To reduce the noise level in the statistical signal in Figures - CiteSeerX

6 downloads 0 Views 210KB Size Report
Klaudia Walter. 1,2. , Irina Abnizova. 1,2. , Greg Elgar. 1 and Walter R. Gilks. 2 .... 2 Gardiner-Garden, M. and Frommer, M. (1987) CpG islands in vertebrate ...
Supplementary material

Striking nucleotide frequency pattern at the borders of highly conserved vertebrate non-coding sequences Klaudia Walter1,2, Irina Abnizova1,2, Greg Elgar1 and Walter R. Gilks2 1

MRC Rosalind Franklin Centre for Genomics Research, Hinxton, Cambridge, UK, CB10 1SB MRC Biostatistics Unit, Institute of Public Health, University Forvie Site, Robinson Way, Cambridge, UK, CB2 2SR Corresponding author: Gilks, W.R. ([email protected]). 2

Reducing the noise level by smoothing

To reduce the noise level in the statistical signal in Figure 1a,b, smoothed versions of the Fugu and the human A+T frequencies were computed. They are shown in Figure S1a,b. Details of the smoothing process are described in the methods section. As we have no indication of whether orientation of CNEs is of significance, we reversed the order of nucleotides in the 3’ boundary sequence block and the right half of the central sequence block and combined them with the 5’ boundary sequence block and the left half of the central sequence block. This gives the same characteristic signal (Figure S2) and shows that the decrease in A+T across the flanking regions is mostly confined to the 150 bp closest to the boundary. A+T frequency rises dramatically across the 20 bp spanning the CNE boundary. Combining the 5’ and the 3’ sequence blocks also reduces the noise level.

Average A+T content in human 0

500

1000 Position

1500

2000

0.40 0.45 0.50 0.55 0.60 0.65

Average A+T content in fugu

(b)

0.40 0.45 0.50 0.55 0.60 0.65

(a)

0

500

1000

1500

2000

Position TRENDS in Genetics

Figure S1. Panel (a) displays smoothed average A+T content in 2200 columns of Fugu sequences that contain 1000-bp 5' flanking region, 50 bp from the 5' end of the CNE, 100 bp from the centre of the CNE, 50 bp from the 3' end of the CNE and 1000-bp 3' flanking region. Panel (b) shows smoothed average A+T content of similarly compiled human CNE sequences. CNEs are shown in red and flanking regions in black.

1

0.65 0.60 0.55 0.50

Average A+T content in fugu

0.45 0.40 0

50

100

150

200

250

300

Position TRENDS in Genetics

Figure S2. Average A+T content of Fugu sequences with 5' and 3' ends combined. 50 bp of 5' and 3' CNE and 50 bp of central CNE are shown in red and 200 bp of 5' and 3' flanking region in black.

Dinucleotide pattern

We also examined the dinucleotide pattern around the CNE boundaries, because especially CG/CG and GC/GC dinucleotides are of interest, because of their low levels in the human genome, the comparatively high levels in the human promoter regions, and the disputed link between CpG frequency and DNA methylation [1–4]. Investigation of the dinucleotides AA, AT, TA and TT unveils patterns similar to the pattern of A+T, most strikingly for AT (Figure S3c). Excluding the immediate 100-bp flanking regions, the frequency of AT is ~30% higher within the CNEs than in the distal flanking regions. The other dinucleotides AA, TA and TT show the same pattern but to a lesser extent (Figure S3a,b,d). The frequencies of the dinucleotides CC and GG (Figure S4a,b) rise sharply within 100 bp on both sides of the CNEs, however frequencies within CNEs are not different to frequencies around CNEs excluding the bordering 100 bp. CG frequency changes are more pronounced and differ substantially from GC frequencies (Figure S4c,d). CG frequencies within CNEs amount only to about 70% of CG frequencies within flanking regions, excluding the proximal 100 bp. There is no visible difference between CNEs and flanking regions when looking at AC, AG, CA and GA frequencies (Figure S4e,f,g,h). However, AC frequencies are below whereas CA frequencies are above the expected frequency (we mean by expected frequency P = 1/16 = 0.0625). It is not yet known whether 5' to 3' direction has a role in the function of the CNE, that is, whether one strand is active and the other is passive. If direction does have a role then about half of the CNEs will by chance be from the wrong strand (i.e. the wrong way round) giving rise to the symmetric pattern seen in Figures 1, S1, S3 and S4. Therefore, we expect to see similar patterns in reverse complements of our data. Thus, the AC pattern, for example, resembles its reverse complement, the GT pattern (not shown). Similarly the pattern for dinucleotides CT, TG and TC resemble those of their respective reverse complements, AG, CA and GA. This also explains why AA and TT frequencies in our data are similar to each other as are CC and GG frequencies, because AA and TT are reverse complements of each other as are CC and GG. By contrast AT and TA are not the same and neither are CG and GC, because neither are reverse complements of each other.

2

Average AT frequency 0.02 0.04 0.06 0.08 0.10

Average TT frequency 0.02 0.04 0.06 0.08 0.10 200

400

600 Position

800

1000

(c) AT 0

200

400

600 Position

800

1000

(b) TT 0

Average TA frequency 0.02 0.04 0.06 0.08 0.10

Average AA frequency 0.02 0.04 0.06 0.08 0.10

(a) AA 0

200

400

600 Position

800

1000

200

400

600 Position

800

1000

(d) TA 0

TRENDS in Genetics

Figure S3. Dinucleotide frequencies for (a) AA, (b) TT, (c) AT and (d) TA within Fugu sequences. The sequences were joined as described earlier. Dinucleotides were counted in non-overlapping 2-bp windows. CNEs are shown in red. The horizontal line is at 1/16 = 0.0625.

3

(b) CC

0

200

400

600

800

1000

0

200

400

600

800

GC

0

200

400

600

800

1000

(g)

CG

0

1000

200

400

600

800

1000

200

400

600

800

1000

(f)

0.02 0.04 0.06 0.08 0.10

0.02 0.04 0.06 0.08 0.10

0.02 0.04 0.06 0.08 0.10

GG

(e)

(d)

AC

0

200

400

600

800

1000

200

400

600

800

1000

CA

0

(h)

0.02 0.04 0.06 0.08 0.10

0.02 0.04 0.06 0.08 0.10

(c)

0.02 0.04 0.06 0.08 0.10

0.02 0.04 0.06 0.08 0.10

0.02 0.04 0.06 0.08 0.10

(a)

AG

0

200

400

600

800

1000

GA

0

TRENDS in Genetics

Figure S4. Dinucleotide frequencies for (a) CC, (b) GG, (c) CG, (d) GC, (e) AC, (f) CA, (g) AG, (h) GA within Fugu sequences. CNEs are shown in red.

Methods Smoothing process

The smoothed versions of the position-specific nucleotide frequencies in Figure S1a,b were calculated using a weighted average wi within a sliding 5-bp window around column i of the alignment:

wi =

i+ 2

∑a

j =i − 2

j

fj

where fj are the observed counts in column j. We used the weights ai = 0.4, ai-1 = ai+1 = 0.2 and ai-2 = ai+2 = 0.1, thus allocating a greater weight to the central nucleotide of the sliding window than to its neighbouring bases. Position weight matrix (PWM)

We calculated the nucleotide frequency pb,i of base b in alignment column i by dividing the counts fb,i of base b in alignment column i by the number of sequences N:

pb ,i =

f b ,i N

To correct for the sequence-specific bias in nucleotide frequencies, we calculated the nucleotide frequency qb,j of base b in sequence j by dividing the counts fb,j of base b in sequence j by the length of the sequence n:

4

qb , j =

f b, j n

A position weight matrix Wb,i,j of base b in alignment column i and for sequence j is then constructed in the following way [5]:

Wb ,i , j = log 2

pb ,i qb , j

The score for sequence j is the total Wj of all Wb,i,j over alignment columns i in sequence j and over all bases: 4

n

W j = ∑∑ Wb ,i , j b =1 i = 1

Over-represented words

We looked for over-represented words of length 3, 4 and 5 bp, in aligned sequences. Each sequence comprised 100 bases spanning a CNE boundary (50 bases to each side of the boundary), at each end of the CNE. We counted the frequency of occurrence for each word at each position along the alignment. We scored each word, w, in each position i by comparing its frequency fwi with its average frequency µw across the whole alignment, as follows: scorewi = (fwi - µw)/σw where σw is the standard deviation in fwi across the whole alignment [6]. References 1 Bird, A.P. (1980) DNA methylation and the frequency of CpG in animal DNA. Nucleic. Acids. Res. 8 7, 1499–1504 2 Gardiner-Garden, M. and Frommer, M. (1987) CpG islands in vertebrate genomes. J. Mol. Biol. 196 2, 261–282 3 Takai, D. and Jones, P. A. (2002) Comprehensive analysis of CpG islands in human chromosomes 21 and 22. Proc. Natl. Acad. Sci. U.S.A. 99 6, 3740–-3745 4 Jabbari, K. and Bernardi, G. (2004) Cytosine methylation and CpG, TpG (CpA) and TpA frequencies. Gene 333, 143–149 5 Wasserman, W. W. and Sandelin, A. (2004) Applied bioinformatics for the identification of regulatory elements. Nat. Rev. Genet. 5 4, 276–287 6 FitzGerald, P. C. et al. (2004) Clustering of DNA sequences in human promoters. Genome Res. 14 8, 1562–1574

5

Suggest Documents