Isolation and Nucleotide Sequence of the Rabbit Globin Gene Cluster

3 downloads 0 Views 4MB Size Report
ABSENCE OF A PAIR OF a-GLOBIN GENES EVOLVING IN CONCERT*. (Received for publication, June 10, 1985). Jan-Fang Cheng, Laurie Raid, and Ross C.
Vol. 261, No. 2, Issue of January 15, pp. 839-848,1986 Printed in U.S.A .

THE JOURNALOF BIOLOGICAL CHEMISTRY (c‘ 1986 by The American Society of Biological Chemists, Inc

Isolation and Nucleotide Sequence of the Rabbit Globin Gene Cluster JI{-cUl-JIa! ABSENCE OF A PAIR OF a-GLOBINGENESEVOLVINGINCONCERT* (Received for publication, June 10, 1985)

Jan-Fang Cheng, Laurie Raid, and Ross C. Hardison From the Department of Molecular and Cell Biology, The Pennsylvania State University, University Park, Pennsylvania16802

A cloned 13.3-kilobase (kb)region of rabbit genomic DNA contains a cluster of a-like globin genes arranged 5’-$**++**+ .****.****

**++*.**>+ **********

***tttt*t*

****>

GCTGAGGAGC AGCCGGGCGC AGGAGCCCAC CCACTGGCGT GGAGACCCCC, CTTCTCCAAC CAGACGCCCA CCTCCGTGCA rCTCAGCTT6 GCCAGCAGTC GTCATCCATG ACCACCCTGC ACACTCCCCT TCTTACCCCCTGCLT7CCTT

lRnn

CCTCTCCTCC CCCCTCCCTT TmTCCTCAC TCACCACAAG CCCCCCCCCG I 9 0 0

** *.*****

GCGGGCTCCA CTCCCCACCT CTKTGGCAGG CCTCCACGCA CTCTCACCGG ACCACTTCCC GCTCCCCCTC CCACCCCCTC ACCCCATCTC CTClCTCCCT2000 t***t

CTCCTCCACTTCCATTCACG

CACCTCCCCC CCACAAGTCC CACCKTCAC CACTCCCACA CACACTCCCC CCACrLCCCA CACCCACATT ACCTClICCA 2100

CAGACGCCAC TCCTCTCCTG GCCCCGCCTG TCTClXGCTT CCCCICACCCCcCCC'rCCCCTCTCGCCTCT CACTTCCCTG AGAATCACGG CACGCCAGTC CTGC7TGCTT TAACCCAGAC T(;CACACTtiA

ACKGCCACCT CTCCACCAAC CACACCCCCI 22011

TAAGTGTCAT AAGTAGAAAC TATACCTAAT TGGC1:TCATC 2300

GGTATACAGC TGCTATTTAG TAGGTTAGGA ATTTGTCTGT GTGGCTGTCT CTGIAATTAC AATTACAACC TCAGTGCCTT AAGTCATCAACACTCAGCTT

+***.** ***

ATAATGTCTG TGTCCATCTT GTTTCATAAT TGCATAATGA ATCTATATTC AAATTAATGT AACGTTGATT TCTGTCCMG

2400

AMAATAAAT GCAAGCATTT 2500

ACACACACAC ACACAACAAG CAAATCCGTG GAAACAGAGA 2600

AAAAAATCTATGACTTTTTTTTAAAAGTCCACATGTTGAATAATCCCATTTATTAAACAC t*t*

*.******** ****

GGACGTTCGT GGGCTGGAGG AGGGCCIGGA GGCACTGCCC CGGCACTTTG GGAGTAGAGG TGCCGACGGT CGCACGCGCT GGCTTGACAG CTCAGTGTCG 2700 GAGCTGCAAG GCTCGCCTAG GCACTCAGCA GGTGCACCTG TTGGCCGCCC CCAACGCAAC TCCTGCTGCG AGCCACCCCG m m c c c c G CCGCGGCCCA 2 ~ 0 0 GCCCGCGAGT CZCTGTCACCATCTCCCCCA

CCGCCCGCGC TCTCCCGGCG TTCCCCCTCC TGTCCAGGTC TCCCTCTGCG CGTCTGCATAACATCTCTCT

CCACIGAATGTTTCAAATGTGTGTTTTGCT

CAAACGCCTC CCGTTCAGAC CGACCCCCAA AGTCCLGCAC CGACkCTGCG TCCCTGCGCC CGCCTCCCCC 3000

TGCCCGCGCC GGCACAOGTG TCCGGAACGG GCCTCCGCCA CCCCCCCACA G r x c r n c G c G GACCCGGCCC

rxcGcccc(x:

C C C C G C C ~ C CCCCCCCGCC

**** ****>**+** ***>.****. ,.)***.**.

* >.*.* .

GCTGCCCCCC GCTGCCCGCC GCTGCCCGCC GCTGCCG

**>*.*.***

2900

?Inn

GCTGTCC. GCGGCGGAGC CGCCGCTGCT r c G c c c c c T G TCGAAGAACC TGGGGAGC~ 3200

+*)++****. +*>

CGTGGGCGTC TACGCGACCC AGGCCCTGGA GAGgGCCCA CCGGGAGCGC GCCCCCGCCC CCCCCCGCCC CGCGCCGCGG GGCCCCCACA CGCACCACAT 3300 CCCCCTCCTC C C G C @ A C C TTGGAGGCCT TCCWXGCACCAAGATCTACTTCTCCCACA

ccmrmccc

TGCACCTCAG CCCGGCLTCC CLCAGGTCAC AGCCCACGPC 3400

TGTCCGCTCT CAGCCACCTC CACCTGCCCA CCCTCCCTCA

3500

CCCCCACCAC T T C G G G ~ G AGCGCCGGCAA CCPTCCACCG CGCAOCGCCC TCCCCTAGGC GGGGTCGCGG AGCACAATCG ATGCACCGCG AGCGCCAAU:

9600

ACCCCTCCCT CCSCTCCTG GGCCACTGTC TGCTGGTCAC CCTCGCCCGG CACTACCCTC GACACTTCGG CCCCGCCATG CACCCCTWXTGCACAAATT

3700

CGCAAGOTGG CCGACGCGCT GACCCTCGCC GCAGACCACC TGACCACCTG

End CCTGCACCAC GTGATCTCGG CGCTGACCTC CMGTACCGC E A K G A G G GTGGGAGGTC GTGGGACGCC CcGeCCCCCG TCGACGCCGT CCGCTTGCAG 3800

-

TAAAGCCCCG GGGCAG GC

CTGAACCGAG KCTCCCTGG GGATTGCCTG TGTGGGWITG GCCTCGGGTC C G C M A C C M GCGGCTGGCG GGTTTGGCGC 3900

GTCCACGTCC CAAATTCCM

TTCCTTGGCC TTGGCCAGWI CGGTGGCAGC CGGGAGGTGG TCGGGGGCCT GTTCATGCCC ACTCCMCCC

CTTCCCACTA 41100

CTGCTCGCTTACTCCTCCTGACTC

Furthermore, they contain repeats of the sequences CCCG and GCCCGCCGC. These G + C-rich repeating sequences are found scattered throughout all the non-coding regions (Fig. 7). They are found as tandem repeats of (CCCG)5 in intron 1 and as dispersed repeats in the 5' promoter region, in intron 2, and in the3' untranslated region.

The conversion unit of the goat a-globin genes is flanked by a 23-bp direct repeat, andvestiges of this sequence can be found flanking the human andmouse a-globin genes (Schon et al., 1982). Examination of the correspondingregions of rabbit a1 reveals a 19-bp segment of the 5' flank from nucleotides 67 t o 85 (GCCGCGCCGGCCAATGGGC) that

844

Cluster

Gene

a-Globin

Rabbit

1 2 3 4 5 6 7 al-glabln gene WTTAAnJ cap GT ATG

LAATAAAl

AG

ATG

GT AG

4

. -I

-1005 b

& ”‘

-358

476

‘CCCG

4

. -3 -I

0 ‘GCCCGCCGC(T)

FIG.7. Structural features of the a l - and $+globin genes. Both the a1 and $a-globin genes are shown using the conventions described in the legend to Fig. 4. I1 is intron 1 and 12 is intron 2. Conserved CCAAT and ATA boxes in the promoter region, GT and AG at thesplice junctions, and theAATAAA polyadenylation signal are marked on the top of each gene. Locations and orientations of two short repeating sequences are indicated under the genes. Positions of nucleotide deletions in the second exon of $a are shown by vertical arrows and thenumber of the nucleotides missing is indicated a t each site.

I GCTTAAA]

splice sites at the intron-exon junctions fit the consensus sequence (Figs. 7 and 8). However, the expected promoter sequences are not found in the $a-globin gene; the ATA box a t -30 is missing, and nomatches with the CCAAT sequence are found in the-85 region. The 5‘ untranslated region of $a has been replaced by 7 copies of a 9-bp tandem repeat which end 6 bp upstream from the ATG sequence. This particular ! 1005 sequence, GCCCGCCGC, repeats three times and is followed by four repeats with an additional T at the3’ end of each 9546 bp sequence (Figs. 7 and 8). In addition, three deletions in exon 2 of $a were necessary to complete the sequence align-358 ment with a1 gene. Two of these deletions alter the translaFIG. 6. In vitro transcription assay of the al-globin gene. tional reading frame at codon positions 53 and 74 (Fig. 8). In Cloned DNA templates were used to program in vitro transcription this altered reading frame, UGA termination codons are in by RNA polymerase I1 in the whole cell extract of Manley et al. phase at codons 66,73, and 92; these are indicated by asterisks (1979), and the transcripts were separated on 1.5% formaldehyde agarose gel, blotted onto nitrocellulose, and exposed to x-ray film. in Fig. 8. The absence of a normal promoter and 5’ untransThe upper portion of the figure shows an autoradiogram of the lated sequence, the deletions and terminationcodons in exon poly(A) addition site (AGTAAA) all transcripts synthesized from the following DNA templates: lane I , 2, andthealtered PstI-cleaved SV40; lane 2, EcoRI-cleaved pBR322; lane 3, HincII- strongly support theassignment of $a as a pseudogene. cleaved pEH2.7; lane 4, EcoRI-cleaved pEH2.7; lane 5 , AccI-cleaved Like the a1 gene, pseudogene $a has several copies of short pEH2.7; lane 6, EcoRI/BarnHI-cleavedpEB1.95(rabbit gene Dl); lane G C-rich repeats in the introns and the 5’ flanking region. 7, no DNA. The sizes of these a1 run-off transcripts are indicated on Two of the repeats in intron 1 match the 9-bp element that the right side of the autoradiogram. The large RNA molecules in lanes 5 and 6 may be caused by initiation within the pBR322 vector is tandemly repeated at the5’ end of $0 (Fig. 7). Most of the DNA. The lower portion of the figure indicates the positions of the sequence matches in the5’ flanking regions of a1 and $a are enzymes used to truncate thepEH2.7 DNA template. The al-globin between the G C-rich repeats (Fig. 8). gene is shown as a box with hatched 5’ untranslated region, black DNA Sequence of the al-qa Intergenic Region-Short recoding regions, and white introns. The thick and thin lines represent peating sequences are frequently found withinthe 2277 bp of rabbit DNA and pBR322 DNA, respectively. The wavy lines are the RNA molecules synthesized beginning at the cap site and ending a t intergenic sequence separating a1 and $a. For example, eight repeating CA dinucleotides interrupted with aTG arelocated the restriction cleavage site. 167 bp downstream from the poly(A) site of a1 gene (nucleotide positions 1029-1046 of Fig. 5), the tetranucleotide CCCA matches for 14/19 positions with a segment3’ to thepoly(A) repeats five times beginning 273 bp 3‘ to thea1 gene (nucleosite (GCAGGGCCGGCCTAGGGAC, nucleotides 896-914). tide positions 1135-1154), and a stretch of Gs start atposition Schon et al. (1982) suggest that the flanking direct repeats 1879. Repeating dinucleotides (TG)S and (CA), are located at may be the remnant of an insertion of an ancestral a-globin positions 2334-2343 and 2558-2575, respectively. Furthergene. more, there is a region of 134 bp (positions 1542-1675) conDNA Sequence of the $a-Globin Gene-A comparison of the taining five repeats of a 26-bp elementwitha consensus nucleotide sequence of $a and a1 indicates that the sequence sequence CACCGCCGTAGCCGGGAATGGTGGGG. The matches begin at the $a gene sequence ATG, which corre- element at the 5‘ end of this group of 26-bp repeats shows sponds to the initiation codon, and end at AGTAAA which the greatest deviation from the consensus. A computer gencorresponds to thepolyadenylation signal (Fig. 8). This pseu- erated dot plot comparison of the rabbit a l - & ~ intergenic dogene consists of three exons and two short introns, and the sequence with the human ar2-a1 intergenic sequence (Mich-

Ac

E

I

Hc

+

+

Cluster

Gene

.. .. .. .. .. .. .. .. .. .. .. .. ..

a-Globin

.. .. .. .. ..

Rabbit

.. .. .. .. .. .. .. ..

845

.. .. .. .. .. .. .. .. .. .. .. .. .. ..

.......... ..........

.......

GGCCTGCGCCACGCCCCCAGAGGCCCGCG~GGACCC~GCCCGCCGC-------GCCCGCCGCC,C~-------------CGCCGCGCCCGCCC,CT~~~~~~~~~~~~~~~~~~

$a

7

1:::

.. .. .. ..

-

::::::

CCCTGCGGG------

ba

I

: :: : :

77 ” ”

. . . . .. ............. .. . . . . . . .. ............. .. . . . . . . .. ............. .. . . .

: :

TGGCGCT(;TCGCC6GCCCAGCGCGCGCTGC~GCGCCCCC’TGTCCAAC.AAC,CTGC,G~ACCAACGTGGGCGTCTACC,~C,~CCGAC,C,CC~~G~AGA~~~~CGCAC 1

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. .. .. .. .. .. .. .. .............................................................................. !$a GGGAGGGCGCCCCCGGCCCGCCGCGCCCCGCGCCG-CGGGGCCCCCACACGCACCACATCCCCCTCCTCCCGCAG4ACCTTGGAGGCCTTCCCGCGCACCAAGATCTACTTCTCCCACAT ” 7 7 d

CGACTTCACCCACGGCTCTGAGCACATCAAAGCCCACGGCAAGAAGGTGTCCGAAGCCCTGACCAAGGCCGTGGGCCACCTGGACGACCTGCCCGGCGCCCTGTCTACTCTCAGCGACC’~

a1

.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ............................................. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .

+a GGACCTGAGCCCGGGCTCCGC-CAGGTCAGAGCCCACGGCCGCAAGGTGGCCGACGCGC~~~CCCTCGCCGCAGACCACC~G-~CGACCTGCCCGGCGCCCTGTCCGC’TCTGAGCGACCT t

......

. . . . .

al

GCACGCGCACAAGCTGCGGGTGGACCCGGTGAATTTCAACGTGAGC---------------CCGCAGCCCGGCTGGGAGCGTCGCGGGGGTCGGCGGTCCCCGACCACACCCACCGACGT

$a

GCACGTGCGCACGCTGCGT---GACCCCCACCACTTCGGCGTGAGCGCCGGCAACCTTCCACCGGGGAGGGGGCTCCCCTAGGCGGGGTGGGGGAGGAGAATCGATGGACCGCGAGCGGG

. . . .. .. .. .. .. .. .. ..

....................

........

***b*

CI

I

CCGCCCC‘TCTCTCTGC

TCCTGTCCCACTGCCTGCTGGTGACCCTGGCCAACCACCACCCCAGTGAATTCACCCCTGCGGTGCACGCCTCCCTGGACAAGTTCCTGGCCAACGTGAGC

.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ............................................................... .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .

$a

AACGACCCCTCCCTGCAG,TGCTGGGCCACTGTCTGCTCGTGACCCTCGCCCGGCACTACCCTGGAGACTTCGGCCCCGCCATGCACGCCTCGGTGCACAAATTCCTGCACCACXTGATC

a1

ACCGTGCTGACCTCCAAATATC(:TTA/IGCTGGAGCCTGGGAGCCGC~CTGGCCCTCCGCCCCCCCCACCCCCGCAGCCCACCCCTGGTCTTTGAATAAAGTCTGAGTGAGTCGCCGACAG

$a

TCGGCGCTGACCTCCAAGTACCGCTGAA-TGGAGGGTGGGAGCTCGTGGGACGCCCCGCCCCCCGTCGACGCCGTCGG----------CTTGGAGTAAAGCCCCC,GGC~AG------CAC

. . . . . . . . . . . . . . . . . . . . . . . . . . .. .. .. .. .. ..........................

.. ..

.. .. .. .. .. .. .. .. ..........

... ...

A

FIG. 8. Alignment of a1 and $a gene sequences. The sequences of the 011 (top strand)- and $a (bottom strand)-globin genes were aligned using the computer program NUCALN from Wilbur and Lipman (1983). The parameters for this alignment are k-tuple size = 3,window size = 20, gap penalty = 5; essentially the same alignment was obtained at other settings of the parameters. The sequence of 01 1 is from nucleotide positions 51 to 1067 in Fig. 5, and thesequence of $a is from positions 3030 to 4024. A colon is placed between matched nucleotides, and gaps are introduced into both gene sequences to optimize the alignment. The boxes enclose the three exons, vertical arrows point to the deletions in exon 2 of $01,and asterisks indicate the nonsense codons in phase as a result of the frameshifts. Horizontal arrows show the G C-rich repeating sequences.

+

BO

60

”“ + 0

z

40

(73%) and the two introns (87% for intron 1 and 77% for intron 2) are higher than their neighboring regions (56% for the 5’ untranslated region; 61, 62, and 61% for exons 1, 2, and 3, respectively). Similar observations of G + C-rich introns were reported for the goat a-globin genes (Schon etal., 1982), human al-, a2-, and [-globin genes (Michelson and Orkin, 1980; Liebhaber et al., 1980; Proudfoot et al., 1982), and equine al-globin gene (Clegg et al., 1984). Usually, this C contentcorrelateswiththeappearance of a high G number of short G C-rich repeated elements. The rabbit +a-globin gene also hasa high G + C content in its 5‘ flanking region (88%) and intron 1 (86%);this is also due to the short repeat sequences. In the intergenic region, the high G C content is evenly distributed except for a segment of 210 bp located approximately 580-790 bp upstreamfrom the +a gene. This region has a G C content of 30% and is flanked by the inverted repeat sequences of (TG), on the 5’ end and (AC), on the 3’ end (Fig. 9).

+

20

a1

~a

FIG. 9. The distribution of G + C content in the rabbit a1$a region., The per cent G C (horizontal axis) is plotted as a function of position in the 011-$01 region. The G + C content was calculated in 100-bp segments in the intergenic region, andthe introns, exons, and untranslatedsegments of the genes were examined separately. The conventions used in drawing the genes are the same as Fig. 4.

+

+

+

+

DISCUSSION

elson and Orkin, 1983; Hess et al., 1983), and the 5’ flanks of the goat ‘01- and ”a-globin genes (Schon et al., 1982) failed t o reveal any long region of matching sequences. The G C Content of the a l - + a Region-The overall G C content in this 4-kb region of the rabbit a-globin gene cluster is 65.7%, which is very G + C rich compared to the average G C content of 44.2% reportedfortherabbit genome (Kritskii et al., 1967). The distribution of G + C content throughout thisregion is shown in Fig. 9. In the case of the a1 gene, the G + C contents of the 5’ flanking region

+

+

+

The rabbit genomic DNA clone X b G 1 contains a cluster of a-like genes arranged 5’-+[-al-+a-3’. This corresponds to the major a-globin gene cluster in the rabbit genome, although genomic blot hybridization data reveal a second a-globin gene, a2, that appears to be a separate locus. Other clones currently being characterized contain

Suggest Documents