ORGANIZATION, EVOLUTION AND FUNCTION OF ALPHA ...

15 downloads 26 Views 2MB Size Report
Chapter 3: Alpha satellite evolution in primates: evidence for the homogenization of ..... human centromere that spans several megabases. ..... from only one man's Y chromosome and sequenced redundant BACs to avoid the problems ...
ORGANIZATION, EVOLUTION AND FUNCTION OF ALPHA SATELLITE DNA AT HUMAN CENTROMERES

by M. KATHARINE RUDD

Submitted in partial fulfillment of the requirements For the degree of Doctor of Philosophy

Dissertation Advisor: Dr. Huntington F. Willard

Department of Genetics CASE WESTERN RESERVE UNIVERSITY

January, 2005

CASE WESTERN RESERVE UNIVERSITY SCHOOL OF GRADUATE STUDIES

We hereby approve the dissertation of

M. Katharine Rudd ______________________________________________________ candidate for the Ph.D. degree *.

A. Gregory Matera

(signed)_______________________________________________ (chair of the committee)

Huntington F. Willard

________________________________________________

Patricia A. Hunt

________________________________________________

Stuart Schwartz

________________________________________________

Alan Tartakoff

________________________________________________

________________________________________________

July 28, 2004

(date) _______________________

*We also certify that written approval has been obtained for any proprietary material contained therein.

1

Table of Contents

Table of contents.................................................................................................1 List of Tables........................................................................................................2 List of Figures......................................................................................................3 Acknowledgements.............................................................................................5 Abstract................................................................................................................6

Chapter 1:

Introduction.......................................................................................8

Chapter 2:

Analysis of centromeric regions of the human genome.................49

Chapter 3:

Alpha satellite evolution in primates: evidence for the homogenization of monomeric alpha satellite................................81

Chapter 4:

Human artificial chromosomes with alpha satellite-based de novo centromeres show increased frequency of nondisjunction and anaphase lag................................................................................122

Chapter 5:

Conclusions and future studies....................................................155

Appendix:

Sequence organization and functional annotation of human centromeres.................................................................................173

Bibliography.....................................................................................................198

2

List of Tables

Table 2-1:

Alpha satellite in the July 2003 (Build 34) assembly of the human genome..............................................................................62

Table 2-2:

Repeat content of the July 2003 (Build 34) human genome assembly........................................................................................77

Table 3-1.

Repeat content of the chromosome 17 satellite zone and flanking regions..............................................................................98

Table 3-2.

Mean percent identity among monomers from particular regions of alpha satellite...........................................................................102

Table 4-1:

Characteristics of human artificial chromosomes.........................135

Table 4-2:

Segregation errors........................................................................147

3

List of Figures

Figure 1-1: Centromere organization among organisms..................................12 Figure 1-2: Alpha satellite organization at the human centromere...................19 Figure 1-3: Alpha satellite evolution model.......................................................22 Figure 1-4: Centromere and pericentromere model.........................................33 Figure 2-1: Alpha satellite location in the July 2003 (Build 34) human genome assembly..........................................................................61 Figure 2-2: Genomic landscape of 1 Mb regions outside of the centromere gaps................................................................................................64 Figure 2-3: Types of alpha satellite in the human genome...............................68 Figure 2-4: Alpha satellite and centromere protein colocalization....................72 Figure 3-1: Alpha satellite organization in the centromeric region of chromosome 17..............................................................................95 Figure 3-2: Percent identity scores for pairwise comparisons of alpha satellite monomers.......................................................................100 Figure 3-3: Phylogenetic tree of alpha satellite on chromosome 17...............104 Figure 3-4: Neighbor-joining tree of monomers from different chromosomes...............................................................................107 Figure 3-5: Maximum likelihood tree of monomers from different chromosomes...............................................................................109 Figure 3-6: Distributions of interchromosomal and intrachromosomal monomeric monomer percent identities.......................................112 Figure 3-7: Genomic organization of 17q compared to the orthologous Pan troglodytes region.........................................................................115 Figure 4-1: FISH analysis of artificial chromosomes......................................137

4 Figure 4-2: Anaphase segregation assay.......................................................139 Figure 4-3: Missegregation of natural, artificial and variant chromosomes....141 Figure 5-1: Model of alpha satellite evolution.................................................161 Figure 5-2: Strategy for sequencing an entire human centromere.................165 Figure A-1: Alpha satellite organization in the human genome.......................178 Figure A-2: Gaps in the public genome assembly of chromosomes X and 17..........................................................................................181 Figure A-3: Repeat content of the junction between the short arm euchromatin and centromere of the X chromosome.........................................183 Figure A-4: Organization of D17Z1 and D17Z1-B higher-order repeats at the centromere of chromosome 17....................................................185 Figure A-5: Phylogenetic analysis of 230 alpha satellite monomers from the X chromosome and chromosome 17...............................................188 Figure A-6: Functional centromere annotation using a human artificial chromosome assay......................................................................193 Figure A-7: Genome assembly of the centromeric regions of the X chromosome, chromosome 17 and 21.........................................196

5

ACKNOWLEDGEMENTS

I would like to thank my advisor, Hunt Willard, for introducing me to alpha satellite and nurturing my scientific development for the past five years. Hunt has made me a better scientist, writer and speaker, and has challenged me to think beyond my view of the chromosome.

I am also grateful to Pat Hunt and Terry Hassold. They have been a constant source of support throughout my graduate career, and have always made me feel like a part of their labs.

The members of the Willard lab have provided scientific discussion, thoughtful debate, and lots of fun over the years. I am especially grateful to Brenda Grimes and Mary Schueler for educating me in the ways of artificial chromosomes and alpha satellite and for discussing the complex centromere.

My friends at Case have been an integral part of my graduate school experience. Whether commiserating over proposal defenses, helping prepare for student seminars, or going out in Coventry to unwind, my friends have always been there for me. My mother and sisters have always encouraged and supported me, even when they weren’t exactly sure what I was doing in the lab.

6

Organization, evolution and function of alpha satellite DNA at human centromeres

Abstract by

M. KATHARINE RUDD

The centromere is a specialized locus responsible for ensuring proper chromosome segregation at mitosis and meiosis. Human centromeres are comprised of large arrays of a primate-specific repeat known as alpha satellite DNA. Understanding the organization and evolution of alpha satellite is essential to delineate the requirements for centromere function. The basic unit of alpha satellite is an ~ 171 bp monomer, and monomers may be organized in one of two types of structure. Higher-order alpha satellite is made up of monomers arranged in homogeneous multimeric higher-order repeat units. In contrast, more divergent monomeric alpha satellite lacks any higherorder periodicity. We have analyzed the alpha satellite in the human genome assembly (Build 34, July 2003), and found regions of both higher-order and monomeric alpha satellite. Although previously identified at all human centromeres, higher-order alpha satellite has only been included in the assemblies of eleven chromosomes. Monomeric alpha satellite typically lies at

7 the edges of larger higher-order arrays, and has been included in all but three chromosome assemblies. The organization of alpha satellite in the human genome is a product of concerted evolutionary processes. We have analyzed the relationships between alpha satellite monomers from multiple chromosomes to discern the exchange mechanisms that have shaped the arrangement of alpha satellite in the genome. Like higher-order alpha satellite described previously, monomeric alpha satellite has a higher frequency of intrachromosomal exchange than interchromosomal exchange. However, comparing orthologous regions of human and chimpanzee alpha satellite, we find that monomeric alpha satellite is more conserved than higher-order alpha satellite. In addition to varying in sequence organization and evolutionary history, monomeric and higher-order alpha satellites also differ in their functionality. Using extended chromosome methods to achieve greater resolution, we have found that antibodies to centromeric proteins only colocalize with higher-order and not monomeric alpha satellite. We have also created artificial chromosomes with de novo centromeres from D17Z1 and DXZ1 higher-order alpha satellites, while other studies have shown that monomeric alpha satellite lacks this functional capacity. This work elucidates the genomic and functional differences between higher-order and monomeric alpha satellite to further define the complex human centromere.

8

Chapter 1

Introduction

9 Chapter 1: Introduction The centromere in all eukaryotic organisms including humans plays a critical role in each step of chromosome segregation in mitosis and meiosis. As the site of the proteinaceous kinetochore, the centromere is responsible for attaching chromosomes to spindle microtubules that then align the chromosomes at the metaphase plate (Rieder and Salmon 1998). Proteins localized to the centromere are also involved in the metaphase to anaphase checkpoint and signal the attachment of all chromosomes to spindle microtubules before allowing the cell to progress into anaphase (Shah and Cleveland 2000). Finally, sister chromatid cohesion must be resolved at the centromere, and proper timing of the removal of cohesion is of vital importance for segregating chromatids (Lee and Orr-Weaver 2001). The centromere is defined by specific DNA sequences as well as by a specialized chromatin structure. Although centromere proteins are well conserved among all organisms (Baum and Clarke 2000; Brown et al. 1993; Buchwitz et al. 1999; Earnshaw et al. 1987; Henikoff et al. 2000; Howman et al. 2000; Kalitsis et al. 1998; Oegema et al. 2001; Palmer et al. 1991; Stoler et al. 1995; Sullivan and Glass 1991; Takahashi et al. 2000; Tomkiel et al. 1994), the DNA sequence organization at the centromere is not at all well conserved (Malik and Henikoff 2002; Willard 1998). In fact, centromeres range in size and complexity from the 125 basepair point centromere found in budding yeast to the human centromere that spans several megabases. Like centromeres in most

10 organisms, the human centromere is made up of repetitive DNA. Alpha satellite DNA, a tandemly repeated DNA family based on a fundamental unit length of ~171 bp, has been found at all human centromeres. However, the particular organization and sequence identity among alpha satellite repeats is largely chromosome-specific (Alexandrov et al. 2001; Warburton and Willard 1996; Willard 1985). Understanding the organization of alpha satellite DNA in the human genome and its role in centromere function is the focus of this dissertation. This introductory chapter discusses the genomic organization and evolution of alpha satellite DNA, as well as the chromatin and protein requirements for centromere function, and compares human centromeres to centromeres from other species, as necessary background for chapters that follow.

Centromere organization among eukaryotes The sequences that make up the centromeres of diverse organisms are extremely variable. Most well characterized centromeres contain repetitive DNA with an AT-richness greater than that of the genome average (Choo 2001; Koch 2000). However, individual organisms have evolved different genomic structures to create a locus capable of chromosome segregation. This section discusses the organization of centromeric DNA in yeasts, flies, plants, worms and mice, while a following section focuses specifically on the organization of human

11 centromeres. The chromatin modifications and centromere proteins involved in centromere function are discussed in a later section. The simplest centromere organization is found in the chromosomes of the yeast, Saccharomyces cerevisiae (Figure 1-1). Only approximately 125 bp is required for centromere function in the budding yeast, and this consensus sequence is consistent among the centromeres of all 16 chromosomes (Clarke and Carbon 1985). There are three functional elements within the S. cerevisiae centromere, CDEI, CDEII and CDEIII. Deletions within CDEI (Hegemann et al. 1988) and CDEII (Sears et al. 1995) affect chromosome segregation in mitosis and meiosis, while deletions within CDEIII completely destroy centromere function (Jehn et al. 1991). Unlike other characterized centromeres, the budding yeast centromere consists of largely unique DNA (Clarke 1990). Nonetheless, the entire centromere is highly AT-rich, and the CDEII sequences are greater than 90% AT with short poly-A and poly-T regions (Fitzgerald-Hayes et al. 1982). In contrast to the simple centromere of the budding yeast, the fission yeast centromere is more similar to the centromeres of higher eukaryotes in its size and complexity. Schizosaccharomyces pombe centromeres are made up of inner and outer inverted repeats flanking a non-repetitive central core (Clarke et al. 1986; Nakaseko et al. 1986; Nakaseko et al. 1987) (Figure 1-1), and each of these regions is AT-rich. Among the three S. pombe chromosomes,

12

S. cerevisiae

...

I

II

...

III

125 bp

S. pombe

...

dh

D. melanogaster

...

dg

Cnt

ImrL

AATAT

...

...

AAGAG 420 kb

...

180 bp repeats

...

dg

35 -110 kb

A. thaliana

O. sativa

dh

ImrR

...

400 kb - 1.4 Mb

...

CentO repeats 65 kb - 2 Mb

M. musculus

... C. elegans

Major Satellite

Minor Satellite

240 kb - 2 Mb

~1 Mb

...

...

...

H. sapiens

Alpha Satellite

...

//

...

240 kb - 5 Mb Figure 1-1: Centromere organization among organisms. The S. cerevisiae centromere is made up of three domains; CDEI, CDEII and CDEIII. The S. pombe centromere is comprised of a unique central core (Cnt) flanked by inner (ImrL, ImrR) and outer repeats (dg, dh). The fly centromere has two satellite domains, interspersed with transposable elements. Arrays of 180 bp repeats and CentO repeats interspersed with retrotransposons comprise the Arabidopsis and rice centromeres, respectively. The mouse centromere is made up of adjacent arrays of minor and major satellite. C. elegans have holocentric chromosomes, thus no specific sequence is required for centromere function. Human centromeres are made up of arrays of alpha satellite DNA organized in a hierarchical repetitive structure.

13 centromeres are similar, but not identical in organization. Each centromere contains an approximately 4 kb central core (Cnt), bordered by approximately 6 kb of imperfect repeats on the left and right arms (ImrL and ImrR). The organization of the outer repeats is more variable among chromosomes, but all are made up of dg and dh repeats, each of which is about 5 kb. Overall, S.

pombe centromeres are 35 - 110 kb in size (Wood et al. 2002), a huge increase as compared to the S. cerevisiae centromeres. The inner repeats and central core are necessary for centromere function and bind spindle microtubules (Baum et al. 1994; Hahnenberger et al. 1991; Nakaseko et al. 2001), whereas the outer repeats recruit heterochromatin proteins and are more likely responsible for other functions such as heterochromatin formation and sister chromatid cohesion (Partridge et al. 2000; Partridge et al. 2002). Centromeres in several other organisms are characterized by long stretches of so-called “satellite DNA”. This type of sequence was first identified as satellite bands in ultracentrifuge density gradients (Corneo et al. 1967; Kit 1961; Sueoka et al. 1959); now the term satellite DNA has come to refer to any tandem repetitive sequence (Charlesworth et al. 1994). Satellite DNAs may be divided into two major groups based on the size of the repeat unit; microsatellites are 2-20 bp long, whereas minisatellites are greater than 20 bp long (Charlesworth et al. 1994). Examples of microsatellites include the human classical satellites (Gosden et al. 1975; Prosser et al. 1986) as well as satellite sequences found in fly heterochromatin (Lohe et al. 1993; Peacock et al. 1978)

14 among others. Larger minisatellites are found at the centromeres of Arabidopsis, rice, mice and humans (see below). The fly centromere has been defined by a 420 kb region of a minichromosome that is required for chromosome transmission (Murphy and Karpen 1995; Sun et al. 1997). Similar to other centromeres, the Drosophila

melanogaster centromere is repetitive, made up of satellite DNA and transposable elements (Figure 1-1). There are two adjacent blocks of microsatellites, AATAT and AAGAG satellites, that are interspersed with transposons as well as AT-rich DNA (Sun et al. 2003). Normal fly centromeres have not been sequenced to date, likely due to the difficulty in sequencing and assembling highly heterochromatic regions of the genome (Hoskins et al. 2002). However, the chromatin environment of endogenous Drosophila centromeres has been very well characterized (Blower and Karpen 2001; Blower et al. 2002). Plant centromeres are very similar to the satellite- and transposon-rich fly centromeres (Figure 1-1). The major component of the Arabidopsis thaliana centromere is an AT-rich 180 bp repeat unit (Richards et al. 1991). The A.

thaliana centromere was mapped as a recombination resistant region of the chromosome (Copenhaver et al. 1998), and subsequent sequence analysis identified large arrays of the 180 bp repeat, 400 kb to 1.4 Mb among chromosomes (Copenhaver et al. 1999). The Arabidopsis centromere is also enriched for retrotransposons not usually found on chromosome arms. Similarly, the rice (Oryza sativa) centromere is predominantly comprised of a 155 bp

15 tandem repeat known as CentO arranged in arrays ranging from 65 kb to 2 Mb among the 12 rice chromosomes (Cheng et al. 2002). These arrays are interspersed with gypsy-class retrotransposons known as centromeric retrotransposons of rice. The Arabidopsis and rice centromeres have recently been defined at the level of chromatin, and chromatin immunoprecipitation experiments using antibodies to proteins required for centromere function have been conducted in both species. As expected, centromere proteins are associated with the 180 bp repeats in Arabidopsis (Nagaki et al. 2003) and the CentO repeats in rice (Nagaki et al. 2004). However, within the functional domain of the smallest rice centromere, there are also four expressed genes. This finding is surprising since centromeres are classically characterized as heterochromatic regions resistant to gene expression (Dillon and Festenstein 2002). Mouse centromeric DNA is comprised of two types of sequence, major and minor satellite DNA (Figure 1-1). These regions have not been well defined, but mapping studies have shown that major and minor satellite are nonoverlapping arrays, with minor satellite positioned closer to the telomere (Joseph et al. 1989; Kipling et al. 1991). The sizes of the basic repeat units in major and minor satellite are approximately 234 bp and 120 bp, respectively (Horz and Altenburger 1981; Pietras et al. 1983). Only minor satellite coincides with antibodies to centromere proteins, suggesting that this region is part of the kinetochore (Wong and Rattner 1988). This finding is supported by the fact that

16 minor satellite is found at all mouse chromosomes (Wong and Rattner 1988), whereas major satellite is located at the centromeres in only some species of mice (Wong et al. 1990). As opposed to all other centromeres described, the centromeres of

Caenorhabditis elegans appear to be completely sequence independent (Figure 1-1). C. elegans chromosomes are holocentric, meaning that many sites along the chromosome act as a centromere, recruiting centromere proteins necessary for segregation (Buchwitz et al. 1999; Moore et al. 1999; Oegema et al. 2001). The sequence-independent nature of C. elegans centromere activity is further supported by the fact that any sequence introduced into the genome segregates as an extrachromosomal element and is heritable (Stinchcomb et al. 1985). It would be interesting to see if particular sequences along the length of the chromosome are associated with centromere proteins on endogenous C.

elegans chromosomes, or if every sequence truly plays an active role in chromosome segregation. Holocentric chromosomes are a curious contrast to the monocentric chromosomes found in most other species typically containing repetitive AT-rich DNA at the centromeres.

Human centromere organization The human centromere is made up of highly repetitive DNA known as alpha satellite. Alpha satellite was first discovered in the human genome (Manuelidis and Wu 1978) by its homology to a repetitive fraction of the African Green

17 Monkey genome (Maio 1971). Further experiments localized these repeats to human centromeric regions by in situ hybridization (Manuelidis 1978). All human centromeres are comprised of alpha satellite DNA, although the organization of alpha satellite varies from centromere to centromere (Alexandrov et al. 1991; Choo et al. 1990; Devilee et al. 1988; Ge et al. 1992; Greig et al. 1989; Greig et al. 1993; Haaf and Willard 1992; Hulsebos et al. 1988; Jorgensen et al. 1988; Looijenga et al. 1992; Puechberty et al. 1999; Rocchi et al. 1991; Vissel and Choo 1991; Waye et al. 1987a; Waye et al. 1987b; Waye et al. 1987c; Waye and Willard 1985; Waye and Willard 1986; Waye and Willard 1987; Waye and Willard 1989a; Willard et al. 1983; Wolfe et al. 1985). The most basic unit of alpha satellite DNA is an approximately 171 bp monomer (Manuelidis and Wu 1978), and monomers may be arranged in one of two types of alpha satellite, higher-order or monomeric. Higher-order alpha satellite is made up of monomers organized in highly identical higher-order repeat units (Willard et al. 1983; Willard and Waye 1987b; Yang et al. 1982). For example, the higher-order alpha satellite found on chromosome 17, D17Z1, is made up of sixteen monomers arranged head to tail to form a 2.7 kb higher-order repeat unit (Waye and Willard 1986) (Figure 1-2). This repeat unit is in turn repeated in tandem with over a thousand copies at the chromosome 17 centromere (Warburton and Willard 1990). Although higher-order repeats within a given array are nearly identical to each other (typically 97-100% identical (Durfy and Willard 1989; Schindelhauer and Schwarz 2002; Schueler et al. 2001; Warburton and Willard 1992)), the

18 monomers that make up the D17Z1 higher-order repeat unit are much less homogeneous, about 76% identical to each other (Waye and Willard 1986). Higher-order alpha satellite has been found at all human centromeres, and higher-order arrays range between 240 kb (Tyler-Smith et al. 1993) and ~ 5 Mb (Wevrick and Willard 1989) in size. Alpha satellite with a less homogeneous monomer organization has been described on chromosomes 7, 10, 16, 21 and the X chromosome (de la Puente et al. 1998; Guy et al. 2003; Horvath et al. 2000; Ikeno et al. 1994; Jackson et al. 1996; Schueler et al. 2001; Wevrick et al. 1992). Termed “monomeric” alpha satellite, this type of alpha satellite lacks any higher-order periodicity. Monomers within a region of monomeric alpha satellite exhibit greater sequence divergence than do higher-order repeat units. Where monomeric alpha satellite has been described, it has been found adjacent to higher-order alpha satellite and is less abundant than the megabase-sized arrays of higher-order alpha satellite. Unlike higher-order alpha satellite, monomeric alpha satellite is regularly interspersed with other repeats as well as some unique sequences. Centromere function has been linked to higher-order alpha satellite, yet there is no evidence for monomeric alpha satellite contributing to proper chromosome segregation (see below). Thus, higher-order and monomeric alpha satellites occupy physically and functionally distinct regions of the chromosome. The arrays of higher-order alpha satellite and adjacent regions including monomeric alpha satellite have thus been termed the centromere and

19

D17Z1 3 +/- 1 Mb



171bp







2.7 kb Higher-order alpha satellite

Monomeric alpha satellite

Figure 1-2: Alpha satellite organization at the human centromere. An example of human centromere organization is shown here for the chromosome 17 centromere. Alpha satellite organized in higher-order arrays (red) span several megabases. In the case of D17Z1 higher-order alpha satellite, 16 monomers are arranged head to tail to comprise a 2.7 kb higher-order repeat unit. Monomeric alpha satellite (blue) lacking any higher-order periodicity is located at the edges of higher-order arrays. Higher-order repeat units are extremely homogeneous (97-100% identical), whereas monomeric monomers are on average 72% identical.

20 pericentromere, respectively (Horvath et al. 2001; Jackson 2003). The exact size of the pericentromeric regions likely varies among chromosomes and has not been defined. However, the sequences that make up the pericentromere have been a subject of much interest in recent years. In addition to monomeric alpha satellite, the pericentromeres of several chromosomes have been shown to contain classical satellites, expressed genes, as well as a high concentration of segmental duplications (Guy et al. 2000; Horvath et al. 2000; Jackson et al. 1996). Segmental duplications are duplicated blocks of genomic DNA several kilobases in size (Bailey et al. 2002; Bailey et al. 2001). These duplications occur within and between chromosomes and are highly enriched in pericentromeric regions. Although no centromere has been entirely sequenced (Eichler et al. 2004), there is no evidence for the kind of interspersed sequence organization seen in regions of monomeric alpha satellite within higher-order alpha satellite arrays (Schueler et al. 2001).

Alpha satellite evolution The organization of alpha satellite is a product of concerted evolutionary processes (Durfy and Willard 1990; Warburton and Willard 1996; Warburton et al. 1993; Waye and Willard 1986). DNA sequences subject to concerted evolution typically exhibit higher sequence identity within a species than between species (Brown et al. 1972; Coen et al. 1982; Southern 1975). For example, higher-order repeat units from an array on one chromosome are more similar to

21 each other than to the orthologous repeats in another species (Durfy and Willard 1990; Jorgensen et al. 1987). Alpha satellite has been found at all primate centromeres studied; however, the organization and types of alpha satellite varies among species (Alexandrov et al. 2001; Warburton et al. 1996) (Figure 13A). In addition to the human centromere, higher-order alpha satellite has been found at some of the centromeres of chimpanzees (Baldini et al. 1991; Warburton et al. 1996; Waye and Willard 1989b), gorillas (Durfy and Willard 1990; Waye and Willard 1989b), and orangutans (Haaf and Willard 1998; Waye and Willard 1989b). Notably, higher-order alpha satellite has not been found in more distant primates. Indeed, only monomeric alpha satellite has been found in Old World Monkeys (Rosenberg et al. 1978; Singer and Donehower 1979; Thayer et al. 1981), New World Monkeys (Alves et al. 1994; Fanning 1989) and prosimians (Maio et al. 1981; Musich et al. 1980). As the centromeres from these monkeys have not been completely analyzed, the absence of higher-order alpha satellite may reflect a limited amount of alpha satellite sampling or a legitimate lack of higher-order structure. These findings are consistent with a model of alpha satellite evolution in which higher-order evolved relatively recently from monomeric alpha satellite. Like other tandem satellite families (Brown et al. 1972; Coen et al. 1982; Southern 1975), alpha satellite is subject to molecular drive mechanisms. Molecular drive is a model that attempts to explain the high sequence identity within a class of sequences. In this process, a sequence

22

A .

Prosimians

~ 55

B

New World Monkeys

Old World Monkeys ~ 35 Orangutans mya Gorillas ~ 25 mya Chimps ~ 5-10 Humans mya Monomeric Higher-order alpha satellite alpha satellite

first monomer tandem duplication monomeric alpha satellite first higher-order alpha satellite human higher-order alpha satellite

Figure 1-3: Alpha satellite evolution model. (A) Types of alpha satellite found among primate centromeres. The simplified phylogenetic tree shows approximate divergence times between humans and other primates in millions of years ago (mya) (Kumar and Hedges 1998). (B) Model of alpha satellite evolution by unequal crossing over (see text for details). Small arrowheads denote monomeric alpha satellite, larger arrows represent higher-order alpha satellite. The black diamond and square represent non-alpha satellite sequences that are present within regions of monomeric alpha satellite in human chromosomes.

23 variant can quickly spread through a population and become fixed. Molecular drive operates within and between chromosomes and includes mechanisms such as unequal crossing-over, gene conversion, and transposition (Coen and Dover 1983; Dover 1982; Strachan et al. 1982; Strachan et al. 1985). Although all of these processes may be participating in alpha satellite evolution to some extent, the homogenization of alpha satellite can best be accounted for by unequal crossing-over. Smith proposed a three-step mechanism to explain the emergence of tandem satellite repeats via unequal crossing-over (Smith 1976). The first step is a mutation that creates short local homology between two regions of a given sequence. In the second step, an unequal crossover event occurs between the two regions of homology generating two products, a deletion and a tandem duplication. Subsequent unequal crossovers between the duplicated repeats in the next step will produce expansions and contractions in the number the tandem repeats. As the number of tandem repeats increases so will the number of sites of homology, increasing the frequency of unequal crossovers between tandem repeats. Recurring crossovers will homogenize the tandem repeats, leading to highly identical repeat units. This process can explain the initial emergence of alpha satellite DNA as well as the homogenization of monomeric alpha satellite to form the higher-order repeat units that subsequently expanded to make up the megabase-sized arrays present on human centromeres (Figure 1-3).

24 The relationships between alpha satellite on different chromosomes, homologs of the same chromosome, and sister chromatids are very informative for determining the relative rates of unequal crossover events predicted to occur in alpha satellite evolution. With the exception of the centromeres on the acrocentric chromosomes, higher-order alpha satellite in the human genome is chromosome-specific, meaning that higher-order alpha satellite on one chromosome may be distinguished from that on another chromosome (Willard and Waye 1987a; Willard and Waye 1987b). This can best be explained by unequal crossover events between homologous chromosomes that homogenized alpha satellite into a chromosome-specific higher-order array (Durfy and Willard 1989; Schindelhauer and Schwarz 2002; Schueler et al. 2001; Warburton et al. 1993). The high sequence identity among thousands of higher-order repeat units on a given array argues that intrachromosomal exchange between homologous chromosomes is an efficient mechanism for homogenizing alpha satellite. There is also evidence of interchromosomal exchanges involving alpha satellite. Higher-order repeats from different chromosomes have related organizations and fall into suprachromosomal families (Alexandrov et al. 1988; Greig et al. 1993; Waye et al. 1987a; Waye and Willard 1986; Willard and Waye 1987a). There are four major suprachromosomal families described in the human genome. Two families contain related higher-order alpha satellite organized in dimeric structures (...ABABAB...), while a third family contains higher-order alpha satellite organized in a pentameric structure

25 (...ABCDEABCDE...) (Alexandrov et al. 1988). Higher-order alpha satellite found on the Y chromosome, DYZ3, does not fall into one of these groups and belongs to a more divergent family (Alexandrov et al. 1993). It is interesting to note that some chromosomes contain more than one higher-order array, and in most cases these arrays are very different in sequence identity and organization (Alexandrov et al. 1991; Baldini et al. 1989; Waye et al. 1987b; Waye et al. 1987c; Wevrick and Willard 1991), suggesting that separate homogenization events gave rise to the two distinct arrays. The related higher-order arrays on different chromosomes provides evidence for interchromosomal exchange; however, the overall sequence variation among higher-order repeats within a suprachromosomal family suggests that this type of exchange event occurred much less frequently than intrachromosomal exchanges between homologous chromosomes (Warburton and Willard 1995; Waye and Willard 1986; Willard and Waye 1987a). Yet another driving force in alpha satellite evolution involves exchanges between sister chromatids. Warburton and Willard analyzed the higher-order repeat units that make up the D17Z1 array in three individual chromosomes 17 using two-dimensional gels (Warburton and Willard 1990). Variation in higherorder repeat unit length within an array on different chromosomes 17 suggests that these variants evolve along haplotypic lineages that have arisen relatively recently (Warburton and Willard 1990; Warburton and Willard 1995). Thus, alpha satellite evolution is a multifaceted process, working at the level of exchange

26 events between different chromosomes, between homologs of the same chromosome, and between sister chromatids. Based on these data, it is possible to hypothesize the steps involved in alpha satellite evolution (Figure 1-3B). Following sequence mutation (s), the first alpha satellite monomer duplicated by unequal crossover to form a dimer sometime early in the primate lineage. These dimers expanded via unequal crossover mechanisms to make a large stretch of tandem monomers such as the monomeric alpha satellite found at the centromeres of Old and New World Monkeys. Subsequent unequal crossovers within chromosomes homogenized monomers further to give rise to the higher-order arrays found in the great apes. As higher-order alpha satellite took on the role of centromere function (see below), the monomeric alpha satellite on the periphery was free to accumulate insertions and mutations without phenotypic consequence. This leads to the current organization of a typical human centromere: a large array of higher-order alpha satellite bordered by more divergent monomeric alpha satellite interspersed with other sequences. The proposed model of alpha satellite evolution will be further evaluated in chapter 3. Based on the model, we would expect higher-order alpha satellite to evolve more rapidly than monomeric alpha satellite. The relationships between alpha satellites on different chromosomes of the same species and between orthologous chromosomes of different species will reveal a great deal about the evolution of alpha satellite in primates.

27

Assembling repetitive regions of the genome To better understand the organization and evolution of alpha satellite, it is necessary to fully sequence and analzye at least some human centromeres. Assembling the extremely identical repeat units that make up higher-order alpha satellite is a daunting task. The majority of the human genome has been sequenced and assembled (Lander et al. 2001; Venter et al. 2001); yet, no human centromere has been completely assembled (Eichler et al. 2004). The centromere regions were intentionally neglected from the human genome project, due largely to the assumption that they contained nothing but junk DNA, and also due to the perceived difficulty in sequencing and assembling these repetitive regions (Collins et al. 1998; Lander et al. 2001). Although no human centromere has been completely sequenced, several higher-order arrays of alpha satellite have been extensively mapped (Jackson et al. 1996; Mahtani and Willard 1990; Mahtani and Willard 1998; Puechberty et al. 1999; Tyler-Smith and Brown 1987; Warburton and Willard 1990; Wevrick and Willard 1989; Wevrick and Willard 1991). A general strategy for mapping higherorder arrays uses restriction enzymes that regularly cut within typical genomic DNA, but that rarely cut within higher-order alpha satellite (Warburton et al. 1991). Upon pulse-field gel electrophoresis, megabase-sized arrays can be resolved by Southern blot analysis. The next step in sequencing human centromeric regions should focus on connecting existing chromosome arm

28 contigs to higher-order alpha satellite (Guy et al. 2003; Guy et al. 2000; Horvath et al. 2000; Schueler et al. 2001), and then develop a strategy to sequence across the highly homogeneous arrays of higher-order alpha satellite. The most challenging part of sequencing across megabases of higherorder alpha satellite is not the sequencing per se, but the assembly process. The human genome project used bacterical artificial chromosomes (BACs) to create a tiling path across chromosomes that were subsequently sequenced and assembled (Lander et al. 2001). This methodology could be applied to higherorder alpha satellite to create a BAC scaffold underlying a restriction-mapped array. However, the subsequent assembly of sequences within each BAC is far more complicated than assembling typical genomic DNA. Higher-order repeat units are up to 100% identical (Durfy and Willard 1989; Schindelhauer and Schwarz 2002; Schueler et al. 2001), so it would be very easy to compress independent sequences. The amount of sequence divergence among higherorder repeat units is comparable to the amount of variation seen in typical genomic DNA assemblies due to allelic variation or polymorphism (Lander et al. 2001; Venter et al. 2001). This is similar to the situation on the human Y chromosome. The Y chromosome is made up of highly homogeneous repetitive sequences, up to 100% identical to each other. To assemble the sequence of this problematic chromosome, Skaletsky et al. used a BAC library constructed from only one man’s Y chromosome and sequenced redundant BACs to avoid the problems associated with normal levels of polymorphism (Skaletsky et al.

29 2003). A similar strategy could be employed to assemble a higher-order array of alpha satellite.

Centromeric chromatin and centromere function Although, as outlined above, the organization of centromeric DNA varies widely among organisms, the chromatin modifications and proteins involved in centromere function are very well conserved from yeast to humans (Baum and Clarke 2000; Brown et al. 1993; Buchwitz et al. 1999; Earnshaw et al. 1987; Henikoff et al. 2000; Howman et al. 2000; Kalitsis et al. 1998; Oegema et al. 2001; Palmer et al. 1991; Stoler et al. 1995; Sullivan and Glass 1991; Takahashi et al. 2000; Tomkiel et al. 1994). Nevertheless, centromere function is complex in that the centromere coordinates multiple processes and different protein players are involved in each step. First of all, the centromere is the site of the kinetochore, a proteinaceous structure responsible for attaching the chromosome to spindle microtubules. To ensure proper chromosome segregation, the centromere must also satisfy spindle assembly checkpoints and release sister chromatid cohesion at the right time. This suggests that the sequences required for all aspects of centromere function may in fact be much larger than those that delineate the region of the kinetochore. The likely primary mark of the functional centromere is the histone variant CENP-A, also known as CENH3 (Ahmad and Henikoff 2002). CENP-A is a histone H3 variant found at active centromeres in every organism studied

30 (Sullivan et al. 2001). Depleting CENP-A in yeast (Stoler et al. 1995), flies (Blower and Karpen 2001), worms (Oegema et al. 2001), and mice (Howman et al. 2000), and in human cells (Valdivia et al. 1998) causes chromosome segregation defects and also has downstream effects on the localization of other centromere proteins, supporting its role as the primary epigenetic mark. CENP-A and histone H3 nucleosomes are interspersed at the centromeres of flies and humans (Blower et al. 2002), and CENP-A can substitute for histone H3 in reconstituted nucleosomes in vitro (Yoda et al. 2000). Given the fact that CENPA is a histone variant, it may set up the centromere-specific chromatin conformation that then recruits other centromere proteins. There are a number of proteins, DNA binding proteins as well as motor proteins, that are part of the kinetochore. Centromere protein B (CENP-B) is a DNA binding protein found at the centromeres of diverse organisms, and it recognizes a 17 bp sequence known as the “CENP-B box” in mouse minor satellite and human higher-order alpha satellite (Masumoto et al. 1989). The CENP-B box sequence has also been found at the centromeres of the great apes, but not in Old World Monkeys, New World Monkeys, or prosimians (Goldberg et al. 1996; Haaf et al. 1995). Despite its conservation, the role of this protein in centromere function is questionable, as knockout mice have no mitotic defects (Hudson et al. 1998) and the Y chromosome of mice and humans has no detectable CENP-B protein (Broccoli et al. 1990; Earnshaw et al. 1991). In fact, chromosome segregation errors associated with CENP-B depletion have only

31 been seen in a S. pombe minichromosome (Irelan et al. 2001). Another DNA binding protein, CENP-C, is directly involved in centromere function, as its depletion causes chromosome segregation defects in yeast (Brown et al. 1993),

C. elegans (Moore and Roth 2001; Oegema et al. 2001), mice (Kalitsis et al. 1998) and human cells (Tomkiel et al. 1994). Other proteins such as dynein, MCAK, and CENP-E are also members of the kinetochore, playing a role in chromosome movement along the microtubules (Rieder and Salmon 1998). And spindle checkpoint proteins such as Mad2 and Bub1 are critical for chromosome segregation as they signal the start of anaphase once all kinetochores are attached to the spindle (Shah and Cleveland 2000). Proper resolution of sister chromatid cohesion is also required for chromosome segregation. After proceeding into mitotic anaphase, sister chromatids completely lose cohesion and separate to opposite poles of the cell. Loss of sister chromatid cohesion is a two-step process in meiosis; in the first meiotic division, cohesion is removed from chromosome arms but maintained at the centromere, and then in the second meiotic division cohesion is completely removed (Dej and Orr-Weaver 2000). The mitotic cohesin complex is made up of the proteins SCC1/Rad21, SCC3, SMC1 and SMC3; however, in meiosis Rec8 substitutes for SCC1/Rad21 (Klein et al. 1999; Lee and Orr-Weaver 2001; Parisi et al. 1999). In the absence of cohesins, chromosomes missegregate, exhibiting chromosome lag and premature sister chromatid separation (Bernard et al. 2001; Hoque and Ishikawa 2002; LeBlanc et al. 1999).

32 The relationships among centromeric chromatin, kinetochore formation, spindle checkpoints, and resolution of sister chromatid cohesion have been best described in S. pombe and D. melanogaster. The fission yeast centromere is made up of two main domains, the central core and inner repeats responsible for kinetochore activity (Nakaseko et al. 2001) and the outer repeats responsible for pericentromeric heterochromatin formation and sister chromatid cohesion (Figure 1-4). Both centromere chromatin domains are transcriptionally silent (Allshire et al. 1995); however, silencing is mediated by different proteins. Mutations in Swi6 and Chp1 alleviate silencing of transgenes in the outer repeats and a mutation in Mis6 alleviates silencing in the central core (Partridge et al. 2000). Swi6 is the yeast ortholog of heterochromatin protein 1 (HP1), a protein first discovered in flies and found to localize to heterochromatic regions of the

Drosophila genome (James and Elgin 1986; James et al. 1989). The chromodomain protein, Chp1, is involved in heterochromatin formation and chromosome segregation (Doe et al. 1998). Mis6 is required for the proper loading of Cnp1, the S. pombe ortholog of CENP-A (Takahashi et al. 2000). Chromatin immunoprecipiation experiments are consistent with these data, as Cnp1 and Mis6 are associated with the central core and inner repeats (Takahashi et al. 2000) while Swi6 and Chp1 associate with the outer repeats (Partridge et al. 2000). The histone methyltransferase Clr4 (Su (Var)3-9 ortholog) methylates histone H3 at lysine 9 and is required for Swi6 localization to the S. pombe centromere (Bannister et al. 2001). Swi6 is required for the

33 Pericentromere

...

dh

dg

Swi6, Chp1, Rad21, Psc3

...

??????

HP1, Prod, Rad21, Mei-S332

Monomeric alpha satellite

...

Centromere

ImrL

Cnt

ImrR

Pericentromere

dh

dg

Cnp1, Mis6

Swi6, Chp1, Rad21, Psc3

??????

??????

...

...

Cid, Polo kinase, Rod, HP1, Prod, Rad21, Mei-S332 Cenp-meta, dynein, ZW10, Bub1, Bub2

Higher-order alpha satellite

Monomeric alpha satellite

...

// CENP-A, CENP-B, CENP-C Cohesins? Heterochromatin proteins?

Other kinetochore proteins?

Checkpoint proteins?

Cohesins? Heterochromatin proteins?

Figure 1-4: Centromere and pericentromere model. The S. pombe centromere has been defined at the DNA and chromatin levels by chromatin immunoprecipitation (ChIP). The CENP-A ortholog, Cnp1, and Mis6 are located at the centromeric region and heterochromatin protein Swi6 and cohesins Rad21 and Psc3 are located at the pericentromere. The D. melanogaster centromere has not been sequenced, however numerous proteins have been cytologically localized to the centromeric and pericentromeric regions. The human centromere is made up of higher-order alpha satellite and the adjacent pericentromere is made up of monomeric alpha satellite interspersed with other sequences. CENP-A, CENP-B and CENP-C proteins associate with higher-order alpha satellite as shown by ChIP; however, the precise location of other proteins is unknown.

34 association of the cohesins SCC1/Rad21 (Bernard et al. 2001) and Psc3 (Nonaka et al. 2002) at the outer repeats of the S. pombe centromere. Mutations in either Swi6 (Ekwall et al. 1995) or Rad21 (Bernard et al. 2001) cause chromosome lag, suggesting that although the outer repeats are not the site of microtubule binding, they still are important for centromere function. The fly centromeric and pericentromeric chromatin domains have a similar bipartite organization. The Drosophila CENP-A ortholog, CID, colocalizes to the genetically defined minichromosome centromere (see above) as well as endogenous fly centromeres (Blower and Karpen 2001; Henikoff et al. 2000). CID is also required for recruiting kinetochore and spindle checkpoint proteins such as POLO kinase, ROD, Cenp-meta and BUB1, as well as the cohesin MEIS332 to the centromere (Blower and Karpen 2001). However, CID is not responsible for the pericentromeric localization of the heterochromatin protein HP1 or the condensation protein PROD. Mutations in polo, mei-S332, HP1 or prod have no effect on Cid localization, suggesting that Cid is upstream of these proteins in the centromere function pathway. Cytological studies have shown that the kinetochore proteins occupy a region separate from the more distal pericentromeric heterochromatin and sister chromatid cohesion proteins (Blower and Karpen 2001) (Figure 1-4). Nevertheless, loss of mei-S332 affects chromosome segregation, as minichromosomes on a mutant mei-S332 background have a significant drop in transmission (Lopez et al. 2000). Cells mutant for the Drosophila ortholog of Rad21 have mitotic defects such as

35 premature sister chromatid separation and abnormal spindle morphology (Vass et al. 2003). Thus it appears that, similar to the case in S. pombe, Drosophila centromeres can be divided into centromeric and pericentromeric chromatin domains, both of which are required for proper chromosome segregation. It is tempting to apply this domain model to the organization of the human centromere (Sullivan 2002). As described above, the human centromere is made up of alpha satellite arranged in higher-order arrays flanked by more divergent monomeric alpha satellite (Figure 1-4). Monomeric alpha satellite is interspersed with other sequences, and there is no evidence for monomeric alpha satellite involvement in centromere function (see below). However, as presented in this thesis, human artificial chromosomes derived from higher-order alpha satellite and lacking monomeric alpha satellite have an increase in anaphase lag and nondisjunction as compared to normal chromosomes (Chapter 4). It may be the case that, although monomeric alpha satellite cannot nucleate the site of the kinetochore on its own (Ikeno et al. 1998), it is required for setting up the pericentromeric chromatin state, similar to the kinetochore flanking sequences in S. pombe and D. melanogaster. Future studies carefully dissecting the locations of centromere proteins, heterochromatin proteins and cohesins at the human centromere will determine the difference between centromeric and pericentromeric domains.

36

Assessing centromere function in human chromosomes The requirements for centromeric and pericentromeric functions in model organisms have been well defined, both at the level of DNA sequence and centromere protein content. Studies involving the human centromere lack the tractable genetic systems found in other organisms, making it difficult to test specific regions functionally for centromere activity. Both sequence specificity and epigenetic modifications are likely responsible for human centromere function (Figueroa et al. 1998; Harrington et al. 1997; Ikeno et al. 1998; Tomkiel et al. 1994; Yen et al. 1991). Nevertheless, the roles of DNA sequence and epigenetics in human centromere function are a topic of much debate (Choo 2000; Cleveland et al. 2003; Karpen and Allshire 1997; Murphy and Karpen 1998). Although all human, and for that matter other primate, centromeres are made up of alpha satellite DNA, there are two lines of evidence used to argue for the sequence independence of human centromeres based on the study of different types of chromosome abnormalities observed in human patient material.

Dicentric chromosomes Dicentric human chromosomes contain two distinct arrays of alpha satellite formed by chromosome breakage and fusion events (Earnshaw et al. 1989; Higgins et al. 1999; Page et al. 1995; Page and Shaffer 1998; Sullivan and Schwartz 1995; Sullivan and Willard 1998; Therman et al. 1974). To maintain chromosome stability, only one centromere must remain active, because if the

37 chromosome attaches to spindle microtubules at two sites it could be pulled to opposite poles of the cell, causing chromosome breakage or anaphase bridging (McClintock 1939). Dicentric chromosome stability has been hypothesized to occur by either inactivating one centromere (Therman et al. 1974) or by coordinating the activity of the two centromeres (Page and Shaffer 1998; Sullivan and Willard 1998). In either case, centromere activity is assessed by the ability to bind centromere proteins. There are several cases in which only one of the two alpha satellite arrays on a dicentric chromosome bind antibodies to centromere proteins (Earnshaw et al. 1989; Sullivan and Schwartz 1995). Both active and inactive regions of alpha satellite bind antibodies to CENP-B; however, only the active centromere binds antibodies to CENP-C and CENP-E. This suggests that one previously active centromere has been epigenetically inactivated and has lost the ability to bind proteins involved in spindle microtubule attachment. The fact that a region of alpha satellite can exist on a chromosome without conferring centromeric activity has led some to propose that alpha satellite is not sufficient for centromere function (Choo 2000; Cleveland et al. 2003; Karpen and Allshire 1997; Murphy and Karpen 1998). However, this argument is misleading. Much in the same way that a previously active gene can be silenced, human centromeres may be epigenetically inactivated during dicentric chromosome formation. Just as a silenced gene is still a “gene”, an inactive centromere is still a “centromere”. Alpha satellite is clearly sufficient for

38 centromere function as demonstrated by artificial chromosome studies (see below).

Neocentromeres The existence of neocentromeres also provides an argument for the sequence independence of centromere function. Neocentromeres are regions of chromosomes that do not contain typical centromeric DNA, but that have been modified epigenetically to act as a centromere and segregate the chromosome faithfully. Neocentromeres were first described in maize (Rhodes and Vilkomerson 1942), and have also been engineered in flies (Maggert and Karpen 2001; Platero et al. 1999; Williams et al. 1998). Human neocentromeres are found on marker chromosomes detected in patient material and derived from chromosome breakage events in which a previously acentric fragment acquires centromere activity (Depinet et al. 1997; du Sart et al. 1997). Neocentromeres have been extensively characterized to determine the molecular structure and epigenetic modifications responsible for centromere activity. There appear to be “hotspots” for neocentromere formation as certain regions of the genome are commonly rearranged to form marker chromosomes with the same breakpoints (Warburton et al. 2000). The DNA sequences underlying the centromere protein binding domains of two different neocentromeres have been analyzed (Barry et al. 1999; Satinover et al. 2001). An increase in AT-richness as compared to the genome average was found at

39 both neocentromeres, and other sequences such as classical satellites and LTRs were also enriched at the neocentromeres. These data suggest that, although neocentromeres contain no detectable alpha satellite DNA, there could be some sequence characteristics that predispose these loci to centromere activity (Koch 2000). This hypothesis has been challenged by a finding that three marker chromosomes derived from the same region of chromosome 13 all have different CENP-A binding domains (Alonso et al. 2003). Thus, the regions of the genome from which marker chromosomes are derived may be hotspots for chromosome breakage events; however, the acquisition of centromeric activity is probably not sequence dependent. Other parameters such as centromere protein deposition, replication timing, and histone acetylation status likely define neocentromere function. Neocentromeres bind antibodies to centromere proteins found at normal active centromeres except for antibodies to CENP-B (Depinet et al. 1997; du Sart et al. 1997; Floridia et al. 2000; Saffery et al. 2000; Slater et al. 1999; Voullaire et al. 1999; Voullaire et al. 2001; Warburton et al. 2000). A neocentromere derived from chromosome 10q25 replicates later in the cell cycle than the normal 10q25 locus (Lo et al. 2001), similar to the replication timing of normal human centromeres (Shelby et al. 2000). Additionally, treatment with the drug Trichostatin A hyperacetylates the normally hypoacetylated neocentromere derived from 10q25 and shifts the CENP-A binding domain of the neocentromere (Craig et al. 2003).

40 Thus, neocentromeres appear to be defined by epigenetic factors such as histone modifications and variants, centromere protein binding and replication timing rather than DNA sequence specificity. These data demonstrate that alpha satellite DNA is not always necessary for centromere function, although neocentromere formation is an extremely rare event. However, alpha satellite is the only sequence capable of recapitulating centromere function in human cells (see below), providing strong evidence for a role for DNA sequence in determining centromere identity in normal chromosomes.

Normal human centromeres The most direct way to assess the requirements for human centromere function is by examining normal human centromeres. There are two principle strategies for determining what DNA is present at the functional centromere, either by looking for the type of DNA associated with centromere proteins on a normal chromosome or by using minichromosome and artificial chromosome assays to determine the minimal DNA sequences required for centromere function. In the first approach, antibodies to centromere proteins known to be present at active centromeres are used to determine which DNA sequences colocalize with the active centromere. The colocalization of centromere proteins and alpha satellite DNA has been demonstrated in a number of studies. Before the purification of antibodies to specific centromere proteins, CREST antisera

41 were used to identify the functional centromere. Serum from patients with calcinosis, Raynaud syndrome, esophageal dysmotility, scleroderma, and telangiectasia (CREST) (Moroi et al. 1980) contains antibodies to CENPs -A, -B, and -C (Earnshaw and Rothfield 1985). CREST sera were shown to colocalize with a degenerate alpha satellite probe on mechanically stretched chromosomes; however, the alpha satellite signal extended beyond the edges of the CREST immunostaining (Zinkowski et al. 1991). Similarly, antibodies to CENP-A only bind to a portion of the alpha satellite at human centromeres, at the site of the inner kinetochore (Warburton et al. 1997). Most recently, extended chromatin fiber experiments have demonstrated that antibodies to CENP-A only stain a portion of the alpha satellite at human centromeres. About one-half to two-thirds of the stretched alpha satellite region overlaps with CENP-A (Blower et al. 2002; Sullivan et al. 2002). These collective data suggest that only a subset of alpha satellite DNA is part of the functional centromere, but they do not define the particular type of alpha satellite participating in centromere function. Haaf and Ward investigated the functionality of two distinct higher-order arrays of alpha satellite on chromosome 7, D7Z1 and D7Z2 (Waye et al. 1987c; Wevrick and Willard 1991). Studies in mechanically extended chromosomes and interphase nuclei showed that only D7Z1, and not D7Z2, colocalized with CREST autoantibodies (Haaf and Ward 1994). Thus, on chromosome 7, centromere function is restricted to only one type of higher-order alpha satellite. Given the adjacent organization of monomeric and higher-order alpha satellite at the

42 human centromere and the difference in repetitive structure between the two, it is interesting to determine if centromere function is restricted to higher-order alpha satellite or if monomeric alpha satellite is also part of the kinetochore. This question will be addressed in this thesis in chapter 3. Chromatin immunoprecipitation experiments with antibodies to centromere proteins also support a functional role for alpha satellite DNA. Vafa and Sullivan first showed that alpha satellite does in fact immunoprecipitate with antibodies to CENP-A and proposed a specialized phasing for CENP-A-containing nucleosomes (Vafa and Sullivan 1997). In another study, antibodies to CENP-B and CENP-C as well as CENP-A were found to be associated with alpha satellite DNA (Ando et al. 2002). Upon cloning and sequencing the chromatin immunoprecipitated DNA, the only type of alpha satellite associated with centromere proteins contained CENP-B boxes. CENP-B recognition sites are found only in higher-order alpha satellite and not monomeric alpha satellite (Masumoto et al. 1989), suggesting that only higher-order alpha satellite is part of the centromere protein complex at the human kinetochore. These two chromatin immunoprecipitation studies are consistent with cytological centromere protein colocalization experiments and strongly support a role for alpha satellite in centromere function.

43

Minichromosomes In addition to strategies that examine the DNA and protein composition at endogenous centromeres, minichromosome and artificial chromosome studies have also explored the minimal requirements for centromere function. Telomere sequences have been introduced into human cell lines to truncate existing chromosomes into smaller minichromosomes. The minichromosome can be mapped subsequently to determine the sequences present responsible for centromere function on this minimal chromosome (Farr et al. 1992; Heller et al. 1996). In contrast, human artificial chromosomes are derived from naked DNA transfected into tissue culture cells (Harrington et al. 1997; Ikeno et al. 1998). Artificial chromosomes may be used as an assay to determine the types of sequences capable of forming a de novo centromere and are a valuable tool for determining the sequence requirements for centromere function (see below). Farr et al. engineered telomere truncation chromosomes by introducing telomere repeats in a non-targeted fashion to truncate the human X chromosome at a number of locations along the q arm (Farr et al. 1992). These chromosomes were further truncated along the p arm to generate minichromosomes less than 2.4 Mb in size (Farr et al. 1995; Mills et al. 1999). The minimal chromosome that retained mitotic stability was 1.4 Mb overall with a 670 kb array of DXZ1 higherorder alpha satellite. Below this threshold, chromosomes with less DXZ1 or less flanking sequence on the p side of DXZ1 were mitotically unstable, suggesting

44 that both higher-order alpha satellite and neighboring pericentromeric sequence may be required for proper centromere function (Spence et al. 2002). Similar chromosome truncation studies have been conducted on the human Y chromosome (Heller et al. 1996). The smallest Y chromosome-based minichromosome exhibiting faithful segregation was 1.8 Mb overall, with an approximately 100 kb array of DYZ3 higher-order alpha satellite (Shen et al. 2001; Yang et al. 2000). These data from the X- and Y-based minichromosomes demonstrate that higher-order alpha satellite is capable of maintaining centromere function after the original chromosome has been significantly truncated and/or rearranged. The fact that the smallest minichromosomes are larger than just the higher-order alpha satellite array may reflect a requirement for other flanking sequences or may simply be an artifact of the telomere truncation process. Telomere constructs may not have integrated within higherorder alpha satellite on both sides of the array to create purely higher-order minichromosomes. Conversely, these kinds of events may have been so detrimental to chromosome segregation that they were not recoverable.

Human artificial chromosomes Artificial chromosome studies address the requirements for centromere establishment as well as maintenance. Candidate DNA sequences are transfected into human tissue culture cells to test them for the ability to form an artificial chromosome with a de novo centromere derived from the input DNA.

45 Two major strategies have been employed to construct artificial chromosomes, using either single DNA molecules (Ikeno et al. 1998) or a combination of linear DNA fragments (Harrington et al. 1997). Numerous artificial chromosome studies have tested alpha satellite sequences, non-alpha satellite sequences, and different types of alpha satellite for centromere functionality. The first human artificial chromosome study combined the principal components of chromosomes— centromeres, telomeres, and genomic DNA (presumably containing origins of replication)— to generate small linear artificial chromosomes. Harrington et al. combined linear arrays of synthetic D17Z1 or DYZ3 higher-order alpha satellite with linearized genomic DNA and TTAGGG telomere repeats (Harrington et al. 1997). In an alternate approach, Ikeno et al. engineered a yeast artificial chromosome (YAC) construct containing alpha satellite and telomere sequences on a single molecule (Ikeno et al. 1998). YACs containing either higher-order or monomeric alpha satellite from chromosome 21 were retrofitted with telomere repeats by homologous recombination. In both studies, DNA sequences were transfected into a human fibrosarcoma cell line, HT1080. Interestingly, only higher-order alpha satellites from chromosome 17 and chromosome 21 were capable of forming artificial chromosomes with de

novo centromeres. The inability of higher-order alpha satellite from Y chromosome to form a de novo centromere has also been demonstrated in other studies (Grimes et al. 2002). Artificial chromosomes were mitotically stable in the absence of drug selection and bound antibodies to centromere proteins,

46 demonstrating the assembly of a fully functional human centromere. These experiments suggest that higher-order and monomeric alpha satellite differ in their capacities to establish a centromere. Higher-order alpha satellite clearly has some functional capability lacking in monomeric alpha satellite. Since these two original studies, higher-order alpha satellite from chromosome 22 (Kouprina et al. 2003) and a chimeric YAC containing higherorder alpha satellite found on chromosomes 4, 14 and 22 (Henning et al. 1999) have also been successful in generating artificial chromosomes. Conversely, the sequences comprising the neocentromere derived from chromosome 10q25 (Saffery et al. 2001) as well as other non-alpha satellite sequences (Ebersole et al. 2000; Grimes et al. 2002) are not capable of forming artificial chromosomes with de novo centromeres in human cells. So what characteristic of higher-order alpha satellite is responsible for conferring centromere function? Is it the extremely homogeneous organization of higher-order repeats? Or the presence of CENP-B boxes in higher-order but not monomeric alpha satellite? Or are specific basepairs present in higher-order repeats besides the CENP-B box responsible for centromere function? Expanding on the earlier study involving alpha satellite from chromosome 21 (Ikeno et al. 1998), Ohzeki et al. generated a number of constructs to begin to address the specific characteristics of higher-order alpha satellite that nucleate centromere function (Ohzeki et al. 2002). A mutation was introduced into the CENP-B boxes in the higher-order repeat unit, causing a failure to bind CENP-B

47 protein in a gel shift assay. BACs containing either normal higher-order or mutated higher-order alpha satellite were transfected into HT1080 cells. Only normal higher-order alpha satellite was capable of artificial chromosome formation. These data suggest that the mutations in this construct are responsible for the absence of centromere function, but it remains to be determined if centromere function was abolished due to the mutation specifically in the CENP-B box or if any mutation in higher-order alpha satellite could hinder centromere function. Further experiments showed that a non-alpha satellite construct that contained CENP-B boxes was not capable of artificial chromosome formation, suggesting that CENP-B binding alone is not sufficient for centromere function (Ohzeki et al. 2002). The absence of CENP-B protein on the Y chromosome and neocentromeres suggests that it is not an integral part of centromere function. The sequence requirements for human centromere function are complex, but likely involve AT-richness and highly repetitive DNA.

Research Objectives and Thesis Outline The following chapters will expand on the requirements for human centromere function and evaluate the role of alpha satellite DNA through genomic analyses and functional experiments. Chapter 2 describes the analysis of the types of alpha satellite in the current genome assembly (Build 34, July 2003), as well as the other types of sequences in the vicinity of human centromeres and pericentromeric regions. The functionality of higher-order and

48 monomeric alpha satellite is tested by centromere protein colocalization experiments on centromeres from six chromosomes. In Chapter 3, the chromosome 17 centromere is analyzed for its genomic content as well as its evolutionary history. Chapter 4 tests D17Z1 and DXZ1 higher-order alpha satellites functionally for their ability to form a de novo centromere on an artificial chromosome. The segregation of DXZ1- and D17Z1-based artificial chromosomes is evaluated and compared to that of patient-derived ring chromosomes and normal chromosomes. Chapter 5 discusses the impact of these experiments in the fields of centromere biology and genomics as well as the future experiments to further explore what is required for human centromere function. The appendix is based on a review article that describes the types of alpha satellite in the human genome and discusses ways to functionally annotate these sequences.

49

Chapter 2

Analysis of the centromeric regions of the human genome assembly

M. Katharine Rudd and Huntington F. Willard

Note: This chapter has been adapted from a manuscript accepted in Trends in Genetics as a peer-reviewed “Genome Analysis” article and reformatted for this document.

50 Abstract. The sequence of the human genome is not yet complete, and major gaps remain at the centromere region of each chromosome. Human centromeres are comprised of megabases of repetitive alpha satellite DNA, most of which is missing from the July 2003 (Build 34) genome assembly. Alpha satellite is a repeat family based on ~171 bp monomers that can be arranged either in a highly homogeneous higher-order organization or in a more heterogeneous monomeric form that lacks this higher-order periodicity. We have analyzed the ~7 megabases of alpha satellite that have been assembled thus far, and have found both higher-order and monomeric types of alpha satellite. The majority of alpha satellite in the assembly lies within 1 Mb of the centromere gaps; however, there are also small blocks of alpha satellite several megabases away from the centromere regions. The most centromere proximal regions of the genome asssembly are enriched for other types of satellites as well as segmental duplications. In addition to characterizing the organization of alpha satellite in the genome assembly, we have also functionally annotated alpha satellite on several chromosomes. Using extended chromosome methods, we have found that antibodies to centromeric proteins only colocalize with higher-order and not monomeric alpha satellite. Thus, higher-order and monomeric alpha satellites differ in genomic organization as well as function.

51

Introduction The centromere of most complex eukaryotic chromosomes is a specialized locus comprised of repetitive DNA that is responsible for chromosome segregation at mitosis and meiosis (Cleveland et al. 2003; Sullivan et al. 2001). Normal human centromeres are made up of megabases of alpha satellite DNA, a repeat family based on ~171 bp monomers (Willard and Waye 1987b). These monomers may be arranged either in a highly homogeneous, multimeric organization or in a more heterogeneous monomeric form that lacks this higher-order periodicity (Alexandrov et al. 2001; Warburton and Willard 1996; Willard 1991). Despite their obvious functional significance, centromeric regions and their constituent alpha satellite sequences were largely omitted by the Human Genome Project because of their repetitive nature and the expected deficiency of genes (Collins et al. 1998); the reported assemblies (Lander et al. 2001; Venter et al. 2001) of each chromosome arm thus end an uncertain distance from the functional centromere (Schueler et al. 2001). While such regions are often considered to be difficult to sequence, in fact it is the assembly, not the sequencing itself, that presents a challenge due to the high degree of sequence homogeneity among many hundreds or thousands of copies of a given repeated sequence. Alpha satellite DNA has been identified at every human centromere (Alexandrov et al. 2001; Warburton and Willard 1996); however, among reported chromosome assemblies, the amount and type of alpha satellite varies. There are two major types of alpha satellite, higher-order and monomeric (Warburton

52 and Willard 1996; Willard 1991). Higher-order alpha satellite is made up of ~171 bp monomers organized in arrays of multimeric repeat units that are highly homogeneous; in contrast, monomeric alpha satellite lacks any higher-order periodicity, and its monomers are only on average ~70% identical to each other (Wevrick et al. 1992). In addition to their different sequence organization, monomeric and higher-order alpha satellites also differ in their functionality. Higher-order alpha satellite has been demonstrated to be associated with centromere function on the basis of genomic (Schueler et al. 2001; Spence et al. 2002), biochemical (Ando et al. 2002; Vafa and Sullivan 1997) and artificial chromosome assays (Harrington et al. 1997; Ikeno et al. 1998; Schueler et al. 2001). In contrast, there is no evidence for the direct involvement of monomeric alpha satellite in centromere function (Ikeno et al. 1998). Higher-order alpha satellite is the predominant type in the genome, present in megabase quantities at each centromere (Warburton and Willard 1996; Wevrick and Willard 1989; Willard and Waye 1987b). Where it has been studied, monomeric alpha satellite lies at the edges of higher-order arrays and is less abundant (Horvath et al. 2000; Schueler et al. 2001; Wevrick et al. 1992). This expectation notwithstanding, the vast majority of alpha satellite in the current assembly is of the monomeric type (see below), reflecting the currently incomplete nature of centromeric contigs.

53

Materials and Methods Alpha satellite in the genome The July 2003 genome assembly (Build 34) was extracted from the UCSC browser (Kent et al. 2002) (http://genome.ucsc.edu). For simplicity and because the sizes of most of the gaps in the genome have not been determined experimentally, we removed all non-centromere clone gaps within the reported chromosome arm contigs in the assembly. The resulting un-gapped assembly was divided into 1 Mb blocks starting from the centromere gaps. The amount of alpha satellite was calculated using RepeatMasker (http://repeatmasker.genome.washington.edu) and the alpha satellite within and beyond the first 1 Mb block was determined (Table 2-1, Figure 2-1). Segmental duplications in the July 2003 assembly were provided by Evan Eichler (Univ. of Washington). Alpha satellite and other satellites were extracted using RepeatMasker. Alpha satellite was characterized as monomeric or higher-order using the dotmatrix program, DOTTER (Sonnhammer and Durbin 1995). Groups of monomers that appeared to have a higher-order structure by DOTTER (stringency of greater than or equal to 95% identical over 100 bp windows) were aligned using CLUSTALW (Thompson et al. 1994) and percent identity among higher-order repeats was determined (see below). As a complementary analysis, we made a database of 41 higher-order repeats reported in the literature and performed BLAST alignments vs. all alpha satellite in the July 2003 assembly

54 (http://www.ncbi.nlm.nih.gov/BLAST/). Of these 41 known higher-order repeat families, only six (D2Z1, D7Z2, D8Z2, D17Z1-B, DXZ1, and DYZ3) were found in the assembly with alignments of greater than or equal to 97% identity. Thus, the current genome assembly is lacking most of the higher-order repeats previously reported in the literature.

Novel higher-order alpha satellite Using DOTTER, we found four regions of higher-order alpha satellite not previously described in the literature in the assemblies of chromosomes 4, 10, 11, and 19. To determine percent identity among higher-order repeat units on a given chromosome arm assembly, we performed in silico restriction digests and aligned tandem higher-order repeat units using CLUSTALW. We extracted seven higher-order repeat units from the proximal q side of the centromere gap on the chromosome 4 assembly, totaling over 15 kb in sequence. Based on CLUSTALW alignments among all possible pairwise comparisons of the seven units, percent identity ranged from 98.5-99.8%, with a mean of 99.1 +/- 0.3%. Two higher-order repeats from 10q were extracted and aligned, and they were 99.3% identical. We also found higher-order alpha satellite on the proximal p side of the centromere gap of chromosome 11. Six higher-order repeats were aligned via CLUSTALW, and their percent identity ranged from 97.4-99.8%, with a mean of 98.3 +/- 0.5%. Lastly, six higher-order repeats from chromosome 19p were 98.7-100% identical, with a mean of 99.3 +/- 0.4%.

55

Alpha satellite percent identity Monomeric alpha satellite To calculate the percent identity among monomeric alpha satellite monomers, we examined three regions of monomeric alpha satellite from the current assembly. Between 30 and 40 kb of sequence from regions including monomeric alpha satellite on chromosomes 3p, 15q and 17p were extracted from July 2003 assembly using the UCSC browser (chr3:90232867-90263045, chr15:18260006-18300090, chr17:21904223-21937497). We used CLUSTALW to perform all pairwise alignments among monomers from a particular region. Upon alignment of all 461 monomers from the three regions (106030 alignments), pairwise percent identity ranged from 48.8-100%, with a mean of 71.6 +/- 8.3%. Higher-order alpha satellite Percent identity among higher-order repeat units in the July 2003 assembly was also determined. The assemblies of chromosomes 4, 7, 8, 10, 11, 17, 19 and X contain typical higher-order alpha satellite as determined by DOTTER and subsequent CLUSTALW alignments. We analyzed 73 higherorder repeats from these chromosome assemblies, totaling over 200 kb of sequence. We used CLUSTALW to perform all pairwise alignments among higher-order repeats from each chromosome. Within chromosome arm contigs, higher-order repeat unit identity ranges from 97.5% +/- 0.5% (8q; n=9 higher-

56 order repeats) to 99.3% +/- 0.4% (19p, n=6 higher-order repeats), with an overall average of 98.4 +/- 0.5% identical. More divergent higher-order alpha satellite is found on the assemblies of chromosomes 2, 4, 6, 7, 11, X and Y. We analyzed over 10 kb of divergent higher-order alpha satellite (from chromosomes 2, 6 and Y), and found higher-order repeat units to be 82.3-100% identical, with an average of 93.7% +/- 2.6% identical.

Alpha satellite and CENP-E colocalization To achieve greater resolution than conventional FISH, we generated extended chromosomes by treating cells with ethidium bromide and mechanically stretching chromosomes with harsh cytospinning conditions. Extended chromosomes were prepared as described (Haaf and Ward 1994) and stained with antibodies to CENP-A or CENP-E as described (Harrington et al. 1997). The CENP-A antibody was provided by Manuel Valdivia (Valdivia et al. 1998) and the CENP-E antibody has been described previously (Harrington et al. 1997). BACs containing alpha satellite from the most proximal edges of the p and q arm contigs of chromosome 3 (RPCI-11 557B13, p; RPCI-11 124L3, q), chromosome 7 (RPCI-11 548K12, p; RPCI-11 435D24, q) and chromosome 12 (RPCI-11 191K23, p; RPCI-11 125N22, q) were hybridized to extended chromosomes stained with antibodies to CENP-E using described FISH conditions (Harrington et al. 1997). BAC RPCI-11 65I6, containing monomeric alpha satellite from the Xq contig, was hybridized to extended chromosomes

57 stained with antibodies to CENP-E. BACs containing monomeric alpha satellite (RPCI-11 305L6, p; RPCI-11 846F4, p; RPCI-11 362P24, q) as well as D17Z1-B higher-order alpha satellite (RPCI-11 285M22, p) from chromosome 17 were hybridized to extended chromosomes stained with antibodies to CENP-A. Plasmids containing higher-order repeat units from chromosomes 8, 17 and the X chromosome were hybridized to extended chromosomes stained with antibodies to CENP-E. The plasmid p17H8 contains D17Z1 higher-order alpha satellite (Waye and Willard 1986), and the plasmid pBamX7 contains DXZ1 higher-order alpha satellite from the X chromosome (Willard et al. 1983). The higher-order repeat found on chromosome 8 (Ge et al. 1992) was subcloned from BAC RPCI-11 451D21 and confirmed by end sequencing. Between 35 and 40 metaphase spreads were scored for colocalization between each alpha satellite FISH probe and CENP-E.

CENP-A ChIP data analysis Sequences from DNA immunoprecipitated with antibodies to tagged CENP-A (Vafa and Sullivan 1997) were kindly provided by Kevin Sullivan (Scripps Research Institute, La Jolla, CA) and were compared to alpha satellite in the reported assembly. 69 sequences from the CENP-A immunoprecipitation were aligned versus the July 2003 (Build 34) assembly, as well as a database of alpha satellite in the literature. The database contains 41 known higher-order repeats comprising over 59 kb of alpha satellite sequences. 35 CENP-A associated

58 sequences aligned to alpha satellite as the best BLAST hit (http://www.ncbi.nlm.nih.gov/BLAST/). 15/35 had high identity alignments (greater than or equal to 95% identical and >100 bp). All 15 of these sequences were higher-order alpha satellite, from chromosomes 1, 4, 8, 10, 13-21, 15, 17, 20 and X. None of the CENP-A associated sequences were determined to be monomeric alpha satellite, confirming the hypothesis that higher-order, not monomeric alpha satellite is responsible for centromere function.

Non-centromeric alpha satellite There are 133 blocks of alpha satellite greater than 5 Mb from the centromere gaps as identified by RepeatMasker. Although large blocks of alpha satellite could arise outside of the centromeric regions by inversions or other chromosome rearrangement mechanisms (Baldini et al. 1993; Yunis and Prakash 1982), it was unanticipated to find many small stretches of ectopic alpha satellite DNA. There are 60 blocks of alpha satellite < 1 kb in size, all > 5 Mb away from the centromere gaps reported in the July 2003 assembly. 13/60 blocks of such alpha satellite lie in an intron of a gene in the Reference Sequence collection (http://www.ncbi.nih.gov/RefSeq). 39/60 are within 10 bp of a transposable element; for this analysis, we included a 10 bp buffer to allow for discrepancies in RepeatMasker detection of alpha satellite and/or other repeats (31/39 are immediately adjacent to a transposable element). Of the 39 blocks of alpha satellite bordering transposable elements, 19 abutted a transposable

59 element on both sides, totalling 58 alpha satellite edges next to a transposable element. Of these, 29/58 abutted an Alu element, 17/58 abutted a LINE element, 10/58 abutted an LTR, and 2/58 abutted a DNA element. For validation purposes, we chose four small regions of non-centromeric alpha satellite and designed PCR primers flanking each region. Primers 578F (5’CCAAAGTAGTCCAATCCATAG3’) and 578R (5’AGGAACACATGCATATTCAGC3’) amplify a 113 bp region of alpha satellite on chromosome 5q34 that lies between a LINE and an Alu. Primers 738F (5’ATCTGTACGTTCTGCCCATG3’) and 738R (5’AGGTACCAATGGAGTGAGCC3’) amplify a 166 bp and a 99 bp region of alpha satellite flanked by LINEs, also on chromosome 5q34. Primers 389F (5’AGTGAAGAGACATGTCCTTG3’) and 389R (5’ACCTGCATGTTCTTCACACC3’) amplify a 38 bp region of alpha satellite next to a LINE on chromosome 2q37.3. PCR conditions were the same for each primer set (5 minute initial denaturation at 94oC followed by 35 cycles of: 94oC for 30 s, 55oC for 30 s and 72oC for 30 s). Each of these ectopic alpha satellite regions was validated by PCR in 20 unrelated individuals, and PCR products from two individuals were sequenced for each region. All 20 individuals were positive for each PCR reaction and the sequenced products agreed with the sequence in the July 2003 assembly in each case (data not shown).

60

Results

Alpha satellite in the genome assembly Despite the difficulty of assembling alpha satellite and the lack of specific attention to the centromere regions for most chromosomes, a number of chromosome assemblies do include alpha satellite in their contigs. The July 2003 (Build 34) assembly (http://genome.ucsc.edu/) contains 6.6 Mb of alpha satellite (an estimated 10-fold underrepresentation (Eichler et al. 2004) (appendix)), of which 5.7 Mb lie within the most proximal megabase of each reported arm contig, adjacent to the centromere gaps. As expected, there is a sharp drop in alpha satellite content outside of the first megabase (Figure 2-1, Table 2-1). To annotate the major alpha satellite regions of the reported genome assembly, we thus focused on the most proximal megabase of each chromosome arm. Validation of the sequence assembly of centromeric regions remains an important goal for future work; nonetheless, general features of the reported contigs have been confirmed in several instances by long-range pulsed field gel mapping (Guy et al. 2003; Schueler et al. 2001). Alpha satellite content adjacent to the centromere gap varies widely among chromosomes (Fig. 2-2). Of the 43 chromosome arm assemblies examined (the five acrocentric chromosomes contain heterochromatic short arms and are not represented in the current genome assembly), nine have not reached any alpha satellite at all, suggesting that these contigs end a substantial distance away from the

61

amount of alpha satellite per block (kb)

800

700

600

500

400

300

200

100

0

1

2

3

4

5

6

7

8

9

10

> 10

Blocks

Figure 2-1. Alpha satellite location in the July 2003 (Build 34) human genome assembly. All chromosomes were divided into 1 Mb blocks starting from the centromere gap, excluding any other gaps in the chromosome assembly. Blocks are labeled 1 - 10 and > 10 corresponding to the distance from the centromere gap. Amount of alpha satellite is expressed per block on the Y axis (in kb), where each data point represents a megabase block on a different chromosome. Only blocks containing alpha satellite are plotted (no zero data points are shown).

62

Table 2-1. Alpha satellite in the July 2003 (Build 34) assembly of the human genome

proximal proximal chromosome assembled p 1 Mb q 1 Mb 1 14039 2 87192 53035 3 192123 147508 8642 4 66016 7272 26088 5 512981 401339 92797 6 159607 40137 118912 7 835724 118511 707204 8 650524 358752 291633 9 245578 150285 10 131485 130815 11 982406 606419 166190 12 629125 223565 405040 13 2064 2064 14 143 143 15 34846 34319 16 479031 182749 17 129172 96712 32460 18 35610 26169 9441 19 561238 160966 365924 20 83305 69675 13554 21 31322 31322 22 6475 1006 X 607663 296468 310795 Y 171600 101582 total (bp) 6649269

outside 1 Mb unassembled 14039 119036 34157 35973 32656 18845 558 10009 139 179653 95293 235452 670 4611 209797 520

527 296282

34348 76 5469 400 70018 859776

total 133075 87192 192123 66016 512981 159607 835724 830177 481030 136096 982406 629125 2064 143 34846 479031 18984 148156 35610 71823 633061 83305 31322 6475 607663 171600 7278828

Amount of alpha satellite in base pairs is listed for each chromosome. The amount of alpha satellite assembled in the proximal Mb on the p and q sides of the centromere gap (excluding clone gaps in the chromosome assembly) and also outside the first Mb is shown. Alpha satellite that has been assigned to a chromosome but is not part of a reported contig is listed as ‘unassembled’.

63 centromere. Only six chromosomes have >100 kb of alpha satellite assembled on both p and q arm contigs; the longest reported assembly on any chromosome is only 836 kb (Table 2-1), substantially less than the amount known to be located at each centromere on the basis of earlier molecular, cytogenetic and genomic studies (Alexandrov et al. 2001; Warburton and Willard 1996; Wevrick and Willard 1989). It is likely that this variation in coverage reflects the assembly progress on particular chromosomes rather than interchromosome differences in alpha satellite organization. To characterize the types of alpha satellite in the current assembly, we used a combination of BLAST and DOTTER alignment tools (see Materials and Methods and Figure 2-3). By this analysis, >92% of alpha satellite in the current assembly is of the monomeric type, and only eleven chromosomes (chromosomes 2, 4, 6, 7, 8, 10, 11, 17, 19 and the X and Y) have reached higher-order alpha satellite (Fig. 2-2). Notably, even within this limited dataset, four of these assemblies contain previously undescribed families of higher-order alpha satellite (see Materials and Methods), suggesting that the complete set of centromeric repeats in the human genome has yet to be revealed. In our analysis of alpha satellite in the current genome assembly, we found two categories of higher-order alpha satellite that differ in the degree and extent of sequence homogeneity (Figure 2-3). Within a region of higher-order alpha satellite on any one chromosome arm, the most homogeneous higherorder repeat units are 97-100% identical. In Build 34 of the current genome

64

65

66

Figure 2-2. Genomic landscape of 1 Mb regions outside of the centromere gaps. The 1 Mb regions adjacent to the centromere gaps are depicted for the p and q arms of chromosomes 1-22, X and Y. Monomeric alpha satellite (blue), typical higher-order alpha satellite (red), and more divergent higher-order alpha satellite (pink, with asterisks), as well as other satellites (grey) are shown above the black line. Arrows depict orientation of alpha satellite monomers. Refseq genes (purple) and segmental duplications are illustrated below the black line. Segmental duplications 98-99% or > 99% identical to another region of the genome are shown as yellow and green boxes, respectively.

67 assembly, 9 chromosome arms have reached higher-order alpha satellite of this type; this totals ~200 kb of sequence. Our analysis of 73 higher-order repeat units shows that within chromosome arm contigs, higher-order repeat unit identity ranges from 97.5% +/- 0.5% (8q; n=9 higher-order repeats) to 99.3% +/- 0.4% (19p, n=6 higher-order repeats), with an overall average of 98.4 +/- 0.5% identical. This degree of homogeneity reflects the concerted evolution of higherorder repeat units and is consistent with previous estimates of intra-array sequence homogeneity in the human genome (Durfy and Willard 1989; Schindelhauer and Schwarz 2002; Schueler et al. 2001). Between different chromosome arrays, however, higher-order repeats are quite divergent, as welldocumented previously (Warburton and Willard 1996; Willard and Waye 1987b). Other higher-order repeat units in the assembly lack the regular organization and consistent higher-order repeat length characteristic of highly homogeneous tandem arrays (Figure 2-3). Their less highly homogenized repeats, while clearly multimeric, are more divergent in both sequence and structure, with a pairwise mean identity of 93.6% +/- 2.6%. Seven chromosome assemblies contain this kind higher-order alpha satellite, comprising ~100 kb of the current genome assembly. The nature of this second category of higherorder alpha satellite in the genome is itself likely heterogeneous. In some cases, these repeats correspond to diverged copies at the edges of an otherwise homogeneous array (Schueler et al. 2001); in other cases, they may represent vestiges of ancient arrays that are no longer present in the genome (or at least

68

Figure 2-3. Types of alpha satellite in the human genome. Four types of alpha satellite DNA are apparent in the current assembly of the human genome. DOTTER plots of 5 kb of alpha satellite compared to itself are shown for each type of alpha satellite. (a) Highly homogeneous higherorder alpha satellite is made up of multimeric repeat units that are 97-100% identical to one another. Higher-order repeat units are organized in tandem arrays that typically have a uniform repeat unit size and can span several Mb. (b) Other higher-order alpha satellite shows clear evidence of multimeric structure; however, these multimeric units are less regular and more divergent in sequence and are 93.7% +/- 2.6% identical on average. (c) Monomeric alpha satellite lacks any evidence of higher-order periodicity, and its monomers have an average pairwise percent identity of 71.6 +/- 8.3%. (d) Short zones of multimeric, highly homogeneous alpha satellite (5 Mb away from the centromere gaps. While the largest of these could represent ancient inversions or other chromosomal rearrangements involving centromere regions (Baldini et al. 1993; Yunis and Prakash 1982), there are 60 blocks containing 98% identical to another region of the genome (Figure 2-2). An emerging model is that segments rich in segmental duplications define some of the pericentromeric regions of the genome distal to alpha satellite, while the centromeric region itself is made up of alpha satellite and will be expected to be largely devoid of such duplications. Other repeats besides alpha satellite are also enriched at the centromere. Using RepeatMasker (http://repeatmasker.genome.washington.edu), we examined the repeat content of the combined 43 most proximal megabases adjacent to the centromere gaps, compared to the genome average. The genome as a whole and the most centromere proximal regions were 49% and 64% repetitive, respectively (Table 2-2). This enrichment in repeat content near

77 Table 2-2. Repeat content of the July 2003 (Build 34) human genome assembly Repeat

Genome Avg Proximal 1 Mb LINE 21.26% 22.20% SINE 13.72% 9.19% LTR 8.72% 9.84% DNA Elements 3.03% 1.81% Small RNAs 0.04% 0.05% Simple Repeats 0.92% 0.92% Low Complexity 0.58% 0.53% Unknown 0.01% 0.01% Other 0.14% 0.22% Satellites 0.43% 18.76% Alpha Satellite 0.26% 13.99% Gamma Satellite 0.01% 0.59% Beta Satellite 0.04% 0.53% Human Satellite 4 0.01% 0.52% CER DNA 0.01% 0.50% Human Satellite 2 0.01% 0.49% (GATTG)n