Sequence Complexity of Histone H1 Subtypes

13 downloads 83313 Views 224KB Size Report
We studied the length variation of the available H1 subtypes and showed that the length of the ... H1 terminal domains suggests that the DNA coding for the.
Sequence Complexity of Histone H1 Subtypes Imma Ponte, Roger Vila, and Pedro Suau Departamento de Bioquı´mica y Biologı´a Molecular, Facultad de Ciencias, Universidad Auto´noma de Barcelona, Barcelona, Spain H1 subtypes are involved in chromatin higher-order structure and gene regulation. H1 has a characteristic three-domain structure. We studied the length variation of the available H1 subtypes and showed that the length of the N-terminal and C-terminal domains was more variable than that of the central domain. The N-terminal and C-terminal domains were of low sequence complexity both at the nucleotide and at the amino acid level, whereas the globular domain was of high complexity. In most subtypes, low complexity was due only to cryptic simplicity, which reflects the clustering of a number of short and often imperfect sequence motifs. However, a subset of subtypes from eubacteria, plants, and invertebrates contained tandem repeats of short amino acid motifs (four to 12 residues), which could amount to a large proportion of the terminal domains. In addition, some other subtypes, such as those of Drosophila and mammalian H1t, were only marginally simple. The coexistence of these three kinds of subtypes suggests that the terminal domains could have originated in the amplification of short sequence motifs, which would then have evolved by point mutation and further slippage.

Introduction Histone H1 binds to the linker DNA in the chromatin fiber. It is currently accepted that H1 could have a regulatory role in transcription through the modulation of chromatin higher-order structure. In vitro experiments with reconstituted chromatin have shown that H1 can repress promoters containing the RNA start site in the linker DNA and that some sequence-specific transcription factors can counteract the H1 mediated repression. However, experiments in vivo indicate that H1 does not function as a global transcriptional repressor, but instead participates in complexes that either activate or repress specific genes (Zlatanova and van Holde 1992; Bouvet, Dimitrov, and Wolffe 1994; Khochbin and Wolffe 1994; Shen and Gorovsky 1996; Wolffe, Khochbin, and Dimitrov 1997). Some of these effects could be attributed only to the globular domain (Vermaak et al. 1998), whereas other effects were localized to the tail-like domains (Lee and Archer 1998; Dou et al. 1999). H1 has multiple isoforms. The sequences of some 100 H1 subtypes from plants, invertebrates, and vertebrates are available (Sullivan et al. 2002). Often more than one H1 subtype is expressed in a given species. The H1 complement has been best characterized in mammals, where six somatic subtypes, a germ line–specific subtype and an oocyte-specific subtype have been identified (Panyim and Chalkley 1969; Bucci, Brock, and Meistrich 1982; Lennox 1984; Tanaka et al. 2001). The subtypes differ in extent of phosphorylation and in turnover rate (Lennox, Oshima, and Cohen 1982; Langan 1982). In vitro evidence supports the idea that the subtypes differ in their ability to condense chromatin (Liao and Cole 1981; Kadake and Rao 1995; Talasz et al. 1998). In vertebrates, the subtypes differ widely in evolutionary stability, suggesting that each subtype may have acquired a unique function (Lennox 1984; Ponte et al. 1998).

Key words: Histone H1, simplicity, slippage, tandem repeats, length mutations. E-mail: [email protected]. Mol. Biol. Evol. 20(3):371–380. 2003 DOI: 10.1093/molbev/msg041 Ó 2003 by the Society for Molecular Biology and Evolution. ISSN: 0737-4038

H1 histones from metazoa have a characteristic threedomain structure: the central domain is globular and contains a winged helix motif, while the N-terminal and Cterminal domains are highly basic and have little or no structure in solution. However, both terminal domains acquire a substantial proportion of secondary structure upon interaction with the DNA (Vila et al. 2000, 2001a, 2001b, 2002). The terminal domains have different evolutionary properties than the globular domain. Globular domains are much more evolutionarily stable and have basically evolved by nucleotide substitution. Terminal domains are, in general, more variable and have evolved by insertion/deletion, in addition to nucleotide substitution. The composition of the H1 terminal domains is dominated by the amino acids Lys, Ala, and Pro. These residues are often arranged in simple repeats, such as KPK, AKP, SPKK, PKKA, and AAKK (Suzuki 1989; Churchill and Travers 1991). The low complexity of the H1 terminal domains suggests that the DNA coding for the terminal domains could also be simple. Simple DNA is formed by the clustering of a number of interspersed short and often imperfect repeats (Tautz, Trick, and Dover 1986; Hancock 1996). Simple sequences are easily misaligned during DNA replication, recombination, and repair and are prone to short insertions and deletions (Levinson and Gutman 1987). Slippage has been shown to act on both coding (Eickbush and Burke 1986; Djian and Green 1989; Treiter, Pfeifle, and Tauz 1989; Paulsson et al. 1990; Costa et al. 1991) and noncoding sequences (Hancock and Dover 1988; Hoelzel, Hancock, and Dover 1991; Ponte et al. 1996). We have studied the length variation and the complexity of the amino acid and nucleotide coding sequences of available H1 subtypes, including those of some protists and eubacteria, which lack a winged helix motif, and found evidence for the involvement of slippage in their evolution. The analysis of a large sample of subtypes has shown that although the majority of H1 subtypes have N-terminal and C-terminal domains of low complexity, the N-terminal and C-terminal domains of a few subtypes are only marginally simple and approach the complexity of common globular proteins. A third class of subtypes contains tandem repeats. The coexistence of 371

372 Ponte et al.

FIG. 1.—Frequency distributions of the lengths of the structural domains of the H1 subtypes. The limits of the domains were as in Sullivan et al. (2002). A total of 93 subtypes were analyzed. The number of domains of a given length is indicated in the x axis and the length of the domain in amino acids residues in the y axis. x is the mean value, r is the standard deviation, and cv is the coefficient of variation. NTD: N-terminal domain; GD: globular domain; CTD: C-terminal domain.

these three kinds of subtypes suggests that ancestral terminal domains could have originated in the amplification of short sequence motifs, which would then have evolved by point mutation and further slippage. Materials and Methods Analysis of Sequence Complexity of Nucleotide and Amino Acid Sequences The composition bias present in the amino acid and the DNA coding sequences of the H1 subtypes was analyzed with Simple 3.0 (Alba`, Laskowski, and Hancock 2002). The program was obtained at http://www.biochem.

ucl.ac.uk/bsm/Simple. Simple 3.0 is an evolution of Simple (Tautz, Trick, and Dover 1986) and Simple34 (Hancock and Armstrong 1994). Briefly, clustered nucleotide motifs of three and four nucleotides were searched within a window of 632 nucleotides. The algorithm assigns a simplicity score (SS) to individual nucleotides, which is a measure of the abundance of trinucleotide and tetranucleotide motifs starting to the right of each nucleotide inside the window. A score of 1 is assigned for each trinucleotide repeat and a score of 3 is assigned for each tetranucleotide repeat. Overall simplicity factors (SF) are calculated by summing all scores and dividing the sum by the number of nucleotides. Relative simplicity factors (RSF) are obtained by dividing the overall simplicity factors of the test sequences by the mean of the corresponding SF for 10 random sequences of the same composition and length as the test sequence. Randomization was carried out independently in three reading frames. No significant difference in RSF values was observed when positionindependent randomization was carried out (Hancock and Armstrong 1994). The RSF of sequences showing the same amount of motif clustering as the random sequences should be close to 1 and significantly greater for simple sequences. The standard deviation of the 10 random sequences allows the analysis of the statistical significance of the RSF. Two confidence levels are returned by the program: 99 % (P , 0.01) and 95 % (P , 0.05). In addition, the program identifies simplicity scores associated with individual motifs that are significantly higher (10 times greater) than could be expected by chance. For the analysis of protein sequences, a window of 610 amino acid residues was used. For the generation of the protein simplicity profiles, a weight of 1 point was accorded to repeats of a single amino acid and a weight of 2 points to repeats of two amino acids. For the detection of motifs of two to four amino acids, a score of 1 was assigned to the repeat to be detected and a score of 0 was assigned to the rest. To test whether short motifs showed significant clustering, 100 random sequences of the same composition and length were generated by random shuffling. The number of random sequences was higher in the analysis of nucleotide sequences than in the analysis of nucleotide sequences to take into consideration the high degree of variation when randomizing relatively short sequences that may have 20 different amino acids (instead of four nucleotides) at each position. Only those motifs which reached a score at least ten times higher than the averaged random sequences were considered. Tandem repeats were searched with the program DNASTAR. Results Length Variation of the Structural Domains of the H1 Subtypes We have examined the length variation of the structural domains of about 100 H1 subtypes from vertebrates, invertebrates, and plants (fig. 1). Only typical subtypes with a three-domain structure were considered. Therefore, subtypes such as that of Tetrahymena thermophila, with a single domain similar to the C-terminus of

Sequence Complexity of Histone H1 373

typical H1s, and that of Saccharomyces cerevisiae, with two globular domains, were not included. The sequences were obtained from the Histone Sequence Database (Sullivan et al. 2002). The globular domain had an average length of 79 6 5 amino acids, while the average lengths of the N-terminal and C-terminal domains were 40 6 13 and 106 6 17, respectively. The coefficient of variation (cv), cv 5 r/ x, where r is the standard deviation and the average value, allows comparison of the length variation on a proportional basis. Values of cv of 0.33, 0.062, and 0.16 were found for the N-terminal, globular, and C-terminal domains, respectively. The N-terminal domain thus appears as the most variable in length, followed by the C-terminal domain and the globular domain.

Complexity of the Nucleotide and the Amino Acid Sequences The clustering of simple motifs present, either in short arrays or interspersed along the sequence, was estimated with Simple 3.0, a modified version of the Simple (Tautz et al. 1986) and Simple34 (Hancock and Armstrong 1994) programs. We calculated the relative simplicity factor (RSF) of the nucleotide sequences of a large sample of H1 subtypes present in the Histone Sequence Database (Sullivan et al. 2002) (table 1). All subtypes gave an RSF higher than 1. The average RSF was close to 1.9 for invertebrate, vertebrate, or plant subtypes. The highest value was that of a Caenorhabditis elegans subtype (code: x53277) with an RSF of 3.05, and the lowest values were those of Drosophila virilis and Lycopersicon pennellii, both with an RSF of 1.20. H1t, the male germ line–specific subtype, gave the lowest values among vertebrates, with an average RSF of 1.52. H1-like basic proteins are also present in eubacteria. Recent results suggest that the ancient C-terminal domain of H1 originated in eubacteria (Kasinsky et al. 2001). We calculated the RSF of a sample of H1-like proteins from eubacteria (table 1). All bacterial subtypes appeared to be significantly simple. One of the subtypes of Bordetella pertusis was extremely simple, with an RSF of 2.91. The complexity of the structural domains was independently analyzed in a sample of subtypes from plants, including Chlorophyta and Streptophyta, fungi, invertebrates, and vertebrates (table 2). When sequences were divided into 59, central, and 39 regions, corresponding, respectively, to the N-terminal, globular, and Cterminal domains, it was apparent that in most cases the sequences encoding the globular domain were not simple. In contrast, the terminal domains were as a rule significantly simple, in particular the C-terminal domain. For subtypes of average simplicity, the effect was clear enough to allow fairly precise identification of the limit between the globular and C-terminal domains merely from inspection of the simplicity profiles (fig. 2). However, this was not possible in subtypes of low simplicity, such as those of Drosophila and vertebrate H1t subtypes, whose Cterminal domains did not reach a high enough simplicity. The limit between the N-terminal and globular domains

was often not so neat, although the N-terminal domain generally contained a large peak of simplicity. Several significant three-nucleotide motifs were found associated with sequence simplicity in the terminal domains. Some motifs such as AAA, AAG, AGA, CCA, GAA were present in most sequences examined. However, others motifs were characteristic of particular subtypes or groups of subtypes. The high level of clustered motifs in the terminal domains agrees with the high RSF achieved by the terminal domains. Consistently, no significant motifs were found in the sequences encoding the globular domains. The most abundant motif was AAG, and it was in frame with Lys, but most other significant motifs were out of frame. The simplicity profiles of the protein sequences were analogous to those of the nucleotide sequences: they reflected the lower sequence complexity of the N-terminal and C-terminal domains compared with the globular domain. Moreover, in general, they permitted definition of the limits of the domains as in the corresponding profiles of the nucleotide sequences (fig. 2). Significant diamino, tri-amino, and tetra-amino acid motifs were found in the terminal domains. As for nucleotide sequences, significant motifs were exclusive of the terminal domains. Most motifs were combinations of the three most abundant amino acids in the sequences, Lys, Ala, and Pro, as KK, KA, KP, KPK, KKA, AAK, PKKA, AKKP, and KKAK. It is remarkable that each different motif was present only in a subset of the sample sequences. This reflects the variety of simple sequence patterns displayed by the terminal domains. Tandem Repeats in H1 Subtypes The above analysis does not distinguish between cryptic simplicity and simplicity arising from tandem repeats. A specific search for tandem repeats showed that repeats of short amino acid motifs longer than duplications were infrequent in H1 subtypes. However, a small subset of subtypes contained significantly longer tandem repeats, which often represented a large fraction of the domain (fig. 3). A subtype from the fly Chironomus dorsalis contained nine copies of the consensus sequence KPAAKKPAA. The repeat was composed of two related shorter sequences, one containing a Lys residue and the other containing a Lys-Lys doublet. This tandem repeat represented about the 79% of the C-terminal domain. Motifs of six amino acids repeated seven times were present in tomato and wheat subtypes (41% and 31% of the Cterminus, respectively). In the sea urchin Strongylocentrotus purpuratus, a motif of seven amino acids was repeated 10 times (59% of the C-terminus). A tandem repeat with consensus (KPKAA)5 was found in mammalian somatic subtypes. In H1b, it incorporated a single Ala/ Val substitution in one of the motifs; in H1c, H1d, and H1e, it was shorter and more extensively modified; and it was absent in H1a (Parseghian et al. 1994). Among the one-domain subtypes of Protists, an Euplotes crasus subtype had a short (KKSAT)3 tandem repeat. Duplications of relatively long sequences were occasionally found in the C-terminal domain (fig. 4). In

374 Ponte et al.

Table 1 Simplicity Analysis of the Entire Nucleotide Sequences of H1 Subtypes SF

RSF

N

Eubacteria m30145 Pseudomonas aeruginosa al031124 Streptomyces coelicolor m37891 Salmonella typhimurium d90713 Escherichia coli u32470 Haemophilus influenza u82555 Bordetella pertussis l37438 Bordetella pertussis l79945 Coxiella burnetii l12962 Chlamydia trachomatis ae001669 Chlamydia pneumoniae ae001623 Chlamydia pneumoniae x57311 Chlamydia trachomatis m80324 Chlamydia psittaci

4.65 5.13 2.48 3.98 3.93 3.37 6.74 2.93 3.83 4.19 3.21 4.37 4.05

1.58 1.71 1.39 1.76 1.86 1.69 2.91 1.37 1.76 1.61 1.62 1.45 1.59

1023 657 414 1266 1149 438 549 354 498 372 519 378 354

Viridiplantae u16726 Chlamydomonas reinhardtii l07946 Volvox carteri l07947 Volvox carteri af107022 Triticum aestivum af107023 Triticum aestivum af107024 Triticum aestivum af107026 Triticum aestivum af107027 Triticum aestivum d87065 Triticum aestivum x59872 Triticum aestivum x57077 Zea mays aj224933 Lycopersicon esculentum u03391 Lycopersicon esculentum u01890 Lycopersicon pennellii af222804 Euhorbia esula aj006767 Cicer arietinum l29456 Nicotiana tabacum ab029614 Nicotiana tabacum l34578 Pisum sativa x05636 Pisum sativa u73781 Arabidopsis thailana x62456 Arabidopsis thailana x62458 Arabidopsis thailana x62459 Arabidopsis thailana ab012694 Lilium longiflorum y12599 Apium graveolens

4.81 5.42 5.00 5.63 5.25 6.47 6.19 5.16 6.19 5.91 4.70 3.88 4.04 2.45 3.69 3.78 3.18 3.03 4.30 3.73 3.20 3.15 3.20 3.93 3.06 4.69

2.46 2.47 2.39 1.78 1.86 1.90 1.94 2.37 2.32 1.87 1.80 2.05 2.05 1.20 2.20 2.05 1.60 1.59 2.50 2.08 1.48 1.71 1.63 2.05 1.69 2.39

696 780 720 669 711 825 681 711 861 711 738 813 858 606 687 564 846 837 555 795 582 819 819 816 693 906

Euglenozoa af131892 Leishmania braziliensisa

7.89

3.45

333

Alveolata af127331 Euplotes crasusa af127332 Euplotes crasusa l15293 Euplotes eurystomusa m14854 Tetrahymena thermophilaa

6.00 3.82 4.55 4.59

2.17 1.70 1.98 1.99

456 513 405 492

Mycetozoa l33457 Dycthyosthelium discoideum u50904 Dycthyosthelium discoideum

3.82 3.85

1.43 1.41

540 471

Fungi u43703 Saccharomyces cerevisaeb af190622 Ascobolus immersus aj011780 Emericella nidulans

2.53 3.18 3.09

1.32 1.88 1.43

774 639 600

Metazoa/echinodermata m16033 Strongylocentrotus purpuratus m20314 Strongylocentrotus purpuratus j03807 Strongylocentrotus purpuratus x04488 Lytechinus pictus u84113 Parechinus angulosus u07825 Psammechinus miliaris

5.83 5.41 4.46 5.54 3.31 4.75

2.67 2.35 2.13 2.52 1.65 1.80

612 630 648 627 750 726

SF

RSF

N

Metazoa/nematoda af005371 Caenorhabditis elegans af005372 Caenorhabditis elegans af017810 Caenorhabditis elegans af017811 Caenorhabditis elegans x53277 Caenorhabditis elegans

3.88 2.99 5.98 3.86 5.41

1.95 1.58 2.74 2.00 3.05

621 570 573 573 609

Metazoa/arthropoda u21211 Chironomus dorsalis l28724 Chironomus thummi l28725 Chironomus thummi l28726 Chironomus thummi l28732 Chironomus thummi x56325 Chironomus thummi x72803 Chironomus thummi l29107 Chironomus tentans l29108 Chironomus tentans l29109 Chiromonus tentans l29105 Chiromonus tentans l29101 Glyptotendipes barbipes l29102 Glyptotendipes barbipes l29103 Glyptotendipes barbipes l29104 Glyptotendipes barbipes l76558 Drosophila virilis u67772 Drosophila virilis u67936 Drosophila virilis x14215 Drosophila melanogaster x17072 Drosophila hydei m84797 Tigriopus californianus

6.06 6.31 4.24 4.80 4.02 4.98 5.72 4.63 4.84 5.55 4.86 5.17 4.37 4.68 5.53 2.63 3.22 2.56 2.78 2.88 4.20

2.51 2.63 1.56 1.74 1.45 1.76 2.21 1.67 1.84 1.99 1.90 2.45 1.59 1.69 2.17 1.26 1.20 1.23 1.46 1.40 2.04

723 726 678 582 699 729 690 693 693 660 708 696 693 681 696 750 750 750 765 744 540

Metazoa/chordata x02624 Oncorhynchus mykiss u45877 Pleuronectes americanus

5.32 3.89

2.20 2.07

621 795

s69089 H1a Xenopus laevis x13855 B4 Xenopus laevis m22834 H18a Xenopus laevis m22835 H18b Xenopus laevis

4.80 4.26 3.37 3.24

2.02 1.99 1.55 1.57

630 819 657 660

j00863 Gallus gallus m17018 Gallus gallus m17019 Gallus gallus m17020 Gallus gallus m17021 Gallus gallus x01752 Gallus gallus j00870 H5 Gallus gallus x01065 H5 Carina moschata x06128 Anas platyrhynchos

4.25 5.73 5.02 5.22 4.98 6.03 3.80 4.01 4.82

2.08 2.16 2.12 1.97 1.98 2.11 1.61 1.60 1.75

654 657 675 654 669 654 570 582 651

m31229 H1e Rattus novergicus j03482 H1c Mus musculus l26163 H1a Mus musculus l26164 H1e Mus musculus z38128 H1d Mus musculus z46227 H1b Mus musculus ay007195 H1oo Mus musculus x57129 H1c Homo sapiens x57130 H1a Homo sapiens x83509 H1b Homo sapiens m60747 H1d Homo sapiens m60748 H1e Homo sapiens nm006026 H1 X Homo sapiens

4.52 4.63 4.25 3.92 4.23 3.72 2.89 3.74 3.12 3.98 3.60 4.69 3.69

2.01 1.92 1.76 1.97 1.90 1.84 1.88 1.79 1.62 1.87 1.78 2.01 1.50

648 633 636 654 669 663 915 636 642 675 660 645 639

x03473 H18 Homo sapiens x13171 H18 Mus musculus x72624 H18 Rattus novergicus

4.13 4.01 4.28

1.74 1.83 1.85

582 597 597

m97755 H1t Homo sapiens m97756 H1t Macaca mulatta

2.32 2.50

1.29 1.45

618 621

Sequence Complexity of Histone H1 375

Table 1 Continued SF

RSF

N

Metazoa/mollusca l41834 Ensis minor

3.29

1.51

1059

Metazoa/annelida u96764 Chaetopterus variopedatus

3.88

1.91

606

m28409 H1t Rattus novergicus u06232 H1t Mus musculus

SF

RSF

N

2.82 2.72

1.59 1.59

621 621

NOTE.—N indicates the length of the sequence in nucleotides. SF is the simplicity factor. RSF is the relative simplicity factor. a The H1 subtypes with a single domain, similar to the typical C-terminal domain. b The Saccharomyces cerevisae subtype with two globular domains.

Chironomus thumi, a sequence of 33 amino acids was duplicated, with only two amino acid substitutions between the two copies of the sequence. These 66 residues represent 61% of the C-terminal domain. In Oncorhynchus mykiss, a duplication of a sequence of 17 residues was observed, with one amino acid substitution and one insertion/deletion. In a Volvox subtype, there was a duplication of 19 amino acids, with three insertions/deletions in one of the copies. A remarkable case of tandem repeat is given by the

H1-like sperm protein from the bivalve Ensis minor, described by Bandiera et al. (1995). In this case, the repeat was found in the N-terminal domain, which is unusually long. The protein contains a globular domain, analogous to that of H1 proteins, preceded by 17 almost identical tandem repeats of the motif KKRSXSRKRSAS, where X is a basic residue. The C-terminus contains numerous basic clusters, which give rise to a large peak of simplicity. The N-terminal repeat unit contains two half-repeats, that show up in the simplicity profiles of the

Table 2 Simplicity Analysis of the Individual Domains of a Sample of Plant, Fungi, Invertebrate, and Vertebrate Entire Protein H1 Subtypes U16726 Chlamydomonas reinhardtii x57077 Zea mays l29456 Nicotiana tabacum x62459 Arabidopsis thaliana u73781 Arabidopsis thaliana x59872 Triticum aestivum af107026 Triticum aestivum u01890 Lycopersicon pennellii ab012694 Lilium longiflorum af190622 Ascobolus immersus aj011780 Emericella nidulans l41834 Ensis minor m20314 Strongylocentrotus purpuratus u96764 Chaetopterus variopedatus u21211 Chironomus dorsalis l28725 Chironomus thummi l29102 Glyptotendipes barbipes U67772 Drosophila virilis AF017810 Caenorhabditis elegans x03473 H18 Homo sapiens x72624 H18 Rattus norvegicus x13171 H18 Mus musculus J00870 H5 Gallus gallus m22834 H18 Xenopus laevis x57130 H1a Homo sapiens l26164 H1a Mus musculus x83509 H1b Homo sapiens z46227 H1b Mus musculus x57129 H1c Homo sapiens j03482 H1c Mus musculus m60747 H1d Homo sapiens z38128 H1d Mus musculus m60748 H1e Homo sapiens l26163 H1e Mus musculus m31229 H1e Rattus norvegicus l28753 H1t Mus musculus m97756 H1t Macaca mulata m97755 H1t Homo sapiens

NTD

GD

CTD

SF

RSF

SF

RSF

SF

RSF

SF

RSF

4.81 4.70 3.19 3.93 3.23 5.91 6.19 2.45 3.06 3.18 3.09 3.29 5.39 3.88 6.06 6.31 4.37 2.47 5.98 4.13 4.28 4.01 3.80 3.37 3.12 3.93 3.98 3.72 3.74 4.63 3.60 4.23 4.69 4.25 4.52 2.71 2.51 2.32

2.46 1.80 1.61 2.05 1.58 1.87 1.94 1.20 1.69 1.88 1.43 1.51 2.32 1.91 2.51 2.63 1.59 1.20 2.74 1.74 1.85 1.83 1.61 1.55 1.62 0.97 1.87 1.84 1.79 1.92 1.78 1.90 2.01 1.76 2.01 1.53 1.45 1.29

4.68 5.06 2.12 2.80 4.20 3.19 6.48 2.10 3.73 3.93 3.12 3.36 1.90 2.74 7.58 4.44 4.18 1.97 5.70 3.79 4.79 2.83 3.48 1.64 3.00 3.62 3.12 3.44 3.82 6.42 2.79 4.25 3.96 4.82 3.89 2.10 2.33 2.19

2.11 1.57 1.05 1.32 1.90 1.21 2.01 1.22 1.81 1.86 1.52 1.26 0.77 1.16 2.98 1.30 1.48 0.99 2.48 1.60 1.48 1.29 1.35 0.80 1.76 1.96 1.71 1.66 1.94 2.61 1.46 2.34 2.04 1.93 1.58 1.19 1.44 1.22

2.93 2.63 2.29 2.15 2.18 3.44 3.24 2.19 1.73 1.80 1.83 1.68 2.20 2.22 2.31 2.42 2.51 1.47 3.18 1.90 1.90 1.90 2.59 1.76 1.94 2.07 2.22 2.13 2.13 2.61 2.47 2.28 2.21 2.37 2.49 2.02 2.11 1.87

1.31 1.01 1.41 1.06 1.17 1.22 1.20 1.13 0.99 0.99 1.06 0.92 1.08 1.15 1.23 1.08 1.16 0.83 1.60 1.02 0.86 0.94 1.15 0.87 1.12 1.04 1.16 1.04 1.13 1.12 1.30 0.93 1.12 0.99 1.01 1.12 1.17 1.09

7.97 5.86 3.90 5.67 3.90 8.85 8.42 3.11 3.82 4.39 3.89 5.33 7.83 5.81 8.01 5.85 5.82 3.16 8.11 5.90 6.11 5.93 4.99 5.06 4.03 5.46 5.46 4.84 4.91 5.74 4.62 5.55 6.62 5.41 6.21 3.42 2.96 2.93

2.12 1.82 1.89 2.31 1.64 2.08 2.07 1.21 1.80 2.13 1.77 1.90 2.70 2.12 2.35 1.81 1.67 1.41 3.46 2.38 2.75 2.72 1.96 1.74 1.72 2.52 2.03 2.29 2.13 2.23 2.12 2.21 2.24 1.90 2.32 1.64 1.50 1.53

NOTE.—SF is the simplicity factor. RSF is the relative simplicity factor. NTD is the N-terminal domain. GD is the globular domain. CTD is the C-terminal domain.

376 Ponte et al.

FIG. 2.—Simplicity profiles of H1 subtypes. (A) Nucleotide sequences. (B) Amino acid sequences. The horizontal axis represents the length of the sequences and the vertical axis the simplicity score. The scores from the simplicity factor calculation were averaged in 10 nucleotide steps. The vertical lines indicate the limits of the structural domains.

nucleotide sequences, each half giving rise to an independent peak (fig. 2). Even more remarkable were some of the tandem repeats of the H1 homologues of eubacteria. One of the subtypes of Bordetella pertussis contained 29 copies of the consensus sequence KKAVA, representing 79% of the entire protein. In Pseudomonas aeruginosa, a tandem of 40 copies of the consensus motif KPAA was present, representing 49% of the protein. In Escherichia coli, a repeat of the consensus (AAAEKAAADKAAAE)6 was present, but it was extensively modified by insertion/deletion. The other eubacterial H1s listed in table 1 did not contain tandem repeats. The comparison of the simplicity profiles of the nucleotide and amino acid sequences shows that the overall simplicity of the proteins correlates with the simplicity of the DNA. In tandem repeats, the repeats in the protein always correlated with repeats in the DNA. This is not always the case in other proteins. Recent studies of polyglutamine-encoding regions show that a large proportion are encoded by near-random mixtures of codons (Alba`, Santiba´n˜ez-Koref, and Hancock 1999a, 1999b; Alba`, Laskowski, and Hancock 2001).

Discussion The N-terminal and C-terminal domains of H1 subtypes are more variable in length than the globular domain, indicating the important role of insertion/deletion events during the evolution of the terminal domains. The multialignment of H1 sequences confirms that gaps are abundant in the terminal domains and are much less frequent in the globular domain (Sullivan et al. 2002). In insects, the length variation of H1 subtypes, estimated by chemical modification and gel electrophoresis, was correlated with the number of species in the orders and taken as indicative of an adaptive mode of evolution (Berdnikov et al. 1993). The sequences encoding the terminal domains show, in general, high or very high levels of sequence simplicity. This contrasts with the globular domain, the complexity of which is that proper to common globular proteins. In H1 subtypes, simplicity at the protein level is correlated with simplicity at the nucleotide level. A number of lines of evidence associate high levels of sequence simplicity with the occurrence of DNA slippage, which results in short insertions/deletions (Levinson and Gutman 1987; Schlotterer and Tautz 1992). It thus seems likely that DNA

Sequence Complexity of Histone H1 377

FIG. 3.—Tandem repeats of short amino acid motifs and the corresponding coding sequences. The numbers on the left indicate the position in the sequence. The percentage of the domain represented by the tandem repeat is indicated in parentheses. EP: entire protein; NTD: N-terminal domain; CTD: C-terminal domain.

378 Ponte et al.

FIG. 4.—Duplication of long amino acid motifs. The number on the left indicates the position in the sequence. The percentage of the domain represented by the tandem repeat is indicated in parentheses. CTD: Cterminal domain.

slippage has played a major role in the evolution of H1 terminal domains. Other DNA amplification mechanisms, such as unequal crossing-over, may also have operated in the evolution of H1 terminal domains, especially in the case of long duplications. The simplicity of the terminal domains is, in general, of the cryptic type, that is, it does not basically arise from tandem repeats but reflects the clustering of different repeated motifs interspersed with each other and with unrepeated motifs. Tandem repetitions of short sequences longer than duplications are indeed infrequent in H1 subtypes. However, tandem repeats, which in some cases amount to a large proportion of the C-terminal domain, were found in some subtypes, mainly from invertebrates and plants, and also in the H1-like proteins from eubacteria. It has been suggested that cryptic simplicity may be the remnant of ancestral tandem repeats that were eroded by point mutations and slippage (Hancock 1993). The H1 subtypes containing tandem repeats are thus of great interest, as they suggest that the C-terminal domains could have originated through the amplification of short sequence motifs that would have accumulated by DNA slippage and then evolved by point mutation and further slippage. Some subtypes, such as those of Drosophila and mammalian H1t, would have been modified to such an extent as to no longer retain a significant degree of simplicity. H1 subtypes can thus be classified in three groups, according to the complexity of the C-terminal domain: (1) those containing tandem repetitions, (2) those having a high level of sequence simplicity, but without long tandem repetitions, comprising, among others, the mammalian subtypes, and (3) those with low or very low sequence simplicity. Clear examples of how the DNA amplification mechanisms have been involved in the genesis of the H1

terminal domains are the H1-like protein from the sperm of the bivalve Ensis minor (Bandiera et al. 1995) and the eubacterial H1-like proteins from Bordetella pertussis (Scarlato et al. 1995) and Pseudomonas aeruginosa (Kato, Misra, and Chakrabarty 1990). In E. minor, and in contrast to typical H1s, the tandem repeats are in the N-terminal domain, which is constituted by a series of 17 almost identical repeats of 12 amino acid residues. As in other tandem repeats, the repeats in the protein are correlated with the repeats in the DNA. Each repeat contains two related half-repeats, reminiscent of the hierarchical organization of human satellites. In B. pertussis, 79% of the protein is formed by 29 copies of the consensus motif KKAVA, whereas in P. aeruginosa, 40 copies of the consensus motif KPAA constitute 49% of the protein. The H1 family of subtypes shows different degrees of simple sequence incorporation in the different lineages. Subtypes with significantly different sequence complexity were found in the same lineages and even in the same species. This suggests that sequence complexity may be related to subtype functional differentiation. The subtypes with long tandem repeats were basically circumscribed to plants, invertebrates, and eubacteria. In order to extend the knowledge on the functionality and evolutionary history of tandem repeats, it would be useful to obtain close homologues of tandem-containing subtypes as well as to explore the presence of polymorphisms (Alba`, Santiba´n˜ez-Koref, and Hancock 1999b; Nishizawa and Nishizawa 1999; Pizzi and Frontali 2001). Kasinsky et al. (2001) proposed on the basis of composition and sequence comparisons of H1 proteins from bacteria, protists, fungi, plants, and animals that H1related histones originated in eubacteria long before the addition of the globular domain. The analysis of the complexity of eubacterial H1s, showing that these subtypes are significantly simple and that some even contain tandem repeats, supports this conclusion in that it shows additional common properties besides composition end sequence similarity between eubacterial H1s and the Cterminus of H1s with tripartite structure. Eubacteria even show a striking example of the efficacy of the mechanisms of insertion/deletion in conditions that probably are of low selective presure: the B. pertussis homologue, BpH1, which is encoded by a dispensable gene, varies in size in different strains from 182 to 206 amino acids. The variability is due to the insertion or deletion of DNA modules (Scarlato et al. 1995). The sequences of the N-terminal and C-terminal domains of most subtypes would still appear to be good substrates for slippage-based mutational mechanisms, which may produce gap mutations with much higher frequency than nucleotide substitutions. However, insertions/deletions that might allow the fast evolution of protein variants and their functional differentiation may be hard to tolerate once functions have become fixed. Recent evidence showing that H1 may be involved in the activation or repression of specific genes and that some of these effects can be attributed to the terminal domains, places the sequence variation and complexity of H1 subtypes in a wider context that goes beyond chromatin condensation.

Sequence Complexity of Histone H1 379

Acknowledgments We thank Professor C. Crane-Robinson for drawing our attention to the H1 subtype from Ensis minor. This work has been financed in part by the Ministerio de Educacio´n y Ciencia (DGICYT, PB98-0896) and the Generalitat de Catalunya (SGR2001/00199). Literature Cited Alba`, M. M., R. A. Laskowski, and J. M. Hancock. 2002. Detecting cryptically simple protein sequences using the SIMPLE algorithm. Bioinformatics 18:672–678. Alba`, M. M., M. F. Santiba´n˜ez-Koref, and J. M. Hancock. 1999a. Amino acid reiterations in yeast are overrepresented in particular classes of proteins and show evidence for a slippage-like mutational process. J. Mol. Evol. 49:789–797. ———. 1999b. Conservation of polyglutamine tract size between mouse and human depends on codon interruption. Mol. Biol. Evol. 16:1641–1644. Bandiera, A., U. A. Patel, G. Manfioletti, A. Rustighi, V. Giancotti, and C. Crane-Robinson. 1995. A precursor-product relationship in molluscan sperm proteins from Ensis minor. Eur. J. Biochem. 233:744–749. Berdnikov, V. A., S. M. Rozov, S. V. Temnykh, F. L. Gorel, and O. E. Kosterin. 1993. Adaptive nature of interspecies variation of histone H1 in insects. J. Mol. Evol. 36:497–507. Bouvet, P., S. Dimitrov, and A. P. Wolffe. 1994. Specific regulation of Xenopus chromosomal 5S rRNA gene transcription in vivo by histone H1. Genes Dev. 8:1147–1159. Bucci, L. R., W. A. Brock, and M. L. Meistrich. 1982. Distribution and synthesis of histone 1 subfractions during spermatogenesis in the rat. Exp. Cell Res. 140:111–118. Churchill, M. E. A., and A. A.Travers. 1991. Protein motifs that recognize structural features of DNA. Trends Biochem. Sci. 16:92–97. Costa, A. R., A. A. Peixoto, J. R. Thackeray, R. Dalgleish, and C. P. Kyriacou. 1991. Length polymorphism in the threonineglycine encoding repeat regions of the period gene in Drosophila. J. Mol. Evol. 32:238–246. Djian, P., and H. Green. 1989. Vectorial expansion of the involucrin gene and the relatedness to hominoids. Proc. Natl. Acad. Sci. USA 86:8447–8451. Dou, Y., C. A. Mizzen, M. Abrams, C. D. Allis, and M. A. Gorovsky. 1999. Phosphorylation of linker histone H1 regulates gene expression in vivo by mimicking H1 removal. Mol. Cell 4:641–647. Eickbush, T. H., and W. D. Burke. 1986. The silkmoth late chorion locus. J. Mol. Biol. 190:357–366. Hancock, J. M. 1993. Evolution of sequence repetition and gene duplications in the TATA-binding protein TBP (TFIID). Nucleic Acids Res. 21:2823–2830. ———. 1996. Simple sequences and the expanding genome. Bioessays 18:421–425. Hancock, J. M., and J. S. Armstrong. 1994. Simple34: an improved and enhanced implementation for VAX and Sun computers of the Simple algorithm for analysis of clustered repetitive motifs in nucleotide sequences. Compt. Appl. Biosci. 10:67–70. Hancock, J. M., and G. A. Dover. 1988. Molecular coevolution among cryptically simple expansion segments of eukaryotic 26S/28S rRNAs. Mol. Biol. Evol. 5:377–391. Hoelzel, A. R., J. M. Hancock, and G. A. Dover. 1991. Evolution of the Cetacean mitochondrial D-loop region. Mol. Biol Evol. 8:475–493. Kadake, J. R., and M. R. S. Rao. 1995. DNA- and chromatincondensing properties of rat testes H1a and H1t compared to

those of rat liver H1bdec: H1t is a poor condenser of chromatin. Biochemistry 34:15792–15801. Kasinsky, H. E., J. D. Lewis, J. B. Dacks, and J. Ausio´. 2001. Origin of H1 linker histones. FASEB J. 15:34–42. Kato, J., T. K. Misra, and A. M. Chakrabarty. 1990. AlgR3, a protein resembling eukaryotic histone H1, regulates alginate synthesis in Pseudomonas aeruginosa. Proc. Natl. Acad. Sci. USA 87:2887–2891. Khochbin, S., and A. P. Wolffe. 1994. Developmentally regulated expression of linker-histone variants in vertebrates. Eur. J. Biochem. 225:501–510. Langan, T. A. 1982. Characterization of highly phosphorylated subcomponents of rat thymus H1 histone. J. Biol. Chem. 257:14835–14846. Lee, H. L., and T. K. Archer. 1998. Prolonged glucocorticoid exposure dephosphorylates histone H1 and inactivates the MMTV promoter. EMBO J. 17:1454–1466. Lennox, R. W. 1984. Differences in evolutionary stability among mammalian H1 subtypes. J. Biol. Chem. 259:669–672. Lennox, R. W., R. G. Oshima, and L. H. Cohen. 1982. The H1 histones and their interphase phosphorylated states in differentiated and undifferentiated cell lines derived from murine teratocarcinomes. J. Biol. Chem. 257:5183–5189. Levinson, G., and G. A. Gutman. 1987. Slipped-strand mispairing: a major mechanism for DNA sequence evolution. Mol. Biol. Evol. 4:203–221. Liao, L. W., and R. D. Cole. 1981. Differences among fractions of H1 histones in their interactions with linear and superhelical DNA: circular dichroism. J. Biol. Chem. 256:10124–10128. Nishizawa, M., and K. Nishizawa. 1999. Local-scale repetitiveness in amino acid use in eukaryote protein sequences: a genomic factor in protein evolution. Proteins 37:284–292. Panyim, S., and R. Chalkley. 1969. A new histone found only in tissues with little cell division. Biochem. Biophys. Res. Commun. 37:1042–1049. Parseghian, M. H., A. H. Henschen, K. G. Krieglstein, and Hamkalo, B. 1994. A proposal for a coherent mammalian histone H1 nomenclature correlated with amino acid sequences. Protein Sci. 3:575–587. Paulsson, G., U. Lendahl, J. Galli, C. Ericsson, and L. Wieslander. 1990. The balbiani ring 3 gene in Chironomus tetans has a diverged repetitive structure split by many introns. J. Mol. Biol. 211:331–349. Pizzi, E., and C. Frontali. 2001. Low complexity regions in Plasmodium falciparum proteins. Genome Res. 11:218–229. Ponte, I., C. Monsalves, M. Caban˜as, P. Martı´nez, and P. Suau. 1996. Sequence simplicity and evolution of the 39 untranslated region of the histone H18 gene. J. Mol. Evol. 43:125–134. Ponte, I., J. M. Vidal-Taboada, and P. Suau. 1998. Evolution of the vertebrate H1 histone class: evidence for the functional differentiation of the subtypes. Mol. Biol. Evol. 15: 702–708. Scarlato, V., B. Arico, S. Goyard, S. Ricci, R. Manetti, A. Prugnola, R. Manetti, P. Polverino-De-Laurento, A. Ullmann, and R. Rappuoli. 1995. A novel chromatin-forming histone H1 homologue is encoded by a dispensable and growthregulated gene in Bordetella pertussis. Mol. Microbiol. 5:871–881. Schlotterer, C., and D. Tautz, 1992. Slippage synthesis of simple sequence DNA. Nucleic Acids Res. 20:211–215. Shen, X., and M. A. Gorovsky. 1996. Linker histone H1 regulates specific gene expression but not global transcription in vivo. Cell 86:475–483. Sullivan, S. A., D. W. Sink, K. L. Trout, I. Makalowska, P. M. Taylor, A. D. Baxevanis, and D. Landsman. 2002. The histone database. Nucleic Acids Res. 30:341–342.

380 Ponte et al.

Suzuki, M. 1989. SPKK, a new nucleic acid-binding unit of protein found in histone. EMBO J. 8:797–804. Talasz, H., N. Sapojnikova, W. Helliger, H. Linder, and B. Puschendorf. 1998. In vitro binding of H1 histone subtypes to nucleosomal organized mouse mammary tumor virus long terminal repeat promotor. J. Biol. Chem. 273:32236–32243. Tanaka, M., J. D. Hennebold, J. Macfarlane, and E. Y. Adashi. 2001. A mammalian oocyte-specific linker histone gene H1oo: Homology with the genes for the oocyte-specific cleavage stage histone (CS-H1) of sea urchin and the B4/H1M histone of the frog. Development 128:655–664. Tautz, D., M. Trick, and G. A. Dover. 1986. Cryptic simplicity in DNA is a major source of genetic variation. Nature 322: 652–656. Treiter, M., C. Pfeifle, and D. Tauz. 1989. Comparison of the gap segmentation gene hunchback between Drosophila melanogaster and Drosophila virilis reveals novel modes of evolutionary change. EMBO J. 8:1517–1525. Vermaak, D., O. C. Steinback, S. Dimitrov, R. A. W. Rupp, and A. P. Wolffe. 1998. The globular domain of histone H1 is sufficient to direct specific gene repression in early Xenopus embryos. Curr. Biol. 8:533–536. Vila, R., I. Ponte, M. Collado, J. L. R. Arrondo, M. A. Jime´nez,

M. Rico, and P. Suau. 2001a. DNA-induced a-helical structure in the NH2-terminal domain of histone H1. J. Biol. Chem. 276:46429–46435. Vila, R., I. Ponte, M. Collado, J. L. R. Arrondo, and P. Suau. 2001b. Induction of secondary structure in a COOH-terminal peptide of histone H1 by interaction with the DNA. J. Biol. Chem. 276:30898–30903. Vila, R., I. Ponte, M. A. Jime´nez, M. Rico, and P. Suau. 2000. A helix-turn motif in the C-terminal domain of histone H1. Protein Sci. 9:627–636. ———. 2002. An inducible helix-Gly-Gly motif in the Nterminal domain of histone H1e: A CD and NMR study. Protein Sci. 11:214–220. Wolffe, A. P., P. S. Khochbin, and S. Dimitrov. 1997. What do linker histones do in chromatin? BioEssays 19: 249–255. Zlatanova, J., and K. van Holde. 1992. Histone H1 and transcription: still an enigma? J. Cell Sci. 103:889–885.

Claudia Kappen, Associate Editor Accepted October 23, 2002