Andreas D. Baxevanis and David Landsman*. National Center for .... Res., 3, 477â492. 11 Garrard, W.T., Todd, R.D., Boatwright, D.T. and Albright, S.C. (1976).
1996 Oxford University Press
Nucleic Acids Research, 1996, Vol. 24, No. 1
245–247
Histone Sequence Database: a compilation of highly-conserved nucleoprotein sequences Andreas D. Baxevanis and David Landsman* National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, Room 8N-805, Bethesda, MD 20894, USA Received November 20, 1995; Accepted November 23, 1995
ABSTRACT By searching the current protein sequence databases using sequences from human and chicken histones H1/H5, H2A, H2B, H3 and H4, a database of aligned histone protein sequences with statistically significant sequence similarity to the search sequences was constructed. In addition, a nucleotide sequence database of the corresponding coding regions for these proteins has been assembled. The region of each of the core histones containing the histone fold motif is identified in the protein alignments. The database contains >1300 protein and nucleotide sequences. All sequences and alignments in this database are available through the World Wide Web at http://www.ncbi.nlm.nih.gov/Baxevani/HISTONES. INTRODUCTION The histone proteins play a critical role in the compaction and organization of DNA within the nucleus. The four core histones (H2A, H2B, H3 and H4), which are responsible for folding DNA into nucleosomes (1), have been very highly conserved throughout evolution. This conservation is particularly striking in the case of H4, where there is >95% identity across all known H4 sequences. In contrast, the linker histones H1 and H5, which are involved in the higher-order packing of nucleosomes, have a relatively conserved central globular domain but overall are conserved to a much lesser extent than their core histone counterparts. We have assembled a database of all of the histone protein and nucleotide sequences as of November 1995. This database is intended to be a source of sequence information for these chromosomal proteins, with particular reference to conflicts between similar sequence entries in different source databases. The database, which will be updated as new histone sequence entries are processed, represents the most comprehensive collection and annotation of histone primary sequences assembled to date. DATABASE CONTENT The protein sequences were compiled by searching the Swiss-Prot version 31.0 (2), Protein Identification Resource (PIR) version
* To
whom correspondence should be addressed
45.0 (3), GenPept release 91.0 (4), and the Protein Data Bank (PDB, April 1995) databases using both the BLASTP (5) and FASTA (6) algorithms. In each case, the sequences from chicken or human sources were used as the basis for comparison in the database searches; in the case of H5, only the chicken sequence was used. In the course of the database searches, cases were noted where there were conflicts between the individual sequence entries for a given histone (http://www.ncbi.nlm.nih.gov/Baxevani/ HISTONES/table.html). In citing sequence conflicts, a majorityrule approach was used. In the cases where there was no clear majority among the sequences, the differences are noted with respect to the entries found in Swiss-Prot, where available. Cases where the protein sequences are incorrectly identified are also noted. Nucleotide sequences corresponding to the protein sequences described above were compiled using AA2NUC (T. Tatusov, NCBI), a UNIX utility that performs automated TBLASTN searches of the databases. [TBLASTN compares a protein query sequence against nucleotide sequence databases dynamically translated in all six possible reading frames (5).] The databases searched included the EMBL Data Library release 43.0 (7), GenBank release 91.0 (4), and the Protein Data Bank (PDB, April 1995) databases. DATABASE AVAILABILITY A World Wide Web site has been developed to allow for easy access to the information in the histone sequence database (http://www.ncbi.nlm.nih.gov/Baxevani/HISTONES/database.html). Through this point-and-click interface, the most current version of the database can be accessed, as well as updated information on database content and discrepancies between individual sequence entries. Pointers to the files containing the sequence entries are divided into three groups: (i) a complete set of protein sequence entries containing all full-length entries for each histone type, (ii) the nucleotide sequence set resulting from the TBLASTN searches described above and (iii) a non-redundant protein sequence set containing only one entry for each unique sequence from a particular organism or variant thereof. The histone database files can also be downloaded directly from the ftp site at NCBI (ncbi.nlm.nih.gov, directory /pub/ baxevanis/histones). There are two protein sequence files and one nucleotide sequence file for each major histone type. The first
246
Nucleic Acids Research, 1996, Vol. 24, No. 1
247 Nucleic Acids Acids Research, Research,1994, 1996,Vol. Vol.22, 24,No. No.11 Nucleic
247
protein sequence file (*.raw) contains all of the sequences found for that histone type and is, therefore, redundant. The sequence data is presented in FASTA format, with the definition line containing the accession number, description and histone code for each individual entry, each separated by a vertical bar (|); the histone codes are in a text file (codes.txt) in the same directory. The second protein sequence file (*.nr) contains only one entry for each unique sequence from a particular organism or variant thereof; this file is non-redundant. These non-redundant files are also in FASTA format, with only the histone code appearing in the definition line. The nucleotide sequence file (*.nuc) contain the results of the AA2NUC program described above. Studies utilizing the data within this database, obtained either through the World Wide Web site or the anonymous ftp site, should cite this paper as the primary reference.
(13,14). There is also considerable variation in the length of the H1 proteins, requiring the introduction of a large number of gaps to adequately align these sequences. One of the oft-cited characteristics of the histone proteins is that they contain no tryptophan residues (15). Our database searches indicate that, while this observation is true for the most part, there are indeed several histones containing tryptophan residues; these include H2A from a Xenopus laevis clone, Trypanosoma cruzi H2B, mouse testis H2B, Leishmania H2B and H3 and a X.laevis H5. The presence of tryptophan in these histones make them particularly attractive targets for physical biochemical studies such as fluorescence spectroscopy which require such stronglyabsorbing heterocyclic residues.
PROTEIN SEQUENCE CHARACTERISTICS
We would like to thank Tatiana Tatusov for generously providing us with the AA2NUC program. We also thank Drs Gina Arents and Van Moudrianakis for numerous critical discussions regarding the histone fold motif.
Alignments for each of the histones are available through the World Wide Web site. In the core histone alignments, the region identified as forming the histone fold motif (8) is boxed. As an example, the alignment of histone H2B is shown in Figure 1. The histone fold, which is involved in the extensive molecular contact interface seen within the core histone octamer, is basically an extended helix–strand–helix motif, consisting of a short alphahelix, a loop/beta-strand segment, a long alpha-helix, a second loop/beta-strand segment, and another short alpha-helix (9). For H2B, H3 and H4, the length of the histone fold is extremely well conserved; only the rat H2B and Leishmania H3 sequences differ in length from the remainder of their respective groups by being one amino acid shorter. The histone fold within H2A is also highly conserved, but there are a greater number of cases where the length of the fold differs; most of these differences are seen in minor H2A variants such as H2A.vD from Drosophila, H2A.hv1 from Tetrahymena and H2A.F/Z from several organisms. Despite these relatively minor differences, the high degree of conservation of the histone fold points to the critical nature of this region in histone–histone and histone–DNA interactions. Unlike the core histone proteins, the linker histones H1 and H5 are substantially less conserved. The most conserved region in these two proteins is the central, globular portion, with both the N- and C-terminal tails exhibiting large numbers of substitutions. As the H1 tails are responsible for binding to internucleosomal DNA (10,11) and are susceptible to post-translational modifications (12), the high variation in the sequences of these tails, even within a single organism, may play a critical role in differential chromatin condensation and, consequently, gene regulation
ACKNOWLEDGEMENTS
REFERENCES 1 Kornberg, R. and Thomas, J.O. (1974) Science, 184, 865–868. 2 Bairoch, A. and Apweiler, R. (1996) Nucleic Acids Res., 24, 221–222. 3 George, D.G., Barker, W.C., Mewes, H.-W., Pfeiffer, F. and Tsugita, A. (1996) Nucleic Acids Res., 24, 17–20. 4 Benson, D., Boguski, M., Lipman, D.J. and Ostell, J. (1996) Nucleic Acids Res., 24, 1–7. 5 Altschul, S.F., Gish, W., Miller, W., Myers, E.W. and Lipman, D.J. (1990) J. Mol. Biol., 215, 403–410. 6 Pearson, W.R. and Lipman, D.J. (1988) Proc. Natl Acad. Sci. USA, 85, 2444–2448. 7 Rodriguez-Tome, P., Stoehr, P.J., Cameron, G.N., and Flores, T.P. (1996) Nucleic Acids Res., 24, 8–12. 8 Arents, G., Burlingame, R.W., Wang, B.-C., Love, W.E. and Moudrianakis, E.N. (1991) Proc. Natl Acad. Sci. USA, 88, 10148–10152. 9 Arents, G. and Moudrianakis, E.N. (1993) Proc. Natl. Acad. Sci. USA, 90, 10489–10493. 10 Varshavsky, A.J., Bakayev, V.V. and Georgiev, G.P. (1976) Nucleic Acids Res., 3, 477–492. 11 Garrard, W.T., Todd, R.D., Boatwright, D.T. and Albright, S.C. (1976) Hamburg, Germany. (1976) in 10th Int. Congress of Biochemistry, Hamburg, Germany, Abstract 02-3-178. 12 Hohmann, P. (1983) Mol. Cell. Biochem., 57, 81–92. 13 Laybourn, P.J. and Kadonaga, J.T. (1991) Science, 254, 238–45. 14 Croston, G.E., Kerrigan, L.A., Lira, L.M., Marshak, D.R. and Kadonaga, J.T. (1991) Science, 251, 643–649. 15 van Holde, K.E. (1989) Chromatin. Springer-Verlag, New York. 16 Higgins, D.G., Bleasby, A.J. and Fuchs, R. (1992) Comput. Applic. Biosci., 8, 189–191. 17 Barton, G.J. (1993) Protein Engng, 6, 37–40.
Figure 1. Multiple sequence alignment of histone H2B. The major human histone sequence is shown at the top of the alignment. At each position, sequences in agreement with the first sequence are denoted by a dot (), while gaps are denoted by dashes (–). For positions not in agreement with the first sequence, the actual amino acid residue for that position is shown. The region comprising the histone fold motif is boxed and was manually aligned. Regions outside the histone fold were aligned using CLUSTAL V (16). The alignment was formatted using ALSCRIPT (17).