© 2000 Oxford University Press
Nucleic Acids Research, 2000, Vol. 28, No. 1
323–325
MEROPS: the peptidase database Neil D. Rawlings* and Alan J. Barrett MRC Molecular Enzymology Laboratory, The Babraham Institute, Babraham, Cambridgeshire CB2 4AT, UK Received September 29, 1999; Revised and Accepted October 8, 1999
ABSTRACT Important additions have been made to the MEROPS database (http://www.bi.bbsrc.ac.uk/Merops/Merops. htm ). These include sequence alignments and cladograms for many of the families of peptidases, and these have proved very helpful in the difficult task of distinguishing the sequences of peptidases that are simply species variants of already known enzymes from those that represent novel enzymes. INTRODUCTION The MEROPS database (http://www.bi.bbsrc.ac.uk/Merops/ Merops.htm) provides a catalogue and a structure-based classification of peptidases (i.e. all proteolytic enzymes) (1). This is a large group of proteins (~2% of all gene products) that is of particular importance in medicine and biotechnology. An index of the peptidases by name or synonym (http://www. bi.bbsrc.ac.uk/Merops/indexes/pepidx.htm) gives access to a set of files termed PepCards (e.g. caspase-1; http://www.bi. bbsrc.ac.uk/Merops/pepcards/c14p001.htm) each of which provides information on classification and nomenclature for a single peptidase. Also provided are an interface to the relevant entries in databases for human genetics, protein and nucleic acid sequence data and tertiary structure data, and if the tertiary structure of the enzyme has been determined, a Richardson diagram (http://www.bi.bbsrc.ac.uk/Merops/images/1ice.htm) (2) is shown. Another index provides access to the PepCards by organism name (http://www.bi.bbsrc.ac.uk/Merops/indexes/ pepidx.htm) so that the user can retrieve all known peptidases from a particular species. The peptidases are classified into families on the basis of statistically significant similarities between the protein sequences in the part termed the ‘peptidase unit’ that is most directly responsible for activity. Families that are thought to have common evolutionary origins because they have similar tertiary folds are grouped into clans. The MEROPS database provides sets of pages called FamCards (e.g. C14; http://www.bi.bbsrc.ac.uk/Merops/famcards/c14.htm) and ClanCards (e.g. CD; http://www.bi.bbsrc.ac.uk/Merops/ clancards/cd.htm) describing the individual families and clans. Each FamCard page provides links to other databases of sequence motifs and secondary and tertiary structures, and shows the distribution of the family across the major kingdoms of living creatures. Among the recent developments to the MEROPS database have been links to the Drosophila genome database FlyBase (3) and to TrEMBLNew, the updates to the TrEMBL database (4). Biomedical, biotechnological and physiological information
have been incorporated into the PepCards. We have also included amino acid sequence alignments of the peptidase units for many families and subfamilies, and cladograms that give a graphical representation of the degrees of similarity of the sequences. The way in which we have been able to use these to recognise homologous but different peptidases within a single organism, and forms of the same peptidase in different organisms, is the main topic of the present paper. SEQUENCE ALIGNMENTS Many peptidases are complex proteins in which the part of the molecule most directly responsible for peptide bond hydrolysis [which we term the ‘peptidase unit’ (1)] is linked at one or both ends to domains with other functions. Generally, the peptidase unit is a contiguous section of the sequence and contains about 200 amino acids. The peptidase unit is identified by reference to crystallographic structures if they are available. It cannot be larger than the smallest active peptidase in the family, and may be smaller. When the smallest active peptidase is a mosaic protein, we restrict the peptidase unit still further by excluding domains that show similarity to domains in proteins that are not peptidases. Once the peptidase unit has been determined for the ‘type example’ (1) of the family, then the limits of the peptidase unit for all members of the family can be determined with a FastA search (5). A library of all the full-length sequences of active peptidases in the family is compiled from the SWISS-PROT, TrEMBL and PIR protein sequence databases, and this is searched with the peptidase unit of the type example. The limits of the peptidase unit of each hit returned from the pairwise alignments in the FastA results file are calculated. If the alignment does not include the N- and C-termini of the type example, e.g. if the hit is a fragment of a sequence, the sequence is usually discarded. When, exceptionally, the hit has an insert longer than that permitted in the FastA alignment algorithm, a manual comparison is made and the limits of the peptidase unit are calculated by hand. The protein sequences of all the peptidase units identified in this way are extracted and concatenated into a file which is read directly into the ClustalX program (6) and aligned using the default parameters. The alignment is checked visually to ensure that known catalytic residues are correctly aligned, and any sequences that cannot be aligned are discarded. Finally, the ClustalX alignment is converted into an HTML file. The value of an alignment is enhanced by annotation. The method we have used to annotate alignments automatically is to collect data that can be related directly to the sequence of the type example of the family or subfamily, and to use a combination
*To whom correspondence should be addressed. Tel: +44 1223 496649; Fax: +44 1223 496023; Email:
[email protected]
324
Nucleic Acids Research, 2000, Vol. 28, No. 1
of symbols and color highlighting on the alignment to show conservation of these structural features. An example of an annotated alignment is that for the peptidase family M8 (http:// www.bi.bbsrc.ac.uk/Merops/aln/m08_fra.htm) of leishmanolysin. An amino acid residue that is identical with that in the type example (Leishmania major leishmanolysin) is shown in purple and in bold. In keeping with the MSF format of the GCG package (7), gaps are shown as dots, and the sequences are displayed in blocks of 60 amino acids. Each block is preceded by two lines of residue numbering and up to three lines for the sequence-specific annotation symbols, and followed by three lines of consensus sequence. The alignment is numbered according to the complete sequence of the type example, and letters indicate insertions relative to the type example. The color-coded annotation symbols are used consistently in all the alignments, and the full list is shown in Table 1. At the end of each alignment there is a key to the symbols. The first consensus line shows residues that occur in more than half of the sequences (or an ‘x’ if no single amino acid predominates). The second line shows the type of amino acid that occurs at each position by use of the SMART groupings and symbols (8). The third line shows an amino acid that is absent from that position in the alignment but is a member of the same SMART group depicted in the second line. Table 1. Features that are annotated in the alignments and the symbols used
horizontal, the ‘root’ on the left and the tips on the right. (Trees generated by the Kitsch algorithm are unrooted, and the ‘root’ is the mid-point of the most ancient divergence.) We rearrange the branches so that the longest are towards the bottom of the image. The cladograms that are included in MEROPS are intended to show how similar sequences cluster together, and not to give an accurate depiction of evolution in the family, so no bootstrapping is done. Both alignments and cladograms are presented within frames in the HTML pages of the database; each is shown in a frame that covers the top half of the screen, and the frame below contains the key. For a given family the order of sequences in the alignment is adjusted to match the order in the tree. Each tip (sequence) on the tree is numbered, and an asterisk marks the type example of the family. The key indicates the peptidase name, the scientific binomial of the species it is derived from, the SWISS-PROT or TrEMBL accession number and the MEROPS identifier. If the sequence has not been assigned to a MEROPS identifier, then it is described as an ‘other peptidase’ and an alternative name is given in parenthesis. If the peptidase family is divided into subfamilies (see below), then the key is also divided into subfamilies. The maximum number of sequences that are included in an alignment or a tree is 100 so as to make them easy to read. If the family contains more than 100 sequences, then the alignments and trees are drawn for separate subfamilies where possible.
Feature
Symbol
SUBFAMILIES
Active site residue
*
Metal ligand
+
Criteria we have used to recognise subfamilies within a peptidase family were described previously (1), and the method of use of cladograms for this purpose is illustrated by the tree for peptidase family M10 (the interstitial collagenase family; http://www. bi.bbsrc.ac.uk/Merops/trees/m10_frt.htm). The family can be divided into three subfamilies, with tips 1–69 in subfamily M10A (the matrixin subfamily), tips 70–80 in subfamily M10B (the serralysin subfamily) and tips 81–82 in subfamily M10C (the fragilysin subfamily).
Specificity site
helix sheet Disulfide bridge (N-terminal end)
@ a b /(number)
Disulfide bridge (C-terminal end)
\(number)
N-linked carbohydrate attachment
‡
Propeptide
»
Transmembrane domain
=
Cleavage site
¦
The symbols used in the database are color-coded: a red symbol indicates that the feature is a prediction only, a blue symbol shows that there is evidence from sitedirected mutagenesis, and a black symbol shows that there is biochemical or tertiary structure evidence from at least one member of the family.
CLADOGRAMS Cladograms are calculated from the amino acid sequence alignments. A difference matrix is calculated and converted to accepted point mutations (PAMs) according to the table of Dayhoff et al. (9). A single tree is calculated from this matrix with the Kitsch program of the Phylip package (10); this uses the algorithm of Fitch and Margoliash (11) with contempory tips, with the assumption that mutation rate is constant throughout the family. The tree is displayed with the branches
DISTINCTION OF KNOWN FROM NOVEL PEPTIDASES As a result of the rapid progress of the nucleotide sequencing projects we are aware of the deduced amino acid sequences of many members of the known families of peptidases for which there are no biochemical data. For each of these putative peptidases a decision has to be made whether to assign it to the MEROPS identifier of an already known peptidase, with an existing PepCard, and with the implication that its properties and functions are similar, or to treat it as a novel entity. The cladograms and sequence alignments provide valuable clues in making such decisions. Criteria we have used for a single peptidase were described previously (1). Cladograms are used to give a preliminary indication of which sequences are species variants of a known peptidase. The tree for peptidase family A1 (the pepsin family; http:// www.bi.bbsrc.ac.uk/Merops/trees/a01_frt.htm) will be used as an example for a family in which it has been difficult to identify the sequences that represent species variants of known enzymes. One problem is that species variants have sometimes
Nucleic Acids Research, 2000, Vol. 28, No. 1
been given different names simply because of their different origins, but also some organisms contain multiple duplicated genes encoding very similar enzymes. As an example of easy identification of orthologues, tips 1–11 have been assigned to the MEROPS identifier for cathepsin D (A01.009). The branching pattern matches that expected for orthologues, the most ancient divergences being in nematodes and trematodes, then in insects and vertebrates. Aspartic endopeptidases from fungi have been assumed in the past to represent a variety of different enzymes, and the tree supports this view in many cases. Fungal vacuolar peptidases (tips 13–16) appear to be the fungal equivalents of cathepsin D, a lysosomal enzyme in animals, whereas aspergillopepsins I (tips 62–71), rhizopuspepsins (tips 72–77), mucorpepsins (tips 78–79) and candidapepsins (tips 82–90) are unlike known animal or plant peptidases. In several of these broad groups genes are known to have proliferated in some species (notably Candida albicans and Rhizopus niveus), and when biochemical evidence exists to show that these represent distinct enzymes we have maintained this distinction. This explains why penicillopepsin, for example, appears within the aspergillopepsin I branch. Presumably, accumulation of point mutations has led to a change in substrate specificity subsequent to the divergence of the Aspergillus and Penicillium genera. The distinction is that penicillopepsin can cleave -casein and thus clot milk (12) whereas aspergillopepsin I cannot (13). There are many sequences that apparently represent novel enzymes but which have not yet been characterized biochemically. This is most notable for the nematode sequences in tips 50–56. Examples of mutations that change peptidase properties can be seen from the cladogram and alignment for cysteine endopeptidases of family C1 (the papain family). The cladogram for family C1 (http://www.bi.bbsrc.ac.uk/Merops/trees/c01a_frt.htm) shows that four peptidase sequences from papaya (Carica papaya) form a branch distinct from those of other plants. These are papain, chymopapain, caricain and glycyl endopeptidase, but the cladogram gives no clue to their enzymology. The sequence alignment for plant members of family C1 (http://www. bi.bbsrc.ac.uk/Merops/aln/c01a_fra.htm) is more informative if inspected in the light of the known structure/function relationships in the family, in particular the residues that form the specificity
325
sites. Thus, Gly156 and Gly198 are both in the S1 specificity pocket of papain and are highly conserved except in glycyl endopeptidase where they are substituted with Glu and Arg, respectively. Enzymological data show that papain, chymopapain and caricain are all similar in specificity to the animal lysosomal endopeptidase cathepsin L, although they differ in details of the cleavage of the insulin B chain. Glycyl endopeptidase, however, has a very restricted specificity, able only to cleave glycyl bonds because of the bulky replacements in the S1 site (14). In the kiwi fruit (Actinidia deliciosa) actinidain is similar in properties to papain but is less efficient in cleaving substrates in which the P2 residue is aromatic, and the explanation for this is that Ser338 has been substituted by Met, making the S2 subsite shorter (15), and again this is apparent from the alignment. REFERENCES 1. 2. 3. 4. 5. 6. 7. 8.
9.
10. 11. 12.
13.
14. 15.
Rawlings,N.D. and Barrett,A.J. (1999) Nucleic Acids Res., 27, 325–331. Richardson,J.S. (1985) Methods Enzymol., 115, 359–380. The FlyBase Consortium (1999) Nucleic Acids Res., 27, 85–88. Bairoch,A. and Apweiler,R. (1998) Nucleic Acids Res., 26, 38–42. Updated article in this issue: Nucleic Acids Res. (2000), 28, 45–48. Pearson,W.R. and Lipman,D.J. (1988) Proc. Natl Acad. Sci. USA, 85, 2444–2448. Thompson,J.D., Gibson,T.J., Plewniak,F., Jeanmougin,F. and Higgins,D.G. (1997) Nucleic Acids Res., 25, 4876–4882. Genetics Computer Group (1994) Program Manual for the Wisconsin Package, Version 8, September 1994. University of Madison, Wisconsin. Ponting,C.P., Schultz,J., Milpetz,F. and Bork,P. (1999) Nucleic Acids Res., 27, 229–232. Updated article in this issue: Nucleic Acids Res. (2000), 28, 231–234. Dayhoff,M.O., Schwartz,R.M. and Orcutt,B.C. (1978) In Dayhoff,M.O. (ed.), Atlas of Protein Sequence and Structure 1978. National Biomedical Research Foundation, Washington D.C., pp. 353–358. Felsenstein,J. (1989) Cladistics, 5, 164–166. Fitch,W.M. and Margoliash,E. (1967) Science, 155, 281–284. Hofmann,T. (1998) In Barrett,A.J., Rawlings,N.D. and Woessner,J.F. (eds), Handbook of Proteolytic Enzymes. Academic Press, London, pp. 878–883. Ichishima,E. (1998) In Barrett,A.J., Rawlings,N.D. and Woessner,J.F. (eds), Handbook of Proteolytic Enzymes. Academic Press, London, pp. 872–878. O’Hara,B.P., Hemmings,A.M., Buttle,D.J. and Pearl,L.H. (1995) Biochemistry, 34, 13190–13195. Watts,A.B. and Brocklehurst,K. (1998) In Barrett,A.J., Rawlings,N.D. and Woessner,J.F. (eds), Handbook of Proteolytic Enzymes. Academic Press, London, pp. 573–576.