The MEROPS Database as a Protease Information System ... November 3, 2000, and in revised form January 17, 2001; published online March 21, 2001.
Journal of Structural Biology 134, 95–102 (2001) doi:10.1006/jsbi.2000.4332, available online at http://www.idealibrary.com on
The MEROPS Database as a Protease Information System Alan J. Barrett, Neil D. Rawlings, and Emmet A. O’Brien Molecular Enzymology Laboratory, The Babraham Institute, Cambridge CB2 4AT, United Kingdom Received November 3, 2000, and in revised form January 17, 2001; published online March 21, 2001
peptidase is technically preferable and is used within the MEROPS database. But there is little difference in meaning, and either term can be used without serious ambiguity. Peptidases are all those enzymes that hydrolyze peptide bonds in proteins and fragments of proteins. In so doing, they cause irreversible modification or destruction of their substrates that may be of biological importance in many different ways. Peptidases have been of interest to mankind for hundreds of years because of the many ways in which they are involved in human physiology, pathology, and technology. The strongest of the forces currently driving research on peptidases is the fact that they are attractive targets for drug design. The concept that peptidase inhibitors can make effective drugs has been validated most dramatically for retropepsin, the processing endopeptidase of the human immunodeficiency virus, several inhibitors of which have proved to be potent antiviral agents (Wlodawer and Vondrasek, 1998). Countless additional examples of potential or proven practical relevance of peptidases have recently been described (Barrett et al., 1998). Peptidases are numerous, representing about 2% of all gene products (Rawlings and Barrett, 1999) and nearly 10% of the enzymes included in the EC list (Nomenclature Committee of the International Union of Biochemistry and Molecular Biology, 1992). Being the subjects of an active field of research, and also being numerous, peptidases require a good system of classification and nomenclature. Without this, communications between scientists, and the storage and retrieval of data arising from the research, are inefficient and are wasteful of precious resources. A good classification is also of heuristic value, posing significant questions that might otherwise not have been asked. Peptidases cannot satisfactorily be classified by the methods that are used for most other kinds of enzymes. The reasons for this have been discussed fully elsewhere (Barrett, 2001), but the essence of the problem is that all peptidases catalyze essentially the same reaction, hydrolysis of a peptide
Peptidases (often termed proteases) are of great relevance to biology, medicine, and biotechnology. This practical importance creates a need for an integrated source of information about peptidases. In the MEROPS database (www.merops.ac.uk), peptidases are classified by structural similarities in the parts of the molecules responsible for their enzymatic activity. They are grouped into families on the basis of amino acid sequence homology, and the families are assembled into clans in light of evidence that they share common ancestry. The evidence for clan-level relationships usually comes from similarities in tertiary structure, but we suggest that secondary structure profiles may also be useful in the future. The classification forms a framework around which a wealth of supplementary information about the peptidases is organized. This includes images of three-dimensional structures, alignments of matching human and mouse ESTs, comments on biomedical relevance, human and other gene symbols, and literature references linked to PubMed. For each family, there is an amino acid sequence alignment and a dendrogram. There is a list of all peptidases known from each of over 1000 species, together with summary data for the distributions of the families and clans throughout the major groups of organisms. A set of online searches provides access to information about the location of peptidases on human chromosomes and peptidase substrate specificity. © 2001 Academic Press Key Words: clan; family; MEROPS; protease; proteinase; peptidase; structural classification; database.
INTRODUCTION
We need to address a terminological issue right away. A redundant set of terms is used by the scientific community to refer to proteolytic enzymes. This includes the terms protease, proteinase, and peptidase. Of these, protease is the most familiar, and we therefore use it externally to describe the content of the MEROPS system. However, the term 95
1047-8477/01 $35.00 Copyright © 2001 by Academic Press All rights of reproduction in any form reserved.
96
BARRETT, RAWLINGS, AND O’BRIEN
bond. The differences between the individual enzymes concern the position of the scissile bond in the peptide chain and the preferred sequence of amino acids around it, but there seems to be no way in which these differences can be used to construct sound and comprehensive systems of classification and nomenclature. A first step toward a new type of classification specifically for peptidases was taken when Hartley (1960) proposed classification by the chemistry of the active site. This was a valuable proposal and has been implemented in one form or another in most successful schemes that have been developed since. But only four separate catalytic types could be recognized by Hartley, and even with the use of the concept of catalytic type, the EC list of IUBMB is able to divide peptidases into only 14 sub-subclasses. This is not adequate for such a large and diverse group of enzymes, so in the early 1990s two of the present authors began to think about a new approach to the classification of peptidases. It was fortuitous that it was just at this time that the flow of deduced primary sequences of proteins started to become a flood. Crystallographic data for the three-dimensional structures of proteases also became more abundant, and the scheme that was developed took advantage of both primary and tertiary structures. The aim was to bring together sets of peptidases that were similar both in molecular structure and in evolutionary origin (Rawlings and Barrett, 1993). The system formed the basis of the MEROPS database that was first published on the World Wide Web in 1996. In the present article we shall describe the recent form of MEROPS as it can be seen in Release 5.4 (March, 2001). THE MEROPS CLASSIFICATION OF PEPTIDASES
The organizational framework of the MEROPS database is one in which molecular structures are used to assign individual peptidases to families and to group the families into clans. A concise and unambiguous identifier is provided for each of the peptidases, families, and clans. The Concept of Peptidase Unit In setting out to classify peptidases by structural criteria, we were immediately confronted by a problem that is faced by anyone trying to classify proteins: that of their chimeric nature. The molecules of many proteins are built up of distinct structural modules that seem to have been “mixed and matched” during evolution. Each of these modules tends to have its own set of structural and evolutionary relationships to other proteins, and the relationships of one are commonly orthogonal to those of another, despite the fact that they are parts of the
same protein. But the fact that we were working with a single functional group of proteins immediately suggested a solution: we would work only with the parts of the peptidases that are responsible for their peptide hydrolase activity; this we termed the peptidase unit. The peptidase unit is required to contain all of the known active site residues and in most cases it is the part of the amino acid sequence that can be aligned with the smallest mature active peptidase known in the family. In a few families even the smallest peptidase is evidently chimeric, and when a domain can be recognized as having a nonpeptidase function it is excluded from the peptidase unit. Assigning a Peptidase to a Family Each family of peptidases is assembled around a founder member called a type example. A peptidase is assigned to the family when its amino acid sequence is found to show statistically significant similarity to that of the peptidase unit of the type example or to the peptidase unit of an existing member of the family. The relationships within a peptidase family are therefore transitive. The statistical significance of amino acid sequence relationships is evaluated with the BLAST program (Altschul et al., 1990), and a relationship is considered significant if the e score is ⬍0.01 against the complete nonredundant database. The peptidases are then regarded as homologous in the strict sense (Reeck et al., 1987). Sometimes, the dendrogram for a family shows that the members of the family split into two or more groups very long ago. A divergence of 150 accepted point mutations per hundred residues (PAM) (Dayhoff et al., 1983) is taken to justify the recognition of subfamilies. Assigning a Family to a Clan Whenever possible, families are grouped into clans. A clan is a group of families for which there is evidence of common ancestry. Such evidence comes from similarities in crystallographically determined tertiary structures or sometimes from identical order of catalytic residues in the linear sequence and similar sequence motifs around the catalytic residues. A clan may consist of a single family if the tertiary structure of its peptidases is unlike that of peptidases in any other family. Three-dimensional structures are not easy to represent or compare, and we have found that simple, linear representations of the secondary structures of type examples of the families are often sufficient to illustrate the clan-level relationships. Secondary structure is calculated from a PDB file according to the rules of Kabsch and Sander (1983), and a Perl script is used to plot a GIF image.
FIG. 1. Secondary structures in clans containing serine peptidases. Secondary structures of representative members of 15 families in 8 clans containing serine peptidases were calculated from the structural coordinates as described in the text. Clan PA is so named because it also contains homologous cysteine peptidases in family C3. Drawings are to scale (by residue numbers), and the elements of secondary structure that are represented are alpha helix (orange), beta strand (green), and coil (cyan). Catalytic residues are marked, and boxes are intended to emphasize the conservation of structural elements around catalytic residues in the clan. Names of peptidases (MEROPS identifier, PDB accession): 1, trypsin (S01.151, 1JRS); 2, togavirin (S03.001, 2SNV); 3, flavivirin (S07.001, 1BEF); 4, hepacivirin (S29.001, 1A1R); 5, picornain 3C (C03.005, 1HAV); 6, subtilisin (S08.001, 1SCN); 7, pseudomonapepsin (S53.001, personal communication from A. Wlodawer), 8, carboxypeptidase Y (S10.001, 1CPY); 9, prolyl aminopeptidase (S33.001, 1QTR); 10, K15 DD-transpeptidase (S11.004, 1SKF); 11, D-Ala-D-Ala carboxypeptidase B (S12.001, 3PTE); 12, signalase (S26.001, 1B12); 13, assemblin (herpes simplex virus) (S21.001, 1AT3); 14, Clp protease (S14.001, 1TYF); 15, C-terminal processing protease (Synechocystis) (S41.002, personal communication from D.-I. Liao).
98
BARRETT, RAWLINGS, AND O’BRIEN
Figure 1 illustrates this approach as applied to the clans that contain serine peptidases. In clan PA of trypsin and its relatives (named for “protein nucleophile” catalytic type because the clan contains cysteine as well as serine peptidases), the boxed segments show the conservation of structures around the catalytic His, Ser, and Cys residues in these proteins, largely composed of beta structures. In contrast, the structures around the catalytic serines, the two families of clan SB (subtilisin and relatives), agree well with each other, but are completely different from those in clan PA. Clans SC and SE similarly show clear links between the families they contain. The structures of the single families in clans SF, SH, SK, and SM are all very different from the others and show why separate clans have been established for these families; they seem more likely to have resulted from separate evolutionary events than to have diverged from the ancestors of any of the other clans. The structures shown in Fig. 1 are all known structures from crystallographic coordinates, but we note that, unlike tertiary structures, secondary structures can often by deduced from amino acid sequences with fair confidence (Cuff and Barton, 2000), so it may sometimes be possible to make a provisional assignment of a family to a clan in the absence of a crystallographic structure. Identifiers for Peptidases, Families, and Clans Each family is given an identifier formed from a letter and a number (such as M24). The letter represents the catalytic type: A, aspartic; C, cysteine; M, metallo; S, serine; T, threonine, or U, unknown. The numbers are assigned sequentially for each catalytic type regardless of the clan to which the family may be assigned. An identifier for each subfamily is formed by adding a letter to the family name (e.g., M24A). The MEROPS identifier for each individual peptidase is formed from the identifier of the family to which it belongs followed by a point and a threedigit number assigned arbitrarily (e.g., M24.003). Each clan is given an identifier formed of two letters (e.g., “MA”), the first letter representing the catalytic type of the peptidases in the clan and the second letter being added sequentially. A clan that contains families of more than one of the types C, S, and T is described as being of type P. As far as can be managed, the MEROPS identifiers of peptidases, families, and clans are stable. If an identifier disappears as a result of a reassignment, the identifier is not reused. As a result, the sequences of identifiers are not now continuous, and this is of no significance.
PRESENTATION OF INFORMATION IN THE MEROPS DATABASE
Once a system of nomenclature and classification is in place, all sorts of information can be organized around it. In the MEROPS database this is done (a) by a system of fixed HTML pages that are made accessible by links from a set of index files and (b) by online search functions that interrogate an underlying database. Card Files and Their Indexes The database contains four major types of pages: PepCards, FamCards, ClanCards, and SpecCards, each of which leads to information on a single peptidase, family, clan, or species of organism. The use of the term “card” for these files originated in a style that was used in earlier releases of the MEROPS database, but it is still serviceable. In the present release, a card file appears in the upper frame of the browser window and contains only a title and a set of menu buttons. Each of the buttons opens a subpage in the lower frame, which initially contains a Summary file. Any PepCard may be reached through one of three indexes. These list each peptidase by name or pseudonym, by MEROPS identifier, or by the organism in which the peptidase has been found. The PepCard is effectively a set of subpages built around a Summary page that gives data for assignments to family and clan, and subsubclass in the IUBMB EC list. Many PepCards contain a brief statement of the known biomedical relevance of the enzyme, and there are also data for human and mouse genetics when appropriate. There is even a PepCard for a peptidase that cannot yet be assigned to a family because no complete amino acid sequence is available, so long as it has been characterized biochemically. A Sequences subpage makes links to the appropriate entries in nucleic acid and amino acid sequence databanks. If tertiary structural coordinates have been deposited, a Structure page lists PDB entries and usually provides a rendered molecular image. From PepCards, the user can also explore alignments of the deduced amino acid sequences of human or mouse expressed sequence tags (ESTs) that match, or nearly match, the sequence (see below). A FamCard contains links to external databases of sequence motifs (PROSITE) and secondary (HSSP) and tertiary structure (CATH, SCOP). There is a list of all peptidases that have been assigned to the family, with links to their PepCards. There are also a protein sequence alignment for the peptidase units of members of the family and a tree showing how the members of the family are related
MEROPS—THE PROTEASE DATABASE
99
FIG. 2. Specificity search. The input form for this search provides a set of eight pull-down boxes representing the P4 to P4⬘ residues of the substrate. In each of these the user may select an amino acid abbreviation or a symbol from an extended set representing various types of amino acids, blocking groups, and others. The result shows substrates that are reported to be cleaved appropriately and a link to the PepCard of the peptidase(s) responsible.
(Rawlings and Barrett, 2000). To construct these, the amino acid sequences of peptidase units for the family are aligned by use of the CLUSTAL W program (Higgins et al., 1996), and the sequence alignment is formatted as an HTML page annotated to highlight the peptidase unit, catalytic residues, disulfide bridges, carbohydrate attachment sites, and transmembrane domains. A difference matrix is computed from the alignment, and percentage differences are converted to accepted point mutations (Dayhoff et al., 1983). The tree is then constructed by use of the KITSCH program of the PHYLIP suite (Felsenstein, 1989). Each ClanCard contains a brief description of the common features of the members of the clan and a list of the families contained in the clan. Like the FamCards, ClanCards show the distribution of the peptidases in the group across the major kingdoms of living creatures. There are about 1200 SpecCards, each of which provides an abbreviated taxonomy for the species of organism, a list of all the peptidases known from it, and commonly the names of their genes. For each PepCard, FamCard, and ClanCard, MEROPS provides an up-to-date reference list summarizing the literature on each peptidase, family, and clan. Many of the references are not easily retrieved through conventional searches of the literature databases because of the muddled and redundant terminology in common use for peptidases. When possible, a reference is linked to an abstract in the PubMed database at the National Center for Biotechnology Information. Search Functions Online searches are an important feature of the MEROPS database. The user can search for a pep-
tidase by its name (complete or partial) or by a known accession number from another database. The user can also retrieve all peptidases assigned to a specified human chromosome and, for any given family or clan, may find all peptidases for which structural coordinates have been deposited. The online searches also provide access to data on the specificity of peptidases. A collection of known cleavage sites in synthetic substrates, peptides, and proteins can be searched by providing the name of the substrate or the name of the peptidase or by filling substrate specificity subsites in a form. The search by specificity subsite details are illustrated in Fig. 2. Retrieval and Presentation of EST Data Attached to many of the PepCards in MEROPS are alignments of the deduced amino acid sequences of human or mouse ESTs that match, or nearly match, the amino acid sequence of the peptidase. The strategy used to construct these depends upon searching with a complete set of query sequences for a given family of peptidases in one operation and is summarized in Scheme 1. This procedure is run for all families of peptidases that have been recognized in humans or the mouse. The EST alignments that result from this approach can yield several kinds of information. The ESTs in the closely matching alignment may be presumed to be products of the known gene and commonly indicate errors in the deposited query sequence, as well as polymorphisms and splice variations. Taken in conjunction with the table of EST data, they can show differences in expression among tissues, developmental states, and disease conditions. The alignment of ESTs that match the query sequence more poorly (but nevertheless better than
FIG. 3. Human homologues of membrane dipeptidase. Detail from an alignment of human ESTs matching the human M19.001 query sequence (top, shaded). Mismatching residues are printed in red. The upper set of ESTs is that matching to better than 95% identity. The lower set matched less well and reveals the presence of two human homologues of the membrane dipeptidase. The gray area to the right is that beyond the peptidase unit. The accession number of each EST (in the form of a link to GenBank) is visible on the right. The novel members of family M19 have now been cloned and sequenced as AJ295149 and AJ291679. FIG. 4. Sequence of human aminopeptidase B. Detail from an alignment of human ESTs matching the rat aminopeptidase B (M01.014) query sequence (top, shaded). The set of human ESTs revealed the full-length human sequence and led to cloning and resequencing of it in our laboratory. FIG. 5. Novel mammalian homologue. Detail from an alignment of mouse ESTs matching the Caenorhabditis elegans M08.002 query sequence (top, shaded). The set of mouse ESTs revealed the existence of a leishmanolysin homologue, previously unknown from mammals. The yellow bars indicate cysteine residues forming disulfide bonds potentially conserved between the leishmanolysin of Leishmania and the mouse protein.
101
MEROPS—THE PROTEASE DATABASE
1. Assemble a library of full-length amino acid query sequences for the peptidase family that contains amino acid sequences for all known examples from the species of interest together with selected other mammalian sequences. 2. With TFASTX (Pearson et al., 1997), search a library containing all available ESTs for the given species with each of the query sequences. 3. Merge information from all the result files and reformat output so as to show for each query sequence the set of ESTs that match this better than any other. 4. Divide the set of ESTs matching each query sequence into (a) ESTs that match to better than 95% identity and so probably are the products of the known gene and (b) those that match less well and may be the products of novel homologous genes. 5. Assemble a sequence alignment for each subset and format this in HTML, marking features including extent of the peptidase unit and known catalytic residues. 6. Collect data for each EST regarding IMAGE clone number, description of library from which the EST was derived, and Unigene cluster to which the EST is assigned, if any. Tabulate these data. SCHEME 1. Procedure for analysis of expressed sequence tags for one family of peptidases.
they match any other known peptidase) may well be found to contain sequences of novel homologues. In the alignment, these novel homologues may be recognized with confidence because of the repeating patterns of mismatching residues that are highlighted in the sequences (Fig. 3). Alignments of ESTs that match query sequences from other organisms sometimes provide part or all of the sequence of a human or mouse peptidase before it has been determined experimentally and show EST clones that will facilitate the experimental confirmation of the sequence (Fig. 4). A non-human/mouse query sequence sometimes even leads to the discovery of a family of peptidases in mammals that was not previously known to be expressed in such organisms (Fig. 5). SOME THINGS ONE MIGHT DISCOVER WITH THE MEROPS DATABASE
1. For any given peptidase, MEROPS shows which other peptidases are closely related at the family level or are more distantly related in the same clan. These relationships tend to carry with them similarities in properties, such as dominant specificity subsites or sensitivity to classes of inhibitors, that are of value in the design of inhibitors as potential drugs. 2. MEROPS provides simple identifiers for the peptidases, families, and clans. Numerous alternative names are in common use, and these are frequently ambiguous or even misleading. The MER-
OPS identifiers are concise and unambiguous and increasingly are being used to increase clarity in other databases and published articles. 3. The user can discover what peptidases are so far known from human, or mouse, or any of over 1000 other organisms, including about 30 with completely sequenced genomes. 4. The user can also learn how a given peptidase, family, or clan is distributed through the major groups of organisms. This information can suggest model systems in which to study the functions of an enzyme. It is also helpful to anyone planning work with an enzyme that will need to be species-specific, such as the use of a peptidase inhibitor as a chemotherapeutic agent against a pathogenic organism. 5. It will become clear what tertiary structures are available for the peptidase, family, or clan of interest. 6. At the most general level, MEROPS can serve as a model for a way in which to construct a database for a functionally defined group of proteins. Focusing on the functional unit (the peptidase unit, in MEROPS) is an appropriate way to resolve the problem of chimerism. Structural classification at the two levels of family (primary structure) and clan (tertiary structure) is widely applicable, and a simple naming system for the families and clans is both easy to set up and valuable. Once the classification is in place, all manner of other information can be attached to the framework. We received financial support from the Medical Research Council (UK) and the Babraham Institute. REFERENCES Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. (1990) Basic local alignment search tool, J. Mol. Biol. 215, 403– 410. Barrett, A. J. (2001) Proteolytic enzymes: Nomenclature and classification, in R. J. Beynon and J. S. Bond, (Eds.), Proteolytic Enzymes. A Practical Approach, 2nd ed. Oxford University Press, Oxford. Barrett, A. J., Rawlings, N. D., and Woessner, J. F. (Eds.) (1998) Handbook of Proteolytic Enzymes, Academic Press, London. Cuff, J. A., and Barton, G. J. (2000) Application of multiple sequence alignment profiles to improve protein secondary structure prediction, Proteins 40, 502–511. Dayhoff, M. O., Barker, W. C., and Hunt, L. T. (1983) Establishing homologies in protein sequences, Methods Enzymol. 91, 524 –545. Felsenstein, J. (1989) PHYLIP—Phylogeny inference package, Cladistics 5, 164 –166. Hartley, B. S. (1960) Proteolytic enzymes, Annu. Rev. Biochem. 29, 45–72. Higgins, D. G., Thompson, J. D., and Gibson, T. J. (1996) Using CLUSTAL for multiple sequence alignments, Methods Enzymol. 266, 383– 402.
102
BARRETT, RAWLINGS, AND O’BRIEN
Kabsch, W., and Sander, C. (1983) How good are predictions of protein secondary structure?, FEBS Lett. 155, 179 –182. Nomenclature Committee of the International Union of Biochemistry and Molecular Biology (1992) Enzyme Nomenclature 1992, Academic Press, Orlando. (See also http://www.chem.qmw.ac.uk/iubmb/enzyme/EC34/ for the latest revision of this document). Pearson, W. R., Wood, T., Zhang, Z., and Miller, W. (1997) Comparison of DNA sequences with protein sequences, Genomics 46, 24 –36. Rawlings, N. D., and Barrett, A. J. (1993) Evolutionary families of peptidases, Biochem. J. 290, 205–218.
Rawlings, N. D., and Barrett, A. J. (1999) MEROPS: The peptidase database. Nucleic Acids Res. 27, 325–331. Rawlings, N. D., and Barrett, A. J. (2000) MEROPS: The peptidase database. Nucleic Acids Res. 28, 323–325. Reeck, G. R., de Hae¨n, C., Teller, D. C., Doolittle, R. F., Fitch, W. M., Dickerson, R. E., Chambon, P., Mclachlan, A. D., Margoliash, E., Jukes, T. H., and Zuckerkandl, E. (1987) “Homology” in proteins and nucleic acids: a terminology muddle and a way out of it, Cell 50, 667– 667. Wlodawer, A., and Vondrasek, J. (1998) Inhibitors of HIV-1 protease: A major success of structure-assisted drug design, Annu. Rev. Biophys. Biomol. Struct. 27, 249 –284.