Document not found! Please try again

the swiss-prot protein sequence database user manual - CiteSeerX

14 downloads 3026 Views 145KB Size Report
Electronic mail address: [email protected]. WWW server: http://www.expasy.ch/. Rolf Apweiler. The EMBL Outstation - The European Bioinformatics ...
THE SWISS-PROT PROTEIN SEQUENCE DATABASE USER MANUAL

Release 39, May 2000

Amos Bairoch Swiss Institute of Bioinformatics (SIB) Centre Medical Universitaire (CMU) 1, rue Michel Servet 1211 Geneva 4 Switzerland Telephone: +41-22-702 58 60 Fax: +41-22-702 58 58 Electronic mail address: [email protected] WWW server: http://www.expasy.ch/ Rolf Apweiler The EMBL Outstation - The European Bioinformatics Institute (EBI) Wellcome Trust Genome Campus Hinxton Cambridge CB10 1SD United Kingdom Telephone: +44-1223-494 444 Fax: +44-1223-494 468 Electronic mail address: [email protected] WWW server: http://www.ebi.ac.uk/

1

Acknowledgements This release of SWISS-PROT has been prepared by: • Amos Bairoch, Kristian Axelsen, Marie-Claude Blatter-Garin, Brigitte Boeckmann, Silvia Braconi Quintaje, Florence Brunner, Danielle Coral, Sylvie Dethiollaz, Livia Famiglietti, Nathalie Farriol-Mathis, Serenella Ferro, Elisabeth Gasteiger, Alain Gateau, Vivienne Gerritsen, Arnaud Gos, Nadine Gruaz-Gumowski, Chantal Hulo, Nicolas Hulo, Janet James, Silvia Jimenez, Eva Jung, Corinne Lachaize, Karine Michoud, Madelaine Moinat, Bruno Pardo, Catherine Rivoire, Bernd Roechert, Claudia Sapsezian, Christian Sigrist, Shyamala Sundaram, Anne-Lise Veuthey, Julia Williams-Nef and Nadine Zangger at the Swiss Institute of Bioinformatics (SIB) and the Medical Biochemistry Department of the University of Geneva; • Rolf Apweiler, Kirsty Bates, Sergio Contrino, Kirill Degtyarenko, Wolfgang Fleischmann, Gill Fraser, Cathy Gedman, Henning Hermjakob, Vivien Junker, Alexander Kanapin, Youla Karavidopoulou, Paul Kersey, Evguenia Kriventseva, Fiona Lang, Minna Lehvaslaiho, Michele Magrane, Maria Jesus Martin, Nicoletta Mitaritonna, Virginie Mittard, Steffen Moeller, Nicola Mulder, Claire O'Donovan, Tom Oinn, John O’Rourke, Isabelle Phan, Sandrine Pilbout, Lucia Rodriguez-Monge, Margaret Shore-Nye, Eleanor Whitfield, Allyson Williams and Evgueni Zdobnov at the European Bioinformatics Institute (EBI). SWISS-PROT contains sequences translated from the EMBL Nucleotide Sequence Database, prepared by the European Bioinformatics Institute. For a recent reference see: Baker W., van den Broek A., Camon E., Hingamp P., Sterk P., Stoesser G. and Tuli M.A.; Nucleic Acids Res. 28:19-23(2000). A small part of the information in SWISS-PROT was originally adapted from information contained in the Protein Sequence Database of the Protein Information Resource (PIR). For a recent reference see: Barker W.C., Garavelli J.S., McGarvey P.B., Orcutt B.C., Srinivasarao G.Y., Xiao C., Yeh L.-S.L, Ledley R.S., Janda J.F., Pfeiffer F., Mewes H.-W., Tsugita A. and Wu C.; Nucleic Acids Res. 28:41-44(2000). Cross-references are made in SWISS-PROT to: • The PROSITE dictionary of sites and patterns in proteins prepared by Amos Bairoch and Philipp Bucher at the Swiss Institute of Bioinformatics. Reference: Hofmann K., Bucher P., Falquet L. and Bairoch A.; Nucleic Acids Res. 27:215-219(1999). • The Pfam database of protein domains prepared under the supervision of Richard Durbin and Sean Eddy. Reference: Bateman A., Birney E., Durbin R., Eddy S.R., Howe K.L. and Sonnhammer E.L.L.; Nucleic Acids Res. 28:263-266(2000). • The PRINTS database of protein fingerprints prepared under the supervision of Terri Attwood at the University of Manchester. Reference: Attwood T.K., Croning M.D.R., Flower D.R., Lewis A.P., Mabey J.E., Scordis P., Selley J.N. and Wright W.; Nucleic Acids Res. 28:225-227(2000). • The 3D macromolecular structure Protein Data Bank (PDB) prepared by Research Collaboratory for Structural Bioinformatics (RCSB). Reference: Berman H.M., Westbrook J., Feng Z., Gilliland G., Bhat T.N., Weissig H., Shindyalov I.N. and Bourne P.E.; Nucleic Acids Res. 28:235-242(2000). • The database of Homology-derived Secondary Structure of Proteins (HSSP) compiled at the European Bioinformatics Institute (EBI). Reference: Holm L. and Sander C.; Nucleic Acids Res. 27:244-247(1999).

2

• The DictyDb database prepared by Douglas W. Smith and Bill Loomis from the University of California, San Diego (UCSD). Reference: Smith D.W. and Loomis W.F.; Proceedings of the International Dictyostelium Conference '96. • The Drosophila genome database (FlyBase) prepared under the supervision of Michael Ashburner at the Department of Genetics, University of Cambridge. Reference: Nucleic Acids Res. 27:85-88(1999). • The EcoGene Escherichia coli K12 and StyGene Salmonella typhimurium LT2 genome databases, both prepared by Ken Rudd at the Department of Biochemistry and Molecular Biolology of the University of Miami School of Medicine. Reference: Rudd K.E.; Nucleic Acids Res. 28:60-64(2000). • The Maize genome database (MaizeDB) developed by the USDA-ARS Maize Genome Project as part of the National Agricultural Library's Plant Genome Research Program. • The Online Mendelian Inheritance in Man database (OMIM) prepared under the supervision of Victor McKusick at John Hopkins University. Reference: Hamosh A., Scott A.F., Amberger J., Valle D. and McKusick V.A.; Hum. Mutat. 15:5761(2000). • The Mouse Genome Database (MGD) prepared by the Mouse Genome Informatics group at Jackson Laboratory. Reference: Blake J.A., Eppig J.T., Richardson J.E. and Davisson M.T.; Nucleic Acids Res. 28:108111(2000). • The Saccharomyces Genome Database (SGD) prepared under the supervision of Mike Cherry at Stanford. Reference: Ball C.A., Dolinski K., Dwight S.S., Harris M.A., Issel-Tarver L., Kasarskis A., Scafe C.R., Sherlock G., Binkley G., Jin H., Kaloper M., Orr S.D., Schroeder M., Weng S., Zhu Y., Botstein D. and Cherry J.M.; Nucleic Acids Res. 28:77-80(2000). • The SubtiList relational database for the Bacillus subtilis 168 genome prepared under the supervision of Ivan Moszer at the Pasteur Institute. Reference: Moszer I.; FEBS Lett. 430:28-36(1998). • The TubercuList relational database for the Mycobacterium tuberculosis H37Rv genome prepared under the supervision of Stewart Cole at the Pasteur Institute. • The WormPep database prepared by Richard Durbin and Erik Sonnhammer from the MRC Laboratory of Molecular Biology and Sanger Center at Hinxton Hall, Cambridge. Reference: Sonnhammer E.L.L. and Durbin R.; Genomics 46:200-216(1997). • The Zebrafish Information Network (ZFIN) database prepared by the Institute of Neuroscience of the University of Oregon. Reference: Westerfield M., Doerry E., Kirkpatrick A.E. and Douglas S.A.; Meth. Cell Biol. 60:339355(1999).

3

• The G-protein--coupled receptor database (GCRDb) prepared by Lee Frank Kolakowski at the Department of Pharmacology of the University of Texas, San Antonio. Reference: Kolakowski L.F. Jr.; Receptors Channels 2:1-7(1994). • The restriction enzymes database (REBASE) prepared by Richard Roberts and Dana Macelis at New England BioLabs. Reference: Roberts R.J. and Macelis D.; Nucleic Acids Res. 28:306-307(2000). • The transcription factor database (TRANSFAC) developed under the supervision of Edgar Wingender from the Gesellschaft fuer Biotechnologische Forschung mbH in Braunschweig. Reference: Wingender E., Chen X., Hehl R., Karas H., Liebich I., Matys V., Meinhardt T., Pruess M., Reuter I. and Schacherer F.; Nucleic Acids Res. 28:316-319(2000). • The Encyclopedia of Escherichia coli genes and metabolism (EcoCyc) prepared under the supervision of Peter Karp at Pangea Systems and Monica Riley at MBL. Reference: Karp P.D., Riley M., Saier M., Paulsen I.T., Paley S.M. and Pellegrini-Toole A.; Nucleic Acids Res. 28:56-59(2000). • The 2D-gel protein database (SWISS-2DPAGE) prepared under the responsibility of Denis Hochstrasser and Ron Appel at the Swiss Institute of Bioinformatics. Reference: Hoogland C., Sanchez J.-C., Tonella L., Binz P.-A., Bairoch A., Hochstrasser D.F. and Appel R.D.; Nucleic Acids Res. 28:286-288(2000). • The gene-protein database of Escherichia coli K12 (2D-gel spots) (ECO2DBASE) prepared under the supervision of Ruth VanBogelen. Reference: VanBogelen R.A., Abshire K.Z., Moldover B., Olson E.R. and Neidhardt F.C.; Electrophoresis 18:1243-1251(1997). • The Harefield Hospital 2D gel protein databases prepared under the supervision of Mike Dunn. Reference: Corbett J.M., Wheeler C.H., Baker C.S., Yacoub M.H. and Dunn M.J.; Electrophoresis 15:1459-1465(1994). • The human keratinocyte 2D-gel protein database from the universities of Aarhus and Ghent. Reference: Celis J.E., Rasmussen H.H., Olsen E., Madsen P., Leffers H., Honore B., Dejgaard K., Gromov P., Hoffmann H.J., Nielsen M., Vassiliev A., Vintermyr O., Hao J., Celis A., Basse B., Lauridsen J.B., Ratz G.P., Andersen A.H., Walbum E., Kjaergaard I., Puype M., Van Damme J. and Vandekerckhove J.; Electrophoresis 14:1091-1198(1993). • The Yeast Electrophoresis Protein Database (YEPD) prepared under the supervision of Jim Garrells from Proteome Inc. Reference: Payne W.E. and Garrels J.I.; Nucleic Acids Res. 25:57-62(1997). • The Human Retroviruses and AIDS compilation of nucleic and amino acid sequences (HIV Sequence Database) edited by G. Myers, A.B. Rabson, S.F. Josephs, T.F. Smith, J.A. Berzofsky, F. Wong-Staal; published by the Theoretical Biology and Biophysics Group T-10 at Los Alamos National Laboratory; and funded by the AIDS program of the National Institute of Allergy and Infectious Diseases through an interagency agreement with the United States Department of Energy.

4

Copyright notice SWISS-PROT is copyright. It is produced through a collaboration between the Swiss Institute of Bioinformatics and the EMBL Outstation - the European Bioinformatics Institute. There are no restrictions on its use by non-profit institutions as long as its content is in no way modified. Usage by and for commercial entities requires a license agreement. For information about the licensing scheme see: http://www.isbsib.ch/announce/ or send an email to [email protected]. The above copyright notice also applies to this user manual as well as to any other SWISS-PROT documents.

How to submit data or updates/corrections to SWISS-PROT To submit new sequence data to SWISS-PROT and for all queries regarding the submission of SWISSPROT one should contact: SWISS-PROT The EMBL Outstation - The European Bioinformatics Institute Wellcome Trust Genome Campus Hinxton Cambridge CB10 1SD United Kingdom Telephone: (+44 1223) 494 462 Telefax: (+44 1223) 494 468 E-mail: [email protected] (for submissions); [email protected] (for queries) To submit updates and/or corrections to SWISS-PROT you can either use the E-mail address: [email protected] or the WWW address: http://www.expasy.ch/sprot/sp_update_form.html Citation If you want to cite SWISS-PROT in a publication, please use the following reference: Bairoch A. and Apweiler R. The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res. 28:45-48(2000).

5

Table of contents 1) What is SWISS-PROT? 2) Conventions used in the database 2.1 General structure of the database 2.2 Classes of data 2.3 Structure of a sequence entry 3) The different line types 3.1 The ID line 3.2 The AC line 3.3 The DT line 3.4 The DE line 3.5 The GN line 3.6 The OS line 3.7 The OG line 3.8 The OC line 3.9 The reference (RN, RP, RC, RX, RA, RT, RL) lines 3.10 The CC line 3.11 The DR line 3.12 The KW line 3.13 The FT line 3.14 The SQ line 3.15 The sequence data line 3.16 The // line Appendix A: Feature table keys A.1 Change indicators A.2 Amino acid modifications A.3 Regions A.4 Secondary structure A.5 Others Appendix B: Amino acid codes Appendix C: Format differences between the SWISS-PROT and EMBL databases C.1 Generalities C.2 Differences in line types present in both databases C.3 Line types defined by SWISS-PROT but currently not used by EMBL C.4 Line types defined by EMBL but currently not used by SWISS-PROT

6

(1). What is SWISS-PROT? SWISS-PROT is an annotated protein sequence database. It was established in 1986 and maintained collaboratively, since 1987, by the group of Amos Bairoch first at the Department of Medical Biochemistry of the University of Geneva and now at the Swiss Institute of Bioinformatics (SIB) and the EMBL Data Library (now the EMBL Outstation - The European Bioinformatics Institute (EBI)). The SWISS-PROT protein sequence database consists of sequence entries. Sequence entries are composed of different line types, each with their own format. For standardization purposes the format of SWISS-PROT follows as closely as possible that of the EMBL Nucleotide Sequence Database. The SWISS-PROT database distinguishes itself from other protein sequence databases by four distinct criteria: a) Annotation In SWISS-PROT, as in most other sequence databases, two classes of data can be distinguished: the core data and the annotation. For each sequence entry the core data consists of: • • •

The sequence data; The citation information (bibliographical references); The taxonomic data (description of the biological source of the protein).

The annotation consists of the description of the following items: • • • • • • • •

Function(s) of the protein; Post-translational modification(s). For example carbohydrates, phosphorylation, acetylation, GPI-anchor, etc.; Domains and sites. For example calcium binding regions, ATP-binding sites, zinc fingers, homeoboxes, SH2 and SH3 domains, kringle, etc.; Secondary structure. For example alpha helix, beta sheet, etc.; Quaternary structure. For example homodimer, heterotrimer, etc.; Similarities to other proteins; Disease(s) associated with deficiencie(s) in the protein; Sequence conflicts, variants, etc.

We try to include as much annotation information as possible in SWISS-PROT. To obtain this information we use, in addition to the publications that report new sequence data, review articles to periodically update the annotations of families or groups of proteins. We also make use of external experts, who have been recruited to send us their comments and updates concerning specific groups of proteins. We believe that our having systematic recourse both to publications other than those reporting the core data and to subject referees represents a unique and beneficial feature of SWISS-PROT.

7

In SWISS-PROT, annotation is mainly found in the comment lines (CC), in the feature table (FT) and in the keyword lines (KW). Most comments are classified by ‘topics'; this approach permits the easy retrieval of specific categories of data from the database. b) Minimal redundancy Many sequence databases contain, for a given protein sequence, separate entries which correspond to different literature reports. In SWISS-PROT we try as much as possible to merge all these data so as to minimize the redundancy of the database. If conflicts exist between various sequencing reports, they are indicated in the feature table of the corresponding entry. c) Integration with other databases It is important to provide the users of biomolecular databases with a degree of integration between the three types of sequence-related databases (nucleic acid sequences, protein sequences and protein tertiary structures) as well as with specialized data collections. SWISS-PROT is currently cross-referenced with about 30 different databases. Cross-references are provided in the form of pointers to information related to SWISS-PROT entries and found in data collections other than SWISS-PROT. This extensive network of cross-references allows SWISS-PROT to play a major role as a focal point of biomolecular database interconnectivity. d) Documentation SWISS-PROT is distributed with a large number of index files and specialized documentation files. Some of these files have been available for a long time (this user manual, the release notes, the various indices for authors, citations, keywords, etc.), but many have been created recently and we are continuously adding new files. The release notes contain an up to date descriptive list of all distributed document files.

8

(2). Conventions used in the database The following sections describe the general conventions used in SWISS-PROT to achieve uniformity of presentation. Experienced users of the EMBL Database can skip these sections and directly refer to Appendix C, which lists the minor differences in format between the two data collections. (2.1). General structure of the database The SWISS-PROT protein sequence database is composed of sequence entries. Each entry corresponds to a single contiguous sequence as contributed to the bank or reported in the literature. In some cases, entries have been assembled from several papers that report overlapping sequence regions. Conversely, a single paper can provide data for several entries, e.g. when related sequences from different organisms are reported. References to positions within a sequence are made using sequential numbering, beginning with 1 at the Nterminal end of the sequence. Except for initiator N-terminal methionine residues, which are not included in a sequence when their absence from the mature sequence has been proven, the sequence data correspond to the precursor form of a protein before post-translational modifications and processing. (2.2). Classes of data In order to attempt to make data available to users as quickly as possible after publication, SWISS-PROT is now distributed with a supplement called TrEMBL, where entries are released before all their details are finalized. To distinguish between fully annotated entries and those in TrEMBL, the 'class' of each entry is indicated on the first (ID) line of the entry. The two defined classes are: STANDARD PRELIMINARY

Data which are complete to the standards laid down by the SWISS-PROT database. Sequence entries which have not yet been annotated by the SWISS-PROT staff up to the standards laid down by SWISS-PROT. These entries are exclusively found in TrEMBL.

(2.3). Structure of a sequence entry The entries in the SWISS-PROT database are structured so as to be usable by human readers as well as by computer programs. The explanations, descriptions, classifications and other comments are in ordinary English. Wherever possible, symbols familiar to biochemists, protein chemists and molecular biologists are used.

9

Each sequence entry is composed of lines. Different types of lines, each with their own format, are used to record the various data that make up the entry. A sample sequence entry is shown below. ID AC DT DT DT DE DE DE GN OS OC OC RN RP RC RX RA RT RT RT RL RN RP RX RA RA RT RT RL RN RP RX RA RT RT RL RN RP RX RA RA RT RT RL RN RP RX RA RA RT RT

GRAA_HUMAN STANDARD; PRT; 262 AA. P12544; 01-OCT-1989 (Rel. 12, Created) 01-OCT-1989 (Rel. 12, Last sequence update) 15-DEC-1998 (Rel. 37, Last annotation update) GRANZYME A PRECURSOR (EC 3.4.21.78) (CYTOTOXIC T-LYMPHOCYTE PROTEINASE 1) (HANUKKAH FACTOR) (H FACTOR) (HF) (GRANZYME 1) (CTL TRYPTASE) (FRAGMENTIN 1). GZMA OR CTLA3 OR HFSP. Homo sapiens (Human). Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. [1] SEQUENCE FROM N.A. TISSUE=T-cell; MEDLINE; 88125000. Gershenfeld H.K., Hershberger R.J., Shows T.B., Weissman I.L.; "Cloning and chromosomal assignment of a human cDNA encoding a T cell- and natural killer cell-specific trypsin-like serine protease."; Proc. Natl. Acad. Sci. U.S.A. 85:1184-1188(1988). [2] SEQUENCE OF 29-53. MEDLINE; 88330824. Poe M., Bennett C.D., Biddison W.E., Blake J.T., Norton G.P., Rodkey J.A., Sigal N.H., Turner R.V., Wu J.K., Zweerink H.J.; "Human cytotoxic lymphocyte tryptase. Its purification from granules and the characterization of inhibitor and substrate specificity."; J. Biol. Chem. 263:13215-13222(1988). [3] SEQUENCE OF 29-40, AND CHARACTERIZATION. MEDLINE; 89009866. Hameed A., Lowrey D.M., Lichtenheld M., Podack E.R.; "Characterization of three serine esterases isolated from human IL-2 activated killer cells."; J. Immunol. 141:3142-3147(1988). [4] SEQUENCE OF 29-39, AND CHARACTERIZATION. MEDLINE; 89035468. Kraehenbuhl O., Rey C., Jenne D.E., Lanzavecchia A., Groscurth P., Carrel S., Tschopp J.; "Characterization of granzymes A and B isolated from granules of cloned human cytotoxic T lymphocytes."; J. Immunol. 141:3471-3477(1988). [5] 3D-STRUCTURE MODELING. MEDLINE; 89184501. Murphy M.E.P., Moult J., Bleackley R.C., Gershenfeld H., Weissman I.L., James M.N.G.; "Comparative molecular model building of two serine proteinases from cytotoxic T lymphocytes.";

10

RL CC CC CC CC CC CC CC CC CC CC CC CC CC CC CC CC CC CC CC CC DR DR DR DR DR DR DR DR DR DR DR KW KW FT FT FT FT FT FT FT FT FT FT FT SQ

Proteins 4:190-204(1988). -!- FUNCTION: THIS ENZYME IS NECESSARY FOR TARGET CELL LYSIS IN CELLMEDIATED IMMUNE RESPONSES. IT CLEAVES AFTER LYS OR ARG. MAY BE INVOLVED IN APOPTOSIS. -!- CATALYTIC ACTIVITY: HYDROLYSIS OF PROTEINS, INCLUDING FIBRONECTIN, TYPE IV COLLAGEN AND NUCLEOLIN. PREFERENTIAL CLEAVAGE: ARG-|-XAA, LYS-|-XAA >> PHE-|-XAA IN SMALL MOLECULE SUBSTRATES. -!- SUBUNIT: HOMODIMER, DISULFIDE-LINKED. -!- SUBCELLULAR LOCATION: CYTOPLASMIC GRANULES. -!- SIMILARITY: BELONGS TO PEPTIDASE FAMILY S1; ALSO KNOWN AS THE TRYPSIN FAMILY. STRONGEST TO OTHER GRANZYMES AND TO MAST CELL PROTEASES. -------------------------------------------------------------------------This SWISS-PROT entry is copyright. It is produced through a collaboration between the Swiss Institute of Bioinformatics and the EMBL outstation the European Bioinformatics Institute. There are no restrictions on its use by non-profit institutions as long as its content is in no way modified and this statement is not removed. Usage by and for commercial entities requires a license agreement (See http://www.isb-sib.ch/announce/ or send an email to [email protected]). -------------------------------------------------------------------------EMBL; M18737; AAA52647.1; -. PIR; A28943; A28943. PIR; A30525; A30525. PIR; A30526; A30526. PIR; A31372; A31372. PDB; 1HF1; 15-OCT-94. MIM; 140050; -. INTERPRO; IPR001254; -. PFAM; PF00089; trypsin; 1. PROSITE; PS00134; TRYPSIN_HIS; 1. PROSITE; PS00135; TRYPSIN_SER; 1. Hydrolase; Serine protease; Zymogen; Signal; T-cell; Cytolysis; Apoptosis; 3D-structure. SIGNAL 1 26 PROPEP 27 28 ACTIVATION PEPTIDE. CHAIN 29 262 GRANZYME A. ACT_SITE 69 69 CHARGE RELAY SYSTEM (BY SIMILARITY). ACT_SITE 114 114 CHARGE RELAY SYSTEM (BY SIMILARITY). ACT_SITE 212 212 CHARGE RELAY SYSTEM (BY SIMILARITY). DISULFID 54 70 BY SIMILARITY. DISULFID 148 218 BY SIMILARITY. DISULFID 179 197 BY SIMILARITY. DISULFID 208 234 BY SIMILARITY. CARBOHYD 170 170 N-LINKED (GLCNAC...) (POTENTIAL). SEQUENCE 262 AA; 28968 MW; DA87363A0D92BAF4 CRC64; MRNSYRFLAS SLSVVVSLLL IPEDVCEKII GGNEVTPHSR PYMVLLSLDR KTICAGALIA KDWVLTAAHC NLNKRSQVIL GAHSITREEP TKQIMLVKKE FPYPCYDPAT REGDLKLLQL TEKAKINKYV TILHLPKKGD DVKPGTMCQV AGWGRTHNSA SWSDTLREVN ITIIDRKVCN DRNHYNFNPV IGMNMVCAGS LRGGRDSCNG DSGSPLLCEG VFRGVTSFGL ENKCGDPRGP GVYILLSKKH LNWIIMTIKG AV

//

11

Each line begins with a two-character line code, which indicates the type of data contained in the line. The current line types and line codes and the order in which they appear in an entry, are shown in the table below.

Line code ID AC DT DE GN OS OG OC RN RP RC RX RA RT RL CC DR KW FT SQ //

Content Identification Accession number(s) Date Description Gene name(s) Organism species Organelle Organism classification Reference number Reference position Reference comment(s) Reference cross-reference(s) Reference authors Reference title Reference location Comments or notes Database cross-references Keywords Feature table data Sequence header (blanks) sequence data Termination line

Occurrence in an entry Once; starts the entry One or more Three times One or more Optional One or more Optional One or more One or more One or more Optional Optional One or more Optional One or more Optional Optional Optional Optional Once One or more Once; ends the entry

As shown in the above table, some line types are found in all entries, others are optional. Some line types occur many times in a single entry. Each entry must begin with an identification line (ID) and end with a terminator line (//). A detailed description of each line type is given in the next section of this document. It must be noted that, with the exception of GN, all SWISS-PROT line types exist in the EMBL Database. A description of the format differences between the SWISS-PROT and EMBL databases is given in Appendix C of this document. The two-character line type code that begins each line is always followed by three blanks, so that the actual information begins with the sixth character. Information is not extended beyond character position 75 except for one exception, CC lines that contain the ‘DATABASE’ topic (see section 3.10).

12

(3). The different line types (3.1). The ID line The ID (IDentification) line is always the first line of an entry. The general form of the ID line is: ID

ENTRY_NAME

DATA_CLASS; MOLECULE_TYPE; SEQUENCE_LENGTH.

(3.1.1). Entry Name The first item on the ID line is the entry name of the sequence. This name is a useful means of identifying a sequence. The entry name consists of up to ten uppercase alphanumeric characters. SWISS-PROT uses a general purpose naming convention that can be symbolized as X_Y, where: •

X is a mnemonic code of at most 4 alphanumeric characters representing the protein name. Examples: B2MG is for Beta-2-microglobulin, HBA is for Hemoglobin alpha chain and INS is for Insulin;



The ‘_' sign serves as a separator;



Y is a mnemonic species identification code of at most 5 alphanumeric characters representing the biological source of the protein. This code is generally made of the first three letters of the genus and the first two letters of the species. Examples: PSEPU is for Pseudomonas putida and NAJNI is for Naja nivea.

However, for species most commonly encountered in the database, self-explanatory codes are used. There are 16 of those codes. They are: BOVIN for Bovine, CHICK for Chicken, ECOLI for Escherichia coli, HORSE for Horse, HUMAN for Human, MAIZE for Maize (Zea mays) , MOUSE for Mouse, PEA for Garden pea (Pisum sativum), PIG for Pig, RABIT for Rabbit, RAT for Rat, SHEEP for Sheep, SOYBN for Soybean (Glycine max), TOBAC for Common tobacco (Nicotina tabacum), WHEAT for Wheat (Triticum aestivum), YEAST for Baker's yeast (Saccharomyces cerevisiae). As it was not possible to apply the above rules to viruses, they were given arbitrary, but generally easy to remember, identification codes. In some cases it was not possible to assign a definitive code to a species. In these cases a temporary code was chosen. Examples of complete protein sequence entry names are: RL1_ECOLI for ribosomal protein L1 from Escherichia coli, FER_HALHA for ferredoxin from Halobacterium halobium. The names of all the presently defined species identification codes are listed in the SWISS-PROT document file SPECLIST.TXT. (3.1.2). Data class The second item on the ID line indicates the data class of the entry (see section 2.2).

13

(3.1.3). Molecule type The third item on the ID line is a three-letter code that indicates the type of molecule of the entry: in SWISSPROT it is ‘PRT‘ (for PRoTein). (3.1.4). Length of the molecule The fourth and last item of the ID line is the length of the molecule, which is the total number of amino acids in the sequence. This number includes the positions reported to be present but which have not been determined (coded as ‘X'). The length is followed by the letter code ‘AA’ (Amino Acids). (3.1.5). Examples of identification lines Two examples of ID lines are shown below: ID ID

CYC_BOVIN GIA2_GIALA

STANDARD; STANDARD;

PRT; PRT;

104 AA. 296 AA.

(3.2). The AC line The AC (ACcession number) line lists the accession number(s) associated with an entry. The format of the AC line is: AC

AC_number_1;[ AC_number_2;]...[ AC_number_N;]

An example of an accession number line is shown below: AC

P00321; P05348;

Semicolons separate the accession numbers and a semicolon terminates the list. If necessary, more than one AC line can be used. Example: AC AC

Q16653; Q14855; Q13054; Q13055; Q92891; Q92892; Q92893; Q92894; Q92895; Q93053; Q99605; O00713; O00714; O00715;

The purpose of accession numbers is to provide a stable way of identifying entries from release to release. It is sometimes necessary for reasons of consistency to change the names of the entries, for example, to ensure that related entries have similar names. However, an accession number is always conserved, and therefore allows unambiguous citation of SWISS-PROT entries. Researchers who wish to cite entries in their publications should always cite the first accession number. This is commonly referred to as the ‘primary accession number’. Entries will have more than one accession number if they have been merged or split. For example, when two entries are merged into one, the accession numbers from both entries are stored into the AC line(s).

14

If an existing entry is split into two or more entries (a rare occurrence), the original accession numbers are retained in all the derived entries and a new primary accession number is added to all the entries. An accession number is dropped only when the data to which it was assigned have been completely removed from the database. Deleted accession numbers are listed in the SWISS-PROT document file DELETEAC.TXT. Accession numbers consist of 6 alphanumerical characters in the following format: 1 [O,P,Q]

2 [0-9]

3 [A-Z, 0-9]

4 [A-Z, 0-9]

5 [A-Z, 0-9]

6 [0-9]

Here are some examples of valid accession numbers: P12345, Q1AAA9, O456A1 and P4A123.

(3.3). The DT line The DT (DaTe) lines show the date of creation and last modification of the database entry. The format of the DT line is: DT

DD-MMM-YYYY (Rel. XX, Comment)

Where ‘DD' is the day, ‘MMM' the month, ‘YYYY' the year, and ‘XX' the SWISS-PROT release number. The comment portion of the line indicates the action taken on that date. There are always three DT lines in each entry, each of them is associated with a specific comment: • • •

The first DT line indicates when the entry first appeared in the database. The associated comment is ‘Created’; The second DT line indicates when the sequence data was last modified. The associated comment is ‘Last sequence update’; The third DT line indicates when data (see the note below) other than the sequence was last modified. The associated comment is ‘Last annotation update’.

Example of a block of DT lines: DT DT DT

01-AUG-1988 (Rel. 08, Created) 01-JAN-1990 (Rel. 13, Last sequence update) 15-APR-1999 (Rel. 38, Last annotation update)

Concerning the third DT line, one should note that such a line is not updated when an entry is the target of what we call a ‘global change’. A global change being defined as any operation which involves changes in all or most SWISS-PROT entries. These changes are announced in the release notes and are usually linked to

15

formatting issues. As such global changes take place at almost each release, we strongly advise users of the database to completely reload SWISS-PROT at each release cycle.

(3.4). The DE line The DE (DEscription) lines contain general descriptive information about the sequence stored. This information is generally sufficient to identify the protein precisely. The format of the DE line is: DE

DESCRIPTION.

The description is given in ordinary English (using US-spelling) and is free-format. In cases where more than one DE line is required, the text is only divided between words and only the last DE line is terminated by a period. The description always starts with the proposed ‘official name’ of the protein. Synonyms are indicated between brackets. Example: DE DE DE

ANNEXIN V (LIPOCORTIN V) (ENDONEXIN II) (CALPHOBINDIN I) (CBP-I) (PLACENTAL ANTICOAGULANT PROTEIN I) (PAP-I) (PP4) (THROMBOPLASTIN INHIBITOR) (VASCULAR ANTICOAGULANT-ALPHA) (VAC-ALPHA) (ANCHORIN CII).

When a protein is known to be cleaved into multiple functional components, the description will start with the name of the precursor protein, followed by a section delimited by ‘[CONTAINS……]’. All the individual components are listed in that section and are separated by semi-colons (‘;’). Synonyms are allowed at the level of the precursor and for each individual component. Example: DE DE DE DE DE DE

CORTICOTROPIN-LIPOTROPIN PRECURSOR (PRO-OPIOMELANOCORTIN) (POMC) [CONTAINS: NPP; MELANOTROPIN GAMMA (GAMMA-MSH); CORTICOTROPIN (ADRENOCORTICOTROPIC HORMONE) (ACTH); MELANOTROPIN ALPHA (ALPHA-MSH); CORTICOTROPIN-LIKE INTERMEDIARY PEPTIDE (CLIP); LIPOTROPIN BETA (BETALPH); LIPOTROPIN GAMMA (GAMMA-LPH); MELANOTROPIN BETA (BETA-MSH); BETA-ENDORPHIN; MET-ENKEPHALIN].

When a protein is known to include multiple domains each of which are described by a different name, the description will start with the name of the overall protein, followed by a section delimited by ‘[INCLUDES……]’. All the domains are listed in that section and are separated by semi-colons (‘;’). Synonyms are allowed at the level of the protein and for each individual domain. Example: DE DE DE

CAD PROTEIN (RUDIMENTARY PROTEIN) [INCLUDES: GLUTAMINE-DEPENDENT CARBAMOYL-PHOSPHATE SYNTHASE (EC 6.3.5.5); ASPARTATE CARBAMOYLTRANSFERASE (EC 2.1.3.2); DIHYDROOROTASE (EC 3.5.2.3)].

16

When the complete sequence was not determined, the last information given on the DE lines will be ‘(FRAGMENT)‘ or ‘(FRAGMENTS)’. Example: DE DE

LYSOPINE DEHYDROGENASE (EC 1.5.1.16) (OCTOPINE SYNTHASE) (LYSOPINE SYNTHASE) (FRAGMENT).

(3.5). The GN line The GN (Gene Name) line contains the name(s) of the gene(s) that code for the stored protein sequence. The format of the GN line is: GN

NAME1[ AND|OR NAME2...].

Examples: GN GN

ALB. REX-1.

It often occurs that more than one gene name has been assigned to an individual locus. In that case all the synonyms will be listed. The word ‘OR' separates the different designations. The first name in the list is assumed to be the most correct (or most current) designation. Example: GN

HNS OR DRDX OR OSMZ OR BGLY.

In a few cases, multiple genes code for an identical protein sequence. In that case all the different gene names will be listed. The word ‘AND' separates the designations. Example: GN

CECA1 AND CECA2.

In very rare cases ‘AND' and ‘OR' can both be present. In that case parentheses are used as shown in the following example: GN

GVPA AND (GVPB OR GVPA2).

(3.6). The OS line The OS (Organism Species) line specifies the organism(s) which was the source of the stored sequence. In the rare case where all the species information will not fit on a single line more than one OS line is used. The last OS line is terminated by a period. The species designation consists, in most cases, of the Latin genus and species designation followed by the English name (in parentheses). For viruses, only the common English name is given. In cases where a protein sequence is identical in more then one species, the OS line(s) will list the names of all those species.

17

Examples of OS lines are shown here: OS OS OS OS

Escherichia coli. Homo sapiens (Human). Acer spicatum (Moose maple) (Mountain maple). Rous sarcoma virus (strain Schmidt-Ruppin).

If a SWISS-PROT entry reports the sequence of a protein identical in a number of species, the name of these species will all be listed in the OS lines of that entry. The species names are separated by commas, the last species name being preceded by the word ‘and’. Species names are never cut across two lines. Here are examples of the OS lines for entries representing multiple species: OS OS

Oncorhynchus nerka (Sockeye salmon), and Oncorhynchus masou (Cherry salmon) (Masu salmon).

OS OS

Mus musculus (Mouse), Rattus norvegicus (Rat), and Bos taurus (Bovine).

(3.7). The OG line The OG (OrGanelle) line indicates if the gene coding for a protein originates from the mitochondria, the chloroplast, a cyanelle, or a plasmid. The format of the OG line is: OG OG OG OG

Chloroplast. Cyanelle. Mitochondrion. Plasmid name.

Where 'name' is the name of the plasmid. If a SWISS-PROT entry reports the sequence of a protein identical in a number of plasmids, the name of these plasmids will all be listed in the OG lines of that entry. The plasmid names are separated by commas, the last plasmid name being preceded by the word ‘and’. Plasmid names are never cut across two lines. Examples: OG

Plasmid IncFIV R124, and Plasmid IncFI ColV3-K30.

OG OG

Plasmid R6-5, Plasmid IncFII NR1, and Plasmid IncFII R1-19 (R1 drd-19).

18

(3.8). The OC line The OC (Organism Classification) lines contain the taxonomic classification of the source organism. The taxonomic classification used in SWISS-PROT is that maintained at the NCBI (see http://www.ncbi.nlm.nih.gov/Taxonomy/) and used by the nucleotide sequence databases (EMBL/GenBank/DDBJ). The NCBI’s taxonomy reflects current phylogenetic knowledge. It is a sequencebased taxonomy as much as possible and based on published authorities wherever possible. Because of the inherent ambiguity of evolutionary classification and the specific needs of database users (e.g., trying to track down the phylogenetic history of a group of organisms or to elucidate the evolution of a molecule), this taxonomy strives to accurately reflect current phylogenetic knowledge. The NCBI’s taxonomy is intended to be informative and helpful; no claim is made that it is necessarily the best or most exact. The classification is listed top-down as nodes in a taxonomic tree in which the most general grouping is given first. The classification may be distributed over several OC lines, but nodes are not split or hyphenated between lines. Semicolons separate the individual items and the list is terminated by a period. The format of the OC line is: OC

Node[; Node...].

For example the classification lines for a human sequence would be: OC OC

Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.

If a protein is identical in more than one species, all the species names will be listed (see 3.6) but the OC lines will only contain the classification for the first species listed.

(3.9). The reference (RN, RP, RC, RX, RA, RT, RL) lines These lines comprise the literature citations within SWISS-PROT. The citations indicate the sources from which the data has been abstracted. The reference lines for a given citation occur in a block, and are always in the order RN, RP, RC, RX, RA, RT and RL. Within each such reference block the RN and RP lines occur once, the RC, RX and RT lines occur zero or more times, and the RA and RL lines each occur one or more times. If several references are given, there will be a reference block for each. An example of a complete reference is: RN RP RC RX RA

[1] SEQUENCE FROM N.A., AND SEQUENCE OF 1-15. STRAIN=Sprague-Dawley; TISSUE=Liver; MEDLINE; 91002678. Chan Y.-L., Paz V., Olvera J., Wool I.G.;

19

RT RT RT RL

"The primary structure of rat ribosomal proteins: the amino acid sequences of L27a and L28 and corrections in the sequences of S4 and S12."; Biochim. Biophys. Acta 1050:69-73(1990).

The formats of the individual lines are explained below. (3.9.1). The RN line The RN (Reference Number) line gives a sequential number to each reference citation in an entry. This number is used to indicate the reference in comments and feature table notes. The format of the RN line is: RN

[N]

where ‘N’ denotes the nth reference for this entry. The reference number is always enclosed in square brackets. (3.9.2). The RP line The RP (Reference Position) line describes the extent of the work carried out by the authors of the reference cited. The format of the RP line is: RP

COMMENT.

Typical examples of RP lines are shown below: RP RP RP RP RP RP RP RP RP RP RP

SEQUENCE FROM N.A. SEQUENCE FROM N.A., AND SEQUENCE OF 12-35. SEQUENCE OF 34-56; 67-73 AND 123-345, AND DISULFIDE BONDS. REVISIONS TO 67-89. STRUCTURE BY NMR. X-RAY CRYSTALLOGRAPHY (1.8 ANGSTROMS). CHARACTERIZATION. MUTAGENESIS OF TYR-56. REVIEW. VARIANT ALA-58. VARIANTS XLI LEU-341; ARG-372 AND TYR-446.

(3.9.3). The RC line The RC (Reference Comment) lines are optional lines which are used to store comments relevant to the reference cited. The format of the RC line is: RC

TOKEN1=Text; TOKEN2=Text; .....

Where the currently defined tokens are:

20

PLASMID SPECIES STRAIN TISSUE TRANSPOSON Examples of RC lines: RC RC RC RC RC

STRAIN=Sprague-Dawley; TISSUE=Liver; STRAIN=Holstein; TISSUE=Mammary gland, and Lymph node; SPECIES=Rat; STRAIN=Wistar; SPECIES=A.thaliana; STRAIN=cv. Columbia; PLASMID=IncFII R100;

The ‘SPECIES' token is only used when an entry describes a sequence that is identical in more than one species; similarly the ‘PLASMID' is only used if an entry describes a sequence identical in more than one plasmid. The SWISS-PROT document TISSLIST.TXT lists all the tissues that are used in the database in the context of the ‘TISSUE’ token.

(3.9.4). The RX line The RX (Reference cross-reference) line is an optional line which is used to indicate the identifier assigned to a specific reference in a bibliographic database. The format of the RX line is: RX

BIBLIOGRAPHIC_DATABASE_NAME; IDENTIFIER.

Where the valid bibliographic database names and their associated identifier are: Name: Database: Identifier:

MEDLINE Medline from the National Library of Medicine (NLM) Eight-digit Medline Unique Identifier (UID)

Example of RX line: RX

MEDLINE; 91002678.

(3.9.5). The RA line The RA (Reference Author) lines list the authors of the paper (or other work) cited. All of the authors are included, and are listed in the order given in the paper. The names are listed surname first followed by a blank, followed by initial(s) with periods. The authors' names are separated by commas and terminated by a semicolon. Author names are not split between lines. An example of the use of RA lines is shown below:

21

RA RA

Coffman B.L., Tephly T.R., Irshaid Y.M., Green M.D., Smith C., Jackson M.R., Wooster R., Burchell B.;

As many RA lines as necessary are included for each reference. An author’s initials can be followed by an abbreviation such as ‘Jr’ (for Junior), ‘Sr’ (Senior), ‘II’, ‘III’ or ‘IV‘ (2nd, 3rd and 4th). Example: RA

Smith H. Jr., von Braun M.T. III;

(3.9.6). The RT line The RT (Reference Title) lines give the title of the paper (or other work) cited as exactly as possible given the limitations of the computer character set. The format of the RT line is: RT

"Title";

Example of a set of RT lines: RT RT RT

"New insulin-like proteins with atypical disulfide bond pattern characterized in Caenorhabditis elegans by comparative sequence analysis and homology modeling.";

It should be noted that: the format of the title is not always identical to that displayed at the top of the published work: • • • • •

Major title words are not capitalized; The text of a title ends with either a period '.', a question mark ‘?‘ or an exclamation mark ‘!‘; Double quotation marks ‘"‘ in the text of the title are replaced by single quotation marks; Titles of articles published in a language other than English have been translated into English; Greek letters are spelled out (alpha, beta, etc.).

(3.9.7). The RL line The RL (Reference Location) lines contain the conventional citation information for the reference. In general, the RL lines alone are sufficient to find the paper in question. a) Journal citations The RL line for a journal citation includes the journal abbreviation, the volume number, the page range, and the year. The format for such a RL line is: RL

Journal_abbrev Volume:First_page-Last_page(YYYY).

22

Journal names are abbreviated according to the conventions used by the National Library of Medicine (NLM) and are based on the existing ISO and ANSI standards. A list of the abbreviations currently in use is given in the SWISS-PROT document file JOURLIST.TXT. An example of an RL line is: RL

J. Mol. Biol. 168:321-331(1983).

When a reference is made to a paper which is ‘in press' at the time when the database is released, the page range, and eventually the volume number are indicated as '0' (zero). An example of a RL line of such type is shown here: RL

Nucleic Acids Res. 27:0-0(1999). b) Book citations

A variation of the RL line format is used for papers found in books or other similar publications, which are cited using the following format: RL RL

(In) Editor_1 I.[, Editor_2 I., Editor_X I.] (eds.); Book_name, pp.[Volume:]First_page-Last_page, Publisher, City (YYYY).

Examples: RL RL

(In) Boyer P.D. (eds.); The enzymes (3rd ed.), pp.11:397-547, Academic Press, New York (1975).

RL RL RL

(In) Rich D.H., Gross E. (eds.); Proceedings of the 7th american peptide symposium, pp.69-72, Pierce Chemical Co., Rockford Il. (1981).

RL RL RL RL

(In) Magnusson S., Ottesen M., Foltmann B., Dano K., Neurath H. (eds.); Regulatory proteolytic enzymes and their inhibitors, pp.163-172, Pergamon Press, New York (1978). c) Plant Gene Register and Worm Breeder’s Gazette citations

The ‘(In)’ prefix used for books (see above) is also used for references to the electronic Plant Gene Register (http://www.tarweed.com/pgr/) as well as to the Worm Breeder's Gazette (http://elegans.swmed.edu/wli/). Examples: RL RL

(In) Plant Gene Register PGR98-023. (In) Worm Breeder's Gazette 15(3):34(1998). d) Unpublished results

RL lines for unpublished results follow the format shown in the next example:

23

RL RL RL

Unpublished results, cited by: Shelnutt J.A., Rousseau D.L., Dethmers J.K., Margoliash E.; Biochemistry 20:6485-6497(1981). e) Unpublished observations

For unpublished observations the format of the RL line is: RL

Unpublished observations (MMM-YYYY).

Where ‘MMM' is the month and ‘YYYY' is the year. We use the ‘unpublished observations' RL line to cite communications by scientists to SWISS-PROT of unpublished information concerning various aspects of a sequence entry. f) Thesis For Ph.D. theses the format of the RL line is: RL

Thesis (Year), Institution_name, Country.

An example of such a line is given here: RL

Thesis (1972), University of Geneva, Switzerland. g) Patent applications

For patent applications the format of the RL line is: RL

Patent number Pat_num, DD-MMM-YYYY.

Where ‘Pat_num‘ is the international publication number of the patent, ‘DD‘ is the day, ‘MMM‘ is the month and ‘YYYY‘ is the year. Example: RL

Patent number WO9010703, 20-SEP-1990. h) Submissions

The final form that an RL line can take is that used for submissions. The format of such a RL line is: RL

Submitted (MMM-YYYY) to the Database_name.

Where ‘MMM' is the month, ‘YYYY' is the year and ‘Database_name‘ is one of the following: EMBL/GenBank/DDBJ databases SWISS-PROT data bank HIV data bank PDB data bank

24

PIR data bank Two examples of submission RL lines are given here: RL RL

Submitted (APR-1994) to the EMBL/GenBank/DDBJ databases. Submitted (FEB-1999) to the SWISS-PROT data bank.

(3.10). The CC line The CC lines are free text comments on the entry, and may be used to convey any useful information. The comments always appear below the last reference line and are grouped together in comment blocks, a block being made of 1 or more comment lines. The first line of a block start is marked with the characters ‘-!-'. The format of a comment block is: CC CC

-!- TOPIC: FIRST LINE OF A COMMENT BLOCK; SECOND AND SUBSEQUENT LINES OF A COMMENT BLOCK.

The comment blocks are arranged according to what we designate as 'topics’. The current topics and their definitions are listed in the table below.

Topic ALTERNATIVE PRODUCTS CATALYTIC ACTIVITY CAUTION COFACTOR DATABASE DEVELOPMENTAL STAGE DISEASE DOMAIN ENZYME REGULATION FUNCTION INDUCTION MASS SPECTROMETRY MISCELLANEOUS PATHWAY PHARMACEUTICAL POLYMORPHISM PTM SIMILARITY SUBCELLULAR LOCATION SUBUNIT TISSUE SPECIFICITY

Description Description of the existence of related protein sequence(s) produced by alternative splicing of the same gene or by the use of alternative initiation codons Description of the reaction(s) catalyzed by an enzyme [1] This topic warns you about possible errors and/or grounds for confusion Description of an enzyme cofactor Description of a cross-reference to a network database/resource for a specific protein [2] Description of the developmental specific expression of a protein Description of the disease(s) associated with a deficiency of a protein Description of the domain structure of a protein Description of an enzyme regulatory mechanism General description of the function(s) of a protein Description of the compound(s) which stimulate the synthesis of a protein Reports the exact molecular weight of a protein or part of a protein as determined by mass spectrometric methods [3] Any comment which does not belong to any of the other defined topics Description of the metabolic pathway(s) to which a protein is associated Description of the use of a specific protein as a pharmaceutical drug Description of polymorphism(s) Description of a post-translational modification Description of the similaritie(s) (sequence or structural) of a protein with other proteins Description of the subcellular location of the mature protein Description of the quaternary structure of a protein Description of the tissue specificity of a protein

25

Notes: [1] For the ‘CATALYTIC ACTIVITY‘ topic: Whenever it was possible we have used, to describe the catalytic activity of an enzyme, the recommendations of the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology (IUBMB) as published in Enzyme Nomenclature, NC-IUBMB, Academic Press, New-York, (1992). [2] The syntax of the 'DATABASE' topic is: CC

-!- DATABASE: NAME=Text[; NOTE=Text][; WWW="Address"][; FTP="Address"].

Where: • • • •

‘NAME’ is the name of the database; ‘NOTE’ (optional) is a free text note; ‘WWW’ (optional) is the WWW address (URL) of the database; ‘FTP’ (optional) is the anonymous FTP address (including the directory name) where the database file(s) are stored.

Note: this is currently the only part of the database where lines longer than 75 characters can be found as long URL or FTP addresses are not reformatted into multiple lines. [3] The syntax of the 'MASS SPECTROMETRY' topic is: CC

-!- MASS SPECTROMETRY: MW=XXX[; MW_ERR=XX][; METHOD=XX][;RANGE=XX-XX].

Where: • • • •

‘MW=XXX’ is the determined molecular weight (MW); ‘MW_ERR=XX’ (optional) is the accuracy or error range of the MW measurement; ‘METHOD=XX’ (optional) is the mass spectrometric method; ‘RANGE=XX-XX’ (optional) is used to indicate what part of the protein sequence entry corresponds to the molecular weight. If this qualifier is not present, the MW value corresponds to the full length of the protein sequence.

Each SWISS-PROT entry will contain a variable number of CC line topics. Most topics can be present more than once in a given entry. The only topics that can only occur only once in an entry are: ALTERNATIVE PRODUCTS, COFACTOR, DEVELOPMENTAL STAGE, ENZYME REGULATION, INDUCTION, SUBCELLULAR LOCATION, SUBUNIT and TISSUE SPECIFICITY. We show here, for each of the defined topics, two examples of their usage: CC

-!- ALTERNATIVE PRODUCTS: AT LEAST THREE ISOFORMS; AIRE-1 (SHOWN

26

CC CC CC CC CC CC

HERE), AIRE-2 AND AIRE-3; SEEM TO BE PRODUCED BY ALTERNATIVE SPLICING. AIRE-2 AND AIRE-3 SEEMS TO BE LESS FREQUENTLY EXPRESSED THAN AIRE-1, IF AT ALL. -!- ALTERNATIVE PRODUCTS: USING ALTERNATIVE INITIATION CODONS IN THE SAME READING FRAME, THE GENE TRANSLATES INTO THREE ISOZYMES: ALPHA, BETA AND BETA'.

CC CC CC CC

-!- CATALYTIC GLUTAMINE -!- CATALYTIC NADP(+) =

CC CC CC CC

-!- CAUTION: REF.2 SEQUENCE DIFFERS FROM THAT SHOWN IN POSITIONS 92 TO 165 DUE TO A FRAMESHIFT. -!- CAUTION: IT IS UNCERTAIN WHETHER MET-1 OR MET-3 IS THE INITIATOR.

CC CC

-!- COFACTOR: PYRIDOXAL PHOSPHATE. -!- COFACTOR: FAD AND NONHEME IRON.

CC CC CC CC CC CC

-!- DATABASE: NAME=CD40Lbase; NOTE=European CD40L defect database (mutation db); WWW="http://www.expasy.ch/cd40lbase/"; FTP="ftp://www.expasy.ch/databases/cd40lbase". -!- DATABASE: NAME=PROW; NOTE=CD guide CD80 entry; WWW="http://www.ncbi.nlm.nih.gov/prow/cd/cd80.htm".

CC CC CC CC

-!- DEVELOPMENTAL STAGE: EXPRESSED EARLY DURING CONIDIAL (DORMANT SPORES) DIFFERENTIATION. -!- DEVELOPMENTAL STAGE: EXPRESSED IN EMBRYONIC AND EARLY LARVAL STAGES.

CC CC CC CC CC CC CC

-!- DISEASE: DEFECTS IN PHKA1 ARE LINKED TO X-LINKED MUSCLE GLYCOGENOSIS, A DISEASE CHARACTERIZED BY SLOWLY PROGRESSIVE, PREDOMINANTLY DISTAL MUSCLE WEAKNESS AND ATROPHY. -!- DISEASE: DEFECTS IN ALD ARE THE CAUSE OF X-LINKED ADRENOLEUKODYSTROPHY, A PEROXISOMAL DISORDER CHARACTERIZED BY PROGRESSIVE DEMYLEINATION OF THE CNS AND ADRENAL INSUFFICIENCY.

CC CC CC CC

-!- DOMAIN: CONTAINS A COILED-COIL DOMAIN ESSENTIAL FOR VESICULAR TRANSPORT AND A DISPENSABLE C-TERMINAL REGION. -!- DOMAIN: THE B CHAIN IS COMPOSED OF TWO DOMAINS, EACH DOMAIN CONSISTS OF 3 HOMOLOGOUS SUBDOMAINS (ALPHA, BETA, GAMMA).

CC CC CC CC

-!- ENZYME REGULATION: THE ACTIVITY OF THIS ENZYME IS CONTROLLED BY ADENYLATION. THE FULLY ADENYLATED ENZYME IS INACTIVE. -!- ENZYME REGULATION: ACTIVATED BY GRAM-NEGATIVE BACTERIAL LIPOPOLYSACCHARIDES AND CHYMOTRYPSIN.

ACTIVITY: ATP + L-GLUTAMATE + NH(3) = ADP + + PHOSPHATE. ACTIVITY: (R)-2,3-DIHYDROXY-3-METHYLBUTANOATE + (S)-2-HYDROXY-2-METHYL-3-OXOBUTANOATE + NADPH.

27

CC CC CC CC

-!- FUNCTION: PROFILIN PREVENTS THE POLYMERIZATION OF ACTIN. -!- FUNCTION: INHIBITOR OF FUNGAL POLYGALACTURONASE. IT IS AN IMPORTANT FACTOR FOR PLANT RESISTANCE TO PHYTOPATHOGENIC FUNGI.

CC CC CC

-!- INDUCTION: BY SALT STRESS AND BY ABSCISIC ACID (ABA). -!- INDUCTION: BY INFECTION, PLANT WOUNDING, OR ELICITOR TREATMENT OF CELL CULTURES.

CC CC CC

-!- MASS SPECTROMETRY: MW=71890; MW_ERR=7; METHOD=MALDI. -!- MASS SPECTROMETRY: MW=8597.5; METHOD=ELECTROSPRAY; RANGE=40-119.

CC CC CC

-!- MISCELLANEOUS: BINDS TO BACITRACIN. -!- MISCELLANEOUS: JUVENILE HORMONE SUPPRESSES TRANSFERRIN LEVELS DRASTICALLY IN THE ADULT FEMALE COCKROACH.

CC CC CC

-!- PATHWAY: FIRST STEP IN PROLINE BIOSYNTHESIS PATHWAY. -!- PATHWAY: LAST STEP IN PROTOHEME BIOSYNTHESIS. IN ERYTHROID CELLS, FERROCHELATASE APPEARS TO BE THE RATE-LIMITING ENZYME.

CC CC CC CC CC CC

-!- PHARMACEUTICAL: Available under the names Avonex (Biogen), Betaseron (Berlex) and Rebif (Serono). Used in the treatment of multiple sclerosis (MS). Betaseron is a slightly modified form of IFNB1 with two residue substitutions. -!- PHARMACEUTICAL: Available under the name Proleukin (Chiron). Used in patients with renal cell carcinoma or metastatic melanoma.

CC CC CC CC CC CC

-!- POLYMORPHISM: THE ALLELIC FORM OF THE ENZYME WITH GLN-191 HYDROLYZES PARAOXON WITH A LOW TURNOVER NUMBER AND THE ONE WITH ARG-191 WITH A HIGH TURNOVER NUMBER. -!- POLYMORPHISM: THE TWO MAIN ALLELES OF HP ARE CALLED HP1F (FAST) AND HP1S (SLOW). THE SEQUENCE SHOWN HERE IS THAT OF THE HP1S FORM.

CC CC CC CC

-!- PTM: O-GLYCOSYLATED; AN UNUSUAL FEATURE AMONG VIRAL GLYCOPROTEINS. -!- PTM: THE SOLUBLE FORM DERIVES FROM THE MEMBRANE FORM BY PROTEOLYTIC PROCESSING.

CC CC

-!- SIMILARITY: BELONGS TO THE ANNEXIN FAMILY. -!- SIMILARITY: CONTAINS 12 EGF-LIKE DOMAINS.

CC CC CC

-!- SUBCELLULAR LOCATION: MITOCHONDRIAL MATRIX. -!- SUBCELLULAR LOCATION: INTEGRAL MEMBRANE PROTEIN. INNER MEMBRANE.

CC CC CC

-!- SUBUNIT: HOMOTETRAMER. -!- SUBUNIT: HETERODIMER OF A LIGHT CHAIN AND A HEAVY CHAIN LINKED BY A DISULFIDE BOND.

28

CC CC CC

-!- TISSUE SPECIFICITY: KIDNEY, SUBMAXILLARY GLAND, AND URINE. -!- TISSUE SPECIFICITY: SHOOTS, ROOTS, AND COTYLEDON FROM DEHYDRATING SEEDLINGS.

(3.11). The DR line (3.11.1). Definition The DR (Database cross-Reference) lines are used as pointers to information related to SWISS-PROT entries and found in data collections other than SWISS-PROT. The full list of all databases to which SWISSPROT is cross-referenced can be found in the document file DBXREF.TXT. For example, if the X-ray crystallographic atomic coordinates of a sequence are stored in the Protein Data Bank (PDB) there will be DR line(s) pointing to the corresponding entri(es) in that database. For a sequence translated from a nucleotide sequence there will be DR line(s) pointing to the relevant entri(es) in the EMBL/GenBank/DDBJ database which correspond to the DNA or RNA sequence(s) from which it was translated. The format of the DR line is: DR

DATABASE_IDENTIFIER; PRIMARY_IDENTIFIER; SECONDARY_IDENTIFIER.

Exceptions are cross-references to the EMBL/GenBank/DDBJ nucleotide sequence database and for the PROSITE and Pfam databases. The specific formats for these cross-references are described in sections 3.11.5 and 3.11.6.

29

(3.11.2). Database identifier The first item on the DR line, the ‘DATABASE_IDENTIFIER’, is the abbreviated name of the data collection to which reference is made. The currently defined database identifiers are listed below. Identifier EMBL CARBBANK DICTYDB ECO2DBASE ECOGENE FLYBASE GCRDB HIV HSC-2DPAGE HSSP INTERPRO MAIZEDB MAIZE-2DPAGE MENDEL MGD MIM PDB PFAM PIR PRINTS PROSITE REBASE AARHUS/GHENT2DPAGE SGD STYGENE SUBTILIST SWISS-2DPAGE TIGR TRANSFAC TUBERCULIST WORMPEP YEPD

ZFIN

Database description Nucleotide sequence database of EMBL (EBI) (see 3.11.5) Complex carbohydrate structure database (CCSD) from CarbBank Dictyostelium discoideum genome database Escherichia coli gene-protein database (2D gel spots) (ECO2DBASE) Escherichia coli K12 genome database (EcoGene) Drosophila genome database (FlyBase) G protein-coupled receptor database (GCRDb) HIV sequence database Harefield hospital 2D gel protein databases (HSC-2DPAGE) Homology-derived secondary structure of proteins database (HSSP). Integrated resource of protein families, domains and functional sites (InterPro) Maize genome database (MaizeDB) Maize genome 2D Electrophoresis database (Maize-2DPAGE) Plant gene nomenclature database (Mendel) Mouse genome database (MGD) Mendelian Inheritance in Man Database (MIM) 3D-macromolecular structure Protein Data Bank (PDB) Pfam protein domain database (see 3.11.6) Protein sequence database of the Protein Information Resource (PIR) Protein Fingerprint database (PRINTS) PROSITE protein domains and families database (see 3.11.6) Restriction enzymes and methylases database (REBASE) Human keratinocyte 2D gel protein database from Aarhus and Ghent universities Saccharomyces Genome Database (SGD) Salmonella typhimurium LT2 genome database (StyGene) Bacillus subtilis 168 genome database (SubtiList) 2D-PAGE database from the Geneva University Hospital (SWISS-2DPAGE) The bacterial database(s) of 'The Institute of Genome Research' (TIGR) Transcription factor database (TRANSFAC) Mycobacterium tuberculosis H37Rv genome database (TubercuList) Caenorhabditis elegans genome sequencing project protein database (WormPep) Yeast electrophoresis protein database (YEPD) Zebrafish Information Network genome database (ZFIN)

(3.11.3). The primary identifier The second item on the DR line, the ‘PRIMARY_IDENTIFER’, is an unambiguous pointer to the information entry in the database to which reference is being made.

30



• • • • • •

For a CarbBank, DictyDb, EcoGene, FlyBase, GCRDb, HIV, HSC-2DPAGE, InterPro, MAIZE-2DPAGE, Mendel, MGD, MIM, PIR, PRINTS, REBASE, SGD, StyGene, SubtiList, SWISS-2DPAGE, TRANSFAC, TubercuList or ZFIN reference the primary identifier is the first accession number (also called the Unique Identifier in some databases) of the entry to which reference is being made. For a PDB reference the primary identifier is the entry name. For an AARHUS/GHENT-2DPAGE, ECO2DBASE or YEPD reference the primary identifier is the protein spot alphanumeric designation. For a WormPep reference the primary identifier is the cosmid-derived name given to that protein by the C.elegans genome-sequencing project. For a MaizeDB reference the primary identifier is the ‘Gene-product’ accession ID. For a TIGR reference the primary identifier is the genome Open Reading Frame (ORF) code. For a HSSP reference the primary identifier is the accession number of a SWISS-PROT entry crossreferenced to a PDB entry whose structure is expected to be similar to that of the entry in which the HSSP cross-reference is present.

(3.11.4). The secondary identifier The third item on the DR line, the ‘SECONDARY_IDENTIFIER‘, is generally used to complement the information given by the first identifier. • • • • • • • • •

For an HIV, PIR, PRINTS or REBASE reference the secondary identifier is the entry's name. For a PDB reference the secondary identifier is the most recent date on which PDB revised the entry (last ‘REVDAT' record). For a DictyDb, EcoGene, FlyBase, Mendel, MGD, SGD, StyGene, SubtiList or ZFIN reference the secondary identifier is the gene designation. If the gene designation is not available, a dash (‘-‘) is used. For an ECO2DBASE reference the secondary identifier is the latest release number or edition of the database that has been used to derive the cross-reference. For a SWISS-2DPAGE, HSC-2DPAGE or MAIZE-2DPAGE reference the secondary identifier is the species or tissue of origin. For an AARHUS/GHENT-2DPAGE reference the secondary identifier is either ‘IEF' (for isoelectric focusing) or ‘NEPHGE' (for non-equilibrium pH gradient electrophoresis). For a WormPep reference the secondary identifier is a number attributed by the C.elegans genomesequencing project to that protein. For a CarbBank GCRDb, InterPro, MaizeDB, MIM, TIGR, TRANSFAC, TubercuList or YEPD reference the secondary identifier is not defined and a dash (‘-‘) is stored in that field. For an HSSP reference the secondary identifier is the entry name of the PDB structure related to that of the entry in which the HSSP cross-reference is present.

31

Examples of complete DR lines are shown here: DR DR DR DR DR DR DR DR DR DR DR DR DR DR DR DR DR DR DR DR DR DR DR DR DR DR DR DR DR DR

AARHUS/GHENT-2DPAGE; 8006; IEF. CARBBANK; CCSD:27494; -. DICTYDB; DD01047; myoA. ECO2DBASE; G052.0; 6TH EDITION. ECOGENE; EG10054; araC. FLYBASE; FBgn0000055; Adh. GCRDB; GCR_0087; -. HIV; K02013; NEF$BRU. HSC-2DPAGE; P47985; HUMAN. HSSP; P00438; 1DOB. INTERPRO; IPR001254; -. MAIZEDB; 25342; -. MAIZE-2DPAGE; P80607; COLEOPTILE. MENDEL; 2596; AMAhy;psbA;1. MGD; MGI:87920; Adfp. MIM; 249900; -. PDB; 3ADK; 16-APR-88. PIR; A02768; R5EC7. PRINTS; PR00237; GPCRRHODOPSN. REBASE; RB00993; EcoRI. SGD; L0000008; AAR2. STYGENE; SG10312; proV. SUBTILIST; BG10774; oppD. SWISS-2DPAGE; P10599; HUMAN. TIGR; MJ0125; -. TRANSFAC; T00141; -. TUBERCULIST; Rv0001; -. WORMPEP; ZK637.7; CE00437. YEPD; 4270; -. ZFIN; ZDB-GENE-980526-290; hoxa1.

(3.11.5). Cross-references to the nucleotide sequence database The specific format for cross-references to the EMBL/GenBank/DDBJ nucleotide sequence database is: DR

EMBL; ACCESSION_NUMBER; PROTEIN_ID; STATUS_IDENTIFIER.

Where ‘PROTEIN_ID' stands for the ‘Protein Sequence Identifier’. It is a string which is stored, in nucleotide sequence entries, in a qualifier called ‘/protein_id’ which is tagged to every CDS in the nucleotide database. Example: FT FT FT FT FT

CDS

302..2674 /protein_id="CAA03857.1" /db_xref="SWISS-PROT:P26345" /gene="recA" /product="RecA protein"

32

The Protein Sequence Identifier (Protein_ID) consists of a stable ID portion (8 characters: 3 letters followed by 5 numbers) plus a version number after a decimal point. The version number only changes when the protein sequence coded by the CDS changes, while the stable part remains unchanged. The Protein_ID effectively replaces what was previously known as the ‘PID’. The 'STATUS_IDENTIFIER' provides information about the relationship between the sequence in the SWISS-PROT entry and the CDS in the corresponding EMBL entry. a) In most cases the translation of the EMBL nucleotide sequence CDS results in the same sequence as shown in the corresponding SWISS-PROT entry or the differences are mentioned in the SWISS-PROT feature (FT) lines as CONFLICT, VARIANT or VARSPLIC and in the RP lines. In these cases the status identifier shows a dash (‘-‘). Example: DR

EMBL; Y00312; CAA68412.1; -.

b) In some cases the translation of the EMBL nucleotide sequence CDS results in a sequence different from the sequence shown in the corresponding SWISS-PROT entry. When the differences are either not mentioned in the SWISS-PROT feature (FT) lines as CONFLICT, VARIANT or VARSPLIC (see Appendix A) and in the RP lines, or do simply not meet the criteria for such situations, the differences are indicated as follows: 1

If the difference is due to a different start of the sequence (e.g. SWISS-PROT believes that the start of the sequence is upstream or downstream of the site annotated as the start of the sequence in the EMBL database), the status identifier shows the comment ‘ALT_INIT’. Example: DR

2

If the difference is due to a different termination of the sequence (e.g. SWISS-PROT believes that the termination of the sequence is upstream or downstream of the site annotated as the end of the sequence in the EMBL database), the status identifier shows the comment ‘ALT_TERM’. Example: DR

3

EMBL; L20562; AAA26884.1; ALT_TERM.

If the difference is due to frameshifts in the EMBL sequence, the status identifier shows the comment ‘ALT_FRAME’. Example: DR

4

EMBL; L29151; AAA99430.1; ALT_INIT.

EMBL; X56420; CAA39814.1; ALT_FRAME.

If the difference is not due to any of the cases mentioned above (e.g. wrong intronexon boundaries given in the EMBL entry) or to a mixture of the cases mentioned above, the status identifier shows the comment ‘ALT_SEQ’. Example:

33

DR

EMBL; M28482; AAA26378.1; ALT_SEQ.

c) In some cases the nucleotide sequence of a complete CDS is divided into exons present in different EMBL entries. We point to the exon containing EMBL entries by citing the Protein_ID as secondary identifier and adding the comment ‘JOINED’ into the status identifier. These EMBL entries are not containing a CDS feature, they contain exons joined to a CDS feature which is labeled with the given Protein_ID. Example: DR DR DR

EMBL; M63397; AAA51662.1; -. EMBL; M63395; AAA51662.1; JOINED. EMBL; M63396; AAA51662.1; JOINED.

In the above example the SWISS-PROT sequence is derived from the CDS labeled with the Protein_ID AAA51662. This CDS feature can be found in the EMBL entry M63397. Exons belonging to this CDS are not only found in EMBL entry M63397, but also in the EMBL entries M63395 and M63396. d) In some cases there is no CDS feature key annotating a protein translation in an EMBL entry and thus no Protein_ID for that CDS. Therefore it is not possible for us to point to a Protein_ID as a secondary identifier. In these cases we point to the relevant EMBL entries by including a dash (‘-‘) in the position of the missing Protein_ID and ‘NOT_ANNOTATED_CDS’ into the status identifier. Example: DR

EMBL; J04126; -; NOT_ANNOTATED_CDS.

(3.11.6). Cross-references to the PROSITE and Pfam databases The specific format for cross-references to the PROSITE and Pfam protein domain and family databases is: DR

PROSITE ¦ PFAM; ACCESSION_NUMBER; ENTRY_NAME; STATUS.

Where ‘ACCESSION_NUMBER' stands for the accession number of the PROSITE or Pfam pattern, profile or HMM-profile entry; ‘ENTRY_NAME’ is the name of the entry and 'STATUS' is one of the following: n FALSE_NEG PARTIAL UNKNOWN_n Where ‘n’ is the number of hits of the pattern or profile in that particular protein sequence. The ‘FALSE_NEG’ status indicates that while the pattern or profile did not detect the protein sequence, it is a member of that particular family or domain. The ‘PARTIAL’ status indicates that the pattern or profile did not detect the sequence because that sequence is not complete and lacks the region on which is the pattern/profile is based. Finally the ‘UNKNOWN’ status indicates uncertainties as to the fact that the sequence is a member of

34

the family or domain described by the pattern/profile. Pfam cross-references do not make use of the ‘FALSE_NEG’ and ‘UNKNOWN’ status. Examples of PROSITE and Pfam cross-references: DR DR DR DR DR

PROSITE; PROSITE; PROSITE; PROSITE; PROSITE;

PS00107; PS00028; PS00237; PS01128; PS00383;

PROTEIN_KINASE_ATP; 1. ZINC_FINGER_C2H2; 6. G_PROTEIN_RECEPTOR; FALSE_NEG. SHIKIMATE_KINASE; PARTIAL. TYR_PHOSPHATASE_1; UNKNOWN_1.

DR DR DR

PFAM; PF00017; SH2; 1. PFAM; PF00008; EGF; 8. PFAM; PF00595; PDZ; PARTIAL.

(3.12). The KW line The KW (KeyWord) lines provide information that can be used to generate indexes of the sequence entries based on functional, structural, or other categories. The keywords chosen for each entry serve as a subject reference for the sequence. The SWISS-PROT document KEYWLIST.TXT lists all the keywords that are used in the database. Often several KW lines are necessary for a single entry. The format of the KW line is: KW

Keyword[; Keyword...].

More than one keyword may be listed on each KW line; semicolons separate the keywords, and the last keyword is followed by a period. Keywords may consist of more than one word (they may contain blanks), but are never split between lines. An example of a KW line is: KW

Oxidoreductase; Acetylation.

The order of the keywords is not significant. The above example could also have been written: KW

Acetylation; Oxidoreductase.

(3.13). The FT line The FT (Feature Table) lines provide a precise but simple means for the annotation of the sequence data. The table describes regions or sites of interest in the sequence. In general the feature table lists posttranslational modifications, binding sites, enzyme active sites, local secondary structure or other characteristics reported in the cited references. Sequence conflicts between references are also included in the feature table. The feature table is updated when more becomes known about a given sequence.

35

The FT lines have a fixed format. The column numbers allocated to each of the data items within each FT line are shown in the following table (column numbers not referred to in the table are always occupied by blanks). Columns 1- 2 6-13 15-20 22-27 35-75

Data item FT Key name ‘FROM' endpoint ‘TO' endpoint Description

The key name and the endpoints are always on a single line, but the description may require continuation. For this purpose, the next line contains blanks in the key, the ‘FROM', and the ‘TO' columns positions, and the description is continued in its normal position. Thus a blank key always denotes a continuation of the previous description. An example of a feature table is shown below: FT FT FT FT FT FT FT FT FT FT

NON_TER SIGNAL CHAIN PROPEP MOD_RES

1 S (IN REF. 2). MISSING (IN REF. 3).

The first item on each FT line is the key name, which is a fixed abbreviation (up to 8 characters) with a defined meaning. A list of the currently defined key names can be found in Appendix A of this document. Following the key name are the ‘FROM' and ‘TO' endpoint specifications. These fields designate (inclusively) the endpoints of the feature named in the key field. In general, these fields simply contain residue numbers indicating positions in the sequence as listed. Note that these positions are always specified assuming a numbering of the listed sequence from 1 to n; this numbering is not necessarily the same as that used in the original reference(s). The following should be noted in interpreting these endpoints: • •



If the ‘FROM' and ‘TO' specifications are equal, the feature indicated consists of the single amino acid at that position; When a feature is known to extend beyond the end(s) of the sequenced region, the endpoint specification will be preceded by ‘’ for features which continue to the right end (C-terminal direction); Unknown endpoints are denoted by ‘?'.

See also the notes concerning each of the key names in Appendix A.

36

The remaining portion of the FT line is a description that contains additional information about the feature. For example, for a post-translationally modified residue (key MOD_RES) the chemical nature of that modification is given, while for a sequence variation (key VARIANT) the nature of the variation is indicated. This portion of the line is generally in free form, and may be continued on additional lines when necessary.

(3.14). The SQ line The SQ (SeQuence header) line marks the beginning of the sequence data and gives a quick summary of its content. The format of the SQ line is: SQ

SEQUENCE

XXXX AA; XXXXX MW;

XXXXXXXXXXXXXXXX CRC64;

The line contains the length of the sequence in amino acids (‘AA’) followed by the molecular weight (‘MW’) rounded to the nearest mass unit (Dalton) and the sequence 64-bit CRC (Cyclic Redundancy Check) value (‘CRC64’). The algorithm to compute the CRC64 is described in the ISO 3309 standard. It should be noted that, while in theory, two different sequences could have the same CRC64 value, the likelihood that this would happen is quite low. An example of a SQ line is shown here: SQ

SEQUENCE

233 AA;

25630 MW;

146A1B48A1475C86 CRC64;

The information in the SQ line can be used as a check on accuracy or for statistical purposes. The word ‘SEQUENCE' is present solely for readability.

(3.15). The sequence data line The sequence data line has a line code consisting of two blanks rather than the two-letter codes used up until now. The sequence is written 60 amino acids per line, in groups of 10 amino acids, beginning in position 6 of the line. The characters used for the amino acids are the standard IUPAC one letter codes (see Appendix B). An example of sequence data lines is shown here: GDVEKGKKIF IMKCSQCHTV EKGGKHKTGP NLHGLFGRKT GQAPGYSYTA ANKNKGIIWG EDTLMEYLEN PKKYIPGTKM IFVGIKKKEE RADLIAYLKK ATNE

37

(3.16). The // line The // (terminator) line contains no data or comments. It designates the end of an entry.

38

Appendix A: Feature table keys The definition of each of the key names used in the feature table is explained here. It is probable that new key names will be progressively added to this list. For each key a number of examples are presented.

(A.1). Change indicators CONFLICT - Different papers report differing sequences. Examples of CONFLICT key feature lines: FT FT FT

CONFLICT CONFLICT CONFLICT

33 60 81

33 60 84

MISSING (IN REF. 2). P -> A (IN REF. 3 AND 4). ASTQ -> GWT (IN REF. 3).

VARIANT - Authors report that sequence variants exist. Examples of VARIANT key feature lines: FT FT FT

VARIANT VARIANT VARIANT

3 87 1

3 87 2

V -> I. L -> T (IN STRAIN 2.3.1). MISSING (IN 25% OF THE CHAINS).

VARSPLIC - Description of sequence variants produced by alternative splicing. Examples of VARSPLIC key feature lines: FT FT

VARSPLIC VARSPLIC

194 197

196 211

GRP -> DVR (IN SHORT FORM). MISSING (IN SHORT FORM).

MUTAGEN - Site which has been experimentally altered. Examples of MUTAGEN key feature lines: FT FT

MUTAGEN MUTAGEN

65 123

65 123

H->F: 100% LOSS OF ACTIVITY. G->R,L,M: DNA BINDING LOST.

(A.2). Amino-acid modifications MOD_RES - Post-translational modification of a residue. The chemical nature of the modification is given in the description. The general format of the MOD_RES description field is: FT

MOD_RES

xxx

xxx

MODIFICATION (COMMENT).

39

The most frequently occurring modifications are listed below. Modification ACETYLATION AMIDATION BLOCKED FORMYLATION GAMMA-CARBOXYGLUTAMIC ACID HYDROXYLATION METHYLATION PHOSPHORYLATION PYRROLIDONE CARBOXYLIC ACID SULFATATION

Description N-terminal or other Generally at the C-terminal of a mature active peptide Undetermined N- or C-terminal blocking group Of the N-terminal methionine Of glutamate Of asparagine, aspartic acid, proline or lysine Generally of lysine or arginine Of serine, threonine, tyrosine, aspartic acid or histidine N-terminal glutamate which has formed an internal cyclic lactam Generally of tyrosine

Examples of MOD_RES key feature lines: FT FT FT FT FT

MOD_RES MOD_RES MOD_RES MOD_RES MOD_RES

1 11 2 8 9

1 11 2 8 9

ACETYLATION. PHOSPHORYLATION (BY PKC). SULFATATION (BY SIMILARITY). AMIDATION (G-9 PROVIDE AMIDE GROUP). METHYLATION (MONO-, DI- & TRI-).

LIPID - Covalent binding of a lipidic moiety The chemical nature of the bound lipid moiety is given in the description. The general format of the LIPID description field is: FT

LIPID

xxx

xxx

NAME OF THE ATTACHED GROUP (COMMENT).

The attached groups that are currently defined are listed below. Attached group

Description

MYRISTATE

Myristate group attached through an amide bond to the N-terminal glycine residue of the mature form of a protein [1,2] or to an internal lysine residue Palmitate group attached through a thioether bond to a cysteine residue or through an ester bond to a serine or threonine residue [1,2] Farnesyl group attached through a thioether bond to a cysteine residue [3,4] Geranyl-geranyl group attached through a thioether bond to a cysteine residue [3,4] Glycosyl-phosphatidylinositol (GPI) group linked to the alpha-carboxyl group of the C-terminal residue of the mature form of a protein [5,6] N-terminal cysteine of the mature form of a prokaryotic lipoprotein with an amidelinked fatty acid and a glyceryl group to which two fatty acids are linked by ester linkages [7]

PALMITATE FARNESYL GERANYL-GERANYL GPI-ANCHOR N-ACYL DIGLYCERIDE

40

References: [1] [2] [3] [4] [5] [6] [7]

Grand R.J.A. Biochem. J. 258:626-638(1989). McLhinney R.A.J. Trends Biochem. Sci. 15:387-391(1990). Glomset J.A., Gelb M.H., Farnsworth C.C. Trends Biochem. Sci. 15:139-142(1990). Sinensky M., Lutz R.J. BioEssays 14:25-31(1992). Low M.G. FASEB J. 3:1600-1608(1989). Low M.G. Biochim. Biophys. Acta 988:427-454(1989). Hayashi S., Wu H.C. J. Bioenerg. Biomembr. 22:451-471(1990).

Examples of LIPID key feature lines: FT FT FT

LIPID LIPID LIPID

1 65 354

1 65 354

MYRISTATE. PALMITATE (BY SIMILARITY). GPI-ANCHOR.

DISULFID - Disulfide bond. The ‘FROM' and ‘TO' endpoints represent the two residues which are linked by an intra-chain disulfide bond. If the ‘FROM' and ‘TO' endpoints are identical, the disulfide bond is an interchain one and the description field indicates the nature of the cross-link. Examples of DISULFID key feature lines: FT FT

DISULFID DISULFID

27 14

44 14

PROBABLE. INTERCHAIN (WITH A LIGHT CHAIN).

THIOLEST - Thiolester bond. The ‘FROM' and ‘TO' endpoints represent the two residues which are linked by the thiolester bond. THIOETH - Thioether bond. The ‘FROM' and ‘TO' endpoints represent the two residues which are linked by the thioether bond. CARBOHYD - Glycosylation site. This key describes the occurrence of the attachment of a glycan (mono- or polysaccharide) to a residue of the protein: •

The type of linkage (C- N- or O-linked) to the protein is indicated.

41



If the nature of the reducing terminal sugar is known, its abbreviation is shown between parenthesis. If three dots “...” follow the abbreviation this indicates extension of the carbohydrate chain. Conversely the absence of the dots indicate that a single monosaccharide is linked.

Examples of CARBOHYD key feature lines: FT FT FT FT

CARBOHYD CARBOHYD CARBOHYD CARBOHYD

52 162 10 34

52 162 10 34

N-LINKED O-LINKED O-LINKED C-LINKED

(GLCNAC...) (POTENTIAL). (GLCNAC). (GALNAC...) (BY SIMILARITY). (MAN).

SE_CYS – Selenocysteine This key describes the occurrence of a selenocysteine in the sequence record. Examples: FT FT

SE_CYS SE_CYS

58 12

58 12

POTENTIAL.

METAL - Binding site for a metal ion. The description field indicates the nature of the metal. Examples of METAL key feature lines: FT FT

METAL METAL

18 87

18 87

IRON (HEME AXIAL LIGAND). COPPER (POTENTIAL).

BINDING - Binding site for any chemical group (co-enzyme, prosthetic group, etc.). The chemical nature of the group is given in the description field. Examples of BINDING key feature lines: FT FT

BINDING BINDING

14 250

14 250

HEME (COVALENT). PYRIDOXAL PHOSPHATE.

(A.3). Regions SIGNAL - Extent of a signal sequence (prepeptide). TRANSIT - Extent of a transit peptide (mitochondrial, chloroplastic, cyanelle or for a microbody). Examples of TRANSIT key feature lines: FT FT FT FT

TRANSIT TRANSIT TRANSIT TRANSIT

1 1 1 1

42 34 25 23

CHLOROPLAST. CYANELLE (BY SIMILARITY). MITOCHONDRION. MICROBODY (POTENTIAL).

42

PROPEP - Extent of a propeptide. Examples of PROPEP key feature lines: FT FT

PROPEP PROPEP

27 550

28 574

ACTIVATION PEPTIDE. REMOVED IN MATURE FORM.

CHAIN - Extent of a polypeptide chain in the mature protein. Examples of CHAIN key feature lines: FT FT

CHAIN CHAIN

21 37

119 >42

BETA-2 MICROGLOBULIN. FACTOR XIIIA.

PEPTIDE - Extent of a released active peptide. Examples of PEPTIDE key feature lines: FT FT

PEPTIDE PEPTIDE

13 235

107 239

NEUROPHYSIN 2. MET-ENKEPHALIN.

DOMAIN - Extent of a domain of interest on the sequence. The nature of that domain is given in the description field. Examples of DOMAIN key feature lines: FT FT

DOMAIN DOMAIN

22 140

788 152

EXTRACELLULAR (POTENTIAL). ANCESTRAL CALCIUM SITE.

CA_BIND - Extent of a calcium-binding region. DNA_BIND - Extent of a DNA-binding region. The nature of the DNA-binding region is given in the description field. Examples of DNA_BIND key feature lines: FT FT FT FT

DNA_BIND DNA_BIND DNA_BIND DNA_BIND

335 224 16 135

415 283 67 200

ETS-DOMAIN. HOMEOBOX. MYB. TEA-DOMAIN.

NP_BIND - Extent of a nucleotide phosphate binding region. The nature of the nucleotide phosphate is indicated in the description field. Examples of NP_BIND key feature lines: FT FT FT

NP_BIND NP_BIND NP_BIND

13 45 8

25 49 34

ATP. GTP (POTENTIAL). FAD (ADP PART).

43

TRANSMEM - Extent of a transmembrane region. ZN_FING - Extent of a zinc finger region. The zinc finger ‘category’ is indicated in the description field. Examples of ZN_FING key feature lines: FT FT

ZN_FING ZN_FING

110 559

134 579

GATA-TYPE. C4-TYPE.

SIMILAR - Extent of a similarity with another protein sequence. Precise information, relative to that sequence is given in the description field. Examples of SIMILAR key feature lines: FT FT

SIMILAR SIMILAR

351 580

456 1182

STRONG, WITH KAPPA CHAIN V REGIONS. WITH ERBB TRANSFORMING PROTEIN.

REPEAT - Extent of an internal sequence repetition. Examples of REPEAT key feature lines: FT FT FT

REPEAT REPEAT REPEAT

75 86 97

85 96 107

1. 2. 3 (APPROXIMATE).

(A.4). Secondary structure The feature table of sequence entries of proteins whose tertiary structure is known experimentally contains the secondary structure information corresponding to that protein. The secondary structure assignment is made according to DSSP (see Kabsch W., Sander C.; Biopolymers, 22:2577-2637(1983)) and the information is extracted from the coordinate data sets of the Protein Data Bank (PDB). In the feature table only three types of secondary structure are specified: helices (key HELIX), beta-strand (key STRAND) and turns (key TURN). Residues not specified in one of these classes are in a ‘loop' or ‘randomcoil' structure). Because the DSSP assignment has more than the three common secondary structure classes, we have converted the different DSSP assignments to HELIX, STRAND, and TURN as shown in the table below. DSSP code H G I E B

DSSP definition

SWISS-PROT assignment

Alpha-helix 3(10) helix Pi-helix Hydrogen bonded beta-strand (extended strand) Residue in an isolated beta-bridge

HELIX HELIX HELIX STRAND STRAND

44

T S

H-bonded turn (3-turn, 4-turn or 5-turn) Bend (five-residue bend centered at residue i)

TURN Not specified

One should be aware of the following facts: a

b c

d

Segment length. For helices (alpha and 3-10), the residue just before and just after the helix as given by DSSP participates in the helical hydrogen bonding pattern with a single H-bond. For some practical purposes, one can therefore extend the HELIX range by one residue on each side, e.g. HELIX 2535 instead of HELIX 26-34. Also, the ends of secondary structure segments are less well defined for lower-resolution structures. A fluctuation of +/- one residue is common. Missing segments. In low-resolution structures, badly formed helices or strands may be omitted in the DSSP definition. Special helices and strands. Helices of length three are 3-10 helices, those of length four and longer are either alpha-helices or 3-10 helices (pi helices are extremely rare). A strand of length one corresponds to a residue in an isolated beta-bridge. Such bridges can be structurally important. Missing secondary structure. No secondary structure is currently given in the feature table in the following cases: • • • •

No sequence data in the PDB entry; Structure for which only C-alpha coordinates are in PDB; NMR structure with more than one coordinate data set; Model (i.e. theoretical) structure.

Examples: FT FT FT FT FT

HELIX TURN TURN STRAND HELIX

3 15 20 23 25

14 15 21 23 35

(A.5). Others ACT_SITE - Amino acid(s) involved in the activity of an enzyme. Examples of ACT_SITE key feature lines: FT FT

ACT_SITE ACT_SITE

193 99

193 99

ACCEPTS A PROTON DURING CATALYSIS. CHARGE RELAY SYSTEM.

45

SITE - Any other interesting site on the sequence. Examples of SITE key feature lines: FT FT

SITE SITE

285 241

288 242

PREVENTS SECRETION FROM ER. CLEAVAGE (BY ANIMAL COLLAGENASES).

INIT_MET - Initiator methionine. This feature key is mostly associated with a zero value in the ‘FROM' and ‘TO' fields to indicate that the initiator methionine has been cleaved off and is not shown in the sequence: FT

INIT_MET

0

0

It is not used for cases where the initiator methionine is not cleaved-off except to indicate internal alternative initiation sites. Example: FT

INIT_MET

44

44

FOR CYTOPLASMIC ISOFORM.

NON_TER - The residue at an extremity of the sequence is not the terminal residue. If applied to position 1, this signifies that the first position is not the N-terminus of the complete molecule. If applied to the last position, it signifies that this position is not the C-terminus of the complete molecule. There is no description field for this key. Examples of NON_TER key feature lines: FT FT

NON_TER NON_TER

1 150

1 150

NON_CONS - Non-consecutive residues. Indicates that two residues in a sequence are not consecutive and that there are a number of unsequenced residues between them. Examples of NON_CONS key feature lines: FT FT

NON_CONS NON_CONS

1036 33

1037 34

N-TERMINAL / C-TERMINAL.

UNSURE - Uncertainties in the sequence Used to describe region(s) of a sequence for which the authors are unsure about the sequence assignment.

46

Appendix B: Amino acid codes The one-letter and three-letter codes for amino acids used in SWISS-PROT are those adopted by the commission on Biochemical Nomenclature of the IUPAC-IUB (see the reference listed below).

One-letter code A R N D C Q E G H I L K M F P S T W Y V B Z X

Three-letter code Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys Met Phe Pro Ser Thr Trp Tyr Val Asx Glx Xaa

Amino-acid name Alanine Arginine Asparagine Aspartic acid Cysteine Glutamine Glutamic acid Glycine Histidine Isoleucine Leucine Lysine Methionine Phenylalanine Proline Serine Threonine Tryptophan Tyrosine Valine Aspartic acid or Asparagine Glutamic acid or Glutamine Any amino acid

Reference IUPAC-IUB Joint Commission on Biochemical Nomenclature (JCBN). Nomenclature and Symbolism for Amino Acids and Peptides. Recommendations 1983. Eur. J. Biochem. 138:9-37(1984). See also: http://www.chem.qmw.ac.uk/iupac/AminoAcid/

47

Appendix C: Format differences between the SWISS-PROT and EMBL databases (C.1). Generalities The format of SWISS-PROT follows as closely as possible that of the EMBL database. The general structure of an entry is identical in both databases. The data classes used in both databases are the same except that SWISS-PROT does not make use of the ‘BACKBONE’, ‘UNREVIEWED' and ‘UNANNOTATED' data classes. One line type used in SWISS-PROT do not exist in the EMBL database (see section C.3); conversely SWISSPROT does not currently make use of every EMBL line type (see section C.4). (C.2). Differences in line types present in both databases (C.2.1). The ID line (IDentification) Differences with the EMBL database ID line format are: • • • •

The entry name can be up to 10 characters long (instead of 9 in EMBL) and can begin with a numerical character; EMBL entry ID lines have an additional three-letter taxonomic division ‘token' inserted between the data class and the molecule type; The molecule type is listed as ‘PRT' rather than ‘DNA' or ‘RNA'; The length of the molecule is followed by ‘AA' (Amino Acid) instead of ‘BP' (Base Pairs).

(C.2.2). The AC line (ACcession number) The format of this line type completely follows that defined by the EMBL database. SWISS-PROT accession numbers do not overlap with those used in the EMBL/GenBank/DDBJ nucleotide sequence database. However, it should be noted that there are differences in the format of the accession numbers themselves. In SWISS-PROT accession numbers consist of 6 alphanumerical characters in the following format: 1 [O,P,Q]

2 [0-9]

3 [A-Z, 0-9]

4 [A-Z, 0-9]

5 [A-Z, 0-9]

6 [0-9]

Examples: P01234; Q1AA12. In EMBL, two different types of accession numbers co-exist: 1. Accession numbers with 6 alphanumerical characters, where the first character is any letter with the exception of O,P or Q and the five other characters are numbers (example: M23765); 2. Accession numbers with 8 alphanumerical characters, where the first two characters are letters and the following six characters are numbers (example: AB001084).

48

(C.2.3). The DT line (DaTe) Differences with the EMBL database DT line format are: • •

In EMBL there are two DT lines per entry instead of three in SWISS-PROT; In EMBL the format of the DT line that indicates when an entry was created is identical to that defined in SWISS-PROT; but the two DT lines that convey information relevant to the updating of an entry are replaced by a single line in EMBL. This is shown in the example below.

DT lines in a SWISS-PROT entry: DT DT DT

21-JUL-1986 (Rel. 01, Created) 23-OCT-1986 (Rel. 02, Last sequence update) 01-APR-1990 (Rel. 14, Last annotation update)

DT lines in an EMBL database entry: DT DT

10-MAR-1990 (Rel. 22, Created) 12-APR-1990 (Rel. 23, Last updated, Version 3)

(C.2.4). The DE line (DEscription) • •

In SWISS-PROT the species of origin is not included in the description; In EMBL the last DE line is not terminated by a period.

(C.2.5). The OS line (Organism Species) • •

In some cases the SWISS-PROT OS line includes more than one organism name (when the relevant sequence is completely conserved in different species); In EMBL the last OS line is not terminated by a period.

(C.2.6). The OG line (OrGanelle) • • •

EMBL makes a distinction between ‘Mitochondrion', and ‘Kinetoplast', while SWISS-PROT does not use the latter designation; EMBL makes a distinction between ‘Chloroplast’ and ‘Plastid’, while SWISS-PROT does not use the latter designation; In EMBL the OG line is not terminated by a period.

(C.2.7). The RP and RC lines • •

In EMBL, contrariwise to SWISS-PROT, the RC line precedes the RP line; In EMBL the RC line is in free format and is generally not used.

49

(C.2.8). The RT line (Reference Title) •

In EMBL the reference title is not terminated by a period, a question mark or an exclamation mark.

(C.2.9). The FT line (Feature Table) The format of this line is totally different from that currently defined for the EMBL database. The format used in SWISS-PROT is similar to that which was used in older versions of the EMBL database, prior to the introduction of the common EMBL/GenBank/DDBJ feature table. (C.2.10). The CC line (Comment) The comment lines, which are free text and can appear anywhere in an EMBL entry, are grouped together in the SWISS-PROT database. They are always listed below the last reference line, and follow a precise syntax (see section 3.10). (C.2.11). The SQ line (SeQuence header) Although the rough format and purpose of this line type is conserved, its exact content differs from that of the EMBL database. The numerical length of the sequence is listed, followed by ‘AA‘ (Amino Acid) instead of ‘BP‘ (Base Pairs). To replace the sequence composition which, for protein sequences, would not fit in a single line, the molecular weight and the 64-bit CRC (Cyclic Redundancy Check) value of the sequence are indicated. (C.3). Line types defined by SWISS-PROT but currently not used by EMBL Presently, there is only one line type which exists in SWISS-PROT and which is not used in the EMBL database; it is the GN line. (C.4). Line types defined by EMBL but currently not used by SWISS-PROT There are three line types which exist in the EMBL database and which are not, presently, used in SWISSPROT: •



FH and XX. The FH and XX lines contain no data and are present in EMBL only to improve readability of an entry when it is printed or displayed on a terminal screen. These lines are not included in SWISSPROT so as to keep it as compact as possible and thereby facilitate its use on small computer systems. SV. The SV (Sequence Version) line contains an identifier specific to nucleic acid sequences. It has no meaning in the context of SWISS-PROT.

50