Vol. 16 no. 6 2000 Pages 520–526
BIOINFORMATICS
Quick selection of representative protein chain sets based on customizable requirements Tamotsu Noguchi 1,∗, Kentaro Onizuka 1, Makoto Ando 1,3, Hideo Matsuda 2 and Yutaka Akiyama 1 1 Parallel
Application TRC Laboratory, Real World Computing Partnership, Tsukuba Mitsui Building 1-6-1 Takezono, Tsukuba-shi Ibaraki 305-0032, Japan and 2 Department of Informatics and Mathematical Science, Graduate School of Engineering Science, Osaka University, 1-3 Machikaneyama, Toyonaka-shi Osaka 560-8531, Japan Received on October 22, 1999; accepted on February 3, 2000
Abstract Motivation: Protein structure classification has been recognized as one of the most important research issues in protein structure analysis. A substantial number of methods for the classification have been proposed, and several databases have been constructed using these methods. Since some proteins with very similar sequences may exhibit structural diversities, we have proposed PDBREPRDB: a database of representative protein chains from the Protein Data Bank (PDB), which strategy of selection is based not only on sequence similarity but also on structural similarity. Forty-eight representative sets whose similarity criteria were predetermined were made available over the World Wide Web (WWW). However, the sets were insufficient in number to satisfy users researching protein structures by various methods. Result: We have improved the system for PDB-REPRDB so that the user may obtain a quick selection of representative chains from PDB. The selection of representative chains can be dynamically configured according to the user’s requirement. The WWW interface provides a large degree of freedom in setting parameters, such as cut-off scores of sequence and structural similarity. This paper describes the method we use to classify chains and select the representatives in the system. We also describe the interface used to set the parameters. Availability: The system for PDB-REPRDB is available at the PAPIA WWW server (http:// www.rwcp.or.jp/ papia/ ). Contact:
[email protected]
Introduction In recent years, the number of entries in the Protein Data Bank (PDB) (Bernstein et al., 1977) has been increasing ∗ To whom correspondence should be addressed. 3 Present address: Information Processing Systems Department, NKK Corporation, 1-1-2 Marunouchi, Chiyoda-ku, Tokyo 100-8202, Japan
520
rapidly, with the determination of large numbers of protein structures due to improved x-ray crystallography and NMR experimental techniques. These data are being used actively in studies of protein function and evolution, and with the surge in accumulated data, studies of protein structure prediction are thriving. But not all protein structure data in PDB are competent for the purpose of protein structure analysis. A lot of entries have insufficiently refined coordinate data, perhaps due to insufficient resolution in the x-ray crystallography or NMR spectroscopy. In many cases we should eliminate the imperfect data beforehand to achieve an accurate result. Moreover, a great deal of protein chains in PDB are similar in terms of sequence or structural similarity. For an unbiased analysis, we should classify these chains and select only one representative from each group of similar chains. At present, several classification databases have been proposed and are available on the World Wide Web (WWW). Hobohm et al. (1992) proposed a representative set of protein chains called ‘PDB SELECT’ (see also Hobohm and Sander, 1994), with a strategy of selection based solely on sequence similarity. Chains that have similar sequences are automatically assumed to have similar structures. As a result, only one representative is selected out of each group regardless of any structural diversity within the group. The set of representatives in ‘PDB SELECT’, which has been updated regularly and is open to the public on an anonymous ftp site, is widely used in the community. Though it does not select a set of representatives, Homology-Derived Structures of Proteins (HSSP) (Sander and Schneider, 1991) is also a database based on a sequence similarity that contains sequence alignment data of similar sequences from PDB and SWISS-PROT. Although these strategies seem rational in terms of equality in homologous sequence elimination, and easy c Oxford University Press 2000
Quick selection of representative protein chains
implementation, chances are that the selected set would not reflect local structural diversities between members of a protein family. Local structural diversity is informative to investigate the principles of the local conformation of proteins. Several protein chains whose similarity are greater than 90% have been found to exhibit structural diversities that are particularly local (Figure 1). In this case, the secondary structure at the reactive center loop in the active conformation of the molecule is different from that in the latent conformation, since the loop is cleaved in the active conformation (I chain) or is incorporated into the A-β-sheet in the latent conformation (L chain) (Skinner et al., 1997). Local structural diversities have also been found at insertion, deletion or mutation sites, since these sequence modifications cause structural changes. On the other hand, CATH (Orengo et al., 1997), based on sequence and topology, and SCOP (Murzin et al., 1995), based on expert analysis (i.e. secondary structure and topology), have also been proposed. These classify the protein chains according to both sequence and structural similarity. CATH and SCOP are available on WWW sites and are utilized enthusiastically by structure biologists throughout the world. However, what they actually arrive at is an hierarchical classification of protein chains. They do not select representatives for the purposes of statistically unbiased structure analyses. FSSP (Holm and Sander, 1994) and HOMSTRAD (Mizuguchi et al., 1998), based on structural alignment, have also been proposed, but again a representative set is not selected. We earlier reported ‘PDB-REPRDB,’ a database of representative protein chains selected from PDB (Noguchi et al., 1997). The criteria used to select the representatives were: (a) quality of atomic coordinate data, (b) sequence uniqueness; (c) conformation uniqueness that is particularly local. We introduced the sequence identity (ID%) and the maximum distance between superimposed pairs of atoms from the two structures (‘Dmax’) as the respective measures of sequence and structural similarities, which is more sensitive to the detection of the local structural diversity than root mean square deviation (RMSD). The previous version of PDB-REPRDB provided 48 representative sets (eight criterion for sequence similarity: ID% ≥ 25– 95% with 10% increments, and six criterion for structural ˚ with 10 A ˚ increments, and similarity: Dmax ≤ 10–50 A ∞: differences in structure not considered) on the WWW. However, the sets were insufficient in number to satisfy users researching protein structures by various methods. We have now developed the interactive system for PDB-REPRDB further, to assure a quick selection of representative chain sets based on the user’s requirements. In this paper we report on this new system for PDBREPRDB, which has been available since April 1999 at the Parallel Protein Information Analysis (PAPIA) system server (Akiyama et al., 1998).
Methods The policy of PDB-REPRDB generated by the new system remains the same as that of the original PDB-REPRDB (Noguchi et al., 1997), whose criteria of selecting the representative chains are, quality of atomic coordinate data, sequence uniqueness, and conformation uniqueness, particularly when local. The quality is important to the selection of the representatives from a structure database such as PDB, since the structures would be slightly different according to the quality, however similar the sequences are. This difference might derive another result in the comparison of structures or by secondary structure definition such as DSSP (Kabsch and Sander, 1983). The best quality chain in the group is selected as the representatives in our criteria. In the new system, the user can exclude unnecessary chains for the purposes of his/her own research and set the priority of quality factors for selecting the representatives. The selection procedure is almost completely automated by sophisticated algorithms. Furthermore, RMSD, which is commonly used as a measure of structural similarity, is added to the criteria for similarity. The operation of the system for PDB-REPRDB can be divided into two stages: 1. calculation of similarities between all pairs of protein chains, 2. classification of those chains and selection of the representative chains according to the priorities specified by the user (Figure 2). Similarities between all pairs of protein chains are calculated beforehand, and an interactive system using the WWW classifies these chains and selects the representatives using the similarity data.
Calculation of similarities A flowchart for the first stage is shown in Figure 3. The first stage is executed every time a new PDB is released. First, all chains are extracted from each PDB entry, and the chains which match any of the following conditions below are excluded (a) DNA and RNA data, (b) theoretically modeled data, (c) short chains (l < 40 residues), or (d) data with non-standard amino acid residues at all residues. The current scope of our database is concentrated on those protein structures determined by experiment (i.e. by x-ray crystallography, NMR, etc.). This excludes chains for (a) DNA and RNA, (b) peptides, or (c) data derived by theoretical modeling. Although (1) chains without backbone coordinates (i.e. with only Cα coordinates) at all residues, (2) chains without side chain coordinates at all residues, or (3) chains without refinement were excluded at this step in the original version, those chains 521
T.Noguchi et al.
Fig. 1. An example of local structural diversity between protein chains whose sequence identity is greater than 90%. The superimposed L and I chains of Antithrombin (PDB ID: 2ant) are shown as a blue and red ribbon, respectively. The conformation of the reactive center loop is cleaved in the active conformation (I). In the latent conformation (L), the loop is incorporated into the β sheet.
List of protein chains
PDB
Page for setting threshold values and the priority of factors
List of protein chains selected and sorted according to the user's requirements Calculate the similarities (ID%, RMSD and Dmax) between all pairs of chains.
Page for setting parameters and threshold values
Parallelized using MPI library
PDB-REPRDB list
Similarity data between pairs of protein chains
Stage 1
Stage 2
Interactive system using the WWW
Fig. 2. An outline of the construction of the new system for PDB-REPRDB. The process is divided into two stages. Stage 1 calculates similarities, while stage 2 performs the classification of chains and selection of the representative chains.
are now eliminated in the second stage by using the WWW interface to set the threshold of (1) the ratio of residues with only Cα coordinates, (2) the ratio of residues with only backbone coordinates, or (3) the R-factor value. All chains remaining after the exclusion step are sorted in decreasing order of the number of Cα coordinates in the chain. 522
We define the similarities between protein chains by means of ID%, RMSD and Dmax. These similarity values are calculated for each pair of protein chains, working from the top of the sorted list to the tail. First, a pair of chains is aligned by the pairwise sequence alignment (Needleman and Wunsch, 1970), which is based on the dynamic programming algorithm, and ID% is calculated
Quick selection of representative protein chains
List of all protein chains
PDB
List of protein chains
1. Eliminate chains ( DNA, RNA, theoretical model, number of residues with C coordinates < 40) 2. Sort by the number of C coordinates
Eliminate and sort chains 1. Eliminate chains with factors greater/smaller than the thresholds set by the user. 2. Sort chains according to various factors. First, the factor given priority '1' is compared. Factors with lower priorities are compared only when higher-priority factors are equal.
Sorted list of protein chains
Set the first chain in the list as a target chain
Similarity data between pairs of protein chains
List of protein chains selected and sorted by the user's requirement
Set chains below the target in the list as comparison chains
Classify chains and select the representatives Pairs of chains are checked for similarity starting from the top of the list. The first chain classified into a group is the representative. Similarity check 1. By sequence similarity: chains whose ID% are over the threshold are classified into the same group. 2. By structural similarity: chains whose RMSD or Dmax are under the threshold are classified into the same group.
1. Align sequences by pairwise sequence alignment method (Needleman-Wunsh). 2. Calculate ID% for aligned sequence pairs. 3. Superimpose each pair of C atoms in the aligned residues. 4. Calculate RMSD and Dmax. Set the next chain in the sorted list as a target chain
Parallelized using MPI library Similarity data between pairs of protein chains
Fig. 3. Flowchart for calculating similarities between all pairs of protein chains.
from the result of the alignment. Next, each pair of Cα atoms in the aligned residues are superimposed by the least square fitting procedure (Kabsch, 1978), and RMSD and Dmax are calculated from the superposition. This procedure has been parallelized using the MPI library for speeding up the calculation of similarities for a large number of protein chain pairs.
Classification of chains and selection of representatives A flowchart for the second stage is shown in Figure 4. The policy for classification of chains and selection of representatives (i.e. considering not only the sequence similarity but also structural similarity, and giving priority to chain data with higher quality) remains the same as the original version (Noguchi et al., 1997). However, in the new system, chains to be classified vary according to the threshold values for various factors, and the representative list and classification data change according to the threshold values and priorities of factors for selecting representatives and the similarity parameters. Furthermore, the sequence and structural similarities are
PDB-REPRDB list
Fig. 4. Flowchart for the PDB-REPRDB WWW interface.
calculated beforehand between all chain pairs to decrease the time required to select representative sets. First, some chains in the list generated in the previous step are eliminated and the remaining chains sorted according to the user’s requirements on the page for setting threshold values and the priority of factors. The chains are eliminated based on thresholds for various factors as described below. 1. Eliminate chains with a greater value than the thresholds for these factors: resolution, R-factor, the number of chain breaks, ratio of non-standard amino acid residues, ratio of residues with only Cα coordinates, ratio of residues with only backbone coordinates. 2. Eliminate chains with a smaller value than the thresholds for the number of residues. Furthermore, mutant chains, complex chains and chains solved by NMR are eliminated, if the user did not explicitly select these chains. Second, the remaining chains are sorted by various factors, in a priority defined by the user. These factors are: 1. resolution, 2. R-factor, 523
T.Noguchi et al.
3. the number of chain breaks (the fewer the better), 4. ratio of non-standard amino acid residues (the smaller the better), 5. ratio of residues with only Cα coordinates (the smaller the better), 6. ratio of residues with only backbone coordinates (the smaller the better), 7. whether mutant or wild (the wild type has priority), and 8. whether complex or not (the non-complex type has priority). If all those factors are of the same value, the chains are sorted in alphabetical order by chain name (e.g. 1MCD < 1MCE, 5AT1A < 5AT1C). This order is also the priority used in the previous version of PDB-REPRDB. Third, chains are classified starting from the top of the sorted list according to the similarity defined by the user on the page for setting parameters and threshold values, and the representative chains are selected from the classification. In this phase, the first chain classified into a group is selected as a representative, because the chain has the highest priority in the group. Therefore, it is possible to change the priority for selection of representatives by defining priorities of factors for sorting chains on Page 2. Therefore, it is possible to change the priority for selection of representatives by defining priorities of factors for sorting chains on the top page. The similarity for classification is defined by selecting the parameters (ID%, and RMSD or Dmax) and setting the thresholds. In the similarity check, we consider the following chains not to be similar: • if ID% is checked, chains whose ID% are less than the threshold; or • if RMSD or Dmax is checked, chains whose RMSD or Dmax are greater than the threshold. Finally, all chains are classified into protein chain groups, and representative chains are selected from the groups.
Using the PDB-REPRDB via the WWW The new system for PDB-REPRDB is designed to provide a WWW user interface to the representative chain sets. Now, the user can eliminate unnecessary chains (e.g. with insufficient resolution, with chain breaks, or without side chain coordinates) from the PDB chain list by setting threshold values, and change the priority of factors for selecting representatives on the top page. The ‘apply constraints’ on this page controls whether the threshold of the factor is used or not. If the user chooses the ‘No’ option 524
for a factor, chains will not be eliminated based on that factor. The ‘Yes’ option causes elimination of chains based on the following threshold. The ‘priority’ values must be integers, set in increasing order from one to nine. Previously, the representative sets were obtained only by similarity parameters that were determined beforehand. Now, the user can obtain representative sets for pairs of sequence and structural similarity parameters (e.g. ID% ≥ ˚ ID% ≥ 90% and Dmax ≤ 5 A) ˚ 30% and RMSD ≤ 15 A, on demand by selecting those parameters and setting the values on the following page. Examples of a representative list and classification data of protein chains obtained from this system are shown in Figure 5. In this case, the criteria for sequence similarity and structural similarity are ID% ≥ 30% and Dmax ≤ ˚ The screen shown at the top of figures contains a 10 A. list of the representative chains, including the following information. • ID: PDB entry ID + chain ID • ∗: link to ‘RasMol’ for displaying a 3D view of the protein • naa: the number of amino acids (from the SEQRES line of PDB) • Res: resolution • Rfac: R-factor • Methd: experimental method • n sid: the number of residues with side chain coordinates • n bck: the number of residues with backbone coordinates • n ca: the number of residues with Cα coordinates • n naa: the number of non-standard amino acid residues • mutant: mutant or wild • complex: complex or not • ECnumber: EC number • COMPND: compound in PDB ‘ID’ sections are hot-linked to the screen shown below, which contains data on the classified groups, and a graphic representation of the 3D structure can be displayed using the RasMol program, by clicking on ‘∗’. Furthermore, ‘ECnumber’ sections are hot-linked to Ligand chemical database for enzyme reactions (LIGAND) (Goto et al., 1999), which is one of the databases supported by DBGET/LinkDB (Fujibuchi et al., 1998) on GenomeNet
Quick selection of representative protein chains
PDB-REPRDB Database of representative protein chains in PDB based on PDB Rel. #86, 30 Mar 99 by Tamotsu NOGUCHI, Kentaro ONIZUKA and Yutaka AKIYAMA (Real World Computing Partnership) (click here to see the document.)
ID : PDB entry ID + chain ID * : (click to show the Protein 3D viewer) naa : the number of amino acids ( from SEQRES line of PDB ) Res : resolution Rfac : R-fator Methd : experimental method n_sid : the number of residues with side chain coordinates n_bck : the number of residues with backbone coordinates n_ca : the number of residues with CA coordinates n_naa : the number of non-standard amino acid residues mutant : mutant or wild complex : complex or not ECnumber : EC number COMPND : compound in PDB
Threshold : ID% = 30 % DMAX = 10 A 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20.
ID 1GCI_ 1CBN_ 2PVB_ 3LZT_ 2FDN_ 1NLS_ 1BXO_ 1BRF_ 1AHO_ 1IXH_ 1CEX_ 2ERL_ 1LKKA 5PTI_ 3SIL_ 2IGD_ 1BKRA 1CTJ_ 1RGEA 1A6G_
* * * * * * * * * * * * * * * * * * * *
naa 269 46 107 129 55 237 323 53 64 321 214 40 105 58 379 61 109 89 96 151
Res 0.78 0.83 0.91 0.92 0.94 0.94 0.95 0.95 0.96 0.98 1.00 1.00 1.00 1.00 1.05 1.10 1.10 1.10 1.15 1.15
Rfac Methd n_sid n_bck 0.10 X 264 265 0.11 X 47 47 0.11 X 96 96 0.09 X 126 127 0.10 X 55 55 0.13 X 230 236 0.10 X 322 323 0.13 X 53 53 0.16 X 61 64 0.12 X 320 321 0.09 X 197 197 0.13 X 40 40 0.13 X 105 105 0.20 X 58 58 0.12 X 379 379 0.10 X 61 61 0.14 X 108 108 0.14 X 89 89 0.11 X 95 96 0.13 X 149 150
n_ca n_naa 269 0 48 0 107 0 129 0 55 0 237 0 323 0 53 0 64 0 321 0 197 0 40 0 106 1 58 0 379 0 61 0 108 0 89 0 96 0 151 0
brk mutnt cmplx ECnumber 0 W N 2 W N 0 W N 1 W N 0 W N 0 W N 0 W N 0 W N 0 W N 0 W N 0 W N 0 W N 1 W c 0 W N 0 W N 0 W N 0 W N 0 W N 0 W N 3.1.27.3 1 W N -
COMPND THE 0.78 ANGSTROMS STRUCTURE OF A SERINE PROTEAS CRAMBIN PIKE PARVALBUMIN (PI 4.10) AT LOW TEMPERATURE (1 REFINEMENT OF TRICLINIC LYSOZYME AT ATOMIC RESOL 2[4FE-4S] FERREDOXIN FROM CLOSTRIDIUM ACIDI-URIC CONCANAVALIN A AND ITS BOUND SOLVENT AT 0.94A RE ACID PROTEINASE (PENICILLOPEPSIN) (E.C. RUBREDOXIN (WILD TYPE) FROM PYROCOCCUS FURIOSUS THE AB INITIO STRUCTURE DETERMINATION AND REFINE PHOSPHATE-BINDING PROTEIN (PBP) COMPLEXED WITH P STRUCTURE OF CUTINASE PHEROMONE ER-1 FROM HUMAN P56-LCK TYROSINE KINASE SH2 DOMAIN IN COMP TRYPSIN INHIBITOR (CRYSTAL FORM /II$) SIALIDASE FROM SALMONELLA TYPHIMURIUM ANISOTROPIC STRUCTURE OF PROTEIN G IGG-BINDING D CALPONIN HOMOLOGY (CH) DOMAIN FROM HUMAN BETA-SP CRYSTAL STRUCTURE OF CYTOCHROME C6 HYDROLASE, GUANYLORIBONUCLEASE CARBONMONOXY-MYOGLOBIN, ATOMIC RESOLUTION
PDB-REPRDB
1GCI : 1CBN : 2PVB : 3LZT : 2FDN : 1NLS : 1BXO : 1BRF : 1AHO : 1IXH : 1CEX : 2ERL : 1LKKA: 5PTI : 3SIL : 2IGD : 1BKRA: 1CTJ : 1RGEA: 1A6G : 1MROC: 1MROB: 1MROD: 1IFC : 1ARB : 1ATG : 1AMM : 2PTH : 1JETA:
1GCI 1CBN 2PVB 3LZT 2FDN 1NLS 1BXO 1BRF 1AHO 1IXH 1CEX 2ERL 1LKKA 5PTI 3SIL 2IGD 1BKRA 1CTJ 1RGEA 1A6G 1MROC 1MROB 1MROD 1IFC 1ARB 1ATG 1AMM 2PTH 1JETA
1THM 1AB1 1RRO 4LZT 1VJW 1JBC 2WEA 8RXNA 2SN3 1IXG 1AGY 1ERC 1SHAA 1BPI 2SIL 1IGD 1AA2 1CYJ 1RGEB 1BZ6A 1MROF 1MROE 1MROA 1ICM 1ARC 4GCR
1SVN 1CNR 4CPV 1LKS 1FCA 1SCS 1BXQ 1IRO 1PTX 1A54A 1CUS
1ST3 1CRN 5PAL 1JSE 1FDN 1ENR 2WEB 1BQ8 1SNB 2ABH 1CUJ
1SUP 1BHP 1CDP 1JSF 1FDX 2CTVA 2WEC 5RXN 1NRA 1QUK 1XZL
1A2Q 2PLH 5CPV 135L 1CLF 1SCR 2WED 1IRN 1NRB 1QUL 1CUY
1S01 1CCM 1PVAA 193L 1ROF 1CONA 1PPLE 4RXN 1LQQ 1IXI 1XZG
1SUB 1CCN 1PVAB 2IHL 1BC6 1GICA 1PPME 1BQ9 1LQH 1PBP 1FFA
1AQN
1AU9
1AK9
3SICE 1SBH
1PAL 194L 1BQX 1GICB 1APUE 1RDG 1LQI 1QUI 1FFD
1PVB 1LZR 1BWE 5CNAA 1APVE 7RXN 1VNA 1QUJ 1FFE
4PAL 1LZB
2PAL 1LZ3
5CNAB 1APWE 1BE7 1VNB 1OIBA 1XZF
5CNAC 5CNAD 1PPKE 1APTE 1CAA 1CAD
1LKLA 2KNT 1DIM 1PGX
1LCJA 9PTI 2SIM 1PGB
1BHFA 1BHHB 1BHHA 1SHBA 1SHDA 1A09B 1BKM 1AAPA 1AAPB 4PTI 1AALB 1AALA 1KNT 7PTI 1DIL 1PGA 1IGCA 1FCCC 1GB1 2GB1 2IGH 2IGG
1OMD 1LZ1
1OIBB 1A55A 1XZH 1XZI 1A09A 1BKL 6PTI 1TAWB 1GB4
1CYI 1C6S 1A2S 1CED 1RGGA 1RGGB 1RGFA 1RGFB 1RGHA 1RGHB 1GMPA 1GMPB 1GMRA 1GMRB 1GMQA 1BVD 1MBD 1BABB 1BABD 1BZ0B 1BZ0D 2MBW 1BVC 1ABS 1THBB 1THBD
1ICN
1IFB
2IFB
1AEL
1DSL
1ELPA 1ELPB 1GCS
3IFB
1URE
1A5DA 1A5DB 1A45
1GAMA 1GAMB
1JEUA 1JEVA 2OLBA 2RKMA 1B6HA 1B1HA 1B0HA 1B2HA 1B4HA 1B5HA 1B3HA 1OLCA
Fig. 5. A list of representatives in the WWW interface to PDB-REPRDB. (top) The arrow indicates the chains classified into ‘1GCI’ group in classification data generated using the WWW interface to PDB-REPRDB.
in Japan. The classification data are presented in one page, in which each representative chain and the similar chains in its group are described by ‘ID’ on a single line. Each
‘ID’ is hot-linked with the PDB on the DBGET/LinkDB; clicking it will show the contents of the corresponding PDB entry. 525
T.Noguchi et al.
Conclusion In this paper, we presented a new system for PDBREPRDB, which makes it possible to obtain a representative chain dataset, where the criteria for selection are based on both sequence and structural differences. The point is that our representative set discriminates between chains that have similar sequences but also have meaningful local structural diversity (due to insertions, deletions, or mutations), and it even discriminates small structural changes caused by complex formation. We have only addressed classification at a global level in this work. Our method attempts to ensure that all local structures are represented in the database, but does not prevent local structures from occurring in multiple clusters. Domain level classification is something we would like to pursue in the future. Several representative sets generated by this system have been used for our own research activities (e.g. protein secondary structure prediction, threading, and local structure classification) and as a database for PAPIA (Akiyama et al., 1998), which has been accessible on the PAPIA WWW page at http://www.rwcp.or.jp/papia/ since April 1998. A researcher studying the analysis and prediction of protein structure can readily obtain a set of representative chains based on his/her own requirements by using the system. That is, the system for PDB-REPRDB assures a quick selection of representative chains, whether the strategy of selection is based only on sequence similarity or on both sequence and structural similarity, and whatever threshold values are set. The system has been available at the PAPIA WWW server since April 1999. Similarity data for PDBREPRDB are recalculated whenever a new PDB is released (until recently, about every 3 months). Recently, daily updates of PDB have become available, allowing the user to access the latest protein structure data over the WWW immediately. In addition, a daily update tool, which is mainly used for distribution of the PDB data, has been made available to update local copies of the PDB every day. Therefore, it has been requested that PDBREPRDB be updated more frequently. Consequently, we are planning to update the similarity data every month for the time being. Acknowledgments We thank Dr Susumu Goto and Prof Minoru Kanehisa at Institute for Chemical Research, Kyoto University for their support.
526
References Akiyama,Y., Onizuka,K., Noguchi,T. and Ando,M. (1998) Parallel Protein Information Analysis (PAPIA) system running on a 64node PC cluster. Proceedings of the Ninth Workshop on Genome Informatics. Universal Academy Press, pp. 131–140. Bernstein,F.C., Koetzle,T.F., Williams,G.J.B., Meyer,E.F.,Jr, Brice,M.D., Rodgers,J.R., Kennard,O., Shimanouchi,T. and Tasumi,M. (1977) The protein data bank: a computer-based archival file for macromolecular structures. J. Mol. Biol., 112, 535–542. Fujibuchi,W., Goto,S., Migimatsu,H., Uchiyama,I., Ogiwara,A., Akiyama,Y. and Kanehisa,M. (1998) DBGET/LinkDB: an integrated database retrieval system. Pac. Symp. Biocomput. 1998, 683–694. Goto,S., Nishioka,T. and Kanehisa,M. (1999) LIGAND database for enzymes, compounds and reactions. Nucleic Acids Res., 27, 377– 379. Hobohm,U., Scharf,M., Schneider,R. and Sander,C. (1992) Selection of representative protein data sets. Protein Sci., 1, 409–417. Hobohm,U. and Sander,C. (1994) Enlarged representative set of protein structures. Protein Sci., 3, 522–524. Holm,L. and Sander,C. (1994) The FSSP database of structurally aligned protein fold families. Nucleic Acids Res., 22, 3600–3609. Kabsch,W. (1978) A discussion of the solution for the best rotation to relate two sets of vectors. Acta Cryst., A 34, 827–828. Kabsch,W. and Sander,C. (1983) Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymer, 22, 2577–2637. Mizuguchi,K., Deane,C.M., Blundell,T.L. and Overington,J.P. (1998) HOMSTRAD: a database of protein structure alignments for homologous families. Protein Sci., 7, 2469–2471. Murzin,A.G., Brenner,S.E., Hubbard,T. and Chothia,C. (1995) SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol., 247, 536– 540. Needleman,S.B. and Wunsch,C.D. (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol., 48, 443–453. Noguchi,T., Onizuka,K., Akiyama,Y. and Saito,M. (1997) PDBREPRDB: a database of representative protein chains in PDB (Protein Data Bank). Proceedings of the Fifth International Conference on Intelligent Systems for Molecular Biology. AAAI Press, Menlo Park, pp. 214–217. Orengo,C.A., Michie,A.D., Jones,S., Jones,D.T., Swindells,M.B. and Thornton,J.M. (1997) CATH—A hierarchic classification of protein domain structures. Structure, 5, 1093–1108. Sander,C. and Schneider,R. (1991) Database of homology derived protein structures and the structural meaning of sequence alignment. Proteins, 9, 56–68. Skinner,R., Abrahams,J.P., Whisstock,J.C., Lesk,A.M., Carrell,R.W ˚ structure of antithrombin and Wardell,M.R. (1997) The 2.6 A indicates a conformational change at the heparin binding site. J. Mol. Biol., 266, 601–609.