More than ever crystallographers are faced with using a variety of databases, each with its own content, organization and query interface. Database federation ...
WHAT DOES DATABASE FEDERATION MEAN TO CRYSTALLOGRAPHY?
Philip E. Bourne123, Ilya N. Shindyalov1, Christopher M. Smith1 and Helge Weissig12 1 San Diego Supercomputer Center, P.O. Box 85608, San Diego, CA 92186, USA, 2
Department of Pharmacology, University of California, San Diego, 9500 Gilman Dr., La Jolla, CA 92093, USA and 3
The Burnham Institute, 10901 North Torrey Pines Road, La Jolla, CA 92037, USA
More than ever crystallographers are faced with using a variety of databases, each with its own content, organization and query interface. Database federation offers the promise of a unified view of these disparate data and detailed query through a single easy to use interface available via the World Wide Web. This paper discusses current progress towards database federation in molecular biology and uses the Protein Kinase Resource (PKR) as an example. The paper concludes that while we are a long way from achieving the computer scientist’s view of database federation, there are useful resources available today. These resources and what we can expect in the future are introduced. Throughout emphasis is placed on the importance of better computer-readable annotation for achieving true database federation.
1.
INTRODUCTION
We begin with an explanation of what is meant in this paper by both “database” and “federation.” The definitions we use are not the formal definitions from computer science, but approximations that suit our final goals. Those goals are: (i) to provide insight into what federated databases are available via the Internet; (ii) to describe the technology behind these resources in regard to their strengths and weaknesses; and (iii) to provide insight into what we expect from the next generation of integrated data resources important to crystallographers. As we shall see, without these approximate definitions there would be little to discuss, since no formal database federations exist in molecular biology today. For our purposes a database can be considered as a set of data of some defined atomicity (level of detail) and scope, for example, a set of protein structures, a set of aligned sequences, and so on, in which each item can be referenced and is organized in a way that provides efficient answers to the queries being asked. By this definition the Protein Data Bank (PDB) [5] is a database since it has scope - all available X-ray crystal structures and NMR structures of proteins and DNA, each with a unique identifier - and is efficient as distributed if all you wish to do is locate a single structure using the PDB code found in the file name. The computer operating system can perform this query with a
simple directory lookup*. It is not an efficient organization, however, if you are trying to find all serine/threonine protein kinase structures solved to better that 2.5Å resolution. To be answered efficiently this type of query requires a specific level of data annotation and data organization, since opening each of over 6000 files (one per structure as of July, 1997) and searching for the appropriate terms is time consuming even on current workstations. An efficient answer to the protein kinase question requires that the data be organized in a certain way. Whatever the underlying data model used to represent the data (we will get to this) the principle is the same. Organize the data such that like features are grouped together. Thus, all compound names might be grouped together and all resolutions might be grouped together. Then to find the resolution of a structure the computer need not wade through a lot of extraneous information which is time consuming. Each member of a grouping has an associated reference so that the correct protein name can be associated with the correct resolution. These references go by names like primary key, primary index, and object identifier depending on whether the underlying data model is a relational database, an indexed database, or an object oriented database. The Nucleic Acid Database (NDB) [4], the genome database (GDB) [14] and the Biological Macromolecule Crystallization Database (BMCD) [17] are examples of databases that use a relational model; the Sequence Retrieval System (SRS) [12,13], Entrez [21,25] and the obsolete structures database [37] are examples that use the index system, and P/FDM [18] and the earlier versions of MOOSE [7,30] are examples that use object oriented databases. In the context of this paper, it is not important to understand these underlying data representations. It is important, however, to know that these resources exist and contain structural information important to crystallographers. Readers interested in database theory can refer to [9,10] for useful reading. What is important here is that data are organized and can be referenced. The organization is expressed by a schema which defines how items of data are grouped and related together. Given the different types of databases what is a “database federation?” Again, a less than formal definition, but one which serves our purpose is as follows. A database federation is a collection of discrete databases to which a user can pose a single query and get an answer through reference to information contained in all databases. The federation should be able to include simple ASCII (flat) files, relational, indexed, and object oriented databases. 1.1 Is this Topic Relevant? As a discipline crystallography, and macromolecular crystallography in particular, is changing. Thanks to expression systems providing larger amounts of protein than could be obtained from natural sources, more powerful synchrotrons and better detectors, improved phasing techniques, semi-automated electron density map fitting, and easier to use and more robust software, the time to complete a structure determination has been drastically reduced. Evidence for this progress is found in the exponential growth in the number of macromolecular structures. These events parallel those that took place in small molecule crystallography 20 years ago. The impact at that time was that small molecule crystallographers became more diverse in their research interests. This is now true for today’s macromolecular crystallographers. A macromolecular structure is the beginning of a line of inquiry that may extend to a detailed study of the functional role of the macromolecule and macromolecules like it. Such a diverse line of inquiry leads macromolecular crystallographers to a variety of different databases and at the same time often results in a sense of *
Not strictly true since some operating systems limit the number of files that can be accessed in a single directory.
frustration since the information they are seeking is spread throughout these databases, each having its own unique query procedures and descriptions for each item of data. If the promise of database federation represents a simple means of searching multiple databases, then we argue that it is relevant to today’s crystallographers.
2.
EXISTING DATABASE FEDERATIONS
Given our informal definition of a database federation, hyperlinks provide the simplest database federation. That is, a query of one resource, say for a particular structure, will provide specific links to information relating to that structure in say the PROSITE database of protein sequences [2,3]. While the query itself does not return information from the other database - that is left to the user - it does provide a way to navigate to that information. We refer to this capability as a loose federation. 2.1
Loose Federations
How is a loose federation implemented? Frequently the curators of a particular database make available a specification which describes how others may reference data in their database via a World Wide Web hyperlink. For example, the National Center for Biotechnology Information (NCBI) provides access to several of their databases in this way, namely: • • • • •
The PubMed database, a compendium of medical abstracts that includes all of the National Library of Medicine's MEDLINE database; The protein database (composite of native or translated sequences from GenBank, EMBL, DDBJ, PIR, SWISS-PROT, PRF, and PDB); The nucleotide database (composite from GenBank, EMBL, DDBJ); The MMDB 3-D protein structures database (from PDB data); The Genomes database (from GDB and elsewhere).
The atomicity provided by the hyperlink is at the level of a discrete entry in one of these databases - a specific citation, a specific structure, or a specific sequence. Each of these discrete entities is identified by a unique code, for example, a GenBank accession number which is never reused. Providing an immutable hyperlink to an item of data in a public database is a valuable service and many Internet-accessible data resources, including our own Protein Kinase Resource (PKR)[34], use this facility to link sequences, structures, citations found in our database to the primary source of data. So for example we make available sequence alignments for casein dependent kinases, where each sequence in the alignment is linked to its GenBank source, which includes additional information like the feature table and yet further links to citations and so on. However, in answering specific scientific questions this type of loose federation is limited. First, the link is uni-directional - there is not necessarily a link from an NCBI database to an external database like the PKR (the PDB does make these links to the PKR). Second, there is no guarantee of once having made the jump to the database of interest you will acquire sufficient related information to provide useful answers to your query. For example, jumping to the PDB based on a specific sequence of a tyrosine kinase which has a known structure and getting access to all tyrosine
kinase structures is not possible. Third, this form of navigation permits you to acquire information, however, it does not perform some calculation or information filtering specific to a query. Finally, the desired link between two databases necessary to answer the query may simply not exist. The navigational shortcomings introduced above are addressed in part by resources such as Entrez, SRS, and DBGET [24] which take existing databases and create their own bi-directional loose federations with improved nomenclature and better level of atomicity, which are then accessible for query through the Web. The basic approach used by each of these resources is similar. Parse the flat files that are used to build each individual database and create an index of terms such that each term can be found in a number of associated databases built from the flat files. The index then relates identical items of information in multiple databases and allows you to access them quickly. While this sounds straightforward, it is not. The difficulty is not in building the indices, but parsing the files and interpreting the contents. Each contributing database has its own flat file format which is convenient to read, but difficult to parse consistently since there is a temporal inconsistency in each of these databases. That is, the format has evolved over time and the database may contain entries in different formats that the parser must deal with. The current PDB is an example with files in v1.0 and v2.0 of the PDB File Format. This problem is relatively trivial since those formats are documented and can be interpreted. What is more difficult to resolve are the undocumented interpretations of what a specific item of data meant to a curator 10 years ago compared to how it is interpreted today. For example, what constitutes a polypeptide chain in the PDB, and hence is given a chain identifier has changed over time [37]. We will come back to this issue of nomenclature.
Figure 1. Interconnected databases available through SRS.
Parsing is handled differently by Entrez and SRS. Entrez works with a low-level form of data notation referred to Abstract Syntax Notation version 1 (ASN.1). Thus data is either collected directly in ASN.1, through, for example, direct deposition of sequences to the NCBI, or converted
from other formats with programs specific to each database. (These programs are, however, built from the same library of tools.) SRS, on the other hand, uses an Object Data Definition (ODD) which is a generic data definition language used to describe the representation found for example in a PDB entry or GenBank sequence entry, or the flat file representations of approximately 80 other biological databases for that matter [13]. Once the ODD for a specific type of data is defined, the same code is used to perform the parsing and indexing by first reading the ODD definition for a particular database. Figure 1 gives a view of some of the databases accessible through SRS characterized by the type of information available. Entrez takes this indexing idea a step further with the concept of neighbors. Neighbors provide hypertext links between related items of information and are determined by algorithms which define the likelihood that the items of information are related. How this is computed depends on the type of information being related. For sequences it depends on a similarity score computed with BLAST [1] - 100 nearest neighbors are reported; for structures it is computed with VAST [16] a structure matching algorithm; and for text it is determined by the frequency important terms appear in a document. It is beyond the scope of this paper to consider the details of how neighbors are determined in Entrez, see [28] for further details. 2.2
Tight Federations
While an active area of computer science research, we consider only the practical implementation of tight federations. Refer to [10] for additional reading on research into tight federations in molecular biology. A tight federation is defined here as a single query made of multiple data items in any of the databases that comprise the federation. The query language should have considerable expressive power and be independent of the underlying database structure. The results of a query should be easy to interpret and form the basis for further study as needed. It must be said that while technology exists to create tight federations, it is people and policies that prevent the widespread availability of database federations today. Databases, or more specifically the data they contain, are coveted and each curator has their own ideas of how a database should be queried and what is the exact definition of each data item within the database. The government funding of large and small biological databases, in the USA at least, has not traditionally fostered database federation - the federal agencies themselves covert the databases they fund. Efforts in Europe show promise. The European Union (EU) Bridge Database Project Consortium funded by the European Community produced a combined schema [19] for a macromolecular structure database based on the consortium’s experience in developing individual structure databases, notably SESAM [22], IDITIS [36] and P/FDM [18]. While not a database federation, it does represent a collective effort. This schema has yet to be implemented but is likely to be prominent in the European Bioinformatics Institute’s (EBI) plans for a structure database. 3.
CURRENT APPROACHES - THE PROTEIN KINASE RESOURCE
The approach of the EU Bridge project in taking the best from existing schema and implementing those in a new schema is today’s norm. To better understand this process we take as an example our own on-going work in developing the Protein Kinase Resource (PKR) [34] which integrates sequence data from GenBank, structure data from PDB, genetic information from OMIM
[27], local laboratory data, enzymatic data extracted from the literature, and other miscellaneous data of use to researchers interested in protein kinases. There are various methodologies for developing integrated resources like the PKR. From a user’s perspective the important issues are the provision of accurate and comprehensive data and making it meaningful to access. It is these issues which are the focus here, not the details of the database implementation which are available elsewhere [31]. The PKR is one of a new generation of data resources. We consider existing resources as broad but shallow, that is, they cover all known instances, for example, all protein structures or all protein sequences, but, by necessity, do so in limited detail. New resources like the PKR are narrow but deep. That is, they cover a particular sub-topic, in this case one protein family, but do so in greater detail, bringing together data from multiple broad data resources and elsewhere. Consider the basic approach we are adopting in developing the PKR (Figure 2) on a step-by-step basis.
Dictionaries Sequence
Lab. Data
F I L T E R S
Raw Data
Load
Structure Literature Genetics
Property Object Database
Storage & Memory Model
Query Library
Viz. Tools
Query
Display
Figure 2. The topology of the Protein Kinase Resource (PKR).
Step 1: Data from existing flat file formats (e.g., PDB, GenBank) are parsed, analogous to the parsing performed for SRS and Entrez. This is currently performed with special programs rather than an ODD approach. Step 2: Each parsed data item is then checked against an appropriate dictionary to check for its existence and to validate that its data type and value are appropriate. For dictionary checking we use the Self-defining Text Archival and Retrieval (STAR) data representation [8,15] defined for use in crystallography, but of general applicability. This approach lets us leverage off the extensive efforts that has already been put into the macromolecular Crystallographic Information File (mmCIF)
dictionary which contains approximately 3,200 terms [6]. Additional dictionaries were written to cover primary sequence and sequence feature tables and enzymology [35]. Computer-readable dictionaries are critical for sustaining a consistent nomenclature in an evolving scientific discipline and their importance cannot be overemphasized. Not only do the dictionaries specify the state of knowledge of a given field at a particular point in time, they provide an explicit record of what exists in that database at a given point in time. Until now, too much of that information was in the curators head and not documented leading to inconsistencies in representation and misuse by programmers writing code to use this data. Each programmer, without the benefit of a definitive guide, is left to make their own interpretation of items of data. The lack of consistency that results has become apparent in our temporal study of the PDB [37]. Two examples are given here to highlight the type of problem which we anticipate exists in all major long-lived data resources and is not peculiar to the PDB.
Figure 3a shows a query for hemoglobin in the obsolete structures database and indicates that this macromolecule has existed in the PDB since 1975. However, in 1984 1HHB was replaced simultaneously by 3 entries (2HHB, 3HHB, 4HHB), all derived from the same data set (Figure 3b). Closer inspection reveals that while these entries are all correct within the limits of the experimental data they show significant differences in their coordinate sets. This difference becomes apparent when considering how each structure compares to ideal geometric values taken from small molecule data [11]. Figure 3c indicates that 2HHB is a highly constrained model, whereas 4HHB is less constrained, and 3HHB is a dimer to which a non-crystallographic transformation has been applied to restore the tetramer. While the latter situation can be automatically ascertained when building a database through the distinct MTRIXn and SCALEn PDB records, the details of the differently constrained models 2HHB and 4HHB is only partially described in REMARK records and in any case such free text is indecipherable by a parser and hence the information is lost when building a database, unless entered manually - a time consuming and expensive process. The second example (Figure 4) shows the distribution of all structures in the obsolete structures database and their corresponding entries in the current PDB distribution based upon changes in the total number of atoms from one release to the next. Surprisingly, 15% of entries have less atoms than their corresponding earlier version. Close inspection reveals that this is caused predominantly by an over determination of water being corrected in a later version of the structure. Neither this change in water content, nor the criteria used to define water cut-off, is provided in these PDB entries. If it is reported it is reported via free text REMARK records and again cannot be consistently parsed and included in a database, and so vital information is lost. Dictionaries provide strict definitions for all items of data to be included in the database and hence provides a high level of machine usable and consistent annotation. The two examples of temporal inconsistency - the inability to automatically characterize change from one version of a structure to the next, could have been avoided if terms describing these changes had been included in a computer-readable dictionary. Step 3 A characteristic of the STAR dictionaries when written using the Dictionary Definition Language (DDL) developed for macromolecular crystallography [38] is the notion of categories
which group important data items together based on their structural relevance. We use these categories and new categories defined in additional dictionaries to represent indices in a property object database as described elsewhere [31]. In summary, the indices correspond to protein features, polypeptide chains, monomers, compounds etc. Each index then has multiple properties associated with it, for example, solvent exposure and secondary structure assignments for every residue in a polypeptide chain. Properties are maintained one per file. For example, solvent exposure for every residue found in all available protein kinase structures is in a single file. The index to each entry corresponds to the polypeptide chain and this is followed by the exposure values for each amino acid in that polypeptide chain. Properties can be retrieved very quickly using search methods that return the indices of polypeptide chains that include the search pattern. Collection objects group indices all having the same property. For example, all polypeptide chains containing a sequence of highly buried residues constitutes a collection object. In this way the time consuming step becomes the building of the database rather than the query to find structures with buried residues, which can be performed in real time.
Figure 5. The Compare3D Java applet.
Step 4 Query methods are hidden from the user by invoking queries through a Web interface, for example a simple Web form or a more sophisticated Java applet. Figure 5 illustrates an applet we have written [32] for the comparative analysis of proteins contained in the PKR (or for any other group of structurally related proteins). Protein sequences are selected from the property object database and a Smith Waterman alignment [33] applied. The C alpha coordinates from the
corresponding residues in the sequence alignment can then be superimposed in a least squares minimization according to the method of Hendrickson [20]. Structure rendering (translation, rotation, zooming, color coding, atom picking) can be performed and contact distance difference matrices examined. While useful, Java applets like this one nevertheless fall short of the rendering capability available with most molecular graphics programs. We anticipate that this situation will change with the advent of software libraries like Java-3D which provide capabilities such as predefined graphics primitives, shadowing, and clipping currently found in popular Fortran and C libraries such as OpenGL. 3.
THE FUTURE
The query language and visualization tools found in resources like PKR extend the indexing principles used in SRS and Entrez. Other resources are being developed which take native flat file data and load that data into specialized databases that support complex queries and modeling of more complete systems. These type of resources are in their infancy, but appear as the next logical step in federated database development. Examples of this type of resource are the genetic and metabolic description of E. coli found in EcoCyc [23] and the metabolic maps found in PUMA [29] and used in describing a variety of organisms. 4.
CONCLUSION
A true database federation in the computer science sense does not currently exist in molecular biology. That is, queries that truly exploit the schema of multiple databases and that can evolve efficiently as the schema of the underlying databases continues to evolve do not exist. Rather, efforts have tended towards reorganizing data from multiple sources into a single indexed based system that lets the user query that index. This methodology has proven to be efficient and can be expected to remain so in the face of exponentially increasing data growth. Nevertheless, this approach limits the types of queries that can be asked and more formal data models and query languages are needed that better support iterative query and can model complete systems. It is a natural tendency to model these data resources after the subcellular, cellular, and higher-order physiology that as biologists we understand. Unfortunately, these model systems are as complicated as the biological systems they represent and we are only just beginning to understand how technically this might be done [26]. Close collaboration between computer scientists and structure and molecular biologists is needed to advance this cause. These collaborations are beginning, crossdisciplinary students are being trained, and the resultant discipline of bioinformatics is beginning to emerge [39]. We believe that end result will be of great benefit to crystallographers.
ACKNOWLEGEMENTS Our own work on temporal databases, property object models, and the protein kinase resource are supported by NSF grants BIR 9630339 and ASC 8902825. Our own work on mmCIF is supported by NSF grant BIR 9310154 and the DOE. REFERENCES
[1] S.F. Altschul, W. Gish, W. Miller, E.W. Myers and D.J. Lipman, J. Mol. Biol. 5 (1990), 403. [2] A. Bairoch and B. Boeckmann, Nucl. Acids Res. 21 (1993), 3093. [3] A. Bairoch, Nucl. Acids Res. 21 (1993), 3097. [4] H.M. Berman, W.K. Olson, D.L. Beveridge, J. Westbrook, A. Gelbin, T. Demeny, S.H. Hsieh, A. R. Srinivasan and B. Schneider, Biophys. J. 63 (1992), 751. [5] F. Bernstein, T.F. Koetzle, G.J.B. Williams, E.F. Meyer Jr., M.D. Brice, J.R. Rogers, O. Kennard, T. Shimanouchi and M. Tasumi, J. Mol. Biol. 112 (1977), 535. [6] P.E. Bourne, H.M. Berman, B. McMahon, K. Watenpaugh, J. Westbrook and P.M.D. Fitzgerald, Methods in Enzymology 277 (1997), 571. [7] W. Chang, I.N. Shindyalov, C. Pu and P.E. Bourne, CABIOS 10 (1994), 575. [8] A. Cook and S.R. Hall, J. Chem Inf. Compt. Sci. 31 (1992), 326. [9] C.J. Date, An Introduction to Database Systems, Sixth Edition (Addison-Wesley, Reading, 1994). [10] S.B. Davidson, C. Overton and P. Buneman, J. Comp. Biol. 2 (1995), 557. [11] R.A. Engh and R. Huber, Acta Cryst. A47 (1991), 392. [12] T. Etzold and P. Argos, CABIOS 9 (1993), 49. [13] T. Etzold, A. Ulyanov and P. Argos. Methods in Enzymology 266 (1996), 114. [14] K.H. Fasman, S.I. Letovsky, P. Li, R.W. Cottingham and D.T. Kingsbury, Nucl. Acids Res. 25 (1997), 72. [15] S.R. Hall and N. Spadaccini, J. Chem Inf. Compt. Sci. 34 (1994), 505. [16] J.F. Gibrat, T. Madej and S.H. Bryant, Curr. Opin. Struct. Biol. 6 (1996), 377. [17] G.L. Gilliland, M. Tung, D.M. Blakeslee and J. Ladner, Acta Cryst. D50 (1994), 408. [18] P.M. Gray, N.W. Paton, G.J. Kemp and J.E. Fothergill, Protein Eng. 3 (1990), 235. [19] P.M. Gray, G.J.L. Kemp, C.J. Rawlings, N.P. Brown, C. Sander, J.M. Thornton, C.M. Orengo, S.J. Wodak and J. Richelle, TIBS 21 (1996), 251. [20] W.A. Hendrickson, Acta Cryst. A35 (1979), 158.
[21] C. Hogue, H. Ohkawa and S.H. Bryant, TIBS 21 (1996), 226. [22] M. Huysmans, J. Richelle and S.J. Wodak, Proteins 11 (1991), 59. [23] P. Karp and S. Paley, J. Comp. Biol. 3 (1996), 191. [24] H. Migimatsu and W. Fujibuchi, The DBGET Resource (1997), http://www.genome.ad.jp/dbget/dbget.html. [25] National Center Biotechnology Information, Linking to Entrez Databases (1996), http://www3.ncbi.nlm.nih.gov/Entrez/linking.html. [26] B. Palsson, Nature Biotech. 15 (1997), 3. [27] P. Pearson, C. Francomano, P. Foster, C. Bocchini, P. Li and V. McKusick. Nucl. Acids Res. 22 (1994), 3470. [28] G.D. Schuler, J.A. Epstein, H. Ohkawa and J.A. Kans, Methods in Enzymology 266 (1996), 141. [29] E. Selkov, S. Basmanova, T. Gaasterland, I. Goryanin, Y. Gretchkin, N. Maltsev , V. Nenashev, R. Overbeek, E. Panyushkina, L. Pronevitch, E. Selkov, Jr. and I. Yunus, Nucl. Acids Res. 24 (1996), 26. [30] I.N. Shindyalov, W. Chang, J.A. Cooper and P.E. Bourne Proceeding of the 28th Annual Hawaii International Conference on System Science (Vol. V. Biotechnology Computing, 1995) IEEE Computer Society Press, p. 208. http://www.sdsc.edu/moose. [31] I.N. Shindyalov and P.E. Bourne, CABIOS 13 (1997), In Press. [32] I.N. Shindyalov and P.E. Bourne, The Compare3D Java Applet (1997) http://xtal1.sdsc.edu/misha/compare_3d.html. [33] T.F. Smith and M.S. Waterman, J. Mol. Biol. 147 (1981), 195. [34] C. Smith, M. Gribskov, I.N. Shindyalov, S.S Taylor, L. Ten Eyck, S. Veretnik and P.E. Bourne, TIBS (1997), Submitted. [35] C. Smith and P.E. Bourne Enzymology Protein Information File Dictionary (1997), http://www.sdsc.edu/Kinases/development/PIF/SFBrowser.html. [36] M.J. Sternberg and S.A. Islam, Biochem. Soc. Trans. 17 (1989), 845. [37] H. Weissig and P.E. Bourne, Archive of obsolete PDB entries (1997), http://db2.sdsc.edu/PDBObs/PDBObs.cgi.
[38] J. D. Westbrook and S. R. Hall. A Dictionary Description Language for Macromolecular Structure, Report NDB-110 (Rutgers University, New Brunswick, NJ, 1995). [39] N. Williams, Science 275 (1997), 301.
LIST OF FIGURES
Figure 3. Queries from the obsolete structures database ; a) the chronology of hemoglobin; b) details of different hemoglobin replacements; c) color-coded deviation from ideal geometry (bond lengths, bond angles, and dihedral angles) averaged for each residue, green closest, red farthest from ideality. Figure 4. The distribution of PDB replaced structures showing changes in the number of atoms.