commodity hard disk. Robbins [1] has ... Hence archival data and their types can always be retrieved .... retrieve the name of the table containing the instance in-.
The Design and Implementation of a Database For Human Genome Research
Rob Sargent*, Dave Fuhrman *, Terence Critchlow+, Tony Di Sera* Robert Mecklenburg+, Gary Lindstrom+, Peter Cartwright * *
Utah Center for Human Genome Research + Computer Science Department University of Utah Salt Lake City, UT 84112 USA
Abstract The Human Genome Project poses severe challenges in database design and implementation. These include comprehensive coverage of diverse data domains and user constituencies; robustness in the presence of incomplete, inconsistent and multi-version data; accessibility through many levels of abstraction, and scalability in content and organizational complexity. This paper presents a new data model developed to meet these challenges by the Utah Center for Human Genome Research. The central characteristics are (i) a high level data model comprising five broadly applicable workflow notions; (ii) representation of those notions as objects in an extended relational model; (iii) expression of working database schemas as meta data in administration tables; (iv) population of the database through tables dependent on the meta data tables; and (v) implementation via a conventional relational database management system. We explore two advantages of this approach: the resulting representational flexibility, and the reflective use of meta data to accomplish schema evolution by ordinary updates. Implementation and performance pragmatics of this work are sketched, as well as implications for future database develo pment. Keywords: extended relational data model, schema evolution, meta data, reflection, genome informatics.
Requirements The Human Genome Project is a 15-year, $3 billion international effort to sequence (identify) all 3 billion nucleotides (bases) found in human chromosomes, as well as the genomes of selected model organisms. The human genome comprises an estimated 50,000 to 100,000 genes, each of length 10,000 to 1 million bases. In coarsest terms, the human genetic sequencing problem amounts to deriving by laboratory observation a string of length 3 billion over a four-letter alphabet. While this sounds like (and is) a vast undertaking, a computer scientist might smugly observe that this monotonous string comprises only 6 billion bits, or
750 megabytes — hardly a large or richly structured database at first glance. Yet this view is as misguided as dismissing a piece of software as being simple because its executable can fit on a commodity hard disk. Robbins [1] has compared the goals of the Human Genome Project to reverse engineering Unix from maliciously mutilated remnants of its object files. In more prosaic terms, the driving requirements for genome databases are representational flexibility and scalability in complexity, i.e., data evolvability. Three database design strategies may be considered to address these requirements.
Protocols A customized design: represent directly and precisely the biochemical techniques, laboratory artifacts, and genetic mechanisms of direct interest to the center. A generalized design: investigate the needs of several center laboratories and determine recurring concepts, emphasizing common features and specializations thereof. In software engineering terms, this is classical object analysis. A meta design: generically categorize the fundamental phenomena of laboratory science. Then specify the notions of interest in the center’s laboratory by definitions within these categories. The UCHGR informatics team initially pursued a customized design strategy. About five years ago the merits of a generalized design — in particular, building on objectoriented (OO) concepts — became evident. The immaturity of OODBMS commercial products ruled out their adoption at that time. Construction of an in-house OODBMS was deemed unaffordable and ill-advised. Hence, use of current generation database technology was dictated, implying selection of a relational database management system (RDBMS). The merging of OO concepts and RDBMS technology required the representation of meta data, which in turn prompted the formulation of a meta design for laboratory science data.
A new data model The two major goals of our new data model are expressive power and ease of accommodating change in application concepts and data organization. To attain these goals, the space of all concepts is divided into five categories: Objects
These are equivalent to entities in the entity-relationship (E-R) model.
Relationships These are similar to relationships in the E-R model, but substantially generalized to relate a variable number of, possibly ordered, sets of objects and relationships. Processes
Processes represent the execution of some procedure — computational or laboratory. They require objects or relationships as input and produce new objects or relationships as output. They are analogous to simplified computational proxies (Cushing [2]) or experiments (Chen [3]).
The execution of a process is guided by a protocol defining the steps involved and materials and equipment r equired.
Environments An environment records all the information required to re-execute a process. There is an inherent representational difficulty in the relational model: since all composite data are represented via tables, differentiating between entities and relationships is difficult in the absence of meta-level information. The representational uniformity that is a prime virtue of the relational model also camouflages conceptual structure and inhibits principled evolution. Each change to the database design requires skilled shepherding by the database designer to ensure that schema changes ripple thoroughly and consistently throughout the system. Once a set of schema changes has been determined, the database system typically must be taken down and bulk reloaded. In short, the human and service costs of schema evolution are intolerable. Our new data model exploits a reflective meta-data approach that makes schema information explicit and thereby amenable to manipulation. The implications of this new approach are pervasive, but most evident in three features: a new model for flexibly specifying and representing relationships; meta information stored in database tables with fixed schemas, and the representation of all five categories in our data model as database objects designated by globally unique identifiers (OIDs). The two kinds of database objects of direct interest here are objects and relationships. Each class of relationship or object has a description stored in the database recording essential information for accessing and manipulating the object (e.g., the name of the table in which it is stored). These meta data descriptions can be queried and directly manipulated by applications to allow model objects and relationships to evolve without evolution of schemas in the underlying RDBMS. As previously mentioned, each database object has a unique identifier, including all model objects, relationships, processes, and protocols. This allows uniform polymorphic references to any database object using a single identifier space. Types also have OIDs, permitting relationships and objects to refer to their types as meta data. This supports a dynamic type dispatch capability in applications. Object identifiers are never recycled, thereby ensuring that database objects and types never relinquish their identities. Hence archival data and their types can always be retrieved and interpreted under new process protocols — a crucial longevity requirement.
Roles Table Object OID Relationship OID 123 167 501 167 4 502 167 503 167
Type object object object object
Role 1 2 2 2
3
Relationships Table Relationship OID Type OID 167 166
2 5
Objects Table Object OID Type OID 501 321 502 321 503 321 123 370
Type OID 166 321 1 370
6
Type Table Description “K to L” “L Class” “K Class”
Table Name “L Table” “K Table” 6
7
L Table OID Value 501 t 502 u 503 v K Table OID Value 123 a
Figure 1. Implementation of relationship model.
Representation strategy Objects in our model are complex entities whose components can include references to other objects, collections (e.g., sets, vectors) and primitive attributes. These objects can be stored in an RDBMS by mapping each class and collection type to a table. Each object instance is then stored as a tuple with an OID, and each collection instance as a set of tuples. For each class, one table known as the dominant table is selected as the root of the decomposition and is listed in the Type Table. The remaining tables represent subordinate tables and are stored in the Dependents Table, which contains the OID of the dependent table, the OID of the parent table, and the name of the table containing the object’s attributes. The original object may be reconstructed through a series of selects and joins. All data elements are accessible from a set of main tables: the Objects, Relationships, and Process Tables. The Objects Table contains the OID and type for all object instances. The Relationships Table contains the OID and type of all instances of relationships. The Relationships Table relies on the Role Table to record the participating components of a relationship, and their instances. It contains the OID and type of all objects that are participating in a relationship, along with the relationship’s instance OID and the role the object plays within the relationship.
This implementation is remarkably flexible, allowing relationships to be dynamically created by inserting elements into existing tables, without RDBMS schema modification. This flexibility also naturally accommodates several varieties of relationships that are traditionally difficult to model. Firstly, sets, bags and sequences may be modeled as relationships containing only one role, with sequences being ordered. Secondly, n-ary relationships are easily represented since there may be an arbitrary number of roles for each relationship. Finally, aggregates can be implemented by using relationship OIDs as members of a role. Figure 1 illustrates the main tables in use. The example starts with the object K1 whose OID is 123. All objects related to it by a specific one-to-many relationship, “K to L”, are to be retrieved. The steps enumerated below can be followed by corresponding labels in the diagram. Step 1: The OID of relationship “K to L”, 166, is determined by performing a selection on the Type Table: select OID from Type where Description = “K to L”. Step 2: This OID is used to determine the set of relationship OIDs representing all of the instances of “K to L”, the set {167}, by performing a selection on the Relationships Table. Step 3: The set of relationship instances containing a reference to K1 are identified by performing a selection on
the Roles Table: select Relationship_OID into X from Roles where Object_OID = 123 and Relationship_OID is in {167}. Knowing that the relationship is “K to L” tells us that K1 plays role 1, and the associated L objects play role 2. Step 4: We now know all “K to L” relationship instances involving K1 (recorded in temporary table X). We find the associated L objects by another selection on the Roles Table: select Object_OID where Relationship_OID is in X and Role = 2, which yields the set {501, 502, 503}. At this point, we have identified the OIDs of the objects we are interested in, but need to identify their types in order to retrieve the associated instance information. Step 5: The type of these elements is “object” which identifies them as belonging to the Objects Table, which is then queried to retrieve the OID of the type of these objects: 321. If these elements had been of type “relationship”, the Relationships Table would have been queried instead of the Objects Table. Step 6: The type OID is used to query the Type Table and retrieve the name of the table containing the instance information for the objects: L Table. Step 7: The data table has now been identified and the desired values can be extracted from it. If the types of the objects were known in advance, the data table could have been queried directly as soon as the OIDs are obtained, instead of indirecting through the Objects Table. The flexibility of this design can be appreciated by noticing that the types of the objects retrieved in Step 5 can be dissimilar. By allowing varying types of objects to occur in the same role, the relationship may evolve without compromising its existing instances. This is most likely to occur when the old and new object types have some basis for similarity. In this case, both old and new applications may be able to access all of the relationship information without modification to either the application or the relationship instances. If the object types are not operationally equivalent, the new type is usually a subtype of the old type, allowing new applications to access all of the relationship information, while restricting old applications to access only supertype information. The ability to extend the semantics of a relationship without impacting existing applications is very important in a rapidly changing development environment.
Status and experience The database as described is in daily use at the Utah Center for Human Genetics Research. At present there are approximately 200 object types and 40 relationships mod-
eled. The database contains a total of 249 tables, of which 15 contain meta information, 28 represent processes and environments, 32 represent relationships, 69 represent objects, and 105 represent archival data. The database spans 600 MB, of which 200 MB hold active data and 400 MB contain historical data. The largest individual table in the database has 400,000 entries. The active data size is expected to level off at approximately 2 GB under regular migration of aged data to archival storage. This data model makes four contributions. The first lies in its simple yet fundamental concepts expressing database organizations not easily describable through direct use of the relational model. Second, the ability to perform schema evolution using the data manipulation language rather than the data definition language is a distinct advantage in an environment where schema changes are frequent. The UCHGR database has undergone at least 50 schema changes over the past four years, each of which would have required schema modifications under a conventional relational representation. Third, this model offers a new migration path to object-oriented modeling without abandoning the pragmatic benefits of mature RDBMS technology. Our approach fosters the incremental adoption of OO concepts while retaining the advantages of commercial technology. Finally, by instantiating the meta-level information within the database, the full power of the data manipulation language can be used to manipulate it. We have only scratched the surface of the potential power of this reflective approach. On the other hand, integrity constraints and consistency checks are difficult to implement using the tools provided by the underlying RDBMS. This is a result of the incomplete meta-information stored for relationships. There is also concern that even if the meta-information was known, using DBMS integrity features such as triggers would probably lead to extremely poor performance due to the difficulty of verifying all of the required information. It is unclear whether tuning the relationship implementation can overcome this problem. While the current database performance is acceptable, it does not match the performance of a well tuned traditional database system. There are severe penalties for the additional large self-joins that are required to denormalize information. In response, some performance critical areas of the database have been converted to the traditional relational model, turning the existing implementation into a hybrid representation. Some applications take advantage of type regularities to increase performance by reducing the number of selections and joins used to obtain the desired results (as commented earlier under Step 7). While user response time may not be as fast as desired, this model provides a significant reduction in the time required to incorporate new schema information into the database, and a corresponding increase in data availability and development
productivity.
Related work Worldwide genetic data resides on several very large community databases, plus hundreds of smaller laboratory databases. Most of the community databases, such as the GDB genetic linkage database (GDB [4]), are currently implemented using the traditional relational model within an RDBMS. The schema evolution problems with this model are evidenced by the criticism that GDB has faced for its inability to evolve its data representation to meet community needs on a timely basis. While the problems faced by trying to communicate between different databases is difficult, the problems faced by an individual laboratory database should not be underestimated (Goodman [5]). Several papers (Goodman [6], Rozen [7], Rozen [8], Stein [9]) have been written describing the LabBase system implemented at the Whitehead Institute, which attempts to deal with these problems. While the LabBase system faces the same challenges and problems as the system described in this paper, it emphasizes performance rather than flexibility. As a result, the LabBase system uses C++ and OODB technology, and can be expected to be much faster. However, LabBase faces major problems with respect to database evolution. In particular, whenever the schema changes, log files are used to reload the data into the new schema. While this may work with smaller databases, it clearly raises scalability issues. The ultimate database system for which we all strive would combine LabBase’s performance and our flexibility. The need for an appropriate data model has been faced by other informatics groups. In particular, the experiment model developed by the OPM project [Chen95] is similar to our meta data model; the major difference is the addition of protocols and environments in our model. The use of these two concepts allows us to separate the meta information about how a process is performed from the actual details of an individual process execution, while ensuring enough information is retained to redo any individual process from scratch. The OPM schema is implemented using a tool to convert directly from the conceptual schema to a straightforward RDBMS representation which does not attempt to address the schema evolution problems that prompted our implementation strategy.
Future work With the experience gained thus far we are embarking on the implementation of a genuine OO language binding for the model. Among the lessons learned is the recognition that the relationship model is perhaps too blank a canvas, and invites free-form art by programmers. An OO binding
for the concept would provide an invigorating opportunity to define more rigidly the semantics of the relationship model and codify idioms for its effective use. The next area of exploration is performance enhancement. Although the current implementation performs tolerably well we anticipate a ten-fold increase in data volume shortly. To improve performance we can employ caching at several levels. Fortunately, the data itself does not change rapidly during the execution of a single process, and currency control standards weaker than strict serializability are often acceptable. Finally, we are left with the more profound questions concerning the results of this work. Our goal has been to make information implicit in the database explicit. This has resulted in significantly more expressive representation of complex relationships, while meeting the evolvability requirement critical to project success. However, by diluting static structure in the database, we have also robbed the DBMS of information it needs to accomplish optimization and integrity control. We have yet to fully gauge the implications of this decision.
Acknowledgments The authors thank Douglas Adamson, Josh Cherry, Debi Nelson, and Bob Weiss for their many contributions to this effort. Funding for this project was provided in part by National Institutes of Health grant Utah Center for Human Genome Research.
References [1] Karen A. Frenkel. 1991. The Human Genome Project and Informatics. Communications of the ACM. Vol. 34, Number 11. [2] Judith B. Cushing, David Maier, Meenakshi Rao, Don Abel, David Feller, D. Michael DeVaney. 1994. Computational Proxies: Modeling Scientific Applications in Object Databases. Proceedings of the Seventh International Working Conference on Scientific and Statistical Database Management. [3] I-Min Chen, Victor M. Markowitz. 1995. An Overview of the Object Protocol Model (OPM) and the OPM Data Management Tools. Information Systems. Vol. 20 Number 5. [4] Johns Hopkins University School of Medicine, Genome Database Project. GDBTM User Guide, Version 5.3. March 1994. [5] Nathan Goodman, Steve Rozen, Lincoln Stein. 1994. A
Glimpse at the DBMS Challenges Posed by the Human Genome Project. Available via ftp from genome.wi.mit.edu as file challenges.ps.Z on pub/papers/Y1994. [6] Nathan Goodman, Steve Rozen, Lincoln Stein. 1994. Building a Laboratory Information System around a C++Based Object-Oriented DBMS. Proceedings of the 20th International Conference on Very Large Databases. Available as file building.ps.Z from genome.wi.mit.edu pub/papers/Y1994. [7] Steve Rozen, Lincoln Stein, Nathan Goodman. 1994. Constructing a Domain-Specific DBMS using a Persistent Object System. Sixth International Workshop on Persistent Object Systems. Ftp from genome.wi.mit.edu pub/papers/Y1994/labbase-design.ps.Z.
[8] Steve Rozen, Lincoln Stein, Nathan Goodman. 1995. LabBase: A database to Manage Laboratory Data in a Large-Scale Genome Mapping Project. Ftp from genome.wi.mit.edu pub/papers/Y1995/labbase.ps.gz. [9] Lincoln Stein, Steve Rozen, Nathan Goodman. 1994. Managing Laboratory Workflow with LabBase. Proceedings of the 1994 conference on Computers in Medicine. Available as file workflow.ps.Z on genome.wi.mit.edu/pub/papers/Y1995.