SGDB is currently in the process of rcvisiig its existing schema into a new one, ... research challenge to the genome informatica commu- nity. The other is that ...
Proceedings of the 28th Annual Hawaii International Conference on System Sciences - 1995
Comparative
Study of Relational and Object-Oriented Yodelings of Genomic Data Dong-Guk
Shin
Computer Science & Engineering University of Connecticut Storrs, CT 06269-3155 This paper report5 the results of a study comparing the two data modeling techniques on two disparate sets of genomic data, E. coli genome data and human genome data. Discussed first is how the relational modeling of E. & genome data can be compared with its object-oriented counterpart. The study include the modeling and the comparison for some key portion of human genome data in order to demonstrate the generality of the modeling techniques and the credibility of the comparisons. This paper also includes discussions on broader issues related to the two alternative approaches such as support for ad hoc queries and a federation of heterogeneous genomic databases in which both relational and object-oriented data model5 are used.
ABSTRACT Two major techniques wmmonly available for modeling genomic data are the nzlational and object This paper discusser the oriented appmaches. strengtha and weaknesses of both approaches. The comparison wua done using two rrets of disparate genomic data, one for the &. & genome and one for the human genome, in other to demonstrate the generolity of the modeling methods and the credibility of the comparison itself. One major atrength of fhe object-oriented approach is its highly flezibk data modeling power offering an elegant way of repnzsenting complez genomic objects. The approach’s major weakness includes the lack of generic way of accessing complez objects. The strength of the relational approach is its l;ll provision of SQL. But the approach5 weaknear is ctimbersome modeling of complez genomic objects that is due to normalization. This paper alro includes discussions on broader isauea related to the two alternative approaches such as #upport for ad hoc queries and a federation of heterogeneous genomic databases in which both Aational and object-oriented data modeb are wed.
2. E. 4
TWO METHODS GENOME DATA
MODELING
The E. & genome data used here come from the EcoSeq database being maintained in ASN.l form by Kenneth Rudd at NCBI [7]. A conceptual representation’ of a small portion of the data is given in Figure 1. Here an E. coli chromosome is seen as a sequence of %ontigs,* where each contig is either called a “meld” or a “single.” A meld is a larger piece of a sequenced object and is constructed by concatenating a sequence of fragments called “constituents” in a non-overlapping manner. For example, the meld rrnCecoM is made of by putting together nine non-overlapping sub-sequences of eight constituents (Note that ECOILVGE is used twice at 7th and 9th). Throughout the section, the data related to Figure 1 will be used to illustrate various comparisons.
1. INTRODUCTION A major hurdle faced by molecular biologists do ing large scale DNA sequencing and gene mapping involves efficient handling and analysis of the voluminous data that is acquired. In the data modeling aspect, two candidate techniques, relational and object-oriented, are commonly available. Of the two, the majority of existing genome centers use relational database management systems (DBMS) to store genomic data (e.g., GDB, GSDB, DDBJ, EMBL). Only a handful of groups are exploring use of objectoriented DBMSs (e.g., MBASE). It has been highly debatable whether one technique outperforms the other - and if indeed one does, then for what reasons.
INote that the conceptual view given in Figure 1 reflects only one way, i.e., Kenneth Rudd’s way, of organizing data avuibbIc for the E. coli gmome. 81
1060-3425/95
OF
$4.00 0 1995 IEEE
Proceedings of the 28th Hawaii International Conference on System Sciences (HICSS '95) 1060-3425/95 $10.00 © 1995 IEEE
Proceedings of the 28th Annual Hawaii International Conference on System Sciences -
1995
Contigues
: ild i. . . . . ..-... 123 221
Genes
2o07 361
ilvG
glyY dYv ....- tdYX ... . Wb.2
:;...-
107
182
219
294
330
i
405
ilvE i . . . . . ,. . . .i 2287 3216
i 20072267
Figure 1. Conceptualrelationshipsamong E. coli genomic entities.
2.1. Data
chmmosome CcdgNme
start
stop
. . . nncecohi . ..
. . . 3966057 . .. 4012593 . .. 4421ooO . .. . . .
. . . 3984560 . . . 4066921 . . . 4421423 . . . . ..
udp. . . ECOTRNAGJ .. . .. .
FCOILVGE ECOILVGE ECOILVCE ECOILVGE GCOTRNAB FCGTRNAG3 ECCYI’RNAG’J . . .
ilvL ilvG ilvM ilvE &ylV &ylX gylY . ..
123 361 2007 2287 107 219 330 . . .
Figure 1 demonstrates one way of viewing an E. coli chromosome as a complex object, complex meaning that the chromosome is made of many sabobjects where each sub-object itself is also made of another layer of sub-objects, and so on. One issue for the comparison is how a complex genome object like the one described in Figure 1 can be modeled in relational as well as in object-oriented approaches. In particular, two criteria are used for the comparison. One is how to describe what is called the IS-A-PART-OF relationship, like the one illustrated by “the rrnCecoM contig is a sub-object of the E. d chromosome”, or by “ECORGNC is a subobject of the meld rrnCecoM”. The other is how to represent the ordering information, like the one illustrated by “a portion of ECOILVGE is followed by a portion of ECOILVGMED which in turn is followed by a portion of ECOILVGE”.
221 2007 22.51 3216 182 294 405 . . .
Meld
. . .
. . .
rmCccoM . . . ITlKkCOM tllce.zoM mlCccoM udp-ecoM .. .
NunC
ECORGNC .. . ECOEVGE FMlILVGMED ECOILVGE ECOUDP . . .
Type
ECORGNC ECOILVGE ECOILVGMED F.COTRNAG3 mcccoM udp-ScoM ..”
C C C S M M . . .
. . . 1 . . . 1 4872 6836 1 . . .
. . .
.
682 . . .
1 .
4500 7203 9456 2455 . ..
7 a 9 1 .
mh
SCquenCC
682 9456 7203 423 18504 24337 . . .
ClTA..*AXA ACAATITAT . . . CGGM’ITC AGCITG...TCGCTG AAA...TAAA GlTAACGlT . . . CGGAATI-C TGCAGAATG . . . CCAGGCAT . . .
Modeling
Aggregate data types First, we show how complex objects can be represented in the relational approach. Figure 2 shows one way of modeling the IS-APART-OF relationship and the ordering information into tabular forms. This modeling is one of a few alternatives that can be exploited with the relational ap preach. See the table Meld. Here a column called “Order” is introduced to model the ordering among the constituents participating in forming a meld. Alterna-
Figure 2. Relational modeling of E. coli genomedata
82
Proceedings of the 28th Hawaii International Conference on System Sciences (HICSS '95) 1060-3425/95 $10.00 © 1995 IEEE
Proceedings
Fii
3. Instance level
description of the
of
the 28th Annual Hawaii International Conference on System Sciences - 1995
object-oriented
modelittg
of E. coli gemme
data
tively, one could use “NextConstituent” in lieu of “Order” in order to form a linked list that can maintain the ordering among the involved constituents. Both approaches are cumbersome. In the approach using “Order”, inserting or deleting an element into or from the sequence causes subsequent order adjustments. In the approach using ‘NextConstituent”, updates become easier but the retrieval of constituents in the sequence requires repeating joins due to the nature of chained referencing. This problem is unavoidable when one attempts to model the ordering information into tabular forms. This deficiency stems from the relational data model’s lack of support for aggregate data types. Now we illustrate how the same relationships given in Figure 1 can be modeled in the object-oriented database approach. In object-oriented databases, the IS-A-PART-OF relationship and the ordering information can be described straightforwardly by using reference pointers and aggregate data types such as set, list, and array (e.g., in ONTOS one can use Reference data type and Set, List or Array data types). To make the comparison vivid, the instance level description of the modeled objects is shown in Figure 3. Schematic description of the modeling is shown in Figure 4. This schema description shows both IS-A relationship hierarchy as well as IS-A-PART-OF relationship hierarchy. To see the latter hierarchy, follow 83
Proceedings of the 28th Hawaii International Conference on System Sciences (HICSS '95) 1060-3425/95 $10.00 © 1995 IEEE
Proceedings of the 28th Annual Hawaii International Conference on System Sciences -
the numbered arrows. Discussion of this modeling follows. Naturalness of modeling The instance level description shown in Figure 3 bears a striking resemblance to the conceptual description of the relationships given in Figure 1. This mainly stems from the support of aggregate data types that eliminate the need for normalisation. Use of direct references with object identifiers (OID) further facilitates structural resemblance. Now the question is, “What does the naturalness of modeling benefit. Tn Although it may often be difficult to quantify, this naturalness in modeling is expected to aid in code development and maintenance. For example, consider the case of producing the complete sequence for the entire & & chromosome. In the object-oriented modeling of Figure 3, a software engineer can quickly visualiee steps of the code: it begins from the root object Cl and traverses through the links to visit each sequence object (constituents as well as singles) and produces the overall sequence. In contrast, in the relational modeling of Figure 2, the software engineer must first make an attempt to understand how the normalized representation in tabular form relates to the structure of the J$ -coli chromosome. This conceptualization process is often nontrivial. Later in Section 3.1 we illustrate that, in the human genome data case, understanding the tabular representation is far more difficult than the E. d case, due to its much higher degree of complexity. 2.2. Method dures
Encapsulation
vs. Stored
Proce-
Besides the differences in data modeling, another distinctive feature of the object-oriented approach is a way of bundling together functions or methods with data. This feature is called encapsulation. It pro vides a way of tightly integrating data with userdefined functions including access methods to complex objects, statistical computations, arbitrary objectdependent functional computations, triggering for integrity checks, etc. Included below is a discussion on how encapsulation differs from the common relational way of doing the same thing, i.e., use of stored procedures in Sybase. Refer to the conceptual relationships described in Figure 1 and consider the following example query. 9how
the contigs that contain the gene ilvM and their length greater than 500.”
The above query can be expressed in an object-
I995
oriented SQL, in Ontos for example, as follows. SEIXCT ?-ROM WHERE
Contig()->Retrievel() Gene Gene.hme() = “i.lvM” and Contigo->Length() > 600;
First, the function Retrieve10 illustrates the idea of sharing procedures. In this arrangement, procedures or functions designed to access the complex object Contig are implemented independent of database applications, and they are stored within the database as sharable objects. Sharable means that multiple database applications can use a common set of functions designed to access Contig objects in predesigned styles. For example, another function called Retrieve2() can be encapsulated within Contig so that it can return a different portion of data from Contig objects. In the relational approach such as Sybase, the similar idea of sharability of user-defined functions can be achieved by stored procedures. Stored procedures, however, are significantly different from encapsulated functions. For example, in Ontos, as illustrated in the query example above, encapsulated functions can be used in a query expression the same way attributes are used. But in Sybase no stored procedure can be used inside SQL expressions, although use of some system defined functions is permitted. In another example, encapsulated functions can be written in free-style programming directly on top of class definitions. In the relational approach, however, stored procedures need to use SQL to bind the persistent data to local variables prior to any data manipulation. This limitation is known analogous to the notion of “impedance mismatch” in the electronics field [5]. Second, the function Length0 illustrates the generally known strength of an object-oriented approach, i.e., inheritance and polymorphism. In the objectoriented design Length0 is defined as a virtual function for Contig. Then, in each of Contig’s subclasses, Meld and Single (see Figure 4), the method of computing length for sequence will differ, and in each subclass Length0 is implemented differently. The user sends the request (i.e., a function call) to the superclass Contig without needing to discern whether the recipient object of the request is a meld or a single. Polymorphism is an effective way of organizing similar procedures and avoiding naming conflicts. In the relational approach there is no counterpart analogous to inheritance and polymorphism. Third, the function Contig() illustrates a way of building explicit links between objects by using reference pointers from gene objects to contig objects.
Proceedings of the 28th Hawaii International Conference on System Sciences (HICSS '95) 1060-3425/95 $10.00 © 1995 IEEE
Proceedings of the 28th Annual Hawaii International Conference on System Sciences - 1995
.. 1 16 1
.. .. 1 16 1 1 ..
:
427 428 429 430 431
3 4 5 I 2 ..
438 441 439 440
::
.. . ... ii
894 897 8911 895 .. .
:. . :. . 7 : 1 ..
88
.. . 5571 90 6116 6119
.. .
7088 7395 ... 7G52 182 809 7071 . ..
:: ..* 89 ii 90
... I 181 181 181 181 181 l&2 .. .
Rgure 6. Concepti
Figure 5. Relational modeling of human gmwmic entities and rcl~tionships
into relational forms due to the hierarchical nature of map nesting’. This section focuses on two objectives. One is to elaborate the commonalities and differences in modeling two different types of genome data, i.e., human genome vs. & & . The other is to compare GDB’s relational modeling of the map portion with its object-oriented modeling counterpart.
Given a Gene object, invoking the function Contig() returns pointers (e.g., unique object identifiers) to either single or meld objects which contain the gene ob ject. This link appears in Figure 4 as a hashed arrow from Gene to Contig. A relational approach analogous to employing such a direct link could be adding an additional column such as Contig in the GeneLocation relation of Figure 2. This column contains foreign keys to tuples of Meld. For singles, the column can redundantly store values given under the Sequence column of the same relation. One major difference between the two approaches could be in performing actual data associations. In the object-oriented approach, once the OID for a contig object is available from the gene object, retrieving the referenced contig’s data means converting the OID to its physical storage and fetching the corresponding record. In the relational approach, retrieving the referenced contig’s data means two joins, one for between GeneLocation and Meld for melds and one for between GeneLocation and Sequence for singles. Further discussion on differences between use of explicit links and relational join is deferred to Section 3.2 where this issue of using explicit links is discussed in detail. 3. EXTENSION DATA
WITH
HUMAN
relationships among human genomic eatitia
3.1. Comparing & &i Genome
the Modelings Data
of Human
and
Our investigation of human genome data confirms the conjecture that much of the basic modeling discipline developed for the & d genome data is reusable. We find, however, that the human genome data poses a much more complex modeling challenge than E. d due to the complexity inherent in the data. Small fractions of these three relations, Locus, Order-Sets and Order-Elements, are shown in Figure 5. Human genome modeling and Similarities & d genome modeling are similar in that they both require modeling of the IS-A-PART-OF hierarchy and ordering among aggregate objects. Comparing Figures 2 and 5 reaffirms the typical way of modeling ordering in the relational approach. For example, in Order-Elements of Figure 5 the column Order-Element-Sequence is introduced to keep the ordering information, and this approach is very similar to the way the column Order is introduced in Meld as shown in Figure 2. How and why the relational
GENOME
The human genome data used here is from GDB. We use a portion of the map data, namely, three tables: Locus, Order-Sets and Order-Elements. Among the current GDB data, these data are known most difficult to deal with when they were being modeled
2Pe,onal BEUE.
cclmmunication
85
Proceedings of the 28th Hawaii International Conference on System Sciences (HICSS '95) 1060-3425/95 $10.00 © 1995 IEEE
with Ken Farman at Genomc Data
Proceedings of the 28th Annual Hawaii International Conference on System Sciences -
.. .
Figure 7. Instance level description
Figure 8. Object-oriented
I
of object-oriented
-
I
modeling
of human genome data
schema for the human genome data
modeling of the ordering information results in a cumbersome design has already been discussed in Section 3.1. Figure 6 shows a conceptual view of the hierarchy of the human chromosome X that has been derived from analysing the relations given in Figure 5. Note that the general structure of the map hierarchy given in Figure 6 is similar to the sequence hierarchy for the I$. d genome illustrated in Figure 1. It would thus not be surprising to expect that the object-oriented modeling for the ordering and IS-A-PART-OF hierarchy for the human genome map data could be done in a similar way the same was done for the & Q& genome. The instance level illustration of the object-oriented
1995
modeling is given in Figure 7. The schema description is given in Figure 8. As it was the case for the E. & data modeling, it is also noted that there is a close resemblance between the conceptualised map relationship shown in Figure 6 and its instance level object-oriented modeling given in Figure 7. Differences Modeling human genome data is much more complicated than modeling E. coli genome data in two aspects. First, the degree of submap nesting is much deeper for the human genome data, and it also varies depending on which toplevel maps are dealt with, as illustrated in Figure 6. In E. coli genome data, the level of nesting is only two (contigs and constituents) and uniform across the entire genome, as illustrated in Figure 1. Second, the degree of incompleteness is much higher for the human genome than the & & . In the human genome case, a significant portion of the maps’ relative ordering along the chromosome is either unknown or unoriented (meaning that orientation relative to ‘pter” and “qter” is unknown). In the relation Order-Sets of Figure 5, the values under the column Order-Class-Code distinguish three different ways of collecting submaps. Here G, U, and 0 mean grouped (i.e., no ordering unknown), unoriented and ordered, respectively. Figure 6 shows examples: the submaps of 88 are ordered, the submaps of 89 are unoriented, and the submaps of 90 are completely unordered. Finally, deriving the conceptual view given in Figure 6 for the human genome map data by analyring its relational modeling shown in Figure 5 is much more difficult than deriving the conceptual view given in Figure 1 for the E. 4 genome data by andyring its relational counterpart given in Figure 2. The augmented complexity plays a definitive role in increm ing the difficulty. This observation concludes that the more complex an object is, the more significantly the naturalness of modeling factors in. 3.2. Joining Queries
vs. Using Direct
Links in Handling
Section 2.2 discussed that one of the important strengths of the object-oriented modeling is the flexibility in data organisation which facilitates efficient query processing. This flexibility is achieved because object identifiers are used for direct referencing. It differs from the relational case where the referencing between different objects is done solely by values (e.g., foreign keys) and therefore joining is required. This observation is illustrated in detail, this time, in the human genome modeling domain. Consider a typical query “Find maps (meaning top
06
Proceedings of the 28th Hawaii International Conference on System Sciences (HICSS '95) 1060-3425/95 $10.00 © 1995 IEEE
Proceedings of the 28th Annual Hawaii International Conference on System Sciences - 1995
the object-oriented modeling of the same data, Figure 7, shows that the locus object with its identifier 2340 contains OID(TM27) which is the reference pointer to its corresponding toplevel-map 88. This referencing appears as a hashed arrow from Locus to TopMap in Figure 8. Assume that the function TopLevelMap() is encapsulated within Locus and it returns the OIDs for the corresponding top level maps. Also assume that LocusJD() is encapsulated within TopMsp and it returns the corresponding locus identifier. The typical query used in the above, UFind maps (meaning top level-maps) containing RCP (red cone pigment)” can be expressed in ONTOS SQL as follows.
level-maps) containing RCP (red cone pigment)“. To answer this query, one should first join Locus and Order-Elements over locus identitiers. The SQL expression for the query is given below. Here OhjectJ%ssXey = 1 means that in the Order-Elements relation the corresponding tuple’s Object-ID is a locus identifier. SELECT FROH AND AKD
oi.Order-Set-ID Order-Elements 01, Locus 11 ll.Locus-Symbol = "BCP" li.LocusJD = ol.Object-ID oi.Object-Class-Key = I
The result of the query is the OrderSetlD of either a toplevel map or a submap which immediately contains the RCP locus. By examining the relations given in Figure 5, one can conclude that 89 is returned as the answer. To see if the map designated by 89 is a toplevel map or not, the result must be joined with Order-Sets. If the tuple of Order-Set-ID = 89 in Order-sets has a corresponding Locus-ID value, then it would indicate that 89 is a toplevel-map. Note that in Order-Sets of Figure 5, 89 is concluded to be a submap because it has no Locus-ID value. The fact that 89 is a submap can be verified in the conceptual diagram given in Figure 5. Next, since 89 is only a submap, search should continue to find its topmap. Finding the toplevel-map containing the found submap requires recursive search involving repeated joins between Order-Sets and Order-Elements. This process can be very expensive if the submap is deeply nested. For example, to find the map that immediately contains 89 and to see whether the map is a toplevel-map, one must subsequently perform the following join: SELECT FROM WHERE AND AH-D
SELECT FROU
TopLevelgapo->Locue,ID() Locus Locus.Locus,Symbol() = "BCP"
This query demonstrates that retrieving the query’s answer is achieved merely by following the reference pointers. By examining the data illustrated in Figure 7, one can conclude that 12955 is retrieved as the answer. Limitation of the object-oriented approach One problem of the object-oriented approach is that its advantages cannot be exploited unless a significant level of preconception is built into the data model at the beginning. For example, use of explicit reference links cannot be exploited unless such needs have been foreseen. Needs for encapsulating functions should likewise be determined o priori A corollary to this observation is that even if DBAs were able to predict successfully extensive sets of functions and object references, letting end users know of their availability would not be an easy task. On the other hand, in the relational case end users would know how to retrieve the data from the persistent storage once they are familiar with SQL and the database schema. The latter approach appears to demand much less than the former approach.
ol.Locus-ID Order-Sets 01, Order-Elements 02 oZ.Object-ID = "89" o2.0bject-Class-Key = 16 ol.Order-Set-ID = oa.Order-Set-ID
4. FURTHER
The outcome of the above query is 12955. The fact that the outcome is a non-null value indicates that the corresponding order set is a toplevel-map. In the current case, recursive search stops only after one trial. But had RCP been contained in 90, an additional search would have been needed. The deeper the initial submap is nested, the more recursive searching is needed. Object-oriented modeling of the same data illustrates that the object-oriented approach offers a feature, i.e., explicit referencing, that can eliminate extensive joins demonstrated in the above. For example,
DISCUSSIONS
AND
ISSUES
The two previous sections were mostly focused on discussing relational and object-oriented modeling of genomic data. This section is devoted to address broader issues related to the two alternative ap proaches. 4.1. Ad Hoc Query One observation that is commonly discussed in the genome informatics community is: “Relational ap preach is better than object-oriented approach be87
Proceedings of the 28th Hawaii International Conference on System Sciences (HICSS '95) 1060-3425/95 $10.00 © 1995 IEEE
Proceedings of the 28th Annual Hawaii International Conference on System Sciences - 1995
cause the former supports SQL fully but the latter supports the same poorly”. Supporting ad hoc querying ability by a DBMS is an important issue. It merits, however, to examine the validity of the above statement with respect to the requirements of the genome community. First, examination of the makeup of the user groups for the genome databases would reveal that the majority of biologists are considered noncomputer professionals. There is ample evidence indicating that teaching SQL to noncomputer professionals is ineffective. For example, the researchers of developing natural language interfbces to databases have reported statistics that demonstrate the ineffectiveness (e.g., [12]). Many corporate organioations using relational systems employ what are called 53QL specialists” whose primary task is to form SQL expressions for data processing needs. More importantly, teaching SQL to end-users does not merely involve teaching a syntax. Often a much more difficult issue is making the user understand the detailed semantics of the underlying database schemas. For example, a user should know what is meant by Order-Sets and Order-Elements in the GDB context in order to use them correctly in forming an SQL expression. One cannot assume that the general users of GDB will be versed in a 43 page description of the schema definitions and 14 pages of Entity and Relationship diagrams. In sum, if the services of the genome databases are mostly aimed at the general user groups, then the argument against the object-oriented approach due to its current lack of SQL support does not have a strong ground. The basis of the argument is still weakened if one considers the current effort by Object Database Management Group (ODMG) who is in the process of developing what is called Object Query Language (OQL). This language is proposed to provide SQL functionalities as well as features designed to manip ulate compound data structures (i.e., sets, bags, lists, arrays, etc) that are beyond what SQL supports [4]. The genome informatics community should focus on not weighing which one of the two alternative data models supports SQL better but developing an ad hoc querying mechanism that can serve the users better than SQL. In such a query system biologists should be able to place queries, in of course an ad hoc fashion, without digesting the detailed construction of the underlying database. One ideal solution could be the one that can provide both the user friendliness of NL query and the preciseness of SQL. Recent proposals for graphic-based query system (e.g., SNAP [2] and SUPER [1]) are worthy of investigation toward achieving
I
lmtionn of m.pr in CH 19 1sac.f loci A
Figure
/
Maps far hs-
GDB/Map (Object-Ozieatcd)
9. Federation model for heterogeneousgenomc databaser
SQL server I
OQLrmcr ’
SQL mxvcr
I
Flgure 10. Graphical query interface for the genome drtrbrro federrtlan
the goal. However, these systems still lack in providing higher level intelligence in handling man-machine communication. As it has been discussed in [9,10], further research needs to be made to find ways of modeling knowledge about the stored data and its use for intelligent man-machine communication. Federation 4.2. Oriented Genome
of Relational Databases
and
Object-
One of the genome informatics requirements summarised in [8] emphasises interoperabiity between related genome databases. An approach is to put disparately maintained data into one repository (e.g., tying sequence and map databases as reported in [ll]). The future model proposed in [8] is to form a federation of genome databases, a model similar to the one discussed in [E]. In the federation approach, genome databases are loosely integrated and autonomies for local databases are maximally preserved. Figure 9 shows an example of a federation where two community databases, GDB and GSDB, and one laboratory database, Lawrence Livermore National Laboratory’s (LLNL) chromosome 19 contig database, are joined for data exchange. For the sake of discussion, assume that the map portion of the GDB data is
Proceedings of the 28th Hawaii International Conference on System Sciences (HICSS '95) 1060-3425/95 $10.00 © 1995 IEEE
Proceedings of the 28th Annual Hawaii International Conference on System Sciences - 1995
maintained in an object-oriented systema. We examine how the following example query presented in [3] can be processed within the federation.
burden for the mapping between the user’s view and the DBA’s modeling is on the system’s side. Note that in the former approach the burden for the mapping is still on the user’s side. For example, assume that conceptual diagrams for GSDB are available on screen and the user attempts to form the following subquery, “all sequences which are putative members of the olfactory receptor family.” It will be the user who needs to figure out the term “olfactory receptor” corresponds to say a graphical object named “gene family” prior to specifying correct restriction condition. Needless to say, the latter approach is much more diIiicult to pursue. Any further discussion on this subject is beyond the scope of this paper. In sum, two points are to be made clear. One is that building an ad hoc query mechanism that allows users to place queries at the conceptual level pose a good research challenge to the genome informatica community. The other is that developing such a mechanism could be one way of achieving seamless interoperations between federated genome databases in which both relational and object-oriented modeling techniques are used.
“Retrieve all sequences which map close to marker M on human chromosome 19, are putative members of the olfactory receptor family, and have been mapped on a contig map of the region; return also the contig descriptions.” As shown in Figure 9, gene family information and map-related information are available from GSDB and GDB, respectively. Thus an operation similar to cross-database join between GDB and GSDB would be needed. Then appropriate probe information is to be derived from GDB. Finally, the probe data can be used to retrieve corresponding contig description from LLNL’s database which specializes in chromosome 19. Figure 10 shows a graphical ad hoc query interface (GAQI) model that has been discussed at the end of Section 4.1 and used here in the federated environment. In the model the GAQI part is formed into a client program and local query processings are handled by SQL and OQL servers. The goal of this query interface iz to allow users to place ad hoc queries based on their conceptual understanding of genomic objects. Two approaches can be thought of. One approach, an approach similar to OPM [6], is to make the interface provide to the users conceptual diagrams with extended semantics that sufficiently describe the content of the databases. For example, the user is presented with semantically enriched conceptual diagrams for GDB and GSDB so that he can grasp how to graphically specify the restriction of bltsctory receptor” for GSDB and “a map close to chromosome 19” for GDB. The user also specifies graphically at the conceptual level which portions of GSDB and GDB should be cross-database joined. The other approach is to embed intelligence in the interface so that the interface itself attempts to resolve the inherent semantic discrepancies that exist between the user’s view of the objects and the DBA’s modeling of the same object. The latter approach differs from the former in that in the latter approach the
5. CONCLUSION One major strength of the object-oriented approach appears to be its highly flexible data modeling power, in particular, its ability to represent the ordering and IS-A-PART-OF relationships using the built-in aggregate data types. In the relational approach, the required normalization process results in cumbersome designs if complex genomic objects are modeled. One major weakness of the object-oriented approach is that it does not provide any generic way of accessing complex objects. The accessing methods need to be custom-built prior to permitting any access. In the relational approach, on the other hand, any persistent data can be retrieved using the generic method, SQL, although the access method may not offer the most efficient solution. In some sense, the reasons why object-oriented databases are not currently popular in the genome informatics community could be due to the immaturity of the technology and the lack of standards. These problems are expected to disappear as the technology evolves and the group like ODMG makes efforts to produce a standard [4]. One may still argue that the need to preconceive all the benefits, i.e., predicting to-be-useful access methods or reference links between closely related objects, is a problem. However, this argument is a weak one considering the role differences between DBAs and end-users of the genome
SGDB is currently in the process of rcvisiig its existing schema into a new one, Version 6. In this revision, the plan is to separate the GDB data into multiple smaller d&abaser. Only the citation databruc in currently separate from the rest. It is possible that the map portion of data may also be exclusively put into a relational saver. Then although it may be in the distant future, it is not inconceivable that the map portion data is modeled into an object-oriented system to take advantage of its flexible modeling power.
89
Proceedings of the 28th Hawaii International Conference on System Sciences (HICSS '95) 1060-3425/95 $10.00 © 1995 IEEE
Proceedings of the 28th Annual Hawaii International Conference on System Sciences -
[S] A. Sheth and J. A. Larson. Federated Database Systems for Managing Distributed, Heterogeneous and Autonomous Databases. ACM Computing Surueys, 22(3), September 1990.
databases. For example, if end-users of the federated genomc databases are not expected to be familiar with detailed designs and constructs of the involved datab8ses, needs to modify or change data modeling on demand would affect only DBAs.
[S] D. G. Shin. Lh: A language for capturing real world meangings of the stored data. In Proc. of ACM 7th Int ‘l Conference on Data Engineer ing, Kobe, Japan, April 1991. Also Submitted to IEEE Il)umr. on Knowledge and Data Engineer ing.
Acknowledgements The author wishes to thank Kenneth Rudd and Ken Fssman for providing valuable discussions on EcoSeq E. &i genome data and GDB human genome data, respectively. He also would Iike to thank Jamie Cuticchia and Mihe Chipperfield for helpful discussions on genomic terminalogies and concepts. This paper was written during the author’s sabbatical leave at GDB. Robert Robbins of DOE and David Benton of NCHGR were instrumental in the author’s presence at GDB. This work ~8s supported in part by Nation8I Institute of Health HG00772-OlAl and a grant from the University of Connecticut Research Foundation.
[lo] D. G. Shin. A n expectation-driven response understanding paradigm. IEEE !&MU. on Know& edge and Data Engineering, 6(2), 1994. [ll]
T Susuki et al. Development of (~1 integrated database for genome mapping and nudeotide sequences. In Pvvc. of the t7Ui Hawaii Intl Conf. on System Sciemer, Maui, Hawaii, 1994.
[12] C. W. Thompson. Using Menu-Bared Naturvll Language Understanding to Avoid Problem Alrociated with lkditional Natural Language Interfacer to Databae8 (tinsportabk, Update, NLmenu). PhD thesis, University of Texas at Austin, 1984.
References [l] A. Auddiuo et al. SUPER - Visual interaction with an object-based ER Model. In Proc. of the 11th Int ‘l Conf. on the Entity-Relationship Approach, Karlsruhe, Germany, 1992. [2] D. Bryce and R. Hull. SNAP:A Graphic-based Schema Manager. In Int’l Conf. on Data Engineering, Los Angeles, CA, Feb. 1986. IEEE Computer Society. [3] DOE. Meeting Report: DOE Informatics Summit. Technical report, German Town, MD, 1993. [4] R. G. G. Cattell (Editor). The Object Database Standard: ODMG - 93. Morgan Kaufmann Publishers, San Mateo, CA, 1994. [5] S. Khoshafian. Modeling with Object-Oriented Databases. AI &pert, 6(10):26-33, 1991. [6] V. Markowits and A. Shoshani. Object queries over reiationaI databases: Language, implementation, and applications. In Prvc. of 9th Int’l Conf. on Data Engineering, Vienna, Austria, 1993. [7] K. E. Rudd, C. Tolstoshev, sequenced E. strategies and 19(3):637-647,
W. Miller, C. Werner, and S. G. Satterfield. coli genes by computer: examples. NucIeic Acids 1991.
1995
J. Ostell, Mapping Software, Research,
90
Proceedings of the 28th Hawaii International Conference on System Sciences (HICSS '95) 1060-3425/95 $10.00 © 1995 IEEE