In this work access support relations are introduced as a means for ... components in order to support queries involving path ...... 5 Query Processing and Update.
Access Support in Object Bases Alfons Kemper Guido Moerkotte Universitat Karlsruhe Fakultat fur Informatik 7500 Karlsruhe, W. Germany Netmail: kemper/moer @ira.uka.de
Abstract
Yet it is essential that the object-oriented systems will yield at least the same performance that relational systems achieve: otherwise their acceptance in the engineering eld is jeopardized even though they provide higher functionality than conventional DBMSs by, e.g., incorporation of type extensibility and object-speci c behavior within the model. Engineers are generally not willing to trade performance for extra functionality and expressive power. Therefore, we conjecture that the next couple of years will show an increased interest in optimization issues in the context of object-oriented DBMSs. The contribution of this paper can be seen as one important piece in the mosaic of performance enhancement methods for object-oriented database applications: the support of object access along reference chains. In relational database systems one of the most performance-critical operations is the join of two or more relations. A lot of research eort has been spent on expediting the join, e.g., access structures to support the join, the sort-merge join, and the hash-join algorithm were developed. Recently, the binary join index structure [11] was designed as another optimization method for this operation. In object-oriented database systems with object references the join based on matching attribute values plays a less predominant role. More important are object accesses along reference chains leading from one object instance to another. Some authors, e.g., [2], call this kind of object traversal functional join . This work presents an indexing technique, called access support relations, which is designed to support the functional join along arbitrary long attribute chains where the chain may even contain collection-valued attributes. The access support relations described in this paper constitute a generalization of the binary join indices originally proposed for the relational model [11], and later extended for object models [3, 12]. Rather than relating only two relations (or object types) our technique allows to support access paths of arbitrary length. Our indexing technique subsumes and extends several other previously
In this work access support relations are introduced as a means for optimizing query processing in object-oriented database systems. The general idea is to maintain redundant separate structures (disassociated from the object representation) to store object references that are frequently traversed in database queries. The proposed access support relation technique is no longer restricted to relate an object (tuple) to an atomic value (attribute value) as in conventional indexing. Rather, access support relations relate objects with each other and can span over reference chains which may contain collection-valued components in order to support queries involving path expressions. We present several alternative extensions of access support relations for a given path expression, the best of which has to be determined according to the application-speci c database usage pro le. An analytical cost model for access support relations and their application is developed. This analytical cost model is, in particular, used to determine the best access support relation extension and decomposition with respect to the speci c database con guration and application pro le.
1 Introduction Object-oriented database systems constitute a promising approach towards supporting technical application domains. Several object-oriented data models have been developed over the last couple of years. However, these systems are still not adequately optimized: for applications which involve a lot of associative search for objects on secondary memory they still have problems to keep up with the performance achieved by, e.g., relational DBMSs.
1
proposed strategies for optimizing access along attribute This allows for shared subobjects because the same chains in object bases. The index paths in GemStone [7] object may thus be associated with many database are restricted to chains that contain only single-valued atcomponents. Here, OIDs are denoted #i0 , #i1 , : : : . tributes and their representation is limited to binary partitions of the access path. Similarly, the object-oriented type constructors the most basic type constructor is the tuple constructor which aggregates dierently access techniques described for the Orion model [6, 1] are typed attributes to one object. In addition, GOM extended in several dimensions in our framework. has the two built-in collection type constructors set, Our technique diers in three major aspects from the denoted as fg, and list1 , denoted as . GOM two aforementioned approaches: also provides for subtyping of tuple-structured types; however this is irrelevant for the present discussion. access support relations allow collection-valued attributes within the attribute chain strong typing GOM is strongly typed, meaning that all database components, e.g., attributes, set elements, access support relations may be maintained in four etc, are constrained to a particular type. This, in dierent extensions. The extension determines the particular means that all path expressions are typed. amount of (reference) information that is kept in the However, the constrained type constitutes only an index structure. upper bound, the actually referenced instance may access support relations may be decomposed into arbe a subtype-instance thereof. bitrary partitions. This allows the database designer to choose the best extension and partition according object references assignment of an object to an attribute, a variable or insertion of an object into a set to the application characteristics. corresponds to maintaining a reference to the respecAlso the (separate) replication of object values as protive object. Thus, object references are stored unidiposed for the Extra object model [9] and for the PostGres rectional, conforming to almost all published object model [10, 8] are subsumed by our technique. models. The remainder of this paper is organized as follows. Section 2 introduces our Generic Object Model (GOM ), 2.1 Type De nitions which serves as the research vehicle for this work, and some simpli ed application example to highlight the re- A linear path is an attribute chain that contains only atquirements on object-oriented access support. Then, in tributes referring to a single object. Single-object-valued section 3 the access support relations are formally de- attributes are only useful to model 1 : 1, or N : 1 rela ned. In section 4 we start the development of an analyt- tionships. In order to represent 1 : M , or general N : M ical cost model for our indexing technique by estimating relations one needs to incorporate collection-valued atthe cardinalities of various representations of access sup- tributes, i.e., attributes referring to a set or list instance. port relations. Section 5 describes the utilization of ac- To illustrate this let us de ne a vastly simpli ed database cess support relations in query evaluation and estimates schema for modeling a Company composed of a set of Dithe performance enhancement on the basis of secondary vision s. Each Division Manufactures a set of Products , page accesses. Section 6 is dedicated to presenting some which themselves are composed of BasePart s. sample results of operation mix costs for a few selected The schema is outlined below: application characteristics. Section 7 concludes this pa- type Company is fDivisiong; per. type Division is [Name: STRING, Manufactures: ProdSET]; type ProdSET is fProductg; type Product is [Name: STRING, Composition: BasePartSET]; This research is based on an object-oriented model that type BasePartSET is fBasePartg; unites the most salient features of many recently proposed type BasePart is [Name: STRING, models in one coherent framework: the Generic Object Price: DECIMAL]; Model GOM . The interesting aspects of GOM concerning the access support relations are: Additionally we assume the existence of a reference to a given company. object identity each object instance has an identity that remains invariant throughout its lifetime. The var Mercedes: Company; object identi er (OID) is invisible for the database 1 Lists are not further considered in this paper, though user; it is used by the system to reference objects.
2 The Object Model GOM
Company Division
#i1
Name: \Auto" Manufactures: #i4
ProdSET Product
#i4 #i6
BasePartSET BasePart
f#i ; #i ; #i ; : : :g
#i0
Name: \560 SEC" Composition: #i7 #i7 #i8
#i2
1
2
3
Name: \Truck" Manufactures: #i5
f#i ; : : :g
#i5
6
Name: \MB Trak" #i9 Composition: NULL
f#i ; : : :g 8
#i10
Name: \Door" Price: 1205:50
f#i ; : : :g 8
\Space" #i3 Name: Manufactures: NULL
f#i ; #i : : :g 6
#i11
9
Name: \Sausage" Composition: #i13
#i13 f#i14 ; : : :g
\Pepper" #i14 Name: Price: 0:12
Figure 1: Database Extension With Non-Linear Paths A sample extension of this schema is presented in Figure 1. Note that an object is represented as a triple (#ij ; v; t) where #ij is the object identi er, v the object representation, and t the object's type. References, e.g., #i1 :Manufactures, are maintained unidirectionally by storing the associated object's identi er, #i4 , within the domain object (#i1 ). Now let us illustrate some typical queries in an SQLlike syntax which access objects along references (possibly leading through sets).
3 Access Support Relations As mentioned earlier access paths are used to support query evaluation. More precisely, access paths allow the fast selection of those members of an object collection which ful ll a given selection criteria based on object references along an attribute chain or path expression. A path expression or attribute chain is de ned as follows:
De nition 3.1 Let t0; : : : ; tn be (not necessarily distinct) types. A path expression on t0 is an expression t0 :A1 : :An i for each 1 i n one of the following conditions holds: Query 1: Which Division uses a BasePart named \Door"? Type ti?1 is de ned as type ti?1 is [: : : ; Ai : ti ; : : :], i.e., a tuple type containing at least the attribute Ai select d:Name of type ti . from d in Mercedes, b in d:Manufactures:Composition Type ti?1 is de ned as type ti?1 is [: : : ; Ai : t0i ; : : :] where b:Name = \Door" and the type t0i is de ned as type t0i is fti g. In this case we speak of a set occurrence at Ai in the path \d.Manufactures.Composition" is a set-valued path ext0 :A1 : :An . pression with the following semantics: [ The type ti?1 is called the domain of Ai , and ti is called d:Manufactures :Composition := m:Composition the range of Ai . m2d:Manufactures The second part of the de nition is useful to support Query 2: Retrieve the Name of all the BaseParts used access paths through sets . If it does not apply to a given by the Division named \Auto". 2
select d:Manufactures:Composition:Name from d in Mercedes where d:Name = \Auto"
path the path is called linear . For simplicity we require each path expression to originate in some type t0 ; alternatively we could have chosen 2 Note, however, that we do not permit powersets.
a particular collection C of elements of type t0 as the anchor of a path (leading to more dicult de nitions and cost functions, though). Since an access path can be seen as a relation we will use relation extensions to represent access paths. The next de nition maps a given path expression to the underlying access support relation declaration.
De nition 3.2 Let t0; : : : ; tn be types, t0:A1 : :An be a path expression, and k the number of set occurrences in t0 :A1 : :An . Then the access support relation Et0 :A1 ::A is of arity n + k + 1 and has the following form: Et0 :A1 ::A : [S0 ; : : : ; Sn+k ] The domain of the attribute S0 is the set of identi ers (OIDs) of objects of type t0 . For (1 i n) let k(i) be the number of set occurrences before Ai , i.e., set occurrences at Aj for j < i. Then the domain of the attribute Si+k(i) is the set of OIDs that identify objects of type n
Example: Recall the Company database extension of Figure 1. For the underlying schema we could declare the access support relation on the path expression Division :Manufactures :Composition :Name . This results in 3 temporary relations E0 , E1 , and E2 . OIDDivision OIDProdSET OIDProduct #i2 #i5 #i9 #i1 #i4 #i6 ::: ::: :::
E0
OIDProduct OIDBasePartSET OIDBasePart #i11 #i13 #i14 #i6 #i7 #i8 ::: ::: :::
E1
n
E2
OIDBasePart V ALUEName #i14 \Pepper" #i8 \Door" ::: :::
ti , if Ai is a single-valued attribute. Let us now introduce dierent possible extensions of a given access support relation E . For a given path expres t0i , if Ai is a set-valued attribute. In this case the sion t :A : :An we distinguish four extensions: domain of Si+k(i)+1 is the set of OIDs of type ti .
If tn is an atomic type then the domain of Sn+k is tn , i.e., values are directly stored in the access support relation. If the underlying path expression is clear from context we will write E instead of Et0 :A1 ::A . n
Let further m be de ned as m := n + k. We distinguish several possibilities for the extension of such relations. To de ne them for a given path expression t0 :A1 : :An we need n temporary relations E0 ; : : : ; En?1 .
De nition 3.3 For each Aj (1 j n) we construct the temporary relation Ej?1 . Depending on the domain of Aj the relation Ej?1 is:
0
1
1. the canonical extension, denoted Ecan contains only information about complete paths, i.e., paths originating in t0 and leading to tn . Therefore, it can only be used to evaluate queries that originate in an object of type t0 and \go all the way" to tn . 2. the left-complete extension Eleft contains all paths originating in t0 but not necessarily leading to tn , but possibly ending in a NULL. 3. the right-complete extension Eright , analogously, contains paths leading to tn , but possibly originating in some object oj of type tj which is not referenced by any object of type tj?1 via the Aj attribute.
4. nally, the full extension Efull contains all partial 1. binary, if Aj is a single-valued attribute. In paths, even if they do not originate in t0 or do end this case the relation Ej?1 contains the tuples in a NULL. (id(oj?1 ); id(oj )) for every object oj?1 of type tj?1 and oj of type tj such that oj?1 :Aj = oj . If tj is De nition 3.4 (Extensions) Let 1 ( 1 ; 1; 1 ) dean atomic type then id(oj ) corresponds to the value note the natural (outer, left outer, right outer) join on oj?1 :Aj . the last column of the rst relation and the rst column of the second relation. Then the dierent extensions are 2. ternary, if the attribute Aj is a set-valued at- obtained as follows: tribute. Then the relation Ej?1 contains the tuples (id(oj?1 ); id(o0j ); id(oj )) for every object oj?1 of Ecan := E0 1 1 En?1 type tj?1 , o0j of type t0j , and oj of type tj such that Efull := E0 1 1 En?1 oj?1 :Aj = o0j and the set o0j contains oj . In the spe0 Eleft := ( (E0 1 E1 ) 1 1 En?1 ) cial case that oj is an empty set the relation Ej?1 contains the tuple (id(oj?1 ); id(o0j ); NULL). Eright := (E0 1 ( 1 (En?2 1 En?1 ) )
Efull
OIDDivision OIDProdSET OIDProduct OIDBasePartSET OIDBasePart V ALUEName #i2 #i5 #i9 NULL NULL NULL NULL NULL #i11 #i13 #i14 \Pepper" #i1 #i4 #i6 #i7 #i8 \Door" ::: ::: ::: ::: ::: ::: Figure 2: A Sample Extension of Efull
Example: For our example application the full extension contains also the incomplete paths, i.e., those that 3.1 Sharing of Access Support Relations lead to a NULL (e.g., the rst tuple in the extension shown in Figure 2) or those not originating in an object Consider the following two path expressions: o0 of type t0 (the second tuple in Efull shown in Figure 2). t+ Even partial paths not originating in t0 and leading to a }| { z t NULL are to be included. The extension Ecan would only { contain the last tuple shown in Efull . Eright would not P1 tz0 :A1}| : :Ai :Ai+1 : :Ai+j :Ai+j+1 : :An contain the rst tuple shown in Figure 2, whereas in Eleft P2 s|0 :B1{z : :B}l :Ai+1 : :Ai+j :C1 : :Cq the second tuple would be omitted. t Aside from dierent extensions of the access support | {z } t+ relation also several decompositions are possible, which are discussed now. Since not all of them are meaningful we de ne a decomposition as follows (Remember: m = If t0 :A1 : :Ai and s0 :B1 : :Bl are path expressions both leading to objects of type ti then part of the access n + k.) support predicates may be shared. De nition 3.5 (Decomposition) Let E be an (m +1)- This, in general, is only possible when a full extension ary access support relation with attributes S0 ; : : : ; Sm . of at least one of the access support relations is mainThen the relations tained. Let Efull be the full extension for the path P1 , and Efull the full extension of the access support relation 0;i1 for 0 < i1 m E : [S0 ; : : : ; Si1 ] for path P2 . Then the decomposition (0; i; i + j; n) of Efull i ;i for i1 < i2 m E 1 2 : [Si1 ; : : : ; Si2 ] and (0; l; l + j; r)3 of Efull share a common partition, i.e., i;i+j = E l;l+j . Efull E i ;m : [Si ; : : : ; Sm ] for ik < m full Thus we obtain the following ve partitions: are called a decomposition of E . The individual relations 0;l 0;i E i ;i +1 , called partitions, are materialized by projecting Efull : [OID s0 ; : : : ; OID t ] : [OID t0 ; : : : ; OID t ] Efull the corresponding attributes of E . If every partition is a binary relation the decomposition is called binary. The E i;i+j = E l;l+j : [OID ; : : : ; OID ] t+ t full full above decomposition is denoted by (0; i1 ; i2; : : : ; ik ; m). i+j;n :[OID ; : : : ;OID ] E l+j;r : [OID ; : : : ; OID ] t+ s t t+ Note that m and n are equal only in the case that there Efull full is no set occurrence along the path. If there is any then m > n. Under the assumption that there is no set sharing, The ve partitions may then, individually, be further dethe set identi ers may be dropped from the access support composed. relation. This results in m = n. To simplify the analysis In general, this sharing is only possible for full extenwe will do so for the examples considered in the next sions. Exceptions are: section. Note, however, that the analytical cost model if both paths P1 and P2 originate in t0 , i.e., i = l = 0 captures the general case if one reads n as m. and t0 = s0 . Then the sharing is also possible for The last question discussed in this section concerns the left-complete extensions. usefulness of the above de ned decompositions. i j
i
i
i j
k
j
k
j
i
i
i j
i
i j
Theorem 3.6 Every decomposition of an access support
relation is lossless. The proof of this theorem is obvious since we decompose along multi-valued dependencies.
n
l j
r
if both paths lead to tn , i.e., their right-most part is identical, then the corresponding partition of the right-complete extensions may be shared.
3 the length of path P2 is r = l + j + q.
This should indicate that there may exist a higher level The probability PA that an object o i of type t i has a of organization, i.e., an access support relations manager de ned Ai+1 attribute value is which controls (and constrains) the possible extensions and decompositions. PA = di i
ci
i
4 Analytical Cost Model: Cardinality of Access Relations
The probability PH that a particular object oi of type t i is \hit" by a reference emanating from some object of type ti?1 is: i
In this section we develop the basis of our analytical cost PH = eci model: a model of the application pro le and formulas i for the cardinalities of access support relations under different extensions and decompositions. Later on, the cost Let us now derive the probability that, for some object model is used to derive the best physical database de- oi of type ti none of the fan i references of the attribute sign, i.e., to nd the best extension and decomposition of oi :Ai+1 hits a particular object oi+1 2 ti+1 , which belongs a given path expression according to the predetermined to the ei+1 referenced objects. This value is deduced by using the number of fan i operation mix. element subsets of the ei+1 objects of type ti+1 . This number is given as the binomial coecient 4.1 Preliminaries i
Before giving the sizes of the relations we introduce some parameters that model the characteristics of an application. These are listed in Figure 3. parameter n ci di
fan
i
shar size
i
i
PageSize OIDsize PPsize + Bfan
application-speci c parameters
semantics (and derivation/default) length of access path total number of objects of type ti the number of objects of type ti for which the attribute Ai+1 is not NULL the number of references emanating on the average from the attribute Ai+1 of an object oi of type ti the average number of objects of type ti that reference the same object in ti+1 . If no value for shar i is determined by the user, it is derived as shar i = min(1; (di fan i )=ci+1 ) average size of objects of type ti
system-speci c parameters
net size of pages, which is set to 4056. size of object identi ers, default is 8. size of page pointer, default is 4 fan out of the B + tree, which is derived as: bPageSize =(PPsize + OIDsize )c
Figure 3: System and Application Parameters
ei+1 ! ei+1 = fan i fan i !(ei+1 ? fan i )!
Then, the probability that the particular object oi+1 is not hit is given as: ?e
+1 ?1
= ei+1e ? fan i = 1 ? efan i
i
fan ?e i i+1 fan i
i+1
i+1
The probability that oi+1 is not hit by any of the references emanating from a subset fo1i ; o2i ; : : : ; oki g of objects of type ti , all of whose Ai attributes are de ned, is:
k 1 ? efan i
i+1
For 0 i < j n we now de ne RefBy (i; j; k), which denotes the number of objects in tj which lie on at least one (partial) path emanating from a k -element subset of ti : 8 > > > >
ei+1
fan j?1 E (i;j;k)
ej 1 ? 1 ? e j
> > > :
j =i+1 !
else
where the exponent E (i; j; k)= RefBy (i; j ? 1; k) PA ?1 . Further the probability, denoted PRefBy (i; j ), that a 4.1.1 Some Derived Quantities path between anyone object in ti and a particular object The number of objects in ti which are referenced by at oj in tj exists for 0 i < j n, is derived as: least one object in ti?1 is denoted as ei : 8 ' & i=j < 1 d i ? 1 fan i?1 RefBy ( i; j; d ) P ( i; j ) = i RefBy ei = shar else : i?1 cj j
Let Ref (i; j; k) denote the number of objects of type ti which have a path leading to some element of a k -element subset of objects of type tj for 0 i < j n. This value can be approximated as: 8 > > > >
j =i+1
E 0 (i;j;k) ! shar i di 1 ? 1 ? else
> > > :
i;j = #Efull i;j = #Eleft
di
i;j #Eright =
j ?k j ?i X X k=1 l=i j ?i X k=1 j ?i X k=1
Plb (max(i; l ? 1); l) path (l; l + k)
Prb (l + k; min(j; l + k + 1))
PRefBy (0; i) path (i; i + k)
Prb (i + k; min(j; i + k + 1)) Plb (max(i; j ? k ? 1); j ? k)
where the exponent E 0 (i; j; k) = Ref (i + 1; j; k) PH +1 . path(j ? k; j ) Pref (j; n) Let PRef (i; j ) be the probability that a given object in ti has at least one path leading to any one object in tj . 4.3 Storage Representation of Access Then Support Relations 8 < 1 i=j Following the proposal by Valduriez [11] for join indices PRef (i; j ) := : Ref (i; j; cj ) else an access support relation (partition) EXi;j is stored in two ci redundant B + trees, one being keyed (clustered) on the i.e., OIDs of objects of type ti , and the The number of paths between the objects in ti and the rst attribute, second B + tree being clustered on the last attribute, i.e., objects in tj can be estimated by OIDs of tj objects. In this way we can achieve a fast lookjY ?1 up of all tuples (partial paths) originating in some object oi of type ti and all (partial) paths leading to some object path(i; j ) = di fan i (PA fan l ) oj of type tj . Particularly, in this way the semi-join of l=i+1 access support relation partitions is eciently performed 4.2 Cardinalities of Access Support Re- in both directions. The right-to-left semi-join, e.g., i
l
lations
0
0
( j (EXi ;i j (EXi;j j EXj;j ) j ) )
We can now deduce closed formulas for the number of is performed for evaluating a backward query, the left-totuples in the access support relations. right semi-join to evaluate a forward query (cf. section 5). Let us rst introduce two more probabilistic values. Let Plb (i; j ) denote4 the probability that a particular object of type tj is not \hit" by any path emanating from some 4.4 Storage Costs for Access Support Reobject in ti for 0 i < j n: lations The size of a tuple in the access support relation EXi;j in