Nested Data Cubes for OLAP 1 Introduction - Semantic Scholar

3 downloads 337 Views 275KB Size Report
Decision Support Suite 17], and Oracle's Sales Analyzer 19]. Some of these .... v1;:::;vm are pairwise distinct coordinates of dom( ); and wi 2 dom( ) for each i 2 1; ...
Nested Data Cubes for OLAP Stijn Dekeyser

Bart Kuijpers

Universiteit Antwerpen (UIA) y

Jan Paredaens

Jef Wijsen

Vrije Universiteit Brussel (VUB) z

Abstract

We present a new model for OLAP, called the nested data cube (NDC) model. Nested data cubes are a generalization of other OLAP models such as f-tables [3], and hypercubes [2], but also of classical structures such as sets, bags, and relations. The model we propose adds to the previous models mainly exibility in viewing the data, in that it allows for the assignment of priorities to the di erent dimensions of the multidimensional OLAP data. We also present an algebra in which all typical OLAP analysis and navigation operations can be formulated. In our model, aggregation is more explicitly supported by providing a way to store information after grouping but before aggregation. We present a number of algebraic operators that work on nested data cubes and that preserve the functional dependency between the dimensional coordinates of the data cube and the factual data in it. These operations include nesting, unnesting, summary, roll-up, and aggregation operations. We show how these operations can be applied to sub-NDC's at any depth, and prove that the operations that work on the highest level of nesting are sucient. Furthermore, we show that the NDC algebra can express the SPJR algebra [1]. A major motivation for de ning an algebra rather than a calculus, is that an algebra naturally leads to an implementation strategy. Importantly, we show that the NDC algebra primitives can be implemented by linear time algorithms.

1 Introduction Since the seminal paper of Codd, Codd, and Salley [5] of 1993, on-line analytical processing (OLAP) is recognized as a promising approach for the analysis and navigation of data warehouses and multidimensional data [4, 6, 9, 10, 12, 21]. Multidimensional databases typically are large collections of enterprise data which are arranged according to di erent dimensions (e.g., time, measures, products, geographical regions) to facilitate sophisticated analysis and navigation for decision support in, for instance, marketing. Figure 1 depicts a three-dimensional database, containing sales information of stores. A popular way of representing such information is the \data cube" [7, 13, 21].  Contact author. y Universiteit Antwerpen

(UIA), Informatica, Universiteitsplein 1, B-2610 Antwerpen, Belgium. Email: fdekeyser, kuijpers, [email protected]. z Vrije Universiteit Brussel (VUB), Informatica, Pleinlaan 2, B-1050 Brussel, Belgium. Email: [email protected]

1

Each dimension is assigned to an axis in the three-dimensional space, and the numeric values are placed in the corresponding cells of the cube. Day : day Jan1 Jan1 Jan1 Jan1 Jan1 Jan2 Jan2

Item : item Lego Lego Scrabble Scrabble Scrabble Lego Lego

Store : store Navona Colosseum Navona Colosseum Kinderdroom Kinderdroom Colosseum

! ! ! ! ! ! !

32 24 13 14 22 2 21

Figure 1: A multidimensional database. The e ectiveness of analysis and the ease of navigation is mainly determined by the

exibility that the system o ers to rearrange the perspectives on di erent dimensions and by its ability to eciently calculate and store summary information. A sales manager might want to look at the sales per store, or he might want to have the total number of items sold in all his stores in a particular month. OLAP systems aim to o er these qualities. During the past few years, many commercial systems have appeared that satisfactorily o er, through ever more ecient implementations, a wide number of analysis capabilities. A few well-known examples are Arbor Software's Essbase [8], IBM's Intelligent Server [14], Red Brick's Red Brick Warehouse [18], Pilot Software's Pilot Decision Support Suite [17], and Oracle's Sales Analyzer [19]. Some of these implementations are founded on theoretical results on ecient array manipulation [15, 16, 20]. In more recent years, however, the need for a formal model for OLAP that explicitly incorporates the notion of di erent and independent views on dimensions and also o ers a logical way to compute summary information, has become apparent. Lately, a number of starting points for such models have been proposed. Gyssens et al. [11] have proposed the rst theoretical foundation for OLAP systems: the tabular database model. They give a complete algebraic language for querying and restructuring two-dimensional tables. Agrawal et al. [2] have introduced a hypercube based data model with a number of operations that can be easily inserted into SQL. Cabibbo and Torlone [3] have recently proposed a data model that forms a logical counterpart of multidimensional arrays. Their model is based on dimensions and f-tables. Dimensions are partially ordered categories that correspond to di erent ways of looking at the multidimensional information (see also [7]). F-tables are structures to store factual data functionally dependent on the dimensions (such as the one depicted in gure 1). They also propose a calculus to query f-tables that supports multidimensional data analysis and aggregation. In this paper, we generalize the notions of f-tables and hypercube by introducing the nested data cube model. Our data model supports a variety of nested versions of f-tables, each of which corresponds to an assignment of priorities to the di erent dimensions. The answer to the query \Give an overview of the stores and their areas together with the global sales of the various items," on the data cube of gure 1, for instance, is depicted in gure 2 by a nested data cube. This data cube has two dimensions, Store and Area , at the most outer level of nesting, and one dimension, Item , at a deeper level. This table gives a view of the sales of gure 1 in such a way that the sales per 2

Store : store

Area : area

Navona

Italy

!

Item : item Lego Scrabble

! !

fj32jg fj13jg

Colosseum

Italy

!

Item : item Lego Scrabble

! !

fj24; 21jg fj14jg

!

Item : item Lego Scrabble

! !

fj2jg fj22jg

Kinderdroom Belgium

Figure 2: A three-dimensional data cube containing sales information. store and per item are clearly visualized. We also present an algebra to formulate queries, such as the one above, on nested data cubes. This query language supports all the important OLAP analysis and navigational restructuring operations on data cubes. There are a number of operations whose aim lies purely in rearranging the information in the cube in order to create di erent and independent views on the data. These include, for instance, nesting and unnesting operations, factual data collection in bags, duplication, extending (of the cube with additional dimensions), and renaming. The algebra also contains selection, aggregation, and roll-up operations that are mainly directed towards the analysis of the data. For instance, the nested cube of gure 2 was obtained from the cube of gure 1 by the following series of operations: rst nesting on Store is performed, resulting in a cube with Store as the only dimension at the highest level, and Item and Day on a deeper level; then, on the inner level, nesting is now used on Item , yielding a nested data cube with depth three, and one dimension in each level of nesting; next, on the deepest level, the information per Item is collected resulting in a decrease of the depth by one, the removal of information about Day , and the creation of bags of numbers in the functional part of the data cube at depth two; nally, the nested data cube is extended by one dimension in which the roll-up of the Store to Area information is stored. This results in a nested data cube in which the dimensions Store and Area are given higher importance than Item , and in which Day has disappeared. If one is interested in the total number of sales per Store and per Item , then an aggregate function sum can be applied to the individual bags in the cube. The ability to collect factual data in bags allows us to stay within our model while computing aggregate functions. Once the collection of data in bags is done, a wide variety of aggregate functions can be performed on them by reusing the constructed bags. Another feature of the model is that grouping on values of an attribute is made explicit. The paper is organized as follows. Section 2 introduces nested data cubes (NDC's). Section 3 de nes the operators of the NDC algebra. Section 4 contains two results concerning the expressive power of the NDC algebra. Section 5 shows how our algebra can be implemented eciently. Section 6 brie y summarizes the most important 3

results. The appendix contains the proofs of the theorems.

2 Nested Data Cubes In this section, we rst de ne the construct of nested data cube (NDC). We then illustrate the construct by three examples. Finally, we show that the set of NDC's over a given scheme is recursively enumerable. Algebraic operators that work on NDC's are de ned in the next section.

De nition 1 We use the delimiters fjjg to denote a bag. We assume the existence of a set A of attributes and a set L of levels. Every level l has associated with it a recursively enumerable S set dom (l) of atomic values.  is a level with dom () = f>g. We de ne dom = fdom (l) j l 2 Lg. For certain pairs (l1 ; l2 ) of L  L, there exist l a roll-up function, denoted R-UPl21 , that maps every element of l1 to an element of l2 . Further requirements can be imposed on the nature of roll-up functions, as is done in [3]. Let f be a total function with domain S . Let S 0  S . We write f [S 0 ] for the total function with domain S 0 obtained by restricting the total function f to S 0 . A coordinate type is a set fA1 : l1 ; : : : ; An : ln g where n  0, and A1 ; : : : ; An are distinct attributes, and l1 ; : : : ; ln are levels. A coordinate over the coordinate type fA1 : l1 ; : : : ; An : ln g is a set fA1 : v1 ; : : : ; An : vn g where vi 2 dom (li ), for each i 2 [1; n]. If  is a coordinate type, then dom () is de ned as the set of all coordinates over . Remark: A coordinate type is a function from fA1 ; : : : ; An g to L. A coordinate is a function from fA1 ; : : : ; An g to dom. If is a coordinate type or a coordinate, we write att ( ) for the attributes appearing in . That is, att (fA1 : l1 ; : : : ; An : ln g) = fA1 ; : : : ; An g. A nested data cube scheme (NDC scheme, or simply scheme) is a statement  conforming to the following syntax:

 = [ !  ] j = l j fj jg

(1) (2)

where

  is a coordinate type, and  l is a level. Throughout this paper, the Greek letters ,  , and consistently refer to the above syntax. The function dom () on levels and coordinate types is extended to schemes as follows:

4

dom ([ !  ]) = ffv1 ! w1 ; : : : ; vm ! wm g j m  0; and v1 ; : : : ; vm are pairwise distinct coordinates of dom (); and wi 2 dom ( ) for each i 2 [1; m]g dom (fj jg) = ffjv1 ; : : : ; vm jg j m  1; and vi 2 dom ( ) for each i 2 [1; m]g

A nested data cube (NDC) over the scheme  is an element of dom ( ). The depth of a scheme  , denoted depth ( ), is de ned recursively as follows: depth ([ !  ]) = 1 + depth ( ) depth ( ) = 0

Let  be a scheme and n 2 N with 1  n  depth ( ). The subscheme of  at depth n, denoted subscheme (; n), is recursively de ned as follows. subscheme ([ !  ]; n) = subscheme (; n ? 1) if n  2 subscheme (; 1) = 

The notion of depth is extended to NDC's in an obvious way. If C is an NDC over the scheme  , then we say that the depth of C is depth ( ). 2 We are now going to illustrate the de nition by some examples.

Example 1 Assume store; area; and item are levels. Figure 2 shows an NDC over

the scheme

 = [fStore : store; Area : areag ! [fItem : itemg ! fjnumericjg]]: We have depth ( ) = 2 and subscheme (; 2) = [fItem : itemg ! fjnumericjg]. We assume it is intuitively clear how an NDC is represented in a tabular format. 2

Example 2 Note that [fg ! numeric] is a legal scheme. Its depth is equal to 1. All NDC's over this scheme can be listed as follows (assume dom (numeric) = f1; 2; : : :g): fg (the empty NDC), ffg ! 1g, ffg ! 2g, and so on. Importantly, numeric itself is also a legal scheme. Its depth is equal to 0. The NDC's over numeric (as a scheme) are 1,2, . . . 2

Example 3 NDC's can represent several common data structures, as follows. Bag An NDC over a scheme of the form fj jg. Set An NDC over a scheme of the form [fA : lg ! ]. Relation An NDC over a scheme of the form [ ! ]. The NDC called C in gure 3 represents a conventional relation.

5

F-tables [3] An NDC over a scheme of the form [ ! l].

2 Example 2 shows that the NDC's over the scheme [fg ! numeric] are recursively enumerable. The following theorem generalizes this result for arbitrary schemes.

Theorem 1 The set of all NDC's over a given scheme  is recursively enumerable.

3 Operators The NDC algebra consists of the following operators: bagify This operator decreases the depth of an NDC by replacing each innermost sub-NDC by a bag containing the right-hand values appearing in the sub-NDC. extend This operator adds an attribute to an NDC. The attribute values of the newly added attribute are computed from the coordinates in the original NDC. nest and unnest These operators capture the classical meaning of nesting and unnesting. duplicate This operator takes an NDC and replaces the right-hand values by the attribute values of some speci ed attribute. select and rename These operators correspond to operators with the same name in the conventional relational algebra. aggregate This operator replaces each right-hand value w in an NDC by a new value obtained by applying a speci ed function to w. Several operators are illustrated in gure 3.

3.1 The operator bagify

The bagify() operator takes as its argument an NDC of depth n  1, and renders a new NDC with depth (n ? 1). Intuitively, this operator is related to the projection of the relational algebra, as it removes attributes.

De nition 2 Let C be an NDC over the scheme [ !  ]. bagify([ !  ]) is recursively de ned on schemes as follows:  ! [2 !  ]]) = [1 ! bagify([2 !  ])] bagify([ ! ]) = fj jg

bagify([ 1

C ) is recursively de ned on NDC's as follows.  If C = fv1 ! w1 ; : : : ; vm ! wm g is an NDC over [1 ! [2 !  ]], then bagify(C ) = fv1 ! bagify(w1 ); : : : ; vm ! bagify(wm )g.

bagify(

6

Item : item Lego Scrabble Lego Scrabble

Day : day Jan 1 Jan 1 Jan 2 Jan 2

C

Amount : numeric 68 45 70 47

! ! ! !

> > > >

! ! ! !

68 45 70 47

C1 = duplicate(C; Amount )

Item : item Lego Scrabble Lego Scrabble

Day : day Jan 1 Jan 1 Jan 2 Jan 2

Amount : numeric 68 45 70 47

C2 = nest(C1 ; Day )

Day : day

Jan 1

!

Item : item Lego Scrabble

Amount : numeric 68 45

! !

68 45

Jan 2

!

Item : item Lego Scrabble

Amount : numeric 70 47

! !

70 47

C3 = bagify(C2 ) Day : day Jan 1 Jan 2

! !

C4 = aggregate(C3 ; sum) Day : day Jan 1 Jan 2

fj68; 45jg fj70; 47jg

C5 = extend(C4 ; Month ; month; R-UPmonth day ) Day : day Jan 1 Jan 2

Month : month Jan Jan

113 117

! !

C6 = nest(C5 ; Month )

Month : month

Jan

!

Day : day Jan 1 Jan 2

! !

113 117

Figure 3: Illustration of the NDC algebra.

7

! !

113 117

Item : item

Lego

!

Day : day Jan 1 Jan 1 Jan 1

Store : store Navona Colosseum Kinderdroom

! ! !

32 24 12

Scrabble

!

Day : day Jan 1 Jan 1 Jan 1

Store : store Navona Colosseum Kinderdroom

! ! !

17 6 22

Item : item Lego Scrabble

! !

fj32; 24; 12jg fj17; 6; 22jg

Figure 4: Illustration of bagify operator.

2

 If C = fv1 ! w1 ; : : : ; vm ! wm g is an NDC over [ ! ], then bagify(C ) = fjw1 ; : : : ; wm jg.

Example 4 Let C be the NDC shown in gure 4 top. Then bagify(C ) is shown in gure 4 bottom. Note that bagify(bagify(C )) is equal to fjfj32; 24; 12jg; fj17; 6; 22jgjg, a bag of bags. 2

3.2 The operator extend

De nition 3 Let C be an NDC over the scheme [ !  ]. Let A be an attribute such that A 62 att (). Let l be a level. Let R be a subset of dom ()  dom (l). The de nition of extend on scheme level is as follows. extend([ !  ]; A; l; R) is equal to [ [ fA : lg !  ]. The de nition of extend on NDC level is as follows. extend(C; A; l; R) is de ned as the smallest NDC containing v [ fA : cg ! w whenever (1) C contains v ! w, and (2) (v; c) 2 R. 2 Example 5 Figure 5 top shows an NDC (call it C ) over the scheme [fStore : storeg ! [fItem : itemg ! numeric]]. Let R  dom (fStore : storeg)  dom (area) such that (fStore : xg; y) 2 R if and only if R-UP (x) = y. Then extend(C; Area ; area; R) is area store

shown in gure 5 bottom. Note that in this example R happens to be a function. 2 The relation R used in extend(; ; ; R) can be given by an NDC in the way de ned next. This is an interesting feature, as it allows \joining" two NDC's. De nition 4 Let C be an NDC over the scheme [ ! l]. The relation induced by C , denoted rel (C ), is the smallest subset of dom ()  dom (l) containing (v; c) whenever C contains v ! c. 8

Store : store

Navona

!

Item : item Lego Scrabble

! !

32 17

Colosseum

!

Item : item Lego Scrabble

! !

24 6

Kinderdroom

!

Item : item Lego Scrabble

! !

12 22

Store : store

Area : area

Navona

Italy

!

Item : item Lego Scrabble

! !

32 17

Colosseum

Italy

!

Item : item Lego Scrabble

! !

24 6

!

Item : item Lego Scrabble

! !

12 22

Kinderdroom Belgium

Figure 5: Illustration of extend operator.

9

Area : area Store : store

Italy

Navona

!

Item : item Lego Scrabble

! !

32 17

Colosseum

!

Item : item Lego Scrabble

! !

24 6

!

Item : item Lego Scrabble

! !

12 22

!

Store : store

Belgium

!

Kinderdroom

Figure 6: Result of applying nest(C; fArea g) on the NDC of gure 5. Let C be an NDC over the scheme [ ! fjljg]. The relation induced by C , denoted rel (C ), is the smallest subset of dom ()  dom (l) containing (v; c) whenever C contains v ! w with c 2 w. 2 We note that in an application of extend(C; A; l; R), C may contain an element v ! w such that for all (v0 ; c) 2 R, v 6= v0 . In that case, v ! w will not contribute to the result NDC.

3.3 The operator nest

De nition 5 Let C be an NDC over the scheme [ !  ]. Let X be a subset of att (). Let Y = att () n X . Let  = [X ], and = [Y ]. nest([ !  ]; X ) is equal to [ ! [ !  ]]. nest(C; X ) is de ned as follows. Let V = fv[X ] j v ! w 2 C g: Clearly, V  dom (). Then 1. For every x 2 V , nest(C; X ) contains x ! C 0 where C 0 is the NDC over [ !  ] satisfying C 0 = fv[Y ] ! w j v ! w 2 C; and v[X ] = xg: 2. If nest(C; X ) contains x ! C 0 then x 2 V |i.e., nest(C; X ) contains no other elements than those speci ed in (1). 2 Example 6 Consider the NDC (call it C ) of gure 5 bottom. Figure 6 shows the result of nest(C; fArea g). 2 10

3.4 The operator unnest

The unnest operator is the inverse of the nest operator. De nition 6 Let C be an NDC over the scheme [1 ! [2 !  ]], where att (1 ) \ att (2 ) = fg. unnest([1 ! [2 !  ]]) is equal to [1 [ 2 !  ]. unnest(C ) is the smallest (w.r.t. set inclusion) NDC containing v1 [ v2 ! w whenever C contains v1 ! C 0 with v2 ! w 2 C 0 . 2

3.5 The operator duplicate

duplicate serves to duplicate attribute values at the right-hand side in an NDC. The operator has been illustrated in gure 3. De nition 7 Let C be an NDC over the scheme [ !  ]. Let A : l 2 . duplicate([ !  ]; A) is equal to [ ! l]. duplicate(C; A) is the smallest NDC over [ ! l] containing v ! c whenever C contains v ! C 0 and v(A) = c (for some NDC C 0 over  ). 2

3.6 The operator select

De nition 8 Let C be an NDC over the scheme [ !  ]. Let A : l 2 . Let B 2 att (). Let c 2 dom (l). select([ !  ]; A; c) is equal to [ !  ]. select([ !  ]; A; B ) is equal to [ !  ]. select(C; A; c) is the smallest NDC over [ !  ] containing v ! w whenever C contains v ! w and v(A) = c. select(C; A; B ) is the smallest NDC over [ !  ] containing v ! w whenever C contains v ! w and v(A) = v(B ). 2 Example 7 Consider the NDC (call it C ) of gure 6. Figure 7 shows the result of C; Area ; Italy). 2

select(

3.7 The operator rename

serves to rename attributes. The operator has been illustrated in gure 3. De nition 9 Let C be an NDC over the scheme [ !  ]. Let A 2 att () and let B be an attribute not in att (). Let 0 be the coordinate type obtained from  by substituting B for A. rename([ !  ]; A; B ) is equal to [ 0 !  ]. rename(C; A; B ) is the smallest NDC over [ 0 !  ] containing v [ fB : cg ! w whenever C contains v [ fA : cg ! w. 2 rename

11

Area : area Store : store

Italy

Navona

!

Item : item Lego Scrabble

! !

32 17

Colosseum

!

Item : item Lego Scrabble

! !

24 6

!

Figure 7: Result of applying select(C; Area ; Italy) on the NDC of gure 6.

3.8 The operator aggregate

applies a (aggregation) function on the right-hand values appearing in an NDC. The operator has been illustrated in gure 3. De nition 10 The ground of a scheme  , denoted ground ( ), is de ned recursively as follows: aggregate

ground ([ !  ]) = ground ( ) ground ( ) =

Let C be an NDC over the scheme  with ground ( ) = 1 . Let f be a total function from dom ( 1 ) to dom ( 2 ). aggregate(; f ) is recursively de ned on schemes as follows:

 !  ]; f ) = [ ! aggregate(; f )] aggregate( 1 ; f ) = 2

aggregate([

C; f ) is recursively de ned on NDC's as follows.  If C = fv1 ! w1 ; : : : ; vm ! wm g is an NDC over [ !  ], then aggregate(C; f ) = fv1 ! aggregate(w1 ; f ); : : : ; vm ! aggregate(wm ; f )g.  If C = c is an NDC over 1 , then aggregate(C; f ) = f (c).

aggregate(

2

4 Expressiveness In this section, we show some properties concerning the expressiveness of the NDC algebra. We rst show that the NDC algebra is suciently powerful to capture alge12

Area : area

Italy

!

Store : store Navona Navona Colosseum Colosseum

Item : item Lego Scrabble Lego Scrabble

! ! ! !

32 17 24 6

Belgium

!

Store : store Kinderdroom Kinderdroom

Item : item Lego Scrabble

! !

12 22

Figure 8: Result of applying unnest2 (C ) on gure 6. braic operations working directly on sub-NDC's over a subscheme of a scheme. Next we show that the NDC algebra can express the SPJR algebra [1].

4.1 Applying Operators at a Certain Depth

The recursion in the de nition of NDC is a \tail recursion." Consequently, the \recursion depth" can be used to unequivocally address a sub-NDC within an NDC. This is an interesting and important property of NDC's. It is exploited by de ning operators that directly work on sub-NDC's at a certain depth. Such operators reduce the need for frequent nesting and unnesting of NDC's. De nition 11 Let C be an NDC over the scheme  . Let 1  d  depth ( ). Let op(C; a1 ; : : : ; an ) be any previously de ned operation of the NDC algebra. Let  0 = subscheme (; d). opd(C; a1 ; : : : ; an ) is de ned iff op( 0 ; a1 ; : : : ; an ) is de ned. We rst give the result on schemes. 1. op1 (; a1 ; : : : ; an ) = op(; a1 ; : : : ; an ). 2. If d > 1 then opd([ !  ]; a1 ; : : : ; an ) = [ ! opd?1(; a1 ; : : : ; an )]. We next give the result on NDC's. 1. op1 (C; a1 ; : : : ; an ) = op(C; a1 ; : : : ; an ). 2. If d > 1 and C = fv1 ! w1 ; : : : ; vm ! wm g then opd (C; a1 ; : : : ; an ) = fv1 ! opd?1 (w1 ; a1 ; : : : ; an ); : : : ; vm ! opd?1 (wm ; a1 ; : : : ; an )g. 2

Example 8 Consider the NDC (call it C ) of gure 6. Figure 8 shows the result of C ). 2

unnest2 (

The following theorem states that opd(C; a1 ; : : : ; an ) is not a primitive operator| i.e., it can be expressed in terms of the operators of the NDC algebra. Theorem 2 Let C be an NDC over the scheme  . The operator opd(C; a1 ; : : : ; an) with d  2 is redundant. 13

4.2 The SPJR Algebra

The following theorem states that the NDC algebra can express the SPJR algebra. Theorem 3 The NDC algebra expresses the SPJR algebra [1].

5 Implementation The operations we proposed for the NDC algebra can be implemented by algorithms that run in linear time with respect to the number of atomic values that appear in the data cube. We assume that each aggregate function is computable in polynomial time. We introduce two new constructs: the iCube which holds the actual data in an n-dimensional array, and the iStruct, essentially a string representing the structure behind the data. For example, consider the scheme [fA1 : a1; A2 : a2 g ! [fB1 : b1 ; B2 : b2 g ! c]]. It can be implemented by the iCube cube of type c[#b1][#a2 ][#d1 ][#b2 ][#a1 ] (the array type of the Java language is used for simplicity) together with the iStruct [5; 2 ! [1; 4 ! ]]. In the iCube's type, #a1 denotes the cardinality of dom (a1 ) plus one. That is, there is one entry for each element of dom (a1 ) (indexes 1; 2; : : :; #a1 ? 1) on top of the entry with index 0. The numbers in the iStruct denote positions in the array type. For example, the number 5 refers to the fth dimension, which ranges to #a1 . A possible member of an NDC over the given scheme is [fA1 : u1 ; A2 : u2g ! [fB1 : v1 ; B2 : v2 g ! w]] which will be represented in the iCube as cube [v1 ][u2 ][0 ][v2 ][u1 ] = w. Note that an extra, unused dimension is present in the iCube (namely, the third dimension ranging to #d1 ). This is necessary in case the extend operation is used, as we require the iCubes to remain the same throughout the computation of the query. The spare dimension will then be used to store the new attribute created by the extend operator. A nal remark relates to the use of bags in our model. The implementation should support fast access to the elements in a bag of an NDC, to facilitate the computation of aggregate functions. Consider, for example, the scheme [fA : ag ! fjcjg]. NDC's over this scheme contain bags. One possible implementation uses the iCube cube of type c[#a][#d] and the iStruct [1 ! fj2jg]. A member [fA : v g ! fjw1 ; w2 jg] of an NDC over this scheme, for example, is represented by fjcube [v ][i ] j 1  i < #djg = fjw1 ; w2 jg. The operations of the NDC algebra are implemented in such a way that the type of the iCube never changes. Importantly, the nest, unnest, and bagify operations only change the iStruct, leaving the iCube una ected. The other operations also change the content of the iCube. Based on this, we can implement an expression in the NDC algebra in linear time. We now give a concrete example. Let revenue be an NDC over the scheme [fStore : store ! [fCity : cityg ! [fProduct : productg ! numeric]]] We want to answer the query \For each city, give the maximal total revenue realized by any store in that city." The iCube and initial iStruct implementing the NDC are revenue = numeric[#store][#city][#product]

14

and [1 ! [2 ! [3 ! ]]], respectively. The algebraic expression for the query is aggregate(bagify(aggregate(bagify(nest(unnest(revenue ); City )); sum)); max) A Java-like linear program implementing the expression is for (int c = 1; c  #city; c++) f revenue [0 ][c ][0 ] = 0 ; for (int s = 1; s  #store; s++) f revenue [s ][c ][0 ] = 0 ; for (int p = 1; p  #product; p++) f revenue [s ][c ][0 ] += revenue [s ][c ][p ];

g

g

g

revenue [0 ][c ][0 ] = max(revenue [0 ][c ][0 ]; revenue [s ][c ][0 ]);

6 Summary We proposed the NDC data model and its associated algebra. The NDC data model di ers from most existing OLAP data models in its explicit modeling of grouping at di erent levels of nesting. We believe that nested grouping naturally arises in many OLAP applications. We proved some properties about the expressiveness of our algebra. The advantage of an algebra in comparison with a calculus is that an algebra approach is generally more close to an implementation. We indicated, in fact, that all operations of the NDC algebra can be implemented by linear time algorithms.

References [1] S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison-Wesley, 1995. [2] R. Agrawal, A. Gupta, and S. Sarawagi. Modeling multidimensional databases. In Proc. IEEE Int. Conf. Data Engineering, pages 232{243, 1997. [3] L. Cabibbo and R. Torlone. Querying multidimensional databases. In Sixth Int. Workshop on Database Programming Languages, pages 253{269, 1997. [4] S. Chaudhuri and U. Dayal. An overview of data warehousing and olap technology. SIGMOD Record, 26(1):65{74, 1997. [5] E. Codd, S. Codd, and C. Salley. Providing OLAP (On-Line Analytical Processing) to user-analysts: An IT mandate. Arbor Software White Paper, http://www.arborsoft.com. [6] G. Colliat. Olap, relational, and multidimensional database systems. SIGMOD Record, 25(3):64{69, 1996. [7] C. Dyreson. Information retrieval from an incomplete data cube. In Proc. Int. Conf. Very Large Data Bases, pages 532{543, Bombai, India, 1996. 15

[8] Essbase. Arbor Software, http://www.arborsoft.com/OLAP.html. [9] J. Gray, A. Boswirth, A. Layman, and H. Pirahesh. Data cube: A relational aggregation operator generalizing group-by. In Proc. IEEE Int. Conf. Data Engineering, pages 152{159, 1997. [10] J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow, and H. Pirahesh. Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals. Data Mining and Knowledge Discovery, 1:29{53, 1997. [11] M. Gyssens, L. Lakshmanan, and I. Subramanian. Tables as a paradigm for querying and restructuring. In Proc. ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pages 93{103, Montreal, Canada, 1996. [12] J. Han. OLAP mining: An integration of OLAP with data mining. In Proceedings of the 7th IFIP 2.6 Working Conference on Database Semantics (DS-7), pages 1{9, 1997. [13] V. Harinarayan, A. Rajaraman, and J. D. Ullman. Implementing data cubes eciently. In Proc. ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pages 205{216, Montreal, Canada, 1996. [14] Intelligent server. IBM, http://www.software.ibm.com/data/pubs/papers. [15] L. Libkin, R. Machlin, and L. Wong. A query language for multidimensional arrays: Design implementation, and optimization techniques. In Proc. ACM SIGMOD Int. Conf. Management of Data, pages 228{239, Montreal, Canada, 1996. [16] A. Marathe and K. Salem. A language for manipulating arrays. In Proc. Int. Conf. Very Large Data Bases, pages 46{55, Athens, Greece, 1997. [17] Pilot decision support suite. Pilot Software, http://rickover.pilotsw.com/products/Welcome.htm. [18] Red brick warehouse. Red Brick, http://www.redbrick.com/rbs-g/html/plo.html. [19] Sales analyzer. Oracle, http://www.oracle.com/products/olap/html/. [20] S. Sarawagi and M. Stonebraker. Ecient organization of large multidimensional arrays. In Proc. IEEE Int. Conf. Data Engineering, pages 328{336, Houston, Texas, 1994. [21] A. Shoshani. OLAP and statistical databases: Similarities and di erences. In Proc. ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pages 185{196, Tucson, AZ, 1997.

16

Appendix

Proof of theorem 1

Let  be a scheme. The proof runs by induction on the recursive construction of  . We distinguish between three cases: Case 1:  = l For each l 2 L, dom (l) is recursively enumerable by de nition. Case 2:  = fj jg By the induction hypothesis, dom ( ) is recursively enumerable. We need to show dom (fj jg) is recursively enumerable. Since dom ( ) is recursively enumerable, there is a computable function (call it f ) from the set of non-negative integers onto dom ( ). Let [m], where m  0, denote the smallest set containing f (i) whenever 0  i  m. Let [m; n], where m; n  0, denote the smallest set containing every bag of dom (fj jg) that contains only elements of [m], and in which every member of [m] occurs at most n times. Every bag of dom (fj jg) belongs to [m; n] for some S m; n  0. Clearly, [m; n] is recursively enumerable, and so is dom (fj jg) = m;n2N 0 [m; n].

Proof.

Case 3:  = [ !  0 ] Clearly, dom () is recursively enumerable. By the induction

hypothesis, dom ( 0 ) is recursively enumerable. We need to show dom ( ) is recursively enumerable. Since dom () is recursively enumerable, there is a computable function (call it f ) from the set of non-negative integers onto dom (). Since dom ( 0 ) is recursively enumerable, there is a computable function (call it g) from the set of non-negative integers onto dom ( 0 ). Let [m], where m  0, denote the smallest set containing f (i) whenever 0  i  m. Let  0 [n], where n  0, denote the smallest set containing g(i) whenever 0  i  n. Let  [m; n] be the set of partial functions from [m] into  0 [n]. Every NDC of dom ( ) corresponds toS [m; n] for some m; n  0. Clearly,  [m; n] is recursively enumerable, and so is m;n2N 0  [m; n]. This concludes the proof. 2

Proof of theorem 2

We show that opd(C; a1 ; : : : ; an ) can always be expressed in terms of other operators that apply at depth < d. We distinguish between ve cases. 1. The proof is obvious for the operations bagify() and aggregate(; ). 2. Let opd (C; a1 ; : : : ; an ) be one of the operations extendd (C; ; ; ), selectd (C; ; ), or renamed (C; ; ) with d  2. Let subscheme (d ? 1;  ) be [1 ! [2 !  0 ]]. Let att (1 ) = X and att (2 ) = Y . Consider the sequence:

Proof.

C1 C2 C3 C4 C5

= = = = =

unnestd?1 (

C) C ;Y) opd?1 (C2 ; a1 ; : : : ; an ) unnestd?1 (C3 ) nestd?1 (C4 ; X )

nestd?1 ( 1

17

We assumed no attribute naming con icts. Naming con icts can always be avoided by applying renamed?1 (; ; ). Then C5 = opd(C; a1 ; : : : ; an ). 3. Let opd(C; a1 ; : : : ; an ) be unnestd (C ). Let subscheme (d ? 1;  ) be [1 ! [2 ! [3 !  0 ]]]. Let att (1 ) = X . Consider the sequence:

C1 = C2 = C3 =

unnestd?1 (

C)

unnestd?1 (C1 ) nestd?1 (C2 ; X )

We assumed no attribute naming con icts. Then C3 = unnestd (C ). 4. Let opd (C; a1 ; : : : ; an ) be nestd(C; Y ). Let subscheme (d ? 1;  ) be [1 ! [2 !  0 ]]. Let att (1 ) = X . Consider the sequence:

C1 = C2 = C3 =

unnestd?1 (

C)

nestd?1 (C1 ; XY ) nestd?1 (C2 ; X )

We assumed no attribute naming con icts. Then C3 = nestd(C; Y ). 5. Let opd (C; a1 ; : : : ; an ) be duplicated (C; A). Let subscheme (d ? 1;  ) be [1 ! [2 !  0 ]]. Let att (1 ) = X . Consider the sequence:

C1 = C2 = C3 =

unnestd?1 (

C)

duplicated?1 (C1 ; A) nestd?1 (C2 ; X )

We assumed no attribute naming con icts. Then C3 = duplicated(C; A). By induction on d, it follows that opd (C; a1 ; : : : ; an ) can always be expressed using only operators opd(C; a1 ; : : : ; an ) with d = 1. This concludes the proof. 2

Proof of theorem 3

De nition 12 Let X be a set of attributes. A relation is de ned relative to a level l. A tuple over X is a total function from X to dom (l). A relation over X is a nite set of tuples over X . Let R be a relation over the set X of attributes. Let  be the coordinate type with att () = X and (A) = l for each A 2 X . The NDC induced by R, denoted cube (R), is the smallest NDC over the scheme [ ! ] containing v ! > whenever v 2 R. 2 Proof. Let R and S be relations over X and Y respectively. We need to show the following: Selection cube (A=a R) and cube (A=B R) can be computed from cube (R) by the NDC algebra. Projection cube (X R) can be computed from cube (R).

18

Join cube (R 1 S ) can be computed from cube (R) and cube (S ). Rename cube (A!B R) can be computed from cube (R). Selection and rename is obvious. For projection, let R be a relation over XY . Let f be a function from dom (fjjg) to dom () that maps each element of dom (fjjg) to >. Let C = cube (R). Let:

C1 = C2 = C3 =

C; X ) bagify(C1 )

nest(

C ; f)

aggregate( 2

Then C3 = cube (X R). For join, we make use of the following property. Let R be a relation over XY , and let S be a relation over Y ZA. Then

R 1 S = (R 1 Y A S ) 1 S: Consequently, it suces to consider joins R 1 S of the following form: 1. R is a relation over XY and S is a relation over Y A, and A 62 XY . 2. R is a relation over XY and S is a relation over Y . We rst consider simulating a join of type (1). Let R and S be relations over XY and Y A respectively, where X \ Y = fg and A 62 XY . Let C = cube (R) and D = cube (S ). Let:

D1 D2 D3 C1 C2 C3

= = = = = =

D; A)

duplicate(

D ;Y ) bagify(D2 ) nest(C; Y ) extend(C1 ; A; l; rel (D3 )) unnest(C2 )

nest( 1

Then C3 = cube (R 1 S ). Next we consider simulating a join of type (2). Let R and S be relations over XY and Y respectively, where X \ Y = fg. Let C = cube (R) and D = cube (S ). Let:

C1 C2 C3 C4 C5 C6

C; Y ) C ; A; ; rel (D)) unnest(C2 ) nest(C3 ; XY ) bagify(C4 ) aggregate(C5 ; f ) Then C6 = cube (R 1 S ). This concludes the proof. 2 = = = = = =

nest(

extend( 1

19

Suggest Documents