Tables As a Paradigm for Querying and Restructuring - CiteSeerX

4 downloads 5825 Views 270KB Size Report
Previous table-based data models (such as relational, ... Email: flaks,[email protected]. ...... URL:http://www.arborsoft.com/papers/coddTOC.html.
To Appear in: Proc. ACM Symp. on Principles of Database Systems (PODS'96), Montreal, PQ, June 1996

Tables As a Paradigm for Querying and Restructuring (Extended Abstract)

Marc Gyssensy

University of Limburg

Laks V.S. Lakshmananz Concordia University

Abstract

Tables are one of the most natural representations of real-life data. Previous table-based data models (such as relational, nested relational, and complex objects models) capture only a limited variety of real-life tables. In this paper, we study the foundations of tabular representations of data. We propose the tabular database model for handling a broad class of natural data representations and develop tabular algebra as a language for querying and restructuring tabular data. We show that the tabular algebra is complete for a very general class of transformations and show that several languages designed for very di erent purposes can naturally be embedded into the tabular model. We also demonstrate the applicability of our model as a theoretical foundation for on-line analytical processing (OLAP), an emerging technology for complementing the robust data management and transaction processing of DBMS with powerful tools for data analysis.

1 Introduction Tables are one of the most natural ways in which real-life data can be represented. Indeed, the success and popularity of the relational model (see [15]) is a testimony to this. The relational model, however, only accounts for a very limited variety of tables possible. Real-world tables may have names for their columns (like relations) and rows (unlike relations), and these  This research was supported in part by grants from the Natural Sciences and Engineering Research Council of Canada and the Fonds Pour Formation De Chercheurs Et L'Aide A La Recherche of Quebec. y Dept. WNI, University of Limburg, B-3590 Diepenbeek, Belgium. E-mail: [email protected]. z Dept. of Computer Science, Concordia University, Montreal, Quebec, Canada Email: flaks,[email protected].

Iyer N. Subramanianz Concordia University

names need not be distinct (unlike in relations). To a limited extent, the nested relational and complexobject models (see [15]) mitigate the limitations of the relational model by allowing nesting and promoting structure sharing. These models, however, still fail to exploit the full power of tables. Figure 1 shows several databases as sets of tables representing the same (sales) data in a variety of ways, and illustrates the points made above. For now, we only consider the tables and parts of tables outlined in bold. The database SalesInfo1 is a relational representation of the sales data. The databases SalesInfo2{SalesInfo4 fall outside the traditional relational model in that (some) rows have names just like columns. Compared to SalesInfo1, SalesInfo2 shows the sales data organized per region, thus facilitating a quick comparison of the performance of each part in various regions. Notice that the column names (usually called attributes) in this example are not all distinct. The database SalesInfo3 shows the sales data for each combination of a part and a region. Here, row and column \names" are actually data! Notice that, unlike in the relational model, the width of the table Sales in both SalesInfo2 and SalesInfo3 is not xed, but depends on the particular instance. Finally, in SalesInfo4, there is a separate table for each region. All tables in this database have the same name; their number depends on the particular instance. In summary, tables as opposed to relations o er a symmetry between rows and columns and the latitude that row and column names may occur multiply or may even be absent. Exploiting this symmetry and

exibility allows for a much broader class of natural data representations than captured by the traditional relational model. Having more liberal tabular representations available for databases is not only interesting from a statical but also from a dynamical point of view. It has been pointed out (e.g., see [7, 8]) that many applications can signi cantly bene t from the integration of database systems

(whose strength is ecient and robust on-line transaction processing (OLTP) and handling large volumes

of data), with analytical tools like spreadsheets (which o er strong on-line analytical processing (OLAP) capabilities). Indeed, spreadsheets model data in the form of tables (somewhat more liberally than in the relational model) and have several powerful analytical functions built into them. Examples include row and column arithmetic, generalized aggregation on arbitrary blocks of values drawn from tables, and the ability to invoke external functions. It has been pointed out [7, 8] that an integration of relational database systems and spreadsheets will combine their complementary strengths in OLAP and OLTP respectively, leading to a powerful environment for data processing. Such an integration calls for a powerful model and language that supports convenient restructuring of data between various tabular representations. In this paper, we propose the tabular database model as a general-purpose model allowing a very broad class of natural data representations, including those covered by the relational model and spreadsheets. Intuitively, a tabular database is a set of tables. Each table has a name, called the table name, which appears as the top-most left-most entry in it (see the tables of the databases SalesInfo2{SalesInfo4 in Figure 1). The other entries appearing in the top-most row are column attributes, and the other entries in the left-most column are row attributes. The remaining entries in a table can be thought of as \data entries." Notwithstanding, data can also occur in the attribute positions (see the table Sales of the database SalesInfo3 in Figure 1). Both row and column attributes are optional. Whenever a certain entry is not applicable, we indicate this by the special symbol ?, thought of the \inapplicable" null (as shown in the tables of the databases SalesInfo2 and SalesInfo3 in Figure 1). We also study the problems of querying and of restructuring among di erent tabular representations of similar data. This problem has applications in schema restructuring, view maintenance, integration of heterogeneous databases, interoperability, and integration of database systems with analytical tools such as spreadsheets, currently active areas of research. The main contributions of the paper are the following. (1) We propose the tabular database model, which is very simple and yet powerful in allowing a very broad class of natural data representations. (2) We propose an algebraic language, called the tabular algebra, for querying and restructuring tabular data. (3) We develop notions of genericity and completeness that in some sense capture the class of all \meaningful" queries and all \conceivable" forms of restructuring on tabular data representation. We also prove the surprising result

that our algebra above, which is essentially based on four very simple and natural forms of restructuring, is complete w.r.t. our criteria. We also compare our model and language with existing ones and bring out their power and generality. Among other things, we show the following. (4) The graphbased object-oriented data model GOOD recently proposed by Gyssens et al. [9, 4, 3] can be embedded within the tabular database model. In particular, every GOOD query can be expressed in the tabular algebra. This observation also provides a means to embed other models encompassed by GOOD, such as the nested and complex-object models, in the tabular model. (5) The syntactic higher-order logic-based model of SchemaLog recently proposed by Lakshmanan et al. [11, 12] can also be embedded within the tabular model. In particular, every query or restructuring transformation expressible in SchemaLog (without function symbols) can also be expressed in tabular algebra. (6) The tabular algebra can serve as a fundamental query and restructuring language for OLAP-based information systems. To our knowledge, the theoretical foundations for OLAP systems have not been clearly developed in the OLAP literature, and the tabular algebra is the rst fundamental querying and restructuring language to be proposed for such systems. Before concluding this section, let us illustrate the

exibility and power of tables compared to relations. Revisit the tables representing sales data, shown in Figure 1. Suppose we wish to include summary data in these databases. Such data can come from, e.g., OLAP tools. The summary could include sales totals per part and per region and grand-total sales. In the relational representation (the database SalesInfo1), we are forced to store such information in separate relations. By contrast, we can easily absorb such summary data in the tables of SalesInfo2{SalesInfo4 as shown in regular outline in Figure 1. It has been pointed out that tabular representations are also useful in the context of data mining ([16]). Finally, as an illustration of the power of the tabular algebra, we mention that it is possible to restructure the data from any of the representations SalesInfo2{ SalesInfo4 in Figure 1 to any other. For lack of space, this extremely terse extended abstract suppresses many of the details and all proofs. All of these can be found in the full paper ([5]).

2 The Tabular Database Model

In this section, we present the tabular database model . Thereto, we distinguish two sorts of symbols: N (called names) and V (called values). Names can be thought of as a generalization of relation and attribute names.

SalesInfo1 Sales Part

nuts nuts nuts screws screws screws bolts bolts

Region

east west south west north south east north

Sold

50 60 40 50 60 50 70 40

TotalPartSales Part Total

nuts screws bolts

150 160 110

TotalRegionSales Region Total

east west north south

120 110 100 90

GrandTotal Total

420

SalesInfo2 Sales Region ? ? ? Total

Part ?

nuts screws bolts ?

Sold

east 50 ?

70 120

Sold

Sold

Sold

west north south 60 ? 40 50 60 50 ? 40 ? 110 100 90

Sold Total

150 160 110 420

SalesInfo3 Sales

east west north south

Total

nuts screws bolts 50 ? 70 60 50 ? ? 40 50 40 50 ? 150 160 110

Total

120 110 100 90 420

SalesInfo4 Sales Region ? ? Total

Part

?

east 50 70 120

Sales Region ? ? Total

Part

Sold

east nuts bolts

Sold

north north screws 60 bolts 40 ? 100

Sales Region ? ? Total

Part

Sales Region ? ? Total

Part

Sold

west west nuts 60 screws 50 ? 110

Sold

Sales Region ? ? ? Total

south south nuts 40 screws 50 ? 90

Figure 1: Some examples of tabular databases.

Part Total

nuts screws bolts ?

Sold Total

150 160 110 420

To allow a broad class of data representations, we allow names to occur also in those positions in tables, normally thought of as data entry positions. Similarly, we allow values to occur also in attribute positions. Our operations will be allowed to distinguish individual names while, for genericity reasons [6], they will not be allowed to distinguish individual values. In concrete examples, we shall distinguish names from values by writing names in type-writer font. As is the case in reallife tables, our tables need not have entries for every row and column combination. To deal with this possible absence of entries, we introduce the (\inapplicable") null value ?. The set of all symbols is then S = N [ V [ ?. The presence of ? requires an adapted notion of equality. Let A and B be two sets of symbols, i.e., A; B  S . We say that A is weakly contained in B, denoted A v B, if A n f?g  B n f?g . We say that A and B are weakly equal , denoted A =: B, if A v B and B v A. A tabular database, now, is a set of tables. In this way, a tabular database can be thought of as a threedimensional table. Formally, a table is a total mapping from the Cartesian product of two initial segments of the natural numbers into S . Hence we can think of a table as a matrix. If  is a table with row numbers 0; : : :; m and column numbers 0; : : :; n, then  is called a table of width n and height m. The width and height of  are denoted width() and height(), respectively. For I, a nite sequence over f0; : : :; mg, and J, a nite sequence over f0; : : :; ng, IJ denotes the subtable of  formed by the rows and columns indicated. In particular, for 0  i  height() and 0  j  widthj (), i denotes the ith row,  j the jth column, and i the entry (i; j). The sequence (i + 1)::height() will be abbreviated as > i and the sequence (j + 1)::width() as > j. (The index position will disambiguate between the two possibilities.) In particular, > 0 will be further abbreviated to >. 00

0>

>0 >> Figure 2: Diagrammatic representation of a table. Using this notation in a block diagram, four \regions" can be distinguished in a table , as shown in Figure 2. The entry 00 is called the table name, the entries 0> are called the column attributes, the entries >0 are called the row attributes, and the entries >> are called the data entries.

The databases SalesInfo2{SalesInfo4 in Figure 1 are examples of tabular databases satisfying the de nitions given here. They will be used as a running example throughout this paper. The possibility of multiple occurrences of column and row attributes requires appropriate terminology. If  is a table, i a row of , and a 2 S some symbol, then i (a) is the set of all data entries appearing inj those columns named a, i.e. i (a) = fij j j > 0 and 0 = ag. Whenever  and  are (not necessarily distinct) tables, and i and k are rows respectively of  and , then i is said to be subsumed by k , denoted i v k , if, for each column attribute a in  or in , i (a) v k (a), i.e. the set i (a) is weakly contained in the set k (a). Finally, the rows i and k are said to subsume each other, denoted i =: k , if i v k and k v i . Similar terminology can be developed for columns.

3 The Tabular Algebra In this section, we describe the tabular algebra (TA), partly informally, due to space restrictions. Formal counterparts of informal descriptions can be found in the full paper [5]. The tabular algebra consists of assignment statements of the form T hoperation ihparameter list ihargument list i, with T a table name parameter, augmented with an iteration construct. The precise meaning of the parameters will be clari ed in Section 3.6. For now, they can be considered as table names, column attributes, or sets of column attributes, respectively. The argument list is a sequence of table name parameters. Each time an assignment statement as above is invoked, the operation is executed on each sequence of tables in the database, whose names match with the table name parameters in the argument list. The resulting tables are named T. The e ect of the tabular algebra operations will sometimes be explained verbally and sometimes by means of a diagrammatic representation in the style of Figure 2. If an entry in a box of such a diagram contains position index parameters, these parameters run over all applicable values, subject to a condition in a condition box and the corresponding entries are repeated accordingly. The range of the repetition is indicated with dashed boxes (e.g. see Figure 3). If a box contains a single entry (e.g., \T" or \?"), that entry is supposed to be repeated as often as is required to match the adjacent boxes, both horizontally and vertically. Whichever way we choose below to explain a particular operation, we assume that  and  are tables with names R and S, respectively. We then describe the e ect of applying an assignment statement with that operation and argument(s) R (and S in case of a binary operation) on the tables  and . For convenience, we refer to the

T 0> >0

>0 0> >> ? ? >>

union

T

>0

i

(i > 0) ^ (8j )(i 6v j )

di erence

T 0i

>0 >i

0> j>

(i > 0) ^ (j > 0)

Cartesian product

Figure 3: Diagrammatic representation of the e ect of union, di erence, and Cartesian product. top row of each table as its attribute row, and all other rows as data rows. Similar convention will also be used for columns.

3.1 Traditional operations A rst set of operations are adaptations of the traditional relational algebra operations to our setting: union, di erence, Cartesian product, renaming, projection, and selection. Figure 3 shows the e ect of T ? R [ S, T ? R n S, and T ? R  S on the tables  and . Notice that the operation is carried out on all (or all pairs of, as appropriate) tables with names R and S and the result is named T. Notice also that union and di erence are de ned in such a way that they always exist. Intersection is de ned in terms of di erence in the usual way. The classical versions of these operations can be simulated by using their tabular versions together with the redundancy removal operations, as discussed in Section 3.4. The e ect of T ? renameB A (R), T ? selectA=B (R), and T ? projectA (R), with A and B attribute parameters and A an attribute-set parameter, is also de ned in the usual way. We merely remark that weak equality is used instead of classical equality in the de nition of selection.

3.2 Restructuring Operations We consider four restructuring operations: grouping, merging, splitting, and collapsing. Grouping and merging (respectively splitting and collapsing) can be seen as inverses of each other. Since the formal de nitions are rather involved, we limit ourselves to giving an informal description on an example here. The formal de nition can be found in the full paper [5]. The syntax of a grouping assignment statement is T ? group by A on B (R), with A and B attributeset parameters. To see its e ect, consider the grouping assignment statement Sales ? group by Region on Sold (Sales) applied to the table in Figure 4, top , the obvious counterpart in the tabular model, of the table Sales in the relational database SalesInfo1 of Figure 1. The resulting table, in Figure 4, bottom , is obtained as follows. (1)

Its attribute row is obtained by rst extracting from the attribute row of the original table the attributes di erent from both Region and Sold (only Part in our example), and then adding this together with as many copies of Sold as there are data rows in the original table. (2) Next, the column headed by Region is added as the rst data row of the new table. (3) Finally, the data rows from top, after projecting out the Region entries), are added to the table bottom, as follows, Consider row i in top. The Sold entry of this row is added under the ith occurrence of the Sold column in bottom, on row i. The remaining entries of row i just added to bottom are lled up with ?'s. The resulting table can be seen as a very uneconomical representation of (the bold part of) the Sales table in the tabular database SalesInfo2 in Figure 1. It is actually this latter table we had in mind as the result of grouping when we conceived this operation. To keep the de nitions simple, we eventually chose the former, and de ned additional operations (see Section 3.4) to obtain the latter table. The syntax of a merging assignment statement is T ? merge on B by A (R), with A and B attribute-set parameters. Applying the merging assignment statement Sales ? merge on Sold by Region(Sales) on the table in the tabular database SalesInfo2 in Figure 1 yields the table in Figure 5, which can be seen as an uneconomical representation of the table in Figure 4, top . This is obtained by essentially \reversing" the steps involved in computing the grouping operation. (The redundancy in the table of Figure 5 can be removed by selecting out the tuples with Sold entry \?", an operation that can be simulated using projection, transposition (Section 3.3), and di erence. Applying the above merging assignment statement to the table in Figure 4, bottom yields a \representation" of the table top, but which is even more uneconomical. Finally it must be emphasized that merging is de ned on all tables, not just those that can be thought of as having resulted from a grouping. Also, any number of rows (columns) may be named A (B) in the de nition of merge on B by A (R). The syntax of a splitting assignment statement is T ? split on A (R), with A an attribute-set parameter.

Sales ? ? ? ? ? ? ? ? Sales Region ? ? ? ? ? ? ? ?

Part

nuts nuts nuts screws screws screws bolts bolts Part ?

nuts nuts nuts screws screws screws bolts bolts

Region

east west south west north south east north

Sold

50 60 40 50 60 50 70 40

Sold

Sold

Sold

Sold

Sold

Sold

Sold

? ? ? ? ? ? ?

60

?

? ?

? ? ?

? ? ? ?

? ? ? ? ?

? ? ? ? ? ?

east 50

west south west north south ? ? ? ? ? ?

40 ? ? ? ? ?

50 ? ? ? ?

60 ? ? ?

50 ? ?

east

70 ?

Sold

north ? ? ? ? ? ? ?

40

Figure 4: E ect of Sales ? group by Region on Sold(Sales). To see its e ect, consider the splitting assignment statement Sales ? split on Region(Sales) applied to the table in Figure 4, top . The operation results in a set of tables named Sales all of which have the same attribute row, obtained by removing Region from the attribute row of the original table. In the set, there is one table for each Region entry in the original table. The rst data row of each of these tables has (the literal constant) Region as the row name, and the Region entry of the original table to which this table is associated, in all other positions of this row. E.g. , the table associated with `east', has (Region; east; east) as the rst data row. The remaining data rows are projections of data rows of the original table, with a matching Region entry. E.g. , all rows in the original table with Region = `east' go into the table associated with `east'. The resulting set of tables is the bold part of the tabular database SalesInfo4 in Figure 1. The syntax of a collapsing assignment statement is T ? collapse by A (R), with A an attribute-set parameter. Its e ect is that all tables named R are rst merged on A by all the attributes of their scheme. Then, their union is taken in the sense of Section 3.1. Applying Sales ? collapse by Region(Sales) to the tables outlined in bold in the tabular database SalesInfo4 in Figure 1 results in another (uneconomical) \representation" of the table in Figure 4, top, from which top can be obtained by applying the redundancy removal operations de ned in Section 3.4.

3.3 Transposition The tabular algebra contains two transposition operators: transposing (in sensu stricto) and switching. The e ect of T ? transpose(R) is that, for each table  named R, a new table is created by transposing  as a matrix and renaming the result T. Hence, column attributes become row attributes, and vice-versa. The e ect of T ? switchV (R), with V an entry parameter is that, for each table  named R, a new table is created. If V has a unique occurrence in >> , say ji = V , then the new table is obtained by swapping rows 0 and i, and columns 0 an j, and renaming the result as T; if V does not have a unique occurrence in >> , then the new table is obtained by simply renaming  as T. For each of the operations de ned in the tabular algebra, it is now possible to express in the tabular algebra a dual operation obtained by interchanging the roles of rows and columns. Using this technique and switching, it is possible to express a constant selection T ? A=`V 0 (R) with A an attribute parameter and V an entry parameter.

3.4 Removal of Redundancy The tabular algebra also contains an operation for redundancy-removal: clean-up. The e ect of T ? clean-up by A on B (R), with A and B attribute-set operators, applied to a table  with name R, is the following. For each A-subtuple of any tuple in > , do the following. If all tuples i , i > 0, with the

Sales ? ? ? ? ? ? ? ? ? ? ? ?

Part

nuts nuts nuts nuts screws screws screws screws bolts bolts bolts bolts

Region

east west north south east west north south east west north south

Sold

50 60 ?

40 ?

50 60 50 70 ?

40 ?

Figure 5: E ect of Sales ? merge on Sold by Region(Sales). same A-subtuple and with the column attribute 0i in B, are subsumed by a common tuple, then replace all these tuples with the \least" such common tuple (where \least" refers to information content). Otherwise, retain the original tuples. The result is named T. The statement Sales ? clean-up by Part on ? (Sales), applied to the table in Figure 4, bottom , results in a table in which the relevant information on nuts, screws, and bolts is grouped in one row for each. The dual operation of clean-up (in the sense of Section 3.4) is called purge. If the statement Sales ? purge on Sold by Region(Sales) is applied to the table resulting from the above clean-up operation, then (the bold part of) the table Sales in SalesInfo2 is obtained. Thus, clean-up can be seen to be a generalization of duplicate (row) elimination, while purge is its dual. Classical union of (the tables representing) two union compatible relations can be obtained by taking tabular union, followed by applications of purge to eliminate redundant columns, and then clean-up to eliminate duplicate rows.

3.5 Tagging Operations and Iteration In order to achieve completeness, we introduce into the tabular model the possibility to create new values as well as an iteration construct. Both features are inspired by their counterparts in the relational language FO + new + while described in [3]. The tuple tagging statement T ? tuplenewA (R), with A an attribute parameter, when applied to a table  with name R adds a new column to  with column attribute name A containing a distinct new value (chosen non-deterministically from S ) for each tuple of > . The result is named T. The set tagging statement T ? setnewA (R), with A an attribute parameter, when applied to a table  with name R

adds a new entry to the attribute row 0 with name A. The other rows of the new tables are obtained by consecutively listing all non-empty subsets of > . Each subset corresponds to a distinct new value in the newly added column named A. Finally, the resulting table is called T. Finally, a while program has the form while R 6= ; do P, with P a tabular algebra program (see below) and R a table name parameter. The semantics of such a while program is that, to each combination of tables whose names correspond to the table name parameters in the while program, the tabular program P is applied as long as the table corresponding to R contains a non-empty set of data rows.

3.6 De nition of Programs in Tabular Algebra A tabular algebra program consists of a sequence of tabular algebra statements and while statements as de ned in the previous paragraphs. Recall that the assignment statements in TA are of the form T hoperation ihparameter list ihargument list i. We now elaborate on syntax and semantics of the parameters that may occur in the parameter list and the argument list in the right-hand side of an assignment statement. The general syntax of hparameteri is as follows: ;j?jijhnameif; hnameigj(hparameteri; hparameteri) ??jhnameijhnameif; hnameij(hparameteri; hparameteri): A parameter represents an entry or a set of entries, consisting of the interpretations of the items in the \positive" list that are not interpretations of items in the \negative list". A star, possibly subscripted for distinction, is a wild card. A pair of parameters de nes entries in the table under consideration by specifying attribute and column row entries. The parameter in the left-hand side of an assignment statement, (resp., in the condition of a while program)

may either be a name, or may be a wild card occurring in the right-hand side of the assignment statement (resp., the body of the while program). The semantics of a tabular algebra program is now as follows. All statements are executed consecutively. starting from an input database. This database is augmented during the computation. Each statement is executed for all combinations of table names in the interpretation of the corresponding parameters. If the statement is an assignment statement whose right-hand side contains a wild card or if it is a while program whose condition contains a wild card, then that wild card should be interpreted as the corresponding name in the combination of table names under consideration. In each computation, a parameter representing a single column attribute should have a singleton set as interpretation, otherwise the e ect of the statement is unde ned. A parameter representing a set of column attributes is interpreted in the obvious way. As is typical in such languages, programs in tabular algebra may produce \scratch" tables during computation. As a result, the names of output tables should be speci ed as part of the program, when simulating transformations, discussed below.

4 Main Results

In this section, we present the main results of our work. First, we present the completeness results of TA. Then we discuss the embedding of SchemaLog, a syntactic higher-order logical model, in the tabular model. Finally, we discuss the applicability of our model for OLAP based information systems.

4.1 Completeness of TA First, we develop some formal notions for use in stating and explaining our results. If D is a tabular database, we call any nite set N  N containing all the names of tables in D a scheme for D. We denote by inst (N ) the set of all tabular databases for which N is a scheme. For a tabular database D, jDj will denote the set of symbols occurring in D. Following Van den Bussche et al. [3], we de ne various morphisms on databases as follows. Two tabular databases D and D0 are called isomorphic if there exists a bijection  : jDj ! jD0j (called an isomorphism from D to D0 ) such that (i )  is the identity on the names in jDj; (ii )  is the identity on ?; and (iii ) (D), where  is extended to tabular databases in an obvious way, and D, are identical up to permutations of the non-attribute rows and permutations of the nonattribute columns of the tables under consideration. If M  jDj, an isomorphism from D to D0 is called a Misomorphism if its the identity on M. An automorphism

of D is an isomorphism from D to itself. We denote by Aut(D) the automorphism group of D. The following notion of transformation as a database mapping expressing a restructuring operation or a query is inspired by Chandra and Harel [6], Abiteboul and Kanellakis [2], and Andries, Gyssens, Paredaens, Van den Bussche, and Van Gucht [17, 3]: Let N  N . A transformation Q is a recursively enumerable relation Q  inst(N )  inst(N ) such that (i ) Q is invariant under every permutation of S that is the identity on N [ f?g; (ii ) Q is invariant under permutations of non-attribute rows and non-attribute columns in tables; (iii ) Q(D; D0 ) implies jDj  jD0j; (iv ) Q(D; D10 ) and Q(D; D20 ) imply that D10 and D20 are jDj-isomorphic; and (v ) Q(D; D0 ) implies the existence of a group homomorphism : Aut(D) ! Aut(D0 ) such that, for every  in Aut(D), () is an extension of . Condition (i ) is known as genericity and formalizes the intuition that a transformation should not distinguish between non-null values that are not names; condition (ii ) says that the order of rows and columns in a table is irrelevant with respect to its meaning; condition (iii ) says that the set of database entries can only grow, even if entries no longer occur in a particular table; condition (iv ) is known as determinacy and formalizes that transformations are only non-deterministic in the particular choices of new values (created by tagging operations); and, nally, condition (v ), known as constructivity,formalizes that new values have to be related to the original values in a certain manner. The de nition of transformation in the tabular model is very close to the de nition of transformation in the relational model for which the language FO+while+new was shown to be complete [3]. To show completeness for the tabular algebra for the notion of transformation given above, we use a reduction argument. Therefore, we rst note the following:

Theorem 4.1 The language FO + while + new can be simulated within the tabular algebra. The critical notion in the reduction argument mentioned above is the notion of a canonical representation of a tabular database. It also gives rise to a novel proof technique which is used to show that various related models and languages can be embedded within the tabular model and algebra. Let D be a tabular database over a scheme N . A canonical representation of D is a relational database R over the relational database scheme Rep = fData(Tbl, Row, Col, Val), Map(Id, Entry)g with the functional dependencies Id ! Entry, and Tbl; Row; Col ! Val, and

satisfying the following property: there exists a table  in D with 00 , 0i , j0 , and ji on the indicated positions if and only if there exists id1, id2, id3 , and id4 for which (id1 ; 00 ), (id2 ; 0i ), id3 ; j0), and (id4 ; ji ) are in Map and (id1 ; id2; id3 ; id4 ) is in Data. Intuitively, a canonical representation of a database encodes every occurrence in the database as a unique id. The relation Map associates to these ids the entries in the corresponding occurrences, and the relation Data associates each occurrence to the occurrences of the corresponding row attribute, column attribute and table name. Even though tables in tabular databases may have variable width, the canonical representation always encodes the information into a (relational) database with relations of xed width. Canonical representations are unique up to the particular choice of occurrence identi ers, justifying the phrase \the" canonical representation. We say that a program P computes a transformation Q if whenever Q(D; D0 ), there exists D00 such that (i) (D; D00 ) is an input-output pair of P and (ii) Q(D; D00 ). Observe that, whenever a program P computes a transformation Q, we have that Q(D; D0 ) for every input-output pair of P.

Lemma 4.2 Let N be a set of names. There exists a program P Rep in the tabular algebra, only dependent upon N , such that for every tabular database D with scheme N , P Rep(D) yields (the natural representation in the tabular model of) the canonical representation of D.

Lemma 4.3 Let N be a set of names. There exists a program P Rep?1 in the tabular algebra, only dependent upon N , such that for (the natural representation in

the tabular model of) an instance R over the relational database scheme Rep in which all named entries belong to N , R = Rep(P Rep?1 (R)).

Although we cannot elaborate on the proof here, it is remarkable that the rather limited number of restructuring operations in the tabular algebra|which were deviced for restructuring purposes, not for completeness purposes|are sucient to establish the above completeness results. We now state our main result:

Theorem 4.4 The tabular algebra programs compute precisely the transformations.

The main idea behind the proof is as follows. Let Q be a transformation such that Q(D; D0 ). Let Rep(D) and Rep(D0 ) be canonical representations. By the above lemmas, we know we can compute D from (a tabular

representation of) Rep(D) up to permutations of row and columns, independent of the particular database D. Let Q " be the corresponding transformation. Similarly. Let Q # be the transformation computing (a tabular representation of) Rep(D0 ) from D0 , independent of the particular database D0 . Now the composition Q # Q  Q ", considered as a relational database mapping, is constructive in the sense of [3] and can therefore be expressed by a program P in the FO+while+new. Let P 0 be the corresponding?1program in the tabular algebra. Then P Rep  P 0  P Rep computes Q. For the complete details of the proof, we refer to [5]. The proof sketch of our completeness results also yields a normal form for programs and transformations, by going via the canonical representations. It goes without saying, however, that this is not the way to proceed in practice. The tabular algebra is suciently rich to compute most of the transformation occurring in practice in a much more direct way. By the same token, not all operations introduced here are necessary to achieve completeness. The point, however, is that we would like tabular algebra to be rich so that useful transformations can be expressed directly at a high level. We anticipate that the traditional, restructuring, and redundancy removal operations, and transposition would be sucient for most useful transformations arising in practice.

4.2 Embedding SchemaLog into the Tabular Model Lakshmanan et al. [11, 12] proposed a higher-order logic called SchemaLog, and more recently an extension to SQL called SchemaSQL [13], inspired by SchemaLog, for facilitating interoperability in a federation of databases. The SchemaLog data model is essentially the relational model, with the following di erences: (i ) tuple ids and relation and attribute names are rstclass citizens in the SchemaLog data model; and (ii ) variable-width relations (e.g., see Figure 1) are possible in the SchemaLog data model. Interestingly, it has been demonstrated that the language of SchemaLog possesses (querying and) restructuring capabilities. These observations suggest that the SchemaLog model has features similar to the tabular model and that a comparison between these models would be worthwhile. As noted above, SchemaLog was proposed in the context of multi-database interoperability. Our primary concern, however, is restructuring and querying of individual databases. To facilitate a comparison, we therefore consider a \stripped-down" version of SchemaLog appropriate for interaction with individual databases. For convenience, we refer to this language as SchemaLogd. Atomic formulas in SchemaLogd are

Attr ! Value], with Rel, constants or variables, in addition to atoms formed using the standard built-in predicates and programming predicates (see [11]). Clearly, every SchemaLog relation can be readily represented as a table in the tabular model. We were able to prove the following result:

of the form

Rel[Tid: Tid, Attr, Value

Theorem 4.5 For every program P in SchemaLogd there is an equivalent program in the tabular algebra.

Due to space restrictions, we refer to [5] for the proof. It may be argued that SchemaLogd programs essentially express transformations, so Theorem 4.5 is really a corollary of Theorem 4.4. However, such an argument may shed little insight into the way such transformations can be simulated in tabular algebra, as the \resulting program" would be too low level. Our proof in [5] essentially gives a procedure for obtaining the equivalent TA program at a high level. We note that it is a simple matter to extend the tabular model and algebra in a way that accounts for a federation of (tabular) databases. Such an extended language would trivially subsume SchemaLog (without function symbols). Notice that even though function symbols are not directly supported in the tabular model, nothing is lost in terms of the expressive power, because of the completeness result in the previous theorem.

4.3 The Tabular Model as a Fundamental Basis for the OLAP Model The relational model, while supporting ecient data management and robust on-line transaction processing (OLTP), provides little support for on-line analysis of data [7, 8]. To overcome this de ciency, Codd has recently proposed [7, 14] a data model called OLAP (for on-line analytical processing). Some of the major highlights of the OLAP model are the following: (i ) whereas the relational model organizes data along one dimension (i.e., as a set of tuples), the OLAP model allows data to be stored in the form of (n-dimensional) matrices; and (ii ) compared to the relational model, the OLAP model permits the ecient computation and storage of summary information on the data. Both features (the rst for n = 2) can also be found in the tabular model, and, in fact, are already illustrated by the examples in Figure 1. Some of the drawbacks of the OLAP technology as it stands today are the following: (i ) unlike relational model, the OLAP model has no stable theoretical foundation and many concepts therein are used rather loosely (e.g., see [7, 8, 14]); (ii ) no languages comparable

to relational algebra or calculus have been developed, and whatever \operations" are referred to in the literature have no clear de nition. Some loose proposals for SQL-like languages do exist, but are ad-hoc and have no formal basis; and, nally, (iii ) the integration with relational technology relies on ad-hoc procedures for converting between the two models rather than any fundamental principles. The tabular model and language, studied for two dimensions in this paper, can be easily generalized to n dimensions. At a conceptual level, the tabular model is more general than both relational and OLAP models and can serve as a common ground between them. At an implementation level, OLAP technology can be used as an ecient implementation of the physical scheme associated with a tabular database. Because of the natural t between (2- or n-dimensional) tables and OLAP matrices, tabular algebra can be used as a fundamental querying and restructuring language for OLAP technology. Tabular algebra, being a complete language, provide a mechanism for restructuring OLAP matrices in all meaningful ways, including relational representations.

5 Summary and Future Research

We proposed the tabular database model which is simple but powerful in allowing a very broad class of natural data representations. We developed tabular algebra and showed that it is generic and complete for querying and restructuring of tables. Well-known data models such as GOOD and (a large fragment of) SchemaLog can be embedded within the tabular model. Tabular model and algebra, while allowing a rich expressive power in enabling all transformations to be expressed, also provide simple and intuitive operations to express transformations in high level terms. Tabular algebra (2- and n-dimensional versions) can serve as a complete querying and restructuring language for OLAP, a technology with great application potential and of considerable interest to the database community, which has, until recently, lacked clear foundations. Tabular algebra is presently being implemented on top of Microsoft Access and Excel, providing a seamless integration of relations and spreadsheets. In ongoing work, we are developing additional derived operations in an e ort to enhance the \expressive power" of tabular algebra in allowing high level expression of transformations. Query (and program) optimization is an important issue. In the direction of OLAP, tabular algebra covers only the aspect of restructuring. We are presently working on operations corresponding to classi cation and summarization, two other important functionalities for OLAP. We also intend to extend our implementationabove to cover all OLAP functionalities.

References [1] ACM. ACM Computing Surveys, volume 22, Sept 1990. Special issue on HDBS. [2] S. Abiteboul and P. Kanellakis. Object identity as a query language primitive. In J. Cli ord, B. Lindsay, and D. Maier, editors, Proceedings of the 1989

ACM SIGMOD International Conference on the Management of Data, volume 18:2 of SIGMOD Record, pages 159{173. ACM Press, 1989. Full version to appear in Journal of the ACM.

[12] Lakshmanan, L.V.S., Sadri, F., and Subramanian, I. N. Logic and Algebraic Languages for Interoperability in Multi-database Systems. Technical report, Concordia University, Montreal, March 1995. (Accepted to Journal of Logic Programming, February 1996). [13] Lakshmanan, L.V.S., Sadri, F., and Subramanian, I. N. SchemaSQL { A Language for Querying and Restructuring Multidatabase Systems. Technical report, Concordia University, Montreal, February 1996. (Submitted for publication.) [14] Pilot Software. An introduction to olap, 1995. URL:http://www.pilotsw.com//pilot013.htm. [15] Serge Abiteboul, Richard Hull, and Victor Vianu. Foundations of Databases. Addison Wesley, Reading, MA, 1995. [16] Han, Jiawei Personal Communication, Sept 1995 [17] M. Andries and J. Paredaens. On instancecompleteness of database query languages involving object creation. Journal of Computer and System Sciences. To appear. See also \A language for generic graph-transformations", Lecture Notes in Computer Science vol. 570, pp. 63{74.

[3] J. Van den Bussche, D. Van Gucht, M. Andries, and Gyssens, M. On the completeness of object-creating database transformation languages, 1994. manuscript. Preliminary version appeared in FOCS'92 and PODS'90. [4] J. Van den Bussche, D. Van Gucht, M. Andries, and Gyssens, M. On the completeness of objectcreating query languages. 33rd Symposium on Foundations of Computer Science, 1992. [5] Gyssens, Marc, Lakshmanan, L.V.S., and Subramanian, I. N. Tables As a Paradigm for Querying and Restructuring Technical Report, Concordia University, Montreal, Nov 1995 [6] Chandra, Ashok K. and Harel, David. Computable queries for relational data bases. Journal of Computer and System Sciences, 21:156{178, 1980. [7] Codd, E.F., Codd, S.B., and Salley C.T. Providing olap (on-line analytical processing) to useranalysts: An IT mandate, 1995. White paper { URL:http://www.arborsoft.com/papers/coddTOC.html. [8] Finkelstein, Richard. Understanding the need for on-line analytical servers, 1995. White paper { URL:http://www.arborsoft.com/papers/ nkTOC.html. [9] M. Gyssens, J. Paredaens, and D. Van Gucht. A graph-oriented object database model. In ACM Symp. Principles of Database Systems, pages 417{ 424, 1990. [10] Hurson, A.R., Bright, M.W., and Pakzad, S. Multidatabase Systems : An Advanced Solution For Global Information Sharing. IEEE Computer

Society, Los Alamitos, CA, 1994. Collection of Papers. [11] Lakshmanan, L.V.S., Sadri, F., and Subramanian, I. N. On the logical foundations of schema integration and evolution in heterogeneous database systems. In Proc. 3rd International Conference on Deductive and Object-Oriented Databases (DOOD '93). Springer-Verlag, LNCS-760, December 1993.

Suggest Documents