Towards a Theory of Bulk Types - CiteSeerX

3 downloads 100130 Views 257KB Size Report
A database programming language can model application domains most natu- .... DBPL to incorporate queries over programmer-de ned bulk types in a uniform ...
Towards a Theory of Bulk Types David A. Watt

Phil Trinder 

Abstract A database programming language can model application domains most naturally if it supports several bulk types, e.g., lists, sets, and relations. Indeed some persistent programming languages permit the programmer to de ne new bulk types that are appropriate to the application domain. Such a richly typed language tends to be complex, since constructs must be provided to declare, construct, inspect, and update instances of every bulk type. The collection theory presented here controls the complexity of such richly typed languages by exploiting operations and properties common to a variety of bulk types. The theory is based on four operations { three constructor operations and one iterator { that obey certain algebraic laws. In addition, a rich set of additional operations can be de ned in terms of the basic operations. Sets, bags, lists, certain trees, relations, and nite mappings are all encompassed by the collection theory. Conversely, types that we would not intuitively classify as bulk types are excluded by the theory. Collection types are uniformly equipped with comprehensions, a convenient and powerful query notation that is fully integrated into the database programming language.

Keywords: Database Programming Languages, Bulk Types, Queries, Algebraic

Speci cation, Comprehensions, Ringads, Monads.

This work supported by both the SERC Bulk Type Constructors Project and the Esprit FIDE Project (BRA 3070). Authors' address: Computing Science Department, University of Glasgow, Glasgow, Scotland. Email: fdaw,[email protected] Note: This is an extended version of a paper submitted for publication. 

1

1 Introduction Several database programming languages (DBPLs) are emerging that enable dataintensive applications to be created with greater ease [1, 3, 6, 10, 11, 12, 13]. In order to represent and manipulate large amounts of data, a DBPL must support one or more bulk types. Values of a bulk type are large and typically variable-sized. Examples of bulk types are lists, sets, relations, and nite mappings. Most conventional database management systems and some DBPLs support a single bulk type. For example, the relation is supported by System R, Ingres, and the DBPL language [2, 9, 13]. This forces application programmers to model everything in terms of the single bulk type, which may be contrived. In a travel agency, for example, the people on a tour would be most naturally modelled by a set: nobody should be booked on the tour twice, and there is no obvious order amongst people. In contrast, the places visited on a tour are most naturally modelled by a list: a place may be visited twice, and the order of visits is important. In order to model application domains more naturally, therefore, some DBPLs permit the use of di erent bulk types [3, 10, 11]. Unfortunately, supporting several bulk types may complicate the language because there must be a means of declaring, constructing, inspecting, and updating instances of each type. The problem is exacerbated in languages, such as those with orthogonal persistence, that permit the programmer to de ne new bulk types. Ideally the DBPL should be extended in a uniform way to support ecient manipulation and interrogation of instances of the new bulk types. We believe that the complexity of a DBPL can be controlled if operations and properties common to a variety of bulk types can be identi ed and exploited. This report presents the initial results of a search for common operations and properties. The theory comprises four basic operations on bulk types, and laws that these operations must obey. A bulk type that satis es the theory is termed a collection type. Sets, bags, lists, certain trees, relations, and nite mappings can be classi ed as collections. Conversely, types that we would not intuitively classify as bulk types, such as single cells, are not collection types. The collection operations are speci ed algebraically. Three of the basic operations are constructor operations, and the fourth is an iterator. A much larger set of useful operations can be speci ed in terms of the basic operations. The laws of the collection theory are concisely captured by describing a collection type constructor and associated operations as a ringad in category theory. (However, no category theory is presented here.) The basic operations are sucient to de ne comprehensions, a programming language construct that also serves very well as a query language. Moreover, comprehensions provide a uniform query notation over all collection types. This report summarises arguments made in [16] that not only do comprehensions allow queries to be expressed clearly, concisely, and eciently, but they also combine computational power with ease of optimisation. It is believed that a DBPL compiler can also be easily extended to translate and optimise queries over programmer-de ned collections if the basic operations are provided. The primary advantage of collection theory is that it provides a uniform set of operations and query notation for a variety of bulk types. The theory also simpli es 2

the DBPL compiler by providing uniform translation and optimisation schemes for queries over these bulk types. The theory allows for the automatic extension of a DBPL to incorporate queries over programmer-de ned bulk types in a uniform and ecient manner. Moreover, the theory provides a mathematical model that has given insights into how new bulk types should be de ned. The remainder of the paper is structured as follows. Section 2 describes the basic operations, and gives some examples of bulk types with these operations. Section 3 describes comprehension query notation. Section 4 shows how comprehensions are de ned in terms of the basic operations, and gives the laws that the operations must obey. Section 5 gives more examples of types that satisfy the theory. Section 6 assesses how well the theory meets the requirements of a bulk type. Section 7 concludes. Appendix A presents a prototyping mechanism for collection types. Appendix B is a summary of the theory of monads and ringads.

2 Collections

A data type is termed a collection type if it is equipped with four operations (empty, single, , and iter), that obey certain laws. Values of a collection type are called collections. A collection consists of zero or more elements. Every collection is homogeneous, i.e., all elements have a common type, termed the element type of the collection. Several programming languages provide speci c kinds of collection, such as  set in Machiavelli [12], and both  set and  list in O2 [10], where in each case  stands for the element type. Our objective, however, is to identify operations and properties common to all kinds of collection. Therefore we shall abstract from speci c kinds of collection, and refer instead to  collection, the type of collections (of unspeci ed kind) with elements of type  .  list and  set may be viewed as instances of  collection, just as int list and bool list may be viewed as instances of  list.

2.1 Basic Operations

The basic operations required to construct and manipulate collections are summarised in Table 1. (Our convention is to use c to stand for an arbitrary collection of the given type, and x to stand for an element of a collection.) The values of type  collection are exactly those that can be constructed using just the rst three operations. Note that all of the occurrences of collection in a type signature denote the same kind of collection. For example, an instance of  may have type int list  int list ! int list, but not int list  int set ! int set. Before describing the operations more fully we introduce the following syntactic sugar: [ ] = empty [x] = single x [x1; :::; xn] = [x1]  : : :  [xn] 3

Table 1: Basic operations Operation Behaviour empty Builds an empty collection. single x Builds a collection containing a single element, x. cc Builds a collection containing the elements of c and c . iter f c Iterates over c applying the multi-valued function f to every element, and combines the resulting collections. 0

0

Type  collection  !  collection

 collection   collection !  collection ( !  collection) !  collection !  collection

The notation [x1; :::; xn] is called an enumeration, and is useful only if  is associative. If it is not clear from the context, we decorate the brackets with the name of a particular kind of collection. Thus [ ]set, [2]set, and [2; 3; 5; 7; 11]set are sets, but [ ]list, [2]list, and [2; 3; 5; 7; 11]list are lists. The operation cc (pronounced `c combined with c ') combines the two collections in such a way that every element of c, and every element of c , has a counterpart in the combined collection. We shall make this more precise by stating laws that  must satisfy. A sensible restriction is that combining with the empty collection has no e ect: c[] = c (1) []c = c (2) or (using algebraic terminology) [ ] is a unit of . The operation iter f c iterates over c applying f to every element. The function f is multi-valued, i.e., when applied to an element it returns a collection. The resulting collections are combined together to produce the result of iter: iter f [x1; : : :; xn] = f x1  : : :  f xn For example, if factors is a function that returns the set of prime factors of an integer, then: 0

0

0

iter factors [8; 9; 10]set = factors 8  factors 9  factors 10 = [2]set  [3]set  [2; 5]set = [2; 3; 5]set We specify the behaviour of iter more formally by the following equations: iter f [ ] = [ ] iter f [x] = f x iter f (c  c ) = iter f c  iter f c if c  c 6= ? 0

0

4

0

(3) (4) (5)

where f is an arbitrary function of type  !  collection. The condition on (5) will be explained in Section 5; it can be ignored for the time being.

2.2 De nition

We are now in a position to make a precise de nition of collections. A collection type is a parameterised type  collection equipped with operations empty, single, , and iter (with the types speci ed in Table 1), which satisfy laws (1) { (5) above. We specialise the collection theory to a particular kind of collection by stating any additional laws that must be satis ed by the four operations. In doing so we must ensure that the additional laws are consistent, in the algebraic speci cation sense, with laws (1) { (5). As an illustration a consistency proof is given for an example collection in the next section.

2.3 Example Collections

Sets

A set, conventionally written in the style f2; 3; 5; 7; 11g, is written in our notation as [2; 3; 5; 7; 11]set. The set operation  is just set union, and has several properties:  It is associative, because the order of combination does not matter. For example, ([2; 3]  [5])  [7; 11] = [2; 3]  ([5]  [7; 11]).  It is idempotent, because duplicates do not count. For example, [2; 3]  [2; 3] = [2; 3].  It is commutative, because the order of elements does not matter. For example, [2; 3]  [5; 7; 11] = [5; 7; 11]  [2; 3]. Hence, for sets to be collections we must show that the idempotence, commutativity, and associativity of  are consistent with laws (1) { (5). The consistency is easily proven. For example, to show that a commutative  (i.e., c1  c2 = c2  c1) is consistent with law (5): iter f (c1  c2) = iter f (c2  c1) = iter f c2  iter f c1 = iter f c1  iter f c2 And to show that an idempotent  (i.e., c  c = c) is consistent with law (5): iter f (c  c) = iter f c = iter f c  iter f c Finally, to show that an associative  (i.e., c1  (c2  c3) = (c1  c2)  c3) is consistent with law (5): iter f (c1  (c2  c3)) = iter f ((c1  c2)  c3) = (iter f c1  iter f c2)  iter f c3 = iter f c1  (iter f c2  iter f c3) = iter f c1  iter f (c2  c3) 5

Table 2: Additional laws for speci c kinds of collection Law  set  bag  list  tree  is associative Y Y Y N  is commutative Y Y N N  is idempotent Y N N N

Lists

A list can be written in our notation as [2; 3; 5; 7; 11]list . The list operation  satis es di erent laws from the set operation :

 It is associative, because order of combination does not matter. For example, ([2; 3]  [5])  [7; 11] = [2; 3]  ([5]  [7; 11]).  It is non-idempotent, because duplicates do count. For example, [2; 3]  [2; 3] = [2; 3; 2; 3] = 6 [2; 3].  It is non-commutative, because the order of elements does matter. For example, [2; 3]  [5; 7; 11] = 6 [5; 7; 11]  [2; 3].

Other Kinds of Collection

Binary trees (with elements at the leaves only) and bags are also collections. In a sense, sets, bags, lists, and trees are the fundamental kinds of collection. They are distinguished from one another by the simple properties of associativity, commutativity, and idempotence of their  operations, as summarised in Table 2. Other kinds of collection include relations and nite mappings. These are essentially set-like collections, but with certain additional operations and special properties. We shall return to discuss them in Section 5.

A Counter-example

As a counter-example, consider a parameterised type  cell with the property that a cell is either empty or contains a single element. The cell operation  `forgets' all but the latest element added, e.g., [2; 3; 5]cell = [2]cell  [3]cell  [5]cell = [5]cell. We can capture this property by asserting that the cell operation  satis es the additional law:

c  c = c if c 6= [ ] 0

0

0

It is easy to show that  cell is not a collection type, since this additional property contradicts law (5). Let f be the function that maps 0 to [0]cell and all other integers to []cell. Then iter f ([0]cell  [3]cell) = iter f [3]cell = []cell, but iter f [0]cell  iter f [3]cell = [0]cell  [ ]cell = [0]cell. The cell operation  destroys information, which is precisely why we do not classify cells as a kind of collection. 6

2.4 Derived Operations

A bulk type should support a rich set of operations to simplify the application programmer's task. Although it might seem that di erent kinds of collection such as lists, bags, and sets have few properties in common, there are surprisingly many operations common to all of them. Moreover, these derived operations can be speci ed in terms of the basic operations, in a uniform manner independent of the particular kind of collection. For example, each kind of collection requires an operation to test for emptiness, and the properties of this operation can be speci ed algebraically as follows:

is empty is empty [ ] is empty [x] is empty (c  c ) 0

: = = =

 collection ! bool true false (is empty c) ^ (is empty c ) 0

A more interesting example is the operation that tests whether a given value is an element of a given collection:

is in is in y [ ] is in y [x] is in y (c  c ) 0

: = = =

 !  collection ! bool false (x = y) (is in y c) _ (is in y c ) 0

More complicated operations can be speci ed as higher-order functions. One of the operands is itself a function, which is applied to each element of a given collection. For example, filter is an operation that iterates over a collection, retaining only those elements that satisfy a boolean-valued function p:

filter filter p [ ] filter p [x] filter p (c  c ) 0

: = = =

( ! bool) !  collection !  collection [] if p x then [x] else [ ] (filter p c)  (filter p c ) 0

Another iterator operation is map. This operation iterates over a collection, applying a function f separately to each element of the collection, and forms a new collection of the results:

map map f [ ] map f [x] map f (c  c ) 0

: = = =

( !  ) !  collection !  collection [] [f x] (map f c)  (map f c ) 0

7

Table 3: Derived operations Operation is empty c size c is in x c all p c

Behaviour Tests whether c is empty. Gives the number of elements of c. Tests whether x is an element of c. Tests whether every element of c satis es the predicate p. Tests whether at least one element of c satis es the predicate p. Gives a collection containing only those elements of c that satisfy the predicate p. Iterates over c applying the single-valued function f to every element. Flattens the collection of collections cc into a single collection of the same kind. See explanation in the text.

any p c filter p c map f c collect cc reduce (e; f; g) c choose c remove x c

Selects an element from c. Gives the collection obtained by removing the element x from c. subsumes (c; c ) Tests whether every element of c is an element of c. intersect (c; c ) Gives a collection containing the common elements of c and c . difference (c; c ) Gives a collection containing those elements of c not in c . min c Gives the `least' element of c. max c Gives the `greatest' element of c. avg c Gives the `average' element of c. 0

0

0

0

0

0

8

Type  collection ! bool  collection ! int  !  collection ! bool ( ! bool) !  collection ! bool ( ! bool) !  collection ! bool ( ! bool) !  collection !  collection ( !  ) !  collection !  collection ( collection) collection !  collection

  ( !  )  (   !  ) !  collection !   collection !   !  collection !  collection  collection   collection

! bool

 collection   collection !  collection  collection   collection !  collection  collection !   collection !  num collection ! num

It is important to note that these equations specify only the properties of the operations. They do not necessarily suggest an ecient implementation. Table 3 summarises these and other derived operations that we consider useful. If further useful operations are identi ed, we think it likely that they can be speci ed in a similar way. Some of the operations rely on the existence of additional operations on the element type. For example, is in relies on equality being de ned on the element type. The semantics of the derived operations is non-destructive. That is to say, each operation preserves the original collection(s), but may construct a new collection re ecting a modi cation. The speci cations of the derived operations are very regular. Each operation is speci ed by three equations, one for each of the basic constructor operations. We can exploit this regularity by means of the operation reduce. This operation takes a triple of arguments representing the actions to be performed in each of the three cases:

reduce reduce(e; f; g) [ ] reduce(e; f; g) [x] reduce(e; f; g) (c  c ) 0

: = = =

  ( !  )  (   !  ) !  collection !  e fx g (reduce(e; f; g) c; reduce(e; f; g) c ) 0

The derived operations can all be de ned in terms of reduce, e.g.:

is empty is in y filter p map f iter f

= = = = =

reduce (true; (x:false); ^) reduce (false; (x:x = y); _) reduce ([ ]; (x:if p x then [x] else [ ]); ) reduce ([ ]; single  f; ) reduce ([ ]; f; )

Machiavelli's hom function has identical semantics to reduce, but is de ned only over lists. Both map and reduce provide similar functionality to the MAP and REDUCE operators found in TRPL [14]. Indeed the TRPL work demonstrates that map and reduce can be automatically generated for any recursive parametric type using re ection.

3 Query Notation The operations speci ed in Section 2 are general enough for us to express arbitrary queries on collections. Nevertheless, they do not provide us with a convenient query notation. For this purpose we introduce comprehensions, a notation that is concise enough for the interactive user and at the same time suitable to be integrated smoothly into a DBPL. Here we shall show that comprehensions can be used uniformly for all kinds of collection. Thus we can design a DBPL with several kinds of collection, each equipped 9

with a query notation, without overdoing the language's syntactic complexity. We can even allow a programmer to de ne new kinds of collection, which will be automatically be equipped with their own query notation.

3.1 Comprehensions

For brevity, comprehension (or ZF) notation is not formally described here, merely illustrated by example. A full description is given in [20]. A set comprehension describing the set of squares of all the odd numbers in a set s is conventionally written: fsquare x j x 2 s ^ odd xg We shall use the following notation: [square x j x s; odd x]set The notation generalises to comprehensions over arbitrary collections. For example, if l is a list of numbers, the corresponding list comprehension is written: [square x j x l; odd x]list These comprehensions will respect the properties of sets and lists, respectively. For example, the result of the set comprehension will contain elements in no particular order, and no duplicates, whereas the result of the list comprehension will contain elements in a well-de ned order, possibly with duplicates. The syntax of collection comprehensions is as follows, where E stands for an expression, Q stands for a quali er, P stands for a pattern, and  stands for an empty quali er: E ::= : : : j [E jQ] Q ::= E j P E j  j Q;Q The result of evaluating the comprehension [E j Q] is a new collection, computed from one or more existing collections of the same kind. The elements of the new collection are determined by repeatedly evaluating E , as controlled by the quali er Q. A quali er is either a lter, E , or a generator, P E , or a sequence of these. A lter is just a boolean-valued expression, expressing a condition that must be satis ed for an element to be included in the result. An example of a lter was odd x above, ensuring that only odd values of x are used in computing the result. A generator of the form V E , where E is a collection-valued expression, makes the variable V range over the elements of the collection. An example of a generator was x s above, making x range over the elements of the set s. More generally, a generator of the form P E contains a pattern P that binds one or more new variables to components of each element of the collection. 10

Queries

Database queries are easily expressed as collection comprehensions. Consider the following database with three relations used by Ullman [19]: members (name, address, balance) orders (order no, oname, oitem, quantity) suppliers (sname, saddress, sitem, sprice)

We shall view each relation as a set of tuples. For example, members is a set of (name, address, balance) triples. The query \Give the names of the members with negative balances" can be written as the set comprehension: [name m j m

members; balance m < 0]set

The query works in a straightforward manner. Each tuple in members is retrieved by m members. If the balance attribute (balance m) is negative, then the name attribute (name m) is included in the result. The query \Give the supplier names, items, and prices of all the suppliers that supply at least one item ordered by Brooks" can be written: [(sname s; sitem s; sprice s) j o orders; oname o = `Brooks ; s suppliers; oitem o = sitem s]set 0

Selector functions are used above to locate the attributes of tuples so that any attribute not relevant to the query can be ignored. This is a substantial advantage for real databases that contain relations with many attributes. An alternative way of writing queries is to provide a pattern that matches all of the attributes. The rst query can be written in this style as follows: [name j (name; address; balance)

members; balance < 0]set

Iteration

Comprehensions are also a concise notation for iterating over collections. The comprehension: [f x j x c] applies an arbitrary function f (of the correct type) to every element of the collection c. It is also possible to manipulate only those elements of the collection that satisfy some condition. For example, to dock the pay of selected employees:

let dock(name; salary) = if salary > 20000 then (name; salary ? 2000) else (name; salary) in [dock(n; s) j (n; s) employees] 11

3.2 Requirements

A query notation should be concise, clear, ecient, and well integrated with the DBPL. In [16] it was argued that comprehensions meet these requirements. The essence of the argument is as follows. Comprehensions are concise because they are a declarative speci cation of the query. Comprehensions are clear because of their similarity to the relational calculus. In [20] the eciency of list comprehensions was proved by showing that they perform the minimum number of cons operations required to produce the result list. (It remains to be seen whether this is true for comprehensions over all kinds of collection.) Comprehensions can be smoothly integrated into a DBPL because they are based on the lambda calculus, which is the basis of expressions in most programming languages.

3.3 Power and Optimisability

While query notations based on the relational algebra are easily optimised, they are not computationally complete | neither computation nor recursion can be expressed. Conversely, queries expressed as procedures in a programming language can perform arbitrary computation, but are hard to optimise. This section demonstrates the power of comprehensions, and shows that they can be optimised. It has also been shown in [15] that computational and recursive queries can be optimised.

3.3.1 Power

A query is said to be relationally complete, i.e., adequately expressive, if it is at least as powerful as the relational calculus. It was shown in [17] that any relational calculus query can be translated into an equivalent (list) comprehension expression. The equivalent comprehension could contain the operations , intersect, or is empty, all of which are de ned in our theory. Indeed comprehension notation, combined with recursive functions, has greater power than the relational calculus as it can express both recursion and computation [17]. As a short example, consider Date's recursive bill-of material query. The problem is stated as follows [7]. Given a relation: parts (main component, sub component, quantity)

determine the set of all component and sub-component parts of a given main part (to all levels). A solution using comprehensions is a recursive function explode with a single argument main, the main part:

explode main = [p j (m; s; q)

parts; m = main; p

([s]  explode s)]set

The explode function works as follows. Each tuple in the parts relation is obtained by (m; s; q) parts. If a tuple's main component is the assembly being exploded (m = main), then the parts p returned are the subassembly itself (s) and its subcomponents (explode s). This solution is considerably briefer than the 27-line SQL solution Date presents. It is also arguably clearer. 12

Atkinson and Buneman have described a more complicated bill-of-material query that entails both computation and recursion. This is also easily expressed as a comprehension [15].

3.3.2 Optimisability

It is important that a simple speci cation of a query can be transformed into a version that can be evaluated eciently. For each of the well-known optimisation strategies on relational queries, there exists an equivalent (list) comprehension transformation [18]. Each of these transformations has also been proved for collection comprehensions in general, i.e., using just the collection laws. Some of the transformations rely on the collection kind having additional properties (such as commutativity of , as in the example below). To give a avour of comprehension optimisation, a transformation equivalent to several relational algebra identities is given here. Filter promotion (or selection promotion) is an algebraic improvement that Ullman [19] identi es as the most important. Filtering as early as possible reduces the size of the intermediate results by discarding elements that are not required. Filter promotion is achieved using a comprehension identity that allows the interchange of quali ers. Quali er interchange states that any two quali ers can be swapped, provided that neither quali er refers to variables bound in the other, and provided that  is commutative for the particular kind of collection under consideration. (The latter condition is satis ed by bags, sets, and relations.) This may be stated: [E j Q; Q ] = [E j Q ; Q] 0

0

For example, the query: [a j (n; b)

names and bdates; (n ; a)

names and addrs; b = 1970; n = n ]set

0

0

can be improved by applying quali er interchange to `(n ; a) and `b = 1970' to yield: 0

[a j (n; b)

names and bdates; b = 1970; (n ; a) 0

names and addrs'

names and addrs; n = n ]set 0

This is considerably more ecient than the original query. Quali er interchange is a generalisation of lter promotion, since it allows us to change the order of generation as well as the order of ltration. This generality is also re ected by the fact that quali er interchange is analogous to several relational algebra identities. These are the identities governing the commuting of products and selections.

4 Monads and Ringads

Wadler [21] has studied the semantics of comprehensions, and shown how they can be understood in terms of monads in category theory. The monad is a very general concept, however | too general for our purposes in characterising bulk types. It is 13

necessary therefore to specialise monads by additional operations and laws. Wadler has dubbed the resulting structures ringads. In this paper we prefer to avoid reliance on category theory. Instead we have characterised collection types directly by laws (1) { (5). However, it is important to establish that our theory is consistent with Wadler's, and we have done so.

4.1 Monads

Comprehensions can be de ned in terms of just three of the collection operations, empty, single, and iter: [E j ] = single E [E j E ; Q] = if E then [E j Q] else empty [E j V E ; Q] = iter (V:[E j Q]) E 0

0

00

00

where E stands for a boolean-valued expression and E for a collection-valued expression. Wadler [21] has stated certain laws that must be obeyed by the operations empty, single, and iter, in order that comprehensions have the properties we expect. He has also pointed out that these laws are just those of a monad with a zero in category theory. The implications of this result are explored in [16], and only summarised here. For example, we would expect that inserting a value into a collection should not change the value inserted. If f is an observer function of type  !  collection, then we can formulate this, in the case of a singleton collection, by the following monad law: iter f (single x) = f x This is the same as law (4) of collection theory. If f :  !  collection, g :  !  collection, and  denotes function composition, then the monad law: 0

00

iter g (iter f c) = iter (iter g  f ) c states that the following computations yield the same result:  iterating over a collection twice, applying f on the rst iteration and g on the second iteration;  iterating over the collection once, applying f to each element, and iterating over the resulting collection applying g. This monad law can be proved by structural induction using laws (3{5) of collection theory. A full set of monad laws is given in Appendix B and in [21]. The laws cited above are monad laws (7) and (8), expressed in a slightly modi ed form. We have proved that all the monad laws are consequences of the laws of collection theory. Thus all kinds of collection are monads, including sets, bags, lists, and certain 14

trees, and we can use comprehensions for all kinds of collection. This provides us with a uniform query notation with the desirable properties outlined in Section 3. The translation rules above provide a uniform compilation scheme for comprehensions over all kinds of collection, thus simplifying a compiler for a DBPL that supports several kinds of collection. Moreover, many optimisations can be proved using the monad laws, so a single optimiser can improve comprehension queries over all kinds of collection. Finally, the DBPL can be automatically extended to translate and optimise queries over programmer-de ned collections, if the programmer de nes the empty, single, and iter operations for each new kind of collection.

4.2 Ringads

The monad operations alone are insucient for collections. They allow existing collections to be manipulated, but allow only empty and singleton collections to be constructed (by empty and single, respectively). To construct larger collections, an operation  is required. A  can be added to the monad in a consistent manner, and the resulting structure has been termed a ringad [22]. A ringad is de ned to be a monad (collection; single; iter) together with a pair (; empty) such that:  empty is a unit of ;  (x:empty) is a zero of the monad;  iter distributes rightward through , i.e.:

iter f (c  c ) = (iter f c)  (iter f c ) 0

0

It is easy to see that collection theory is consistent with this de nition of a ringad.

5 Scope of the Collection Theory Our objective is to tune the theory so that our de nition of collection types includes as many as possible of the types that we would intuitively classify as bulk types, and excludes all types that we would not classify as such. We have already seen that sets, bags, lists, and (certain) trees are indeed collections, and that cells are not. In this section we show that relations and nite mappings are also collections.

Relations

A relation may be viewed simply as a set of tuples or records. Thus relations are collections, with the operation  being set union. For example:

names and addrs = [(David; Bearsden); (Phil; Glasgow); (Malcolm; Rhu); (Malcolm; Edinburgh)]set names and bdates = [(Ray; 1945); (David; 1946); (Phil; 1961)]set are relations of types (name  address) set and (name  int) set, respectively. 15

A typical operation speci c to relations is a join. For example, the (equi-)join of the above relations would be: [(David; Bearsden; 1946); (Phil; Glasgow; 1961)]set and could be computed by: [(n; a; b) j (n; a) names and addrs; (n ; b) names and bdates; n = n ]set 0

0

We could easily parameterise this computation with respect to the two relations, giving a function of type (name  address) set  (name  int) set ! (name  address  int) set. We can easily generalise the above with respect to a boolean-valued function p that tests whether two given elements contain consistent information, and with respect to a function f that `merges' two consistent elements. This gives us a general join operation on sets:

join : (   ! bool) ! (   ! ) ! ( set   set !  set) join p f (s; s ) = [f (x; x ) j x s; x s ; p (x; x )]set 0

0

0

0

0

Finite Mappings

A nite mapping of type  mapping is a set of elements of type  , with the property that all elements of a particular mapping have unique keys. Enforcing this constraint gives mappings a special avour in collection theory. We require that the element type  be equipped with a function same key :    ! bool, which must be an equivalence relation. Then every mapping [x1; : : : ; xn]mapping must be such that same key (xi; xj ) if and only if i = j . Typically the elements of a mapping are pairs, and the same key function compares the rst components of the pairs. For example, the following mappings are of type (string  int) mapping:

m1 = [( ace ; 1); ( two ; 2); ( three ; 3)]mapping m2 = [( jack ; 11); ( queen ; 12); ( king ; 13); ( ace ; 14)]mapping 0

0

0

0

0

0

0

0

0

0

0

0

0

0

The mapping operation  is a partial function: the key uniqueness constraint implies that  fails when applied to two collections that contain con icting elements (i.e., elements that are di erent but have the same keys). For example, the maps m1 and m2 above have con icting elements with key ace , so m1  m2 = ?. Here ? denotes the `result' of an operation that fails, i.e., raises an exception. The extra law for the mapping operation  is: 0

0

[x]  [x ] = ? if same key(x; x ) ^ x 6= x 0

0

16

0

In conjunction with the idempotence, associativity, and commutativity of , this law captures the operation's special property. The condition in law (5) of collection theory is necessary in order to avoid inconsistency when  is partial as in the case of mappings. For example, suppose that iter f eliminates pairs in which the second component is even. Then we have: iter f (m1  m2) = iter f ? = ? But without the condition in (5) we would also have: iter f (m1  m2) = iter f m1  iter f m2 = [( ace ; 1); ( three ; 3)]mapping  [( jack ; 11); ( king ; 13)]mapping = [( ace ; 1); ( three ; 3); ( jack ; 11); ( king ; 13)]mapping An additional lookup operation is de ned on mappings: lookup y m gives the element of the mapping m that has the same key as y. (There is at most one such element.) For example: lookup ( ace ; 0) [( ace ; 1); ( two ; 2); ( three ; 3)]mapping = ( ace ; 1) This operation is speci ed as follows: lookup :  !  mapping !  lookup y [ ] = ? lookup y [x] = x if same key (x; y) lookup y [x] = ? if :same key (x; y) lookup y (m  m ) = lookup y m if lookup y m = ? lookup y (m  m ) = lookup y m if lookup y m = ? As usual, this is only a speci cation. In practice, we would expect lookup to be implemented eciently, for example using an index or hash table. 0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

Another Counter-example: Destructive Mappings

Now consider an apparently minor variation of nite mappings as described above. Suppose that we make the mapping operation  destructive: if mappings m and m contain con icting elements, then m  m includes the element from m and `forgets' the element from m. This version of  would be a total operation, avoiding the complications introduced by a partial operation. However, we decided to reject this possibility. The following example illustrates why. Suppose that iter f eliminates pairs in which the second component is even. Then we have: iter f (m1  m2) = iter f [( two ; 2); ( three ; 3); ( jack ; 11); ( queen ; 12); ( king ; 13); ( ace ; 14)]mapping = [( three ; 3); ( jack ; 11); (king ; 13)]mapping 0

0

0

0

0

0

0

0

0

0

0

17

0

0

0

0

0

0

0

0

0

But, applying (5), we also have:

iter f (m1  m2) = iter f m1  iter f m2 = [( ace ; 1); ( three ; 3)]mapping  [( jack ; 11); ( king ; 13)]mapping = [( ace ; 1); ( three ; 3); ( jack ; 11); ( king ; 13)]mapping So a destructive version of the mapping operation  is inconsistent with collection theory. A simpler objection to the destructive  is that it is non-commutative. This might seem a rather abstract argument, but non-commutativity of  has the practical consequence of inhibiting the quali er interchange optimisation discussed in Section 3.3.2. It is interesting that other DBPL designers have also come to the same conclusion [5]. This example supports our claim that the collection theory provides useful mathematical insights into the design of bulk types and their operations. 0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

6 Discussion A comprehensive list of requirements for bulk types has been set out in [5], and is reproduced here. (They are not necessarily independent of one another, and are not presented in any particular priority order.) The usefulness of collection theory can be judged by seeing how well it meets each of these requirements. 1. Instances of a bulk type are potentially large. The type  collection places no restriction on the size of instances and hence collections may be large. 2. Instances of a bulk types may vary in size. As above,  collection does not imply a xed size. In particular, the type of :  collection   collection !  collection, implies that collections with di erent sizes have the same type. 3. Bulk values may be mutable. The theory does not explicitly address this issue. However, it does not exclude updating operations that, for example, add a single element to a collection, or remove a single element, or modify an element of a mapping with a given key value. The semantics of such updating operations can be de ned in terms of the collection theory. 4. A useful and succinct algebra is required over bulk values. The theory is speci ed in an algebraic manner, the ringad laws provide an algebra, and the comprehension optimisations provide an algebra of queries. 5. Bulk data types should be type complete. Collections are largely type complete as the type  collection may take any type  as a parameter. However, some collection operations depend on the element type being equipped with certain operations, the most pervasive example of 18

6. 7. 8. 9. 10. 11.

this being the equality operation on elements demanded by (some) collection operations. An interesting aspect of our work has been to show that, whereas operations like is in clearly depend on an equality operation on elements, and are therefore not type-complete, other operations like is empty, all, any, filter, map, iter, and collect do not depend on an equality operation. Thus we can even construct sets of functions (for example) and perform useful computations with them | though not membership tests. Therefore we can claim that the collection type constructors themselves are type-complete, although some of their operations are not. Bulk values should be rst class citizens of the language. There is no reason why a DBPL supporting collection types should not treat them as rst class citizens. Bulk values should have persistence rights. As above, there is no reason why a DBPL supporting collection types should not permit collections to persist. Explicit iteration should be provided. Section 3.1 demonstrated that comprehensions provide explicit iteration. The order of iteration should be de ned and speci able. The iteration order of collection comprehensions is de ned, but not speci able short of performing a sort. It should be possible to project an element out of a bulk type, e.g., choose an element from a set. This is provided by the choose operation. The implementation should be ecient. Eciency, in an absolute sense, is not addressed by the collection theory. However, the laws of collection theory provide a rigorous basis for query optimisations. Moreover, the uniform treatment of all kinds of collection can be exploited by a compiler to implement (and optimise) many kinds of collection without increase of complexity.

7 Conclusion

The initial results of an investigation to locate and exploit common properties of bulk types have been reported. Some bulk types with common operations that obey certain simple laws have been identi ed and termed collections. The operations on collections are partitioned into basic operations and derived operations, providing a fast means of prototyping new kinds of collection. Sets, bags, lists, certain trees, relations, and nite mappings are all collections. Types that we would not intuitively classify as bulk types are not collections. Comprehensions provide a desirable and uniform query notation over all collections. 19

We hope to encourage DBPL designers to use the theory when designing new bulk types to be incorporated into their languages. Collection theory has already in uenced the design of the Comandos Object Data Management System [8]. The theory has also in uenced the design of iteration and dyadic operations on nite mappings [5] Further investigation is required to identify more bulk types that are collection types | perhaps other tree types and a graph type | and others that are not. Another issue that requires investigation is providing conversions between bulk types, such as converting a set into a list or vice versa. Another possibility is to generalise comprehensions so that we can iterate over a list and a set within the same comprehension.

Appendix A: Prototyping Collection Types The collection operations are partitioned into:  a small set of basic operations, and  a much larger set of operations that can be de ned in terms of the basic operations This suggests a simple means of prototyping a collection type. When a new collection type is proposed, only the basic operations need to be de ned, in terms of a suitable representation type. The constructed operations can then be automatically generated. In this appendix we present a prototyper written in Standard ML.1

The Functor

The following de nes Collection basic sig to be the signature of a structure exporting a parameterised type named 't collection, equipped with operations named empty, single, ++ (corresponding to ), iter, and reduce: signature Collection_basic_sig sig type 't collection

=

val empty val single val ++

: 't collection : 't -> 't collection : 't collection * 't collection -> 't collection

val iter

: ('s -> 't collection)

For the bene t of readers unfamiliar with modules in ML, here is a summary of the main concepts. A structure is a module that exports some types and operations. A signature is the `type' of a structure; it speci es (only) the exported types and the names and types of the exported operations. A functor is a parameterised module; its argument is a structure, with a speci ed signature, which it uses to generate a new structure. 1

20

val reduce

-> 's collection -> 't collection : 't * ('s -> 't) * ('t * 't -> 't) -> 's collection -> 't

end

The following de nes Collection sig to be the signature of a structure exporting a parameterised type named 't collection, equipped with most of the basic and derived operations summarised in Tables 1 and 3: signature Collection_sig sig type 't collection

=

val empty val single val ++

: 't collection : 't -> 't collection : 't collection * 't collection -> 't collection

val iter

: ('s -> 't collection) -> 's collection -> 't collection : 't * ('s -> 't) * ('t * 't -> 't) -> 's collection -> 't

val reduce

val val val val val val val val

is_empty size is_in all any filter map collect

val list val enum end

: : : : : : : :

't collection -> bool 't collection -> int 't -> 't collection -> bool ('t -> bool) -> 't collection -> bool ('t -> bool) -> 't collection -> bool ('t -> bool) -> 't collection -> 't collection ('s -> 't) -> 's collection -> 't collection ('t collection) collection -> 't collection

: 't collection -> 't list : 't list -> 't collection

The following de nes a functor make collection. This functor takes a given structure Basic, with signature Collection basis sig, and uses it to generate a larger structure, with signature Collection sig: functor make_collection (Basic: Collection_basic_sig) : Collection_sig = struct open Basic infix ++ local

21

fun id (x) = x fun both (b: bool, b': bool) = b andalso b' fun either (b: bool, b': bool) = b orelse b' in val is_empty

=

reduce (true, (fn x => false), both)

val size

=

reduce (0, (fn x => 1), op +)

fun is_in y

=

reduce (false, (fn x => (x = y)), either)

fun all p

=

reduce (true, p, both)

fun any p

=

reduce (false, p, either)

fun filter p

=

iter (fn x => if p x then single x else empty)

fun map f

=

iter (single o f)

val collect

=

iter (id)

val list

=

reduce (nil, (fn x => x::nil), op @)

fun enum nil | enum (x::xs) end

= =

empty single x ++ enum xs

end (*make_collection*)

Note the following points:  't collection is an abstract type, since the signatures Collection basic sig and Collection sig reveal nothing about its representation.  We have chosen to include reduce with the basic operations, because make collection needs it to de ne some of the derived operations (the representation type being hidden).  The operation enum mimics the syntactic sugar of enumerations (Section 2.1).

Using the Functor

We can use the functor make collection to generate structures appropriate to speci c kinds of collection. Here we illustrate the idea to generate prototypes for lists, sets, and (binary) relations. The following structure de nes the basic operations for lists: structure List_basis : Collection_basic_sig = struct

22

datatype 't collection = empty | cons of ('t * 't collection) fun single x = cons (x, empty) infix ++ fun op ++ (empty, l') = l' | op ++ (cons(x,l), l') = cons (x, l ++ l') fun iter f empty = empty | iter f (cons(x,l)) = f x ++ iter f l fun reduce (e,f,g) empty = e | reduce (e,f,g) (cons(x,l)) = g (f x, reduce (e,f,g) l) end (*List_basis*)

The following de nes a structure List exporting operations on lists: structure List : Collection_sig = make_collection (List_basis) open List infix ++

The type and operations exported by List are named 't List.collection, List.empty, List.single, List.++, etc. The open declaration abbreviates these to 't collection, empty, single, ++, etc. Now we can write down list expressions such as: val cs = single "England" ++ single "Mexico"; val cs' = cs ++ enum ["Germany", "Argentina", "Spain", "Mexico", "Italy", "USA"]; size cs'

Here cs and cs' are of type string List.collection, and the value of size is 8. The following structure de nes the basic operations for sets: structure Set_basis : Collection_basic_sig = struct datatype 't collection = empty | ins of ('t * 't collection) fun single x = ins (x, empty) infix ++ fun op ++ (empty, s') = s' | op ++ (ins(x,s), s') = let

23

cs'

fun member (y, empty) = false | member (y, ins(x,s)) = (y = x) orelse member (y, s) in if member (x, s') then s ++ s' else ins (x, s ++ s') end fun iter f empty = empty | iter f (ins(x,s)) = f x ++ iter f s fun reduce (e,f,g) empty = e | reduce (e,f,g) (ins(x,s)) = g (f x, reduce (e,f,g) s) end (*Set_basis*)

The following de nes a structure Set exporting operations on sets: structure Set : Collection_sig = make_collection (Set_basis) open Set infix ++

The type and operations exported by Set are named 't Set.collection, , , , etc. The open declaration abbreviates these to , , , , etc. Now we can write down set expressions such as:

Set.empty Set.single Set.++ 't collection empty single ++

val primes = single 2 ++ single 3 ++ single 5 ++ single 7; val odds = enum [1, 3, 5, 7, 9]; val evens = map succ odds; is_in 5 primes

Here primes, odds, and evens are of type int Set.collection. Finally, the following structure Relation implements binary relations. Note that we can reuse Set basis to generate the basic and derived operations on binary relations, viewed as sets. For illustrative purposes, Relation de nes and exports a few additional operations on binary relations: a join and some projections. structure Relation = struct structure R : Collection_sig = make_collection (Set_basis) open R fun join (r1, r2) = iter (fn (y2,z2) => iter (fn (y1,z1) => if y1 = y2 then single (y1, (z1, z2))

24

else empty ) r1 ) r2 fun project12 r = map (fn (x,(y,z)) => (x,y)) r fun project13 r = map (fn (x,(y,z)) => (x,z)) r fun project23 r = map (fn (x,(y,z)) => (y,z)) r end (*Relation*)

Appendix B: Monads and Ringads B.1 Kleisli Monads

This description of monads is closely based on [22], although the monad functions have been renamed. In the following, functions may be subscripted with their type, e.g., id is the identity function on type  . Function composition is denoted by , i.e., (f  g)x = f (g x). A Kleisli monad is a triple (collection; single; iter) consisting of a type constructor collection, and functions single and iter. In category-theoretic terms, collection is a mapping from objects to objects, single :  !  collection is a family of arrows, and for each arrow f :  !  collection there is an arrow (iter f ) :  collection !  collection, satisfying:

iter single = id collection iter f  single = f (iter g)  (iter f ) = iter ((iter g)  f )

(6) (7) (8)

B.2 Monads with a Zero

A monad can be augmented by a zero. We de ne a family of functions zero; :  !  collection, each of which simply ignores its argument and returns an empty collection:

zero; x = empty collection The laws that a zero (and hence empty) must obey are as follows:

zero;  e = zero; iter zero; = zero collection; iter f  zero; = zero;

(9) (10) (11)

for any functions e :  !  and f :  !  collection. A monad with a zero, sometimes termed a quad, provides exactly the functions required to support comprehensions. 25

B.3 Ringads

A ringad incorporates a  function, in addition to the functions already mentioned. The  function has zero as left and right unit, i.e.:

zero; x  f x = f x f x  zero; x = f x

(12) (13)

for any function f :  !  collection. Note that these laws correspond to laws (1{2) of collection theory. A ringad consists of a monad (collection; single; iter) and a pair (; zero), such that zero is a unit of , zero is the zero of the monad, and iter distributes rightward through , i.e.: iter f (c  c ) = (iter f c)  (iter f c ) (14) for any function f :  !  collection. 0

0

References

[1] Albano, A., Cardelli, L., and Orsini, R. Galileo | A Strongly Typed Interactive Conceptual Language. ACM Transactions on Database Systems 10, 2 (June 1985), 230{260. [2] Astrahan, M.M., Blasgen, M.W., Chamberlin, D.D., Eswaran K.P. Gray, J.N., Griths, P.P., King, W.F., Lorie, R.A., McJones, P.R., Mehl J.W. Putzolu, G.R., Traiger, I.L., Wade, B.W., and Watson, V. System R | Relational Approach to Database Management. ACM Transactions on Database Systems 1, 2 (June 1976), 97{137. [3] Atkinson, M.P. Programming Languages and Databases. Proceedings of the 4th International Conference on Very Large Databases (1978), 408{419. [4] Atkinson, M.P., and Buneman, O.P. Types and Persistence in Database Programming Languages. ACM Computing Surveys 19, 2 (June 1987), 105{190. [5] Atkinson, M.P., Richard, P., and Trinder, P.W. Bulk Types for Large Scale Programming. Proceedings of Information Systems for the 90's, Kiev, Ukraine (October 1990). [6] Bancilhon, F., Briggs, T., Khosa an, S., and Valduriez, P. FAD, A Powerful and Simple Database Language. Proceedings of the 13th International Conference on Very Large Databases, Brighton, England (September 1987), 97{107. [7] Date, C.J. An Introduction to Database Systems 4th Ed. Addison Wesley (1976). [8] Harper D.J., and Norrie M. Data Management for Object-Oriented Systems. To appear in The Ninth British National Conference on Databases, Wolverhampton, England (July 1991). 26

[9] Held, G.D., Stonebraker, M.R., and Wong, E. Ingres, a Relational Database System. Proceedings of the 44th National Computing Conference (May 1975). [10] Lecluse, C., Richard, P., and Velez, F. O2, an Object-Oriented Data Model. Proceedings of the ACM-SIGMOD Conference, Chicago, USA (1988). [11] Morrison, R., Brown, F., Connor, R., and Dearle, A. The Napier 88 Reference Manual. Universities of Glasgow and St Andrews, PPRR-77-89. [12] Ohori, A., Buneman, O.P., and Breazu-Tannen, V. Database Programming in Machiavelli | a Polymorphic Language with Static Type Inference. Proceedings of the ACM-SIGMOD Conference, Portland, USA (1989), 46{57. [13] Schmidt, J.W., Eckhardt, H., and Mathes, F. DBPL Report. DBPL-memo 111-88, Fachbereich Informatik, Johan Wolfgang Goethe-Universitat, Frankfurt, West Germany (1988). [14] Sheard, T. A User's Guide to TRPL: A Compile-time Re ective Programming Language. Department of Computer and Information Science, University of Massachusetts, Amherst, MA, COINS Technical Report 90-109 (December 1990). [15] Trinder, P.W. A Functional Database. D.Phil Thesis, Oxford University (December 1989). [16] Trinder, P.W. Comprehensions: a Query Notation for DBPLs. To appear in the Proceedings of the Third International Workshop on Database Programming Languages, Nafplion, Greece (August 1991). [17] Trinder, P.W., and Wadler, P.L. List Comprehensions and the Relational Calculus. Proceedings of the 1988 Glasgow Workshop on Functional Programming, Rothesay, Scotland, (August 1988), 115{123. [18] Trinder, P.W., and Wadler, P.L. Improving List Comprehension Database Queries. Proceedings of TENCON'89, Bombay, India (November 1989), 186{ 192. [19] Ullman, J.D. Principles of Database Systems, Pitman, 1980. [20] Wadler, P.L. List Comprehensions. Chapter 7 of Peyton Jones, S.L. The Implementation of Functional Programming Languages. Prentice Hall, 1987. [21] Wadler, P.L. Comprehending Monads. Proceedings of the ACM Conference on Lisp and Functional Programming, Nice, France (June 1990) [22] Wadler, P.L. Notes on Monads and Ringads. Internal document, Computing Science Dept. Glasgow University (September 1990).

27

Suggest Documents