Aug 5, 2010 - We introduce a library for generic multiset programming. It supports algebraic ...... Available as example database from MySQL.com,. 2010.
Generic Multiset Programming for Language-integrated Querying∗ Fritz Henglein and Ken Friis Larsen Department of Computer Science, University of Copenhagen (DIKU) Email: {henglein, kflarsen}@diku.dk August 5, 2010
Abstract This paper demonstrates how relational algebraic programming based on efficient symbolic representations of multisets and operations on them can be applied to the query sublanguage of SQL in a type-safe fashion. In essence, it provides a library for na¨ıve programming with multisets in a generalized SQL-style fashion, but avoids many cases of asymptotically inefficient nested iteration through cross-products.
1
Introduction
We introduce a library for generic multiset programming. It supports algebraic programming based on Codd’s original relational algebra with select (filter), project (map), cross product, (multi)set union and (multi)set difference as primitive operators and extends it with SQL-style functions corresponding to GROUP BY, SORT BY, HAVING, DISTINCT and aggregation functions. It generalizes the querying core of SQL as follows: Multisets may contain elements of arbitrary first-order data types, including references (pointers), recursive data types and nested multisets. It contains an expressive embedded domainspecific language for user-definable equivalence and ordering relations. And it allows in principle arbitrary user-defined functions for selection, projection and aggregation. ∗ This material is based upon work supported by the Danish National Science Foundation (FNU) under Project APPL and by the Danish National Advanced Technology Foundation under Project 3gERP.
1
Employing a recently discovered simple technique for representing multisets, predicates and projections using symbolic representations of collections, our implementation avoids the quadratic run-time penalty of multiplying out cross-products, which often makes using the otherwise attractive use of list comprehensions for SQL-like querying unviable. In particular, (equi)joins can be defined na¨ıvely by composition of selection and cross-product, yet get executed by invoking an efficient join algorithm. We provide illustrative applications; briefly discuss the tensions between semantic determinacy, efficient joins, and dependent products; and relate our work to other frameworks for encapsulating querying in-memory and RDBMS-managed data.
1.1
Contributions
In this paper we provide a library for SQL-style programming with multisets that • supports all the classical features of the data query sublanguage of SQL; • admits multisets of any element type, including nested multisets and trees; • admits user-definable equivalences (equijoin conditions), predicates, and functions; • admits na¨ıve programming with cross-products, but avoids spending quadratic time on computing them; • offers an object-oriented (object-method-parameters) syntax using infix binary operators; • and is easy to implement. To demonstrate this we include the complete source code (without basic routines for generic discrimination reported elsewhere) and a couple of paradigmatic examples.
1.2
Required background
A basic understanding of the relational data model and SQL is required. Any textbook on databases will do; e.g.Ramakrishnan and Gehrke (2003). We use the functional core parts of Haskell (Peyton Jones 2003) as our programming language, extended with Generalized Algebraic Data Types (GADTs), as implemented in Glasgow Haskell (GHC Team). GADTs provide a convenient type-safe framework for shallow embedding of little languages (Bentley 1986), 2
which we use for symbolic representation of multisets, performable functions, predicates, equivalence and ordering relations. Apart from type safety, all other aspects of our library can be easily coded up in other general-purpose programming languages, both eager and lazy. Hudak and Fasel (1999) provide a brief and gentle introduction to Haskell, but as we deliberately do not use monads, type classes or any other Haskell-specific language constructs except for GADTs, we believe basic knowledge of functional programming is sufficient for understanding the code we provide.
2
Basic ingredients
In this section we recapitulate the basic ingredients for a simple, but efficient implementation of generic relational algebra (Henglein 2010). The ideas are the same, but the concrete representations are slightly different to make them practically more expressive and efficient. In particular, disjunction and negation are added here.
2.1
Multisets (bags)
A multiset, also called bag, is a collection of elements that may contain duplicates. We use a lazy data structure representation of bags using symbolic unions (U) and cross-products (X), where the leaves of the data structure are singleton bags or empty bags: data Bag a Empty :: S :: U :: X ::
where Bag a -- Empty : Empty a → Bag a -- S v : Singleton bag { v } Bag a → Bag a → Bag a -- union Bag a → Bag b → Bag (a, b) -- cross-product
Symbolic cross-products, together with the matching symbolic representations of functions and predicates operating on bags—see below—and efficient partitioning for user-defined equivalence relations, are the keys to allowing na¨ıve filtermap-product programming without incurring the quadratic run-time blow-up of nested iteration through a cross-product. A bag can be constructed from a list by forming unions of singleton bags: bag :: [a] → Bag a bag [] = Empty bag (x : xs) = foldr (\ v s → S v ‘U‘ s) (S x) xs
Conversely, a bag can be converted to a list:
3
list list list list list
:: Bag a → [a] Empty = [] (S v) = [v] (s1 ‘U‘ s2) = list s1 ++ list s2 (s1 ‘X‘ s2) = [ (v1, v2) | v1 ← list s1, v2 ← list s2 ]
The last clause of list performs explicit multiplying out of symbolic cross-products, resulting in a potentially quadratic blow-up in time and space. Using a lazy list (stream) implementation the quadratic space consumption can often be staved off, but the time blow-up remains. Since this is the only nonlinear code we shall use we can easily identify where we incur potentially nonlinear complexity in the subsequent code. Observe that list encapsulates not only a computational, but also a semantic problem: it is nondeterministic in the sense that two different representations of the same bag may yield different results. In particular, representation changes of bags can be observed by list. Since all other operations we use are semantically deterministic, we can identify potential nondeterminism by the presence of list. The size of a bag is the number of elements it contains: size size size size size
:: Bag a → Int Empty = 0 (S _) = 1 (s1 ‘U‘ s2) = size s1 + size s2 (s1 ‘X‘ s2) = size s1 ∗ size s2
Note the last clause: It computes the size without multiplying out the crossproduct.
2.2
Performable functions
We define symbolic representations of performable functions, functions that can be applied to each element of a bag: data Func a Func :: Fst :: Snd :: Par ::
b where (a → b) Func (a, Func (a, Func a b
→ b) b) →
Func a b a b Func c d → Func (a, c) (b, d)
Any user-definable function, f, can be turned into a performable function by applying the constructor Func to it. The constructors Fst and Snd represent the projections on pairs. The parallel composition Par f g of performable functions f and g, operating on the components of pairs independently, is represented symbolically: Par f g. 4
The function ext maps performable functions to their extensions as ordinary functions: :: Func a b → a → b (Func f) x =f x (Par f1 f2) (x, y) = (ext f1 x, ext f2 y) Fst (x, y) =x Snd (x, y) =y
ext ext ext ext ext
2.3
Predicates
Similar to performable functions, we allow arbitrary Boolean functions for filtering elements from a bag as predicates, but maintain symbolic representations for the constant-true (TT), constant-false (FF), sequential conjunction (SAnd), parallel conjunction (PAnd), sequential disjunction (SOr), parallel disjunction (POr), join condition (In) and negated join condition (NotIn) predicates. data Pred a where Pred :: (a → Bool) → Pred a TT :: Pred a FF :: Pred a SAnd :: Pred a → Pred a → Pred a PAnd :: Pred a → Pred b → Pred (a, b) SOr :: Pred a → Pred a → Pred a POr :: Pred a → Pred b → Pred (a, b) In :: (Func a k, Func b k) → Equiv k → Pred (a, b) NotIn :: (Func a k, Func b k) → Equiv k → Pred (a, b)
The extension as Boolean functions and thus the semantics of predicates is given by sat: sat sat sat sat sat sat sat sat sat sat
:: Pred a → a → Bool (Pred f) x TT _ FF _ (p1 ‘SAnd‘ p2) x (p1 ‘PAnd‘ p2) (x, y) (p1 ‘SOr‘ p2) x (p1 ‘POr‘ p2) (x, y) ((f, g) ‘In‘ e) (x, y) ((f, g) ‘NotIn‘ e) x
= = = = = = = = =
f x True False (sat p1 x && sat p2 x) sat p1 x && sat p2 y sat p1 x | | sat p2 x sat p1 x | | sat p2 y eq e (ext f x) (ext g y) not (sat ((f, g) ‘In‘ e) x)
The negation notP of a predicate can be defined using De Morgan’s laws: notP :: Pred a → Pred a notP (Pred f) = Pred (not . f)
5
notP notP notP notP notP notP notP notP
TT FF (p1 (p1 (p1 (p1 (fs (fs
‘SAnd‘ p2) ‘PAnd‘ p2) ‘SOr‘ p2) ‘POr‘ p2) ‘In‘ e) ‘NotIn‘ e)
= = = = = = = =
FF TT notP p1 ‘SOr‘ notP p2 notP p1 ‘POr‘ notP p2 notP p1 ‘SAnd‘ notP p2 notP p1 ‘PAnd‘ notP p2 fs ‘NotIn‘ e fs ‘In‘ e
Observe that PAnd and POr are analogous to Par: They combine two componentwise predicates to a predicate on pairs. A join condition (f, g) ‘In‘ e defines a predicate on a pair based on three ingredients: key functions f and g, which map the components to a common type t and an equivalence expression e, which denotes an equivalence relation on the type t. Equivalence expressions provide a compositional way of defining equivalence relations, again represented symbolically, as values of type Equiv t. More precisely, each element of Equiv t denotes an equivalence relation on a subset of t. Figure 1 shows the definition of Equiv and the generic function eq, which maps e to the characteristic function of the equivalence relation denoted by e. The values x for which eqe xx terminates constitutes the subset on which the equivalence relation is defined. We are stretching the usage of “expression” since equivalence expressions may actually be infinite terms, not just finite ones. Infinite equivalence expressions allow denoting equivalence relations on recursive types, such as lists and trees. Consider for example, listE e: listE :: Equiv t → Equiv [t] listE e = MapE fromList (SumE TrivE (PairE e (listE e))) where fromList :: [t] → Either () (t, [t]) fromList [] = Left () fromList (x : xs) = Right (x, xs)
It denotes the elementwise extension of the equivalence denoted by e to lists such that [v1 , . . . , vm ] and [w1 , . . . , wn ] are listE e-equivalent if and only if m = n, and vi and wi are e-equivalent for all i ∈ {1 . . . m}. Componentwise equivalence is not the only useful equivalence on lists. Two lists are bag-equivalent under e, denoted by Bag e, if one of them can be permuted such that it is listE e-equivalent to the other. They are set-equivalent under e, denoted by Set e, if each element in one list is e-equivalent to some element in the other list. Equivalence expressions provide an expressive language for defining useful equivalence relations. It is for example possible to define eqInt32 and eqString8, the equalities on 32-bit integers and on 8-bit character strings, respectively. We 6
data Equiv NatE :: TrivE :: SumE :: PairE :: MapE :: BagE :: SetE ::
:: ∗ → ∗ where Int → Equiv Int Equiv t Equiv t1 → Equiv t2 → Equiv (Either t1 t2) Equiv t1 → Equiv t2 → Equiv (t1, t2) (t1 → t2) → Equiv t2 → Equiv t1 Equiv t → Equiv [t] Equiv t → Equiv [t]
eq :: Equiv t → t → t → Bool eq (NatE n) x y = if 0 ≤ x && x ≤ n && 0 ≤ y && y ≤ n then (x == y) else error "Argument out of range" eq TrivE _ _ = True eq (SumE e1 _) (Left x) (Left y) = eq e1 x y eq (SumE _ _) (Left _) (Right _) = False eq (SumE _ _) (Right _) (Left _) = False eq (SumE _ e2) (Right x) (Right y) = eq e2 x y eq (PairE e1 e2) (x1, x2) (y1, y2) = eq e1 x1 y1 && eq e2 x2 y2 eq (MapE f e) x y = eq e (f x) (f y) eq (BagE _) [] [] = True eq (BagE _) [] (_ : _) = False eq (BagE e) (x : xs’) ys = case delete e x ys of Just ys’ → eq (BagE e) xs’ ys’ Nothing → False where delete :: Equiv t → t → [t] → Maybe [t] delete e v = subtract’ [] where subtract’ _ [] = Nothing subtract’ accum (x : xs) = if eq e x v then Just (accum ++ xs) else subtract’ (x : accum) xs eq (SetE e) xs ys = all (member e xs) ys && all (member e ys) xs where member :: Equiv t → [t] → t → Bool member _ [] _ = False member e (x : xs) v = eq e v x | | member e xs v
Figure 1: Equivalence expressions and the (characteristic functions of) equivalence relations denoted by them.
7
can also define new equivalence constructors. For example, maybeE :: Equiv t → Equiv (Maybe t) maybeE e = MapE (maybe (Left ()) Right) (SumE eqUnit e)
lifts an equivalence e to an equivalence on the Maybe data type such that two values are equivalent if they both are Nothing or they both have the form Just v such that the arguments of Just are e-equivalent.
2.4
Generic discrimination
To compute equijoins and for the extensions to Generic SQL in Section 3 we require an efficient generic discriminator : disc :: Equiv k → [(k, v)] → [[v]]
A discriminator disc e groups values according to the e-equivalence classes of the keys they are associated with in the input. Below we shall assume that disc e is stable; that is, each group in the output contains its values in the same relative positional order as they occur in the input. A discriminator can be used to partition an input list into equivalence classes, which in turn can be used to return a representative for each such equivalence class: part :: Equiv k → [k] → [[k]] part e xs = disc e [ (x, x) | x ← xs ] reps :: Equiv k → [k] → [k] reps e xs = [ head ys | ys ← part e xs ]
2.5
Relational operators
The relational operators corresponding to selection (filter), projection (map), union and cross-product can now be defined as in Figure 2. We stick to relational algebra in calling the filtering function select.1 The key idea for avoiding na¨ıve multiplying out of cross-products is illustrated in the following clauses: select (s1 ‘X‘ s2) (p1 ‘PAnd‘ p2) = select s1 p1 ‘X‘ select s2 p2 select (s1 ‘X‘ s2) (p1 ‘POr‘ p2) = (select s1 p1 ‘X‘ s2) ‘U‘ (s1 ‘X‘ select s2 p2) 1
Unfortunately the nomenclature in SQL is at odds with relational algebra terminology: The SELECT part of an SQL query specifies what corresponds to a projection in relational algebra, whereas the WHERE part correspond to a selection.
8
-- select s p : select (filter) all elements of s that satisfy p select :: Bag a → Pred a → Bag a select s TT =s select _ FF = Empty select (s1 ‘U‘ s2) p = select s1 p ‘U‘ select s2 p select s (p1 ‘SAnd‘ p2) = select (select s p1) p2 select s (p1 ‘SOr‘ p2) = select s p1 ‘U‘ select s p2 select (s1 ‘X‘ s2) (p1 ‘PAnd‘ p2) = select s1 p1 ‘X‘ select s2 p2 select (s1 ‘X‘ s2) (p1 ‘POr‘ p2) = (select s1 p1 ‘X‘ s2) ‘U‘ (s1 ‘X‘ select s2 p2) select (s1 ‘X‘ s2) ((f1, f2) ‘In‘ e) = join f1 e f2 s1 s2 select s p = bag (filter (sat p) (list s)) -- join f1 e f2 s1 s2 = { (x1, x2) : x1 in s1, x2 in s2 | f1(x1) is e-equiv with f2(x2) } join :: Func a k → Equiv k → Func b k → Bag a → Bag b → Bag (a, b) join f1 e f2 s1 s2 = foldr (\b s → let (vs, ws) = split b in (bag vs ‘X‘ bag ws) ‘U‘ s) Empty bs where xs = [ (ext f1 r, Left r) | r ← list s1 ] ys = [ (ext f2 r, Right r) | r ← list s2 ] bs = disc e (xs ++ ys) split :: [Either a b] → ([a], [b]) split [] = ([], []) split (v : vs) = let (lefts, rights) = split vs in case v of Left v’ → v’ : lefts, rights) Right v’ → (lefts, v’ : rights) -- perform s f : apply function f to all elements of s (generalization of "project") perform :: Bag a → Func a b → Bag b perform (s1 ‘U‘ s2) f = perform s1 f ‘U‘ perform s2 f perform (s1 ‘X‘ s2) (Par f1 f2) = perform s1 f1 ‘X‘ perform s2 f2 perform (s1 ‘X‘ s2) Fst = size s2 ‘times‘ s1 perform (s1 ‘X‘ s2) Snd = size s1 ‘times‘ s2 perform s f = bag (map (ext f) (list s)) -- default clause -- times k s : union of k copies of s times :: Int → Bag a → Bag a times n s = if n > 0 then s ‘U‘ times (n - 1) s else Empty -- union s t : bag union of s and t union :: Bag a → Bag a → Bag a union s Empty = s union Empty t = t union s t = s ‘U‘ t -- prod s t : cross-product of s and t prod :: Bag a → Bag b → Bag (a, b) prod Empty t = Empty 9 prod s Empty = Empty prod s t = s ‘X‘ t
Figure 2: Definition of basic relational operators
select (s1 ‘X‘ s2) ((f1, f2) ‘In‘ e) = join f1 e f2 s1 s2
In the first clause and similarly in the second clause we exploit that the results of parallel conjunction and disjunction over cross-products can themselves be expressed as cross-products. In the third clause we see that it is possible to dynamically discover when a join condition is applied to a cross-product and employ an efficient join-algorithm, which avoids nested interation through the cross-product. We call the functional that maps a performable function over a bag perform to avoid mixing it up with map, which does the same for lists. Analogous to parallel conjunction for select, when mapping a parallel composition over a symbolic cross-product, the cross-product is not multiplied out, but the result is computed symbolically: perform (s ‘X‘ r) (Par f g) = perform s f ‘X‘ perform r g
Just like the select-clauses for parallel conjunction and disjunction, this illustrates why it is important to maintain a symbolic representation of a parallel functional composition. If Par f g is only available as a black-box function, there is no alternative to multiplying out the cross-product before applying the function to each pair in the cross-product one at a time, as seen in the default clause for perform. In a set-theoretic semantics mapping a projection over a symbolic product could be performed by simply returning the corresponding component set, once we have checked that the the other set is nonempty. With bags, however, we need to return as many copies as the other component has elements: perform (s1 ‘X‘ s2) Fst = size s2 ‘times‘ s1 perform (s1 ‘X‘ s2) Snd = size s1 ‘times‘ s2
where the times operator defines iterated bag union. The union and cross-product operators are essentially just the corresponding bag constructors U and X, respectively. They are implemented as “smart” constructors here, which recognize whenver an empty bag argument is represented by the constructor Empty.
3
Generic SQL
We now present the extension of the relational algebra core of Section 2.5 to SQL with EXCEPT, DISTINCT, GROUP BY, SORT BY, HAVING, and aggregation functions.
10
3.1
EXCEPT
The SQL difference operator EXCEPT is based on a notion of equality. We generalize it to an operation, diff, that removes all equivalent elements from a bag, where the equivalence is user-definable: diff :: Bag k → (Equiv k, Bag k) → Bag k diff s1 (e, s2) = foldr include Empty bs where xs = [ (x, Left x) | x ← list s1 ] ys = [ (y, Right y) | y ← list s2 ] bs = disc e (ys ++ xs) untag (Left x) = x untag (Right _) = error "Wrong tagged tagged value" include b@(Left x : _) s = bag (map untag b) ‘U‘ s include b@(Right y : _) s = s include [] s = error "Application to empty block"
For example, residents ‘diff‘ (tenants, MapE lastName eqString8)
returns all the residents that are not tenants, where two persons are considered to represent the same resident if their last names are the same.
3.2
DISTINCT
The SQL DISTINCT clause removes duplicate values from a bag, effectively returning the underlying set. As for EXCEPT this requires a notion of equality on the elements of a bag. As before we allow arbitrary user-definable equivalence relations. We can code DISTINCT as the binary operation coalesceBy, which leaves only one representative from each equivalence class in a bag. coalesceBy :: Bag a → Equiv a → Bag a coalesceBy s e = bag (reps e (list s))
For example, employees ‘perform‘ salaryLevel ‘coalesceBy‘ eqInt32
first forms the bag of all salary levels of employees and then returns the salary levels without duplicates.
3.3
GROUP BY
The SQL GROUP BY clause partitions a bag according to an equivalence relation specified by a so-called grouping list. All such equivalences—and more—are 11
definable using the equivalence expressions in Figure 1. The GROUP BY clause can thus be defined as a binary operation that takes a bag and an equivalence expression and returns a partition, represented as a bag of bags. groupBy :: Bag a → Equiv a → Bag (Bag a) groupBy s e = bag (map bag (part e (list s)))
For example, employees ‘groupBy‘ (MapE salaryLevel eqInt32)
partitions the employees into groups of employees with the same salary level.
3.4
SORT BY
The SQL SORT BY clause is handled completely analogously to GROUP BY. Instead of an equivalence expression it requires the specification of an ordering relation, and it returns a list: sortBy :: Bag a → Order a → [a] sortBy s r = sort r (list s)
For example, employees ‘sortBy‘ (MapO salaryLevel (Inv ordInt32))
returns the employee groups as before, but now sorted in descending order according to their salary level. Here ordInt32 denotes the standard order on 32-bit integers, and MapO is analogous to MapE: it induces an ordering on the domain of a function from an ordering on its codomain. The complete definition of order expressions and their extensions as (characteristic functions of) ordering relations is given in Figure 3. The definition of sortBy uses sort, an order-expression generic sorting function. It can be implemented using generic order discrimination, which generalizes distributive sorting (Henglein 2008, 2009) from finite types and strings to all order-expression denotable orders, or by employing classical comparison-based sorting; e.g. sort r xs = csort (lte r) xs
where csort is a comparison-parameterized function implemented using Quicksort, Mergesort, Heapsort or similar.
3.5
HAVING
The SQL HAVING clause is used in combination with the GROUP BY clause. It retains only those groups of values that satisfy a user-provided predicate (on 12
data Order :: ∗ → ∗ where Nat :: Int → Order Int Triv :: Order t SumL :: Order t1 → Order t2 → Order (Either t1 t2) PairL :: Order t1 → Order t2 → Order (t1, t2) MapO :: (t1 → t2) → Order t2 → Order t1 BagO :: Order t → Order [t] SetO :: Order t → Order [t] Inv :: Order t → Order t lte (Nat n) x y = if 0 ≤ x && x ≤ n && 0 ≤ y && y ≤ n then x ≤ y else error "Argument out of allowed range" lte Triv _ _ = True lte (SumL r1 r2) (Left x) (Left y) = lte r1 x y lte (SumL r1 r2) (Left _) (Right _) = True lte (SumL r1 r2) (Right _) (Left _) = False lte (SumL r1 r2) (Right x) (Right y) = lte r2 x y lte (PairL r1 r2) (x1, x2) (y1, y2) = lte r1 x1 y1 && if lte r1 y1 x1 then lte r2 x2 y2 else True lte (MapO f r) x y = lte r (f x) (f y) lte (BagO r) xs ys = lte (MapO (sort r) (listL r)) xs ys lte (SetO r) xs ys = lte (MapO (usort r) (listL r)) xs ys lte (Inv r) x y = lte r y x
Figure 3: Order expressions and the (characteristic functions of) ordering relations (total preorders) denoted by them.
13
bags!). Since our bags can contain elements of any type, not just primitive ordered types, HAVING requires no special attention, but can be coded using select: having :: Bag (Bag a) → Pred (Bag a) → Bag (Bag a) having = select
For example, employees ‘groupBy‘ (MapE salaryLevel eqInt32) ‘having‘ (Pred (\emps → size emps ≥ 6))
computes the salary groups with at least 6 employees each.
3.6
Aggregation
Aggregation functions are arguably the essential extension of SQL over relational algebra and the main cause for switching from sets to bags in the semantics of SQL. Since all inputs and outputs of SQL queries are required to be bags of primitive values, different intermediate data types such as the result of grouping operations need to be coupled with operations yielding primitive values again. This is why aggregation functions are coupled with grouping and sorting clauses in SQL. We are not restricted in the same fashion and can completely separate aggregation functions from grouping. Notably, any binary function and start value can be extended to an aggregation function on bags: fold :: Bag a → (a → b → b, b) → b fold (s1 ‘U‘ s2) (f, n) = fold s1 (f, fold s2 (f, n)) fold s (f, n) = foldr f n (list s) -- default clause
If the input function f satisfies f v1 (f v2 w) = f v2 (f v1 w), fold yields a semantically deterministic function; otherwise, the result may depend on the order in which the elements of the bag are enumerated by list. For example, employees ‘groupBy‘ (MapE salaryLevel eqInt32) ‘having‘ (Pred (\emps → size emps ≥ 6)) ‘fold‘ (union, Empty)
computes the employees belonging to salary level groups with at least 6 members each.
4
Examples
To demonstrate how the library works, we first present a series of examples of how to express queries both in SQL and using our library, and then we show how our library can be used to express queries over non-normalized data. 14
data City = City { CREATE TABLE City ( cid :: Int, ID int(11) NOT NULL, cname :: String Name char(35) NOT NULL, } deriving (Eq, Show, Read) PRIMARY KEY (ID) ); data Country = Country { CREATE TABLE Country ( code :: String, Code char(3) NOT NULL, name :: String, Name char(52) NOT NULL, population :: Int, Population int(11) NOT NULL, capital :: Maybe Int Capital int(11), } deriving (Eq, Show, Read) PRIMARY KEY (Code) ); data CountryLanguage = CountryLanguage { CREATE TABLE CountryLanguage ( countryCode :: String, CountryCode char(3) NOT NULL, language :: String, Language char(30) NOT NULL, isOfficial :: Bool, IsOfficial char(1) NOT NULL, percentage :: Float Percentage float(4,1) NOT NULL, PRIMARY KEY (CountryCode,Language) } deriving (Eq, Show, Read) ); Figure 4: World Database SQL schemas and corresponding Haskell record declarations, some columns of the original data set are left out for clarity of presentation.
4.1
World database
For our first series of examples we shall use the world sample database from MySQL.com (wor 2010). The database consists of three tables declared by the schemas in Figure 4, shown together with the corresponding Haskell record declarations. 4.1.1
Finding the capital
The SQL query for finding the capital of each country is: SELECT City.Name, Country.Name FROM City, Country WHERE Capital = ID When we want to write a similar query with our library we need to handle an issue that SQL brushes under the carpet. Notice that the Capital column does not have a NOT NULL constraint in the SQL schema in Figure 4 and, correspondingly, the capital field in the Haskell record has type Maybe Int. Thus, what should
15
be done with countries that have a NULL value instead of an integer key in the Capital column? Our first solution is a query that uses PAnd to only select those countries that have a capital, and we use a join condition to match the remaining identifiers. Finally we use perform to project out the cname and name fields: findCapitals cities countries = (cities ‘prod‘ countries) ‘select‘ ((TT ‘PAnd‘ Pred hasCapital) ‘SAnd‘ ((Func cid, Func $ value . capital) ‘In‘ eqInt32)) ‘perform‘ (Par (Func cname) (Func name)) where hasCapital = maybe False (\_ → True) . capital value (Just x) = x
While findCapital computes the right result, the predicate used with select is not as elegant as we would like. Thus, for our second solution we define a specialized equivalence relation that works similar to the SQL semantics. That is, we lift the matching of fields to take Nothing values into account. This is done using the maybeE equivalence constructor defined earlier: findCapitals’ cities countries = (cities ‘prod‘ countries) ‘select‘ ((Func $ Just . cid, Func capital) ‘In‘ maybeE eqInt32) ‘perform‘ (Par (Func cname) (Func name))
This computes the same result as our first formulation of findCapitals and, furthermore, it can be thought of as a mechanical translation of the original SQL query. Note that, in contrast to the SQL query, our translation does not iterate through the full cross-product. Our query only uses time linear in the sum of the sizes of cities and countries. 4.1.2
Group by language
To construct aggregate queries over groups of countries with the same official language, we need the skeleton SQL query: SELECT FROM WHERE GROUP BY
Language, ... CountryLanguage, Country IsOfficial = ’T’ AND Code = CountryCode Language
where the ellipses denotes the computation we want to perform on each group. With our library we can first compute the grouping we want: 16
groupByLanguage :: Bag (Bag (CountryLanguage, Country)) groupByLanguage = (countryLanguage ‘select‘ Pred isOfficial) ‘prod‘ countries ‘select‘ ((Func countryCode, Func code) ‘In‘ eqString8) ‘groupBy‘ (MapE (language . fst) eqString8)
Because we now have an explicit representation of the grouping, we can reuse it for various “reports”. First, we might want to generate a list of pairs of a language and a list of countries where that language is the official language: countriesWithSameLang :: Bag (String, [String]) countriesWithSameLang groupByLang = groupByLang ‘perform‘ (Func report) where report bag = bag ‘perform‘ (Par (Func language) (Func name)) ‘fold‘ (collect, ("",[])) collect (lang, n) (_, ns) = (lang, n:ns)
Second, we can generate a top 10 of the most spoken official languages: languageTop10 :: [(String, Int)] languageTop10 groupByLang = take 10 $ groupByLang ‘perform‘ (Func report) ‘sortBy‘ sumOrder where report bag = (lang bag, bag ‘perform‘ (Func summary) ‘fold‘ ((+),0)) lang = language . fst . head . list summary (l, c) = percentage l ‘percentOf‘ population c percentOf p r = round $ (p ∗ fromIntegral r) / 100 sumOrder = Inv $ MapO snd ordInt32
4.2
Working with non-normalized data
Our second series of examples does not have natural SQL counterparts, as we will be working with non-normalized data, namely lists and trees. Our running example is a custom data type of file system elements, which represent snapshots of a file system directory structure, without the contents of the files it contains: data FilesystemElem = File FilePath | Directory FilePath [FilesystemElem]
where FilePath is synonymous with String. (How to generate a FilesystemElem from an actual file system is outside the scope of this paper.) 17
4.2.1
Finding duplicate file names
To find files with the same name, but in different places in the file system, we first flatten the file system to a list of files with the full path represented as a list in reverse order. We construct a product of the list with itself, and select those pairs that have the same file name, but different paths: findDuplicates :: FilesystemElem → Bag ([FilePath], [FilePath]) findDuplicates fs = fsBag ‘prod‘ fsBag ‘select‘ (((Func head, Func head) ‘In‘ eqString8) ‘SAnd‘ (notP $ (Func id, Func id) ‘In‘ eqStrings)) where fsBag = bag $ flatten fs eqStrings = listE eqString8
Recall that notP denotes negation, eqString8 equality on strings of 8-bit characters, and listE eqString8 equality on lists of such strings. 4.2.2
Finding duplicate directories
To find directories with the same content based on file and directory names, we first flatten the file system into a list of pairs: the full path of the directory and a file system element, one for each Directory element in the file system. Then we form a product of the list with itself and select pairs that are duplicates of each other, but with different paths: findDuplicateDirs fs = fsNonEmpty ‘prod‘ fsNonEmpty ‘select‘ (((Snd, Snd) ‘In‘ dirEq) ‘SAnd‘ (notP $ (Fst, Fst) ‘In‘ eqStrings)) where fsBag = bag $ dirs [] fs fsNonEmpty = fsBag ‘select‘ (Pred (not . null . snd)) dirs path (Directory d cont) = (fullpath, cont) : subdirs where fullpath = d : path subdirs = [d | c ← cont, d ← dirs fullpath c] dirs _ _ = [] eqStrings = listE eqString8
But what exactly is dirEq, which captures what it means for the contents of a (sub)directory to be a “duplicate” of another?
18
To start with, note that FilesystemElem is isomorphic to Either FilePath (FilePath [FilesystemElem]). One direction of the isomorphism can be defined by fromFs (File f) = Left f fromFs (Directory d fs) = Right(d, fs)
With this we can recursively define structural equality on file system elements: fsEq0 :: Equiv FilesystemElem fsEq0 = MapE fromFs (SumE eqString8 (PairE eqString8 (listE fsEq0)))
In words, two file system elements are structurally equal if they both are either files with the same name or both are directories with the same name and containing FilesystemElems that are, recursively, pairwise structurally equal. As it is not guaranteed that the files and subdirectories in a directory are listed in the same order, however, we need to discover cases where a directory is a duplicate of another even though their contents are listed in different orders. This can be accomplished by replacing listE by BagE, the bag-equivalence constructor: fsEq1 :: Equiv FilesystemElem fsEq1 = MapE fromFs (SumE eqString8 (PairE eqString8 (BagE fsEq1)))
Using SetE instead of BagE we can even allow indadvertently duplicated file names and subdirectories and still identify them as duplicates: fsEq2 :: Equiv FilesystemElem fsEq2 = MapE fromFs (SumE eqString8 (PairE eqString8 fsEq2))
Now we have the desired definition of dirEq: dirEq = SetE fsEq2
4.2.3
Skipping the cross-product
In the previous two examples we have used the SQL idiom of using self-join before filtering pairs for non-duplicates. However, it turns out that for this particular pattern of queries, it is not the most efficient approach to follow the SQL idiom. Instead, we can just use groupBy directly on the (flattened) data, without first creating the cross-product. Thus, an alternative formulation of the findDuplicates function is: findDuplicates’ :: FilesystemElem → Bag (Bag [FilePath]) findDuplicates’ fs =
19
fsBag ‘groupBy‘ fpEq ‘having‘ (Pred (\s → size s ≥ 2)) where fsBag = bag $ flatten fs fpEq = MapE head eqString8
This function returns a bag of bags, each of size at least 2 and having the same name, but different paths. Note, that this is a more succinct representation of the result compared to findDuplicates, which uses the SQL self-join idiom.
5
Performance
We demonstrate the performance profile of our library on a number of benchmarks. All benchmark tests were performed on a lightly loaded Mac Book Pro with a Intel Core i5 CPU and 8 GB of RAM. To orchestrate the benchmarks we use the Haskell library criterion (O’Sullivan 2010), which automatically ensures that the benchmarks are iterated enough times to fit with the resolution of the clock. Furthermore, the criterion performs a bootstrap analysis on the timings to check that spikes in the load from other programs running on the computer do not skew the results. Each timing reported is the average of 100 samplings. Since the standard deviations of these averages are all negligible, we do not report the standard deviations.
5.1
Simple cross product
Our first benchmark is a simple self-join of a bag of consecutive integers and then selecting those pairs from the cross-product where the two integers are equal: simpleCross n = (nats ‘prod‘ nats) ‘select‘ ((Func id, Func id) ‘In‘ eqInt32) where nats = bag [1 .. n]
For comparison, we perform the same experiment using SQLite (Hipp 2010), by creating an in-memory table with the schema: CREATE TABLE Nats (value INT NOT NULL);
and then performing the following query: SELECT COUNT(∗) FROM Nats AS n1, Nats AS n2 WHERE n1.value = n2.value;
Figure 5 shows the running time for the two experiments with different numbers of elements. Figure 5(a) shows the running times for generic bags. As expected, 20
Time in milliseconds
400 300 200 100 0
0
50,000 100,000 150,000 200,000 Number of elements
Number of elements Time
10000 20
50000 99.1
100000 202
150000 335
200000 428
(a) With generic bags, running time in milliseconds.
Time in seconds
101
100
103.2
103.4 103.6 103.8 Number of elements
Number of elements Time
1250 0.25
2500 1.01
104
5000 4.09
10000 16.12
(b) With SQLite, running time in seconds.
Figure 5: Running time for a simple join of two collections of consecutive integers. 21
ghc testsuite full
Files
findDuplicates
findDuplicatesGB
findDuplicateDirs
3301 5732 9934
38 84 165
15 38 78
112 264 521
Figure 6: Running times (in milliseconds) for various queries over file system elements. the running time is linearly correlated to the number of elements. Likewise, Figure 5(b) shows the running times for SQLite, which exhibits a quadratic running time behavior. Note that the graph in Figure 5(b) uses log-log scale. One might argue that the experiment is not valid, because the SQL schema does not have an index on the value column, which would prevent the quadratic blow-up. However, the point is not that the quadratic behavior can be avoided, but to illustrate that (a) it is easy to construct an example that exhibits quadratic behavior if one forgets an index, and (b) that it is not necessary to figure out where to put indexes to get linear running time using our library for binary joins. We obtain a linear running time in this example without an index. No storage nor compute time for allocating and maintaining an index is required.
5.2
Non-normalized Data
Figure 6 shows the running time for some of the queries from Section 4.2. The data used for the experiment are file system elements created from various Haskell open source projects: • ghc is the source distribution for the Glorious Haskell Compiler version 6.10.4. • testsuite is the source distribution of the test-suite for the Glorious Haskell Compiler version 6.10.4. • full is a directory containing the above two elements plus the source distribution of the Haskell Platform version 2010.1.0.0. The Files column gives a count of the number of files and directories found in the various data sets. The findDuplicatesGB is the findDuplicates’ function from Section 4.2.3 with the extra post-processing step of unfolding the result to be equivalent to the result returned by findDuplicates.
22
6
Discussion
We briefly discuss related work on data structures, language-integration, expression, and optimization of relational queries below, as well as variations and future work on semantics, expressiveness and further asymptotic optimizations.
6.1
Lazy data structures
Henglein (2010) has shown how to represent collections—there interpreted as sets—using symbolic (“lazy”) unions and cross-products, as summarized in Section 2. The idea of thinking of collections as constructed from a symmetric union operation instead of element-by-element cons operation (lists) is at the heart of the Boom hierarchy and developed in the Bird-Meertens formalism (Backhouse 1989) for program calculation. Early on Skillikorn observed its affinity with dataparallel implementations (Skillicorn 1990), which has received renewed interest in connection with MapReduce style parallelization (Dean and Ghemawat 2004; Steele 2009). Representing cross-products lazily, however, does not seem to have received similar attention despite its eminent simplicity and usefulness in supporting na¨ıve programming with cross-products without incurring a quadratic run-time blow-up.
6.2
Language-integrated querying
Exploiting the natural affinity of the relational model with functional programming has led to numerous proposals for integrating persistent database programming into a functional language in a type-safe fashion, initiated by Buneman in the early 80s; see e.g. Buneman et al. (1981, 1982); Atkinson and Buneman (1987); Ohori and Buneman (1988); Ohori (1990); Breazu-Tannen et al. (1992); Tannen et al. (1992); Buneman et al. (1995). More recently, a number of type-safe embedded query languages have been devised that provide for interfacing with SQL databases: HaskellDB (Leijen and Meijer 1999), Links (Cooper et al. 2006; Lindley et al. 2009) and Ferry (Grust et al. 2009a). The latter two allow formulating queries in a rich query language, which is mapped to SQL queries. The former is the basis of Microsoft’s popular LINQ technology (Meijer et al. 2006), which provides a typed language for formulating queries, admitting multiple query implementations and bindings to data sources. Execution of a query consists of building up an abstract syntax representation of the query and shipping it off to the data source manager for interpretation. In particular, the data can be internal (in-memory) collections processed internally, as we do here. However, we provide an actual query evaluation with efficient join implementation, not just an abstract
23
syntax tree generator or a na¨ıve (nested loop) interpreter.
6.3
Comprehensions
Wadler (1992) introduces comprehending monads for expressing queries in relational calculus style using list comprehensions. Trinder and Wadler (1988, 1989) demonstrate how relational calculus queries can be formulated in this fashion and how common algebraic optimizations on queries have counterparts on list comprehensions. Peyton Jones and Wadler (2007) extend list comprehensions with language extensions to cover SQL-style constructs, notably SORT BY and GROUP BY. These comprehension-based formulations are based on nested iteration with the attendant quadratic run-time cost, however. Instead we forfeit the possibility of formulating arbitrary dependent products of the form [ ... | x