A family of nested query languages for semi-structured data N. Bidoit
S. Maabout
M. Ykhlef
LaBRI. Universit de Bordeaux I 351 cours de la libration 33405 Talence
{bidoit, maabout, my}@labri.u-bordeaux.fr
Abstract. Semi-structured data are commonly represented by labelled
graphs. The labels may be strings, integers, . . . Thus their type is atomic. They are carried by edges and/or nodes. In this paper, we investigate a nested graph representation of semi-structured data. Some nodes of our graphs may be labelled by graphs. Our motivation is to bring the data model in use closer to the natural presentation of data, in particular closer to the Web presentation. The main purpose of the paper is to provide query languages of reasonable expressive power for querying nested db-graphs.
1 Introduction Recently, semi-structured data attracted a lot of attention from the research database community (see for example 8, 4, 3, 10, 9, 5, 6]). This interest is motivated by a wide range of applications such as the genome databases, scientic databases, libraries of programs, digital libraries, on-line documentation, electronic commerce. Semi-structured data arises under a variety of forms: data in BibTex or Latex les, HTML or XML les, data integrated from independent sources. Clearly enough, the main challenge is to provide suitable data models and languages to represent and manipulate semi-structured data. Several proposals have been made 13, 7, 6]. The representation of semi-structured data by graphs with labels on edges and/or vertices is shared by almost all proposals. In a previous paper 6], we have proposed a model for semi-structured data, called the db-graph model together with a logic query language in the style of relational calculus. The contribution of 6] resides in a sound theoretically founded language. Furthermore the language, called Graph(Fixpoint), has been shown more expressive than other proposals such as Lore 3], UnQL 8], WebSQL 10]. In this paper, we study an extension of our db-graph model for representing semi-structured data. This extension allows one to have db-graphs whose vertex labels are db-graphs themselves. It is widely recognized that the diculties arising when dening a model and a language to describe and manipulate semistructured data are (i) a partial knowledge of the structure and (ii) a potentially deeply nested structure.
Bringing the data model closer to the natural presentation of data stored via Web documents is the main motivation behind nesting db-graphs. To illustrate the use of nested graphs, let us consider the case of XML documents. Assume we have a XML le book.xml. Its contents and its graph representation are described below. title
. . . document heading . . .
Databases John Smith
author
Databases
Jhon Smith
...
Clearly, title and author are considered as attribute names whereas Databases and John Smith are data. Now suppose that we add to this document the lines which describe an anchor pointing to the le contents.xml.
Table of contents The view of books.xml becomes: title
contents author
Databases
Jhon Smith
Table of contents
Furthermore, assume that contents.xml is as described below . . . document heading . . .
Relational Databases Object Databases
...
Thus, we may enrich our initial graph to represent the contents of the second document. To this purpose, we may consider two alternatives. The rst one uses a at graph" while the second uses a nested graph". Both representations are respectively described below. We believe that the second graph is a more faithful representation of the real world. Indeed, it helps to distinguish between the structure of book.xml and that of the le . . . document heading . . .
Databases John Smith Table of Contents Relational databases Object databases
...
whose graph representation is actually a at graph identical to the one in Figure 1(a).
title
contents
title
author Table of contents
Databases
Databases
contents author John Smith
John Smith
Table of contents Section1
Section1
Section2
Relational databases
Section2
Object databases
Relational Databases
(a) Flat graph
Object databases
(b) Nested graph
Fig. 1. Two possible representations Let us now consider another situation. The Web can be viewed as a graph. Its nodes are les in particular formats. A node of the Web db-graph may well be a structured document (e.g., the annual report of the LaBRI database group in PDF format). As such, this node can itself naturally be represented by a graph. Another node of the Web may be a simple text le (e.g., describing the topics of interest of the LaBRI database group). This node will carry a simple label. Finally, there may exist a node providing an interface to an integrated movie database build from several sources containing information about movies, movie theaters in Bordeaux, ... , comments about movies provided by the LaBRI database group members. In this case again, the Web node is in fact valued by a db-graph: the db-graph (let us call it the movie db-graph) represents the integrated sources of information. It is easy to go one step further in the modeling process. Let us consider those nodes of the movie db-graph corresponding to the LaBRI database group member comments about movies. Some of these nodes may well be simple and carry simple text. Others may be complex and contain structured documents. This situation leads to having a second level of nesting in the Web db-graph. Applications bring many examples of this kind. It is of course possible to represent these semi-structured data by at db-graphs. The work presented in this paper investigates the gain in using a nested representation. The approach is similar to that of the nested relational model 1] compared to the relational model. We mainly focus on providing languages to query nested db-graphs. Two languages are proposed. They both are formally dened in a calculus style as extensions of the language introduced in 6]. The two languages dier in the knowledge of the structure required to express queries. The rst language, called Graph(Nest-Path), is based on a nested path language Nest-Path. In order to specify a query with this language, it is required that the user knows the levels of nesting of the searched information. In contrast, the second language, called Graph(Emb-Path), has a higher degree of declarativeness: the user does not need to have any knowledge about the level of nesting of the wanted information.
n1 imdb n2
n21 director
title title
type
director type
n22
Psycho
Hitchcock movie
Chahine
Destiny movie
title type
episode
episode director director
100
Cosby Show 1
tvserie Sandrich
n
movie n
n
3
4
movie pariscope
pariscope
5
genre title
Singletary
director
theater
title
schedule title
theater
title
schedule
director Gaumont
Destiny
Chahine
Apocalypse Now
Coppola
Destiny
20:30
George V
Psycho
22:15
drama
Fig. 2. A nested db-graph. The paper is organized as follows. Section 2 is devoted to the presentation of our nested semi-structured data model. Section 3 prepares to the denitions of the graph query languages by introducing common notions. In fact, Section 3 introduces a parameterized graph query language Graph(L) where L, the parameter, is a path query language. In section 4, we instantiate L by a path query language based on nested path formulas called Nest-Path which extends the language Path introduced in 6]. The advantages and drawbacks of Graph(NestPath) are discussed. The main problem arising while using Graph(Nest-Path) to query a nested db-graph is that it requires to have knowledge of the levels of nesting. Section 5 tries to solve this problem and provides a new instantiation of L called Emb-Path. Further research topics conclude the paper.
2 Data model Nested db-graphs (db-graphs for short) are a generalization of at db-graphs in the sense that some nodes are allowed to be themselves db-graphs. Hence, in the next denition, we shall distinguish two sets of nodes: a set Ns of simple nodes and a set Nc of complex nodes. De nition 1 (Nested db-graph). Let V be an innite set of atomic values. Let Ns be a set of simple nodes and Nc a set of complex nodes. Let E be a set of edges (identiers). Let org and dest be functions assigning to an edge respectively its origin and its destination. Let be a polymorphic labelling function for nodes and edges such that labels of simple nodes and edges are atomic values in V whereas labels of complex nodes are db-graphs. A nested db-graph G = (Ns Nc E org dest) is dened recursively by: 1. If Nc is empty then G = (Ns Nc E org dest) is a nested db-graph. In this case, G is called a at db-graph. 2. G = (Ns Nc E org dest) is a nested db-graph over the db-graphs G1 :: Gk if for each n 2 Nc , (n) is one of the db-graphs G1 :: Gk Example 1. Figure 2 is an example of a nested db-graph. It contains two complex nodes n2 and n4 . The destination n2 of the edge labelled by imdb, contains a db-graph providing information about movies and tv-series. The complex node n4 is labelled by a db-graph representing a relational database having a single relation pariscope(theater title schedule). The simple node n3 is the root of a at db-tree giving information about movies too. Let us consider a nested db-graph G = (Ns Nc E org dest) over the dbgraphs G1 : : : Gk . We provide the reader with few remarks and notations: In order to avoid ambiguity, we use the term node of G to refer to an element of Ns or Nc and we use the term embedded node of G to refer to either a node of G or a node of Gi or an embedded node of Gi . For instance, if one considers the db-graph drawn in Figure 2, the nodes n1 , n2 , n3 are nodes of G the nodes n21 and n22 are not direct nodes of G, they are embedded in the node n2 . When necessary, we use the notations Ns (G), Nc(G) and Nemb (G) to refer respectively to the set of simple nodes of G, the set of complex nodes of G and the set of embedded nodes of G. We make the assumption that Ns and Nc are disjoint sets (a node cannot be simple and complex at the same time). We also make the assumption that Ns and Nc both have an empty intersection with Nemb (Gi ), for any i. Intuitively, a node cannot belong to several levels of nesting in G (see below). Two nodes can be linked by several edges although each pair of such edges should carry dierent labels. Thus an edge is totally dened by its origin o, its destination d and its label l. We sometimes use the notation (o, d, l) to refer to that edge.
A path is a sequence of edges he1 e2 : : : ek i s.t dest(ei) = org(ei+1 ) for i=1::k-1.
In the following, as it is usually done, we restrict our attention to simple paths that is paths with no multiple occurrences of a same edge. Note here again that when considering a path of G, embedded db-graphs are not expanded. The empty path is denoted . The mappings org and dest are extended to paths in a natural way. The embedded db-graphs G1 : : : Gk of G are at level 1 of nesting in G. The set of all embedded db-graphs of G is composed of G1 : : : Gk and of theirs embedded db-graphs. The notion of level of nesting in G of an embedded db-graph is the natural one. By a path p of level l in G, we mean a path p of an embedded db-graph of level l. Recall that we make the assumption that an embedded node cannot belong to more than one level of nesting. Thus the same holds for a path. Atomic labels are carried by simple nodes and edges of G and also by simple nodes and edges of embedded db-graphs of G. This entails that a label may occur at dierent levels of nesting in G. For instance, in gure 2, the atomic value title labels two embedded edges, one at level 0 and the other at level 1. In the following, atomic values labelling nodes are called data and atomic values labelling edges are called labels. A node r of a graph G is a source for G if there exists a path from r to any other node of G. A graph is rooted if it has at least one source.
3 Querying semi-structured data: preliminaries In this section, we introduce the general setting which serves the denition of our languages. In the literature, most of the languages proposed for querying semi-structured data represented by a graph are based on a similar approach, although the paradigms used are dierent. A query is meant to extract sub-graphs from the initial graph or equivalently the roots of these sub-graphs. What are the criteria participating in the specication of the sub-graph retrieval? Essentially, the languages allow one to express two kinds of properties: (i) reachability of the root of the output sub-graphs via a specic path (ii) existence, in the output sub-graph, of some specic paths. It is easy to see that both kinds of property rely on some paths. This explains why most languages like ours 6] are based on an intermediate language which retrieves paths in a graph. These path languages often use path expressions as their basic constructs. Their expressive power determines the expressive power of the semi-structured query language based on it. In the following, for the purpose of the presentation, we will view a semi-structured query language as a language parameterized by a path formulas Lpath . The languages considered in this paper are calculus. Thus we suppose that four sorts of variables are on hand: path and graph variables denoted by X Y Z : : :,
label variables denoted by x y z : : : and data variables denoted by : : : We sometimes use bold capital characters X Y : : : when the sort of the variable is irrelevant. A term is dened as usual. A path formula ' of Lpath may have free variables. All free variables of ' are forced to be of the sort path. Dening the semantics of Lpath is done by dening the relation G, j=' for any db-graph G and any valuation of the free variables in '. We now have all the ingredients to dene a db-graph query language parameterized by the path formulas in Lpath . De nition 2. A query in the language Graph(Lpath ) is an expression of the form fX1 : : : Xn j g where Xi 's are graph variables and moreover Xi 's are the only free variables of the graph formula . An atomic graph formula is an expression of one of the following forms: 1. A path formula in Lpath , 2. t : X where t is a graph term and X is a path variable, 3. X t] where X is a path variable and t is a graph term, 4. t1 t2 where both t1 and t2 are graph terms and where is the symbol of bisimulation 12]. A general graph formula is dened in the usual way by introducing connectives and quantiers. Roughly speaking, t : X means that X is a path whose origin is a root of the graph t. Hence the formula of the kind t : X expresses what has been formerly introduced as a retrieval condition of the kind (i) existence, in the output subgraph, of some specic path. Intuitively, the atom X t] checks whether the graph t is rooted at the destination of the path X . This formula illustrates what we have previously introduced as a retrieval condition of the kind (ii) reachability of the root of the output sub-graphs via a specic path. The formal denitions of graph formula satisfaction and of graph query answering are not developed here. The interested reader can nd these denitions in 6]. As a matter of fact, in 6] we dened three path languages, namely Path, Path-Fixpoint and Path-While. The expressive power of the graph languages GFixpoint dened as Graph(Path-Fixpoint) and G-While dened as Graph(PathWhile) were investigated. In the next sections, using the same technics as 6], we dene two nested db-graph query languages. The rst one is based on a path language called Nest-Path and the second one is based on Emb-Path.
4 Nest-Path: a nested path calculus In this section, we introduce Nest-Path, a path expression language which extends Path by nesting (simple) path expressions. A path expression in Path is
an abstraction for describing paths which have a level of nesting equal to 0. Here we need a mechanism to abstract paths belonging to any level of nesting. Let us rst illustrate the language Nest-Path. A (simple) path expression (in Path) can be of the form x . and then the intention is to describe all paths reduced to a single edge labelled by x whose destination is a simple node labelled by . We extend these simple expressions by considering the more general form l . q where q is a path query. Such a nested path expression is meant to specify paths reduced to a simple edge labelled by l and whose destination is a complex node on which the query q should evaluate to a non-empty set of paths. Example 2. Let q : fY j Y 2 titleg be a path query. Then imdb . q is a nested path expression. It species paths of level 0 reduced to a single edge labelled by imdb and whose destination is a complex node labelled by a nested graph Gi on which the query q evaluates to a non-empty set of paths of level 1, reduced to one edge labelled by title. The single path of the db-graph of gure 2 spelled by imdb . qg is the edge (n1 n2 imdb). Note that the query q evaluated on the contents of the complex node n2 returns the 3 edges whose destinations are respectively labelled by \Destiny", \Psycho" and \Cosby Show".
4.1 Nest-Path's syntax We now formally dene nested path expressions.
De nition 3 (Nest-Path). A nested path expression is recursively dened by: 1. A path variable or a label term is a nested path expression. In this case, it is both pre-free (the origin of the path is not constrained) and post-free (the destination of the path is not constrained). 2. (a) s . t (b) t / s (c) s . q(X1 : : : Xn ) (d) q(X1 : : : Xn ) / s are nested path expressions if s is a nested path expression. In cases (b) and (d), s is required to be pree-free and in cases (a) and (c), s is required to be post-free. t is a data term. q is a nested path query of arity n 1 Xi are path variables. The nested path expressions of types (a) and (c) are post-bound. Those of types (b) and (d) are pre-bound. 3. s1 :s2 is a nested path expression when s1 and s2 are nested path expressions, resp. post-free and pree-free. s1 :s2 is pre-free (resp. post-free) as soon as s1 (resp. s2 ) is. 1
See de nition 4.
Example 3. The expression movie:title captures every path p having two edges labelled successively by movie and title. It is pre-free and post-free since neither the destination nor the origin of p is bound by a data term or a query. The path expression movie:title . \Destiny" captures paths of the same form whose destination brings the data \Destiny". This expression is pre-free and post-bound. The expression imdb . q(X) is a pre-free, post-bound nested path expression. We are now going to complete the former denition to make precise what is a path query. A path formula is build from path expressions as follows: 1. It is a nested path expression. 2. It is t1 =t2 where t1 and t2 are terms of the same sort. 3. It is t 2 s where t is a path term and s is a nested path expression. 4. ^ (resp. _ , : , (9X) and (8X) where and are path formulas. Intuitively, t 2 s intends to check that the path t is among the paths spelled by the path expression s. De nition 4 (Nested path query). A nested path query of arity n is of the form fX1 : : : Xn j 'g where ' is a nested path formula, for i = 1::n, Xi is a path variable, and the set Free(') of free variables of ' is exactly fX1 : : : Xn g. Example 4. Let us consider the nested path query r : fX j imdb . q(X)g where q is the unary query fY j Y 2 titleg. This query is meant to return paths of level 1 embedded in a node which is the destination of an edge of level 0 labelled by imdb. These output paths are reduced to single edges labelled by title. The preceding example suggests that a level of nesting is associated to variables in path expressions and queries. The notion of level of a variable in a path expression or query is necessary in order to dene the semantics. We will restrict our presentation to meaningful examples. Example 5. In the previous example, the level of the variable X occurring in r is 1. This is implied by the fact that the level of the variable Y in q is 0 and X is linked to Y by q in the nested path expression imdb . q(X). Example 6. Consider the path query r dened by fX j X 2 title _ imdb . q(X)g where q is fY j Y 2 titleg. The query r returns the edges of level 0 labelled by title (this comes from X 2 title) as well as the edges of level 1 labelled by title (this comes from imdb . q(X)). Note that, in this case, two levels (0 and 1) are associated to the variable X. This situation is considered valid in the framework of nested dbgraphs. The motivation is to allow one to retrieve paths at dierent levels of nesting. In fact, the only case where it is sound to assign multiple levels to a variable arises in disjunctive queries. Note here that the query r could have been split into two conjunctive queries.
Example 7. The expression fX j X:imdb . q(X)g where q : fY j Y 2 titleg is not a query because the variable X is assigned two levels (0 and 1) and a concrete path cannot be assigned two dierent levels. In 15], a procedure has been dened that assigns levels of nesting to the variables of a path query or path expression. In the following we assume that expressions are well nested and moreover that variables have a unique level (disjunction is ruled out without loss of generality).
4.2 Nest-Path's semantics
The semantics of Nest-Path is provided by giving a precise denition of the function Spell. Given a db-graph G and a path expression s, the function Spell gives the paths of G or embedded in G which conform to the expression s. In order to dene Spell, we rst need to tell how variables are valuated. As usually in the framework of calculus for querying databases 2], valuations rely on the active domain of the underlying db-graph G and expression s. The formal denition of the active domain is cumbersome and we will avoid its presentation here. The general idea is that a variable X , for instance a path variable, cannot be valuated by any path. It should be valuated by a path p in G or in an embedded graph of G, depending on the level of nesting associated to the variable X . So in fact, we always valuate a pair (X l) composed of a variable and its level in the path expression s or in a query r by a value in the graph of corresponding level l of nesting. For sake of simplicity, in the rest of the paper, we use the term valuation to abbreviate active domain valuation sound w.r.t. level assignment. De nition 5 (Spelling). Let G = (Ns Nc E org dest) be a db-graph. V ar(s) is the set of variables appearing in the expression s. Let be a valuation of V ar(s). The set of simple paths spelled by s in G, with respect to denoted by SpellG( (s)) is dened as follows: 1. if s is X then SpellG( (s)) = f (X )g, 2. if s is a label term t then SpellG( (s)) = fhei j e 2 E and (e) = (t)g, 3. if s is s1 . t then SpellG( (s)) = fp j p 2 SpellG( (s1 )) and
(dest(p)) = (t) SpellG( (t / s1 ) is dened in a similar fashion.
4. if s is s1 . q(X1 : : : Xn ) then SpellG( (s)) = fp j p 2 SpellG( (s1 )) q( (dest(p))) = 6
and and
g
( (X1 ) : : : (Xn )) 2 q( (dest(p))) g where q( (dest(p))) denotes the answer (see denition 7 below) of the path query q evaluated on the db-graph labelling the destination of the path p. SpellG( (q(X1 : : : Xn ) / s1 )) is dened in a similar way.
5. if s is s1 :s2 then SpellG( (s)) = fp1 :p2 j p1 2 SpellG( (s1 )) and p2 2 SpellG( (s2 )) and dest(p1 ) = org(p2 ) and p1 :p2 is simple g Example 8. Let G be the db-graph of gure 2. Let s be the path expression imdb . q(X) where q : fY j Y 2 titleg Consider the valuation of X (whose level in s is 1) by the path h(n21 n22 title)i of level 1, then it is easy to see that (X ) belongs to the answer of the query q evaluated on the db-graph labelling the complex node n2 . In fact, SpellG( (s)) = fh(n1 n2 imdb)ig. De nition 6 (Satisfaction of a path formula). Let ' be a path formula and a valuation of Free('). A db-graph G satises ' w.r.t , denoted G j= ' ], if 1. ' is a path expression and SpellG(' ]) = 6 . 2. ' is t1 = t2 where t1 and t2 are both terms and (t1 )= (t2 ). 3. ' is t 2 s where t is a path term and there exists an embedded db-graph G of level equal to that of t such that (t) 2 SpellG ( (s)). 4. Satisfaction for ( ^ ), : , (9X) or (8X) is dened as usual. 0
0
De nition 7 (Path query answer). Let q = fX1 : : : Xn j 'g be a path query and be a valuation of Free('). The answer of q evaluated over the db-graph G is q(G) = f( (X1 ) : : : (Xn )) j G j= ' ]g: Hence, a path query takes a db-graph as an input and outputs tuples of paths. The following example illustrates how the language Nest-Path can be used to combine data which may reside in dierent embedded graphs. Example 9. Let us consider once more the db-graph of Figure 2. Assume that the user knows about the structure of the information in node n2 as well as the structure of the information in node n4 (These structures may have been extracted by an intelligent engine). Thus he/she knows that n2 provides titles together with directors of movies and n4 provides titles together with theaters. Now assume that the user just wants to combine these information to have together titles, directors and theaters. He/she may express this by the following query r : fF D T j sg where s is the following path expression imdb . q1 (F D) ^ (9U)(9Fm) (U . q2 (Fm T) ^ (9) (F 2 title . ^ Fm 2 title . )) and q1 : fX1 Y1 j (9Z1 )(Z1 :X1 ^ X1 2 title ^ Z1 :Y1 ^ Y1 2 director)g q2 : fX2 Y2 j (9Z2 )(Z2 2 pariscope ^ Z2 :X2 ^ X2 2 title ^ Z2 :Y2 ^ Y2 2 theater)g
The query q1 collects information (title and director) from the complex node
n2 and the query q2 returns information (title and theater) from n4 . The combination is performed by the condition (9) (F 2 title . ^ Fm 2 title . ) that appears in r. This combination acts like a join.
Now that we have a way to select paths (and embedded paths) in a dbgraph, we naturally can select sub-graphs (and embedded sub-graphs) via the language Graph(Nest-Path). Let us reuse the preceding example to illustrate this language. Example 10. The reader may have noticed that the path query r returns triples of paths although it seems more natural for a user to get triples of nodes (the destinations of the paths) as an answer. Intuitively, this is what is performed by the following graph query: fX Y Z j (9F9D9T) s ^ FX] ^ DY] ^ TZ]g where s is the path expression dened in the previous example.
4.3 Expressiveness In this section we investigate the expressive power of Nest-Path. Essentially, we compare the new path language Nest-Path with the simple path language Path. The next result relies on a translation of nested db-graphs into at db-graphs and it shows that a Nest-Path query qnest can be translated into a Path query q in such a way that evaluating qnest over a db-graph G is equivalent to evaluating q over the corresponding at db-graph. Intuitively, this result tells that NestPath is not more expressive than Path. It can be compared in its spirit to the result 14, 1] which bridges a gap between the nested and the at relational models. The simple language Path is not redened here. The reader can assume that Path is just like Nest-Path without nesting that is without the constructs of 2.(c) and 2.(d) of denition 3.
Theorem 1. There exist a mapping S from db-graphs to at db-graphs, and a mapping T from Nest-Path to Path such that if q 2 Nest-Path and G is a nested db-graph then q(G) = T (q)(S (G)): Before we proceed to a short proof, let us recall 6] the notion of maximal sub-graph of a graph G. A graph G rooted at r is maximal w.r.t. G if each path in G starting at the source r is also a path in G . Moreover G is assumed to be a sub-graph of G. Finally, G is a cover of G if it is maximal and if it is not a sub-graph of any other maximal sub-graph. 0
0
0
0
Sketch of proof The mapping S unnests a nested db-graph G by repeating the process described below until G becomes totally at. 1. Every complex node n of G is replaced by a simple node nflat .Consider the label of n that is a graph Gn .
2. Nodes, edges etc of Gn are added to nodes, edges ... of G. 3. For each cover G0 of Gn , a root r is selected and an edge (nflat r %) is added to the edges of G. Note here that % is a new special label for the purpose of unnesting. Notice that S is not deterministic because of the choice of a root (see point 3. above) for a cover of Gn . The gure 3 shows a at db-graph which is the result of unnesting the dbgraph of gure 2 by applying S . The translation of a path expression snest of Nest-Path is not very complicated (although it is overly technical) once the transformation of nested db-graph into at db-graph is dened. In fact, the only diculty is to capture in the at path expression translating snest the nested path queries q that appear in subexpressions of the kind X . q(Y ). The level of Y is the information allowing one to make the translation. Let us consider the expression snest = X .q(Y ) where q is the path query fZ j Z 2 titleg. The translation of snest is X:%:Y ^ Y 2 title. The fact that the level of Y in snest is 1 is reected by the requirement of one edge labelled by % before Y . It is quite trivial to see that Graph(Nest-Path) can be simulated by Graph(Path) by using the transformations S and T of theorem 1.
5 Emb-Path: a xpoint nested path calculus When using Graph(Nest-Path) for extracting data from a semi-structured database represented by a nested db-graph, the following problem arises. How could the user write a query if he does not know about the level of nesting of the data that he wants to retrieve? For instance, consider that the user is seeking all possible movie or tv-serie titles. The database of interest is the one depicted in gure 2. Hence, the user may ignore that some of the data he is looking for are at level 1 of nesting and try to get what he wants by writing the query fX j titleX]g. The answer returned by this query is not complete w.r.t. the user expectation. Now, the user may have already navigate in the database and happen to know that data are nested at least one level. Thus he probably will write the following expression to retrieve movie and tv-serie titles: fX j titleX] _ (9U) U . r(X)g where r : fY j titleY]g. However, it may well be the case that the data are nested one level more etc. First, it is unrealistic to require that the user (and even the data management system) knows about the depth of nesting. Secondly, even in the case where the depth d of the nested db-graph is known, writing a query such as the one above may become highly complicated if d is large. In this section, we introduce a new path language called Emb-Path to overcome the above mentioned problem and gain a degree of declarativeness by allowing the user to write queries such as the one of the motivating example without knowing the depth of the db-graph.
n1 imdb n2 % % %
%
title
director type
director
title type Chahine
Destiny movie
Psycho
Hitchcock movie
title type
episode
episode director director
100
Cosby Show 1
tvserie Sandrich
Singletary
n
4 %
% % n
movie
3
pariscope
pariscope
movie theater
schedule title
theater
title
schedule
genre title
director
title
Gaumont
Destiny
20:30
George V
Psycho
22:15
director
Destiny
Chahine
Apocalypse Now
Coppola
drama
Fig. 3. Flattening a nested db-graph
5.1 Emb-Path's syntax and semantics Roughly speaking, an embedded path formula is specied by two Path formulas. The evaluation of the embedded path formula specied by ' and is an iterative process. At each step, the role of the rst formula ' is simply to extract paths of interest. At each step, the second formula serves to access some specic complex nodes in the db-graph, the ones which are going to be unnested in the sense that the labels of these complex nodes which are db-graphs are going to be processed (queried) at the next iteration. Thus in fact the second formula allows one to navigate in depth into the db-graph (not necessarily in the whole db-graph). Formally,
De nition 8 (Emb(Path,Path)). Let ' be a Path formula with n free path variables X1 : : : Xn and let be a Path formula with one free path variable Y . Then Emb(' ) (read ' is embedded in ) is an embedded path expression of arity n. Given a db-graph G, Emb(' ) denotes the n-ary relation which is the limit of the sequence fxk gk
0
dened by:
1. x0 is the answer to the query f(X1 : : : Xn ) j 'g evaluated on G. y0 is the answer to the query f(Y ) j g evaluated on G. 2. xk+1 = xk '(yk ) and yk+1 = yk (yk ) where '(yk ) (resp. (yk )) is the union of the answers to the query f(X1 : : : Xn ) j 'g (resp. f(Y ) j g) evaluated on the db-graphs labelling the destination of the paths p 2 yk (when this destination is a complex node).
An atomic embedded formula is an expression of the form Emb(' )(t1 : : : tn ) where t1 : : : tn are path terms and an embedded path expression Emb(' ) of arity n. Note that a Path formula ' having n free variables can be translated by embedding ' in false. The language of embedded formulas is denoted EmbPath as an abbreviation of Emb(Path,Path). It leads to a graph query language in a straightforward manner by considering Graph(Emb-Path). Note that the above denition can be generalized in a standard way by allowing ' and to be embedded formulas themselves. In this case, the language of embedded formulas is called Emb(Emb-Path,Emb-Path). Example 11. Let us consider the introductory query example: the user is trying to gather all movie or tv-serie titles in the database no matter at which level of nesting they are. In order to write this query, we will use the Emb-Path expression Emb(' ) of arity one where: ' is U 2 title and is V The graph query which returns all the titles is written in Graph(Emb-Path) as: f X j 9 Y Emb(' )(Y) ^ YX] g This graph query just extracts the subgraphs rooted at the destination of the paths Y satisfying the embedded formula Emb(', )(Y). What kind of paths does the embedded formula select? Because the path formula ' is U 2 title, Emb(', ) returns embedded paths reduced to edges labelled by title. Because the path formula is V, all complex nodes are explored and unnested and thus Emb(', ) returns all paths reduced to edges labelled by title and belonging to any level of nesting. Example 12. Let us consider a second example having the same structure as the previous one except for the formula which is now replaced by V 2 imdb. As in the previous example, because the path formula ' is U 2 title, Emb(', ) returns embedded paths reduced to edges labelled by title. However, now, because the path formula is V 2 imdb, the only complex nodes explored and unnested are the one at the destination of an edge labelled by imdb. Thus Emb(', ) here returns all paths reduced to edges labelled by title and belonging to any embedded db-graph reachable by an edge labelled by imdb.
Note here that there is another slightly dierent way to dene a graph query language (in the same style). It consists in embedding a graph formula in a path formula instead of embedding a path formula in a path formula and linking it to a graph formula. De nition 9 (Emb(Graph,Path)). Let ' be a Graph(Path) formula with n free graph variables X1 : : : Xn and let be a Path formula with one free path variable Y . Then Emb(' ) (read ' is embedded in ) is an embedded graph expression of arity n. An atomic formula of Emb(Graph, Path) is an expression of the form Emb(' )(t1 : : : tn ) where t1 : : : tn are graph terms. We do not develop the semantics of the language Emb(Graph, Path). The denition is similar as the one for Emb-Path. We simply illustrate it by the following example. Example 13. Let us consider the introductory example. It can be expressed in Emb(Graph, Path) by the query f X j Emb(' )(X) g where ' is titleY] and is V
5.2 Expressiveness Let us now address the following question with respect to expressive power: how are related the path languages Nest-Path and Emb-Path when the depth of the db-graph is known. We rst show that Emb-Path is subsumed by Nest-Path although in a rather weak sense.
Theorem 2. For a given db-graph G whose depth is known we have: for each
query qemb in Emb-Path, there exists a query qnest in Nest-Path such that qemb (G) and qnest (G) are equal.
The result above does not imply the inclusion of Emb-Path into Nest-Path because the translation of the query qemb depends strongly on the queried dbgraph G. In fact this result could be compared to the one showing that the transitive closure of a binary relation can be computed within the relational calculus as soon as one can use the number of tuples of the relation. Proof. Let k be the depth of G. The atomic embedded formula Emb(' )(X1 : : : Xn) can be simulated by the formula
'(X1 : : : Xn) _ (9V1 ) ( (V1 ) ^ V1 2 V1 . q(X1 : : : Xn ) ) _ (9V1 )(9V2 ) ( (V1 ) ^ V1 2 V1 . r(V2) ^ V2 2 V2 . q(X1 : : : Xn) )
.. . _(9V1 ) : : : (9Vk 1 ) ( (V1 ) ^ V1 2 V1 . r(V2 ) ^ : : : Vk 1 2 Vk 1 . q(X1 : : : Xn ) ) ;
;
;
q r
: f(U1 : : : Un ) j 'g : f(V) j g
Concerning the inverse inclusion, proving that Nest-Path Emb-Path remains an open question. In contrast to the preceding inclusion, this one would be strong. It would imply that the language Emb-Path enables at the same time (1) to write queries without having to know about the depth of the data to be retrieved, (2) to write queries exploring the db-graph at a given level of nesting. For the time being, even for a simple Nest-Path query like fX j (9U) X 2 imdb . q(U)g where q is fY j Y 2 titleg, we have no translation in Emb-Path neither a proof showing that it cannot be translated. The diculty that arises here is nding a mechanism that stops embedding at a specic level of nesting.
6 Concluding Remarks To conclude we would like to stress that although the current paper does not address it, the problem of restricting the languages to tractable queries i.e. queries computable in polynomial time has been investigated in 15]. To illustrate the problem let us consider the path query fX j X g which returns all simple paths in a db-graph. For instance suppose that the db-graph nodes are 1 2 : : : 2n, and its edges are from i to i +1 and i +2, for i = 1 : : : 2n ; 2. Thus the db-graph has O(2n ) simple paths from node 1 to node 2n. This entails that the evaluation of the query fX j X g is not polynomial. In order to avoid such situation, syntactic restrictions can be used (see 15, 11]). We are currently studying the open problem mentioned in the previous section. As a matter of fact, we separate the problem into couple of questions: (i) is Nest-Path included in Emb-Path=Emb(Path,Path)? (ii) is Nest-Path included in Emb(Emb(Path),Emb(Path))? (iii) how Emb(Graph,Path) is related to Graph(Emb-Path)?
References 1. S. Abiteboul and N. Bidoit. Non rst normal form relations: An algebra allowing data restructuring. Journal of Computer and System Sciences, Academic, 33, 1986. 2. S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison-Wesley, 1995. 3. S. Abiteboul, D. Quass, J. McHugh, J. Widom, and J. L. Wiener. The lorel query language for semistructured data. International Journal on Digital Libraries, 1(1):6888, April 1997. 4. Serge Abiteboul. Querying semistructured Data. In ICDT, pages 118, 1997. 5. Nicole Bidoit and Mourad Ykhlef. Fixpoint Path Queries. In International Workshop on the Web and Databases WebDB'98 In Conjunction with EDBT'98, pages 5662, Valencia, Spain, 2728 March 1998. http://www.dia.uniroma3.it/webdb98/papers.html/.
6. Nicole Bidoit and Mourad Ykhlef. Fixpoint Calculus for Querying Semistructured Data. Lecture Notes in Computer Science, 1590:7898, 1999. 7. P. Buneman, S. B. Davidson, M. Fernandez, and D. Suciu. Adding Structure to Unstructured Data. In Proceedings of International Conference on Database Theory (ICDT), Delphi, Greece, January 1997. 8. P. Buneman, S. B. Davidson, G. Hillebrand, and D. Suciu. A query language and optimization techniques for unstructured data. In Proceedings of ACM SIGMOD Conference on Management of Data, pages 505516, Montreal, Canada, June 1996. 9. M. Fernandez, D. Florescu, J. Kang, A. Y. Levy, and D. Suciu. System demonstration - strudel: A web-site management system. In Proceedings of ACM SIGMOD Conference on Management of Data, Tucson, Arizona, May 1997. Exhibition Program. 10. A. Mendelzon, G. Mihaila, and T. Milo. Querying the World Wide Web. International Journal on Digital Libraries, 1(1):5467, 1997. 11. A. Mendelzon, and P. Wood. Finding Regular Simple Paths in Graph Databases. SIAM Journal on Computing, 24, 1995. 12. R. Milner. Communication and concurrency. Prentice Hall, 1989. 13. Y. Papakonstantinou, H. Garcia-Molina, and J. Widom. Object exchange across heterogeneous information sources. In Proceedings of IEEE International Conference on Data Engineering (ICDE), pages 251260, Taipei, Taiwan, March 1995. 14. J. Paredaens and D. Van Gucht. Possibilities and limitations of using at operators in nested algebra expressions. In Proceedings of PODS conference, 1988. 15. M. Ykhlef. Interrogation des donnes semistructures. PhD thesis, Univ. of Bordeaux, 1999.