An E cient Uni cation Algorithm for a Logic Database ... - CiteSeerX

5 downloads 0 Views 236KB Size Report
Brigham Young University. Provo, Utah 84602, U.S.A.. Email: ng@cs.byu.edu ..... is assigned to each node in D by adding backward pointers from descendant.
An Ecient Uni cation Algorithm for a Logic Database Language for Nested Relations Yiu-Kai Ng Qing Chang Computer Science Department Brigham Young University Provo, Utah 84602, U.S.A. Email: [email protected] FAX: (801) 378-7775 Abstract

Although ecient uni cation algorithms exist for logic database queries for at relations, no ecient uni cation algorithm has been proposed for logic database queries for nested relations. As a result, the required time to process logic database queries for nested relations is often less than ideal. To overcome this shortcoming, we propose here a linear time uni cation algorithm for a large class of logic database queries for nested relations. The algorithm provides an ecient evaluation technique for many common nested relational queries. Our paper includes a characterization of the class of nested relational queries that can be handled by our proposed linear time uni cation algorithm and also the proof of the linearity of the algorithm.

Keywords: Logic programming, logic database language, nested relation, uni cation.

1 Introduction A logic database query can be answered by computing bindings, which are generated by uni cation. It is therefore important, especially for large databases, that uni cation algorithms be as ecient as possible. Indeed, linear time uni cation algorithms [PW78] [Cha86] do exist for querying at relations, as found in ordinary relational databases. However, no ecient uni cation algorithms have been proposed for querying nested relations, which are the primary components of object-oriented database systems. The purpose of this paper is to propose a linear time uni cation algorithm that can be used to query nested relations. The proposed algorithm handles a large and interesting class of queries Q, but does not handle all possible queries over nested relations. The class Q includes queries that are equivalent to those computable by (nested) selection, projection, join, union, di erence, and by nest and unnest. The class also includes recursive queries and Horn-clause (Datalog) queries. It does not, however, include queries that involve the 1

general set-term matching or set-term uni cation problem [KN86] [Sie89]. Indeed, unless P = NP , no algorithm exists for eciently computing bindings for these queries since the set-term matching and set-term uni cation problems are NP-complete [KN86] [Sie89]. The language in which we express our logic database queries and whose syntax and semantics we use to prove our result is LDL/NR (a Logic Database Language for Nested Relations) [Cha93]. LDL/NR is a strongly typed language with full expressive power for handling mutually nested tuples and sets. It is thus more faithful to the nested relational model [Hul89] than other strongly typed logic database languages [CC90] [CK91] because it follows the constraint of alternating tuple terms and set terms exactly. It is more complete than other logic database languages such as [Zan85] which allows only nested tuples rather than nested sets and [AGS92] which are based only on single-level setwise manipulation. These languages are incapable of dealing with an arbitrary nested relational structure. Queries expressible by LDL/NR properly include the class of queries Q. Indeed, LDL/NR allows users to express queries that involve the general set-term matching and set-term uni cation problems. Based on the syntax of LDL/NR, however, we are able to characterize the queries that lie in Q, and thus, we are able to know which queries expressible by LDL/NR can be processed in linear time. Roughly speaking, queries in Q are those that involve either an entire set term or a single-element set term. We present the details of LDL/NR and its linear time uni cation algorithm for queries in Q as follows. In Section 2 we give the formal syntax and semantics of LDL/NR. This establishes a rm foundation for our uni cation algorithm and also lets us characterize the class of nested relational queries that can be resolved linearly by our proposed algorithm. In Section 3 we give our algorithm which extends those of [PW78] [Cha86]. Our discussion also includes a proof which shows that the time complexity of the proposed algorithm is linear. We provide some concluding remarks in Section 4.

2 Syntax and Semantics of LDL/NR 2.1 Syntax of LDL/NR

LDL/NR is based on the notions of type, term, formula, rule, fact, and query, which in turn are de ned on an alphabet. Constants and variables in LDL/NR are of atomic type. There are two other types: set type and tuple type, in LDL/NR. A type in LDL/NR is inductively de ned as follows: (i) an atomic type is a type, (ii) a set type is a type, and for a set-type object rfsg, s is an object of either atomic type or tuple type, and (iii) a tuple type is a type, and for a tuple-type object p(s1 ; : : :; sn), si, 1  i  n, is an object of either atomic type or set type. (Hence, the type symbols for the set type and tuple type are fg and (), respectively.) Tuple type and set type can be used alternatively to form complex data types, as in [CC90] [CK91]. The type declaration dept(Dname, projects fPnameg, employees femployee(Ename, EID)g) de nes dept which is of tuple type with components Dname, which is of atomic type, and projects and employees, which are of set types. Objects of atomic type are called atomic terms. A term is inductively de ned as follows: (i) a constant is a term, (ii) a variable is a term, (iii) a set-type object rfsg is a term called 2

set term, where s is either an atomic term or tuple term, and (iv) a tuple-type object p(s1; : : : ; sn) is a term called tuple term, where si, 1  i  n, is either an atomic term or set term. A tuple term of dept, as introduced earlier, is dept(cs, projectsfdb; se; lpg, employeesfemployee(smith; 123), employee(jones, 567), employee(snow; 201)g). A set term rfX g, where X is a variable, denotes a set term of arbitrary cardinality. Further, LDL/NR adopts the Unique Name Assumption on types [CC90] which restricts symbols of di erent tuple and set terms contained in a term to be distinct. This same restriction also applies to variables of di erent types in a rule. A (well-formed) formula is inductively de ned as follows: (i) a tuple term or set term is a formula, (ii) if T and S are terms of comparable types, then TS is a formula, where  is a (set) comparison operation, (iii) if F is a formula and X is a variable, then 9X F and 8X F are formulas, and (iv) if F and G are formulas, so are :F , F _ G, F ^ G, F ! G, and F $ G. Further, a tuple term, set term, or (set) comparison operator with arguments is an atom1. A literal is an atom or its negation. A ground formula or ground term is a formula or term without variables. A rule is of the form head : { body, where head is an atom and body is a conjunction of atoms. The rule works on all (Ename) : { employee(Ename, EID), works on(EID, wprojs fProj IDg), projectsfPIDg  wprojsfProj IDg retrieves all employees who work on all projects, projects. A unit rule is a rule with an empty body. A fact is a ground unit rule. In a database, a nite set of facts forms an extensional database (EDB ). A program P is a collection of rules such that for each C 2 P , C is either a fact in EDB or a rule. A goal is a rule without a head, denoted ? ? body. A logic database query in LDL/NR consists of a program and a goal.

Example 1 Assume that a nested relational database, which consists of a nested relation parent and a at relation person, is represented by the following facts:

parent(john, childrenfchild(mary, 13), child(anthony, 10), child(bill, 5)g). person(john, male). person(mary, female). person(anthony, male). person(bill, male). and the rule R, father(X; Y ) :{ person(X; male), parent(X; Y )., which determines each father X and all X 's children. The logic database query, which consists of the program that includes R and a number of facts above and the goal ? ? father(john; Z ) retrieves all children of john. 2

2.2 Semantics of LDL/NR

The declarative semantics of an LDL/NR program (query) is given by the usual semantics of formulas in LDL/NR. To de ne formally the meaning of a fact as a logical consequence of a set of facts and rules, we introduce the concepts of uni cation. In a logic database language, an answer substitution is obtained by substituting variables in a query q by a set of data values which yields a desired result of q. This process Each (set) comparison operator and hence is treated as a tuple term. 1

optr

with arguments

3

arg1

and

arg2

can be written as

(

optr arg1 ; arg2

)

is called uni cation, the principle behind resolution in logic programming [Das92], which is an evaluation method that can be used for processing queries in logic database systems.

Example 2 Given the LDL/NR program and goal in Example 1, since john and Z in the

goal ?- father(john; Z ) can be uni ed with X and Y respectively in the head predicate of the rule father(X; Y ) :- person(X; male), parent(X; Y )., and since Y can be uni ed with childrenfchild(mary, 13), child(anthony, 10), child(bill, 5)g in the rst fact, Z can be uni ed with childrenfchild(mary, 13), child(anthony, 10), child(bill, 5)g which yields all children of john. 2

3 An Uni cation Algorithm Queries can be used for retrieving data from databases. This can be done by assigning data values in databases to variables in queries. In logic programming, the assignment of a term t to a variable V , denoted V=t, is called a binding, assuming that t and V are distinct and V does not occur in t. A nite set of bindings of variables X1; : : : ; Xn , denoted  = fX1=t1; : : :; Xn =tn g, is called a substitution, where Xi 6= Xj , 1  i; j  n, if i 6= j . Two substitutions  and  can be composed together to yield another substitution , denoted =   . The substitution which is an empty set of bindings is called the identity substitution, denoted . An expression is either a term, a literal, or a conjunction or disconjunction of literals. A simple expression is either a term or an atom. A subexpression of an expression E is an expression contained in E . If S is a nite set of simple expressions and there exists a substitution  such that S is a singleton, then S is said to be uni able (by ) or  uni es S , and  is called a uni er of S . If  is a uni er of S and for each uni er  of S , there exists a substitution such that  =   , then  is called a most general uni er (mgu) of S .

3.1 An Ecient Uni cation Algorithm for LDL/NR

[Llo87] discussed the uni cation problem of logic programming and presented a simple but inecient uni cation algorithm. [PW78] introduced an ecient uni cation algorithm based on an equivalence relation de ned on the nodes in a directed acyclic graph (DAG); however, the computed substitution for any two expressions E and F represented by a DAG may not be a uni er of E and F . [Cha86] proposed a revised version of the uni cation algorithm in [PW78] with correction and modi cation. [Mar82] revealed the pattern matching nature of the uni cation problem that can be exploited in many cases where symbolic expressions are dealt with. In this subsection, we propose an ecient uni cation algorithm which is based on the uni cation algorithms in [PW78] [Cha86]. These uni cation algorithms are extended for determining the uni ability of simple expressions in LDL/NR which handles nested relational database structures. The input of our algorithm is a DAG D which is set up initially to represent T , a set of two simple expressions E and F in LDL/NR. Common subexpressions in T are represented by a single subgraph in D. A subgraph S in D represents t, an expression or a subexpression of an expression in T , and the root node of S is labeled with the outermost 4

Figure 1: A DAG for p(X; X; X ) and p(sfq(a)g; sfq(Z )g; sfY g) and its Reduced Graph. symbol appeared in t and hence represents t. A node denoting either a tuple term, set term, or constant is called a non-variable node, and a node denoting a variable is called a variable node. If t is a tuple term with k components, then the node n which represents t has outdegree k, and the ith component of t, i  k, is referred as the ith child of n. A set-term node has outdegree 1, and a constant node has outdegree 0. There is a node with outdegree 0 for each distinct variable. The DAG for the two simple expressions p(X; X; X ) and p(sfq(a)g; sfq(Z )g; sfY g) is given in Figure 1(a). Two nodes in D are said to be uni able if the simple expressions they represent are uni able. Given any two nodes n1 and n2 in D, if n1 is an ancestor of n2, then n1 and n2 are not uni able because the expression represented by n1 contains the expression represented by n2. We want to de ne an equivalence relation R on the set of nodes in D so that given any two nodes u and v in D, if the two expressions of which u and v represent respectively are uni able, then u and v are in the same equivalence class of R, and vice versa. Hence, determining the uni ability of E and F in T is the same as determining whether the two root nodes of D are in the same equivalence class of R. However, we cannot adopt the traditional de nition of equivalence relation on the set of nodes in D directly because given any three nodes u, v, and w in D, even though u and v are uni able and v and w are uni able, u, v, and w may not be in the same equivalence class of R. For example, if we assume that node s represents the set term sfY g, node X represents the variable X , and node c represents the constant c, then nodes s and X are uni able and nodes X and c are uni able; however, nodes s and c are not uni able. In order to capture precisely and correctly the equivalence relation on the set of nodes in D based on uni ability, we adopt and modify the de nition of valid equivalence relation in [PW78]. We rst de ne the reduced graph of a DAG.

De nition 1 Let D be a DAG representing two simple expressions, and let R be an equiv-

alence relation on the nodes in D. D0 is the reduced graph of D with respect to R if D0 is a directed graph in which each node n contains a list of nodes l in D such that nodes in l are equivalent with respect to R, and the outdegree of n is the maximal outdegree of nodes in l. 2

Example 3 We de ne a relation R on the nodes in the DAG D in Figure 1(a) such that 5

xRy if and only if ex = ey , where ex and ey are the two expressions represented by the nodes x and y in D, respectively, and  is a uni er of ex and ey . The reduced graph D0 of D with respect to R is given in Figure 1(b). 2

De nition 2 An equivalence relation R on the nodes in a DAG D is valid if R has the following properties:

(i) if two non-variable nodes in D are equivalent, then their corresponding children in D are equivalent in pairs; (ii) each equivalence class is homogeneous, i.e., it does not contain two distinct nonvariable nodes in D with di erent names or types; (iii) the reduced graph D0 of D with respect to R is acyclic. 2 Assume that n1 and n2 are the two root nodes of a DAG. If n1 and n2 are uni able, then n1 and n2 are equivalent with respect to a valid equivalence relation R, i.e., n1Rn2 , and n1 and n2 are in the same equivalence class of R. The following lemma proves this result and shows that its converse also holds.

Lemma 1 Let n1 and n2 be the two root nodes of a DAG D. n1 and n2 are uni able if and only if there exists a valid equivalence relation R on the nodes in D such that n1Rn2.

Proof: (only if) Assume that n1 and n2 are uni able. Let  be a uni er of the two simple

expressions e1 and e2 represented by n1 and n2, respectively. Since  is a uni er of e1 and e2,  is a uni er of each pair of corresponding subexpressions of e1 and e2. Let R be a relation de ned by xRy if and only if ex = ey , where ex and ey are the two expressions represented by the nodes x and y in D, respectively. Hence, x and y are related if and only if ex and ey are uni able (or simply x and y are uni able). Let er; es, and et be the expressions represented by nodes r; s, and t in D, respectively. Since for any node r, er  = er, hence rRr. If rRs, i.e., er  = es, then es = er  and hence sRr. If rRs and sRt, i.e., er  = es and es = et, then er  = et, and hence rRt. Thus, R is an equivalence relation on the nodes in D. Let x and y be two non-variable nodes in D, and let ex and ey be the two expressions represented by x and y, respectively. Suppose ex = ey , i.e., xRy. Let sx and sy be the two corresponding ith (i 2 I +) child of x and y, respectively, and let esx and esy be the expressions represented by sx and sy, respectively. Since ex = ey  and esx and esy are a pair of corresponding subexpressions of ex and ey , respectively, esx = esy . Hence, sx and sy are equivalent in pair, i.e., sxRsy, and thus Property (i) of valid equivalence relation holds for R. Since R is an equivalence relation on the set of nodes in D, R yields a set of equivalence classes on the set of nodes in D. Let nodes x and y be in one of the equivalence classes of R, and let ex and ey be the two expressions represented by x and y, respectively. If nodes x and y are two distinct non-variable nodes with di erent names or types, then there does not exist a uni er  such that ex = ey , and hence x and y cannot be in the same equivalence class which contradicts the assumption that x and y are in the same equivalence 6

class. Thus, any equivalence class of R does not contain two distinct non-variable nodes in D with di erent names or types, i.e., every equivalence class of R is homogeneous, and hence Property (ii) of valid equivalence relation holds for R. Let D0 be the reduced graph of D with respect to R. We claim that D0 is acyclic. Assume not, then there exists at least one closed path2 p: c1; c2; : : : ; ck ; c1 in D0 . By the de nition of reduced graph, if (ck ; c1) is an edge in p, then there exist a node nk in lk , the list of nodes in ck , and a node n1 in l1, the list of nodes in c1, such that nk is the parent of n1 in D. By the same argument for the edge (ck?1; ck ) in p, there exists a node nk?1 in lk?1 such that nk?1 is the parent of nk in D. By repeating this process and the fact that p is a closed path c1; : : : ; ck ; c1, there exists a node n01 in l1 such that n01 is an ancestor of n1 in D. Recall that all nodes in D that are in node c1 are uni able. Let e1 and e01 denote the expressions that are represented by nodes n1 and n01 in D, respectively. Since n1 is a descendent of n01, e1 appears in e01. Also, since n1 and n01 are in l1, n1 and n01 are equivalent with respect to R. Thus, e1 and e01 are uni able. However, since e1 appears in e01, e1 and e01 are not uni able, a contradiction. Hence, D0 is acyclic, and thus Property (iii) of valid equivalence relation holds for R. (if) Suppose R is a valid equivalence relation on the nodes in D and n1Rn2. Since R is a valid equivalence relation on the nodes in D, it yields a partition on the nodes in D. By Property (ii) of the valid equivalence relation, each equivalence class is homogeneous. Hence, each homogeneous equivalence class does not contain two distinct nonvariable nodes with di erent names or types. Thus, if an equivalence class has the form fX1; : : : ; Xng, n  2, and each Xi , 1  i  n, denotes a variable node, then a substitution fX2=X1 ; : : :; Xn =X1g can be generated from this equivalence class. If an equivalence class has the form fX1; : : :; Xn ; t; : : :; tg, where Xi , 1  i  n, denotes a variable node, t denotes a non-variable node, and Et denotes the expression represented by t, then a substitution fX1=Et; : : : ; Xn =Etg can be generated from this equivalence class. If an equivalence class has the form fX g or ftg, where X and t represent a variable node and a non-variable node, respectively, then no substitution is generated for this equivalence class. Hence, a substitution for e1 and e2, the simple expressions represented by n1 and n2 respectively, is immediately derivable from these equivalence classes. Therefore, n1 and n2 are uni able. 2 From now on we refer valid equivalence as equivalence, unless stated otherwise. Based on Lemma 1, we can reduce the uni cation problem of two simple expressions E and F to the problem of determining the equivalence of the two root nodes n1 and n2, which represent E and F , respectively in a DAG. If n1 and n2 are in the same equivalence class, i.e., they are uni able, then the uni cation algorithm given in this paper generates an mgu of E and F ; otherwise, it reports the fact that E and F are not uni able. For any given DAG D, a node n in D is assigned a name which is the outermost symbol appeared in the expression represented by n, a value which is a unique identi er of n, an equivalence class identi er3 which is the unique identi er of node x, where x is the rst node included (and assumed to be) in the equivalence class of n (x = n if n is the rst node included in its own equivalence class), a type which is the type of expression 2 3

A closed path 1 2 6= , 1   , if 6= . 1 is a path, where Each node is in its own equivalence class and each equivalence class is assigned a unique identi er. c ; c ; : : : ; ck ; c

ci

7

cj

i; j

k

i

j

(term) represented by n, and an indicator which has an initial value NIL. The indicator of n is associated with a non-NIL value if n has been investigated for (further) substitution. A non-NIL value is an expression which is either the expression represented by n or an expression replaces the original expression of n. The following de nition captures these properties of a node precisely. De nition 3 A node n in a DAG is a 5-tuple (name; id; class id; type; assign), where name is the outmost symbol appeared in the expression represented by n, id is the unique identi er of n and is initialized to ?1, class id is the equivalence class identi er, which is the unique identi er of the rst node included (and assumed to be) in the equivalence class of n, type is the type of n, and assign is a value of n which indicates whether n has been investigated for (further) substitution and is initialized to  (NIL). 2 De nition 4 Two nodes X1 and X2 mismatch if (i) X1 and X2 are of di erent non-variable types, or (ii) X1 and X2 are of the same non-variable type, but they have di erent names or di erent number of children (the latter applies to tuple term only). Otherwise, X1 and X2 match. 2 It is assumed that during the transformation process which converts two simple expressions into a DAG D, the name eld and the type eld of each node n in D are extracted from the expression represented by n. If n is a variable node, then its initial substituted value in Subs, a global structure which records all bindings for variables in D, is . As shown in the initialization step in Algorithm 1, G is a duplicated copy of the original graph D and is being used during the process of computing a non-ordered substitution for the expressions represented by D. Given a DAG D as an input, Algorithm 1 proceeds to add an undirected edge between the two root nodes of D. (An undirected edge connecting nodes n1 and n2 denotes that n1 and n2 are to be considered for uni cation.) It then chooses an existing non-variable node n in D and calls the procedure Unify which attempts to unify n and each of its ancestor nodes n1; : : : ; nk (k  0) with other nodes in D which are in the same equivalence class of n; n1; : : :; nk , respectively. nk (or n if n has no ancestor), which is one of the two root nodes of D, is rst considered to be uni ed with nk 0, the other root node of D. If nk and nk 0 match, undirected edges connecting corresponding children of nk and nk 0 are created, and nodes nk and nk 0 and edges coming out of them are deleted from D. The node to be considered next for uni cation is nk?1, assuming that nk?1 is a direct descendent of nk and is an ancestor of n. For each node ni; 1  i  k ? 1, if ni is not a (new) root node in D, i.e., ni has another ancestor, then each of ni's ancestors A is considered to be uni ed with other nodes in D which are in the same equivalence class of A. Algorithm 1 terminates when an mgu of the two simple expressions E and F represented by D is determined or when it has been decided that E and F are not uni able. The procedure Unify, as introduced in Algorithm 1, may produce an ordered substitution instead of a non-ordered substitution for the expressions represented by D. A substitution  = fX1=t1; : : : ; Xn =tng is non-ordered if Xi does not occur in tj , 1  i; j  n; otherwise,  is ordered [Cha86]. Subsequent procedure Comp mgu in Algorithm 1 constructs an non-ordered substitution for the substitution generated by Unify. The procedure Comp mgu determines whether further substitutions should be done on variables 8

occurred in a term t of the binding X=t generated in Unify. It makes use of the structure Subs, which includes bindings for variables appeared in the two expressions represented by D. Each variable V in Subs is initially bounded to , i.e., V=, and the binding may subsequently be changed to V=t in Unify. V=t may further be changed to V=t0 in Comp mgu if term t0 contains a variable W such that there exists a non- binding for W in Subs. Eventually, the procedure Comp mgu produces a non-ordered substitution  for the expressions represented by D. The functions, Exp V ar, Descend, and Exp Args, as included in Algorithm 1, are mutually recursive. This mutually recursive set of functions share the global structures  (a substitution) and Subs. Our ecient algorithm does not include the pre-processor as in [Cha86] which assigns, in linear time, parent pointers for a node in a DAG. We assumed that during the transformation from a set of two simple expressions into a DAG D, a list of parent pointers is assigned to each node in D by adding backward pointers from descendant nodes to parent nodes. This process works in linear time with respect to the number of nodes and edges in D.

ALGORITHM 1. Generates an mgu  of the root nodes u and v if u and v are uni able.

Input: A DAG D with root nodes u and v. Output: An mgu of the two simple expressions represented by D or report the fact that the expressions are not uni able. Begin 1. Initialization: G := D; k := 1;  := , For each node n in D Do Begin n:id := ?1. n:assign := . If n is a variable node, then Subs(n) := . End. 2. Create an undirected edge between (u; v). 3. While there is a non-variable node n, call Unify(n). 4. While there is a variable node n, call Unify(n). 5. Call Comp mgu and halt. End. /* End of Algorithm 1 */ Procedure Unify (n) Begin /* a +n indicates that n attempts to unify with one of its descendents */ 1. If n:id > 0, then print (\non-uni able") and halt Else /* n is to be uni ed with other nodes which are in the equivalence class of n */ Begin n:id := k. n:class id := n:id. k := k + 1. End. 9

/* store nodes in a stack which may be in the same equivalence class of n */ 2. Create a new pushdown stack with operations Push(node) and Pop. 3. Push(n). /* n is in its own equivalence class */ 4. While stack is nonempty Do /* unify all the nodes in the equivalence class of n */ Begin (a) m := Pop. (b) If m and n are non-variable nodes, then If m:type 6= n:type, then /* type mismatched */ print (\non-uni able") and halt Else If m:name 6= n:name then /* name mismatched */ print(\non-uni able") and halt. /* unify m's parent nodes with nodes in their respective equivalence classes */ (c) While m has a parent f do Unify(f ). /* determine all the nodes which are in the same equivalence class of m */ (d) While there is an undirected edge between (m, f ) Begin (i) If f:id = ?1, then /* f may be in the equivalence class of m */ Begin f:id := k: f:class id := n:class id: k := k + 1: Push(f ). End Else If f:class id 6= n:id, then /* f attempts to unify w/ its descendents */ print(\non-uni able") and halt. /* equivalent class relationship between m and f are to be determined */ (ii) Delete undirected edge between (m; f ) from D. End While. /* Step 4(d) */ (e) If m:id 6= n:id, then /* m and n are distinct nodes */ Begin (i) If m is a variable node, then Begin Subs(m) := n.  :=  [ fm=ng. End Else If m is a tuple term or set term with the same outdegree q of n, then For i := 1 to q Do create an undirected edge between (ith child(n), ith child(m)) Else print (\non-uni able"). /* m and n have di erent no. of children */ (ii) Delete m and all directed edges out of m from D. End. End While. /* Step 4 */ 10

5. Delete n and all directed edges out of n from D. End. /* End of Unify */

Procedure Comp mgu /* Compute a non-ordered substitution for  */

Begin For each binding V=t 2  Do  :=  [ f V=Exp V ar(V ) g. /* Compute new binding for V , if necessary */ End. /* End of Comp mgu */

Function Exp Var(V ) /* Compute new binding for variable V , if needed */

Begin If V:assign 6= , then /* (New) Binding for V has been computed */ return(V:assign) Else Begin If (bind := Descend(Subs(V ))) = , then /* V is not associated with a value */ bind := V . End If V:assign := bind. return(bind). End. /* Else */ End. /* End of Exp Var */

Function Descend(t) /* Determine the value of term t */ Begin Case 't = ': return(). /* a non-existent term t */ Case 't.type = variable': return(Exp Var(t)). /* return the investigated value of t */ Case 't.type = constant': return(t). Case 't.assign 6= ': return(t.assign). /* an examined tuple term/set term */ default: /* further examine a tuple term/set term */ value := Exp Args(Arguments-of(t)). If value = Arguments-of(t), then t.assign := t Else t.assign := concat(t.name, '(', value, ')'). End If return(t.assign). End. /* End of Descend */ Function Exp Args(L) /*Explore a list of arguments L in a depth- rst (preorder) mode*/ Begin If L = , then /* L is an empty list of terms */ return() Else Begin 11

head val := Descend(head(L)). /* head(L) yields the 1st argument of L */ tail val := Exp Args(tail(L)). /* tail(L) yields the remaining arguments of L */ If head val 6= head(L) or tail val 6= tail(L), then return(Cons(head val, tail val)) /* concatenate the returned values of L */ Else return(L). End. End. /* End of Exp Args */ 2 The following notations are adopted during the process of unifying nodes in a DAG according to Algorithm 1: a deleted edge is represented by a dashed line, a deleted node is slashed through, and each node n that has been visited is assigned a pair of numbers (i; j ), where i denotes the unique identi er (id) of n, and j denotes the class identi er (class id) of n.

Example 4 Let p(X; X; X ) and p(sfq(a)g; sfq(Z )g; sfY g) be the two simple expressions

represented by the DAG D in Figure 1(a) with an undirected edge connecting the two root nodes of D. We apply Algorithm 1 to determine an mgu  of these two simple expressions. Assume that the rightmost s node is chosen to start with. Since (the rightmost) node p is a root node and is an ancestor of s, p is considered next. This p node matches another (the leftmost) p node. As a result, new undirected edges connecting the corresponding children (X and the three s nodes) of the two p nodes are created, and the two p nodes and the (un)directed edges coming out of them are deleted. Since (the rightmost) s becomes a (new) root node and there is an undirected edge connecting (the rightmost) s and X , s and X are uni ed. Further, since X and the other two s nodes are in the same equivalent class, the other two s nodes are visited and the undirected edges connecting X and all s nodes are deleted. The DAG at this stage is as shown in Figure 2. According to the procedure Unify, the middle s node is to be considered next. Since the middle and the rightmost s nodes match, an undirected edge between their corresponding children, (the rightmost) q and Y , is created. Further, since the other pair of (the leftmost and the rightmost) s nodes are in the same equivalent class and they match, their corresponding children, (the leftmost) q and Y , are connected by an undirected edge. All s nodes and their edges are then deleted. The DAG at this stage is as shown in Figure 3. Here, either node (the leftmost or rightmost) q or a can be chosen next since they are non-variable nodes. Assume that the rightmost q is chosen. Since there is an undirected edge connecting (the rightmost) q and Y , (the rightmost) q and Y are uni ed. Further, since the leftmost q and Y are in the same equivalent class as the rightmost q, and since the two q nodes match, their children, a and Z , are connected and to be considered for uni cation, and all q and Y nodes and (undirected) edges coming out of them are deleted. The resultant DAG at his stage is as shown in Figure 4. At this stage there are two nodes, a and Z , left to be considered. a is chosen over Z because a is a non-variable node. Since a and Z are in the same equivalent class, they are uni ed, and the procedure Unify yields  = fX=sfY g; Y=q(Z ); Z=ag. The subsequent procedure Comp mgu is called to generate a non-ordered substitution 12

Figure 2: The DAG after X and (the rightmost) s are uni ed.

Figure 3: The DAG after the equivalent class of s's children is determined.

Figure 4: The DAG after Y and (the middle) q are uni ed and before a and Z are to be considered. 13

for . Since variable X is bound to sfY g, i.e., Subs(X ) = sfY g, the function Descend is called in Exp V ar to investigate whether there exists a non- value associated with the binding for variable Y . Since Y is an argument of s, the function Exp Args is called in Descend to explore Y to determine whether Y is a variable involved in a non- binding. As it turns out, Descend (called by Exp Args) detects that Y a variable and calls Exp V ar to retrieve the term q(Z ) which is bound to Y . Subsequent function calls yield the fact that Z is bound to a, a constant. Backward substitutions generate new bindings Y=q(a) and X=sfq(a)g, which yield the desired non-ordered substitution  = fX=sfq(a)g; Y=q(a); Z=ag, which is an mgu for the two expressions p(X; X; X ) and p(sfq(a)g; sfq(Z )g; sfY g). 2 Notice that Algorithm 1 is designed to deal with two simple expressions. In order to show that our uni cation algorithm can handle uni cation problems of more than two simple expressions, as the uni cation algorithm in [Llo87] does, we propose a transformation, which involves only a linear increase in the length of the input, for converting a set of three or more simple expressions into a set of two simple expressions. The transformed set of expressions is then processed by Algorithm 1. Assume that the set of simple expressions S = fs1; s2; : : :; sng, n  3, is to be uni ed. If any two expressions in S mismatch (this can easily be veri ed by comparing the name, type, and number of subexpressions at the rst level of the two expressions), then a message indicating that S is not uni able is generated; otherwise, there are three cases to be considered during the transformation process. Case (i) If S contains at least one tuple term, then each of the remaining simple expressions in S is either a tuple term or a variable. Suppose s1; : : :; sm are tuple terms and each of the remaining sis, m +1  i  n, is a variable, denoted Xi. Let p and q be two names that do not appear in any expression in S . If m = n, then let S 0 = fp(qfs1g, qfs2g, : : :, qfsm?1g), p(qfs2g, qfs3g, : : : , qfsmg)g; otherwise, let S 0 = fp(qfs1g, : : :, qfsm?1g, qfsmg, Xm+1, : : :, Xn?1 ), p(qfs2g, : : :, qfsmg, Xm+1 , Xm+2 , : : :, Xn )g. Case (ii) If S contains at least one set term, then each of the remaining simple expressions in S is either a set term or a variable. Suppose s1; : : : ; sm are set terms, and each of the remaining sis , m + 1  i  n, is a variable, denoted Xi . Let p be a name that does not appear in any expression in S . If m = n, then let S 0 = fp(s1 ; s2; : : :; sm?1 ), p(s2; s3; : : :; sm)g; otherwise, let S 0 = fp(s1; : : :; sm?1, sm, Xm+1 , : : : , Xn?1 ), p(s2, : : :, sm, Xm+1 , Xm+2 , : : :, Xn )g. Case (iii) If S has neither tuple terms nor set terms, then S contains at most one constant and each of the remaining simple expressions in S is a variable. Let p be a name that does not appear in any expression in S , and let S 0 = fp(s1; s2; : : :; sn?1 ), p(s2; s3; : : :; sn)g. Lemma 2 The uni cation problem of a set of three or more simple expressions S is the same as the uni cation problem of S 0, where S 0 which is transformed from S is a set of two simple expressions. Proof: We apply Algorithm 1 to S 0 in case (iii) above to show that the uni cation problem of S is the same as the uni cation problem of S 0. (The DAG D for S 0 in case (iii) is as shown 14

Figure 5: Converting the uni cation problem with m expressions into the uni cation problem with 2 expressions. in Figure 5.) Suppose node sk in D is rst chosen to start. If k = 1, then Algorithm 1 determines the uni ability of s2 and s1, s3 and s1, : : :, and sn and s1. If k = n, then Algorithm 1 determines the uni ability of sn?1 and sn , sn?2 and sn, : : :, and s1 and sn. If 1 < k < n, then Algorithm 1 determines the uni ability of sk+1 and sk , sk+2 and sk , : : :, sn and sk , sk?1 and sk , sk?2 and sk , : : :, and s1 and sk . If each pair of (si; sj ), 1  i; j  n, are uni able, then S 0 is uni able, and so is S . The same argument holds for S 0 in cases (i) and (ii) above. Hence, the uni cation problem of S is the same as the uni cation problem of S 0. 2

3.2 Computational Complexity of Algorithm 1

In this subsection, we show that Algorithm 1 is linear with respect to the number of names and edges appearing in the two simple expressions represented by a DAG which is given as an input for Algorithm 1.

Theorem 1 Algorithm 1 is a linear time algorithm. Proof: Let D = (V; E ) be the input of Algorithm 1, and let e1 and e2 be the two simple

expressions represented by D. There are two cases to be considered. Case 1. Suppose that e1 and e2 are uni able. Let jV j, jE j, and jUE j denote the number of vertices, directed edges, and undirected edges created in D, respectively. Since D consists of nitely many nodes, jV j is nite. Also, since there are nitely many directed edges between any two nodes in D, jE j is nite. Since all the undirected edges, except the one connecting the two root nodes of D, are created in Step 4.(e)(i)P of the procedure Unify according to the outdegree of two nodes n and m in D, jUE j  n2V outdegree(n) + 1. Also, since the outdegree of a node n is the number of directed edges coming out from n, jUE j  jE j + 1. Algorithm 1 terminates when all nodes and (un)directed edges are removed from D and a non-ordered substitution has been computed. The time complexity of Algorithm 1 is determined by the number of times nodes and (un)directed edges in D are visited. In D each undirected edge e connecting nodes f and m of interest is visited at Step 4.(d) of Unify. e is visited again at Step 4.(d)(ii) to be deleted. Thus, each undirected edge is visited exactly twice during the execution of Algorithm 1. It follows that 2jUE j visits are made on the undirected edges in D. For any node n in D, if n has parents, then directed edges of a path n; : : :; m are visited to determine one of the current root nodes m of the DAG, where m is an ancestor of n. After two of the current root nodes have been uni ed, all the directed edges coming 15

out of the two root nodes are deleted. Thus, any directed edge of D is visited once for determining one of the current root nodes of D and is visited again for deletion in either Step 4.(e)(ii) or Step 5 in Unify. Further, each directed edge in G (the duplicated copy of D4 ) will be visited at most once in Descend during the process of computing a non-ordered substitution for the expressions represented by D. Since every directed edge in D is visited at most three times, at most 3jE j visits are made on the directed edges in D. Since each node in D is initialize to -1 and is set to a positive value, a unique identi er, afterward during the execution of Algorithm 1, each node is visited exactly twice for setting up its value. In other word, 2jV j number of visits are made on the nodes in D for setting up their values. Further, two nodes are visited in order to create an undirected edge between the nodes. Since there are jUE j edges, 2jUE j visits are made on the nodes in D for creating undirected edges. Also, each node n is visited when n is deleted at Step 4.(e)(ii) or Step 5 in Unify. It thus requires jV j visits for deleting all the nodes in D. Since each undirected edge (m; f ) is visited at Step 4.(d) to determine the nodes f s which are in the same equivalence class of m, each of these f nodes is visited once. Since there are jUE j undirected edges in D, there are jUE j nodes in D to be visited in order to determine their equivalence classes. In addition, during the process of computing a non-ordered substitution for the expressions represented by D, each node (in G) is visited (at most) one more time. Thus, the number of visits made on the nodes in D does not exceed 4jV j + 3jUE j. Therefore, the total number of visits on the nodes and (un)directed edges in D does not exceed 5jUE j + 3jE j + 4jV j. Since jUE j  jE j + 1, the total number of visits does not exceed 8jE j + 4jV j + 5. Thus, the time complexity for Algorithm 1 is O(jV j + jE j). (We assume that each if-then-else statement, case statement, assignment statement, or unit operation such as Push, Pop, etc., takes one time unit to process. However, since there is a constant number of these statements and operations, their computational time can be ignored.) Case 2. Suppose that e1 and e2 are not uni able. Then, Algorithm 1 terminates at either Step 1, Step 4.(b), Step 4.(d)(i), or Step 4.(e)(i) of Unify. The non-uni able message is generated in the body of the condition statements in one of these four steps. Since these statements are also checked in Case 1, the running time for Case 2 does not exceed the running time for Case 1. 2

4 Concluding Remarks In this paper we provided an ecient uni cation algorithm for LDL/NR, a logic database language for nested relations. This uni cation algorithm can be used to determine the answers to a large and interesting class of LDL/NR queries. The proposed algorithm is linear with respect to the number of names and edges appearing in a DAG which represents two simple expressions in LDL/NR. The proof for the time complexity of the algorithm has been given. To avoid the confusion of working with two DAGs, and , we can modify Algorithm 1 so that an edge or a node can be marked for deletion and still exists physically for future references. 4

G

16

D

References [AGS92] N. Arni, S. Greco, and S. Sacca. Set-Term Matching in Logic Programming. In International Conference on Database Theory, pages 436{449, 1992. Lecture Notes in Computer Science, 646. [CC90] Q. Chen and W. Chu. Deductive and Object-Oriented Database, chapter HILOG: A High-order Logic Programming Language for Non-1NF Deductive Databases, pages 431{452. Elsevier Science Publishers, 1990. W. Kim, J. Nicolas, and S. Nishio (Editors). [Cha86] D. D. Champeaux. About the Paterson-Wegman Linear Uni cation Algorithm. Computer and System Sciences, 32:79{90, 1986. [Cha93] Q. Chang. A Logic Database Language for Nested Relations in Partitioned Normal Form. Master's thesis, Brigham Young University, Provo, Utah, December 1993. [CK91] Q. Chen and Y. Kambayashi. Nested Relation Based Database Knowledge Representation. In Proceedings of 1991 ACM SIGMOD International Conference on Management of Data, pages 328{337. ACM, 1991. [Das92] S. K. Das. Deductive Databases and Logic Programming. Addison-Wesley, New York, 1992. [Hul89] Richard Hull. Four Views of Complex Objects: A Sophisticate's Introduction. In S. Abiteboul, P. Fischer, and H. Schek, editors, Nested Relations and Complex Objects in Databases, pages 87{116. Springer-Verlag, New York, 1989. Lecture Notes in Computer Science, 361. [KN86] D. Kapur and P. Narendran. NP-Completeness of the Set Uni cation and Matching Problems. In J. Siekmann, editor, 8th International Conference on Automated Deduction, pages 489{495. Springer-Verlag, New York, 1886. Lecture Notes in Computer Science, 230. [Llo87] J. W. Lloyd. Foundations of Logic Programming, Second, Extended Edition. Spring-Verlag, New York, 1987. [Mar82] A. Martelli. An Ecient Uni cation Algorithm. ACM Transactions on Programming Laugages and Systems, 4(2):258{282, 1982. [PW78] M. S. Paterson and M. N. Wegman. Linear Uni cation. Computer and System Sciences, 16:158{167, 1978. [Sie89] J. Siekmann. Uni cation Theory. Symbolic Computation, 7:207{274, 1989. [Zan85] C. Zaniolo. The Representation and Deductive Retrieval of Complex Objects. In Proceedings of the International Conference on Very Large Databases, pages 458{469, Stockholm, 1985. ACM. 17

Suggest Documents