Finite Implication of Keys and Foreign Keys for

0 downloads 0 Views 258KB Size Report
even under the primary key restriction: nite implication of unary keys, unary foreign keys and negations of unary keys. nite implication of unary keys, unary ...
Finite Implication of Keys and Foreign Keys for XML Data Wenfei Fan

Leonid Libkin

[email protected]

[email protected]

Bell Laboratories

Temple University

Abstract

needs to reason about them. There are two important questions in connection with these constraints. One is the nite implication problem: given a DTD D and a set  of keys and foreign keys, does it follow that all nite XML documents satisfying  and conforming to D must also satisfy some other key or foreign key '? XML documents are typically modeled as node-labeled trees, e.g., in XSL [12, 31], XQL [28], XML Schema [29], XPath [13] and DOM [3]. Thus the nite implication problem is to determine whether for all nite XML tree T that both conforms to D and satis es , T must also satisfy '. This question is important in, among other things, data integration. For example, one may want to know whether a constraint ' holds in a mediator interface, which may use XML as a uniform data format [4, 26]. But this cannot be veri ed directly since the mediator interface does not contain data. One way to verify ' is to show that it is implied by constraints that are known to hold [18]. The other important question is the nite satis ability problem: given a DTD D and a set  of keys and foreign keys, is there a nite XML tree that both conforms to D and satis es ? In other words, one wants to know whether an XML speci cation is meaningful. In relational databases the nite satis ability and nite implication problems have been well studied for keys, foreign keys and the more general functional and inclusion dependencies (see, e.g., [1] for a survey). In the XML setting, however, little work has been done in this line of research. Recently we investigated the nite satis ability problems associated with keys and foreign keys for XML [16], and our results show that reasoning about keys and foreign keys for XML is far more intriguing than that for relational databases. This is because XML DTDs interact with keys and foreign keys, and the interaction makes the analysis of nite satis ability of these constraints much harder. This paper extends the work of [16]. We continue to study the interaction between DTDs and key, foreign key constraints, and investigate nite implication problems associated with keys and foreign keys for XML.

We investigate nite implication problems associated with key and foreign key constraints for XML data. We demonstrate that there is interaction between DTDs and these constraints, and the interaction complicates the analysis of nite implication. In particular, we establish complexity results for reasoning about two classes of constraints for XML. One class, Lu , de nes unary keys and foreign keys, and the other, L, consists of multi-attribute keys and foreign keys. We show that the nite implication problem for Lu constraints is coNP-complete, and the nite implication problem for L constraints is undecidable. We also identify several PTIME decidable cases of the nite implication problems. In addition, we improve the results established in [16] by showing that the nite satis ability problem for unary keys, unary inclusion constraints and their negations is NP-complete.

1 Introduction Although a number of dependency formalisms were developed for relational databases, functional and inclusion dependencies are the only ones that are commonly used in practice. More precisely, only two subclasses of functional and inclusion dependencies, namely, keys and foreign keys, are still being widely used and supported by the SQL standard [23]. Keys and foreign keys are fundamental to conceptual database design. They provide a mechanism by which one tuple may refer to another tuple. They are also useful in update anomaly prevention, query optimization and index design [1, 30]. Keys and foreign keys have been proposed for XML in, e.g., XML Schema [29], XML Data [21] and in a recent proposal [6]. They express a fundamental part of the semantics of XML data and are important in, among other things, query formulation and optimization [27], data transformation and integration [18], and in data exchange for converting databases to an XML encoding. To take advantage of keys and foreign keys, one 1

Examples. Before we formally treat nite implication

college

of keys and foreign keys, let us consider some concrete examples. A DTD, D1 , for courses and course coordinators in a college is given as follows:

course



Here we omit the descriptions of elements whose type is string (e.g., PCDATA in XML). This DTD speci es that, among others, a course coordinator is either a professor or a staff. Assume that each coordinator has an attribute for referring to a course that the coordinator supervises, and the attribute is a key of coordinator, i.e., there is at most one coordinator for each course. A course has attribute num that uniquely identi es the course, and an attribute taught by that refers to a professor who teaches the course. Assume that a course is taught by at most one professor, i.e., taught by is also a key of course. In addition, professor and staff have a name attribute that is a key, and professor also has a teaches attribute that refers to a course that the professor teaches. Furthermore, we assume that no two professors teach the same course, i.e., teaches is also a key of professor. An XML tree conforming to D1 is depicted in Fig. 1, in which the values of the text elements are omitted. Here we consider single-valued attributes. That is, if an attribute l is de ned for an element type  in a DTD, then in a document valid w.r.t. the DTD, each element of type  must have a unique l attribute that has a string value. Using keys and foreign keys, the semantic information given above can be expressed as follows: (1) coordinator:for ! coordinator, (2) course:num ! course, (3) course:taught by ! course, (4) professor:name ! professor, (5) professor:teaches ! professor, (6) staff:name ! staff , (7) coordinator:for  course:num, (8) course:taught by  professor:name, (9) professor:teaches  course:num. Let us refer to this set of constraints as 1 . The rst six constraints are keys. They assert that the value of a key attribute uniquely identi es an element among elements of the same type. For example, referring to an XML tree T , the rst key assures that no two distinct coordinator nodes in T can have the same for attribute value. In other words, for any nodes x and y in T labeled coordinator, if the (string) values of their for attributes are equal, then x and y must be the same node. It should be noted that two notions of

@taught_by @num "Joe"

"CIS670"

course

@taught_by "Mary"

coordinator

@num

coordinator

professor @for professor @for

"CIS331"

"CIS670"

"CIS331"

@teaches

@name

@name

@teaches

"CIS670"

"Joe"

"Mary"

"CIS331"

Figure 1: An XML tree conforming to D

1

equality are used in the de nition of keys: we assume string value equality when comparing attribute values of x and y, and vertex equality when it comes to comparing x and y. The pairs of constraints (7, 2), (8, 4) and (9, 2) given above are foreign keys. They state that an attribute of an element type 1 is a foreign key referencing a key attribute of another element type 2 , and thus uniquely identi es an element of 2 . For example, the last foreign key asserts that for any node x in T labeled professor, there is a node in T labeled course such that the value of the teaches attribute of x equals to the value of the num attribute of y. Observe that y is uniquely identi ed since num is a key of course. One may wonder over nite XML trees conforming to D1 , whether 1 implies the following constraints:

'1 : professor:teaches  coordinator:for, '2 : staff:name  professor:name. That is, whether for any nite XML tree T conforming to D1 , if T satis es , then T satis es '1 and '2 .

This is indeed the case. To see this, let us rst de ne some notations. Given an XML tree T and an element type  , we use ext( ) to denote the set of all the nodes labeled  in T . Similarly, given an attribute l of  , we use ext(:l) to denote the set of l attribute values of all  elements. Let us denote the cardinality of a set S by jS j. Observe that if T satis es :l !  , then jext(:l)j = jext( )j, and if T satis es 1 :l1  2 :l2 , then jext(1 :l1 )j  jext(2 :l2 )j. Thus from 1 follows that in any nite XML tree conforming to D1 and satisfying 1 , we have the following:

jext(coordinator)j = jext(coordinator:for)j  jext(course:num)j = jext(professor:teaches)j 2

= jext(professor)j:

db

By the DTD D1 , a coordinator can have at most one professor subelement, and thus

jext(professor)j  jext(coordinator)j:

state

From these follows immediately

@name

jext(coordinator:for)j = jext(professor:teaches)j = jext(course:num)j: Thus by the constraints coordinator:for  course:num and professor:teaches  course:num in  we have

"PA"

"Harrisburg" @in_state

@name

"liberty bell"

"Philadelphia"

Figure 2: An XML tree conforming to D

2

foreign key referring to name of state elements together with the rst key. Now let us consider whether any nite XML tree that both conforms to D2 and satis es 2 must also satisfy another not-very-natural constraint:

'3 : city:name  state:name. Given this particular D2 , the nite implication holds.

coordinator (professor | staff)>,

then there is a nite XML tree T conforming to the new DTD and satisfying , but T does not satisfy any of '1 and '2 . In other words, over di erent DTDs, the nite implication has di erent outcomes. This example demonstrates that XML DTDs interact with keys and foreign keys, and the interaction complicates the analysis of nite implication of keys and foreign keys.

This is because there does not exist any nite XML tree that both conforms to D2 and satis es 2 . Indeed, from 2 follows immediately:

jext(capital)j  jext(state)j:

(a)

However, observe that D2 requires that each state node must have a capital as a child and in addition, the db node must have a non-empty sequence of capital children that are not children of any state node. Thus the following is enforced by D2 :

As another example, let us consider a DTD D2 for states and capitals: db state

capital

"PA"

"Harrisburg" @in_state

that '1 must hold. In addition, from these inequalities we have jext(coordinator)j = jext(professor)j. By the DTD D1 , a coordinator cannot both a professor and a staff subelements. Thus jext(staff )j = 0. From this follows '2 immediately. It should be noted that given a di erent DTD,  may not imply any of '1 and '2 . For example, if we change the element type de nition of coordinator to


city

capital

"PA"

1


capital

state

(state+, capital+)> (capital, city )>

jext(state)j < jext(capital)j:

It speci es that a database consists of a non-empty sequence of states and a non-empty sequence of capitals, and a state has a capital and other cities. Here again we omit the descriptions of elements whose type is string. Assume that state has an attribute name, capital has an attribute in state, and city has an attribute name. A nite XML tree conforming to D2 is shown in Fig.2. It is natural to consider 2 , a set of simple unary key and foreign key constraints:

(b)

Observe that (b) contradicts to (a) given above. Thus there is no nite XML tree that can satisfy both (a) and (b). Again, the DTD D2 interacts with the set 2 of key and foreign key constraints and as a result, the nite implication holds given this particular D2 . With a di erent DTD, the nite implication may not hold.

Contributions. The central technical problem con-

sidered in this paper is the nite implication problem associated with keys and foreign keys for XML. We investigate the problem for constraints de ned in two languages that were introduced and studied in [17, 16]. One language, denoted by Lu , de nes a class of unary keys and foreign keys, and the other, denoted by L, is a class of multi-attribute keys and foreign keys. We make the following contributions:

state:name ! state, capital:in state ! capital, capital:in state  state:name: These assert that name is a key of state elements, in state is a key of capital elements, and it is also a 3

tional dependencies over R, one can always construct a database instance I of R such that I satis es  and moreover, I contains one tuple for each relation in R. It is well-known that the nite implication problem for inclusion and functional dependencies is undecidable. It was shown in [17] that the nite implication problem for keys and foreign keys in relational databases is also undecidable. For unary inclusion and functional dependencies, [14] has shown that the nite implication problem is decidable in linear time. In contrast, this paper shows that nite implication problem associated with the simpler unary keys and foreign keys for XML is coNP-complete. A variety of path constraints have been studied for semistructured and XML data [2, 9, 11]. The interaction between path constraints and database schemas was investigated in [10]. Path constraints specify inclusions among certain sets of objects in edge-labeled graphs. They are not capable of expressing keys.

1. We show that the nite implication problem for L is undecidable. 2. We show that the nite implication problem for Lu is coNP-complete. 3. We identify several special cases of the nite implication problems for Lu and L, and establish their PTIME decidability. 4. It was proved in [16] that the nite satis ability problem for Lu is NP-complete. We improve this by showing that the nite satis ability problem for unary keys, unary inclusion constraints and their negations remains NP-complete, where inclusion constraints are a generalization of foreign keys. As shown by [14], the nite implication problem for unary keys and foreign keys in relational databases is decidable in linear time. The results of this paper show that reasoning about unary keys and foreign key is much harder, because of the interaction between DTDs and key, foreign key constraints.

Organization. The rest of the paper is organized as

follows. Section 2 describes a formalism of XML DTDs and de nes two classes of keys and foreign keys for XML, namely, Lu and L. Section 3 establishes the undecidability of the nite implication problem for L constraints. Section 4 shows that the nite implication problem for Lu constraints is coNP-complete. It also improves the results of [16] by showing that the nite satis ability problem for unary keys, unary inclusion constraints and their negations is NP-complete. Several PTIME decidable cases of the nite implication problems are also identi ed in Sections 3 and 4. Finally, Section 5 identi es directions for further work. Proofs of the results in the paper are given in an appendix.

Related work. Key and foreign key speci cations have

been proposed in, e.g., the XML standard (via ID and IDREF attributes) [5], XML Schema [29] and in a recent proposal [6]. However, few results have been known for reasoning about keys and foreign keys for XML. The nite implication problems associated with Lu and L constraints were investigated in connection with a graph model of XML in [17]. However, in contrast to the XML tree model, the graph model allows sharing of nodes and thus simpli es the analysis of nite implication. It should be mentioned that in general, there is no one to one mapping between XML graphs and XML documents. In other words, an XML graph may have multiple XML documents corresponding to it. The nite implication problem for the key language proposed in [6] was studied in [7, 8]. However, the problem was considered in the absence of foreign keys and DTDs, and as a result, reasoning about keys in their setting is much simpler. We have investigated nite satis ability of Lu and L constraints [16]. Among other things, [16] shows that the nite satis ability problem for Lu is NP-complete, and the nite satis ability problem for L is undecidable. For relational databases, the nite satis ability and nite implication problems have been well studied for the more general inclusion and functional dependencies (see, e.g, [1] for a survey). In the relational database setting, the nite satis ability problem for inclusion and functional dependencies is trivial: given any relational schema R and any set  of inclusion and func-

2 Keys and foreign keys for XML In this section, we review the formalism of XML DTDs, the de nition of XML trees and the de nition of Lu and L constraints introduced and studied in [17, 16]. We also describe the nite implication problems for Lu and L.

2.1 A formalism of DTDs We shall use the following formalism of XML DTDs. De nition 2.1 [17, 16]: A DTD (Document Type De nition) is de ned to be D = (E; A; P; R; r), where:

 E is a nite set of element types , ranged over by ;

4

 A is a nite set of attributes , ranged over by l; E and A are disjoint, i.e., E \ A = ;;  P is a mapping from E to element type de nitions:

 V is a set of vertices (nodes );  lab is a mapping from V to E [ A [ fSg;  ele is a partial function from V to sequences of V vertices such that for any v 2 V , ele(v) is de ned i lab(v) =  and  2 E , and moreover, if P ( ) is

P ( ) = ; where is a regular expression de ned

as follows: ::= S j  0 j  j 0 j0 j 0 ;0 j  where S denotes string type,  0 2 E ,  denotes the empty sequence, and \j", \;" and \" stand for union, concatenation, and the Kleene closure, respectively;  R is a mapping from E to P (A), the power-set of A;  r 2 E and is called the element type of the root . Without loss of generality, we assume that r does not occur in P ( ) for any  2 E . Moreover, we assume that that each  2 E ,  is connected to r. That is,  appears in the regular expression obtained by repeatedly substituting P ( 0 ) for each  0 that occurs in P (r). To simplify the discussion, we assume that all attributes are single-valued. That is, if l 2 R( ) then every element of type  has a unique l attribute and in addition, the value of the l attribute is a string. We do not consider other XML attributes, e.g., set-valued attributes, in this paper. As an example, let us consider the college DTD D1 given in Section 1. In our formalism, D1 can be represented as (E1 ; A1 ; P1 ; R1 ; r1 ), where E1 = fcollege; coordinator; course; professor; staff g A1 = ffor; taught by; num; name; teachesg P1 (college) = course ; coordinator P1 (coordinator) = professor j staff P1 (course) = P1 (professor) = P1 (staff ) = S R1 (coorinator) = fforg R1 (course) = fnum; taught byg R1 (professor) = fname; teachesg R1 (staff ) = fnameg R1 (college) =; r1 = college XML documents are typically modeled as trees, e.g., in XSL [12, 31], XQL [28], XML Schema [29], XPath [13] and DOM [3]. Along the same lines, we de ne XML trees as follows. De nition 2.2 [16]: Let D = (E; A; P; R; r) be a DTD. A data tree T valid w.r.t. D is de ned to be T = (V; lab; ele; att; val; root); where 0

0

   

and ele(v) = [v1 ; :::; vn ], then lab(v1):::lab(vn ) must be in the regular language de ned by ; att is a partial function from V  A to V such that for any v 2 V and l 2 A, att(v; l) is de ned i lab(v) =  ,  2 E and l 2 R( ); val is a partial function from V to string values such that for any node v 2 V , val(v) is a string i lab(v) = S or lab(v) 2 A; root is a distinguished vertex in V and is called the root of T ; in addition, lab(root) = r; T has a tree structure. More speci cally, for any v 2 V , there exists a unique path from root to v. A path from root is a sequence of labels, de ned as follows: {  is a path from root to itself; { if  is a path from root to v with lab(v) =  in E , then for any v0 in ele(v) with lab(v0 ) =  0 and  0 2 E [ fSg, : 0 is a path from root to v0 ; and moreover, for any l 2 R( ), :l is a path from root to att(v; l).

We assume that there is a unique node in T labeled r. A data tree T is said to be nite if V is nite. Intuitively, V is the set of vertices of the tree T . The mapping lab labels every node of V with a symbol from E [ A [ fSg. In general, vertices labeled with element types of E are internal nodes of T , and those labeled with S or attributes of A are leaves of T . More specifically, if a node x is labeled  in E , then functions ele and att de ne the children of x, which are partitioned into sub-elements and attributes according to P ( ) and R( ) in DTD D. Sub-elements of node x are ordered and their labels observe the regular expression P ( ). In contrast, attributes of node x are unordered and are identi ed by their labels (names). The function val assigns string values to attributes of nodes and also to vertices labeled S. The last condition in the de nition above assures that T indeed has a tree structure. As a result, sharing of nodes is not allowed in T . For example, Figures 1 and 2 depict data trees valid w.r.t. D1 and D2 given in Section 1, respectively. 5

R3 (student) = fstudent idg R3 (enroll) = fstudent id; dept; course nog R3 (school) = R3 (name) = R3 (gpa) = R3 (subject) = R3 (grade) = ; r3 = school Typical L constraints over D3 include: (1) student[student id] ! student, (2) course[dept; course no] ! course, (3) enroll[student id; dept; course no] ! enroll,  student[student id], (4) enroll[student id] (5) enroll[dept; course no]  course[dept; course no]. The rst three constraints are keys in L, the last two

There is an one to one mapping between data trees and XML documents valid w.r.t. a DTD D when the order of attributes is ignored. Given a DTD D and a data tree T valid w.r.t. D, we de ne the following notations. We use T j= D to denote that T is valid w.r.t. D. For any  2 E [ fSg, we use ext( ) to denote the set of all the nodes in T labeled  . For any node x in T labeled  in E and for any l 2 R( ), we use x:l to indicate val(att(x; l)), i.e., the value of the attribute l of x. We de ne ext(:l) to be fx:l j x 2 ext( )g, which is a set of strings. For any nite set S , we use jS j to denote the cardinality of S .

2.2 The de nitions of Lu and L constraints

are inclusion constraints, and the pairs (4, 1) and (5, 2) are foreign keys in L. Let T be a data tree valid w.r.t. D. For each  element x in T and a sequence X = [l1 ; : : : ; ln ] of attributes in R( ), we use x[X ] to denote the sequence of X attribute values of x, i.e., x[X ] = [x:l1 ; : : : ; x:ln ]. The tree T satis es a L constraint ' over a DTD D, denoted by T j= ', i

We next review the de nition of keys and foreign keys introduced in [17, 16] for XML. We begin with the constraint language of multi-attribute keys and foreign keys, denoted by L. Let D = (E; A; P; R; r) be a DTD. A constraint ' of L over D has one of the following forms:

 key:  [X ] !  , where  2 E and X is a set of

 if ' is a key  [X ] !  , then in T , ^ 8 x y 2 ext( ) ( (x:l = y:l) ! x = y):

attributes in R( ). It indicates that the set X of attributes is a key of elements of  .  foreign key: 1 [X ]  2 [Y ] and 2 [Y ] ! 2 , where 1 ; 2 2 E , X; Y are non-empty sequences of attributes in R(1 ), R(2 ), respectively, and moreover, X and Y have the same length. This constraint indicates that X is a foreign key of 1 elements referencing key Y of 2 elements.

l2X

That is, two distinct  nodes in T cannot have the same X attribute values;  if ' is a foreign key 1 [X ]  2 [Y ] and 2 [Y ] ! 2 , then T j= 2 [Y ] ! 2 and

A constraint of the form 1 [X ]  2 [Y ] is called an inclusion constraint. Observe that a foreign key is actually a pair of constraints, namely, an inclusion constraint 1 [X ]  2 [Y ] and a key 2 [Y ] ! 2 . In contrast to foreign keys, inclusion constraints do not require the presence of keys. To illustrate keys and foreign keys of L, let us borrow the following example from [16]. Consider DTD D3 = (E3 ; A3 ; P3 ; R3 ; r3 ), where

8 x 2 ext( ) 9 y 2 ext( ) (x[X ] = y[Y ]): 1

2

That is, the sequence of X attribute values of every 1 node in T must match the sequence of Y attribute values of some 2 node in T . In addition, Y is a key of 2 . Observe that two notions of equality are used to de ne keys: string value equality is assumed in x:l = y:l, and x = y is true if and only if x and y are the same node. This is di erent from the meaning of keys in relational databases. It should be noted that given any DTD D, there are nitely many L constraints over D.

E3 = fschool; student; course; enroll; name; gpa; subject; gradeg A3 = fstudent id; course no; deptg P3 (school) = course ; student; enroll P3 (course) = subject P3 (student) = name; gpa P3 (enroll) = grade P3 (name) = P3 (gpa) = P3 (subject) = P3 (grade) = S R3 (course) = fdept; course nog

The class of unary keys and foreign keys for XML, denoted by Lu, is a sublanguage of L. An Lu constraint is an L constraint in which the cardinality of the set (or the lengths of sequences) of attributes is 1. More specifically, let D = (E; A; P; R; r) be a DTD. A constraint ' of Lu over D has one of the following forms: 6

Lemma 3.1 [16]: The nite satis ability problem for

 key: :l !  , where  2 E and l 2 R( ).  foreign key:  :l   :l and  :l !  , where  ;  2 E , l 2 R( ), and l 2 R( ). For example, the constraints of  [ f' ; ' g and  [ f' g given in Section 1 are Lu constraints over 1

1

2

1

1

2 2

1

2

2

L is undecidable.

2

By reduction from the nite satis ability problem for L, we prove a stronger undecidability result: Theorem 3.2: It is undecidable to determine, given any DTD D, any set  of L constraints over D, any unary key '1 and any unary inclusion constraint '2 , whether  (D; ) j= '1 ;  (D; ) j= '2 .

2

1

2

2

1

2

3

DTDs D1 and D2 , respectively.

Let D be a DTD,  a set of L (Lu ) constraints over D and T a data tree valid w.r.t. D. We use T j=  to indicate that T satis es , i.e., for any  2 , T j= . Let ' be another L (Lu ) constraint. We say that  nitely implies ' over D, denoted by (D; ) j= ', if for any nite data tree T such that T j= D and T j= , there must be T j= '. It should be noted when ' is a foreign key, ' consists of an inclusion constraint 1 and a key 2 . In this case (D; ) j= ' in fact means that (D; ) j= 1 ^ 2 .

Recall that a foreign key ' in L consists of a key '1 and an inclusion constraint '2 . Thus (D; ) j= ' i (D; ) j= '1 ^ '2 : Therefore, from Theorem 3.2 follows immediately: Corollary 3.3: The nite implication problem for L is undecidable. We prove Theorem 3.2 by reduction from the nite satis ability problem for L. Let D = (E; A; P; R; r) be a DTD, and  a set of L constraints over D. We de ne another DTD D0 = (E 0 ; A0 ; P 0 ; R0 ; r), where E 0 = E [ fDY ; EX g where DY , EY are fresh element types A0 = A [ fK g where K is a fresh attribute P 0 (r) = P (r); DY ; DY ; EX i.e., P (r) followed by two DY elements and an EX element P 0 ( ) = P ( ) for all  2 E P 0 (DY ) = P 0 (EX ) =  R0 (DY ) = fK g R0 (EX ) = fK g R0 ( ) = R( ) for all  2 E R0 (r) = R(r) We de ne an unary key '1 , an unary inclusion constraint '2 and another key over D0 as follows: '1 : DY :K ! DY , '2 : DY :K  EX :K , : EX :K ! EX . Obviously, every L constraint over D is also an L constraint over D0 . Thus  is also a set of L constraints over D0 . We show that V 1.  is nitely satis able over D i  ^  ^ '2 ^:'1 is nitely satis able over D0 ; V 2.  is nitely satis able over D i  ^  ^ '1 ^:'2 is nitely satis able over D0 .

The central technical problem investigated in this paper is the nite implication problem for keys and foreign keys for XML. That is the problem to determine, given any DTD D and any set  [ f'g of keys and foreign keys over D, whether (D; ) j= '. When keys and foreign keys are in L (resp. Lu ), we refer to the problem as the nite implication problem for L (resp. Lu ). For example, (D1 ; 1 ) j= '1 ^ '2 and (D2 ; 2 ) j= '3 given in Section 1 are instances of the nite implication problem for Lu . Observe that there are keys in 1 (2 ) that, together with '1 and '2 ('3 ), make up foreign keys over D1 (D2 ). A closely related problem is the nite satis ability problem for keys and foreign keys for XML. That is the problem to determine, given any DTD D and any set  of keys and foreign keys over D, whether there exists a nite data tree T such that T j=  and T j= D. Again, when keys and foreign keys are in L (resp. Lu ), we refer to the problem as the nite satis ability problem for L (resp. Lu ). We investigate the nite implication problems for L and Lu in Sections 3 and 4, respectively. In Sections 4 we shall also study the nite satis ability problem for unary keys, unary inclusion constraints and their negations.

3 Finite implication of L constraints In this section, we rst show that the nite implication problem for L is undecidable, and then identify a lineartime decidable case of the nite implication problem. To establish the undecidability of the nite implication problem for L, we use the following result of [16]. 7

T.

r

Dy

Dy

Ex

@K

@K

@K

A proof of the lemma can be found in Appendix. Let  be a set of keys in L over D, and ' =  [X ] !  be another key in L over D. We say that  contains a superkey of ' if there is  [Y ] !  in  such that X  Y , i.e., X is a subset of Y . Using this and Lemma 3.5 we can prove the following (see Appendix for a proof): Lemma 3.6: Let D be a DTD,  a set of keys in L over D, and ' =  [X ] !  another key in L over D. There is a nite data tree T such that T j= D, T j=  but T j= :' i  does not contain a superkey of ' and moreover, there is a nite tree T 0 such that T 0 j= D and jext( )j > 1 in T 0. In addition, this is decidable in linear time in the sizes of D and  [ f'g. Lemma 3.6 suces to prove Theorem 3.4 because (D; ) j= ' i there is no nite data tree T such that T j= D, T j=  but T j= :'. By Lemma 3.6, the later can be decided in linear time.

T

Figure 3: A tree used in the proof of Theorem 3.2 For if these hold, then the encoding is a reduction from the nite satis ability problem for L to the complement of the nite implication problem described in Theorem 3.2. Theorem 3.2 holds if the complement of the nite implication problem is undecidable. We prove (1) as follows. If there exists a nite tree V 0 T j= D and T j=  ^  ^ '2 ^:'1 , then we construct another tree T 0 by removing DY , EX elements from T . Obviously, T 0 j= D and T 0 j= . Conversely, suppose that there is a nite tree T j= D and T j= . We construct another tree T 0 from T as shown in Fig. 3. Let us refer to the two DY elements in T 0 as d1 ; d2 , and the EX element as e. Let d1 :K = d2 :K = e:K . Then it is easy to see that T 0 j= D0 , T 0 j=  and T 0 j=  ^ '2 ^:'1 . We now prove (2). As above, weVcan show that if there is a nite tree T j= D0 and T j=  ^  ^ '1 ^:'2 , then there exists another tree T 0 such that T 0 j= D and T 0 j= . Conversely, suppose that there is a nite tree T j= D and T j= . We construct a tree T 0 from T as shown in Fig. 3. Again we refer to the two DY elements in T 0 as d1 ; d2 , and the EX element as e. Now let d1 :K 6= d2 :K . Then it is easy to see that T 0 j= D0 , T 0 j=  and T 0 j=  ^ '1 ^ :'2 . This completes the proof of Theorem 3.2. Observe that given any DTD D, there are nitely many L constraints over D. Thus when D is xed, nite implication of L constraints is decidable [15].

Given Theorem 3.4, one is tempted to think that nite implication of foreign keys in L would be simpler than the nite implication problem for L. However, it is not the case. Recall that a foreign key of L consists of an inclusion constraint and a key. Thus we cannot exclude keys in the presence of foreign keys. In fact, it is not hard to show that nite implication of foreign keys in L has the same complexity as that of the nite implication problem for L. Thus the problem remains undecidable.

4 Finite implication of Lu constraints The undecidability of the nite implication problem for

L suggests us to study sublanguages of L that possess

decidable nite implication problem. In this section we investigate Lu , the sublanguage of L consisting of unary keys and foreign keys for XML. The main complexity result of this section is: Theorem 4.1: The nite implication problem for Lu is coNP-complete. Recall that the nite implication problem for Lu is to determine, given any DTD D and any set  [f'g of Lu constraints over D, whether (D; ) j= '. To prove Theorem 4.1, it suces to show that the complement of the nite implication problem for Lu is NP-complete. That is the problem to determine whether there is a nite tree T such that

We next consider a special case of the nite implication problem for L. The nite implication problem for keys in L is the problem to determine, given any DTD D and any set  [ f'g of keys in L over D, whether (D; ) j= '. We show that this problem is decidable. Theorem 3.4: The nite implication problem for keys in L is decidable in linear time. To show this, we need the following lemma. Lemma 3.5: For any DTD D and element type  in D, it is decidable in linear time whether there is a nite tree T such that T j= D and moreover, jext( )j > 1 in

T j= D ^ 8

^

 ^ :':

Recall that a foreign key of Lu consists of a unary key '1 and a unary inclusion constraint '2 . Thus when ' is a foreign key, the problem above is a disjunction of two:

fact a reduction from the nite satis ability problem for Lu to the problems (1) and (2). In addition, the reduction is obviously in polynomial time. Therefore, we have the following: Lemma 4.3: The nite implication problem for Lu is coNP-hard.

^

T j= D ^  ^ :'1 ^ T j= D ^  ^ '1 ^ :'2 Observe that when ' is a key in Lu , the complement of (D; ) j= ' is actually problem (1) given above. Thus to (1) (2)

In relational databases it is common to consider primary keys. That is, for each relation one can specify at most one key, namely, the primary key of the relation. In the XML setting, the primary key restriction requires that for each element type  2 E , one can specify at most one key, i.e., there is at most one l 2 R( ) such that :l !  . Under the primary key restriction, the nite implication problem for Lu is to determine, given any DTD D and any set  [ f'g of Lu constraints in which there is at most one key for each element type (either given as a key or as part of a foreign key), whether (D; ) j= '. One might think that the primary key restriction will simplify the analysis of nite implication of Lu constraints. Unfortunately, this is not the case. Lemma 4.4: Under the primary key restriction, the nite implication problem for Lu is also coNP-hard. To prove this, we need another result from [16]. Lemma 4.5 [16]: Under the primary key restriction, the nite satis ability problem for Lu is NP-complete.

prove Theorem 4.1, we need only to show that problems (1) and (2) are NP-complete. Most known NP-complete problems are easily shown to be in NP, and only the reduction from a known NPcomplete problem is dicult [20]. For the complement of the nite implication problem for Lu , the opposite is the case. In the rest of the section, we rst show that the complement of the nite implication problem is NP-hard. We then show that it is in NP, and we generalize a result of [16] by showing that the nite satis ability problem for unary keys, unary inclusion constraints and their negations is NP-complete. Finally, we identify some PTIME decidable cases of the nite implication problem.

4.1 Finite implication problem for Lu is coNP-hard As observed above, to prove that the nite implication problem for Lu is coNP-hard, it suces to show that the problems (1) and (2) above are NP-hard. To do so, recall the following result of [16]. Lemma 4.2 [16]: The nite satis ability problem for Lu is NP-complete. Given this, we show that the problems (1) and (2) are NP-hard by reduction from the nite satis ability problem for Lu . More speci cally, given any DTD D and any set  of Lu constraints, we de ne in polynomial time a DTD D0 , sets 1 and 2 of Lu constraints, a unary key '1 and a unary inclusion constraint '2 such that

Observe that in the reduction from the nite satis ability problem for L to the complement of the nite implication problem for L given in the last section, if the set  of constraints given is a set of Lu constraints under the primary key restriction, then these and the new constraints de ned in the reduction do not violate the primary key restriction. In other words, the reduction also serves as a reduction from the nite satis ability problem for Lu to the complement of the nite implication problem for Lu under the primary key restriction. Therefore, from Lemma 4.5 follows immediately Lemma 4.4.

V

1. there is a nite tree VT j= D ^  i there is a nite tree T 0 j= D0 ^ 1 ^ :'1 ; V 2. there is a nite tree VT j= D ^  i there is a nite tree T 0 j= D0 ^ 2 ^ '1 ^ :'2 .

The lemma below is immediate from Lemmas 4.3 and 4.4. Corollary 4.6: The following problems are coNP-hard, even under the primary key restriction:  nite implication of unary keys, unary foreign keys and negations of unary keys.  nite implication of unary keys, unary inclusion constraints and their negations.

Recall the reduction from the nite satis ability problem for L to the complement of the nite implication problem for L given in Section 3. Observe that the new constraints '1 ; '2 ;  de ned in the reduction are in Lu . Thus given a set of Lu constraints, the reduction is in 9

where aij is the j th element of the ith row of A, xj is the j th entry of ~x and bi is the ith entry of ~b. It is known that LIP is NP-complete in the strong sense [19]. It is straightforward to represent (D; ) as an instance of LIP. As a result, whether (D; ) admits a non-negative integer solution is in NP. To complete the proof of Lemma 4.7, it suces to show the following: Lemma 4.9: Let D be a DTD,  be a set of of unary keys, unary inclusion constraints and negations of unary keys over D, and (D; ) be the system of linear inequalities determined by D and . There is a nite tree T such that T j=  and T j= D i (D; ) has a non-negative integer solution. A proof of Lemma 4.9 is given in Appendix.

4.2 Finite implication problem for Lu is in coNP To prove that the nite implication problem for Lu is

in coNP, we rst show that the problem (1) given above is in NP, and then prove the problem (2) is also in NP. To prove that that the problem (1) is in NP, we show a stronger result: Lemma 4.7: The following problem is in NP: given any DTD D and any set  of unary keys, unary inclusion constraints and negations of unary keys over D, whether there is a nite tree T such that T j= D and T j= . To show this lemma, we use the following result from [16]. Lemma 4.8 [16]: Given any DTD D and any set  of Lu constraints, one can compute a system (D; ) of linear inequalities in linear time in the sizes of D and , such that (D; ) has a non-negative integer solution i there is a nite tree T satisfying  and conforming to D, i.e., T j=  and T j= D. More speci cally, [16] shows the following. First, one can compute, in linear time, the system (D; ) in which jext( )j and jext(:l)j are variables for all element type  and attribute l of  in D. Second, given a nonnegative integer solution to (D; ) one can construct a nite tree T such that T j= , T j= D and moreover, the cardinalities jext( )j and jext(:l)j in T are the values of the variables jext( )j and jext(:l)j in (D; ) given by the solution; and vice versa. In fact, the proof of the lemma given in [16] shows that these also hold for unary keys and unary inclusion constraints. Now consider a set  = 1 [ 2 , where 1 is a set of unary keys and unary inclusion constraints over D, and 2 is a set of negations of unary keys over D. Let (D; 1 ) be the system of linear inequalities determined by D and 1 as described in Lemma 4.8. We de ne another system of linear inequalities, denoted by (D; ) and referred to as the system determined by D and , to be (D; 1 ) [ fjext(:l)j < jext( )j j :(:l !  ) 2 2 g: Clearly, (D; ) can be computed in linear time by Lemma 4.8. It is in NP to decide whether (D; ) admits a nonnegative integer solution. To see this, recall the linear integer programming (LIP ) problem: Given an m  n matrix A of integers and a column vector ~b of m integers, does there exist a column vector ~x of n (non-negative) integers such that A ~x  ~b? That is, for i 2 [1; m],

X

i2[1;n]

We now proceed to show that the problem (2) given above is also in NP. Again, we prove a stronger result. Lemma 4.10: The following problem is in NP: given any DTD D and any set  of unary keys, unary inclusion constraints and negations of unary inclusion constraints over D, whether there is a nite tree T such that T j= D and T j= . The proof of this lemma is a little involved. We cannot reduce the problem directly to LIP as above, because there is no direct connection between i :li 6 j :lj and the cardinalities jext(i )j, jext(j )j, jext(i :li )j and jext(j :lj )j in a tree. Below we present a NP algorithm for determining nite satis ability of unary keys, unary inclusion constraints and negations of unary inclusion constraints. Consider a DTD D and  = 1 [ 2 , where 1 is a set of unary keys and unary inclusion constraints over D, and 2 consists of negations of unary inclusion constraints over D. Let (D; 1 ) be the system of linear inequalities determined by D and 1 , as described in Lemma 4.8. Let l1 ; : : : ; ln be an enumeration of all attributes in D. Without loss of generality, assume that li is an attribute of element type i (note that i 's need not be distinct). Let U = (uij )ni;j=1 and V = (vij )ni;j=1 be two matrices whose elements are nonnegative integers. We say that they admit a set representation if there is a family of nite sets A1 ; : : : ; An such that

uij = j Ai \ Aj j;

vij = j Ai ? Aj j :

We extend (D; 1 ) with new variables, uij ; vij , and equations:

aij xj  bi ;

 jext(i :li )j = uii = uij + vij for all i; j 2 [1; n]; 10

 vij = 0 for all i :li  j :lj in  , and moreover,

vii = 0;  vij > 0 for all i :li 6 j :lj in 2 .

it must have one with a bound on uij ; vij which is polynomial (in terms of the size of (D; )). For that, recall [24] that if a system of k linear inequalities with l variables and all coecients at most c has a non-negative integer solution, then it has a non-negative integer solution in which none of the variables exceeds l(ck)2k+1 . Thus, M can be taken to be a number that in binary notation has 1+ dlog l +(2k +1)  log(ck)e many 1's. Note that the number of variables, l, of 0 (D; ) is at most exponential in the size of (D; ), and the number of equations, k, is at most polynomial. This shows that M can be found in polynomial time, and thus proves the lemma. Given Lemmas 4.11 and 4.12, we show that nite satis ability of  over D is in NP. Our nondeterministic machine computes M given by Lemma 4.12, and then guesses a solution with all the components bounded by M . It then tests if the U; V part has a set representation. To do so, we transform U; V, in polynomial time, into another matrix W, and then run a nondeterministic polynomial time machine on W. If it returns `yes', then U; V has a set representation, and thus by Lemma 4.11 the answer to whether  is nitely satis able over D is `yes'. Let K = M  n. We now de ne the matrix W. It is a 2n  2n matrix, with 8 uij if i; j  n >> if i  n; j > n < vi;j?n if i > n; j  n wij = > vi?n;j

1

Let us denote the new system by (D; ) and refer to it as the system determined by D and . Observe that (D; ) can be simply converted to a system of linear inequalities (by treating an equation as two inequalities). The intended interpretation for the variable uij is j ext(i :li ) \ ext(j :lj ) j, and j ext(i :li ) ? ext(j :lj ) j for vij . Thus, the inequalities vij > 0 in (D; ) says that ext(i :lj ) 6 ext(tj :lj ) for all i :li 6 j :lj in 2 . The lemma below reveals the connection between the encoding and the satis ability problem we are investigating. A proof of the lemma is given in Appendix. Lemma 4.11: The linear system (D; ) determined by DTD D and constraints  has a non-negative integer solution with U; V having a set representation i there is a nite tree T such that T j= D and T j= . It remains to show that one can check in NP whether the system (D; ) has a non-negative integer solution with U; V having a set representation. We start with a lemma. Lemma 4.12: Given (D; ), one can compute, in polynomial time, a number M such that (D; ) has a non-negative integer solution with U; V having a set representation i it admits such a solution with all variables being bounded by M . To prove the lemma, we need to extend (D; ). Let  be the set of functions  : f1; : : : ; ng ! f0; 1g which are not identically 0. For every , we introduce a new variable z (note that the number of variables is now exponential in the size of the problem). The intended interpretation of z is the cardinality of [ \ ext(j :lj ): ext(i :li ) ? i:(i)=1

>> K ? ui?n;j?n ? : vi?n;j?n ? vj?n;i?n

if i; j > n Recall the INTERSECTION PATTERN problem: Given an m  m matrix A, are there sets Y1 ; : : : ; Ym such that aij =j Yi \ Yj j? This problem is known to be NP-complete (see, e.g., [19]). We now show the following: The INTERSECTION PATTERN problem returns `yes' on input W i U; V have a set representation. First, assume U; V have a set representation. That is, there are nite sets A1 ; : : : ; An such that uij = j Ai \ Aj j; vij = j Ai ? Aj j : By the assumption, all entries in U; V are bounded by M , and hence we may assume all sets in the representation are subsets of a set U of cardinality K . Let m = 2n and de ne Yi to be Ai for i  n, and U ? Ai?n for i > n. Then W is the intersection pattern for this family of sets, and thus the INTERSECTION PATTERN problem returns `yes' on W. Next, assume the INTERSECTION PATTERN returns `yes' on W, so we have a family of sets Y1 ; : : : ; Y2n

j :(j )=0

We now extend (D; ) to 0 (D; ) by adding the following equations: X X z : z ; vij = uij = :(i)=(j )=1

:(i)=1;(j )=0

Clearly, (D; ) has a non-negative integer solution with U; V having a set representation i 0 (D; ) has a non-negative integer solution, as the variables z describe all possible intersections of ext(i :li ) and their complements, and the equations above show how to reconstruct uij and vij from them. We thus must show that if 0 (D; ) has a non-negative integer solution then 11

for which W is the intersection pattern. Let U be the union of all Yj 's. We show Yn+i = U ? Yi for all i  n. We have wi;n+i = vii = 0, and thus Yn+i  U ? Yi . Moreover, we have j Yi [ Yn+i j= wii + wn+i;n+i = K . We next show that for every i; j  n it is the case that Yi [ Yn+i = Yj [ Yn+j (and thus equals to U ). Note that both Yi [ Yn+i and Yj [ Yn+j are K -element sets. Furthermore, (Yi [ Yn+i ) \ (Yj [ Yn+j ) = (Yi \ Yj ) [ (Yi \ Yn+j ) [ (Yn+i \ Yj ) [ (Yn+i \ Yn+j ): These four sets are pairwise disjoint, and their cardinalities are wij = uij ; wi;j+n = vij ; wi+n;j = vji and wi+n;j+n = K ? uij ? vij ? vji , respectively. Thus, the cardinality of (Yi [ Yn+i ) \ (Yj [ Yn+j ) is K , and since the cardinality of each Yi [ Yn+i and Yj [ Yn+j is K , we conclude Yi [ Yn+i = Yj [ Yn+j . This nally shows that U has cardinality K , and thus each Yn+i is U ? Yi for all i  n. This immediately gives us a set representation for U; V. To conclude, once we guessed a bounded solution to (D; ) (all components are at most M ), we proceed to compute in polynomial time the matrix W from U and V, and then run a nondeterministic polynomial time algorithm on it to check if W is an intersection pattern. Putting everything together, we see that this nondeterministic polynomial time algorithm returns `yes' i there is bounded non-negative solution (and thus, there is a solution) to (D; ) for which U; V have a set representation. By Lemma 4.11, this happens if and only if there exists a nite tree T such that T j= D and T j= . This completes the proof of Lemma 4.10. Immediately from Lemmas 4.10 and 4.7 follows: Corollary 4.13: The nite implication problem for Lu constraints is in coNP. This and Lemma 4.3 prove Theorem 4.1.

nite tree T such that T j= D and T j= . Indeed, if (D; ) has such a solution, then by Lemma 4.11 there exists a nite tree such that T j= D, T j= 1 and T j= 3 . Moreover, since 2  , the proof of Lemma 4.7 shows that we must also have T j= 2 . Conversely, assume that there exists a nite tree T such that T j= D and T j= . Then by (the proof of) Lemma 4.7, there is a non-negative integer solution to (D; 1 [ 2 ) such that the values of the variables jext( )j and jext(:l)j are the cardinalities jext( )j and jext(:l)j in T , respectively. We extend the solution as follows: let uij be j ext(i :li ) \ ext(j :lj ) j, and vij be j ext(i :li ) ? ext(j :lj ) j. It is easy to verify that this is indeed a solution to (D; ) with U; V having a set representation. From this and the fact that nite satis ability of Lu constraints is NP-hard (Lemma 4.3) follows: Corollary 4.14: The nite satis ability problem for unary keys, unary inclusion constraints and their negations is NP-complete. This is stronger than Lemma 4.8 established in [16]. This corollary, Lemma 4.7 and Corollary 4.6 prove the following. Corollary 4.15: The following problems are coNPcomplete, even under the primary key restriction:  nite implication of unary keys, unary foreign keys and negations of unary keys.  nite implication of unary keys, unary inclusion constraints and their negations.

4.3 PTIME decidable cases We next consider special cases of the nite implication problem for Lu that are decidable in PTIME. First, suppose that we are given a xed DTD D. Consider a set  of unary keys, unary inclusion constraints and their negations over D. Let  = 1 [ 2 , where 1 is the set of unary keys and unary inclusion constraints in , and 2 consists of negations in . Recall the system 0 (D; ) of linear inequalities de ned in the proofs of Lemma 4.12 and Corollary 4.14 given above. The system consists of inequalities of (D; 1 ), the system determined by D and 1 , as well as inequalities for encoding set representation. In [16] it has been shown that the number of variables in (D; 1 ) is bounded by the size of D. Thus by the de nition of 0 (D; ) given above, the number of variables in 0 (D; ) is also bounded by D. Therefore, when D is xed, the number of variables in 0 (D; ) is bounded

Observe that the proof of Lemma 4.10 can be easily generalized to show that the nite satis ability problem for unary keys, unary inclusion constraints and their negations is also in NP. To see this, consider a set  of constraints over a DTD D, where  = 1 [ 2 [ 3 , 1 is a set of unary keys and unary inclusion constraints, 2 consists of negations of unary keys, and 3 is a set of negations of unary inclusion constraints. We rst compute (D; 1 [ 2 ) as in the proof of Lemma 4.7, and then extend it with U and V to de ne (D; ) as described in the proof of Lemma 4.10. Obviously, (D; ) is a system of linear inequalities that can be computed in PTIME. Moreover, (D; ) has an integer solution with U; V having a set representation i there is a 12

by a constant. It is known that when the number of variables in a system of linear inequalities is bounded, it can be determined in PTIME whether the system admits a non-negative integer solution [22, 25]. By Lemma 4.11 and the proofs of Lemma 4.12 and Corollary 4.14, 0 (D; ) admits a non-negative integer solution if and only if there is a nite tree T such that T j= D and T j= . Therefore, we have the following: Corollary 4.16: Given a xed DTD, the following problems can be determined in PTIME:

with these languages, and we expect that some complexity results established in this paper and [16] will still hold for these more powerful key and foreign key speci cations.

Acknowledgements. We thank Michael Benedikt, Peter Buneman, Frank Neven and Jer^ome Simeon for valuable suggestions and discussions.

References [1] S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison-Wesley, 1995. [2] S. Abiteboul and V. Vianu. Regular path queries with constraints. In Proceedings of ACM Symposium on Principles of Database Systems (PODS), pages 122{133, Tucson, Arizona, May 1997. [3] V. Apparao, S. Byrne, M. Champion, S. Isaacs, I. Jacobs, A. Le Hors, G. Nicol, J. Robie, R. Sutor, C. Wilson, and L. Wood. Document Object Model (DOM) Level 1 Speci cation. W3C Recommendation, Oct. 1998. http://www.w3.org/TR/REC-DOM-Level-1/. [4] C. Baru, A. Gupta, B. Ludascher, R. Marciano, Y. Papakonstantinou, P. Velikhov, and V. Chu. XML-based information mediation with MIX. In Proceedings of ACM SIGMOD Conference on Management of Data, pages 597{599, Philadelphia, Pennsylvania, June 1999. [5] T. Bray, J. Paoli, and C. M. Sperberg-McQueen. Extensible Markup Language (XML) 1.0. W3C Recommendation, Feb. 1998. http://www.w3.org/TR/REC-xml/. [6] P. Buneman, S. Davidson, W. Fan, C. Hara, and W. Tan. Keys for XML. Draft manuscript, July 2000. [7] P. Buneman, S. Davidson, W. Fan, C. Hara, and W. Tan. Reasoning about keys for XML. Draft manuscript, Sept. 2000. [8] P. Buneman, S. Davidson, W. Fan, C. Hara, and W. Tan. Reasoning about relative keys for XML. Draft manuscript, Sept. 2000. [9] P. Buneman, W. Fan, and S. Weinstein. Path constraints on semistructured and structured data. In Proceedings of ACM Symposium on Principles of Database Systems (PODS), pages 129{138, Seattle, Washington, June 1998.

 the nite implication problem for Lu ;  the nite implication and satis ability problems for unary keys, unary inclusion constraints and their negations.

Finally, we consider the nite implication problem for keys in Lu and the nite implication problem for foreign keys in Lu , which are de ned in the same way as the corresponding problems in connection with L described in Section 3. Immediately from Theorem 3.4 we have that the nite implication problem for keys in Lu is also decidable in linear time. Using an argument similar to the one for the nite implication problem for foreign keys in L given in Section 3, one can show that the nite implication problem for foreign keys in Lu remains to be coNP-complete.

5 Conclusion We have investigated the nite implication problems associated with two classes of keys and foreign keys for XML, namely, a class Lu of unary keys and foreign keys and a class L of multi-attribute keys and foreign keys. We have shown that the nite implication problem for L is undecidable. Moreover, in contrast to the linear time decidability of nite implication of unary keys and foreign keys for relational databases [14], the nite implication problem for Lu is coNP-complete because of the interaction between XML DTDs and these constraints. We have also identi ed several PTIME decidable cases of the nite implication problems. In addition, we have generalized the results of [16] by showing that the nite satis ability problem for unary keys, unary inclusion constraints and their negations is NP-complete. More expressive key and foreign key speci cations have been proposed for XML in, e.g., XML Schema [29] and a recent proposal [6], by using regular path expressions or XPath [13]. We intend to investigate the nite implication and nite satis ability problems associated 13

[22] A. K. Lenstra, H. W. Lenstra, and L. Lovasz. Factoring polynomials with rational coecients. Math. Ann., 261:515{534, 1982. [23] J. Melton and A. Simon. Understanding the New SQL: A Complete Guide. Morgan Kaufman, 1993. [24] C. H. Papadimitriou. On the complexity of integer programming. Journal of ACM, 28(4):765{768, 1981. [25] C. H. Papadimitriou. Computational Complexity. Addison-Wesley, 1994. [26] Y. Papakonstantinou and V. Vianu. Type inference for views of semistructured data. In Proceedings of ACM Symposium on Principles of Database Systems (PODS), pages 35{46, Dallas, Texas, May 2000. [27] L. Popa. Object/Relational Query Optimization with Chase and Backchase. PhD thesis, University of Pennsylvania, 2000. [28] J. Robie, J. Lapp, and D. Schach. XML Query Language (XQL). Workshop on XML Query Languages, Dec. 1998. [29] H. S. Thompson, D. Beech, M. Maloney, and N. Mendelsohn. XML Schema Part 1: Structures. W3C Working Draft, Apr. 2000. http://www.w3.org/TR/xmlschema-1/. [30] J. D. Ullman. Database and Knowledge Base Systems. Computer Science Press, 1988. [31] P. Wadler. A Formal Semantics for Patterns in XSL. Technical report, Computing Sciences Research Center, Bell Labs, Lucent Technologies, 2000. http://www.cs.bell-labs.com/~wadler/ topics/xml#xsl-semantics.

[10] P. Buneman, W. Fan, and S. Weinstein. Interaction between path and type constraints. In Proceedings of ACM Symposium on Principles of Database Systems (PODS), pages 56{67, Philadelphia, Pennsylvania, May 1999. [11] P. Buneman, W. Fan, and S. Weinstein. Query optimization for semistructured data using path constraints in a deterministic data model. In Proceedings of International Workshop on Database Programming Languages (DBPL), 1999. [12] J. Clark. XSL Transformations (XSLT). W3C Recommendation, Nov. 1999. http://www.w3.org/TR/xslt. [13] J. Clark and S. DeRose. XML Path Language (XPath). W3C Working Draft, Nov. 1999. http://www.w3.org/TR/xpath. [14] S. S. Cosmadakis, P. C. Kanellakis, and M. Y. Vardi. Polynomial-time implication problems for unary inclusion dependencies. Journal of the ACM, 37(1):15{46, Jan. 1990. [15] B. Dreben and W. D. Goldfarb. The Decision Problem: Solvable Classes of Quanti cational Formulas. Addison-Wesley, 1979. [16] W. Fan and L. Libkin. Finite satis ability of keys and foreign keys for XML data. Technical Report TUCIS-TR-2000-002, Department of Computer and Information Sciences, Temple University, 2000. [17] W. Fan and J. Simeon. Integrity constraints for XML. In Proceedings of ACM Symposium on Principles of Database Systems (PODS), pages 23{34, Dallas, Texas, May 2000. [18] D. Florescu, L. Raschid, and P. Valduriez. A methodology for query reformulation in CIS using semantic knowledge. International Journal of Cooperative Information Systems (IJCIS), 5(4):431{ 468, 1996. [19] M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide to the Theory of NPCompleteness. W. H. Freeman and Company, 1979. [20] J. E. Hopcroft and J. D. Ullman. Introduction to Automata Theory, Languages and Computation. Addision Wesley, 1979. [21] A. Layman, E. Jung, E. Maler, H. S. Thompson, J. Paoli, J. Tigue, N. H. Mikula, and S. De Rose. XML-Data. W3C Note, Jan. 1998. http://www.w3.org/TR/1998/NOTE-XML-data. 14

where the depth of T is the number of nodes in the longest path from the root to a leaf of T . We next show Claim 2. We use a recursive function ftree ( ) to check whether a regular expression over E [ fSg yields a nite tree and if so, we compute num( ;  ). The function returns a pair (b; m), where b is a boolean indicating whether yields a nite tree, and m is the value of num( ;  ) if b is true. For each  0 2 E , we are to use two global boolean variables: checked( 0 ) to indicate whether  0 has been checked by the function, and finite( 0 ) to indicate whether  0 yields a nite tree. These are used to ensure that each  0 in E is checked at most once by the function. Initially, checked( 0 ) and finite( 0) have value false for each  0 2 E . We also use a global variable num( 0 ) for each  0 2 E to record num(P ( 0 );  ). Initially, num( 0 ) = 0.

Appendix

Proof of Lemma 3.5. Let D = (E; A; P; R; r) be a DTD and  2 E . We show that it is decidable in

linear time whether there exists a nite tree T such that T j= D and jext( )j > 1 in T . To do this, we rst introduce some notions. Let M = f0; 1; xg be a linearordered set with 0 < 1 < x, and ; be two binary operators on M de ned by:

80 < v v = : 1 x 80 < v v = : 1 x 1

2

1

2

if v1 = v2 = 0 if v1 = 0; v2 = 1 or v1 = 1; v2 = 0 otherwise if v1 = 0 or v2 = 0 if v1 = v2 = 1 otherwise

Given a regular expression over E [fSg, we say that yields a nite tree if one of the conditions given below is satis ed; and if so, we de ne the maximum number of  elements in to be a value of M , denoted by num( ;  ), as follows:

function ftree ( ) 1. case of either  or S return (true; 0);

0 2 E

j j

(1) if checked( 0 ) and  0 6=  then return (finite( 0 ); num( 0 )) else if checked( ) then return (finite( ); num( )  1); (2) checked( 0 ) := true; (3) (b; m) := ftree (P ( 0 )); 0 (4) finite( ) := b; (5) num( 0 ) := m; (6) if  6=  0 then return (b; m) else return (b; m  1); j ( 1 ; 2 ) (1) (b1 ; m1 ) := ftree ( 1 ); (2) (b2 ; m2 ) := ftree ( 2 ); (3) return (b1 ^ b2 ; m1  m2 ); j ( 1 j 2 ) (1) (b1 ; m1 ) := ftree ( 1 ); (2) (b2 ; m2 ) := ftree ( 2 ); (3) if b1 = false then m1 := 0; (4) if b2 = false then m2 := 0; (5) return (b1 _ b2 ; max(m1 ; m2 )); j

 is either  or S; let num( ;  ) = 0;  is  0 2 E and P ( 0 ) yields a nite tree; if  0 6=  ,

let num( ;  ) = num(P ( 0 );  ); otherwise (i.e., if  0 =  ) let num( ;  ) = num(P ( );  )  1;  = ( 1 ; 2 ) and both 1 and 2 yield nite trees; let num( ;  ) = num( 1;  )  num( 2 ;  );  = ( 1 j 2 ) and either 1 or 2 yields a nite tree; for i = 1; 2, let f ( i ;  ) = num( i ;  ) if i yields a nite tree, and f ( i ;  ) = 0 otherwise; let num( ;  ) = max(f ( 1 ;  ); f ( 2 ;  ));  is 1 ; let num( ;  ) = num( 1 ;  ) x if 1 yields a nite tree, and let num( ;  ) = 0 otherwise.

To prove the lemma, it suces to show the following: Claim 1. There exists a nite data tree valid w.r.t. D with jext( )j > 1 i P (r) yields a nite tree and num(P (r);  ) = x. Claim 2. It is decidable in linear time whether P (r) yields a nite tree and num(P (r);  ) = x. We rst show Claim 1. If P (r) yields a nite tree and num(P (r);  ) = x, then we can easily construct a nite tree T such that T j= D and jext( )j > 1, following the de nition of the function num given above. Conversely, if there exists a nite tree T such that T j= D and jext( )j > 1, we can verify by induction on the depth of T that P (r) yields a nite tree and num(P (r);  ) = x,

1

(1) (b1 ; m1 ) := ftree ( 1 ); (2) if b1 then return (b1 ; m1 x) else return (true; 0); 2. return (false; 0); 15

It is easy to verify that P (r) yields a nite tree with num(P (r);  ) = x if and only if ftree (P (r)) returns (true; x). In addition, the function ftree is called at most O(m) time, where m is the size of the function P . This is because for each  0 in E , P ( 0 ) is checked at most once by the function.

Assume that there exists a nite tree T such that T j=  and T j= D. Since T j= 1 , by (the proof of) Lemma 4.8 of [16], there is a non-negative integer solution to (D; 1 ), the system of linear inequalities determined by D and 1 , such that the values of the variables jext( )j and jext(:l)j in (D; 1 ) given by the solution are the cardinalities jext( )j and jext(:l)j in T . Note that for all element type  and attribute l of  in D, jext( )j and jext(:l)j are variables in (D; 1 ). Thus for each :l 6!  , the solutions also assigns values to jext( )j and jext(:l)j. We claim that it is also a solution of (D; ). To see this, observe that it is always true that jext( )j  jext(:l)j in T since every  element in T contributes at most one distinct :l value. Moreover, since T j= 2 , there must be two distinct  elements d1 and d2 in T such that d1 :l = d2 :l. Thus jext( )j > jext(:l)j. Therefore, all inequalities in (D; ) are satis ed by the solution. Conversely, assume that (D; ) has a non-negative integer solution. Since it is also a solution of (D; 1 ), by (the proof of) Lemma 4.8, there is a nite tree T such that T j= D, T j= 1 and moreover, the cardinalities jext( )j and jext(:l)j in T are the values of the variables jext( )j and jext(:l)j in (D; 1 ) given by the solution. We claim that T j= . Indeed, for any :l 6!  in 2 , we have jext( )j > jext(:l)j in T . Thus there must be two distinct  elements d1 and d2 in T such that d1 :l = d2 :l. That is, T j= :l 6!  . Hence T j= D and T j= . This completes the proof of the lemma.

Proof of Lemma 3.6. Let D be a DTD,  a set of keys in L over D, and ' =  [X ] !  another key in L over D. We rst show that there is a nite data tree T such that T j= D, T j=  but T j= :' i  does not contain a superkey of ' and moreover, there is a nite tree T 0 such that T 0 j= D and jext( )j > 1 in T 0. Suppose that there is a nite tree T such that T j= D, T j=  and T j= :'. Then obviously, T is valid w.r.t. D, and moreover, there must be at least two  elements d1 ; d2 in T such that d1 [X ] = d2 [X ] but d1 6= d2 since T j= :'. Thus there must be jext( )j > 1 in T . In addition,  cannot contain  [Y ] !  with X  Y , since otherwise it would contradict T j= :' and T j= . Conversely, let T 0 be a nite tree such that T 0 j= D and jext( )j > 1 in T 0. Thus there are at least two  elements d1 ; d2 in T 0. Let d1 [X ] = d2 [X ] but assign di erent string values to all other attributes in T 0 . Let us refer to the tree constructed this way as T . It is easy to verify that T j= D and T j= :' by the de nition of T . Moreover, T j=  by the de nition of T and the fact that d1 [X ] = d2 [X ] does not violate any keys in , given the assumption that  does not contain a superkey of '. Thus the rst statement of the lemma holds. To show that this can be determined in linear time, observe that by Lemma 3.5, it can be decided in linear time in the size of D whether there is a nite tree T such that T j= D and jext( )j > 1 in T . In addition, it is decidable in linear time in the size of  [f'g whether  contains a superkey of ' (see e.g., [1] for discussion about a linear time algorithm for checking ( nite) implication of functional dependencies). Thus it is decidable in linear time in the sizes of D and [f'g whether there is a nite tree T such that T j= D and jext( )j > 1 in T . This completes the proof of the lemma.

Proof of Lemma 4.11. Let D be a DTD, 1 be a set of unary keys and unary inclusion constraints over D, 2 be a set of negations of unary inclusion constraints over D,  = 1 [ 2 , and (D; ) be the system of linear inequalities determined by D and . We show that (D; ) has a non-negative integer solution with U; V having a set representation i there is a nite tree T such that T j=  and T j= D. Assume that there exists a nite tree T such that T j=  and T j= D. Since T j= 1 , as in the proof of Lemma 4.7 we can de ne a non-negative integer solution to (D; 1 ), the system of linear inequalities determined by D and 1 . We extend the solution as follows: let uij be j ext(i :li ) \ ext(j :lj ) j, and vij be j ext(i :li ) ? ext(j :lj ) j. It is easy to verify that this is indeed a solution to (D; ) with U; V having a set representation. Conversely, assume that (D; ) has a non-negative integer solution with U; V having a set representation.

Proof of Lemma 4.9. Let D be a DTD,  a set of 1

unary keys and unary inclusion constraints over D, 2 a set of negations of unary keys over D,  = 1 [ 2 , and (D; ) be the system of linear inequalities determined by D and . We show that (D; ) has a non-negative integer solution i there is a nite tree T such that T j=  and T j= D. 16

Then there are nite sets A1 ; : : : ; An such that

uij = j Ai \ Aj j;

vij = j Ai ? Aj j :

Again as in the proof of Lemma 4.7, we create a nite tree T such that T j= 1 and T j= D. In addition, we de ne the val function in T such that ext(i ) = Ai for i 2 [1; n]. This is possible since jext(i :li )j = uii = uij + vij is in (D; ) for all i; j 2 [1; n]. Because vij > 0 is in (D; ) for all i :li 6 j :lj in 2 , we have j ext(i :li ) ? ext(j :lj ) j> 0. That is, T j= i :li 6 j :lj . Thus T j= 2 . This completes the proof of the lemma.

17