Finite Satis ability of Keys and Foreign Keys for XML Data

2 downloads 0 Views 330KB Size Report
28] and XML Data 21], support key and foreign key ... foreign keys for XML is the question of nite satis - ..... and XML documents valid w.r.t. a DTD D when the.
Finite Satis ability of Keys and Foreign Keys for XML Data Wenfei Fan

Leonid Libkin

[email protected]

[email protected]

Bell Laboratories

Temple University

Abstract

ument. The current reference mechanism of the XML standard relies on ID and IDREF attributes. It is rather weak since, as opposed to keys and foreign keys, ID and IDREF attributes are neither typed nor scoped. In other words, one has little control over what a reference points to. This mechanism fails to capture the semantics of native XML documents. Worse still, it is not capable of preserving the semantics of data in data exchange. Typically XML documents are generated from persistent databases and then sent over a network to an application. As mentioned earlier, keys and foreign keys convey a fundamental part of the semantic of data in databases that one cannot a ord to lose. These constraints, however, cannot be precisely expressed by ID and IDREF attributes, and as a result, in data exchange databases may su er loss of information when translated into an XML encoding. To overcome this limitation, several XML proposals, e.g., XML Schema [28] and XML Data [21], support key and foreign key speci cations. An important question in connection with keys and foreign keys for XML is the question of nite satis ability: given a nite set  of keys and foreign keys and a DTD (Document Type De nition [5]) D, whether there exists a nite XML document that both satis es  and conforms to D? XML documents are typically modeled as ordered node-labeled trees, referred to as XML trees , in which vertices are labeled with element or attribute tags. There is an one to one mapping between XML documents and XML trees when the order of attributes is disregarded. The tree model has been adopted by query languages, speci cations and APIs for XML, e.g., XSL [14, 30], XQL [27], XML Schema [28] and DOM [3]. In the tree model, the nite satis ability problem is to determine whether there exists a nite XML tree that both satis es  and conforms to D. In contrast to the analysis of nite satis ability of keys and foreign keys for relational databases, the analysis for XML is not trivial: one may not be able to nd any nite XML tree that both conforms to D and satis es , although when considered separately, D has some nite

Key and foreign key constraints are useful for XML [5] data in semantic speci cation, query optimization and more importantly, for information preservation in data exchange. Several XML proposals, e.g., XML Schema [28] and XML Data [21], support key and foreign key speci cations. These constraints, however, may not be nitely satis able in the XML context. More specifically, given a DTD D and a nite set  of key and foreign key constraints, there may not exist any nite XML documents that both conform to D and satisfy . This paper investigates nite satis ability problems associated with key and foreign key constraints for XML. We show that DTDs impose a class of cardinality constraints on XML documents, and these cardinality constraints interact with keys and foreign keys. We investigate the interaction and in particular, establish complexity results for the analysis of nite satis ability of keys and foreign keys de ned in two constraint languages. One language, Lu , de nes unary keys and foreign keys in terms of XML attributes, and the other, L, de nes multi-attribute keys and foreign keys. We show that the nite satis ability problem for Lu constraints is NP-complete, and the nite satis ability problem for L constraints is undecidable. We also establish PTIME decidability for several special cases of the nite satis ability problems.

1 Introduction Keys and foreign keys have proved useful for databases. They describe an essential part of the semantics of data and are fundamental to conceptual database design. In relational databases, they provide the means by which one tuple may refer to another tuple. They are also useful in update anomaly prevention, query optimization and index design [1, 29]. Key and foreign key constraints are also important for XML [5] data. They provide a robust mechanism of identifying and referring to portions of an XML doc1

XML tree T1 conforming to it and  can be satis ed by a nite tree T2 . This is because keys and foreign keys interact with DTDs. More speci cally, DTD grammars impose a class of constraints, referred to as cardinality constraints , and there is interaction between cardinality constraints and key, foreign key constraints. Moreover, the interaction is not simple. To answer the question of nite satis ability, it is important to understand the interaction. One wants to know whether a set of natural constraints and a perfectly natural DTD, when put together, could specify any nite XML documents. The problem, however, has received little attention. To the best of our knowledge, no previous work has studied the interaction and the nite satis ability problem for XML. This paper investigates nite satis ability of key and foreign key constraints for XML data. We explore cardinality constraints imposed by XML DTDs and study their interaction with key and foreign key constraints. We also establish complexity results for determining nite satis ability of key and foreign key constraints in the XML setting.

teachers

teacher

@name

teachers teacher teach

research

teach

"Joe"

subject

"DB"

subject

"Web DB"

"XML"

@taught_by

@taught_by "Joe"

"Joe"

Figure 1: An XML tree conforming to D1 for any nodes x and y in T labeled teacher, if the (string) values of their name attributes are equal, then x and y must be the same node. It should be mentioned that two notions of equality are used in the de nition of keys: we assume string value equality when comparing name attribute values of x and y , and vertex equality when it comes to comparing x and y. This provides a ner sharing and identifying semantics than that found in relational databases. The second constraint states that no two distinct subject nodes in T can have the same taught by attribute value. The third constraint, a foreign key, asserts that for any node x in T labeled subject, there is a node y in T labeled teacher such that the value of the taught by attribute of x equals to the value of the name attribute of y. In addition, observe that name is a key of teacher. Thus by the foreign key, the value of the taught by attribute of a subject node uniquely identi es a teacher node. Obviously, there exists a nite XML tree conforming to D1 , as shown in Fig. 1. However, there is no nite XML tree that both conforms to D1 and satis es 1 . To see this, let us rst de ne some notations. Given an XML tree T and an element type  , we use ext( ) to denote the set of all the nodes labeled  in T . Similarly, given an attribute l of  , we use ext(:l) to denote the set of l attribute values of all  elements. Let us denote the cardinality of a set S by S . Then immediately from 1 follows a set of dependencies:

Examples. To illustrate the interaction between XML DTDs and key, foreign key constraints, let us consider a few concrete examples. Below is a DTD, D1 , which speci es a collection of teachers:
teacher

(teacher+)> (teach, research)> (subject, subject)>

It speci es that, among others, a teacher teaches two subjects. Here we omit the descriptions of elements whose type is string (e.g., PCDATA in XML). Assume each teacher has an attribute name and each subject has an attribute taught by. Recall that XML attributes have string values. Here we consider singlevalued attributes. That is, if an attribute l is de ned for an element type  in a DTD, then in a document valid w.r.t. the DTD, each element of type  must have a unique l attribute that has a string value. Consider a set of simple unary key and foreign key constraints, 1 : teacher:name teacher, subject:taught by subject, subject:taught by teacher:name. That is, name is a key of teacher elements, taught by is a key of subject elements and it is also a foreign key referencing name of teacher elements. More specifically, referring to an XML tree T , the rst constraint asserts that two distinct teacher nodes in T cannot have the same name attribute value. In other words, !

!

j



j

ext(teacher:name) = ext(teacher) ; ext(subject:taught by) = ext(subject) ; ext(teacher:name) : ext(subject:taught by) j

2

j

j

j

j

j

j

j

j

j



j

j

Therefore,

db

ext(subject)

ext(teacher) : (1) On the other hand, the DTD D1 requires that each j

j  j

j

teacher must teach two subjects. Since no sharing of nodes is allowed in XML trees, from D1 follows: 2 ext(teacher) = ext(subject) : (2) Thus ext(teacher) < ext(subject) . Obviously, (1) and (2) contradict with each other and therefore, there exists no nite XML tree that both satis es 1 and conforms to D1 . In particular, the XML tree in Fig. 1 violates the key constraint subject:taught by subject. This example demonstrates that a DTD may impose on XML trees dependencies among cardinalities of certain sets of objects. Let us refer to the dependencies as cardinality constraints . These cardinality constraints interact with key and foreign key constraints. More speci cally, key and foreign key constraints also enforce a class of cardinality constraints, and there is interaction between these two classes of cardinality constraints. This makes the analysis of nite satis ability of keys and foreign keys for XML far more intriguing than that for relational databases. Because of the interaction, simple key and foreign key constraints (e.g., 1 ) may not be satis able by nite XML trees conforming to certain DTDs (e.g., D1 ). As another example, let us consider a DTD D2 for states and capitals: j

j

j

j

j

state

j

j

@name



j  j

"PA" "Philadelphia"

observe that D2 requires that each state node must have a capital as a child and in addition, the db node must have a non-empty sequence of capital children that are not children of any state node. Therefore, the following cardinality constraint is enforced by D2 :

ext(state) < ext(capital) :

j

j

j

j

(4)

Observe that (4) contradicts to (3) given above. Again, the DTD D2 interacts with the set 2 of key and foreign key constraints and as a result, 2 cannot be satis ed by nite XML trees conforming to D2 . In particular, the XML tree given in Fig. 2 does not satisfy the key constraint capital:in state capital, because two distinct capital nodes have the same in state attribute value. Some XML DTDs do not have nite XML trees conforming to them even in the absence of key and foreign key constraints. For instance, there exists no nite trees conforming to the DTD D3 given below: !


db (foo)> foo (foo)>

In other words, even given an empty set 3 of constraints, D3 still does not have any nite XML tree conforming to it.

Contributions. We consider two classes of key and

state:name:

foreign key constraints de ned in terms of XML attributes, and investigate their interaction with DTDs. One class, denoted by Lu , consists of unary keys and foreign keys, and the other, denoted by L, is composed of multi-attribute keys and foreign keys. The main contributions of the paper are the following: 1. We show that Lu constraints can be captured by cardinality constraints. Given a nite set  of Lu constraints, we de ne a nite set C of cardinality

These assert that name is a key of state elements, in state is a key of capital elements and is a foreign key referring to name of state elements. From these follows immediately: ext(capital) ext(state) : (3) Taking the DTD D2 alone, there exists a nite XML tree conforming to D2, as shown in Fig.2. However, j

"Harrisburg" @in_state

Figure 2: An XML tree conforming to D2

!

!

capital

"PA"

It speci es that a database consists of a non-empty sequence of states and a non-empty sequence of capitals, and a state has a capital and other cities. Here again we omit the descriptions of elements whose type is string. Assume that state has an attribute name and capital has an attribute in state. It is natural to consider 2 , a set of simple unary key and foreign key constraints: state:name state, capital:in state capital,

capital:in state

city

"Harrisburg" @in_state

(state+, capital+)> (capital, city )>

db state

capital

capital

"PA"

j

!


state

j

3

2.

3.

4.

5.

nality constraints more complex than those studied for object-oriented databases. Key and foreign key speci cations were proposed in XML Schema [28] and XML Data [21]. However, their associated nite satis ability problem has not been addressed. A key constraint language for XML was recently proposed in [6] and its associated nite satis ability and nite implication problems were investigated in [7]. However, these papers do not consider foreign keys and the interaction between DTDs and constraints. A class of simple key and foreign key constraints was also studied for a graph model of XML [19]. Their associated implication and nite implication problems were investigated there. However, in contrast to the XML tree model, the graph model allows sharing of nodes and thus trivializes the analysis of nite satis ability of key and foreign key constraints. It should be mentioned that in general, there is no one to one mapping between XML graphs and XML documents. In other words, an XML graph may have multiple XML documents corresponding to it. Keys, foreign keys and more general inclusion and functional dependencies have been well studied for relational databases (see, e.g., [1, 29] for detailed surveys). In particular, [16] investigates a class of unary inclusion and functional dependencies, which is related to unary key and foreign key constraints considered in this paper, and their associated ( nite) implication problem. As mentioned earlier, in the relational setting, the nite satis ability problem for keys and foreign keys is trivial. A variety of path constraints have been studied for semistructured and XML data [2, 8, 10]. The interaction between path constraints and database schemas was investigated in [9]. Path constraints specify inclusions among certain sets of objects in edge-labeled graphs. They are not capable of expressing key constraints.

constraints such that over nite XML trees,  is satis able if and only if C is satis able. We explore the connection between DTDs and cardinality constraints. Given a DTD D, we de ne a nite set D of cardinality constraints that characterizes D: there exists a nite XML tree conforming to D if and only if D is nitely satis able. We investigate the nite satis ability problem for Lu constraints: given any DTD D and any nite set  of Lu constraints over D, whether there exists a nite XML tree that both conforms to D and satis es . We show the problem is NP-complete. More speci cally, we show it is NP-hard by reduction from linear integer programming problem (see, e.g., [20, 25]). To show it is in NP, we use (D; ), which is de ned in terms of D and C , to characterize D and : there exists a nite XML tree conforming to D and satisfying  if and only if (D; ) admits an integer solution. We show that there is a NP algorithm for guessing and checking integer solution of (D; ). We establish the undecidability of the nite satis ability problem for L constraints by reduction from an undecidable nite implication problem associated with keys and foreign keys for relational databases. We show that several special cases of the nite satis ability problems are decidable in PTIME. In particular, taking advantage of the characterization (D; ), we show that the problem for Lu can be decided in PTIME when the DTD D is xed.

These complexity results are rather surprising. Recall that keys and foreign keys for relational databases are always satis able as long as there is no typing error in the given constraints. Indeed, given any relational schema R and any set  of keys and foreign keys over R, one can always construct a database instance I of R such that I satis es  and moreover, I contains one tuple for each relation in R. In contrast, the results of this paper show that the analysis of nite satis ability of keys and foreign keys for XML is not trivial.

Organization. The reminder of the paper is organized as follows. Section 2 presents a formalism of XML DTDs, describes the tree model for XML, and de nes two classes of key and foreign key constraints, namely, Lu and L. Section 3 provides a characterization of Lu constraints and a characterization of DTDs in terms of cardinality constraints. Section 4 shows that the nite satis ability problem for Lu constraints is NPcomplete by using the characterizations of DTDs and Lu constraints. Section 5 establishes the undecidability of the nite satis ability problem for L constraints. Finally, Section 6 summarizes our results and identi es directions for further work. Proofs of the results in the paper are given in an appendix.

Related work. The interaction between cardinality

constraints and database schemas has been studied for object-oriented databases [12, 13]. This interaction is quite di erent from what we explore in this paper because XML DTDs are de ned in terms of extended context free grammars and they lead to a class of cardi4

E1 = teachers; teacher; teach; research; subject A1 = name; taught by P1 (teachers) = teacher; teacher P1 (teacher) = teach; research P1 (teach) = subject; subject P1 (subject) = P1 (research) = S R1 (teacher) = name R1 (subject) = taught by R1 (teachers) = R1 (teach) = R1 (research) = r1 = teachers Similarly, we represent the DTD D3 given in Section 1 as (E3 ; A3 ; P3 ; R3 ; r3 ), where E3 = db; foo A3 = P3 (db) = foo P3 (foo) = foo R3 (db) = R2 (foo) = r3 = db

2 DTDs, keys and foreign keys

f

f

In this section, we rst present a formalism of XML DTDs [5] and the tree model of XML. We then de ne two classes of key and foreign key constraints for XML and describe their associated nite satis ability problems.

f

f

An XML document is typically modeled as a nodelabeled ordered tree. Given a DTD, we de ne the notion of its valid documents, i.e., documents that conform to it, as follows. De nition 2.2: Let D = (E; A; P; R; r) be a DTD. A data tree T valid w.r.t. D is de ned to be T = (V; lab; ele; att; val; root); where V is a set of vertices (nodes ); lab is a mapping from V to E A S ; ele is a partial function from V to sequences of V vertices such that for any v V , ele(v) is de ned i lab(v) =  and  E , and moreover, if P ( ) is and ele(v) = [v1 ; :::; vn ], then lab(v1):::lab(vn ) must be in the regular language de ned by ; att is a partial function from V A to V such that for any v V and l A, att(v; l) is de ned i lab(v) =  ,  E and l R( ); val is a partial function from V to string values such that for any node v V , val(v) is a string i lab(v) = S or lab(v) A; root is a distinguished vertex in V and is called the root of T ; in addition, lab(root) = r; T has a tree structure. More speci cally, for any v V , there exists a unique path from root to v. A path from root is a sequence of labels, de ned as follows:

;

0 0 0 ;0 0 0 where S denotes string type,  0 E ,  denotes the empty sequence, and \ ", \;" and \ " stand S

j

0

j



j

j

j

j



2

j







for union, concatenation, and the Kleene closure, respectively; R is a mapping from E to (A), the power-set of A; if l R( ) then we say l is de ned for  ; r E and is called the element type of the root .

[



2

P

2

2



g

;

as follows:

::=

g

;

E is a nite set of element types , ranged over by ; A is a nite set of attributes , ranged over by l; E and A are disjoint, i.e., E A = ; P is a mapping from E to element type de nitions: P ( ) = ; where is a regular expression de ned \



g

;

In the literature (e.g., [4, 11, 24]), DTDs are often formalized as Extended Context Free Grammars (ECFGs), with elements as non-terminals, basic XML types as terminals and element type de nitions (in terms of regular expressions) as productions. This formalism fails to capture XML attributes. We extend this formalism of DTDs by incorporating simple attributes. De nition 2.1: A DTD (Document Type De nition) is de ned to be D = (E; A; P; R; r), where:



g

f

2.1 DTDs and XML trees



g

2

Without loss of generality, we assume that r does not occur in P ( ) for any  E . To simplify the discussion, we assume that all attributes are single-valued. That is, if l R( ) then every element of type  has a unique l attribute and in addition, the value of the l attribute is a string. We do not consider other XML attributes, e.g., set-valued attributes, in this paper. As an example, let us consider the teachers DTD D1 given in Section 1. In our formalism, D1 can be represented as (E1 ; A1 ; P1 ; R1 ; r1 ), where





2

2

2

2

2



2

2

2





2

5

[ f g

{  is a path from root to itself; { if  is a path from root to v with lab(v) =  in

Let D = (E; A; P; R; r) be a DTD. A constraint ' of Lu over D has one of the following forms:

E , then for any v0 in ele(v) with lab(v0) =  0 and  0 E S , : 0 is a path from root to v0 ; and moreover, for any l R( ), :l is a path from root to att(v; l). We assume that there is a unique node in T labeled r. A data tree T is said to be nite if V is nite. Intuitively, V is the set of vertices of the tree T . The mapping lab labels every node of V with a symbol from E A S . In general, vertices labeled with element types of E are internal nodes of T , and those labeled with S or attributes of A are leaves of T . More specifically, if a node x is labeled  in E , then functions ele and att de ne the children of x, which are partitioned into sub-elements and attributes according to P ( ) and R( ) in DTD D. Sub-elements of node x are ordered and their labels observe the regular expression P ( ). In contrast, attributes of node x are unordered and are identi ed by their labels (names). The function val as2



[ f g

2

[



For example, the constraints of 1 and 2 given in Section 1 are Lu constraints over DTDs D1 and D2 , respectively. Let T be a data tree valid w.r.t. D. Then T satis es a Lu constraint ' over D, denoted by T = ', i

[ f g

j





j

2

!

2

!



2

9

2

j

!

Observe that two notions of equality are used to de ne keys: string value equality is assumed in x:l = y:l, and x = y is true if and only if x and y are the same node. This is di erent from the meaning of keys in relational databases. A foreign key constraint asserts two things: ext(1 :l1 ) ext(2 :l2 ) and 2 :l2 2 . Thus the value of l1 attribute of a 1 element (node) can uniquely identify a 2 element (node). Let  be a set of Lu constraints. We use T =  to indicate that T satis es , i.e., for any  , T = . The nite satis ability problem for Lu is the problem to determine, given any DTD D and any set  of Lu constraints over D, whether there exists a nite data tree T such that T =  and T is valid w.r.t. D. Finally, it is easy to verify the following, which asserts that, among others, any set of Lu constraints over a DTD is nite. Proposition 2.1: Let D = (E; A; P; R; r) be a DTD. The cardinality of the set of all Lu constraints over D is at most O(m2 n2 ), where m = E and n = A . 

[f g

!

j

2

g j

 , then in T , x y ext( ) (x:l = y:l x = y); i.e., no two distinct vertices in T labeled  can have the same l attribute value; if ' is a foreign key constraint 1 :l1 2 :l2 , then in T , x ext(1 ) y ext(2 ) (x:l1 = y:l2); i.e., the set of l1 attribute values of all 1 vertices in T is a subset of the set of l2 attribute values of all 2 vertices in T . In addition, T = 2 :l2 2 . 8

2

f

if ' is a key constraint :l 8

signs string values to attributes of nodes and also to vertices labeled S. The last condition in the de nition above assures that T indeed has a tree structure. As a result, sharing of nodes is not allowed in T . For example, Figures 1 and 2 depict data trees valid w.r.t. D1 and D2 given in Section 1, respectively. There is an one to one mapping between data trees and XML documents valid w.r.t. a DTD D when the order of attributes is ignored. Given a data tree T valid w.r.t. D, it is easy to derive an XML document valid w.r.t. D. The document is unique regardless of the order of attributes. Conversely, given an XML document valid w.r.t. D, it is straightforward to depict a data tree representing the document. Again, the tree is unique when the order of attributes is disregarded. Given a DTD D and a data tree T valid w.r.t. D, we de ne the following notations. For any  E S , we use ext( ) to denote the set of all the nodes in T labeled  . For any node x in T labeled  in E and for any l R( ), we use x:l to indicate val(att(x; l)), i.e., the value of the attribute l of x. We de ne ext(:l) to be x:l x ext( ) , which is a set of strings. For any nite set S , we use S to denote the cardinality of S . 2

key constraint: :l !  , where  2 E and l 2 R( ). It indicates that l is a key of elements of  . foreign key constraint: 1 :l1  2 :l2 , where 1 ; 2 are element types in E , l1 2 R(1 ), l2 2 R(2 ), and moreover, 2 :l2 ! 2 . This constraint indicates that l1 is a foreign key of 1 elements referencing l2 attributes of 2 elements.

j

j

j

2.2 A constraint language Lu As observed earlier, key and foreign key constraints are important for XML. Next, we de ne a class of unary key and foreign key constraints for XML, denoted by Lu .

j

6

j

j

j

2.3 A constraint language L We next extend Lu to de ne a class of multi-attribute keys and foreign keys for XML, denoted by L. Multi-

Let T be a data tree valid w.r.t. D. For each  element x in T and a sequence X = [l1 ; : : : ; ln ] of attributes in R( ), we use x[X ] to denote the sequence of X attribute values of x, i.e., x[X ] = [x:l1 ; : : : ; x:ln ]. A tree T satis es a L constraint ' over a DTD D, denoted by T = ', i if ' is a key  [X ]  , then in T ,

attribute keys and foreign keys are important in, among other things, data conversion for capturing keys and foreign keys commonly found in relational databases. Let D = (E; A; P; R; r) be a DTD. A constraint ' of L over D has one of the following forms: 



j



8

key:  [X ] !  , where  2 E and X is a set of attributes in R( ). It indicates that the set X of attributes is a key of elements of  . foreign key: 1 [X ]  2 [Y ], where 1 ; 2 are element types in E , X; Y are non-empty sequences of attributes in R(1 ), R(2 ), respectively, and moreover, X and Y have the same length. In addition, 2 [Y ] ! 2 . This constraint indicates that X is a foreign key of 1 elements referencing key Y of 2 elements.



l2X

(x:l = y:l)

!

x = y):

2

9

2

!

The central technical problems investigated in the paper are the nite satis ability problems for Lu and L. We investigate these problems in Sections 4 and 5, respectively.

g

g

3 Cardinality constraints, DTDs and Lu constraints In this section, we develop a class of cardinality constraints and provide characterizations of Lu constraints and DTDs in terms of these cardinality constraints. More speci cally, given a DTD D and a set  of Lu constraints over D, we de ne a nite set C of cardinality constraints such that there exists a nite data tree valid w.r.t. D and satisfying  if and only if there exists a nite data tree valid w.r.t. D and satisfying C . Similarly, given a DTD D, we de ne a nite set D of cardinality constraints such that there exists a nite data tree valid w.r.t. D if and only if D is nitely satis able. These characterizations are not only interesting in their own right, they are also important in the analysis of nite satis ability and nite implication of Lu constraints.

g

g

g

f

^



j

f

f

2

x ext(1 ) y ext(2 ) (x[X ] = y[Y ]): That is, the sequence of X attribute values of every 1 node in T must match the sequence of Y attribute values of some 2 node in T . In addition, T = 2 [Y ] 2 , i.e., Y is a key of 2 . The nite satis ability problem for L is de ned in the same way as the nite satis ability problem for Lu .

E4 = school; student; course; enroll; name; gpa; subject; grade A4 = student id; course no; dept P4 (school) = course ; student; enroll P4 (course) = subject P4 (student) = name; gpa P4 (enroll) = grade P4 (name) = P4 (gpa) = P4 (subject) = P4 (grade) = S R4 (course) = dept; course no R4 (student) = student id R4 (enroll) = student id; dept; course no R4 (school) = R4 (name) = R4 (gpa) = R4 (subject) = R4 (grade) = r4 = school Typical L constraints over D4 include: student[student id] student, course[dept; course no] course, enroll[student id; dept; course no] enroll, enroll[student id] student[student id], enroll[dept; course no] course[dept; course no]. The rst three constraints are keys in L, while the last two are foreign keys in L. Observe that Lu is a sub-language of L, in which the f

x y ext( ) (

That is, two distinct  nodes in T cannot have the same X attribute values; if ' is a foreign key 1 [X ] 2 [Y ], then in T , 8

As an example, let us consider another DTD D4 = (E4 ; A4 ; P4 ; R4 ; r4 ), where

f

!

;

!

!

!





3.1 A Characterization of Lu constraints Consider a DTD D = (E; A; P; R; r) and a set  of Lu constraints over D. For any nite data tree T valid w.r.t. D, we derive from  a class of cardinality constraints on T , denoted by C , as follows. For any ' ,

sequences of attributes in a constraint are of length 1.

2

7

if ' is a key constraint :l  , then ext(:l) = ext( ) is in C ; if ' is a foreign key constraint 1 :l1 2 :l2 , then ext(1 :l1 ) ext(2 :l2 ) ; ext(2 :l2 ) = ext(2 ) are in C . Moreover, for any  E and l R( ) that occur in , ext(:l) ext( ) and 0 ext(:l) are in C . We refer to C as the set of cardinality constraints determined by . We use T = C to denote that T satis es all constraints of C . For example, recall the set 1 of Lu constraints over DTD D1 given in Section 1. The set of cardinality constraints determined by 1 , denoted by C1 , consists of: ext(teacher:name) = ext(teacher) ext(subject:taught by) = ext(subject) ext(subject:taught by) ext(teacher:name) 0 ext(teacher:name) 0 ext(subject:taught by) The theorem below shows that C characterizes nite satis ability of . It allows us to consider C instead of  when studying nite satis ability of . Theorem 3.1: Let D be a DTD,  be a set of Lu constraints over D, and C be the set of cardinality constraints determined by . Then the following two statements are equivalent. 1. There exists a nite data tree T1 such that T1 is valid w.r.t. D and T1 = . 2. There exists a nite data tree T2 such that T2 is valid w.r.t. D and T2 = C . In addition, any nite data tree valid w.r.t. D and satisfying  also satis es C . Indeed, given a nite data tree T1 valid w.r.t. D and satisfying , one can verify T1 = C . Conversely, given a nite tree T2 valid w.r.t. D and satisfying C , one can construct a nite tree T1 by modifying function val in T2 such that T1 is valid w.r.t. D and T1 = . It is straightforward to verify the following. Proposition 3.2: Given any set  of Lu constraints over a DTD D, the set C of cardinality constraints determined by  can be computed in linear time. As an immediate result of Proposition 3.2, the size of C is linear in the size of . 

3.2 Eliminating Kleene closures

!

j

j

j

Element type de nitions in a DTD are de ned in terms of regular expressions. To simplify the discussion when it comes to the study of cardinality constraints enforced by DTDs, we want to rewrite element type de nitions to eliminate occurrences of the Kleene star. Moreover, the rewriting should not a ect nite satis ability of Lu constraints. More speci cally, given a DTD D, we de ne another DTD D? such that D? does not contain the Kleene star and in addition, for any set  of Lu constraints over D, there exists a nite data tree valid w.r.t. D and satisfying  if and only if there exists a nite data tree valid w.r.t. D? and satisfying . We next present a rewriting that preserves nite satis ability of Lu constraints. Let D = (E; A; P; R; r) be a DTD. For each element type  E , the de nition of  , P ( ), is a regular expression . We de ne another function P ? for element type de nitions: P ?( ) = f ( ), where f ( ) is given as follows:

j





j

j

j

j



j

j

2

j

j  j

j

j

2

j

 j

j

j

j

j

j

j

j

j

j

j

j

j



j

j



j

j



j

2

8 if is S,  0 or  > > < f ( 1 ); f ( 2 ) if = 1 ; 2 f ( ) = > f ( ) f ( ) if = 2 1 2 > : 1 if = 1 Here  is a new element type and P ? ( ) =  1 ;  :

j

j

j

j

That is, we eliminate occurrences of the Kleene star by introducing new element types and by rewriting Kleene closures in terms of direct recursion. We refer to the function f as the transition function . Formally, we de ne the -free DTD of D, denoted by D? , to be D? = (E ? ; A; P ?; R? ; r), where 

j



j



j



E ? = E Ea , and Ea is the set of new element types  's introduced in the process of eliminating [

occurrences of the Kleene star, as described above; P ? is de ned in terms of P and the transition function f as described above; R? ( ) = R( ) for all  E and R?( ) = for all  Ea . 2

;

2

j

Observe that the element type r of the root and the set A of attributes remain unchanged in D? . For example, the -free DTD of D1 given in Section 2 is D1? = (E1? ; A1 ; P1? ; R1? ; r), where 

E1? = teachers; teacher; teach; research; subject; t A1 = name; taught by f

f

8

g

g

P1? (teachers) = teacher; t P1? (t ) =  teacher; t P1? (teacher) = teach; research P1? (teach) = subject; subject P1? (subject) = P1? (research) = S R1? (teacher) = name R1? (subject) = taught by R1? (teachers) = R1? (teach) = R1?(research) = R1? (t ) = r1 = teachers Here t is the only new element type introduced. Observe that the -free DTD D3? of D3 in Section 1 is D3 itself, because D3 does not contain Kleene star. Obviously, any set  of Lu constraints over D is also a set of Lu constraints over the -free DTD D? of D. The next theorem shows that D? preserves nite

3.3 A characterization of DTDs Next, we show that cardinality constraints can also be used to characterize DTDs. Without loss of generality by Theorems 3.3, we need to consider -free DTDs only. Consider a DTD D = (E; A; P; R; r). To de ne cardinality constraints to characterize D, we rst describe unknowns to be used in these constraints. For each symbol  E S , we assume a set of unknowns X = xi i IN , where IN is the set of natural numbers. For each  E , consider the regular expression = P ( ). We de ne the set of regular expressions in P ( ), denoted by S , to be the set consisting of (1) , (2) regular expressions in 1 or 2 , if = ( 1 ; 2 ) or = ( 1 2 ). For the regular expressions in S , we also  assume a set of unknowns Y = y ;k S ; k IN . Using these unknowns, we de ne a function g from S and Y to sets of cardinality constraints as follows:

j

f



g

f

g

2

;

f

j

2

[f g

g

2



j

f



satis ability of . It allows us to consider -free DTDs only, without loss of generality. Theorem 3.3: Let D be a DTD, D? the -free DTD of D, and  a set of Lu constraints over D. Then the following two statements are equivalent. 

8 > > > > > > > > > > > >
> > > > > > > > > > > :

1. There exists a nite data tree T1 such that T1 is valid w.r.t. D and T1 = . 2. There exists a nite data tree T2 such that T2 is valid w.r.t. D? and T2 = . j

j

j

2

2

g

 = xi 0 y ;k if is  0   i y ;k = xS if is S  = y if is  y ;k ;j     y ;k = y 1 ;k1 ; y ;k = y 2 ;k2 g( 1 ; y  1 ;k1 ) g( 2 ; y  2 ;k2 ) if = 1 ; 2  y ;k = y  1 ;k1 + y  2;k2 g( 1 ; y  1 ;k1 ) g( 2 ; y  2 ;k2 ) if = 1 2

f

g

f

g

f

g

f

g

[

[

f

g

[

[

j

 , y , y Here xi 0 X , xiS XS , and y;j 1 ;k1 2 ;k2 Y are fresh variables. The set of cardinality constraints determined by  , denoted by  , consists of the following:

Theorem 3.3 can be veri ed by induction on the structures of element type de nitions in D. Suppose that there is a nite data tree T1 valid w.r.t. D and satisfying . Then T1 = C by Theorem 3.1, where C is the set of cardinality constraints determined by . We can construct a nite tree T20 by modifying T1 , following the de nition of the transition function f , such that T20 is valid w.r.t. D? and T20 = C . Again by Theorem 3.1, there is a nite data tree T2 such that T2 is valid w.r.t. D? and T2 = . Conversely, given a nite data tree T2 valid w.r.t. D? and satisfying , we have T2 = C by Theorem 3.1. In addition, by modifying T2 one can construct a nite tree T10 , directed by the de nition of f , such that T10 is valid w.r.t. D and T10 = C . Thus by Theorem 3.1, there is a nite data tree T1 valid w.r.t. D and satisfying . See Appendix for a detailed proof. Finally, it is easy to verify the following. Proposition 3.4: Given any DTD D, its -free DTD D? can be computed in linear time. As an immediate result of Proposition 3.4, the size of D? is linear in the size of D.

2

j

 

j



2

ext( ) = y ; 1 , where = P ( );  ); constraints in g( ; y ; 1 xi 0 0 for all xi 0 occurring in g( ; y ; 1);   occurring in g ( ; y  ). y ;k 0 for all y ;k ;1

 j

j

2

j





Intuitively, constraints of  describe the grammar

j

P ( ) in a quantitative way. Referring to a nite data tree T valid w.r.t. D, let us consider the sub-elements ~v = [v1 ; :::; vn ] of a node in T labeled  . The labels of the sub-elements, lab(~v) = lab(v1 ); :::; lab(vn ), must observe P ( ). Moreover, a subsequence of lab(~v) may

j

correspond to a certain occurrence of a regular expression in S . For example, if P ( ) = ( 1 ; 2 ), then there is m [1; n] such that lab(v1 ); :::; lab(vm) and lab(vm+1 ); :::; lab(vn) are in the regular languages de ned by 1 and 2 , respectively. We say that [v1 ; :::; vm ] and [vm+1 ; :::; vn ] correspond to 1 and 2 , respectively.



2

9

In particular, if =  0 for some  0 E S , an occurrence of may correspond to a node in T labeled  0 . The unknown xi 0 intends to indicate the number of all  0 nodes in T that correspond to a certain oc keeps track of currence of  0 in P ( ). Similarly, y ;k the number of sub-element sequences that correspond to the kth occurrence of in P ( ). Constraints of  specify dependencies among these numbers. To illustrate this, let us consider the structure of = P ( ).  many nodes labeled Suppose that in T , there are y ; 1  . Suppose is some  0 E S , then each  node has a distinct sub-element labeled  0 . Thus there must  many nodes labeled  0 that are sub-elements of be y ; 1  nodes. This is exactly what the constraint y ; 1 = xi 0 says. Next, suppose = ( 1 ; 2 ). The sub-elements ~v of a  node can be partitioned into ~v = v~1 v~2 such that v~1 and v~2 correspond to 1 and 2 , respectively. Assume that the numbers of sub-element sequences of  nodes corresponding to 1 and 2 are y  1;k1 and y  2 ;k2 , respectively. Then both y  1 ;k1 and y  2 ;k2 must equal to y ; 1 . The constraints y ; 1 = y  1 ;k1 and y ; 1 = y  2;k2 describe this quantitative relation. Finally, suppose = ( 1 2 ). The sub-elements of a  node correspond to either 1 or 2 . Assume that the numbers of sub-elements of  nodes corresponding to 1 and 2 are y  1;k1 and y  2 ;k2 , respectively. Then the sum of y  1 ;k1 and y  2 ;k2 must equal to y ; 1 . This is what the  = y + y constraint y ; 1 1 ;k1 2 ;k2 says. The set of cardinality constraints determined by DTD D, denoted by D , consists of the following: ext(r) = 1; constraints of  for each  E ; X i x for each  E S , where ext( ) = 2

2

equations and disequations. We say that D is nitely satis able if and only if D admits an integer solution. That is, there is an integer assignment to each unknown  that satis es all the equations and ext(r) , xi 0 and y ;k disequations in D . For example, let us consider DTDs D1? and D3? given in Section 3. The cardinality constraints determined by these DTDs are given below:

[ f g

j

D1 :

ext(teachers) ext(teachers) ext(teacher) teacher : ext(teacher) ext(t ) t : y t = x2teacher y t = x2t ext(teach) teach : ext(teach) : ext (subject) subject research : ext(research) moreover, ext(teachers) ext(teacher) ext(t ) ext(teach) ext(subject) ext(research) ext(S ) teachers :

[ f g

j

2

i2[1;m ]

j

j

j

j

j

j

j

j

j

j

j

j

j

j

j

j

j

j

j



Observe, for example, x1teacher indicates the number of teacher nodes that are children of teachers nodes, and x2teacher stands for the number of teacher children of nodes labeled t . The cardinality of ext(teacher) equals to the sum of x1teacher and x2teacher . Obviously, there is a unique node labeled teachers, i.e., the root. Hence x1teacher = 1 since the root has a unique teacher child. Thus ext(teacher) = 1 + x2teacher .

[f g

j

j

D3 : ext(db) = x1foo db : ext(foo) = x2foo foo : moreover, ext(db) = 1 ext(foo) = x1foo + x2foo all unknowns 0. j

j

j

j

j

j

j

j



It is easy to check that D1 is nitely satis able, whereas D3 is not. Intuitively, the unknowns in the constraints provide a ner labeling that keeps track of how vertices correspond to certain occurrences of regular expressions in a DTD. We next show that D indeed characterizes D.

1  i  m

says, where xi indicates the number of all  nodes in T that correspond to a certain occurrence of  . By treating ext( ) in D as an unknown for each  E S , we can view D as a system of linear 2

j

j

[ f g

j

j

j

j

2

j

j

j

j

2

j

= x1teacher = x1t = x1teach = x1research = yt + y t

= x1subject = x2subject = x1S = x2S =1 = x1teacher + x2teacher = x1t + x2t = x1teach = x1subject + x2subject = x1research = x1S + x2S all unknowns 0.

 without loss of generality, we assume x1 ; :::; xm  are all unknowns of X occurring in some  0 . Intuitively, referring to a nite data tree T , the constraint ext(r) = 1 asserts that T has a unique root. For each  E ,  assures that the grammar P ( ) is observed. Consider  E S . Observe that each node of T labeled  corresponds to a certain occurrence of  in P ( 0 ) for some element type  0 . Obviously, the set ext( ) should include all  vertices in T no matter what occurrences of  they correspond to.XThis is exactly what the constraint ext( ) = xi j

j

j

2

 j

j

j

j



j

j

j

 j

j

j

[ f g

10

Theorem 3.5: Let D be a ( -fee) DTD and D be the

4.1 Characterizing DTDs and Lu constraints



set of cardinality constraints determined by D. Then D is nitely satis able if and only if there is a nite data tree valid w.r.t. D. Furthermore, given an integer solution of D , one can construct a nite tree T valid w.r.t. D such that there is certain quantitative connection between ext( ) in T and the corresponding unknown ext( ) in D . Corollary 3.6: Let D be a ( -fee) DTD and D be the set of cardinality constraints determined by D. If D admits an integer solution such that ext( ) is assigned a value n for each  E , then there is a nite data tree T such that T is valid w.r.t. D and in addition, ext( ) in T is equal to n . To prove Theorem 3.5 and Corollary 3.6, we use unknowns and the function g to label vertices in a data tree. Given a nite data tree T valid w.r.t. D, one can mark vertices (and vertex sequences) of T with unknowns following the de nition of g. Given the labeling, one can derive an integer assignment to unknowns in D : each unknown is assigned the number of nodes (or node sequences) in T marked with the unknown. In can be veri ed that the assignment is indeed a solution of D . Conversely, given an integer solution of D , one can construct a data tree T such that for each  E S , the cardinality of ext( ) in T is the value of the unknown ext( ) in the solution. Moreover, sequences of vertices in T can be marked with unknowns, directed by the de nition of g. This labeling yields a de nition of the function ele in T , which de nes edges in T . It can be veri ed that T is indeed nite and valid w.r.t. D, since D describes element type de nitions in a quantitative way. See Appendix for a detailed proof. The proof of the next proposition is straightforward. Proposition 3.7: Given any ( -free) DTD D, the set D of cardinality constraints determined by D can be computed in linear time. Immediately from Proposition 3.7 it follows that the size of D is linear in the size of D. j

To establish our complexity results, we rst de ne a system of linear equations and disequations to characterize DTD and Lu constraints. More speci cally, given a DTD D and a nite set  of Lu constraints over D, we de ne a nite set (D; ) of cardinality constraints. The set (D; ) can be treated as a system of linear equations and disequations. We show that there exists a nite data tree valid w.r.t. D and satisfying  if and only if (D; ) admits a (legal) integer solution. We proceed in two steps: we rst consider -free DTDs and then show that the results established for -free DTDs can be extended to DTDs that are not necessarily -free. Let D = (E; A; P; R; r) be a -free DTD, and  be a nite set of Lu constraints over D. We de ne the set of cardinality constraints determined by D and , denoted by (D; ), to be (D; ) = D C ; where C is the set of cardinality constraints determined by , and D is the set of cardinality constraints determined by D, as de ned in Sections 3.1 and 3.3, respectively. For each  E S , we treat ext( ) in (D; ) as an unknown. Similarly, for each  E and l R( ), we treat ext(:l) also as an unknown. Thus (D; ) can be viewed as a system of linear equations and disequations. We say that (D; ) is nitely satis able if and only if (D; ) admits an legal integer solution. By a legal solution we mean a solution that satis es the following condition: if ext( ) > 0 then ext(:l) > 0 for each ext( ) and ext(:l) appearing in (D; ). This is to ensure that every  element has an l attribute. Observe that ext(:l) ext( ) is in C . For example, recall DTDs D1? and D3? given in Section 3 and constraint sets 1 and 3 in Section 1. It is easy to verify that neither (D1? ; 1 ) nor (D3? ; 3 ) is nitely satis able. This is consistent with the observations made in Section 1. We next show that (D; ) characterizes D and . Theorem 4.1: Let D be a ( -free) DTD,  a nite set of Lu constraints over D, and (D; ) the set of cardinality constraints determined by D and . Then (D; ) is nitely satis able if and only if there is a nite data tree T such that T is valid w.r.t. D and T = . To prove this theorem, we use Theorems 3.1 and 3.5. Given a nite data tree T valid w.r.t. D and satisfying , we have T = C by Theorem 3.1, where

j



j



j

2

j





j

2



[

2

j

j

2

[ f g

j

[f g

j

j

j

j

j

4 Finite satis ability of Lu constraints

j

j

j



j

j

j  j

j



In this section, we show that the nite satis ability problem for Lu is NP-complete. That is, it is NPcomplete to determine, given any DTD D and any set  of Lu constraints over D, whether there exists a nite data tree valid w.r.t. D and satisfying . In addition, we show that several special cases of the nite satis ability problem can be decided in PTIME.

j

j

11

2

j

j

4.2

C is the set of cardinality constraints determined by . By Theorem 3.5, from T we can derive an integer

Next, we establish one of the main complexity results of the paper: Theorem 4.3: The nite satis ability problem for Lu constraints is NP-complete. Before we prove the theorem, we rst review the linear integer programming (LIP ) problem: Given an m n matrix A of integers and a column vector ~b of m integers, does there exist a column vector ~x of n integers such that A ~x ~b? That is, for i [1; m],

assignment to the unknowns in D such that it satis es D . We extend the assignment as follows: for each  E and l R( ), let the unknown ext(:l) in (D; ) have as its value the cardinality of ext(:l) in T . Clearly, ext(:l) > 0 if ext( ) > 0 since every  element in T has an l attribute. By T = C , it can be veri ed that this extended assignment is a legal solution of (D; ). Conversely, given a legal integer solution of (D; ), there exists a nite data tree T1 valid w.r.t. D by Theorem 3.5 and Corollary 3.6. In addition, for each  E , the cardinality of ext( ) in T1 equals to the value of the unknown ext( ) in the solution. Observe that the solution also satis es equations and disequations in C . That is, for all  E and l R( ) that occur in , the values of the unknowns ext( ) and ext(:l) in (D; ) satisfy C . Thus we can construct a nite data tree T2 by modifying the val function in T1 , directed by the solution, such that T2 is valid w.r.t. D and T2 = C . Observe that since the solution is legal we have ext(:l) > 0 if ext( ) > 0. This makes it possible for every  element in T2 to have an l attribute. Hence by Theorem 3.1, there exists a nite data tree T such that T is valid w.r.t. D and T = . A detailed proof can be found in Appendix. Next, consider DTDs that are not necessarily -free. The corollary below shows that for any DTD D and nite set  of Lu constraints over D, (D? ; ) also characterizes whether there exists a nite data tree valid w.r.t. D and satisfying , where D? is the -free DTD of D as de ned in Section 3. Corollary 4.2: Let D be a DTD, D? the -free DTD of D,  a nite set of Lu constraints over D, and (D? ; ) the set of cardinality constraints determined by D? and . Then (D? ; ) is nitely satis able if and only if there is a nite data tree T such that T is valid w.r.t. D and T = . The corollary follows from Theorems 3.3 and 4.1. To see this, rst assume that there is a nite data tree T valid w.r.t. DTD D and satisfying . By Theorem 3.3, there is a nite data tree T 0 such that T 0 is valid w.r.t. D? and T 0 = . Thus by Theorem 4.1, (D? ; ) is nitely satis able. Conversely, suppose that (D? ; ) admits a legal integer solution. Then by Theorem 4.1, there is a nite data tree T 0 such that T 0 is valid w.r.t. D? and T 0 = . Thus by Theorem 3.3, there is a nite data tree T such that T is valid w.r.t. D and T = . 2

2

j

j

j

j

Complexity analysis

j

j

j





2

j

j 2[1;n]

j

2

j

j

j

j

j

j

j

aij xj bi ; 

where aij is the j th element of the ith row of A, xj is the j th entry of ~x and bi is the ith entry of ~b. It is known that LIP is NP-complete in the strong sense [20]. In particular, when non-negative integer solutions are considered, [25] has shown that if the problem has a solution, then it has another solution in which for all j [1; n], xj is no larger than n (m a)2m+1 , where a is the largest absolute value of elements in A and ~b. To show the nite satis ability problem for Lu is in NP, we encode an instance of the problem as an instance of LIP. By Corollary 4.2, given any DTD D and any set  of Lu constraints over D, there is a nite data tree valid w.r.t. D and satisfying  if and only if (D? ; ) has a legal integer solution. Note that (D? ; ) can be encoded as an instance of LIP. More speci cally, observe that an equation F1 = F2 in (D? ; ) is equivalent to disequations F1 F2 and F2 F1 . Thus without loss of generality, we assume (D? ; ) consists of only disequations. We enumerate the unknowns in (D? ; ) as x1 ; :::; xn and disequations as F1 ; :::; Fn . Thus we can de ne an m n matrix A of integers X and a vector ~b of aij xj bi is m integers such that for i [1; m],

2

j

2

X

2

j

j













2



j 2[1;n]

Fi . Hence (D? ; ) admits an integer solution if and only if there exists a vector ~x such that A ~x ~b . It should be noted that for all unknowns x appearing in (D? ; ), x 0 is in (D? ; ). In other words, we are interested in non-negative solutions to A ~x ~b

j







only. We now consider the problem of determining whether (D? ; ) admits a legal integer solution. A legal solution requires that ext(:l) > 0 if ext( ) > 0. To ensure this, consider a system of linear disequations that includes all the disequations in (D? ; ) and moreover, ext(:l) 1 and ext( ) 1 for each ext(:l) and ext( ) appearing in (D?; ). Let X be some set of pairs ( ext(:l) ; ext( ) ) that appear in (D? ; ),

j

j

j

j

j

j 

j

j

j j

j

j 

j

j

12

j

j

j

j

j

hard by reduction from a variant of LIP, namely, A ~x = ~b; where for all i [1; m], j [1; n], aij coecients are in 0; 1 , all bi elements are 1, and all xj components are binary, i.e., in 0; 1 . It is known that the variant is also NP-complete [20, 23]. Given an instance A ~x = ~b of the variant of LIP, we de ne a DTD D and a set  of Lu constraints over D such that there is a nite data tree valid w.r.t. D and satisfying  if and only if A ~x = ~b admits aX binary solution. For i [1; m], we use Fi to aij xj . We de ne D to be (E; A; P; R; r), denote

and X be a system of linear disequations that is the same as except that it includes ext(:l) = 0 and ext( ) = 0 for each ( ext(:l) ; ext( ) ) in X , instead of ext(:l) 1 and ext( ) 1. We can encode X as an instance A X ~x ~b X of LIP, as described above. It is easy to see that (D? ; ) admits a legal integer solution if and only if there is some A X ~x ~b X that admits an integer solution. More speci cally, an integer solution of any A X ~x ~b X is a legal integer solution of (D? ; ), and conversely, a legal integer solution of (D? ; ) is an integer solution of some A X ~x ~b X . By the result of [25] cited above, if X has an integer solution, then it has one in which each unknown has a length in binary of at most 1+ log n+(2m+1) log(ma) bits, where m; n and a are determined by (D? ; ) as described above. In other words, the bounds on solutions for all x's are the same. Therefore, if the encoding A ~x ~b of (D? ; ) has a legal integer solution, then it has one in which each unknown has a length in binary of at most 1 + log n + (2m + 1) log(ma) bits. Let c be a number that in binary notation has 1 + log n + (2m + 1) log(ma) many 1's. Observe that the number c can be computed in PTIME given (D? ; ). In light of these, we de ne a new system  of linear disequations that is the same as (D? ; ) except it also includes: j

j

j

j

j

j 

j

j j

j

j

2

j 

f



f







j 2[1;n]

where E= r

e

Fi i [1; m] bi i [1; m] V Fi i [1; m] Xij i [1; m]; j [1; n] Zij i [1; m]; j [1; n] A= v Aij i [1; m]; j [1; n] P (r) = F1 ; :::; Fm ; b1 ; :::; bm P (Fi ) = Xij1 ; :::; Xijl for i [1; m] where Xij1 ; :::; Xijl is a subsequence of Xi1 ; :::; Xim such that Xij is in P (Fi ) i ai j in A is 1 P (Xij ) = Zij  for i [1; m] and j [1; n] P (Zij ) = V Fi for i [1; m] and j [1; n] P (V Fi ) = P (bi ) =  for i [1; m] R(Zij ) = Aij for i [1; m] and j [1; n] R(V Fi ) = R(bi ) = v for i [1; m] R(r) = R(Fi ) = R(Xij ) = Intuitively, Xij indicate xj in Fi , Zij denotes the value of Xij : Xij has value 1 if and only if Xij has a Zij child. The attribute Aij of Zij is used to ensure that all occurrences of xj have the same value. The element type V Fi indicates the value of Fi , and its attribute v is to ensure that the value of Fi is 1. More speci cally, these are captured by the set  of Lu constraints over D. To ensure that all occurrences of xj have the same value, the following are in : for j [1; n] and i; l [1; m], Zij :Aij Zij Zij :Aij Zlj :Alj These assert that Xij has value 1 if and only if Xlj equals to 1. To ensure Fi = bi , we include the following in : for i [1; m], V Fi :v V Fi bi :v bi V Fi :v bi :v bi :v V Fi :v These assert that Fi node has a unique V Fi descendent, and thus has value 1. A data tree valid w.r.t. D has f g [ f [ f

d





j

j

j

j  j

j

j  j

j

j

2

2

g

[ f

j

2

2

g

f

j

j

g

j

j

2

2

g

2

2

2

2

2

g

2

f g

2

2

;

j

j

2

[ f

j

solution if and only  has an integer solution. Indeed, if (D? ; ) has a legal integer solution then it has one bounded by c. Thus the solution satis es c ext(:l) ext( ) . In other words, it is an integer solution of . Conversely, if  has an integer solution, then it is also an integer solution of (D? ; ) and moreover, if ext( ) > 0 then ext(:l) > 0 by c ext(:l) ext( ) in . In other words, it is a legal solution of (D? ; ). Therefore, by Corollary 4.2, there exists a nite data tree valid w.r.t. D and satisfying  i  has an integer solution. By Propositions 3.4, 3.2 and 3.7, (D? ; ) can be computed in linear time and its size is linear in the sizes of D and . Thus  can be computed in PTIME in the sizes of D and . It should be noted that from the discussions above follows that if  has an integer solution, then it has another solution in which all unknowns are no larger than c. Hence determining whether  admits an integer solution is in NP. Therefore, the nite satis ability problem for Lu is in NP. We show that the nite satis ability problem is NPj

j

g

g

2

c ext(:l) ext( ) for every ext(:l) and ext( ) appearing in (D? ; ). It is easy to verify that (D? ; ) has a legal integer j  j

2

2

f g [ f

e

e

j

j

j

[ f



d

g

2



d

2

g

j

j

2

2

!



2

!

!





the form shown in Figure 3.

13

nite tree valid w.r.t. D. Given a xed DTD D, the number of unknowns in C is bounded by the size of D by Proposition 2.1, and the number of unknowns in D? is determined by D and xed. Thus the number of unknowns in (D?; ) is bounded. Recall the system  given above in the proof of NP-membership of the nite satis ability problem for Lu . Observe that the number of unknowns in  is the same as that in (D? ; ), and thus is also bounded given a xed DTD. It is known that when the number of unknowns in a system of linear disequations is bounded, checking whether the system admits an integer solution can be done in PTIME [22, 26]. As shown above,  can be satis ed by a nite data tree valid w.r.t. D if and only if  admits an integer solution. Moreover,  can be computed in PTIME in the sizes of D and . Thus we have: Corollary 4.5: Given a xed DTD D, it can be determined in PTIME whether an arbitrary set of Lu constraints over D can be satis ed by any nite data tree valid w.r.t. D.

r

F1

Fi

...

Fm

b1

...

bm

...

...

X ij

...

@v

@v

Z ij @A i j

VF i

@v

Figure 3: A tree used in the proof of Theorem 4.3 It is easy to verify that the encoding can be done in PTIME in m and n. Moreover, A ~x = ~b admits a binary solution if and only if there is a nite data tree valid w.r.t. D and satisfying . Thus what given above is indeed a PTIME reduction from the variant of LIP. This completes the proof of Theorem 4.3.

A natural question is: What is the complexity of checking whether a DTD has a nite data tree valid w.r.t. it? This is a special case of the nite satis ability problem, namely, when the given set of Lu constraints is empty. The theorem below answers the question: Theorem 4.6: Given any DTD D, it can be decided in linear time whether there exists a nite data tree valid w.r.t. D. To prove this theorem, we rst de ne the following notion. Let D = (E; A; P; R; r) be a DTD and be a regular expression over E S . We say that yields a nite tree if one the following is satis ed: is one of , S or 1 ; is some  in E and P ( ) yields a nite tree; = ( 1 ; 2 ) and both 1 and 2 yield nite trees; = ( 1 2 ) and either 1 or 2 yields nite tree. One can verify the following (see Appendix for a proof): 1. There exists a nite data tree valid w.r.t. D i P (r) yields a nite tree. 2. Given an arbitrary DTD D, it can be decided in linear time whether P (r) yields a nite tree. From these follows immediately Theorem 4.6.

In relational databases it is common to consider primary keys. That is, for each relation one can specify at most one key, namely, the primary key of the relation. In the XML setting, the primary key restriction requires that for each element type  E , one can specify at most one key, i.e., there is at most one l R( ) such that :l  . Under the primary key restriction, the nite satis ability problem for Lu is to determine, given any DTD D and nite set  of Lu constraints in which there is at most one key for each element type (either given as keys or implied by foreign keys), whether there is a nite data tree that satis es  and is valid w.r.t. D. One might think that the primary key restriction will simplify the analysis of nite satis ability of Lu constraints. Unfortunately, this is not true. Corollary 4.4: Under the primary key restriction, the nite satis ability problem for Lu is NP-complete. To show this, observe that the reduction from LIP given above de nes at most one key for each element type. Therefore, it is also a reduction to the nite satis ability problem under the primary key restriction. As a result, under the primary key restriction the problem remains to be NP-hard. 2

2

!

[ f g









We next consider special cases of the nite satis ability problem for Lu . We rst show that given a xed DTD D, it can be decided in PTIME whether an arbitrary set  of Lu constraints over D is satis able by a

j

The next corollary answers another natural question: What is the complexity of the nite satis ability problem for key constraints only? 14

Corollary 4.7: The problem to determine, given any

a key constraint ', denoted by I = ', if

DTD D and any set  of key constraints in Lu over D, whether there exists a nite data tree valid w.r.t. D and satisfying  can be decided in linear time in the size of D. This follows from Theorem 4.6 and Proposition 4.8: Proposition 4.8: Let D be a DTD and  be a set of key constraints in Lu over D. Then  can be satis ed by a nite data tree valid w.r.t. D if and only if there exists a nite data tree valid w.r.t. D. To prove the proposition, observe that given any nite data tree T1 valid w.r.t. D, we can always construct another nite data tree T2 by modifying the val function in T1 such that ext( ) = ext(:l) for any constraint :l  in . Thus T2 = C . By Theorem 3.1, T2 = . In addition, it is easy to verify that T2 is valid w.r.t. D since T1 is valid w.r.t. D. j

!

j

j

8

1ik

(t1 :li = t2 :li )

^

l2Att(R)

(t1 :l = t2 :l));



8

j

t1 I t 2 I 0 ( 2

9

2

j

!

^

1jk

t1 :lj = t2 :lj0 );

where I and I 0 are the instances of R and R0 in I, respectively. Let  ' be a set of keys and foreign keys over R. We use  = ' to denote that  nitely satis es ', i.e., for any nite instance I of R, if I = , then I = '. In relational databases, the nite implication problem for keys and foreign keys is the problem of determining, given a relational schema R and a set  ' of keys and foreign keys over R, whether  = '. A special case of the problem is the nite implication problem for keys by keys and foreign keys , which is to determine whether  = ' where ' is a key and  is a set of keys and foreign keys over R. It was shown in [19] that the nite implication problem for keys and foreign keys in relational databases is undecidable. The theorem below improves the result. A proof of the theorem can be found in Appendix. Theorem 5.1: In relational databases, the nite implication problem for keys by keys and foreign keys is undecidable. One might also want to consider nite implication of foreign keys by keys and foreign keys. That is the problem to determine  = ' where ' is a foreign key and  is a set of keys and foreign keys over a schema R. However, it is easy to show that the problem has the same complexity as that of the nite implication problem for keys and foreign keys because a foreign key R[l1 ; :::; lk ] R0 [l10 ; :::; lk0 ] requires that the set consisting of l10 ; :::; lk0 is a key of R0 . [ f

g

j

j

!

j

[f

g

j

5 Finite satis ability of L constraints In this section, we establish the undecidability of the nite satis ability problem for the class L of multiattribute keys and foreign keys. To do this, we rst show that in relational databases, a nite implication problem associated with keys and foreign keys is undecidable, and then present a reduction from the nite implication problem to the nite satis ability problem for L constraints. We rst review keys, foreign keys and their associated nite implication problems in relational databases. Let R = (R1 ; : : : ; Rn ) be a relational schema. For each relation (schema) Ri in R, we use Att(Ri ) to denote the set of all attributes of Ri , and Inst(Ri ) to denote the set of instances of Ri . Observe that a database instance I of R has the form (I1 ; : : : ; In ), where Ii Inst(Ri ) for all i [1; n]. For an instance Ii Inst(Ri ), a tuple t Ii and an attribute l Att(Ri ), we use t:l to denote the l attribute value of t. Keys and foreign keys over R are de ned as follows:

j

j

2

2

!

where I is the instance of R in I; foreign key: R[l1; :::; lk ] R0 [l10 ; :::; lk0 ], where R, R0 are in R, [l1 ; :::; lk ] is a sequence of attributes in Att(R), and [l10 ; :::; lk0 ] is a sequence of attributes in Att(R0 ). In addition, R0 [l10 ; :::; lk0 ] R0 , i.e., the set consisting of l10 ; :::; lk0 is a key of R0 . An instance I of R satis es a foreign key constraint ', denoted by I = ', if I = R0 [l10 ; :::; lk0 ] R0 and moreover, j



2

2



^

!

Observe that the corollary no longer holds if  consists of only foreign key constraints in Lu . This is because the presence of 1 :l1 2 :l2 requires that l2 is a key of 2 , i.e., 2 :l2 2 . It is easy to show that this special case has the same complexity as that of the nite satis ability problem for Lu .

2

2



j

j

t1 t2 I (

j



key: R[l1; :::; lk ] ! R, where R 2 R, and for i 2 [1; k], li 2 Att(R). An instance I of R satis es

From Theorem 5.1 follows immediately that the complement of the nite implication problem for keys by keys and foreign keys is also undecidable. That is the problem to determine, given a relational schema R, a 15

set  of keys and foreign keys over R and a key ' over R, whether there is a nite instance of R satisfying V  '. Using this we show another main complexity result of the paper: Theorem 5.2: The nite satis ability problem for L constraints is undecidable. We prove this theorem by reduction from the complement of the nite implication problem for keys by keys and foreign keys. Let R = (R1 ; : : : ; Rn ) be a relational schema,  be a set of keys and foreign keys over R, and ' = R[X ] R be a key over R. Let Y = Att(R) X . We encode R,  and ' in terms of a DTD D and a set  of L constraints over D as follows. Let D = (E; A; P; RA ; r), where E = Ri i [1; n] ti i [1; n]

r

^:

R1

ti

2

g[ f

[ r; DY ; EX Att(Ri ) A= [ f

j

2

...

@XY

@XY

... @X

j

d1 [X ] = d2 [X ] by DY [X ] EX [X ] and the fact ext(EX ) = 1, d1 [Y ] = d2 [Y ] by DY [Y ] DY . These nodes will serve as a witness V for '. Given these, we show that  ' can be satis ed 



j

j



6

!

:

2

^:

by a nite instance of R if and only if  can be satis ed by a nite data tree valid w.r.t. D. Assume V that there is a nite instance I of R satisfying  '. We construct a nite data tree T from I as follows. Let T have a root node r and a Ri node for each Ri in R. For any Ri R and each tuple p in the instance of Ri in I, we create a distinct ti node x such that p:l = x:l for all l Att(Ri ). Since I = ', there are two tuples p and p0 in the instance of R in I such that p[X ] = p0 [X ] and p[Y ] = p0 [Y ]. We create two distinct DY nodes d1 and d2 such that d1 :l = p:l and d2 :l = p0 :l for all l Att(R). In addition, we create a single EX node e such that e:l = p:l for all l X . We de ne the edge relation of T such that T has the form shown in Figure 4. It is easy to verify that T is a nite tree valid w.r.t. D. By I =  it is easy to verify that T =  . By the de nition of T , it is also easy to see that T = '. In particular, since Att(R) = X Y , we have T = t' [XY ] t' , where t' is a symbol in P (R) = t' . Therefore, T = . Conversely, suppose that there is a nite data tree T valid w.r.t. D and satisfying . We de ne a nite instance I of schema R as follows. For each ti node x, let (l1 = x:l1 ; : : : ; lm = x:lm ) be a tuple in the instance of Ri in I, where l1 ; : : : ; lm is an enumeration of Att(Ri ). Obviously I is a nite instance of R. By T =  , it is easy to verify that I = . Moreover, by T = ' and the de nition of I, we have I = ' since there must be two tuples d1 and d2 in the instance of R in I such that d1 [X ] = d2 [X ] but d1 [Y ] = d2 [Y ]. Thus the encoding is indeed a reduction from the complement

2

2

^ :

[

2

2

2

2

[



...

ti

...

[

g

!

ti

that have all the attributes in X Y , and a single EX node that has all attributes in X . If T = ', then

= R 1 ; : : : ; R n ; DY ; DY ; E X = ti for i [1; n] = for i [1; n] = P (EX ) =  = Att(Ri ) for i [1; n] =X Y =X = RA (Ri ) = for i [1; n] In particular, we denote P (R) = t' for the relation R in '. Note that R = Rs and t' = ts for some s [1; n]. We encode  and ' with  =  ' , where  is de ned as follows: For every key Ri [Z ] Ri in , ti [Z ] ti is in  . For every foreign key Ri [Z ] Rj [Z 0 ] in ,  includes ti [Z ] tj [Z 0 ]. The set ' consists of the following: DY [Y ] DY ; EX [X ] EX ; DY [X ] EX [X ]; DY [X; Y ] t'[X; Y ]; t' [XY ] t' ; where [X; Y ] stands for the concatenation of list X and list Y , and t' is the grammar symbol in P (R) = t' . Observe that Att(R) = X Y and thus we can require that XY is a key of t' . As depicted in Figure 4, in any nite data tree valid w.r.t. D, there are two distinct DY nodes d1 and d2 

Ex

Figure 4: A tree used in the proof of Theorem 5.2

g

;

Dy

...

i2[1;n]

P (r) P (Ri ) P (ti ) P (DY ) RA (ti ) RA (DY ) RA (EX ) RA (r)

Dy

@Att(Ri)

n

j

Rn

...

...

!

f

Ri

...

j

:

6

2

!

2



j

!

j

j

[

!

j

! j

! 



!

j

j

j

[

6

j

16

:

This work is the rst step towards understanding the interaction between DTDs and integrity constraints. A number of questions remain open. First, we have only considered key and foreign key constraints de ned in terms of XML attributes. More expressive key and foreign key constraints, such as those de ned in terms of XPath [15] expressions (or regular path expressions), have been proposed for XML (e.g., XML Schema [28]). We expect to extend some of the complexity results established in this paper to these more powerful constraint languages. Second, the nite implication problem for key and foreign key constraints needs to be studied. This problem has been investigated for XML data modeled as graphs [19], but has not been addressed for the tree model of XML that is typically used to represent XML data. We defer to another paper [18] for a full treatment of nite implication of keys and foreign keys. Third, other constraints commonly found in databases, such as inverse constraints, have not been studied for the tree model of XML.

of the nite implication problem for keys by keys and foreign keys. From this follows Theorem 5.2. Observe that given a xed DTD D, there are only nitely many L constraints over D. Thus when DTD D is xed, nite satis ability of arbitrary sets of L constraints over D is decidable [17]. A more interesting special case of the nite satis ability problem for L constraints is the nite satis ability problem for keys in L. That is the problem to determine, given any DTD D and any set  of key constraints in L over D, whether there exists a nite data tree valid w.r.t. D and satisfying . The corollary below shows that this problem is decidable. Corollary 5.3: The nite satis ability problem for keys in L is decidable in linear time. The decidability of the problem follows from Theorem 4.6. A proof of Corollary 5.3 is given in Appendix. One might be tempted to think that nite satis ability of foreign keys in L would be simpler than nite satis ability of general L constraints. That is the problem to determine, given any DTD D and any set  of foreign keys in L over D, whether there exists a nite data tree valid w.r.t. D and satisfying . It is not hard to show that this problem, however, has the same complexity as that of the nite satis ability problem for L and therefore, is undecidable. This is because a foreign key  [X ]  0 [Y ] in  requires  0 [Y ]  0 , i.e., the presence of keys. 

Acknowledgements. We thank Michael Benedikt, Peter Buneman, Frank Neven and Jer^ome Simeon for valuable suggestions and discussions.

References [1] S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison-Wesley, 1995. [2] S. Abiteboul and V. Vianu. Regular path queries with constraints. In Proceedings of ACM Symposium on Principles of Database Systems (PODS), pages 122{133, Tucson, Arizona, May 1997. [3] V. Apparao, S. Byrne, M. Champion, S. Isaacs, I. Jacobs, A. Le Hors, G. Nicol, J. Robie, R. Sutor, C. Wilson, and L. Wood. Document Object Model (DOM) Level 1 Speci cation. W3C Recommendation, Oct. 1998. http://www.w3.org/TR/REC-DOM-Level-1/. [4] C. Beeri and T. Milo. Schemas for integration and translation of structured and semi-structured data. In Proceedings of International Conference on Database Theory (ICDT), Jerusalem, Israel, Jan. 1999. [5] T. Bray, J. Paoli, and C. M. Sperberg-McQueen. Extensible Markup Language (XML) 1.0. W3C Recommendation, Feb. 1998. http://www.w3.org/TR/REC-xml/.

!

6 Conclusion We have investigated the interaction between DTDs and key and foreign key constraints for XML. We have demonstrated that DTDs impose a form of cardinality constraints on XML documents and in the presence of these cardinality constraints, the analysis of nite satis ability of key and foreign key for XML becomes far more intriguing than that for relational databases. In particular, we have introduced and studied a class Lu of unary keys and foreign keys, and a class L of multi-attribute keys and foreign keys for XML. We have shown that both DTDs and Lu constraints can be characterized by cardinality constraints. Taking advantage of these characterizations, we have shown that the nite satis ability problem for Lu constraints is NP-complete for XML. We have also established PTIME decidability for several special cases of nite satis ability of Lu constraints. In addition, we have shown that the nite satis ability problem for L is undecidable. 17

[18] W. Fan and L. Libkin. Finite implication of keys and foreign keys for XML data. Technical Report TUCIS-TR-2000-003, Department of Computer and Information Sciences, Temple University, 2000. [19] W. Fan and J. Simeon. Integrity constraints for XML. In Proceedings of ACM Symposium on Principles of Database Systems (PODS), pages 23{34, Dallas, Texas, May 2000. [20] M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide to the Theory of NPCompleteness. W. H. Freeman and Company, 1979. [21] A. Layman, E. Jung, E. Maler, H. S. Thompson, J. Paoli, J. Tigue, N. H. Mikula, and S. De Rose. XML-Data. W3C Note, Jan. 1998. http://www.w3.org/TR/1998/NOTE-XML-data. [22] A. K. Lenstra, H. W. Lenstra, and L. Lovasz. Factoring polynomials with rational coecients. Math. Ann., 261:515{534, 1982. [23] H. R. Lewis and C. H. Papadimitrious. Elements of the Theory of Computation. Prentice-Hall, 1981. [24] F. Neven. Extensions of attribute grammars for structured document queries. In Proceedings of International Workshop on Database Programming Languages (DBPL), 1999. [25] C. H. Papadimitriou. On the complexity of integer programming. Journal of ACM, 28(4):765{768, 1981. [26] C. H. Papadimitriou. Computational Complexity. Addison-Wesley, 1994. [27] J. Robie, J. Lapp, and D. Schach. XML Query Language (XQL). Workshop on XML Query Languages, Dec. 1998. [28] H. S. Thompson, D. Beech, M. Maloney, and N. Mendelsohn. XML Schema Part 1: Structures. W3C Working Draft, Apr. 2000. http://www.w3.org/TR/xmlschema-1/. [29] J. D. Ullman. Database and Knowledge Base Systems. Computer Science Press, 1988. [30] P. Wadler. A Formal Semantics for Patterns in XSL. Technical report, Computing Sciences Research Center, Bell Labs, Lucent Technologies, 2000. http://www.cs.bell-labs.com/~wadler/ topics/xml#xsl-semantics.

[6] P. Buneman, S. Davidson, W. Fan, C. Hara, and W. Tan. Keys for XML. Draft manuscript, July 2000. [7] P. Buneman, S. Davidson, W. Fan, C. Hara, and W. Tan. Reasoning about keys for XML. Draft manuscript, Sept. 2000. [8] P. Buneman, W. Fan, and S. Weinstein. Path constraints on semistructured and structured data. In Proceedings of ACM Symposium on Principles of Database Systems (PODS), pages 129{138, Seattle, Washington, June 1998. [9] P. Buneman, W. Fan, and S. Weinstein. Interaction between path and type constraints. In Proceedings of ACM Symposium on Principles of Database Systems (PODS), pages 56{67, Philadelphia, Pennsylvania, May 1999. [10] P. Buneman, W. Fan, and S. Weinstein. Query optimization for semistructured data using path constraints in a deterministic data model. In Proceedings of International Workshop on Database Programming Languages (DBPL), 1999. [11] D. Calvanese, G. De Giacomo, and M. Lenzerini. Representing and reasoning on XML documents: A description logic approach. Journal of Logic and Computation, 9(3):295{318, June 1999. [12] D. Calvanese and M. Lenzerini. Making objectoriented schemas more expressive. In Proceedings of ACM Symposium on Principles of Database Systems (PODS), 1994. [13] D. Calvanese and M. Lenzerini. On the interaction between isa and cardinality constraints. In Proceedings of IEEE International Conference on Data Engineering (ICDE), 1994. [14] J. Clark. XSL Transformations (XSLT). W3C Recommendation, Nov. 1999. http://www.w3.org/TR/xslt. [15] J. Clark and S. DeRose. XML Path Language (XPath). W3C Working Draft, Nov. 1999. http://www.w3.org/TR/xpath. [16] S. S. Cosmadakis, P. C. Kanellakis, and M. Y. Vardi. Polynomial-time implication problems for unary inclusion dependencies. Journal of the ACM, 37(1):15{46, Jan. 1990. [17] B. Dreben and W. D. Goldfarb. The Decision Problem: Solvable Classes of Quanti cational Formulas. Addison-Wesley, 1979. 18

Now for each  E , l R( ) and x ext( ), let val0 (att(x; l)) be a string value in s[:l] such that in T1 , s[:l] = ext(:l) . This is possible since T2 = C and moreover, ext( ) in T1 is the same as ext( ) in T2 by the de nition of T1 . Given the de nitions of T1 and val0 above, it is easy to verify that T1 = C . In addition, it can be veri ed that T1 = . More speci cally, for any ' , (1) if ' is :l  , then for any distinct x; y ext( ), x:l = y:l by the cardinality constraint ext(:l) = ext( ) and the de nition of val0 ; (2) if ' is 1 :l1 2 :l2 , then T2 = ext(1 :l1 ) ext(2 :l2 ) by T2 = C . Thus for any x ext(1 ), we have that x:l ext(2 :l2 ) by the de nition of val0 . Therefore, T1 is indeed a nite data tree such that T1 is valid w.r.t. D and T1 = . Thus (1) holds.

Appendix

2

Proof of Theorem 3.1. We show (1)

(2). (1) (2). This direction is immediate. Given a nite data tree T1 such that T1 is valid w.r.t. D and T1 = , it is easy to verify that T1 = C . Thus (2) holds. ,

j

)

j

j









2

2





f

j

2

g









n

2

;

2

j

j

j

2

T2 = (V2 ; lab2 ; ele2; att; val; root) by modifying T1 such that T2 is valid w.r.t. D? . We proceed to construct T2 as follows. Initially, let V2 = V1 , lab2 = lab1 and ele2 = ele1. Note that att, val and root are the same in T1 and T2 . We traverse T2 starting from root. For each v V2 such that lab(v) =  and  E , we increment V2 , lab2 and ele2 as follows. Assume P ( ) = and ele(v) = [v1 ; :::; vn ]. Let w = lab(v1 ) ::: lab(vn ). Since T1 is valid w.r.t D, by Definition 2.2, w must be in the regular language de ned by . By the de nitions of the transition function f and the function P ? for element type de nitions in D? , it is easy to verify that w can be derived from the rule P ? ( ) = f ( ) and the related rules for new element types in Ea introduced by f ( ). Let t be a derivation tree of w. Without loss of generality, assume the root of t is v, the leaves of t are v1 ; :::; vn , and internal nodes of t are x1 ; :::; xm . Observe that these internal nodes are labeled with element types in Ea , i.e., E ? E . We modify T2 by substituting t for v. More speci cally, we (i) increment V2 by including x1 ; :::; xm ; (ii) increment lab2 by de ning lab2(xi ) to be the label of xi in t, for each i [1; m]; and (iii) increment ele2 according to the structure of t, i.e., by substituting the children of v in t for [v1 ; :::; vn ] and by de ning ele2(xi ) to be the children of xi in t, for i [1; m]. Let T2 be the tree





j







j  j

Claim: Let D be a DTD and D? be the -free DTD of D. Then there exists a nite data tree T1 valid w.r.t. D i there exists a nite data tree T2 valid w.r.t. D? . To verify the claim, rst assume that there exists a nite data tree T1 = (V1 ; lab1; ele1; att; val; root) such that T1 is valid w.r.t. D. We construct a data tree

j



j

Proof of Theorem 3.3. We rst show the following:

g

j  j

j

j



j

j

j



j

j

2

j



j



j



6

j

62



!

2

2

2

j

2

j

2

j

j

)

j

j

2

j

(2) (1). Let T2 = (V; lab; ele; att; val; root) be a nite data tree valid w.r.t. D and satisfying C . We construct a data tree T1 = (V; lab; ele; att; val0 ; root) such that T1 = . Observe that the only di erence between T1 and T2 is the de nition of the function val0 . The function val0 has the following property: for any v V,  (v) if lab(v) A 0 val (v) = avalstring otherwise This suces since T1 =  and in addition, it is easy to verify that T1 is a nite data tree valid w.r.t. D by the assumption about T2 and De nition 2.2. We next de ne the function val0 such that T1 = . To do so, let S = :l  E; l R( ) : We de ne a partial order on S as follows: :l :l, 1 :l1 2 :l2 if T2 = ext(1 :l1 ) ext(2 :l2 ) , 1 :l1 3 :l3 if 1 :l1 2 :l2 and 2 :l2 3 :l3 . It is easy to verify that is indeed a partial order on S . Using we de ne an equivalence relation on S as follows: 1 :l1 2 :l2 i 1 :l1 2 :l2 and 2 :l2 1 :l1 Let [:l] denote the equivalence class of :l with respect to , and let S[ ] = [:l] :l S : We now extend to S[ ] such that [1 :l1 ] [2 :l2 ] i 1 :l1 2 :l2 . We then topologically sort S[ ] based on . For each [:l], let s[:l] be a distinct set. Based on the topological order, we assign a set of string values to each s[:l] using the algorithm below: 1. for each s[:l] S[ ] do s[:l] := ; 2. for each s[:l] S[ ] in the topological order do (1) add distinct string values to s[:l] such that s[:l] is equal to ext(:l) in T2 ; (2) for each s[ 0 :l0] S[ ] such that [:l] [ 0 :l0 ] do s[ 0 :l0 ] := s[ 0 :l0] s[:l]; f

j

2

2

j

2



2

[

19

is a nite data tree valid w.r.t. D. We show that D admits an integer solution. It suces to de ne an integer assignment to unknowns in D that satis es equations and disequations in D . To do so, for each  E , let the unknown ext( ) be the number of vertices in V labeled  . Assume P ( ) = . Since T is valid w.r.t. D, for any vertex v  with lab(v) =  and ele(v) = [v1 ; :::; vn ], lab(v1):::lab(vn ) must be in the regular language de ned by . In other words, [v1 ; :::; vn ] observes the regular expression . Thus for i [1; n], we can mark vi with xi 0 if vi corresponds to a certain occurrence of  0 in , following the de nition of the function g. Similarly, we can mark a sequence of  to indicate that the sequence corre[v1 ; :::; vn ] with y ;k sponds to a certain occurrence of the regular expression in . More speci cally, we mark ~v = [v1 ; :::; vn ] with y ; 1 , and consider the structure of as follows. If is some  0 E S , then n must equal to 1. We mark v1 with xi 0 , where xi 0 is the corresponding variable in g( ; y ; 1 ). If is , then n = 0. We also mark the  , where y  is the correspondempty sequence with y;k ;k  ing variable in g( ; y ;1 ). If is ( 1 ; 2 ), then ~v can be partitioned into two sub-sequences v~1 = [v1 ; :::; vm ] and v~2 = [vm+1 ; :::; vn ] such that lab(v1 ) ::: lab(vm) and lab(vm+1 ) ::: lab(vn) are in the regular languages de ned by 1 and 2 , respectively. We mark v~1 with y  1;k1 and mark v~2 with y  2 ;k2 , where y  1;k1 and y  2 ;k2 are the  ). We then proceed corresponding variables in g( ; y ; 1 to mark sub-sequences of v~1 and v~2 in the same way. If is ( 1 2 ), then ~v corresponds to either 1 or 2 . We mark the sequence with the corresponding variable  ) and proceed to mark its sub-sequences as in g( ; y ; 1 above. These marks provide a ner labeling that keeps track of how [v1 ; :::; vn ] corresponds to regular expressions in P ( ). Let the unknown xi 0 be assigned as its value the number of all vertices in V marked xi 0 , and  be assigned the number of all vertex similarly, let y ;k  . This is indeed an integer assignsequences marked y ;k ment since T is nite. This assignment must satisfy  for each  E , since T is valid w.r.t. D and  is determined by P ( ) in D. Formally, this can be veri ed by induction on the structure of P ( ). In addition, it is easy to verify that the assignment satis es the other constraints in D . More speci cally, ext(r) = 1, all unknowns 0, and ext( ) is the collection of all vertices corresponding to some occurrences of  in element type de nitions of D. Therefore, the assignment is indeed an integer solution of D . That is, D is nitely satis able.

obtained at the end of the traversal. It is easy to verify that T2 is nite and in addition, T2 is valid w.r.t D? . Conversely, let T2 = (V2 ; lab2; ele2; att; val; root) such that T2 is valid w.r.t. D? . We construct a data tree T1 = (V1 ; lab1; ele1; att; val; root) by modifying T2 such that T1 is valid w.r.t. D. We construct T1 as follows. For any node v V2 with lab(v) =  and  Ea , i.e., E ? E , we substitute the sub-elements of v for v in ele(v0), where v0 is the parent of v. In addition, we delete (i) v from V2 , (ii) lab2(v) from lab2, and (iii) ele2(v) from ele2. We repeat the process until there is no node labeled with element type in E ? E . Now let V1 , lab1 and ele1 be V2 , lab2 and ele2 obtained at the end of the process, respectively. Observe that att, val and root remain unchanged. It is easy to verify that T1 is indeed a nite data tree valid w.r.t D.

2

n

2

n

2

Using the claim, we next show Theorem 3.3. (2). Assume that there exists a nite data tree T1 such that T1 is valid w.r.t. D and T1 = . We show that there exists a nite date tree T2 such that T2 is valid w.r.t. D? and T2 = . By Theorem 3.1, there is a nite data tree T10 such that T10 is valid w.r.t. D and T10 = C , where C is the set of cardinality constraints determined by . By the claim, we can construct a nite data tree T20 such that T20 is valid w.r.t. D? . Moreover, observe that the construction given in the proof of the claim does not alter ext( ) and ext(:l) for any  E and l R( ). Therefore, by T10 = C we have T20 = C . Thus again by Theorem 3.1, there is a nite data tree T2 such that T2 is valid w.r.t. D? and T2 = . That is, (2) holds. (1)

)

j

j

j

2

j

j

(2)

)

(1). Suppose that there exists a nite date tree

T2 such that T2 is valid w.r.t. D? and T2 = . We show that there exists a nite data tree T such that T is valid w.r.t. D and T = . The proof is similar to the one given above. Consider the construction of T1 valid w.r.t. D from T2 valid w.r.t. D? given in the proof of the claim. This construction does not alter ext( ) and ext(:l) for any  E and l R( ). Thus T1 = C i T2 = C . Hence by Theorem 3.1, there exists a nite data tree T valid w.r.t. D and satisfying . Therefore, j

j

2

2

[ f g

j

j

2

j

2

2

2

j

2

j

j

j

j



(1) holds if (2) holds.

Proof of Theorem 3.5. First, assume

Conversely, assume that D admits an integer solution. That is, there is an integer assignment to un-

T = (V; lab; ele; att; val; root) 20

with y  1 ;k1 and y  2 ;k2 many distinct sequences marked  . with y  2;k2 . We mark these sequences also with y ;k Again this is possible by the de nition of constraints in D . For each v V , if lab(v) =  ,  E and P ( ) = , then we choose a distinct sequence of vertices [v1 ; :::; vn ]  . This is possible by constraints of marked with y ; 1 D . We de ne ele(v) = [v1 ; :::; vn ]. It can be shown that lab(v1):::lab(vn ) is indeed in the regular language de ned by , by the de nition of D . Formally, this can be veri ed by induction on the structure of , i.e., considering the cases when is one of S,  , , ( 1 2 ) and ( 1 ; 2 ). It is also easy to verify that T de ned in this way is indeed a nite data tree valid w.r.t. D. Therefore, there exists a nite data tree valid w.r.t. D.

 satisfying equations and knowns ext( ) , xi 0 and y ;k disequations in D . Observe that all these unknowns are equal to or greater than 0 because of disequations in D . We show that there is nite data tree T = (V; lab; ele; att; val; root) valid w.r.t. D. To do so, for each  E S , we create ext( ) many distinct vertices and label them with  . We refer to this set of vertices as ext( ). In addition, for each  E , each v ext( ) and l R( ), we create a distinct vertex and label it with l. We refer to this vertex as vl . Let [ V = ext( ) ext(S) j

j

2

j

2

[f g

j

2

2

 2E

[

 2E

2

[

j

[

vl v ext( ); l R( ) :

f

j

2

2

g

The functions lab, att and val are de ned as follows: for each v V and l A,  ( ) and  E S lab(v) = l ifif vv = ext vl0 for some v0 ext( )  vl if vl V att(v; l) = unde ned otherwise 8 < empty string if lab(v) is S or l, val(v) = : where l A unde ned otherwise It is easy to verify that these functions are well de ned. Let root be the vertex labeled r. This vertex is unique since ext(r) = 1 is in D . Finally, we de ne the function ele. To de ne ele, we do the following: (1) For each xi in D , we choose xi many distinct vertices labeled  and mark them with xi . We mark a vertex at most once. Observe that every vertex in V labeled  can be marked once and only once since the assignment to unknowns is an integer solution of D .  in D , we choose y  many distinct (2) For each y ;k ;k  , sequences of vertices in V and mark them with y ;k following the de nition of the function g. More speci  = xi 0 is in D for some  0 E S , then cally, if y ;k  we pick xi 0 many distinct vertices labeled xi 0 and mark  . If y  = y  is in D , then we them also with y ;k ;j ;k  create y ;k many empty sequences and mark them with  . If y  = y    y ;k ;k 1 ;k1 and y ;k = y 2 ;k2 are in D , then  many distinct sequences marked with we choose y ;k   many distinct sequences marked with y 1 ;k1 and y ;k   many distinct sequences of the y 2 ;k2 . We create y ;k form ~v = v~1 v~2 , where v~1 and v~2 are sequences chosen as above and marked with y  1;k1 and y  2 ;k2 , respectively.  . If y  = y   We mark ~v with y ;k ;k 1 ;k1 + y 2 ;k2 is in D ,  then we choose y 1 ;k1 many distinct sequences marked 2

Proof of Corollary 3.6. This follows immediately from the proof of Theorem 3.5. Indeed, given an integer solution of D , the proof of Theorem 3.5 constructs a nite data tree T valid w.r.t. D. In addition, for any  E , the number of vertices in T labeled  is exactly the non-negative integer value assigned to the unknown ext( ) in D . Thus Corollary 3.6 holds.

2

2

2

[ f g

2

2

2

j

2

j

j

Proof of Theorem 4.1. Suppose that there exists a

nite data tree T such that T is valid w.r.t. DTD D and T = . We show that (D; ) admits a legal integer solution. By Theorem 3.1, we have T = C , where C is the nite set of cardinality constraints determined by . As in the proof of Theorem 3.5, we can de ne an integer assignment to the unknowns in (D; ) such that equations and disequations in D are satis ed. Observe that the assignment assures that for each  E , the unknown ext( ) equals to the number of all vertices in T labeled  . We extend the assignment as follows: for each  E and l R( ), let the unknown ext(:l) be the number of distinct l attribute values of all vertices in T labeled  . Thus by T = C , this extended assignment satis es all the equations and disequations in C . In addition, if ext( ) > 0 then ext(:l) > 0 since every  element in T has an l attribute. Hence the assignment is indeed a legal solution of (D; ). This is, (D; ) is nitely satis able. j

j

j

2

2

2

j

2

j

2

j

j

j

[f g

j

j

j

j

Conversely, suppose that (D; ) admits a legal integer solution. We show that there is a nite data tree T such that T is valid w.r.t. D and T = . As in the proofs of Theorem 3.5 and Corollary 3.6, given an integer solution of (D; ), we can construct a nite j

21

data tree

Claim 1. There exists a nite data tree valid w.r.t. D = (E; A; P; R; r) i P (r) yields a nite tree. Claim 2. Given an arbitrary DTD D = (E; A; P; R; r), it can be decided in linear time whether P (r) yields a nite tree. We rst show Claim 1. If P (r) yields a nite tree, one can easily construct a nite data tree valid w.r.t. D. Conversely, suppose that there exists a nite data tree T valid w.r.t. D. We de ne the depth of T to be the number of nodes in the longest path from the root to a leaf of T . Since T is nite, its depth is some natural number n. We show by induction on n that P (r) yields a nite tree. Basis: n = 1. Since T is valid w.r.t. D,  must be in the regular language L(P (r)) de ned by P (r). By induction on the structure of a regular expression , we can verify that yields a nite tree if  2 L( ). Inductive step: Assume the statement when n = m. We show that it also holds for n = m + 1. Let [v1 ; :::; vl ] be the sub-elements of the root of T . Since T is valid w.r.t. D, lab(v1 ) ::: lab(vl ) must be in L(P (r)). Observe that for each i 2 [1; l], the subtree of T rooted at vi has a depth at most m. Given these, by induction on structure of P (r), we can verify that P (r) yields a nite tree.

T 0 = (V; lab; ele; att; val; root) such that T 0 is valid w.r.t. D. Moreover, for each  E , if the unknown ext( ) is assigned n by the integer assignment, then n is number of all vertices in T 0 labeled  . We modify the de nition of the function val such that in T 0, for each  E and l R( ) that occur in , the number of distinct l attribute values of all vertices labeled  equals to the value assigned to the unknown ext(:l) by the assignment. This is possible since ext(:l) ext( ) is in C , and the assignment is also a solution of C . Therefore, there is a nite data tree T 00 such that T 00 = C and T 00 is valid w.r.t. D. Observe that since the solution is legal, we have ext(:l) > 0 if ext( ) > 0, and thus every  element in T 00 can have an l attribute. Hence by Theorem 3.1, there exists a nite data tree T such that T is valid w.r.t. D and moreover, T = . 2

j

j

2

j

j

2

j

j  j

j

j

j

j

j

j

j

Proof of Corollary 4.2. First assume that there ex-

ists a nite data tree T such that T is valid w.r.t. DTD D and T = . We show that (D? ; ) admits a legal integer solution, where D? is the -free DTD of D. By Theorem 3.3, there exists a nite data tree T 0 such that T 0 is valid w.r.t. D? and T 0 = . Thus by Theorem 4.1, (D? ; ) is nitely satis able. j



j

We next show Claim 2. We use a boolean recursive function nite-tree ( ) to check whether a regular expression over E S yields a nite tree. For each  E , we use two global boolean variables: checked( ) to indicate whether  has been checked by the function, and finite( ) to indicate whether  yields a nite tree. These are used to ensure that each  in E is checked at most once by the function. Initially, checked( ) and finite( ) have value false for each  E . function nite-tree ( ) if is one of , S or 1 then return true else if = ( 1 ; 2 ) then return ( nite-tree ( 1 ) and nite-tree ( 2)) else if = ( 1 2 ) then return ( nite-tree ( 1 ) or nite-tree ( 2)) else if =  in E then (1) if checked( ) then return finite( ); (2) checked( ) := true; (3) finite( ) := nite-tree (P ( )); (4) return finite( ); else return false; It is easy to verify that P (r) yields a nite tree if and

Conversely, suppose that (D? ; ) admits a legal integer solution. We show that there is a nite data tree T such that T is valid w.r.t. D and T = . By Theorem 4.1, there exists a nite data tree T 0 such that T 0 is valid w.r.t. D? and T 0 = . Thus by Theorem 3.3, there exists a nite data tree T such that T is valid w.r.t. D and T = .

[ f g

2

j

j

j

2

Proof of Theorem 4.3. The proof is outlined in Sec-

tion 4.2. In particular, a reduction from a variant of the linear integer programming (LIP) problem to the nite satis ability problem for Lu constraints is given there. It shows that the nite satis ability problem is NP-hard since the variant of LIP is NP-complete [20, 23].

j

Proof of Corollary 4.4. The proof is given in Sec-

tion 4.2.

Proof of Corollary 4.5. The proof is given in Sec-

tion 4.2.

Proof of Theorem 4.6. It suces to verify: 22

Let   be a set of FDs and IDs over R. We use  =  to denote that  nitely satis es  as for keys and foreign keys. The nite implication problem for FDs by FDs and IDs is the problem to determine, given any relational schema R, any set  of FDs and IDs over R and a FD  over R, whether  = . This is a well-known undecidable problem [1]. We encode FDs and IDs in terms of keys and foreign keys as follows. (1) FD  = R : X Y . Let Z be a key for R, i.e., R[Z ] R (note that every relation schema has a key). We de ne a new (fresh) relation schema Rnew such that Att(Rnew ) = XY Z , i.e., the union of X , Y and Z . Observe that XY Z is a key for both Rnew and R. We encode  with two key constraints 1 and 4 , and two foreign key constraints 2 and 3 : 1 = Rnew [X ] Rnew 2 = R[XY ] Rnew [XY ] 3 = Rnew [XY Z ] R[XY Z ] 4 = Rnew [XY ] Rnew

only if nite-tree (P (r)) returns true. In addition, the function call nite-tree (P (r)) takes O(m) time, where m is the size of the function P . This is because for each  in E , P ( ) is checked at most once by the function.

[ f g

j

Proof of Proposition 4.8. Suppose that there ex-

j

ists a nite data tree T1 = (V; lab; ele; att; val; root) valid w.r.t. D. We construct another nite data tree T2 by modifying the val function in T1 such that for any constraint :l  in , ext( ) = ext(:l) . More speci cally, let T2 = (V; lab; ele; att; val0 ; root). Observe that the only di erence between T1 and T2 is the de nition of the function val0. For any v1 ; v2 in V with lab(v1 ) =  and lab(v2 ) =  , we can make val0 (att(v1 ; l)) = val0 (att(v2 ; l)). Let val0 (v) = val(v) for all other vertices in V . Thus T2 = C . By Theorem 3.1, T2 = . In addition, it is easy to verify that T2 is valid w.r.t. D since T1 is valid w.r.t. D. The other direction is immediate. !

j

j

j

!

j

!

6

j

j

!





Proof of Theorem 5.1. We prove this by reduction

!

from the nite implication problem for functional dependencies by functional and inclusion dependencies, which is known to be undecidable (see, e.g., [1] for a proof). Before we give the reduction, we rst review functional and inclusion dependencies in relational databases. Let R be a relational schema. Functional dependencies (FDs) and inclusion dependencies (IDs) over R are de ned as follows. 



!

!



FD. R : X ! Y , where R 2 R, and X and Y are subsets of attributes in Att(R). An instance I of R satis es the FD , denoted by I j= , if 8



(2) ID  = R1 [X ] R2 [Y ]. Let Z be a key for R2 , i.e., R2 [Z ] R2 . We de ne a new relation schema Rnew such that Att(Rnew ) = Y Z . Observe that Y Z is a key for R2 . We encode  with a key 1 and two foreign keys 2 and 3 : 1 = Rnew [Y ] Rnew 2 = R1 [X ] Rnew [Y ] 3 = Rnew [Y Z ] R2 [Y Z ]

t 1 t2 I ( 2

^

l2X

(t1 :l = t2 :l)

!

^

l0 2Y



We next show that the encoding is indeed a reduction from the nite implication problem for FDs by FDs and IDs to the nite implication problem for keys by keys and foreign keys. Given a relational schema R, a set  of FDs and IDs over R, and a FD  = R : X Y over R, as described above we encode  with a set 1 of keys and foreign keys, and encode  with  [X ] R ; 1 = Rnew new  2 = R [XY ] Rnew [XY ];  [XY Z ] R [XY Z ]; 3 = Rnew  [XY ] R : 4 = Rnew new 0 Let  =  2 ; 3 ; 4 . It suces to show that  =  i 0 = 1 : Let R' be the relational schema that includes all relation schemas in R as well as new relations created in the encoding. We show the claim as follows.

(t1 :l0 = t2 :l0));

where I is the instance of R in I. Observe that keys are a special case of FDs in which Y = Att(R). ID. R[l1 ; :::; lk ] R0 [l10 ; :::; lk0 ], where R; R0 R, [l1 ; :::; lk ] is a sequence of attributes in Att(R), and similarly, [l10 ; :::; lk0 ] is a sequence of attributes in Att(R0 ). It should be mentioned that in contrast to foreign keys, l10 ; :::; lk0 are not necessarily a key for R0 . An instance I of R satis es the foreign key constraint , denoted by I = , if 

!

2

!





!

j

8

[ f

^ t1 I t 2 I 0 ( t1 :lj = t2 :lj0 ); 2

9

2

g

j

1jk

where I and I 0 are the instances of R and R0 in I, respectively. 23

j

val0 (att(v1 ; l)) = val0 (att(v2 ; l)) for any l X . Let val0 (v) = val(v) for all other vertices in V . It is easy to verify that T2 is valid w.r.t. D since T1 is valid w.r.t. D. In addition, T2 =  [X ]  since for any x; y ext( ), x[X ] = y[X ]. The other direction is immediate.

(1) SupposeV that there is a nite instance I of R such that I =  . We showV that there is a nite instance I' of R' such that I' = 0 1 We construct I' such that for any R in R, the instance of R in I' is the same as the instance of R in I. We populate instances of new relations Rnew created in the encoding as follows. j

6

^ :

j

^:

j

6

If Rnew is introduced in the encoding of a FD R : X Y as described above, then we let the instance Inew of Rnew in I' be a subset of XY Z (I ) such that XY (I ) = XY (Inew ) and moreover, Inew = Rnew [XY ] Rnew , where I is the instance of R in I and W (I ) is the projection of I on attributes W . If Rnew is introduced in the encoding of an ID R1 [X ] R2 [Y ] as described above, then let the instance Inew of Rnew in I' be a subset of Y Z (I2 ) such that Y (I2 ) = Y (Inew ) and in addition, Inew = Rnew [Y ] Rnew , where I2 is the instance of R2 in I. V It is easy to verify that I' = 0 1 . 

!

j

!





j

!

j

^:

(2) Suppose V that there is a nite instance I' of R' such that I' = 0 1 . We construct a nite instance I of R by removing from I' all instances of new relations introduced in the encoding. It is easy to verify that I = V  . Therefore, the encoding is indeed a reduction from the nite implication problem for FDs by FDs and IDs. This shows that the nite implication problem for keys by keys and foreign keys is undecidable. j

j

^ :

^ :

Proof of Theorem 5.2. A proof is given in Section 5. Proof of Corollary 5.3. Given any DTD D and any

set  of keys in L over D, it suces to show that  can be satis ed by a nite data tree valid w.r.t. D if and only if there exists a nite data tree valid w.r.t. D. For if it holds, then the corollary follows immediately from Theorem 4.6. We now show the claim. Suppose that there exists a nite data tree T1 = (V; lab; ele; att; val; root) valid w.r.t. D. We construct another nite data tree T2 by modifying the val function in T1 such that for any constraint  [X ]  in , ext( ) = ext(:l) for every l X . That is, T2 = :l  for all l X . More speci cally, let T2 = (V; lab; ele; att; val0 ; root). Observe that the only di erence between T1 and T2 is the de nition of the function val0. For any v1 ; v2 in V with lab(v1 ) =  and lab(v2 ) =  , we can make !

2

j

j

j

!

j

j

2

24

2

!

2