Valid Query Answers for XML - Department of Computer Science and

0 downloads 0 Views 383KB Size Report
//proj[name='Pierogies']/emp[1]/name/text(). → ∅. //proj[name='Pierogies']/emp[1]/salary. → ∅. XML Document proj name. Pierogies emp name. Alice salary. $90 ...
Valid Query Answers for XML Slawomir Staworko1 1 2

Jan Chomicki2

University at Buffalo and INRIA Futurs

University at Buffalo and Warsaw University

June 25-29, 2007

Invalid XML documents

Querying Invalid XML I

Integration of XML documents

I

Slight differences between schemas (e.g. different cardinality constraints)

I

Legacy XML databases

I

Database updates

Invalid XML documents

Querying Invalid XML I

Integration of XML documents

I

Slight differences between schemas (e.g. different cardinality constraints)

I

Legacy XML databases

I

Database updates

Document Type Definition (D0 ) proj emp name salary

→ → → →

(name, emp, proj*, emp*) (name, salary) #PCDATA #PCDATA

Query : get salaries of all employees that are not managers //proj/name/emp/following sibling::emp/salary

Invalid XML documents

Querying Invalid XML I

Integration of XML documents

I

Slight differences between schemas (e.g. different cardinality constraints)

I I

Document with errors proj name Pierogies

emp

emp

name salary

name salary

John $80k

Mary

Legacy XML databases Database updates proj

Document Type Definition (D0 ) proj emp name salary

→ → → →

(name, emp, proj*, emp*) (name, salary) #PCDATA #PCDATA

Query : get salaries of all employees that are not managers //proj/name/emp/following sibling::emp/salary

...

$40k

Invalid XML documents

Querying Invalid XML I

Integration of XML documents

I

Slight differences between schemas (e.g. different cardinality constraints)

I I

Document with errors proj name Pierogies

emp

emp

name salary

name salary

John $80k

Mary

Legacy XML databases Database updates proj

Document Type Definition (D0 ) proj emp name salary

→ → → →

(name, emp, proj*, emp*) (name, salary) #PCDATA #PCDATA

Query : get salaries of all employees that are not managers //proj/name/emp/following sibling::emp/salary

...

$40k

Invalid XML documents

Querying Invalid XML I

Integration of XML documents

I

Slight differences between schemas (e.g. different cardinality constraints)

I I

Document with errors proj name Pierogies emp

Legacy XML databases Database updates

Document Type Definition (D0 ) proj emp name salary

→ → → →

emp

emp

name salary

name salary

John $80k

Mary

name salary ???

???

proj

(name, emp, proj*, emp*) (name, salary) #PCDATA #PCDATA

Query : get salaries of all employees that are not managers //proj/name/emp/following sibling::emp/salary

...

$40k

Core XPath Core XPath Queries I

text values but no attributes

I

all standard axes

I

subexpressions, value tests, and joins: //*[A/B], //*[B/text()=C/text()]

I

negation and disjunction: //*[not A/B], //A | //B

I

no functions count(/A/*)

N0 A N1

N3 B

N2

B

abcd

Tree Reachability Fact: (x, Q, y ) query Q node x

object y

Query Answers I

Basic facts use only /* and following sibling:: (N0 ,/*,N1 ), (N1 ,/*,N2 ),...

I I

Other facts are inferred with implications: (X , Q/P, Y ) ⇐ (X , Q, Z ) ∧ (Z , P, Y ). (N0 ,/*/*,N2 ).

given query Q and document T with the root node r find all tree facts that hold in T x in an answer to Q in T iff the tree fact (r , Q, x) holds in T

Query Evaluation for Positive Core XPath (no negation) Bottom-up approach I I I

computing tree facts for query Q tree facts for T1 , . . . , Tm computed before including inferred facts (involving subqueries of Q)

Tree N0 X Ni−1

N1 X1

...

Xi−1

T1 Ti−1

Algorithm I start with ∅ II for subtree Ti (i = 1, . . . , m) 1 add all facts of the subtree (obtained by recursion) 2 add (N0 ,/*,Ni ) 3 if i > 1 add (Ni−1 ,following sibling::*,Ni )

Ni Xi Ti

...

Nm Xm Tm

Editing operations

A B C

D

Editing operations

Cost: 3 Insert

A B C

Editing operations I

Inserting a subtree

A

D

B C

E F

D G

Editing operations

Cost: 3 Insert

A B C

Editing operations I I

Inserting a subtree Deleting a subtree

Cost: 2 Delete

A

D

B C

E F

A

D G

E F

D G

Editing operations

Cost: 1 H B

D

C

Editing operations I I I

Cost: 3

Modify

Inserting a subtree Deleting a subtree Modifying a node’s label

Insert

A B C

Cost: 2 Delete

A

D

B C

E F

A

D G

E F

D G

Edit Distance and Repairs [3]

Distance between documents dist(T , S) is the minimal cost of transforming T into S dist = 1

C

C

Distance to a DTD A

dist(T , D) is the minimal cost of repairing T w.r.t D i.e.,

B

B

B

B

min{dist(T , S)|S valid w.r.t D}

DTD C → (A,B)* A → EMPTY B → EMPTY

Repair T 0 is a repair of T w.r.t D iff dist(T 0 , T ) = dist(T , D)

C A

B

C A

B

C A

B

There can be an exponential number of repairs

A

B

Valid Query Answers

Valid Query Answers

XML Document

x is a valid answer to query Q in T w.r.t. D iff x is an answer to Q in every repair of T w.r.t. D.

proj name

Document Type Definition (D0 ) proj emp name salary

→ → → →

(name, emp, proj*, emp*) (name, salary) #PCDATA #PCDATA

proj

Pierogies

... emp

emp

name salary name salary Alice $90K Bob

Queries and Valid Answers //proj[emp[1]/salary=’$90K’]/name/text() //proj[name=’Pierogies’]/emp[1]/salary/text() //proj[name=’Pierogies’]/emp[1]/name/text() //proj[name=’Pierogies’]/emp[1]/salary

→ → → →

{Pierogies} {$90K} ∅ ∅

$90K

Trace graph

Read A

DTD

C A

B

B

C → (A,B)* A → EMPTY B → EMPTY

Q0

Q1

Read B

A

B

B

Trace graph

Read A

DTD

C A

B

B

Ins A

Del

C → (A,B)* A → EMPTY B → EMPTY

Q0

Q1 Ins B Read B

A

B

B

Del

Trace graph

Read A

DTD

C A

B

Ins A

Del

C → (A,B)* A → EMPTY B → EMPTY

B

Q0

Q1 Ins B Read B

A

B

B

Q0

Q0

Q0

Q0

Q1

Q1

Q1

Q1

Del

Trace graph

Read A

DTD

C A

B

Ins A

Del

C → (A,B)* A → EMPTY B → EMPTY

B

Q0

Q1 Ins B Read B

A

Q0

Q0 Re

Q1

B

B

Q0 ad Re

ad

Q1

Q0 Re

Q1

ad

Q1

Del

Trace graph

Read A

DTD

C A

B

Ins A

Del

C → (A,B)* A → EMPTY B → EMPTY

B

Q0

Q1 Ins B Read B

Ins A

Ins A

Q1

Del

Q1

Re

ad

Del

Q0 Ins B

Del

ad Re

Del

Q0 Ins B

ad

Del

Q0 Ins B

Ins B

Re

B

Ins A

Del

Q0

Q1

B

Ins A

A

Q1

Del

Trace graph

Read A

DTD

C A

B

Ins A

Del

C → (A,B)* A → EMPTY B → EMPTY

B

Q0

Q1 Ins B Read B

Ins A

Ins A

Q1 ,0

Del

Q1 ,1

Re

ad

Del

Q0 ,1 Ins B

Del

ad Re

Del

Q0 ,0 Ins B

ad

Del

Q0 ,1 Ins B

Ins B

Re

B

Ins A

Del

Q0 ,0

Q1 ,1

B

Ins A

A

Q1 ,2

Del

Trace graph

Read A

DTD

C A

B

Ins A

Del

C → (A,B)* A → EMPTY B → EMPTY

B

Q0

Q1 Ins B Read B

Ins A

Ins A

Q1 ,0

Del

Q1 ,1

Re

ad

Del

Q0 ,1 Ins B

Del

ad Re

Del

Q0 ,0 Ins B

ad

Del

Q0 ,1 Ins B

Ins B

Re

B

Ins A

Del

Q0 ,0

Q1 ,1

B

Ins A

A

Q1 ,2

Del

Trace graph

Read A

DTD

C A

B

Ins A

Del

C → (A,B)* A → EMPTY B → EMPTY

B

Q0

Del Q1

Ins B Read B

Compact representation of all repairs

Del

Q1 ,1

ad

Del

Ins A

Ins A

Ins A

Q1 ,0

Re

Q0 ,1

Repairing Paths: Ins B

Del

ad Re

Del

Q0 ,0 Ins B

ad

Ins B

Ins B

Q1 ,1

Re

Del

Q0 ,1

Ins A

Del

Q0 ,0

Q1 ,2

I

(Read, Read, Del)

I

(Read, Read, Ins A, Read)

I

(Read, Del, Read)

Computing Valid Query Answers

Certain Tree Facts Tree facts present in every repair of a given tree

A

B

Q0 ,0

Q0 ,0

e ad

ad

R

Bottom-up approach Q1 ,0

Q0 ,1

ead

R

Q1 ,1

set of facts collected “so far”: I

Intersection of the sets of all repairing paths

Del

For every repairing path, construct: I

Obtain certain facts by

Del

Ins A

Re

Precomputed values: I certain facts for all children I certain facts common for every minimal tree satisfying DTD

B

I I

start with ∅ Read adds certain facts of the corresponding child Del adds nothing Ins A adds certain facts common for minimal trees labeled with A

Eager intersection Problem Possibly an exponential number of paths

Solution: Eager Intersection Intersect all sets of certain facts for paths sharing the same last operation adding facts (Read/Ins).

Computing VQA Queries

Combined-complexity

Data-complexity

Descending axes

PTIME

PTIME

+ Ascending axes

co-NP-complete

???

+ Sliding axes

co-NP-complete

???

+ Negation/Disjunction

co-NP-complete

???

+ Joins

co-NP-complete

co-NP-complete

Experimental Results: Edit Distance Computation

Compared algorithms Parse base line Validate regular automata Dist distance computation

Data generation 1. random valid document 2. removing and adding random nodes 3. invalidity ratio dist(T , D)/|T | ' 0.1% 4. small height (8-10)

Experimental Results: Edit Distance Computation

Compared algorithms Parse

Parse base line Validate regular automata Dist distance computation

1. random valid document 2. removing and adding random nodes 3. invalidity ratio

Dist

12

Time (sec.)

Data generation

Validate

9 6 3

dist(T , D)/|T | ' 0.1% 4. small height (8-10)

10

20

30

Document size (MB)

40

50

Experimental Results: Edit Distance Computation

Compared algorithms Parse base line Validate regular automata Dist distance computation

1. random valid document 2. removing and adding random nodes

Dist

60 DTD size |D|

90

6 Time (sec.)

Data generation

Validate

4

2

3. invalidity ratio dist(T , D)/|T | ' 0.1% 4. small height (8-10)

30

120

Experimental Results: Valid Query Answer Computation

Compared algorithms QA base line VQA eager intersection

Data generation I

invalidity ratio dist(T , D)/|T | ' 0.1%

I

small height (8-10)

Implemented Queries I //, /, following sibling:: I name()=A,text()=’str’ I QA works in linear time

Experimental Results: Valid Query Answer Computation

Compared algorithms QA

QA base line VQA eager intersection

VQA

16

I

invalidity ratio dist(T , D)/|T | ' 0.1%

I

small height (8-10)

Time (sec.)

Data generation 12 8 4

Implemented Queries I //, /, following sibling:: I name()=A,text()=’str’ I QA works in linear time

3

6

Document size (MB)

9

Experimental Results: Valid Query Answer Computation

Compared algorithms VQA

QA base line VQA eager intersection 30

I

invalidity ratio dist(T , D)/|T | ' 0.1%

I

small height (8-10)

Time (sec.)

Data generation

20

10

Implemented Queries I //, /, following sibling:: I name()=A,text()=’str’ I QA works in linear time

10

20

30

40

DTD size |D|

50

60

Conclusions and Future Work

Conclusions I

Framework for querying documents with validity violations of local nature (missing or superfluous nodes)

I

Efficient algorithm for computing valid answers to a class of XPath queries

Future Work I

Valid answers by query rewriting

I

In-depth analysis of data complexity

I

Other tree operations: inner node deletion/insertion, node move, . . . [1]

I

Semantic inconsistencies: keys, functional dependencies,. . . [2]

Acknowledgments

Marcelo Arenas Leo Bertossi Wenfei Fan Jerzy Marcinkowski Slawomir Staworko Jef Wijsen

P. Bille. A Survey on Tree Edit Distance and Related Problems. Theoretical Computer Science, 337(1-3):217–239, 2003. S. Flesca, F. Furfaro, S. Greco, and E. Zumpano. Querying and Repairing Inconsistent XML Data. In Web Information Systems Engineering, pages 175–188. Springer, LNCS 3806, 2005. S. Staworko and J. Chomicki. Validity-Sensitive Querying of XML Databases. In EDBT Workshops (dataX), pages 164–177. Springer, LNCS 4254, 2006.

Suggest Documents