Valid Query Answers for XML - Department of Computer Science and

Valid Query Answers for XML Slawomir Staworko1 1 2

Jan Chomicki2

University at Buffalo and INRIA Futurs

University at Buffalo and Warsaw University

June 25-29, 2007

Invalid XML documents

Querying Invalid XML I

Integration of XML documents

I

Slight differences between schemas (e.g. different cardinality constraints)

I

Legacy XML databases

I

Database updates




I


I

Legacy XML databases

I

Database updates

Document Type Definition (D0 ) proj emp name salary

→ → → →

(name, emp, proj*, emp*) (name, salary) #PCDATA #PCDATA

Query : get salaries of all employees that are not managers //proj/name/emp/following sibling::emp/salary




I


I I

Document with errors proj name Pierogies

emp

emp

name salary

name salary

John $80k

Mary

Legacy XML databases Database updates proj


→ → → →



...

$40k




I


I I

Document with errors proj name Pierogies

emp

emp

name salary

name salary

John $80k

Mary

Legacy XML databases Database updates proj


→ → → →



...

$40k




I


I I

Document with errors proj name Pierogies emp

Legacy XML databases Database updates


→ → → →

emp

emp

name salary

name salary

John $80k

Mary

name salary ???

???

proj



...

$40k

Core XPath Core XPath Queries I

text values but no attributes

I

all standard axes

I

subexpressions, value tests, and joins: //*[A/B], //*[B/text()=C/text()]

I

negation and disjunction: //*[not A/B], //A | //B

I

no functions count(/A/*)

N0 A N1

N3 B

N2

B

abcd

Tree Reachability Fact: (x, Q, y ) query Q node x

object y

Query Answers I

Basic facts use only /* and following sibling:: (N0 ,/*,N1 ), (N1 ,/*,N2 ),...

I I

Other facts are inferred with implications: (X , Q/P, Y ) ⇐ (X , Q, Z ) ∧ (Z , P, Y ). (N0 ,/*/*,N2 ).

given query Q and document T with the root node r find all tree facts that hold in T x in an answer to Q in T iff the tree fact (r , Q, x) holds in T

Query Evaluation for Positive Core XPath (no negation) Bottom-up approach I I I

computing tree facts for query Q tree facts for T1 , . . . , Tm computed before including inferred facts (involving subqueries of Q)

Tree N0 X Ni−1

N1 X1

...

Xi−1

T1 Ti−1

Algorithm I start with ∅ II for subtree Ti (i = 1, . . . , m) 1 add all facts of the subtree (obtained by recursion) 2 add (N0 ,/*,Ni ) 3 if i > 1 add (Ni−1 ,following sibling::*,Ni )

Ni Xi Ti

...

Nm Xm Tm

Editing operations

A B C

D

Editing operations

Cost: 3 Insert

A B C

Editing operations I

Inserting a subtree

A

D

B C

E F

D G

Editing operations

Cost: 3 Insert

A B C

Editing operations I I

Inserting a subtree Deleting a subtree

Cost: 2 Delete

A

D

B C

E F

A

D G

E F

D G

Editing operations

Cost: 1 H B

D

C

Editing operations I I I

Cost: 3

Modify

Inserting a subtree Deleting a subtree Modifying a node’s label

Insert

A B C

Cost: 2 Delete

A

D

B C

E F

A

D G

E F

D G

Edit Distance and Repairs [3]

Distance between documents dist(T , S) is the minimal cost of transforming T into S dist = 1

C

C

Distance to a DTD A

dist(T , D) is the minimal cost of repairing T w.r.t D i.e.,

B

B

B

B

min{dist(T , S)|S valid w.r.t D}

DTD C → (A,B)* A → EMPTY B → EMPTY

Repair T 0 is a repair of T w.r.t D iff dist(T 0 , T ) = dist(T , D)

C A

B

C A

B

C A

B

There can be an exponential number of repairs

A

B

Valid Query Answers

Valid Query Answers

XML Document

x is a valid answer to query Q in T w.r.t. D iff x is an answer to Q in every repair of T w.r.t. D.

proj name


→ → → →


proj

Pierogies

... emp

emp

name salary name salary Alice $90K Bob

Queries and Valid Answers //proj[emp[1]/salary=’$90K’]/name/text() //proj[name=’Pierogies’]/emp[1]/salary/text() //proj[name=’Pierogies’]/emp[1]/name/text() //proj[name=’Pierogies’]/emp[1]/salary

→ → → →

{Pierogies} {$90K} ∅ ∅

$90K

Trace graph

Read A

DTD

C A

B

B

C → (A,B)* A → EMPTY B → EMPTY

Q0

Q1

Read B

A

B

B

Trace graph

Read A

DTD

C A

B

B

Ins A

Del


Q0

Q1 Ins B Read B

A

B

B

Del

Trace graph

Read A

DTD

C A

B

Ins A

Del


B

Q0

Q1 Ins B Read B

A

B

B

Q0

Q0

Q0

Q0

Q1

Q1

Q1

Q1

Del

Trace graph

Read A

DTD

C A

B

Ins A

Del


B

Q0

Q1 Ins B Read B

A

Q0

Q0 Re

Q1

B

B

Q0 ad Re

ad

Q1

Q0 Re

Q1

ad

Q1

Del

Trace graph

Read A

DTD

C A

B

Ins A

Del


B

Q0

Q1 Ins B Read B

Ins A

Ins A

Q1

Del

Q1

Re

ad

Del

Q0 Ins B

Del

ad Re

Del

Q0 Ins B

ad

Del

Q0 Ins B

Ins B

Re

B

Ins A

Del

Q0

Q1

B

Ins A

A

Q1

Del

Trace graph

Read A

DTD

C A

B

Ins A

Del


B

Q0

Q1 Ins B Read B

Ins A

Ins A

Q1 ,0

Del

Q1 ,1

Re

ad

Del

Q0 ,1 Ins B

Del

ad Re

Del

Q0 ,0 Ins B

ad

Del

Q0 ,1 Ins B

Ins B

Re

B

Ins A

Del

Q0 ,0

Q1 ,1

B

Ins A

A

Q1 ,2

Del

Trace graph

Read A

DTD

C A

B

Ins A

Del


B

Q0

Q1 Ins B Read B

Ins A

Ins A

Q1 ,0

Del

Q1 ,1

Re

ad

Del

Q0 ,1 Ins B

Del

ad Re

Del

Q0 ,0 Ins B

ad

Del

Q0 ,1 Ins B

Ins B

Re

B

Ins A

Del

Q0 ,0

Q1 ,1

B

Ins A

A

Q1 ,2

Del

Trace graph

Read A

DTD

C A

B

Ins A

Del


B

Q0

Del Q1

Ins B Read B

Compact representation of all repairs

Del

Q1 ,1

ad

Del

Ins A

Ins A

Ins A

Q1 ,0

Re

Q0 ,1

Repairing Paths: Ins B

Del

ad Re

Del

Q0 ,0 Ins B

ad

Ins B

Ins B

Q1 ,1

Re

Del

Q0 ,1

Ins A

Del

Q0 ,0

Q1 ,2

I

(Read, Read, Del)

I

(Read, Read, Ins A, Read)

I

(Read, Del, Read)

Computing Valid Query Answers

Certain Tree Facts Tree facts present in every repair of a given tree

A

B

Q0 ,0

Q0 ,0

e ad

ad

R

Bottom-up approach Q1 ,0

Q0 ,1

ead

R

Q1 ,1

set of facts collected “so far”: I

Intersection of the sets of all repairing paths

Del

For every repairing path, construct: I

Obtain certain facts by

Del

Ins A

Re

Precomputed values: I certain facts for all children I certain facts common for every minimal tree satisfying DTD

B

I I

start with ∅ Read adds certain facts of the corresponding child Del adds nothing Ins A adds certain facts common for minimal trees labeled with A

Eager intersection Problem Possibly an exponential number of paths

Solution: Eager Intersection Intersect all sets of certain facts for paths sharing the same last operation adding facts (Read/Ins).

Computing VQA Queries

Combined-complexity

Data-complexity

Descending axes

PTIME

PTIME

+ Ascending axes

co-NP-complete

???

+ Sliding axes

co-NP-complete

???

+ Negation/Disjunction

co-NP-complete

???

+ Joins

co-NP-complete

co-NP-complete

Experimental Results: Edit Distance Computation

Compared algorithms Parse base line Validate regular automata Dist distance computation

Data generation 1. random valid document 2. removing and adding random nodes 3. invalidity ratio dist(T , D)/|T | ' 0.1% 4. small height (8-10)


Compared algorithms Parse

Parse base line Validate regular automata Dist distance computation

1. random valid document 2. removing and adding random nodes 3. invalidity ratio

Dist

12

Time (sec.)

Data generation

Validate

9 6 3

dist(T , D)/|T | ' 0.1% 4. small height (8-10)

10

20

30

Document size (MB)

40

50


Compared algorithms Parse base line Validate regular automata Dist distance computation

1. random valid document 2. removing and adding random nodes

Dist

60 DTD size |D|

90

6 Time (sec.)

Data generation

Validate

4

2

3. invalidity ratio dist(T , D)/|T | ' 0.1% 4. small height (8-10)

30

120

Experimental Results: Valid Query Answer Computation

Compared algorithms QA base line VQA eager intersection

Data generation I

invalidity ratio dist(T , D)/|T | ' 0.1%

I

small height (8-10)

Implemented Queries I //, /, following sibling:: I name()=A,text()=’str’ I QA works in linear time


Compared algorithms QA

QA base line VQA eager intersection

VQA

16

I


I

small height (8-10)

Time (sec.)

Data generation 12 8 4


3

6

Document size (MB)

9


Compared algorithms VQA

QA base line VQA eager intersection 30

I


I

small height (8-10)

Time (sec.)

Data generation

20

10


10

20

30

40

DTD size |D|

50

60

Conclusions and Future Work

Conclusions I

Framework for querying documents with validity violations of local nature (missing or superfluous nodes)

I

Efficient algorithm for computing valid answers to a class of XPath queries

Future Work I

Valid answers by query rewriting

I

In-depth analysis of data complexity

I

Other tree operations: inner node deletion/insertion, node move, . . . [1]

I

Semantic inconsistencies: keys, functional dependencies,. . . [2]

Acknowledgments

Marcelo Arenas Leo Bertossi Wenfei Fan Jerzy Marcinkowski Slawomir Staworko Jef Wijsen

P. Bille. A Survey on Tree Edit Distance and Related Problems. Theoretical Computer Science, 337(1-3):217–239, 2003. S. Flesca, F. Furfaro, S. Greco, and E. Zumpano. Querying and Repairing Inconsistent XML Data. In Web Information Systems Engineering, pages 175–188. Springer, LNCS 3806, 2005. S. Staworko and J. Chomicki. Validity-Sensitive Querying of XML Databases. In EDBT Workshops (dataX), pages 164–177. Springer, LNCS 4254, 2006.

Valid Query Answers for XML - Department of Computer Science and

Valid Query Answers for XML - Department of Computer Science and

Suggest Documents

XPath Query Containment - Department of Computer Science

8. Query Processing - Department of Computer Science

Active XML and active query answers - Serge Abiteboul

Link Detection in XML Documents - Department of Computer Science

The XML Web: a First Study - Department of Computer Science

Indexing XML Data with ToXin - Department of Computer Science

Query-Sensitive Ray Shooting - UMD Department of Computer Science

Skyline Query Processing over Joins - Department of Computer Science

Ontoligent Interactive Query Tool - Department of Computer Science ...

Skyline Query Processing over Joins - Department of Computer Science

An Algebra for XML Query

An Algebra for XML Query

Department of Computer Science

Selectivity Estimation for Exclusive Query ... - Computer Science

OLAP-based Query Recommendation - Department of Computer

English for Computer Science - Department of Computer Science

Complex Group-By Queries for XML - UBC Department of Computer

Computer Science E-259 Lectures - Computer Science E-259: XML ...

Computer and Information Ethics - Computer Science Department ...

Efficient Query Answering for OWL 2 - Department of Computer

for Design Patterns - Department of Computer Science

Efficient Query Answering for OWL 2 - Department of Computer ...

Protocols for Processes - Department of Computer Science

Effective anonymization of query logs - Computer Science