//proj[name='Pierogies']/emp[1]/name/text(). â â
. //proj[name='Pierogies']/emp[1]/salary. â â
. XML Document proj name. Pierogies emp name. Alice salary. $90 ...
Valid Query Answers for XML Slawomir Staworko1 1 2
Jan Chomicki2
University at Buffalo and INRIA Futurs
University at Buffalo and Warsaw University
June 25-29, 2007
Invalid XML documents
Querying Invalid XML I
Integration of XML documents
I
Slight differences between schemas (e.g. different cardinality constraints)
I
Legacy XML databases
I
Database updates
Invalid XML documents
Querying Invalid XML I
Integration of XML documents
I
Slight differences between schemas (e.g. different cardinality constraints)
I
Legacy XML databases
I
Database updates
Document Type Definition (D0 ) proj emp name salary
→ → → →
(name, emp, proj*, emp*) (name, salary) #PCDATA #PCDATA
Query : get salaries of all employees that are not managers //proj/name/emp/following sibling::emp/salary
Invalid XML documents
Querying Invalid XML I
Integration of XML documents
I
Slight differences between schemas (e.g. different cardinality constraints)
I I
Document with errors proj name Pierogies
emp
emp
name salary
name salary
John $80k
Mary
Legacy XML databases Database updates proj
Document Type Definition (D0 ) proj emp name salary
→ → → →
(name, emp, proj*, emp*) (name, salary) #PCDATA #PCDATA
Query : get salaries of all employees that are not managers //proj/name/emp/following sibling::emp/salary
...
$40k
Invalid XML documents
Querying Invalid XML I
Integration of XML documents
I
Slight differences between schemas (e.g. different cardinality constraints)
I I
Document with errors proj name Pierogies
emp
emp
name salary
name salary
John $80k
Mary
Legacy XML databases Database updates proj
Document Type Definition (D0 ) proj emp name salary
→ → → →
(name, emp, proj*, emp*) (name, salary) #PCDATA #PCDATA
Query : get salaries of all employees that are not managers //proj/name/emp/following sibling::emp/salary
...
$40k
Invalid XML documents
Querying Invalid XML I
Integration of XML documents
I
Slight differences between schemas (e.g. different cardinality constraints)
I I
Document with errors proj name Pierogies emp
Legacy XML databases Database updates
Document Type Definition (D0 ) proj emp name salary
→ → → →
emp
emp
name salary
name salary
John $80k
Mary
name salary ???
???
proj
(name, emp, proj*, emp*) (name, salary) #PCDATA #PCDATA
Query : get salaries of all employees that are not managers //proj/name/emp/following sibling::emp/salary
...
$40k
Core XPath Core XPath Queries I
text values but no attributes
I
all standard axes
I
subexpressions, value tests, and joins: //*[A/B], //*[B/text()=C/text()]
I
negation and disjunction: //*[not A/B], //A | //B
I
no functions count(/A/*)
N0 A N1
N3 B
N2
B
abcd
Tree Reachability Fact: (x, Q, y ) query Q node x
object y
Query Answers I
Basic facts use only /* and following sibling:: (N0 ,/*,N1 ), (N1 ,/*,N2 ),...
I I
Other facts are inferred with implications: (X , Q/P, Y ) ⇐ (X , Q, Z ) ∧ (Z , P, Y ). (N0 ,/*/*,N2 ).
given query Q and document T with the root node r find all tree facts that hold in T x in an answer to Q in T iff the tree fact (r , Q, x) holds in T
Query Evaluation for Positive Core XPath (no negation) Bottom-up approach I I I
computing tree facts for query Q tree facts for T1 , . . . , Tm computed before including inferred facts (involving subqueries of Q)
Tree N0 X Ni−1
N1 X1
...
Xi−1
T1 Ti−1
Algorithm I start with ∅ II for subtree Ti (i = 1, . . . , m) 1 add all facts of the subtree (obtained by recursion) 2 add (N0 ,/*,Ni ) 3 if i > 1 add (Ni−1 ,following sibling::*,Ni )
Ni Xi Ti
...
Nm Xm Tm
Editing operations
A B C
D
Editing operations
Cost: 3 Insert
A B C
Editing operations I
Inserting a subtree
A
D
B C
E F
D G
Editing operations
Cost: 3 Insert
A B C
Editing operations I I
Inserting a subtree Deleting a subtree
Cost: 2 Delete
A
D
B C
E F
A
D G
E F
D G
Editing operations
Cost: 1 H B
D
C
Editing operations I I I
Cost: 3
Modify
Inserting a subtree Deleting a subtree Modifying a node’s label
Insert
A B C
Cost: 2 Delete
A
D
B C
E F
A
D G
E F
D G
Edit Distance and Repairs [3]
Distance between documents dist(T , S) is the minimal cost of transforming T into S dist = 1
C
C
Distance to a DTD A
dist(T , D) is the minimal cost of repairing T w.r.t D i.e.,
B
B
B
B
min{dist(T , S)|S valid w.r.t D}
DTD C → (A,B)* A → EMPTY B → EMPTY
Repair T 0 is a repair of T w.r.t D iff dist(T 0 , T ) = dist(T , D)
C A
B
C A
B
C A
B
There can be an exponential number of repairs
A
B
Valid Query Answers
Valid Query Answers
XML Document
x is a valid answer to query Q in T w.r.t. D iff x is an answer to Q in every repair of T w.r.t. D.
proj name
Document Type Definition (D0 ) proj emp name salary
→ → → →
(name, emp, proj*, emp*) (name, salary) #PCDATA #PCDATA
proj
Pierogies
... emp
emp
name salary name salary Alice $90K Bob
Queries and Valid Answers //proj[emp[1]/salary=’$90K’]/name/text() //proj[name=’Pierogies’]/emp[1]/salary/text() //proj[name=’Pierogies’]/emp[1]/name/text() //proj[name=’Pierogies’]/emp[1]/salary
→ → → →
{Pierogies} {$90K} ∅ ∅
$90K
Trace graph
Read A
DTD
C A
B
B
C → (A,B)* A → EMPTY B → EMPTY
Q0
Q1
Read B
A
B
B
Trace graph
Read A
DTD
C A
B
B
Ins A
Del
C → (A,B)* A → EMPTY B → EMPTY
Q0
Q1 Ins B Read B
A
B
B
Del
Trace graph
Read A
DTD
C A
B
Ins A
Del
C → (A,B)* A → EMPTY B → EMPTY
B
Q0
Q1 Ins B Read B
A
B
B
Q0
Q0
Q0
Q0
Q1
Q1
Q1
Q1
Del
Trace graph
Read A
DTD
C A
B
Ins A
Del
C → (A,B)* A → EMPTY B → EMPTY
B
Q0
Q1 Ins B Read B
A
Q0
Q0 Re
Q1
B
B
Q0 ad Re
ad
Q1
Q0 Re
Q1
ad
Q1
Del
Trace graph
Read A
DTD
C A
B
Ins A
Del
C → (A,B)* A → EMPTY B → EMPTY
B
Q0
Q1 Ins B Read B
Ins A
Ins A
Q1
Del
Q1
Re
ad
Del
Q0 Ins B
Del
ad Re
Del
Q0 Ins B
ad
Del
Q0 Ins B
Ins B
Re
B
Ins A
Del
Q0
Q1
B
Ins A
A
Q1
Del
Trace graph
Read A
DTD
C A
B
Ins A
Del
C → (A,B)* A → EMPTY B → EMPTY
B
Q0
Q1 Ins B Read B
Ins A
Ins A
Q1 ,0
Del
Q1 ,1
Re
ad
Del
Q0 ,1 Ins B
Del
ad Re
Del
Q0 ,0 Ins B
ad
Del
Q0 ,1 Ins B
Ins B
Re
B
Ins A
Del
Q0 ,0
Q1 ,1
B
Ins A
A
Q1 ,2
Del
Trace graph
Read A
DTD
C A
B
Ins A
Del
C → (A,B)* A → EMPTY B → EMPTY
B
Q0
Q1 Ins B Read B
Ins A
Ins A
Q1 ,0
Del
Q1 ,1
Re
ad
Del
Q0 ,1 Ins B
Del
ad Re
Del
Q0 ,0 Ins B
ad
Del
Q0 ,1 Ins B
Ins B
Re
B
Ins A
Del
Q0 ,0
Q1 ,1
B
Ins A
A
Q1 ,2
Del
Trace graph
Read A
DTD
C A
B
Ins A
Del
C → (A,B)* A → EMPTY B → EMPTY
B
Q0
Del Q1
Ins B Read B
Compact representation of all repairs
Del
Q1 ,1
ad
Del
Ins A
Ins A
Ins A
Q1 ,0
Re
Q0 ,1
Repairing Paths: Ins B
Del
ad Re
Del
Q0 ,0 Ins B
ad
Ins B
Ins B
Q1 ,1
Re
Del
Q0 ,1
Ins A
Del
Q0 ,0
Q1 ,2
I
(Read, Read, Del)
I
(Read, Read, Ins A, Read)
I
(Read, Del, Read)
Computing Valid Query Answers
Certain Tree Facts Tree facts present in every repair of a given tree
A
B
Q0 ,0
Q0 ,0
e ad
ad
R
Bottom-up approach Q1 ,0
Q0 ,1
ead
R
Q1 ,1
set of facts collected “so far”: I
Intersection of the sets of all repairing paths
Del
For every repairing path, construct: I
Obtain certain facts by
Del
Ins A
Re
Precomputed values: I certain facts for all children I certain facts common for every minimal tree satisfying DTD
B
I I
start with ∅ Read adds certain facts of the corresponding child Del adds nothing Ins A adds certain facts common for minimal trees labeled with A
Eager intersection Problem Possibly an exponential number of paths
Solution: Eager Intersection Intersect all sets of certain facts for paths sharing the same last operation adding facts (Read/Ins).
Computing VQA Queries
Combined-complexity
Data-complexity
Descending axes
PTIME
PTIME
+ Ascending axes
co-NP-complete
???
+ Sliding axes
co-NP-complete
???
+ Negation/Disjunction
co-NP-complete
???
+ Joins
co-NP-complete
co-NP-complete
Experimental Results: Edit Distance Computation
Compared algorithms Parse base line Validate regular automata Dist distance computation
Data generation 1. random valid document 2. removing and adding random nodes 3. invalidity ratio dist(T , D)/|T | ' 0.1% 4. small height (8-10)
Experimental Results: Edit Distance Computation
Compared algorithms Parse
Parse base line Validate regular automata Dist distance computation
1. random valid document 2. removing and adding random nodes 3. invalidity ratio
Dist
12
Time (sec.)
Data generation
Validate
9 6 3
dist(T , D)/|T | ' 0.1% 4. small height (8-10)
10
20
30
Document size (MB)
40
50
Experimental Results: Edit Distance Computation
Compared algorithms Parse base line Validate regular automata Dist distance computation
1. random valid document 2. removing and adding random nodes
Dist
60 DTD size |D|
90
6 Time (sec.)
Data generation
Validate
4
2
3. invalidity ratio dist(T , D)/|T | ' 0.1% 4. small height (8-10)
30
120
Experimental Results: Valid Query Answer Computation
Compared algorithms QA base line VQA eager intersection
Data generation I
invalidity ratio dist(T , D)/|T | ' 0.1%
I
small height (8-10)
Implemented Queries I //, /, following sibling:: I name()=A,text()=’str’ I QA works in linear time
Experimental Results: Valid Query Answer Computation
Compared algorithms QA
QA base line VQA eager intersection
VQA
16
I
invalidity ratio dist(T , D)/|T | ' 0.1%
I
small height (8-10)
Time (sec.)
Data generation 12 8 4
Implemented Queries I //, /, following sibling:: I name()=A,text()=’str’ I QA works in linear time
3
6
Document size (MB)
9
Experimental Results: Valid Query Answer Computation
Compared algorithms VQA
QA base line VQA eager intersection 30
I
invalidity ratio dist(T , D)/|T | ' 0.1%
I
small height (8-10)
Time (sec.)
Data generation
20
10
Implemented Queries I //, /, following sibling:: I name()=A,text()=’str’ I QA works in linear time
10
20
30
40
DTD size |D|
50
60
Conclusions and Future Work
Conclusions I
Framework for querying documents with validity violations of local nature (missing or superfluous nodes)
I
Efficient algorithm for computing valid answers to a class of XPath queries
Future Work I
Valid answers by query rewriting
I
In-depth analysis of data complexity
I
Other tree operations: inner node deletion/insertion, node move, . . . [1]
I
Semantic inconsistencies: keys, functional dependencies,. . . [2]
Acknowledgments
Marcelo Arenas Leo Bertossi Wenfei Fan Jerzy Marcinkowski Slawomir Staworko Jef Wijsen
P. Bille. A Survey on Tree Edit Distance and Related Problems. Theoretical Computer Science, 337(1-3):217–239, 2003. S. Flesca, F. Furfaro, S. Greco, and E. Zumpano. Querying and Repairing Inconsistent XML Data. In Web Information Systems Engineering, pages 175–188. Springer, LNCS 3806, 2005. S. Staworko and J. Chomicki. Validity-Sensitive Querying of XML Databases. In EDBT Workshops (dataX), pages 164–177. Springer, LNCS 4254, 2006.