Nov 1, 2011 - Mark Musa, in his translation of Dante's Divine Comedy: Dante the Pilgrim, in order to arrive at the Divine Light, will come to an understanding ...
The 30th International Conference on Conceptual Modeling (ER)
01 November 2011, Brussels, Belgium
A precious class of cardinality constraints for flexible XML data processing Flavio Ferrarotti School of Information Management, The Victoria University of Wellington, New Zealand
Sebastian Link Department of Computer Science, The University of Auckland, New Zealand
Sven Hartmann Institute of Informatics, Clausthal University of Technology , Germany
This research is supported by the Marsden Fund Council from Government funding, administered by the Royal Society of New Zealand.
1
The 30th International Conference on Conceptual Modeling (ER)
01 November 2011, Brussels, Belgium
Important Discovery
Theorem: The best beer is made in New Zealand.
2
The 30th International Conference on Conceptual Modeling (ER)
01 November 2011, Brussels, Belgium
Proof of the discovery ,→ Lemma [Mayor of Brussels]: The quality of beer is determined by the quality of the rugby team. ,→ Axiom [of Rugby]: New Zealand has the best rugby team (The All Blacks).
3
⊓ ⊔
The 30th International Conference on Conceptual Modeling (ER)
01 November 2011, Brussels, Belgium
Motivation • XML: lingua franca for Web data representation, exchange and integration • high degree of syntactic flexibility, low degree of semantic capabilities • challenge: full-fledged tools for native data management • S. Abiteboul, P. Buneman, D. Suciu. Data on the Web: From relations to semistructured data and XML. Morgan Kaufmann, 1999. • S. Ceri, P. Fraternali, S. Paraboschi. XML: Current developments and future challenges for the database community. EDBT, 3-17, 2000. • W. Fan. XML constraints. DEXA Workshops, 805-809, 2005. • D. Suciu. On database theory and XML. SIGMOD Record, 30(3):39-45,2001. • V. Vianu. A web odyssey: from Codd to XML. PODS, 1-15, 2001.
• integrity constraints enhance semantic capabilities: • restrict XML data stores to those that are meaningul for application • balance trade-off between expressiveness and efficiency • cardinality constraints bound the occurrences of data items based on data values ,→ bounds dominate our everyday life ,→ extend keys, and potentially occurrence constraints ,→ rich history in conceptual modeling (Chen, Thalheim, Embley, Lenzerini, Ling) 4
The 30th International Conference on Conceptual Modeling (ER)
01 November 2011, Brussels, Belgium
An XML data tree fragment db E E calendar A "2007" no A "1" name A "maths"
title E name E S Principal
E semester
E course
sid A teacher E "007" first E
S Skinner
S Bart
year
E course
name A "physics" E student last E grade E S Simpson
S A+
E semester
no A "2"
E teacher title E name E S
sid A "007"
first E
S Principal Skinner
S Bart
E
name A "PE" E student last E grade E S Simpson
5
course
sid E student sid A A "007" "247" first E last E grade E first E
E teacher title E name E
S S S A− Principal Skinner
S Bart
S Simpson
S A+
S Lisa
E student last E grade E S Simpson
S B−
The 30th International Conference on Conceptual Modeling (ER)
01 November 2011, Brussels, Belgium
Some cardinality constraints • relatively to each year: a semester is determined by its number • each year node has either no semester nodes or precisely two • relatively to each course: a student is determined by its sid • each student deciding to study in a semester enrols into at least two and at most 4 courses in that semester • no student enrols more than three times into the same course • no course is offered more than once a year 6
The 30th International Conference on Conceptual Modeling (ER)
01 November 2011, Brussels, Belgium
An Application Scenario: Approximating Query Costs I Bart Simpson wishes to query XML source for results in 2007 ,→ using his cell-phone but only when costs reasonable ,→ service provider: only retrieve data if service paid for I consider the following XQuery query: for $s in doc(“enrol.xml”)/year[@calendar=“2007”]//course/student where $s/sid=“007” return ⟨grade⟩{$s/grade}⟨/grade⟩ I XML DBMSs capable of reasoning about cardinality constraints foresees maximal number of eight answers (without processing any data!) I approximate costs returned to Bart Simpson who decides accordingly I service provider minimizes costs for unpaid services 7
The 30th International Conference on Conceptual Modeling (ER)
01 November 2011, Brussels, Belgium
Outline I introduce cardinality constraints based on XML tree model I investigate decision problems to unlock XML applications I identify sources of intractability: coNP -hardness results I identify precious subclass • finite axiomatization • characterize implication in terms of shortest path • devise algorithm to efficiently decide implication
8
The 30th International Conference on Conceptual Modeling (ER)
01 November 2011, Brussels, Belgium
Value Equality • each node v is the root of a unique (maximal) sub-tree T (v) of T • two nodes v1 and v2 are value-equal iff their induced sub-trees T (v1) and T (v2) are isomorphic • we denote this by v1 =v v2 • roughly speaking, we have: • lab(v1 ) = lab(v2 ) • if v1 , v2 are attribute or text nodes, then val(v1 ) = val(v2 ) • if v1 , v2 are element nodes, then they possess sets of value-equal attribute children • lists of value-equal element and text children •
• we define the value-intersection of two node sets V1 and V2 • V1 ∩v V2 = {(v1 , v2 ) | v1 ∈ V1 , v2 ∈ V2 , v1 =v v2 } 9
The 30th International Conference on Conceptual Modeling (ER)
01 November 2011, Brussels, Belgium
Example: Value-Equality db E E calendar A "2007" no A "1" name A "maths"
title E name E S Principal
E semester
E course
sid E teacher A "007" first E
S Skinner
S Bart
year
sid A teacher E "007"
E student last E grade E S Simpson
E course
name A "physics"
v1
S A+
E semester
no A "2"
title E name E S
first E
S Principal Skinner
S Bart
name A "PE"
v2 E student last E grade E S Simpson
E
course
v
3 sid student sid E E teacher A A "007" "247" last E grade E first E title E name E first E
S S S A− Principal Skinner
S Bart
S Simpson
S A+
S Lisa
v4 E student last E grade E S Simpson
• student-nodes: v1 =v v3, v2 ̸=v v1 and v2 ̸=v v3 • no two distinct course-nodes are value-equal • for V1 = {v1, v2} and V2 = {v3, v4} we have V1 ∩v V2 = {(v1, v3)} 10
S B−
The 30th International Conference on Conceptual Modeling (ER)
01 November 2011, Brussels, Belgium
Path languages for selecting nodes • we need a suitable path language to select nodes in the XML tree • expressive enough to allow reasonable navigation • simple enough to allow efficient reasoning • use P L generated by Q → ℓ | ε | Q.Q | | ∗ ℓ ∈ E ∪ A ∪ {S} denotes any node name • denotes single label wildcard ∗ denotes variable length wildcard • • ϵ denotes the empty path •
• for S ⊆ {., , ∗}, PLS denotes the subset of P L expressions restricted to the constructs in S 11
The 30th International Conference on Conceptual Modeling (ER)
01 November 2011, Brussels, Belgium
Selecting nodes in an XML tree • path expression Q said to be valid iff no attribute name and no S occur in position other than terminal •
attribute and text nodes are always leaves of the XML tree
• u[[Q]]: those nodes in tree T reachable from node u via simple P ∈ Q • Q1 ⊆ Q2 iff u[[Q1] ⊆ u[[Q2] for all trees T and all nodes u of T • Path containment for P L: decide for arbitrary Q1, Q2 ∈ P L whether Q1 ⊆ Q2 holds • Path containment for P L can be decided in quadratic time
12
The 30th International Conference on Conceptual Modeling (ER)
01 November 2011, Brussels, Belgium
Example: Navigation between Nodes db E E calendar A "2007" no A "1" name A "maths"
first E v1 S S Skinner Bart
title E name E S Principal
u
E semester
E course
sid E teacher A "007"
year
E course
name A "physics" E student last E grade E S Simpson
S A+
E semester
no A "2"
E teacher title E name E S
sid A "007"
first E v2 S
S Principal Skinner
Bart
E
name A "PE" E student last E grade E S Simpson
sid E student sid A A "007" "247" first E last E grade E first E v4 v3
E teacher title E name E
S S S A− Principal Skinner
I u[[ .course. ∗.first]] = {v1, v2, v3, v4} I semester.course. . .first ⊆ .course. ∗.first I not .course. ∗.first ⊆ semester.course. ∗.first 13
course
S Bart
S Simpson
S A+
S Lisa
E student last E grade E S Simpson
S B−
The 30th International Conference on Conceptual Modeling (ER)
01 November 2011, Brussels, Belgium
Cardinality constraints for XML • a cardinality constraint φ has the form card(Q, (Q′, {Q1, . . . , Qk })) = (min, max) Q ∈ P L is the context path ′ • Q ∈ P L is the selector path • Qi ∈ P L is a field path (for i = 1, . . . , k) ′ ′ • Q.Q and Q.Q .Qi are valid paths for i = 1, . . . , k •
• two nodes v1, v2 are {Q1, . . . , Qk }-ambiguous iff v1[ Qi] ∩v v2[ Qi] ̸= ∅ for all i = 1, . . . , k • the XML tree T satisfies φ iff for all nodes q ∈ r[[Q]] and all nodes q ′ ∈ q[[Q′] there are no less than min and no more than max nodes in q[[Q′] that are {Q1, . . . , Qk }-ambiguous with q ′ 14
The 30th International Conference on Conceptual Modeling (ER)
01 November 2011, Brussels, Belgium
Example 1 I in every semester every student who studies must enrol in 2 to 4 courses T does not satisfy card ( ∗.semester,(course,{ .sid}))=(2,4) db E E calendar A "2007" no A "1" name A "maths"
title E name E S Principal
E semester
E course
sid E teacher A "007" first E
S Skinner
S Bart
year
E course
name A "physics" E student last E grade E S Simpson
S A+
E semester
no A "2"
E teacher title E name E S
sid A "007"
first E
S Principal Skinner
S Bart
E
name A "PE" E student last E grade E S Simpson
15
course
sid E student sid A A "007" "247" first E last E grade E first E
E teacher title E name E
S S S A− Principal Skinner
S Bart
S Simpson
S A+
S Lisa
E student last E grade E S Simpson
S B−
The 30th International Conference on Conceptual Modeling (ER)
01 November 2011, Brussels, Belgium
Example 2 I T satisfies card (year,( .course,{teacher}))=(3,6) I T violates card ( ∗.semester,(course,{teacher}))=(2,2) db E E calendar A "2007" no A "1" name A "maths"
title E name E S Principal
E semester
E course
sid E teacher A "007" first E
S Skinner
S Bart
year
E course
name A "physics" E student last E grade E S Simpson
S A+
E semester
no A "2"
E teacher title E name E S
sid A "007"
first E
S Principal Skinner
S Bart
E
name A "PE" E student last E grade E S Simpson
16
course
sid E student sid A A "007" "247" first E last E grade E first E
E teacher title E name E
S S S A− Principal Skinner
S Bart
S Simpson
S A+
S Lisa
E student last E grade E S Simpson
S B−
The 30th International Conference on Conceptual Modeling (ER)
01 November 2011, Brussels, Belgium
Example 3 I XML keys covered by cardinality constraints: min = max = 1 ,→ P. Buneman et al. Reasoning about keys for XML. Inf. Syst.28(8):1037–1063, 2003 ,→ HL. Efficient reasoning about a robust XML key fragment. ACM ToDS. 34(3): Article 10, 2009 I T satisfies card ( ∗.course,(student,{sid}))=(1,1) db E E calendar A "2007" no A "1" name A "maths"
title E name E S Principal
E semester
E course
sid E teacher A "007" first E
S Skinner
S Bart
year
E course
name A "physics" E student last E grade E S Simpson
S A+
E semester
no A "2"
E teacher title E name E S
sid A "007"
first E
S Principal Skinner
S Bart
E
name A "PE" E student last E grade E S Simpson
17
course
sid E student sid A A "007" "247" first E last E grade E first E
E teacher title E name E
S S S A− Principal Skinner
S Bart
S Simpson
S A+
S Lisa
E student last E grade E S Simpson
S B−
The 30th International Conference on Conceptual Modeling (ER)
01 November 2011, Brussels, Belgium
Implication problems I Σ (finitely) implies φ ∈ Σ ∗ (semantic closure): ,→ each (finite) XML tree that satisfies Σ also satisfies φ I (finite) implication problem: ,→ Given any finite Σ ∪ {φ}: does Σ (finitely) imply φ? I efficient solution to decision problems unlocks XML applications + be the syntactic closure of Σ I let ΣR • with respect to a fixed system R of inference rules + can be inferred from Σ using the rules in R • φ ∈ Σ R + ⊆ Σ ∗, and complete iff Σ ∗ ⊆ Σ + I R is sound iff ΣR R
18
The 30th International Conference on Conceptual Modeling (ER)
01 November 2011, Brussels, Belgium
A Computer Scientist’s Pilgrimage I Mark Musa, in his translation of Dante’s Divine Comedy: Dante the Pilgrim, in order to arrive at the Divine Light, will come to an understanding of sin on his journey through Hell and will see the penance imposed on repentant sinners on the Mount of Purgatory. I our journey through Hell (it’s not that bad!): ,→ our sin: too much expressiveness ,→ understanding: identifying intractabilities I our Mount Purgatory: ,→ restriction to tractable cases ,→ do penance by studying properties I our “Divine Light”: ,→ finite axiomatization ,→ efficient algorithms 19
The 30th International Conference on Conceptual Modeling (ER)
01 November 2011, Brussels, Belgium
Sources of Intractability: Lower and Upper Bounds I C1 = {card(ε, (P ′, {P1, . . . , Pk })) = (min, max) | P ′, P1, . . . , Pk ∈ PL{.}, k ≥ 1, max ≤ 5} I Theorem: The finite implication problem for the class C1 of all simple absolute cardinality constraints with a non-empty set of field paths is coNP hard. I suggests that computational intractability may result from the specification of both lower and upper bounds
20
The 30th International Conference on Conceptual Modeling (ER)
01 November 2011, Brussels, Belgium
Sources of Intractability: Empty sets of field paths I C2 = {card(ε, (P ′, {P1, . . . , Pk })) = (1, max) | P ′, P1, . . . , Pk ∈ PL{.}, k ≥ 0, max ≤ 6} I Theorem: The finite implication problem for the class C2 of all absolute cardinality constraints where the lower bound is fixed to 1 is coNP -hard. I suggests that computational intractability may result from the permission of empty sets of field paths
21
The 30th International Conference on Conceptual Modeling (ER)
01 November 2011, Brussels, Belgium
Sources of Intractability: General field path expressions I C3 = {card(ε, (Q′, {Q1, . . . , Qk })) = (1, max) | Q′, Q1, . . . , Qk ∈ ∗} {., PL , k ≥ 1, max ≤ 4} I Theorem: The finite implication problem for the class C3 of all absolute cardinality constraints that have a non-empty set of field paths and where the lower bound is fixed to 1 is coNP -hard. I suggests that computational intractability may result from permission to have arbitrary path expressions in both target- and field paths
22
The 30th International Conference on Conceptual Modeling (ER)
01 November 2011, Brussels, Belgium
Max constraints for XML I a max constraint φ has the form card(Q, (Q′, {P1, . . . , Pk })) ≤ max Q ∈ P L is the context path ′ • Q ∈ P L is the selector path {., } is a field path (for i = 1, . . . , k and k ≥ 1) • Pi ∈ P L ′ • Q.Q .Pi valid path (for i = 1, . . . , k) •
I XML tree T satisfies card(Q, (Q′, {P1, . . . , Pk })) ≤ max iff T satisfies card(Q, (Q′, {P1, . . . , Pk })) = (1, max) I Theorem: Every finite set of max constraints is finitely satisfiable. I Theorem: Implication and finite implication coincide for max constraints. 23
The 30th International Conference on Conceptual Modeling (ER)
01 November 2011, Brussels, Belgium
A finite axiomatization for max constraints card(Q, (Q′, S)) ≤ ∞ (infinity)
card(Q, (ϵ, S)) ≤ 1 (epsilon)
card(Q, (Q′, S)) ≤ max card(Q, (Q′, S)) ≤ max + 1 (weakening)
card(Q, (Q′, S)) ≤ max card(Q, (Q′, S ∪ {P })) ≤ max (superkey)
card(Q, (Q′.P, {P ′})) ≤ max card(Q, (Q′, {P.P ′})) ≤ max (subnodes)
card(Q, (Q′.Q′′, S)) ≤ max card(Q.Q′, (Q′′, S)) ≤ max (selector-to-context)
card(Q, (Q′, S)) ≤ max Q′′ ⊆Q ′′ ′ card(Q , (Q , S)) ≤ max (context-path-containment)
card(Q, (Q′, S)) ≤ max Q′′ ⊆Q′ ′′ card(Q, (Q , S)) ≤ max (selector-path-containment)
card(Q, (Q′, S ∪ {P })) ≤ max P ′ ⊆P ′′ ′ card(Q, (Q , S ∪ {P })) ≤ max (field-path-containment)
card(Q, (Q′, S ∪ {ϵ, P })) ≤ max card(Q, (Q′, S ∪ {ϵ, P.P ′})) ≤ max (prefix-epsilon)
card(Q, (Q′.P, {ϵ, P ′})) ≤ max card(Q, (Q′, {ϵ, P.P ′})) ≤ max (subnodes-epsilon)
card(Q, (Q′, {P.P1, . . . , P.Pk })) ≤ max card(Q.Q′, (P, {P1, . . . , Pk })) ≤ max′ card(Q, (Q′.P, {P1, . . . , Pk })) ≤ max · max′ (multiplication)
24
The 30th International Conference on Conceptual Modeling (ER)
01 November 2011, Brussels, Belgium
Example I employees take a role in projects of a company I employees have at least one quarter of no project involvement: σ1 = card( ∗.year, (quarter, {project. .eid.S})) ≤ 3
per quarter, every employee is involved in at most two projects: σ2 = card( ∗.year.quarter, (project, { .eid.S})) ≤ 2 per project, every employee has a unique role: σ3 = card( ∗.project, ( , {eid.S})) ≤ 1 I no employee must be involved in more than 5 project per annum: card( ∗.year, (quarter.project, { .eid.S})) ≤ 5
I φ cannot be inferred from Σ = {σ1, σ2, σ3} ,→ How can we show that φ is not implied by Σ? ,→ we construct a counter-example 25
The 30th International Conference on Conceptual Modeling (ER)
01 November 2011, Brussels, Belgium
Shortest Paths vs Derivability I we construct a cardinality graph GΣ,φ ,→ l: maximum of consecutive occurrences in Σ ,→ replace ∗ by l + 1 occurrences of label ,→ replace by single label occurrence I if d(qφ, qφ′ ) ≤ maxφ in GΣ,φ, then φ ∈ Σ + I Example: •
φ = card( ∗.year, (quarter.project, { .eid.S})) ≤ 5
not derivable from σ1 = card( ∗.year, (quarter, {project. .eid.S})) ≤ 3 σ2 = card( ∗.year.quarter, (project, { .eid.S})) ≤ 2 σ3 = card( ∗.project, ( , {eid.S})) ≤ 1
26
The 30th International Conference on Conceptual Modeling (ER)
01 November 2011, Brussels, Belgium
Completeness I T satisfies Σ: ,→ σ1 = card( ∗.year, (quarter, {project. .eid.S})) ≤ 3, ,→ σ2 = card( ∗.year.quarter, (project, { .eid.S})) ≤ 2, ,→ σ3 = card( ∗.project, ( , {eid.S})) ≤ 1 I T violates φ = card( ∗.year, (quarter.project, { .eid.S})) ≤ 5
27
The 30th International Conference on Conceptual Modeling (ER)
01 November 2011, Brussels, Belgium
An Algorithm for the Implication Problem • strong method: φ ∈ Σ ∗ iff d(qφ, qφ′ ) ≤ maxφ in GΣ,φ • this is how the algorithm works: • Input: Σ, φ ∗ • Output: YES if φ ∈ Σ ; No otherwise • Method: (1) construct the cardinality graph GΣ,φ (2) if d(qφ, qφ′ ) ≤ maxφ then return(YES) else return(NO) • l denotes the maximum number of consecutive occurrences in Σ • shortest path: ,→ Dijkstra’s algorithm with start node qφ (O(|φ|2 · max{1, l}2)) • cardinality graph can be constructed in time O(||Σ|| · |φ| · max{1, l}) 28
The 30th International Conference on Conceptual Modeling (ER)
01 November 2011, Brussels, Belgium
A note on the construction I for cardinality graph GΣ,φ important to ,→ replace ∗ by l + 1 occurrences of label ,→ and not by less I otherwise possible that ,→ d(qφ, qφ′ ) ≤ maxφ in GΣ,φ and ,→ φ ∈ / Σ+ I Example: •
φ = card( ∗.year, (quarter.project, { .eid.S})) ≤ 6
not derivable from σ1 = card( .year, (quarter, {project. .eid.S})) ≤ 3 σ2 = card( ∗.year.quarter, (project, { .eid.S})) ≤ 2
29
The 30th International Conference on Conceptual Modeling (ER)
01 November 2011, Brussels, Belgium
Findings for max constraints I form natural class of XML constraints that generalize XML keys I useful for many XML applications: ,→ XQuery, XPath, XML Encryption, XQuery update facility ,→ cost estimation, query optimization, query rewriting I reasoning about cardinality constraints intractable in general ,→ specification of both lower and upper bounds ,→ empty set of field paths ,→ general key path expressions I identified max constraints as a precious subclass ,→ always satisfiable ,→ implication and finite implication coincide ,→ established finite axiomatization ,→ characterized implication in terms of shortest paths ,→ efficient solutions to decision problem unlocks applications 30
The 30th International Conference on Conceptual Modeling (ER)
01 November 2011, Brussels, Belgium
Future Work I push expressiveness while maintaining tractability I develop visual specification languages I validity: is an XML document valid wrt a CC (tree automata) I discover all CC satisfied by a given XML document (data mining) I interaction with schema specifications such as DTDs and XSDs I interaction with other types of constraints I automatic support for query optimization I physical database tuning: indexes for efficient query performance I investigate consistent query answering I database design: transform poorly-designed databases into welldesigned ones that avoid data redundancies, processing difficulties and inefficient query processing 31