7740 Trienter Str. 7860 Triester Str. 8580 Walther-v.-d.-Vogelweide-Pl. ... VLDB 2005, Trondheim. Nikolaus Augsten, Michael Böhlen, Johann Gamper. Page 2 ...
Approximate Tree Matching with pq-Grams
¨ Nikolaus Augstena , Michael Bohlen, Johann Gamper DIS - Center for Database and Information Systems Free University of Bozen-Bolzano, Italy
www.inf.unibz.it 1 – Motivation . . . . . . . . . 2 – Related Work . . . . . . . 3 – pq -Grams . . . . . . . . . 4 – Properties . . . . . . . . . 5 – Experiments . . . . . . . . 6 – Conclusion and Future Work
a
. . . . . .
. . . . . .
. . . . . .
Supported by the Municipality of Bozen-Bolzano.
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. 2 . 6 . 7 . 11 . 14 . 21
Motivation — Example Data Sources ☞ We want to link data items in different databases that correspond to the same real world object.
VLDB 2005, Trondheim
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
Page 2
Motivation — Example Data Sources ☞ We want to link data items in different databases that correspond to the same real world object.
Land Register
Registration Office
LR id num entr apt owner 91 1 1 Maier 91 1 2 Rossi 91 1 3 Maier 91 2 A Braun ... 74 3 A 1 Spiro 74 3 A 2 Barducci 74 3 A 3 Costanzi ...
VLDB 2005, Trondheim
RO id num entr apt 30 1 1 30 1 3 30 2 A 30 2 B 1 120 3 A 1 120 3 A 2 120 3 A 3 SRO
SLR id 139 109 185 91 165 115 259 207 33 263 262 285 266 ...
resident Pichler Rieder Fischer Rossi ... Spiro Barducci Costanzi ...
street SIEGESPLATZ GILMWEG P. R. GIULIANI STR. CESARE ABBA STRASSE MUSTERPLATZ ITALIENSTRASSE TELSERDURCHGANG SERNESIDURCHGANG BOZNER BODENWEG TRIESTER STRASSE TRIENTER STRASSE WALTHERPLATZ TURINER STRASSE
id 30 5220 3000 3030 3540 4440 7180 7590 7620 7650 7740 7860 8580 ...
street Giuseppe-Cesare-Abba-Str. Bozner-Boden-Str. Hermann-von-Gilm-Str. Pater-Reginaldo-Giuliani-Str. Italienallee Musterplatzl Raffaello-Sernesi-Galerie Telsergalerie Friedensplatz Turiner Str. Trienter Str. Triester Str. Walther-v.-d.-Vogelweide-Pl.
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
Page 2
Motivation — Example Data Sources ☞ We want to link data items in different databases that correspond to the same real world object. ☞ Example query: Who lives in Braun’s apartment? Land Register
Registration Office
LR id num entr apt owner 91 1 1 Maier 91 1 2 Rossi 91 1 3 Maier 91 2 A Braun ... 74 3 A 1 Spiro 74 3 A 2 Barducci 74 3 A 3 Costanzi ...
VLDB 2005, Trondheim
RO id num entr apt 30 1 1 30 1 3 30 2 A 30 2 B 1 120 3 A 1 120 3 A 2 120 3 A 3 SRO
SLR id 139 109 185 91 165 115 259 207 33 263 262 285 266 ...
resident Pichler Rieder Fischer Rossi ... Spiro Barducci Costanzi ...
street SIEGESPLATZ GILMWEG P. R. GIULIANI STR. CESARE ABBA STRASSE MUSTERPLATZ ITALIENSTRASSE TELSERDURCHGANG SERNESIDURCHGANG BOZNER BODENWEG TRIESTER STRASSE TRIENTER STRASSE WALTHERPLATZ TURINER STRASSE
id 30 5220 3000 3030 3540 4440 7180 7590 7620 7650 7740 7860 8580 ...
street Giuseppe-Cesare-Abba-Str. Bozner-Boden-Str. Hermann-von-Gilm-Str. Pater-Reginaldo-Giuliani-Str. Italienallee Musterplatzl Raffaello-Sernesi-Galerie Telsergalerie Friedensplatz Turiner Str. Trienter Str. Triester Str. Walther-v.-d.-Vogelweide-Pl.
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
Page 2
Motivation — Example Data Sources ☞ We want to link data items in different databases that correspond to the same real world object. ☞ Example query: Who lives in Braun’s apartment? Land Register
Registration Office
LR id num entr apt owner 91 1 1 Maier 91 1 2 Rossi 91 1 3 Maier 91 2 A Braun ... 74 3 A 1 Spiro 74 3 A 2 Barducci 74 3 A 3 Costanzi ...
resident Pichler Rieder Fischer Rossi ... Spiro Barducci Costanzi ...
SRO
SLR id 139 109 185 91 165 115 259 207 33 263 262 285 266 ...
VLDB 2005, Trondheim
RO id num entr apt 30 1 1 30 1 3 30 2 A 30 2 B 1 120 3 A 1 120 3 A 2 120 3 A 3
street SIEGESPLATZ GILMWEG P. R. GIULIANI STR. CESARE ABBA STRASSE MUSTERPLATZ ITALIENSTRASSE TELSERDURCHGANG SERNESIDURCHGANG BOZNER BODENWEG TRIESTER STRASSE TRIENTER STRASSE WALTHERPLATZ TURINER STRASSE
?
id 30 5220 3000 3030 3540 4440 7180 7590 7620 7650 7740 7860 8580 ...
street Giuseppe-Cesare-Abba-Str. Bozner-Boden-Str. Hermann-von-Gilm-Str. Pater-Reginaldo-Giuliani-Str. Italienallee Musterplatzl Raffaello-Sernesi-Galerie Telsergalerie Friedensplatz Turiner Str. Trienter Str. Triester Str. Walther-v.-d.-Vogelweide-Pl.
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
Page 2
Motivation — Example Data Sources ☞ We want to link data items in different databases that correspond to the same real world object. ☞ Example query: Who lives in Braun’s apartment? Land Register
LR id num entr apt owner 91 1 1 Maier 91 1 2 Rossi 91 1 3 Maier 91 2 A Braun ... 74 3 A 1 Spiro 74 3 A 2 Barducci 74 3 A 3 Costanzi ...
Registration Office
!
VLDB 2005, Trondheim
RO id num entr apt 30 1 1 30 1 3 30 2 A 30 2 B 1 120 3 A 1 120 3 A 2 120 3 A 3 SRO
SLR id 139 109 185 91 165 115 259 207 33 263 262 285 266 ...
resident Pichler Rieder Fischer Rossi ... Spiro Barducci Costanzi ...
street SIEGESPLATZ GILMWEG P. R. GIULIANI STR. CESARE ABBA STRASSE MUSTERPLATZ ITALIENSTRASSE TELSERDURCHGANG SERNESIDURCHGANG BOZNER BODENWEG TRIESTER STRASSE TRIENTER STRASSE WALTHERPLATZ TURINER STRASSE
?
id 30 5220 3000 3030 3540 4440 7180 7590 7620 7650 7740 7860 8580 ...
street Giuseppe-Cesare-Abba-Str. Bozner-Boden-Str. Hermann-von-Gilm-Str. Pater-Reginaldo-Giuliani-Str. Italienallee Musterplatzl Raffaello-Sernesi-Galerie Telsergalerie Friedensplatz Turiner Str. Trienter Str. Triester Str. Walther-v.-d.-Vogelweide-Pl.
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
Page 2
Motivation — Address Trees
☞ residential addresses are hierarchical → address tree
Address trees:
CESARE ABBA STRASSE
1 1 2 3
VLDB 2005, Trondheim
3
2 A
B 1 2 3 4
D
4 A B C
Giuseppe-Cesare-Abba-Str. 6
1
2
- A
B
13
1234
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
3 C
6
4 A
BC
123
Page 3
Motivation — Address Trees
☞ residential addresses are hierarchical → address tree ☞ Idea: corresponding streets ⇒ similar address tree How similar are two address trees? Address trees:
CESARE ABBA STRASSE
1 1 2 3
VLDB 2005, Trondheim
3
2 A
B 1 2 3 4
D
4 A B C
Giuseppe-Cesare-Abba-Str. 6
1
2
- A
B
13
1234
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
3 C
6
4 A
BC
123
Page 3
Motivation — Standard Solution: The Edit Distance
☞ Edit distance: Minimum cost sequence of edit operations (node insertion, node deletion, and label change) that transform one tree into an other.
VLDB 2005, Trondheim
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
Page 4
Motivation — Standard Solution: The Edit Distance
☞ Edit distance: Minimum cost sequence of edit operations (node insertion, node deletion, and label change) that transform one tree into an other. T′′
T a
b
x
b
c
d
e
f g
d
e
f g
h i k
h i
VLDB 2005, Trondheim
c
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
Page 4
Motivation — Standard Solution: The Edit Distance
☞ Edit distance: Minimum cost sequence of edit operations (node insertion, node deletion, and label change) that transform one tree into an other. T′
T a
a
insert(k, e, 3) −→
b
c
d
e
h i
VLDB 2005, Trondheim
T′′
f g
b
x
b
c
d
e
f g
h i k
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
c
d
e
f g
h i k
Page 4
Motivation — Standard Solution: The Edit Distance
☞ Edit distance: Minimum cost sequence of edit operations (node insertion, node deletion, and label change) that transform one tree into an other. T′
T a
a
insert(k, e, 3) −→
b
c
d
e
f g
x
rename(a, x) −→
b
c
d
e
f g
h i k
h i
edit distance:
VLDB 2005, Trondheim
T′′
b
c
d
e
f g
h i k
disted (T, T′′ ) = 2
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
Page 4
Motivation — Standard Solution: The Edit Distance
☞ Edit distance: Minimum cost sequence of edit operations (node insertion, node deletion, and label change) that transform one tree into an other. T′
T a
a
insert(k, e, 3) −→
b
c
d
e
T′′
f g
x
rename(a, x) −→
b
c
d
e
f g
h i k
h i
edit distance:
b
c
d
e
f g
h i k
disted (T, T′′ ) = 2
☞ Problem: Best algorithms O(n2 log2 (n)) ⇒ not scalable.
VLDB 2005, Trondheim
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
Page 4
Motivation — Problem Definition
☞ Our goal: Find an efficient and effective approximation of the tree edit distance that ➠ is scalable for large trees, ➠ emphasizes structure.
VLDB 2005, Trondheim
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
Page 5
Related Work — Tree Distances ☞ n → number of tree nodes ☞ Tree edit distance: ➳ for balanced trees [Zhang and Shasha, 1989]: O(n2 log2 (n)) ➳ for arbitrary trees [Klein, 1998]: O(n3 log(n))
VLDB 2005, Trondheim
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
Page 6
Related Work — Tree Distances ☞ n → number of tree nodes ☞ Tree edit distance: ➳ for balanced trees [Zhang and Shasha, 1989]: O(n2 log2 (n)) ➳ for arbitrary trees [Klein, 1998]: O(n3 log(n)) ☞ Tree edit distance approximations: ➳ Restricted versions of the tree edit distance: ➠ Alignment [Jiang et al., 1995]: O(n2 ) ➠ Isolated subtree [Tanaka and Tanaka, 1988]: O(n2 ) ➠ Top-down [Selkow, 1977, Yang, 1991]: O(n2 ) ➠ Bottom-up [Valiente, 2001]: O(n) → only very specific domains
VLDB 2005, Trondheim
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
Page 6
Related Work — Tree Distances ☞ n → number of tree nodes ☞ Tree edit distance: ➳ for balanced trees [Zhang and Shasha, 1989]: O(n2 log2 (n)) ➳ for arbitrary trees [Klein, 1998]: O(n3 log(n)) ☞ Tree edit distance approximations: ➳ Restricted versions of the tree edit distance: ➠ Alignment [Jiang et al., 1995]: O(n2 ) ➠ Isolated subtree [Tanaka and Tanaka, 1988]: O(n2 ) ➠ Top-down [Selkow, 1977, Yang, 1991]: O(n2 ) ➠ Bottom-up [Valiente, 2001]: O(n) → only very specific domains ➳ XML versioning [Chawathe et al., 1996, Chawathe and Garcia-Molina, 1997, Lee et al., 2004]: O(n2 ) for very different trees
VLDB 2005, Trondheim
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
Page 6
Related Work — Tree Distances ☞ n → number of tree nodes ☞ Tree edit distance: ➳ for balanced trees [Zhang and Shasha, 1989]: O(n2 log2 (n)) ➳ for arbitrary trees [Klein, 1998]: O(n3 log(n)) ☞ Tree edit distance approximations: ➳ Restricted versions of the tree edit distance: ➠ Alignment [Jiang et al., 1995]: O(n2 ) ➠ Isolated subtree [Tanaka and Tanaka, 1988]: O(n2 ) ➠ Top-down [Selkow, 1977, Yang, 1991]: O(n2 ) ➠ Bottom-up [Valiente, 2001]: O(n) → only very specific domains ➳ XML versioning [Chawathe et al., 1996, Chawathe and Garcia-Molina, 1997, Lee et al., 2004]: O(n2 ) for very different trees ➳ Tree-edit distance embedding [Garofalakis and Kumar, 2003, Garofalakis and Kumar, 2005]: ➠ O(n log n) ➠ guaranteed distance distortion for tree edit distance with subtree move
VLDB 2005, Trondheim
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
Page 6
Related Work — Tree Distances ☞ n → number of tree nodes ☞ Tree edit distance: ➳ for balanced trees [Zhang and Shasha, 1989]: O(n2 log2 (n)) ➳ for arbitrary trees [Klein, 1998]: O(n3 log(n)) ☞ Tree edit distance approximations: ➳ Restricted versions of the tree edit distance: ➠ Alignment [Jiang et al., 1995]: O(n2 ) ➠ Isolated subtree [Tanaka and Tanaka, 1988]: O(n2 ) ➠ Top-down [Selkow, 1977, Yang, 1991]: O(n2 ) ➠ Bottom-up [Valiente, 2001]: O(n) → only very specific domains ➳ XML versioning [Chawathe et al., 1996, Chawathe and Garcia-Molina, 1997, Lee et al., 2004]: O(n2 ) for very different trees ➳ Tree-edit distance embedding [Garofalakis and Kumar, 2003, Garofalakis and Kumar, 2005]: ➠ O(n log n) ➠ guaranteed distance distortion for tree edit distance with subtree move ☞ Related work for strings: ➳ Navarro [Navarro, 2001]: good overview of the edit distance for strings and its variants ➳ Ukkonen [Ukkonen, 1992]: q -grams as lower bound for string edit distance VLDB 2005, Trondheim
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
Page 6
pq -Grams — Subtrees of the pq -Extended Tree ☞ Extended Tree Tpq :
2, 3-Extended Tree:
Patch boundaries by adding null nodes (*):
➳ p − 1 ancestors to the root ➳ q − 1 nodes before the first and after
T a a b c
the last child of each non-leaf node
➳ q children to each leaf
T2,3 * a
−→
**
e b
a **
e
b b
c
**
********
******
VLDB 2005, Trondheim
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
Page 7
pq -Grams — Subtrees of the pq -Extended Tree ☞ Extended Tree Tpq :
2, 3-Extended Tree:
Patch boundaries by adding null nodes (*):
➳ p − 1 ancestors to the root ➳ q − 1 nodes before the first and after
T a
a
a b c
the last child of each non-leaf node
➳ q children to each leaf
T2,3 *
−→
**
e b
☞ pq -Gram G: Subtree of Tpq . ➳ Anchor node ➳ with p − 1 ancestors ➳ and q children. Contiguous siblings in G are contiguous siblings in Tpq .
a **
e
b b
c
**
********
******
2, 3-Gram Pattern: •
p−1
• anchor
• • • q
VLDB 2005, Trondheim
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
Page 7
pq -Grams — Subtrees of the pq -Extended Tree ☞ Extended Tree Tpq :
2, 3-Extended Tree:
Patch boundaries by adding null nodes (*):
➳ p − 1 ancestors to the root ➳ q − 1 nodes before the first and after
T a
a
a b c
the last child of each non-leaf node
➳ q children to each leaf
T2,3 *
−→
**
e b
☞ pq -Gram G: Subtree of Tpq . ➳ Anchor node ➳ with p − 1 ancestors ➳ and q children. Contiguous siblings in G are contiguous siblings in Tpq .
**
e
b b
**
c
********
******
2, 3-Gram Pattern: •
p−1
•
Example pq -Grams for T:
a
anchor
q
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
* a
* a
• • •
VLDB 2005, Trondheim
a
b c * a b 2, 3-gram 1, 2-gram
a
e b 3, 2-gram
Page 7
pq -Grams — Subtrees of the pq -Extended Tree ☞ Extended Tree Tpq :
2, 3-Extended Tree:
Patch boundaries by adding null nodes (*):
T a
➳ p − 1 ancestors to the root ➳ q − 1 nodes before the first and after
a
a b c
the last child of each non-leaf node
➳ q children to each leaf
−→
**
e b
☞ pq -Gram G: Subtree of Tpq . ➳ Anchor node ➳ with p − 1 ancestors ➳ and q children. Contiguous siblings in G are contiguous siblings in Tpq . ☞ pq -gram Profile Pp,q (T): ➳ Bag of all pq -grams of T. 2, 3-Gram Profile of T:
T2,3 *
a **
e
b
**
c
********
b
******
2, 3-Gram Pattern: •
Example pq -Grams for T:
p−1
a
*
•
a
a
a
anchor
• • •
b c * a b 2, 3-gram 1, 2-gram
q
*
e b 3, 2-gram
*
a
a
a
a
a
a
*
a
*
a
*
*
a
a
e
a
b
a
a
a
b
a
c
a
a
* * a * * e * * * * e b * * * e b * b * * * a b * * * a b c * * * b c * c * * VLDB 2005, Trondheim
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
Page 7
pq -Grams — Algorithm for pq -Gram Profile
P
T a a e
b
c
b
1 CREATE P ROFILE(T, p, q, P, r, anc) 2 anc := shift(anc, l(r)) 3 sib: shift register of size q (initialized with *) 4 5 if r is a leaf then 6 P := P ∪ (anc ◦ sib) 7 else 8 for each child c (from left to right) of r do 9 sib := shift(sib, l(c)) 10 P := P ∪ (anc ◦ sib) 11 P :=PROFILE(T, p, q, P, c, anc) 12 for k := 1 to q − 1 13 sib := shift(sib, *) 14 P := P ∪ (anc ◦ sib) 15 return P
VLDB 2005, Trondheim
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
Page 8
pq -Grams — Algorithm for pq -Gram Profile anc * *
P
T a a e
b
c
b
1 CREATE P ROFILE(T, p, q, P, r, anc) 2 anc := shift(anc, l(r)) 3 sib: shift register of size q (initialized with *) 4 5 if r is a leaf then 6 P := P ∪ (anc ◦ sib) 7 else 8 for each child c (from left to right) of r do 9 sib := shift(sib, l(c)) 10 P := P ∪ (anc ◦ sib) 11 P :=PROFILE(T, p, q, P, c, anc) 12 for k := 1 to q − 1 13 sib := shift(sib, *) 14 P := P ∪ (anc ◦ sib) 15 return P
VLDB 2005, Trondheim
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
Page 8
pq -Grams — Algorithm for pq -Gram Profile anc *
P
T a a e
b
c
b
1 CREATE P ROFILE(T, p, q, P, r, anc) 2 anc := shift(anc, l(r)) 3 sib: shift register of size q (initialized with *) 4 5 if r is a leaf then 6 P := P ∪ (anc ◦ sib) 7 else 8 for each child c (from left to right) of r do 9 sib := shift(sib, l(c)) 10 P := P ∪ (anc ◦ sib) 11 P :=PROFILE(T, p, q, P, c, anc) 12 for k := 1 to q − 1 13 sib := shift(sib, *) 14 P := P ∪ (anc ◦ sib) 15 return P
VLDB 2005, Trondheim
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
Page 8
pq -Grams — Algorithm for pq -Gram Profile anc *
P
T a *
*
a
*
b
c
sib e
b
1 CREATE P ROFILE(T, p, q, P, r, anc) 2 anc := shift(anc, l(r)) 3 sib: shift register of size q (initialized with *) 4 5 if r is a leaf then 6 P := P ∪ (anc ◦ sib) 7 else 8 for each child c (from left to right) of r do 9 sib := shift(sib, l(c)) 10 P := P ∪ (anc ◦ sib) 11 P :=PROFILE(T, p, q, P, c, anc) 12 for k := 1 to q − 1 13 sib := shift(sib, *) 14 P := P ∪ (anc ◦ sib) 15 return P
VLDB 2005, Trondheim
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
Page 8
pq -Grams — Algorithm for pq -Gram Profile anc *
P
T a
*
a
*
b
c
sib e
b
1 CREATE P ROFILE(T, p, q, P, r, anc) 2 anc := shift(anc, l(r)) 3 sib: shift register of size q (initialized with *) 4 5 if r is a leaf then 6 P := P ∪ (anc ◦ sib) 7 else 8 for each child c (from left to right) of r do 9 sib := shift(sib, l(c)) 10 P := P ∪ (anc ◦ sib) 11 P :=PROFILE(T, p, q, P, c, anc) 12 for k := 1 to q − 1 13 sib := shift(sib, *) 14 P := P ∪ (anc ◦ sib) 15 return P
VLDB 2005, Trondheim
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
Page 8
pq -Grams — Algorithm for pq -Gram Profile anc *
P
T a
*
a
*
b
* a * * a
c
sib e
b
1 CREATE P ROFILE(T, p, q, P, r, anc) 2 anc := shift(anc, l(r)) 3 sib: shift register of size q (initialized with *) 4 5 if r is a leaf then 6 P := P ∪ (anc ◦ sib) 7 else 8 for each child c (from left to right) of r do 9 sib := shift(sib, l(c)) 10 P := P ∪ (anc ◦ sib) 11 P :=PROFILE(T, p, q, P, c, anc) 12 for k := 1 to q − 1 13 sib := shift(sib, *) 14 P := P ∪ (anc ◦ sib) 15 return P
VLDB 2005, Trondheim
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
Page 8
pq -Grams — Algorithm for pq -Gram Profile anc *
P
T a
*
a
*
b
* a * * a
c
sib e
b
1 CREATE P ROFILE(T, p, q, P, r, anc) 2 anc := shift(anc, l(r)) 3 sib: shift register of size q (initialized with *) 4 5 if r is a leaf then 6 P := P ∪ (anc ◦ sib) 7 else 8 for each child c (from left to right) of r do 9 sib := shift(sib, l(c)) 10 P := P ∪ (anc ◦ sib) 11 P :=PROFILE(T, p, q, P, c, anc) 12 for k := 1 to q − 1 13 sib := shift(sib, *) 14 P := P ∪ (anc ◦ sib) 15 return P
VLDB 2005, Trondheim
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
Page 8
pq -Grams — Algorithm for pq -Gram Profile
P
an c
T a a e
b
* a * * a
c
b
1 CREATE P ROFILE(T, p, q, P, r, anc) 2 anc := shift(anc, l(r)) 3 sib: shift register of size q (initialized with *) 4 5 if r is a leaf then 6 P := P ∪ (anc ◦ sib) 7 else 8 for each child c (from left to right) of r do 9 sib := shift(sib, l(c)) 10 P := P ∪ (anc ◦ sib) 11 P :=PROFILE(T, p, q, P, c, anc) 12 for k := 1 to q − 1 13 sib := shift(sib, *) 14 P := P ∪ (anc ◦ sib) 15 return P
VLDB 2005, Trondheim
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
Page 8
pq -Grams — Algorithm for pq -Gram Profile
P
an c
T a a
*
*
sib
VLDB 2005, Trondheim
e
b
* a * * a
c
b
1 CREATE P ROFILE(T, p, q, P, r, anc) 2 anc := shift(anc, l(r)) 3 sib: shift register of size q (initialized with *) 4 5 if r is a leaf then 6 P := P ∪ (anc ◦ sib) 7 else 8 for each child c (from left to right) of r do 9 sib := shift(sib, l(c)) 10 P := P ∪ (anc ◦ sib) 11 P :=PROFILE(T, p, q, P, c, anc) 12 for k := 1 to q − 1 13 sib := shift(sib, *) 14 P := P ∪ (anc ◦ sib) 15 return P
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
Page 8
pq -Grams — Algorithm for pq -Gram Profile
P
an c
T a a
*
*
sib
VLDB 2005, Trondheim
e
b
* a * * a a a * * e
c
b
1 CREATE P ROFILE(T, p, q, P, r, anc) 2 anc := shift(anc, l(r)) 3 sib: shift register of size q (initialized with *) 4 5 if r is a leaf then 6 P := P ∪ (anc ◦ sib) 7 else 8 for each child c (from left to right) of r do 9 sib := shift(sib, l(c)) 10 P := P ∪ (anc ◦ sib) 11 P :=PROFILE(T, p, q, P, c, anc) 12 for k := 1 to q − 1 13 sib := shift(sib, *) 14 P := P ∪ (anc ◦ sib) 15 return P
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
Page 8
pq -Grams — Algorithm for pq -Gram Profile
P
an c
T a a
*
*
sib
VLDB 2005, Trondheim
e
b
* a * * a a a * * e
c
b
1 CREATE P ROFILE(T, p, q, P, r, anc) 2 anc := shift(anc, l(r)) 3 sib: shift register of size q (initialized with *) 4 5 if r is a leaf then 6 P := P ∪ (anc ◦ sib) 7 else 8 for each child c (from left to right) of r do 9 sib := shift(sib, l(c)) 10 P := P ∪ (anc ◦ sib) 11 P :=PROFILE(T, p, q, P, c, anc) 12 for k := 1 to q − 1 13 sib := shift(sib, *) 14 P := P ∪ (anc ◦ sib) 15 return P
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
Page 8
pq -Grams — Algorithm for pq -Gram Profile
P
T anc a e
a b
* a * * a a a * * e
c
b
1 CREATE P ROFILE(T, p, q, P, r, anc) 2 anc := shift(anc, l(r)) 3 sib: shift register of size q (initialized with *) 4 5 if r is a leaf then 6 P := P ∪ (anc ◦ sib) 7 else 8 for each child c (from left to right) of r do 9 sib := shift(sib, l(c)) 10 P := P ∪ (anc ◦ sib) 11 P :=PROFILE(T, p, q, P, c, anc) 12 for k := 1 to q − 1 13 sib := shift(sib, *) 14 P := P ∪ (anc ◦ sib) 15 return P
VLDB 2005, Trondheim
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
Page 8
pq -Grams — Algorithm for pq -Gram Profile
P
T anc a e *
*
sib
VLDB 2005, Trondheim
a b
* a * * a a a * * e
c
b *
1 CREATE P ROFILE(T, p, q, P, r, anc) 2 anc := shift(anc, l(r)) 3 sib: shift register of size q (initialized with *) 4 5 if r is a leaf then 6 P := P ∪ (anc ◦ sib) 7 else 8 for each child c (from left to right) of r do 9 sib := shift(sib, l(c)) 10 P := P ∪ (anc ◦ sib) 11 P :=PROFILE(T, p, q, P, c, anc) 12 for k := 1 to q − 1 13 sib := shift(sib, *) 14 P := P ∪ (anc ◦ sib) 15 return P
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
Page 8
pq -Grams — Algorithm for pq -Gram Profile
P
T anc a e *
*
sib
VLDB 2005, Trondheim
a b
* a * * a a a * * e a e * * *
c
b *
1 CREATE P ROFILE(T, p, q, P, r, anc) 2 anc := shift(anc, l(r)) 3 sib: shift register of size q (initialized with *) 4 5 if r is a leaf then 6 P := P ∪ (anc ◦ sib) 7 else 8 for each child c (from left to right) of r do 9 sib := shift(sib, l(c)) 10 P := P ∪ (anc ◦ sib) 11 P :=PROFILE(T, p, q, P, c, anc) 12 for k := 1 to q − 1 13 sib := shift(sib, *) 14 P := P ∪ (anc ◦ sib) 15 return P
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
Page 8
pq -Grams — Algorithm for pq -Gram Profile
P
an c
T a a
*
e
sib
VLDB 2005, Trondheim
b
* a * * a a a * * e a e * * *
c
b
1 CREATE P ROFILE(T, p, q, P, r, anc) 2 anc := shift(anc, l(r)) 3 sib: shift register of size q (initialized with *) 4 5 if r is a leaf then 6 P := P ∪ (anc ◦ sib) 7 else 8 for each child c (from left to right) of r do 9 sib := shift(sib, l(c)) 10 P := P ∪ (anc ◦ sib) 11 P :=PROFILE(T, p, q, P, c, anc) 12 for k := 1 to q − 1 13 sib := shift(sib, *) 14 P := P ∪ (anc ◦ sib) 15 return P
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
Page 8
pq -Grams — Algorithm for pq -Gram Profile
P
an c
T a a
*
e
sib
VLDB 2005, Trondheim
b
c
b
1 CREATE P ROFILE(T, p, q, P, r, anc) 2 anc := shift(anc, l(r)) 3 sib: shift register of size q (initialized with *) 4 5 if r is a leaf then 6 P := P ∪ (anc ◦ sib) 7 else 8 for each child c (from left to right) of r do 9 sib := shift(sib, l(c)) 10 P := P ∪ (anc ◦ sib) 11 P :=PROFILE(T, p, q, P, c, anc) 12 for k := 1 to q − 1 13 sib := shift(sib, *) 14 P := P ∪ (anc ◦ sib) 15 return P
* a * * a a a * a a a * a
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
* e e b
e * * * b * * *
Page 8
pq -Grams — Algorithm for pq -Gram Profile
P
an c
T a a e
b
sib
VLDB 2005, Trondheim
c
b
*
1 CREATE P ROFILE(T, p, q, P, r, anc) 2 anc := shift(anc, l(r)) 3 sib: shift register of size q (initialized with *) 4 5 if r is a leaf then 6 P := P ∪ (anc ◦ sib) 7 else 8 for each child c (from left to right) of r do 9 sib := shift(sib, l(c)) 10 P := P ∪ (anc ◦ sib) 11 P :=PROFILE(T, p, q, P, c, anc) 12 for k := 1 to q − 1 13 sib := shift(sib, *) 14 P := P ∪ (anc ◦ sib) 15 return P
* a * * a a a * a a a * a a a e
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
* e e b b
e * * * b * * * *
Page 8
pq -Grams — Algorithm for pq -Gram Profile
P
an c
T a a e
c
b b
*
sib
VLDB 2005, Trondheim
*
1 CREATE P ROFILE(T, p, q, P, r, anc) 2 anc := shift(anc, l(r)) 3 sib: shift register of size q (initialized with *) 4 5 if r is a leaf then 6 P := P ∪ (anc ◦ sib) 7 else 8 for each child c (from left to right) of r do 9 sib := shift(sib, l(c)) 10 P := P ∪ (anc ◦ sib) 11 P :=PROFILE(T, p, q, P, c, anc) 12 for k := 1 to q − 1 13 sib := shift(sib, *) 14 P := P ∪ (anc ◦ sib) 15 return P
* a * * a a a * a a a * a a a e a a b
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
* e e b b
e * * * b
* * * * * *
Page 8
pq -Grams — Algorithm for pq -Gram Profile anc *
P
T a a
*
b
c
sib e
b
1 CREATE P ROFILE(T, p, q, P, r, anc) 2 anc := shift(anc, l(r)) 3 sib: shift register of size q (initialized with *) 4 5 if r is a leaf then 6 P := P ∪ (anc ◦ sib) 7 else 8 for each child c (from left to right) of r do 9 sib := shift(sib, l(c)) 10 P := P ∪ (anc ◦ sib) 11 P :=PROFILE(T, p, q, P, c, anc) 12 for k := 1 to q − 1 13 sib := shift(sib, *) 14 P := P ∪ (anc ◦ sib) 15 return P
VLDB 2005, Trondheim
* a * * a a a * a a a * a a a e a a b * a * a b
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
* e e b b
e * * * b
* * * * * *
Page 8
pq -Grams — Algorithm for pq -Gram Profile anc *
P
T a a
b
c
sib e
b
1 CREATE P ROFILE(T, p, q, P, r, anc) 2 anc := shift(anc, l(r)) 3 sib: shift register of size q (initialized with *) 4 5 if r is a leaf then 6 P := P ∪ (anc ◦ sib) 7 else 8 for each child c (from left to right) of r do 9 sib := shift(sib, l(c)) 10 P := P ∪ (anc ◦ sib) 11 P :=PROFILE(T, p, q, P, c, anc) 12 for k := 1 to q − 1 13 sib := shift(sib, *) 14 P := P ∪ (anc ◦ sib) 15 return P
VLDB 2005, Trondheim
* a * * a a a * a a a * a a a e a a b * a * a b a b * * a a b c
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
* e e b b
e * * * b
* * * * * *
* *
Page 8
pq -Grams — Algorithm for pq -Gram Profile anc *
P
T a a
b
c
*
sib e
b
1 CREATE P ROFILE(T, p, q, P, r, anc) 2 anc := shift(anc, l(r)) 3 sib: shift register of size q (initialized with *) 4 5 if r is a leaf then 6 P := P ∪ (anc ◦ sib) 7 else 8 for each child c (from left to right) of r do 9 sib := shift(sib, l(c)) 10 P := P ∪ (anc ◦ sib) 11 P :=PROFILE(T, p, q, P, c, anc) 12 for k := 1 to q − 1 13 sib := shift(sib, *) 14 P := P ∪ (anc ◦ sib) 15 return P
VLDB 2005, Trondheim
* a * * a a a * a a a * a a a e a a b * a * a b a b * * a a b c a c * * a b c *
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
* e e b b
e * * * b
* * * * * *
* * * *
Page 8
pq -Grams — Algorithm for pq -Gram Profile anc *
P
T a a
b
c
*
*
sib e
b
1 CREATE P ROFILE(T, p, q, P, r, anc) 2 anc := shift(anc, l(r)) 3 sib: shift register of size q (initialized with *) 4 5 if r is a leaf then 6 P := P ∪ (anc ◦ sib) 7 else 8 for each child c (from left to right) of r do 9 sib := shift(sib, l(c)) 10 P := P ∪ (anc ◦ sib) 11 P :=PROFILE(T, p, q, P, c, anc) 12 for k := 1 to q − 1 13 sib := shift(sib, *) 14 P := P ∪ (anc ◦ sib) 15 return P
VLDB 2005, Trondheim
* a * * a a a * a a a * a a a e a a b * a * a b a b * * a a b c a c * * a b c * * a c * *
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
* e e b b
e * * * b
* * * * * *
* * * *
Page 8
pq -Grams — Algorithm for pq -Gram Profile anc *
P
T a a
b
c
*
*
sib e
b
1 CREATE P ROFILE(T, p, q, P, r, anc) 2 anc := shift(anc, l(r)) 3 sib: shift register of size q (initialized with *) 4 5 if r is a leaf then 6 P := P ∪ (anc ◦ sib) 7 else 8 for each child c (from left to right) of r do 9 sib := shift(sib, l(c)) 10 P := P ∪ (anc ◦ sib) 11 P :=PROFILE(T, p, q, P, c, anc) 12 for k := 1 to q − 1 13 sib := shift(sib, *) 14 P := P ∪ (anc ◦ sib) 15 return P
VLDB 2005, Trondheim
* a * * a a a * a a a * a a a e a a b * a * a b a b * * a a b c a c * * a b c * * a c * *
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
* e e b b
e * * * b
* * * * * *
* * * *
Page 8
pq -Grams — The pq -Gram Profile The pq -gram profile is
☞ small → size O(n)
Theorem 1 For tree and i non-leaves:
T with l leaves
| Pp,q (T)| = 2l + qi − 1.
VLDB 2005, Trondheim
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
Page 9
pq -Grams — The pq -Gram Profile The pq -gram profile is
☞ small → size O(n) ☞ easy to store ➳ represent the pq -grams by fingerprint hash value ➳ store profile in single-attribute relation
T a a b c e b
VLDB 2005, Trondheim
→
P2,3 (T) pq−gram (*,a,*,*,a) (a,a,*,*,*) (a,e,*,*,*) (a,a,*,e,b) (a,b,*,*,*) (a,a,e,b,*) (a,a,b,*,*) (*,a,*,a,b) (a,b,*,*,*) (*,a,a,b,c) (a,c,*,*,*) (*,a,b,c,*) (*,a,c,*,*)
Theorem 1 For tree and i non-leaves:
T with l leaves
| Pp,q (T)| = 2l + qi − 1.
P2,3 (T)
→
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
hash 10AE 2F1E 1008 13E1 5F31 AE1D 13DF F310 5F31 45A1 973F 3F1E 11EF
Page 9
pq -Grams — The pq -Gram Profile The pq -gram profile is
☞ small → size O(n) ☞ easy to store ➳ represent the pq -grams by fingerprint hash value ➳ store profile in single-attribute relation
Theorem 1 For tree and i non-leaves:
T with l leaves
| Pp,q (T)| = 2l + qi − 1.
☞ allows effective distance computation between trees T a a b c e b
VLDB 2005, Trondheim
→
P2,3 (T) pq−gram (*,a,*,*,a) (a,a,*,*,*) (a,e,*,*,*) (a,a,*,e,b) (a,b,*,*,*) (a,a,e,b,*) (a,a,b,*,*) (*,a,*,a,b) (a,b,*,*,*) (*,a,a,b,c) (a,c,*,*,*) (*,a,b,c,*) (*,a,c,*,*)
P2,3 (T)
→
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
hash 10AE 2F1E 1008 13E1 5F31 AE1D 13DF F310 5F31 45A1 973F 3F1E 11EF
Page 9
pq -Grams — pq -Gram Distance
☞ Definition 1 For two trees T1 and T2 the pq -gram distance is:
∆p,q (T1 , T2 ) = 1 − 2
VLDB 2005, Trondheim
| Pp,q (T1 ) ∩ Pp,q (T2 )| | Pp,q (T1 ) ∪ Pp,q (T2 )|
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
Page 10
pq -Grams — pq -Gram Distance
☞ Definition 1 For two trees T1 and T2 the pq -gram distance is:
∆p,q (T1 , T2 ) = 1 − 2
| Pp,q (T1 ) ∩ Pp,q (T2 )| | Pp,q (T1 ) ∪ Pp,q (T2 )|
☞ can be computed in O(n log n) time and O(n) space (bag intersection of relations)
VLDB 2005, Trondheim
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
Page 10
pq -Grams — pq -Gram Distance
☞ Definition 1 For two trees T1 and T2 the pq -gram distance is:
∆p,q (T1 , T2 ) = 1 − 2
| Pp,q (T1 ) ∩ Pp,q (T2 )| | Pp,q (T1 ) ∪ Pp,q (T2 )|
☞ can be computed in O(n log n) time and O(n) space (bag intersection of relations) ☞ other terms are constants for normalization: ➳ ∆p,q (T1 , T2 ) = 1 if trees have no pq -grams in common ➳ ∆p,q (T1 , T2 ) = 0 if trees have the same pq -gram profile
VLDB 2005, Trondheim
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
Page 10
Properties — Sensitivity to Structure Change
☞ Intuition: Nodes with structural information → more significant
VLDB 2005, Trondheim
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
Page 11
Properties — Sensitivity to Structure Change
☞ Intuition: Nodes with structural information → more significant
T a c
b d
e
f g
h i k
VLDB 2005, Trondheim
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
Page 11
Properties — Sensitivity to Structure Change
☞ Intuition: Nodes with structural information → more significant
disted = 2 ∆2,3 = 0.30
T′ a b
c d e f h i
VLDB 2005, Trondheim
T
←
a c
b d
e
f g
h i k
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
Page 11
Properties — Sensitivity to Structure Change
☞ Intuition: Nodes with structural information → more significant
disted = 2 ∆2,3 = 0.30
T′ a b
c d e f h i
VLDB 2005, Trondheim
disted = 2 ∆2,3 = 0.89
T
←
→
a c
b d
e
T′′ a b d h i k f g
f g
h i k
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
Page 11
Properties — Sensitivity to Structure Change
☞ Intuition: Nodes with structural information → more significant ☞ Address application: Mismatch of houses (with subnumbers and apartment numbers) is more significant than mismatch of apartments.
disted = 2 ∆2,3 = 0.30
T′ a b
c d e f h i
VLDB 2005, Trondheim
disted = 2 ∆2,3 = 0.89
T
←
→
a c
b d
e
T′′ a b d h i k f g
f g
h i k
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
Page 11
Properties — Sensitive to Structure Change
☞ node changes ⇒ pq -grams change ☞ pq -grams change ⇒ distance increases
VLDB 2005, Trondheim
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
Page 12
Properties — Sensitive to Structure Change
☞ node changes ⇒ pq -grams change ☞ pq -grams change ⇒ distance increases ☞ cntpq (T, v) ≈ q + f p ➳ leaf change: cost depends only on q ➳ non-leaf change: p prevalent
VLDB 2005, Trondheim
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
Page 12
Properties — Sensitive to Structure Change
☞ node changes ⇒ pq -grams change ☞ pq -grams change ⇒ distance increases ☞ cntpq (T, v) ≈ q + f p ➳ leaf change: cost depends only on q ➳ non-leaf change: p prevalent
Theorem 2 For a complete tree T (fanout f , depth d) the number of pq -grams that contain a node v of level l is: ( f p −1 if p ≤ d − l f −1 (f + q − 1)
cntpq (T, v) = q sgn(l) +
VLDB 2005, Trondheim
f d−l −1 f −1 (f
+ q − 1) + f d−l if p > d − l.
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
Page 12
Properties — Sensitive to Structure Change
☞ node changes ⇒ pq -grams change ☞ pq -grams change ⇒ distance increases ☞ cntpq (T, v) ≈ q + f p ➳ leaf change: cost depends only on q ➳ non-leaf change: p prevalent ➳ p controls structure sensitivity
Theorem 2 For a complete tree T (fanout f , depth d) the number of pq -grams that contain a node v of level l is: ( f p −1 if p ≤ d − l f −1 (f + q − 1)
cntpq (T, v) = q sgn(l) +
VLDB 2005, Trondheim
f d−l −1 f −1 (f
+ q − 1) + f d−l if p > d − l.
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
Page 12
Properties — Robust to Local Changes ☞ Intuition: Weight local changes less than distributed changes!
VLDB 2005, Trondheim
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
Page 13
Properties — Robust to Local Changes ☞ Intuition: Weight local changes less than distributed changes! ☞ Address application: missing house with apartment numbers → small difference, but many nodes change
VLDB 2005, Trondheim
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
Page 13
Properties — Robust to Local Changes ☞ Intuition: Weight local changes less than distributed changes! ☞ Address application: missing house with apartment numbers → small difference, but many nodes change
☞ Local changes → less pq -grams change ➳ neighbored nodes “share” pq -grams ➳ change counts only once.
VLDB 2005, Trondheim
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
Page 13
Properties — Robust to Local Changes ☞ Intuition: Weight local changes less than distributed changes! ☞ Address application: missing house with apartment numbers → small difference, but many nodes change
☞ Local changes → less pq -grams change ➳ neighbored nodes “share” pq -grams ➳ change counts only once. Theorem 3 Delete or update all nodes of subtree with l leaves and i non-leaves ⇒ only 2l + iq pq -grams change
VLDB 2005, Trondheim
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
+q−1
Page 13
Properties — Robust to Local Changes ☞ Intuition: Weight local changes less than distributed changes! ☞ Address application: missing house with apartment numbers → small difference, but many nodes change
☞ Local changes → less pq -grams change ➳ neighbored nodes “share” pq -grams ➳ change counts only once. Theorem 3 Delete or update all nodes of subtree with l leaves and i non-leaves ⇒ only 2l + iq pq -grams change Example: Delete subtree rooted in e
• e → in 11 pq -grams • h,i, k → in 4 pq -grams each • actually changing: 11 pq -grams (vs. 3 × 4 + 11 = 23)
T
b
+q−1
a c
d
e
f g
h i k
VLDB 2005, Trondheim
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
Page 13
Experiments — Sensitivity to Structure Changes ☞ Cost for leaf change → depends only on q
VLDB 2005, Trondheim
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
Page 14
Experiments — Sensitivity to Structure Changes ☞ Cost for leaf change → depends only on q ☞ Experiment: ➳ delete leaf nodes ➳ measure edit distance
VLDB 2005, Trondheim
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
Page 14
Experiments — Sensitivity to Structure Changes ☞ Cost for leaf change → depends only on q ☞ Experiment: ➳ delete leaf nodes ➳ measure edit distance 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0
vary q
1,3-grams 2,3-grams 3,3-grams 4,3-grams
0
2
4
6
pq-gram distance
pq-gram distance
vary p
8
10 12 14 16 18 20
0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0
2,1-grams 2,2-grams 2,3-grams 2,4-grams
0
2
4
edit distance
6
8
10 12 14 16 18 20
edit distance
(Artificial tree with 144 nodes, 102 leaves, fanout 2–6 and depth 6. Average over 100 runs.)
VLDB 2005, Trondheim
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
Page 14
Experiments — Sensitivity to Structure Changes ☞ Cost for non-leaf change → controlled by p
VLDB 2005, Trondheim
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
Page 15
Experiments — Sensitivity to Structure Changes ☞ Cost for non-leaf change → controlled by p ☞ Experiment: ➳ delete non-leaf nodes ➳ measure edit distance
VLDB 2005, Trondheim
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
Page 15
Experiments — Sensitivity to Structure Changes ☞ Cost for non-leaf change → controlled by p ☞ Experiment: ➳ delete non-leaf nodes ➳ measure edit distance 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
vary q 0.7
1,3-grams 2,3-grams 3,3-grams 4,3-grams
pq-gram distance
pq-gram distance
vary p 2,1-grams 2,2-grams 2,3-grams 2,4-grams
0.6 0.5 0.4 0.3 0.2 0.1 0
0
2
4
6
8 10 12 14 16 18 20
0
2
4
edit distance
6
8 10 12 14 16 18 20 edit distance
(Artificial tree with 144 nodes, 102 leaves, fanout 2–6 and depth 6. Average over 100 runs.)
VLDB 2005, Trondheim
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
Page 15
Experiments — Robustness to Local Changes ☞ Subtree deletions → cheaper than distributed deletions
VLDB 2005, Trondheim
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
Page 16
Experiments — Robustness to Local Changes ☞ Subtree deletions → cheaper than distributed deletions ☞ Experiment: ➳ delete subtree ➳ randomly delete same number of distributed nodes ➳ compare edit distance
VLDB 2005, Trondheim
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
Page 16
Experiments — Robustness to Local Changes ☞ Subtree deletions → cheaper than distributed deletions ☞ Experiment: ➳ delete subtree ➳ randomly delete same number of distributed nodes ➳ compare edit distance
2,3-gram-distance
0.35 distributed changes local changes
0.3 0.25 0.2 0.15 0.1 0.05 0 0
5
10
15
20
edit distance (Artificial tree with 144 nodes, 102 leaves, fanout 2–6 and depth 6. Average over 100 runs.)
VLDB 2005, Trondheim
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
Page 16
Experiments — Scalability to Large Trees ☞ pq -gram distance → scalable to large trees ☞ compare with edit distance
a
implementation by Zhang and Shasha (http://www.cs.nyu.edu/cs/faculty/shasha/papers/tree.html)
VLDB 2005, Trondheim
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
Page 17
Experiments — Scalability to Large Trees ☞ pq -gram distance → scalable to large trees ☞ compare with edit distance ☞ Experiment: For pair of trees ➳ compute tree edit distancea and pq -gram distance ➳ vary tree size: up 2 × 106 nodes ➳ measure wall clock time
a
implementation by Zhang and Shasha (http://www.cs.nyu.edu/cs/faculty/shasha/papers/tree.html)
VLDB 2005, Trondheim
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
Page 17
Experiments — Scalability to Large Trees ☞ pq -gram distance → scalable to large trees ☞ compare with edit distance ☞ Experiment: For pair of trees ➳ compute tree edit distancea and pq -gram distance ➳ vary tree size: up 2 × 106 nodes ➳ measure wall clock time 27 hours
100000
edit dist 2,3-gram dist
10000
time [sec]
1000 100 10 1 0.1 0.01 0.001 1
10
100
1000
10000 100000 1e+06
number of nodes (n) a
implementation by Zhang and Shasha (http://www.cs.nyu.edu/cs/faculty/shasha/papers/tree.html)
VLDB 2005, Trondheim
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
Page 17
Experiments — Influence of p and q on Scalability ☞ Scalability independent of p and q .
VLDB 2005, Trondheim
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
Page 18
Experiments — Influence of p and q on Scalability ☞ Scalability independent of p and q . ☞ Experiment: For pair of trees ➳ compute pq -gram distance for varying p and q ➳ vary tree size: up 106 nodes ➳ measure wall clock time
VLDB 2005, Trondheim
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
Page 18
Experiments — Influence of p and q on Scalability ☞ Scalability independent of p and q . ☞ Experiment: For pair of trees ➳ compute pq -gram distance for varying p and q ➳ vary tree size: up 106 nodes ➳ measure wall clock time 25
3,4-gram dist 2,3-gram dist 1,2-gram dist
time [sec]
20
15
10
5
0 0
100000
200000
300000
400000
500000
number of nodes (n)
VLDB 2005, Trondheim
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
Page 18
Experiments — Effectiveness for Real World Dataset
☞ pq -gram distance → effective approximation of tree edit distance ☞ test on real world data (address databases)
VLDB 2005, Trondheim
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
Page 19
Experiments — Effectiveness for Real World Dataset
☞ pq -gram distance → effective approximation of tree edit distance ☞ test on real world data (address databases) ☞ Experiment: For two sets of address trees ➳ find matches (closest tree of other set) ➳ use different distance functions ➳ count correct matches
VLDB 2005, Trondheim
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
Page 19
Experiments — Effectiveness for Real World Dataset
☞ pq -gram distance → effective approximation of tree edit distance ☞ test on real world data (address databases) ☞ Experiment: For two sets of address trees ➳ find matches (closest tree of other set) ➳ use different distance functions ➳ count correct matches
VLDB 2005, Trondheim
accuracy
correct
false pos.
runtime
edit dist
82.7%
248
9
187,538s
1,2-grams
78.3%
235
5
181s
2,3-grams
77.3%
232
4
204s
3,2-grams
79.3%
238
2
180s
tree-embedding
69.0%
207
8
313s
buttom-up
50.0%
150
12
237s
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
Page 19
Experiments — Tree-Embedding vs. pq -Grams tree edit distance embedding
☞ variable shapes
VLDB 2005, Trondheim
pq -grams ☞ fixed shape
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
Page 20
Experiments — Tree-Embedding vs. pq -Grams pq -grams ☞ fixed shape
tree edit distance embedding
☞ variable shapes ☞ elements can be ➳ single node (no structure) ➳ chains (only vertical structure) ➳ contiguous leaves (only horizontal structure) ➳ subtrees with vertical and horizontal structure
☞ horizontal and vertical structure
Typical address tree (nearly complete tree): Phase
0
1
2
3
4
5
6
tot.
tree-embed
pq -gram
single nodes
29
8
4
2
1
-
-
44
65%
0%
chains
-
1
1
-
-
-
-
2
3%
0%
cont. leaves
-
7
2
-
-
-
-
9
13%
0%
subtrees
-
-
3
4
3
2
1
13
19%
100%
VLDB 2005, Trondheim
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
Page 20
Experiments — Tree-Embedding vs. pq -Grams pq -grams ☞ fixed shape
tree edit distance embedding
☞ variable shapes ☞ elements can be ➳ single node (no structure) ➳ chains (only vertical structure) ➳ contiguous leaves (only horizontal structure) ➳ subtrees with vertical and horizontal structure
☞ horizontal and vertical structure
☞ guarantees with respect to tree edit distance
☞ emphasizes structure
Typical address tree (nearly complete tree): Phase
0
1
2
3
4
5
6
tot.
tree-embed
pq -gram
single nodes
29
8
4
2
1
-
-
44
65%
0%
chains
-
1
1
-
-
-
-
2
3%
0%
cont. leaves
-
7
2
-
-
-
-
9
13%
0%
subtrees
-
-
3
4
3
2
1
13
19%
100%
VLDB 2005, Trondheim
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
Page 20
Conclusion and Future Work
☞ pq -gram distance
VLDB 2005, Trondheim
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
Page 21
Conclusion and Future Work
☞ pq -gram distance ➳ scalable to large trees
VLDB 2005, Trondheim
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
Page 21
Conclusion and Future Work
☞ pq -gram distance ➳ scalable to large trees ➳ emphasizes structure
VLDB 2005, Trondheim
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
Page 21
Conclusion and Future Work
☞ pq -gram distance ➳ scalable to large trees ➳ emphasizes structure ➳ robust to local change
VLDB 2005, Trondheim
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
Page 21
Conclusion and Future Work
☞ pq -gram distance ➳ scalable to large trees ➳ emphasizes structure ➳ robust to local change ➳ effective approximation of tree edit distance
VLDB 2005, Trondheim
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
Page 21
Conclusion and Future Work
☞ pq -gram distance ➳ scalable to large trees ➳ emphasizes structure ➳ robust to local change ➳ effective approximation of tree edit distance ☞ Ongoing and future work:
VLDB 2005, Trondheim
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
Page 21
Conclusion and Future Work
☞ pq -gram distance ➳ scalable to large trees ➳ emphasizes structure ➳ robust to local change ➳ effective approximation of tree edit distance ☞ Ongoing and future work: ➳ strict bounds for pq -gram approximations
VLDB 2005, Trondheim
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
Page 21
Conclusion and Future Work
☞ pq -gram distance ➳ scalable to large trees ➳ emphasizes structure ➳ robust to local change ➳ effective approximation of tree edit distance ☞ Ongoing and future work: ➳ strict bounds for pq -gram approximations ➳ clustering of XML data
VLDB 2005, Trondheim
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
Page 21
Conclusion and Future Work
☞ pq -gram distance ➳ scalable to large trees ➳ emphasizes structure ➳ robust to local change ➳ effective approximation of tree edit distance ☞ Ongoing and future work: ➳ strict bounds for pq -gram approximations ➳ clustering of XML data ➳ incremental updates of pq -gram profiles
VLDB 2005, Trondheim
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
Page 21
References [Chawathe and Garcia-Molina, 1997] Chawathe, S. S. and Garcia-Molina, H. (1997). Meaningful change detection in structured data. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 26–37, Tucson, Arizona, United States. ACM Press. [Chawathe et al., 1996] Chawathe, S. S., Rajaraman, A., Garcia-Molina, H., and Widom, J. (1996). Change detection in hierarchically structured information. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 493–504, Montreal, Quebec, Canada. ACM Press. [Garofalakis and Kumar, 2003] Garofalakis, M. and Kumar, A. (2003). Correlating XML data streams using tree-edit distance embeddings. In Proceedings of the Twenty-Second ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS 2003), pages 143–154, San Diego, California. ACM Press. [Garofalakis and Kumar, 2005] Garofalakis, M. and Kumar, A. (2005). XML stream processing using tree-edit distance embeddings. ACM Transactions on Database Systems, 30(1):279–332. [Jiang et al., 1995] Jiang, T., Wang, L., and Zhang, K. (1995). Alignment of trees—an alternative to tree edit. Theoretical Computer Science, 143(1):137–148. [Klein, 1998] Klein, P. N. (1998). Computing the edit-distance between unrooted ordered trees. In Proceedings of the 6th European Symposium on Algorithms, volume 1461 of Lecture Notes in Computer Science, pages 91–102, Venice, Italy. Springer. VLDB 2005, Trondheim
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
Page 22
[Lee et al., 2004] Lee, K.-H., Choy, Y.-C., and Cho, S.-B. (2004). An efficient algorithm to compute differences between structured documents. IEEE Transactions on Knowledge and Data Engineering, 16(8):965–979. [Navarro, 2001] Navarro, G. (2001). A guided tour to approximate string matching. ACM Computing Surveys, 33(1):31–88. [Selkow, 1977] Selkow, S. M. (1977). The tree-to-tree editing problem. Information Processing Letters, 6(6):184–186. [Tanaka and Tanaka, 1988] Tanaka, E. and Tanaka, K. (1988). The tree-to-tree editing problem. Int. Journal of Pattern Recognition and Artificial Intelligence (IJPRAI), 2(2):221–240. [Ukkonen, 1992] Ukkonen, E. (1992). Approximate string-matching with q -grams and maximal matches. Theoretical Computer Science, 92(1):191–211. [Valiente, 2001] Valiente, G. (2001). An efficient bottom-up distance between trees. In Proceedings of the 8th Symposium on String Processing and Information Retrieval, pages 212–219, Laguna de San Rafael, Chile. IEEE Computer Science Press. [Yang, 1991] Yang, W. (1991). Identifying syntactic differences between two programs. Software—Practice & Experience, 21(7):739–755. [Zhang and Shasha, 1989] Zhang, K. and Shasha, D. (1989). Simple fast algorithms for the editing distance between trees and related problems. SIAM Journal on Computing, 18(6):1245–1262. VLDB 2005, Trondheim
¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper
Page 23