Approximate Tree Matching with pq-Grams - Database Research Group

1 downloads 0 Views 455KB Size Report
7740 Trienter Str. 7860 Triester Str. 8580 Walther-v.-d.-Vogelweide-Pl. ... VLDB 2005, Trondheim. Nikolaus Augsten, Michael Böhlen, Johann Gamper. Page 2 ...
Approximate Tree Matching with pq-Grams

¨ Nikolaus Augstena , Michael Bohlen, Johann Gamper DIS - Center for Database and Information Systems Free University of Bozen-Bolzano, Italy

www.inf.unibz.it 1 – Motivation . . . . . . . . . 2 – Related Work . . . . . . . 3 – pq -Grams . . . . . . . . . 4 – Properties . . . . . . . . . 5 – Experiments . . . . . . . . 6 – Conclusion and Future Work

a

. . . . . .

. . . . . .

. . . . . .

Supported by the Municipality of Bozen-Bolzano.

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. 2 . 6 . 7 . 11 . 14 . 21

Motivation — Example Data Sources ☞ We want to link data items in different databases that correspond to the same real world object.

VLDB 2005, Trondheim

¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

Page 2

Motivation — Example Data Sources ☞ We want to link data items in different databases that correspond to the same real world object.

Land Register

Registration Office

LR id num entr apt owner 91 1 1 Maier 91 1 2 Rossi 91 1 3 Maier 91 2 A Braun ... 74 3 A 1 Spiro 74 3 A 2 Barducci 74 3 A 3 Costanzi ...

VLDB 2005, Trondheim

RO id num entr apt 30 1 1 30 1 3 30 2 A 30 2 B 1 120 3 A 1 120 3 A 2 120 3 A 3 SRO

SLR id 139 109 185 91 165 115 259 207 33 263 262 285 266 ...

resident Pichler Rieder Fischer Rossi ... Spiro Barducci Costanzi ...

street SIEGESPLATZ GILMWEG P. R. GIULIANI STR. CESARE ABBA STRASSE MUSTERPLATZ ITALIENSTRASSE TELSERDURCHGANG SERNESIDURCHGANG BOZNER BODENWEG TRIESTER STRASSE TRIENTER STRASSE WALTHERPLATZ TURINER STRASSE

id 30 5220 3000 3030 3540 4440 7180 7590 7620 7650 7740 7860 8580 ...

street Giuseppe-Cesare-Abba-Str. Bozner-Boden-Str. Hermann-von-Gilm-Str. Pater-Reginaldo-Giuliani-Str. Italienallee Musterplatzl Raffaello-Sernesi-Galerie Telsergalerie Friedensplatz Turiner Str. Trienter Str. Triester Str. Walther-v.-d.-Vogelweide-Pl.

¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

Page 2

Motivation — Example Data Sources ☞ We want to link data items in different databases that correspond to the same real world object. ☞ Example query: Who lives in Braun’s apartment? Land Register

Registration Office

LR id num entr apt owner 91 1 1 Maier 91 1 2 Rossi 91 1 3 Maier 91 2 A Braun ... 74 3 A 1 Spiro 74 3 A 2 Barducci 74 3 A 3 Costanzi ...

VLDB 2005, Trondheim

RO id num entr apt 30 1 1 30 1 3 30 2 A 30 2 B 1 120 3 A 1 120 3 A 2 120 3 A 3 SRO

SLR id 139 109 185 91 165 115 259 207 33 263 262 285 266 ...

resident Pichler Rieder Fischer Rossi ... Spiro Barducci Costanzi ...

street SIEGESPLATZ GILMWEG P. R. GIULIANI STR. CESARE ABBA STRASSE MUSTERPLATZ ITALIENSTRASSE TELSERDURCHGANG SERNESIDURCHGANG BOZNER BODENWEG TRIESTER STRASSE TRIENTER STRASSE WALTHERPLATZ TURINER STRASSE

id 30 5220 3000 3030 3540 4440 7180 7590 7620 7650 7740 7860 8580 ...

street Giuseppe-Cesare-Abba-Str. Bozner-Boden-Str. Hermann-von-Gilm-Str. Pater-Reginaldo-Giuliani-Str. Italienallee Musterplatzl Raffaello-Sernesi-Galerie Telsergalerie Friedensplatz Turiner Str. Trienter Str. Triester Str. Walther-v.-d.-Vogelweide-Pl.

¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

Page 2

Motivation — Example Data Sources ☞ We want to link data items in different databases that correspond to the same real world object. ☞ Example query: Who lives in Braun’s apartment? Land Register

Registration Office

LR id num entr apt owner 91 1 1 Maier 91 1 2 Rossi 91 1 3 Maier 91 2 A Braun ... 74 3 A 1 Spiro 74 3 A 2 Barducci 74 3 A 3 Costanzi ...

resident Pichler Rieder Fischer Rossi ... Spiro Barducci Costanzi ...

SRO

SLR id 139 109 185 91 165 115 259 207 33 263 262 285 266 ...

VLDB 2005, Trondheim

RO id num entr apt 30 1 1 30 1 3 30 2 A 30 2 B 1 120 3 A 1 120 3 A 2 120 3 A 3

street SIEGESPLATZ GILMWEG P. R. GIULIANI STR. CESARE ABBA STRASSE MUSTERPLATZ ITALIENSTRASSE TELSERDURCHGANG SERNESIDURCHGANG BOZNER BODENWEG TRIESTER STRASSE TRIENTER STRASSE WALTHERPLATZ TURINER STRASSE

?

id 30 5220 3000 3030 3540 4440 7180 7590 7620 7650 7740 7860 8580 ...

street Giuseppe-Cesare-Abba-Str. Bozner-Boden-Str. Hermann-von-Gilm-Str. Pater-Reginaldo-Giuliani-Str. Italienallee Musterplatzl Raffaello-Sernesi-Galerie Telsergalerie Friedensplatz Turiner Str. Trienter Str. Triester Str. Walther-v.-d.-Vogelweide-Pl.

¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

Page 2

Motivation — Example Data Sources ☞ We want to link data items in different databases that correspond to the same real world object. ☞ Example query: Who lives in Braun’s apartment? Land Register

LR id num entr apt owner 91 1 1 Maier 91 1 2 Rossi 91 1 3 Maier 91 2 A Braun ... 74 3 A 1 Spiro 74 3 A 2 Barducci 74 3 A 3 Costanzi ...

Registration Office

!

VLDB 2005, Trondheim

RO id num entr apt 30 1 1 30 1 3 30 2 A 30 2 B 1 120 3 A 1 120 3 A 2 120 3 A 3 SRO

SLR id 139 109 185 91 165 115 259 207 33 263 262 285 266 ...

resident Pichler Rieder Fischer Rossi ... Spiro Barducci Costanzi ...

street SIEGESPLATZ GILMWEG P. R. GIULIANI STR. CESARE ABBA STRASSE MUSTERPLATZ ITALIENSTRASSE TELSERDURCHGANG SERNESIDURCHGANG BOZNER BODENWEG TRIESTER STRASSE TRIENTER STRASSE WALTHERPLATZ TURINER STRASSE

?

id 30 5220 3000 3030 3540 4440 7180 7590 7620 7650 7740 7860 8580 ...

street Giuseppe-Cesare-Abba-Str. Bozner-Boden-Str. Hermann-von-Gilm-Str. Pater-Reginaldo-Giuliani-Str. Italienallee Musterplatzl Raffaello-Sernesi-Galerie Telsergalerie Friedensplatz Turiner Str. Trienter Str. Triester Str. Walther-v.-d.-Vogelweide-Pl.

¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

Page 2

Motivation — Address Trees

☞ residential addresses are hierarchical → address tree

Address trees:

CESARE ABBA STRASSE

1 1 2 3

VLDB 2005, Trondheim

3

2 A

B 1 2 3 4

D

4 A B C

Giuseppe-Cesare-Abba-Str. 6

1

2

- A

B

13

1234

¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

3 C

6

4 A

BC

123

Page 3

Motivation — Address Trees

☞ residential addresses are hierarchical → address tree ☞ Idea: corresponding streets ⇒ similar address tree How similar are two address trees? Address trees:

CESARE ABBA STRASSE

1 1 2 3

VLDB 2005, Trondheim

3

2 A

B 1 2 3 4

D

4 A B C

Giuseppe-Cesare-Abba-Str. 6

1

2

- A

B

13

1234

¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

3 C

6

4 A

BC

123

Page 3

Motivation — Standard Solution: The Edit Distance

☞ Edit distance: Minimum cost sequence of edit operations (node insertion, node deletion, and label change) that transform one tree into an other.

VLDB 2005, Trondheim

¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

Page 4

Motivation — Standard Solution: The Edit Distance

☞ Edit distance: Minimum cost sequence of edit operations (node insertion, node deletion, and label change) that transform one tree into an other. T′′

T a

b

x

b

c

d

e

f g

d

e

f g

h i k

h i

VLDB 2005, Trondheim

c

¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

Page 4

Motivation — Standard Solution: The Edit Distance

☞ Edit distance: Minimum cost sequence of edit operations (node insertion, node deletion, and label change) that transform one tree into an other. T′

T a

a

insert(k, e, 3) −→

b

c

d

e

h i

VLDB 2005, Trondheim

T′′

f g

b

x

b

c

d

e

f g

h i k

¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

c

d

e

f g

h i k

Page 4

Motivation — Standard Solution: The Edit Distance

☞ Edit distance: Minimum cost sequence of edit operations (node insertion, node deletion, and label change) that transform one tree into an other. T′

T a

a

insert(k, e, 3) −→

b

c

d

e

f g

x

rename(a, x) −→

b

c

d

e

f g

h i k

h i

edit distance:

VLDB 2005, Trondheim

T′′

b

c

d

e

f g

h i k

disted (T, T′′ ) = 2

¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

Page 4

Motivation — Standard Solution: The Edit Distance

☞ Edit distance: Minimum cost sequence of edit operations (node insertion, node deletion, and label change) that transform one tree into an other. T′

T a

a

insert(k, e, 3) −→

b

c

d

e

T′′

f g

x

rename(a, x) −→

b

c

d

e

f g

h i k

h i

edit distance:

b

c

d

e

f g

h i k

disted (T, T′′ ) = 2

☞ Problem: Best algorithms O(n2 log2 (n)) ⇒ not scalable.

VLDB 2005, Trondheim

¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

Page 4

Motivation — Problem Definition

☞ Our goal: Find an efficient and effective approximation of the tree edit distance that ➠ is scalable for large trees, ➠ emphasizes structure.

VLDB 2005, Trondheim

¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

Page 5

Related Work — Tree Distances ☞ n → number of tree nodes ☞ Tree edit distance: ➳ for balanced trees [Zhang and Shasha, 1989]: O(n2 log2 (n)) ➳ for arbitrary trees [Klein, 1998]: O(n3 log(n))

VLDB 2005, Trondheim

¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

Page 6

Related Work — Tree Distances ☞ n → number of tree nodes ☞ Tree edit distance: ➳ for balanced trees [Zhang and Shasha, 1989]: O(n2 log2 (n)) ➳ for arbitrary trees [Klein, 1998]: O(n3 log(n)) ☞ Tree edit distance approximations: ➳ Restricted versions of the tree edit distance: ➠ Alignment [Jiang et al., 1995]: O(n2 ) ➠ Isolated subtree [Tanaka and Tanaka, 1988]: O(n2 ) ➠ Top-down [Selkow, 1977, Yang, 1991]: O(n2 ) ➠ Bottom-up [Valiente, 2001]: O(n) → only very specific domains

VLDB 2005, Trondheim

¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

Page 6

Related Work — Tree Distances ☞ n → number of tree nodes ☞ Tree edit distance: ➳ for balanced trees [Zhang and Shasha, 1989]: O(n2 log2 (n)) ➳ for arbitrary trees [Klein, 1998]: O(n3 log(n)) ☞ Tree edit distance approximations: ➳ Restricted versions of the tree edit distance: ➠ Alignment [Jiang et al., 1995]: O(n2 ) ➠ Isolated subtree [Tanaka and Tanaka, 1988]: O(n2 ) ➠ Top-down [Selkow, 1977, Yang, 1991]: O(n2 ) ➠ Bottom-up [Valiente, 2001]: O(n) → only very specific domains ➳ XML versioning [Chawathe et al., 1996, Chawathe and Garcia-Molina, 1997, Lee et al., 2004]: O(n2 ) for very different trees

VLDB 2005, Trondheim

¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

Page 6

Related Work — Tree Distances ☞ n → number of tree nodes ☞ Tree edit distance: ➳ for balanced trees [Zhang and Shasha, 1989]: O(n2 log2 (n)) ➳ for arbitrary trees [Klein, 1998]: O(n3 log(n)) ☞ Tree edit distance approximations: ➳ Restricted versions of the tree edit distance: ➠ Alignment [Jiang et al., 1995]: O(n2 ) ➠ Isolated subtree [Tanaka and Tanaka, 1988]: O(n2 ) ➠ Top-down [Selkow, 1977, Yang, 1991]: O(n2 ) ➠ Bottom-up [Valiente, 2001]: O(n) → only very specific domains ➳ XML versioning [Chawathe et al., 1996, Chawathe and Garcia-Molina, 1997, Lee et al., 2004]: O(n2 ) for very different trees ➳ Tree-edit distance embedding [Garofalakis and Kumar, 2003, Garofalakis and Kumar, 2005]: ➠ O(n log n) ➠ guaranteed distance distortion for tree edit distance with subtree move

VLDB 2005, Trondheim

¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

Page 6

Related Work — Tree Distances ☞ n → number of tree nodes ☞ Tree edit distance: ➳ for balanced trees [Zhang and Shasha, 1989]: O(n2 log2 (n)) ➳ for arbitrary trees [Klein, 1998]: O(n3 log(n)) ☞ Tree edit distance approximations: ➳ Restricted versions of the tree edit distance: ➠ Alignment [Jiang et al., 1995]: O(n2 ) ➠ Isolated subtree [Tanaka and Tanaka, 1988]: O(n2 ) ➠ Top-down [Selkow, 1977, Yang, 1991]: O(n2 ) ➠ Bottom-up [Valiente, 2001]: O(n) → only very specific domains ➳ XML versioning [Chawathe et al., 1996, Chawathe and Garcia-Molina, 1997, Lee et al., 2004]: O(n2 ) for very different trees ➳ Tree-edit distance embedding [Garofalakis and Kumar, 2003, Garofalakis and Kumar, 2005]: ➠ O(n log n) ➠ guaranteed distance distortion for tree edit distance with subtree move ☞ Related work for strings: ➳ Navarro [Navarro, 2001]: good overview of the edit distance for strings and its variants ➳ Ukkonen [Ukkonen, 1992]: q -grams as lower bound for string edit distance VLDB 2005, Trondheim

¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

Page 6

pq -Grams — Subtrees of the pq -Extended Tree ☞ Extended Tree Tpq :

2, 3-Extended Tree:

Patch boundaries by adding null nodes (*):

➳ p − 1 ancestors to the root ➳ q − 1 nodes before the first and after

T a a b c

the last child of each non-leaf node

➳ q children to each leaf

T2,3 * a

−→

**

e b

a **

e

b b

c

**

********

******

VLDB 2005, Trondheim

¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

Page 7

pq -Grams — Subtrees of the pq -Extended Tree ☞ Extended Tree Tpq :

2, 3-Extended Tree:

Patch boundaries by adding null nodes (*):

➳ p − 1 ancestors to the root ➳ q − 1 nodes before the first and after

T a

a

a b c

the last child of each non-leaf node

➳ q children to each leaf

T2,3 *

−→

**

e b

☞ pq -Gram G: Subtree of Tpq . ➳ Anchor node ➳ with p − 1 ancestors ➳ and q children. Contiguous siblings in G are contiguous siblings in Tpq .

a **

e

b b

c

**

********

******

2, 3-Gram Pattern: •

p−1

• anchor

• • • q

VLDB 2005, Trondheim

¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

Page 7

pq -Grams — Subtrees of the pq -Extended Tree ☞ Extended Tree Tpq :

2, 3-Extended Tree:

Patch boundaries by adding null nodes (*):

➳ p − 1 ancestors to the root ➳ q − 1 nodes before the first and after

T a

a

a b c

the last child of each non-leaf node

➳ q children to each leaf

T2,3 *

−→

**

e b

☞ pq -Gram G: Subtree of Tpq . ➳ Anchor node ➳ with p − 1 ancestors ➳ and q children. Contiguous siblings in G are contiguous siblings in Tpq .

**

e

b b

**

c

********

******

2, 3-Gram Pattern: •

p−1



Example pq -Grams for T:

a

anchor

q

¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

* a

* a

• • •

VLDB 2005, Trondheim

a

b c * a b 2, 3-gram 1, 2-gram

a

e b 3, 2-gram

Page 7

pq -Grams — Subtrees of the pq -Extended Tree ☞ Extended Tree Tpq :

2, 3-Extended Tree:

Patch boundaries by adding null nodes (*):

T a

➳ p − 1 ancestors to the root ➳ q − 1 nodes before the first and after

a

a b c

the last child of each non-leaf node

➳ q children to each leaf

−→

**

e b

☞ pq -Gram G: Subtree of Tpq . ➳ Anchor node ➳ with p − 1 ancestors ➳ and q children. Contiguous siblings in G are contiguous siblings in Tpq . ☞ pq -gram Profile Pp,q (T): ➳ Bag of all pq -grams of T. 2, 3-Gram Profile of T:

T2,3 *

a **

e

b

**

c

********

b

******

2, 3-Gram Pattern: •

Example pq -Grams for T:

p−1

a

*



a

a

a

anchor

• • •

b c * a b 2, 3-gram 1, 2-gram

q

*

e b 3, 2-gram

*

a

a

a

a

a

a

*

a

*

a

*

*

a

a

e

a

b

a

a

a

b

a

c

a

a

* * a * * e * * * * e b * * * e b * b * * * a b * * * a b c * * * b c * c * * VLDB 2005, Trondheim

¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

Page 7

pq -Grams — Algorithm for pq -Gram Profile

P

T a a e

b

c

b

1 CREATE P ROFILE(T, p, q, P, r, anc) 2 anc := shift(anc, l(r)) 3 sib: shift register of size q (initialized with *) 4 5 if r is a leaf then 6 P := P ∪ (anc ◦ sib) 7 else 8 for each child c (from left to right) of r do 9 sib := shift(sib, l(c)) 10 P := P ∪ (anc ◦ sib) 11 P :=PROFILE(T, p, q, P, c, anc) 12 for k := 1 to q − 1 13 sib := shift(sib, *) 14 P := P ∪ (anc ◦ sib) 15 return P

VLDB 2005, Trondheim

¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

Page 8

pq -Grams — Algorithm for pq -Gram Profile anc * *

P

T a a e

b

c

b

1 CREATE P ROFILE(T, p, q, P, r, anc) 2 anc := shift(anc, l(r)) 3 sib: shift register of size q (initialized with *) 4 5 if r is a leaf then 6 P := P ∪ (anc ◦ sib) 7 else 8 for each child c (from left to right) of r do 9 sib := shift(sib, l(c)) 10 P := P ∪ (anc ◦ sib) 11 P :=PROFILE(T, p, q, P, c, anc) 12 for k := 1 to q − 1 13 sib := shift(sib, *) 14 P := P ∪ (anc ◦ sib) 15 return P

VLDB 2005, Trondheim

¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

Page 8

pq -Grams — Algorithm for pq -Gram Profile anc *

P

T a a e

b

c

b

1 CREATE P ROFILE(T, p, q, P, r, anc) 2 anc := shift(anc, l(r)) 3 sib: shift register of size q (initialized with *) 4 5 if r is a leaf then 6 P := P ∪ (anc ◦ sib) 7 else 8 for each child c (from left to right) of r do 9 sib := shift(sib, l(c)) 10 P := P ∪ (anc ◦ sib) 11 P :=PROFILE(T, p, q, P, c, anc) 12 for k := 1 to q − 1 13 sib := shift(sib, *) 14 P := P ∪ (anc ◦ sib) 15 return P

VLDB 2005, Trondheim

¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

Page 8

pq -Grams — Algorithm for pq -Gram Profile anc *

P

T a *

*

a

*

b

c

sib e

b

1 CREATE P ROFILE(T, p, q, P, r, anc) 2 anc := shift(anc, l(r)) 3 sib: shift register of size q (initialized with *) 4 5 if r is a leaf then 6 P := P ∪ (anc ◦ sib) 7 else 8 for each child c (from left to right) of r do 9 sib := shift(sib, l(c)) 10 P := P ∪ (anc ◦ sib) 11 P :=PROFILE(T, p, q, P, c, anc) 12 for k := 1 to q − 1 13 sib := shift(sib, *) 14 P := P ∪ (anc ◦ sib) 15 return P

VLDB 2005, Trondheim

¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

Page 8

pq -Grams — Algorithm for pq -Gram Profile anc *

P

T a

*

a

*

b

c

sib e

b

1 CREATE P ROFILE(T, p, q, P, r, anc) 2 anc := shift(anc, l(r)) 3 sib: shift register of size q (initialized with *) 4 5 if r is a leaf then 6 P := P ∪ (anc ◦ sib) 7 else 8 for each child c (from left to right) of r do 9 sib := shift(sib, l(c)) 10 P := P ∪ (anc ◦ sib) 11 P :=PROFILE(T, p, q, P, c, anc) 12 for k := 1 to q − 1 13 sib := shift(sib, *) 14 P := P ∪ (anc ◦ sib) 15 return P

VLDB 2005, Trondheim

¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

Page 8

pq -Grams — Algorithm for pq -Gram Profile anc *

P

T a

*

a

*

b

* a * * a

c

sib e

b

1 CREATE P ROFILE(T, p, q, P, r, anc) 2 anc := shift(anc, l(r)) 3 sib: shift register of size q (initialized with *) 4 5 if r is a leaf then 6 P := P ∪ (anc ◦ sib) 7 else 8 for each child c (from left to right) of r do 9 sib := shift(sib, l(c)) 10 P := P ∪ (anc ◦ sib) 11 P :=PROFILE(T, p, q, P, c, anc) 12 for k := 1 to q − 1 13 sib := shift(sib, *) 14 P := P ∪ (anc ◦ sib) 15 return P

VLDB 2005, Trondheim

¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

Page 8

pq -Grams — Algorithm for pq -Gram Profile anc *

P

T a

*

a

*

b

* a * * a

c

sib e

b

1 CREATE P ROFILE(T, p, q, P, r, anc) 2 anc := shift(anc, l(r)) 3 sib: shift register of size q (initialized with *) 4 5 if r is a leaf then 6 P := P ∪ (anc ◦ sib) 7 else 8 for each child c (from left to right) of r do 9 sib := shift(sib, l(c)) 10 P := P ∪ (anc ◦ sib) 11 P :=PROFILE(T, p, q, P, c, anc) 12 for k := 1 to q − 1 13 sib := shift(sib, *) 14 P := P ∪ (anc ◦ sib) 15 return P

VLDB 2005, Trondheim

¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

Page 8

pq -Grams — Algorithm for pq -Gram Profile

P

an c

T a a e

b

* a * * a

c

b

1 CREATE P ROFILE(T, p, q, P, r, anc) 2 anc := shift(anc, l(r)) 3 sib: shift register of size q (initialized with *) 4 5 if r is a leaf then 6 P := P ∪ (anc ◦ sib) 7 else 8 for each child c (from left to right) of r do 9 sib := shift(sib, l(c)) 10 P := P ∪ (anc ◦ sib) 11 P :=PROFILE(T, p, q, P, c, anc) 12 for k := 1 to q − 1 13 sib := shift(sib, *) 14 P := P ∪ (anc ◦ sib) 15 return P

VLDB 2005, Trondheim

¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

Page 8

pq -Grams — Algorithm for pq -Gram Profile

P

an c

T a a

*

*

sib

VLDB 2005, Trondheim

e

b

* a * * a

c

b

1 CREATE P ROFILE(T, p, q, P, r, anc) 2 anc := shift(anc, l(r)) 3 sib: shift register of size q (initialized with *) 4 5 if r is a leaf then 6 P := P ∪ (anc ◦ sib) 7 else 8 for each child c (from left to right) of r do 9 sib := shift(sib, l(c)) 10 P := P ∪ (anc ◦ sib) 11 P :=PROFILE(T, p, q, P, c, anc) 12 for k := 1 to q − 1 13 sib := shift(sib, *) 14 P := P ∪ (anc ◦ sib) 15 return P

¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

Page 8

pq -Grams — Algorithm for pq -Gram Profile

P

an c

T a a

*

*

sib

VLDB 2005, Trondheim

e

b

* a * * a a a * * e

c

b

1 CREATE P ROFILE(T, p, q, P, r, anc) 2 anc := shift(anc, l(r)) 3 sib: shift register of size q (initialized with *) 4 5 if r is a leaf then 6 P := P ∪ (anc ◦ sib) 7 else 8 for each child c (from left to right) of r do 9 sib := shift(sib, l(c)) 10 P := P ∪ (anc ◦ sib) 11 P :=PROFILE(T, p, q, P, c, anc) 12 for k := 1 to q − 1 13 sib := shift(sib, *) 14 P := P ∪ (anc ◦ sib) 15 return P

¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

Page 8

pq -Grams — Algorithm for pq -Gram Profile

P

an c

T a a

*

*

sib

VLDB 2005, Trondheim

e

b

* a * * a a a * * e

c

b

1 CREATE P ROFILE(T, p, q, P, r, anc) 2 anc := shift(anc, l(r)) 3 sib: shift register of size q (initialized with *) 4 5 if r is a leaf then 6 P := P ∪ (anc ◦ sib) 7 else 8 for each child c (from left to right) of r do 9 sib := shift(sib, l(c)) 10 P := P ∪ (anc ◦ sib) 11 P :=PROFILE(T, p, q, P, c, anc) 12 for k := 1 to q − 1 13 sib := shift(sib, *) 14 P := P ∪ (anc ◦ sib) 15 return P

¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

Page 8

pq -Grams — Algorithm for pq -Gram Profile

P

T anc a e

a b

* a * * a a a * * e

c

b

1 CREATE P ROFILE(T, p, q, P, r, anc) 2 anc := shift(anc, l(r)) 3 sib: shift register of size q (initialized with *) 4 5 if r is a leaf then 6 P := P ∪ (anc ◦ sib) 7 else 8 for each child c (from left to right) of r do 9 sib := shift(sib, l(c)) 10 P := P ∪ (anc ◦ sib) 11 P :=PROFILE(T, p, q, P, c, anc) 12 for k := 1 to q − 1 13 sib := shift(sib, *) 14 P := P ∪ (anc ◦ sib) 15 return P

VLDB 2005, Trondheim

¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

Page 8

pq -Grams — Algorithm for pq -Gram Profile

P

T anc a e *

*

sib

VLDB 2005, Trondheim

a b

* a * * a a a * * e

c

b *

1 CREATE P ROFILE(T, p, q, P, r, anc) 2 anc := shift(anc, l(r)) 3 sib: shift register of size q (initialized with *) 4 5 if r is a leaf then 6 P := P ∪ (anc ◦ sib) 7 else 8 for each child c (from left to right) of r do 9 sib := shift(sib, l(c)) 10 P := P ∪ (anc ◦ sib) 11 P :=PROFILE(T, p, q, P, c, anc) 12 for k := 1 to q − 1 13 sib := shift(sib, *) 14 P := P ∪ (anc ◦ sib) 15 return P

¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

Page 8

pq -Grams — Algorithm for pq -Gram Profile

P

T anc a e *

*

sib

VLDB 2005, Trondheim

a b

* a * * a a a * * e a e * * *

c

b *

1 CREATE P ROFILE(T, p, q, P, r, anc) 2 anc := shift(anc, l(r)) 3 sib: shift register of size q (initialized with *) 4 5 if r is a leaf then 6 P := P ∪ (anc ◦ sib) 7 else 8 for each child c (from left to right) of r do 9 sib := shift(sib, l(c)) 10 P := P ∪ (anc ◦ sib) 11 P :=PROFILE(T, p, q, P, c, anc) 12 for k := 1 to q − 1 13 sib := shift(sib, *) 14 P := P ∪ (anc ◦ sib) 15 return P

¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

Page 8

pq -Grams — Algorithm for pq -Gram Profile

P

an c

T a a

*

e

sib

VLDB 2005, Trondheim

b

* a * * a a a * * e a e * * *

c

b

1 CREATE P ROFILE(T, p, q, P, r, anc) 2 anc := shift(anc, l(r)) 3 sib: shift register of size q (initialized with *) 4 5 if r is a leaf then 6 P := P ∪ (anc ◦ sib) 7 else 8 for each child c (from left to right) of r do 9 sib := shift(sib, l(c)) 10 P := P ∪ (anc ◦ sib) 11 P :=PROFILE(T, p, q, P, c, anc) 12 for k := 1 to q − 1 13 sib := shift(sib, *) 14 P := P ∪ (anc ◦ sib) 15 return P

¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

Page 8

pq -Grams — Algorithm for pq -Gram Profile

P

an c

T a a

*

e

sib

VLDB 2005, Trondheim

b

c

b

1 CREATE P ROFILE(T, p, q, P, r, anc) 2 anc := shift(anc, l(r)) 3 sib: shift register of size q (initialized with *) 4 5 if r is a leaf then 6 P := P ∪ (anc ◦ sib) 7 else 8 for each child c (from left to right) of r do 9 sib := shift(sib, l(c)) 10 P := P ∪ (anc ◦ sib) 11 P :=PROFILE(T, p, q, P, c, anc) 12 for k := 1 to q − 1 13 sib := shift(sib, *) 14 P := P ∪ (anc ◦ sib) 15 return P

* a * * a a a * a a a * a

¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

* e e b

e * * * b * * *

Page 8

pq -Grams — Algorithm for pq -Gram Profile

P

an c

T a a e

b

sib

VLDB 2005, Trondheim

c

b

*

1 CREATE P ROFILE(T, p, q, P, r, anc) 2 anc := shift(anc, l(r)) 3 sib: shift register of size q (initialized with *) 4 5 if r is a leaf then 6 P := P ∪ (anc ◦ sib) 7 else 8 for each child c (from left to right) of r do 9 sib := shift(sib, l(c)) 10 P := P ∪ (anc ◦ sib) 11 P :=PROFILE(T, p, q, P, c, anc) 12 for k := 1 to q − 1 13 sib := shift(sib, *) 14 P := P ∪ (anc ◦ sib) 15 return P

* a * * a a a * a a a * a a a e

¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

* e e b b

e * * * b * * * *

Page 8

pq -Grams — Algorithm for pq -Gram Profile

P

an c

T a a e

c

b b

*

sib

VLDB 2005, Trondheim

*

1 CREATE P ROFILE(T, p, q, P, r, anc) 2 anc := shift(anc, l(r)) 3 sib: shift register of size q (initialized with *) 4 5 if r is a leaf then 6 P := P ∪ (anc ◦ sib) 7 else 8 for each child c (from left to right) of r do 9 sib := shift(sib, l(c)) 10 P := P ∪ (anc ◦ sib) 11 P :=PROFILE(T, p, q, P, c, anc) 12 for k := 1 to q − 1 13 sib := shift(sib, *) 14 P := P ∪ (anc ◦ sib) 15 return P

* a * * a a a * a a a * a a a e a a b

¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

* e e b b

e * * * b

* * * * * *

Page 8

pq -Grams — Algorithm for pq -Gram Profile anc *

P

T a a

*

b

c

sib e

b

1 CREATE P ROFILE(T, p, q, P, r, anc) 2 anc := shift(anc, l(r)) 3 sib: shift register of size q (initialized with *) 4 5 if r is a leaf then 6 P := P ∪ (anc ◦ sib) 7 else 8 for each child c (from left to right) of r do 9 sib := shift(sib, l(c)) 10 P := P ∪ (anc ◦ sib) 11 P :=PROFILE(T, p, q, P, c, anc) 12 for k := 1 to q − 1 13 sib := shift(sib, *) 14 P := P ∪ (anc ◦ sib) 15 return P

VLDB 2005, Trondheim

* a * * a a a * a a a * a a a e a a b * a * a b

¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

* e e b b

e * * * b

* * * * * *

Page 8

pq -Grams — Algorithm for pq -Gram Profile anc *

P

T a a

b

c

sib e

b

1 CREATE P ROFILE(T, p, q, P, r, anc) 2 anc := shift(anc, l(r)) 3 sib: shift register of size q (initialized with *) 4 5 if r is a leaf then 6 P := P ∪ (anc ◦ sib) 7 else 8 for each child c (from left to right) of r do 9 sib := shift(sib, l(c)) 10 P := P ∪ (anc ◦ sib) 11 P :=PROFILE(T, p, q, P, c, anc) 12 for k := 1 to q − 1 13 sib := shift(sib, *) 14 P := P ∪ (anc ◦ sib) 15 return P

VLDB 2005, Trondheim

* a * * a a a * a a a * a a a e a a b * a * a b a b * * a a b c

¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

* e e b b

e * * * b

* * * * * *

* *

Page 8

pq -Grams — Algorithm for pq -Gram Profile anc *

P

T a a

b

c

*

sib e

b

1 CREATE P ROFILE(T, p, q, P, r, anc) 2 anc := shift(anc, l(r)) 3 sib: shift register of size q (initialized with *) 4 5 if r is a leaf then 6 P := P ∪ (anc ◦ sib) 7 else 8 for each child c (from left to right) of r do 9 sib := shift(sib, l(c)) 10 P := P ∪ (anc ◦ sib) 11 P :=PROFILE(T, p, q, P, c, anc) 12 for k := 1 to q − 1 13 sib := shift(sib, *) 14 P := P ∪ (anc ◦ sib) 15 return P

VLDB 2005, Trondheim

* a * * a a a * a a a * a a a e a a b * a * a b a b * * a a b c a c * * a b c *

¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

* e e b b

e * * * b

* * * * * *

* * * *

Page 8

pq -Grams — Algorithm for pq -Gram Profile anc *

P

T a a

b

c

*

*

sib e

b

1 CREATE P ROFILE(T, p, q, P, r, anc) 2 anc := shift(anc, l(r)) 3 sib: shift register of size q (initialized with *) 4 5 if r is a leaf then 6 P := P ∪ (anc ◦ sib) 7 else 8 for each child c (from left to right) of r do 9 sib := shift(sib, l(c)) 10 P := P ∪ (anc ◦ sib) 11 P :=PROFILE(T, p, q, P, c, anc) 12 for k := 1 to q − 1 13 sib := shift(sib, *) 14 P := P ∪ (anc ◦ sib) 15 return P

VLDB 2005, Trondheim

* a * * a a a * a a a * a a a e a a b * a * a b a b * * a a b c a c * * a b c * * a c * *

¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

* e e b b

e * * * b

* * * * * *

* * * *

Page 8

pq -Grams — Algorithm for pq -Gram Profile anc *

P

T a a

b

c

*

*

sib e

b

1 CREATE P ROFILE(T, p, q, P, r, anc) 2 anc := shift(anc, l(r)) 3 sib: shift register of size q (initialized with *) 4 5 if r is a leaf then 6 P := P ∪ (anc ◦ sib) 7 else 8 for each child c (from left to right) of r do 9 sib := shift(sib, l(c)) 10 P := P ∪ (anc ◦ sib) 11 P :=PROFILE(T, p, q, P, c, anc) 12 for k := 1 to q − 1 13 sib := shift(sib, *) 14 P := P ∪ (anc ◦ sib) 15 return P

VLDB 2005, Trondheim

* a * * a a a * a a a * a a a e a a b * a * a b a b * * a a b c a c * * a b c * * a c * *

¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

* e e b b

e * * * b

* * * * * *

* * * *

Page 8

pq -Grams — The pq -Gram Profile The pq -gram profile is

☞ small → size O(n)

Theorem 1 For tree and i non-leaves:

T with l leaves

| Pp,q (T)| = 2l + qi − 1.

VLDB 2005, Trondheim

¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

Page 9

pq -Grams — The pq -Gram Profile The pq -gram profile is

☞ small → size O(n) ☞ easy to store ➳ represent the pq -grams by fingerprint hash value ➳ store profile in single-attribute relation

T a a b c e b

VLDB 2005, Trondheim



P2,3 (T) pq−gram (*,a,*,*,a) (a,a,*,*,*) (a,e,*,*,*) (a,a,*,e,b) (a,b,*,*,*) (a,a,e,b,*) (a,a,b,*,*) (*,a,*,a,b) (a,b,*,*,*) (*,a,a,b,c) (a,c,*,*,*) (*,a,b,c,*) (*,a,c,*,*)

Theorem 1 For tree and i non-leaves:

T with l leaves

| Pp,q (T)| = 2l + qi − 1.

P2,3 (T)



¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

hash 10AE 2F1E 1008 13E1 5F31 AE1D 13DF F310 5F31 45A1 973F 3F1E 11EF

Page 9

pq -Grams — The pq -Gram Profile The pq -gram profile is

☞ small → size O(n) ☞ easy to store ➳ represent the pq -grams by fingerprint hash value ➳ store profile in single-attribute relation

Theorem 1 For tree and i non-leaves:

T with l leaves

| Pp,q (T)| = 2l + qi − 1.

☞ allows effective distance computation between trees T a a b c e b

VLDB 2005, Trondheim



P2,3 (T) pq−gram (*,a,*,*,a) (a,a,*,*,*) (a,e,*,*,*) (a,a,*,e,b) (a,b,*,*,*) (a,a,e,b,*) (a,a,b,*,*) (*,a,*,a,b) (a,b,*,*,*) (*,a,a,b,c) (a,c,*,*,*) (*,a,b,c,*) (*,a,c,*,*)

P2,3 (T)



¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

hash 10AE 2F1E 1008 13E1 5F31 AE1D 13DF F310 5F31 45A1 973F 3F1E 11EF

Page 9

pq -Grams — pq -Gram Distance

☞ Definition 1 For two trees T1 and T2 the pq -gram distance is:

∆p,q (T1 , T2 ) = 1 − 2

VLDB 2005, Trondheim

| Pp,q (T1 ) ∩ Pp,q (T2 )| | Pp,q (T1 ) ∪ Pp,q (T2 )|

¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

Page 10

pq -Grams — pq -Gram Distance

☞ Definition 1 For two trees T1 and T2 the pq -gram distance is:

∆p,q (T1 , T2 ) = 1 − 2

| Pp,q (T1 ) ∩ Pp,q (T2 )| | Pp,q (T1 ) ∪ Pp,q (T2 )|

☞ can be computed in O(n log n) time and O(n) space (bag intersection of relations)

VLDB 2005, Trondheim

¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

Page 10

pq -Grams — pq -Gram Distance

☞ Definition 1 For two trees T1 and T2 the pq -gram distance is:

∆p,q (T1 , T2 ) = 1 − 2

| Pp,q (T1 ) ∩ Pp,q (T2 )| | Pp,q (T1 ) ∪ Pp,q (T2 )|

☞ can be computed in O(n log n) time and O(n) space (bag intersection of relations) ☞ other terms are constants for normalization: ➳ ∆p,q (T1 , T2 ) = 1 if trees have no pq -grams in common ➳ ∆p,q (T1 , T2 ) = 0 if trees have the same pq -gram profile

VLDB 2005, Trondheim

¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

Page 10

Properties — Sensitivity to Structure Change

☞ Intuition: Nodes with structural information → more significant

VLDB 2005, Trondheim

¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

Page 11

Properties — Sensitivity to Structure Change

☞ Intuition: Nodes with structural information → more significant

T a c

b d

e

f g

h i k

VLDB 2005, Trondheim

¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

Page 11

Properties — Sensitivity to Structure Change

☞ Intuition: Nodes with structural information → more significant

disted = 2 ∆2,3 = 0.30

T′ a b

c d e f h i

VLDB 2005, Trondheim

T



a c

b d

e

f g

h i k

¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

Page 11

Properties — Sensitivity to Structure Change

☞ Intuition: Nodes with structural information → more significant

disted = 2 ∆2,3 = 0.30

T′ a b

c d e f h i

VLDB 2005, Trondheim

disted = 2 ∆2,3 = 0.89

T





a c

b d

e

T′′ a b d h i k f g

f g

h i k

¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

Page 11

Properties — Sensitivity to Structure Change

☞ Intuition: Nodes with structural information → more significant ☞ Address application: Mismatch of houses (with subnumbers and apartment numbers) is more significant than mismatch of apartments.

disted = 2 ∆2,3 = 0.30

T′ a b

c d e f h i

VLDB 2005, Trondheim

disted = 2 ∆2,3 = 0.89

T





a c

b d

e

T′′ a b d h i k f g

f g

h i k

¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

Page 11

Properties — Sensitive to Structure Change

☞ node changes ⇒ pq -grams change ☞ pq -grams change ⇒ distance increases

VLDB 2005, Trondheim

¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

Page 12

Properties — Sensitive to Structure Change

☞ node changes ⇒ pq -grams change ☞ pq -grams change ⇒ distance increases ☞ cntpq (T, v) ≈ q + f p ➳ leaf change: cost depends only on q ➳ non-leaf change: p prevalent

VLDB 2005, Trondheim

¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

Page 12

Properties — Sensitive to Structure Change

☞ node changes ⇒ pq -grams change ☞ pq -grams change ⇒ distance increases ☞ cntpq (T, v) ≈ q + f p ➳ leaf change: cost depends only on q ➳ non-leaf change: p prevalent

Theorem 2 For a complete tree T (fanout f , depth d) the number of pq -grams that contain a node v of level l is: ( f p −1 if p ≤ d − l f −1 (f + q − 1)

cntpq (T, v) = q sgn(l) +

VLDB 2005, Trondheim

f d−l −1 f −1 (f

+ q − 1) + f d−l if p > d − l.

¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

Page 12

Properties — Sensitive to Structure Change

☞ node changes ⇒ pq -grams change ☞ pq -grams change ⇒ distance increases ☞ cntpq (T, v) ≈ q + f p ➳ leaf change: cost depends only on q ➳ non-leaf change: p prevalent ➳ p controls structure sensitivity

Theorem 2 For a complete tree T (fanout f , depth d) the number of pq -grams that contain a node v of level l is: ( f p −1 if p ≤ d − l f −1 (f + q − 1)

cntpq (T, v) = q sgn(l) +

VLDB 2005, Trondheim

f d−l −1 f −1 (f

+ q − 1) + f d−l if p > d − l.

¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

Page 12

Properties — Robust to Local Changes ☞ Intuition: Weight local changes less than distributed changes!

VLDB 2005, Trondheim

¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

Page 13

Properties — Robust to Local Changes ☞ Intuition: Weight local changes less than distributed changes! ☞ Address application: missing house with apartment numbers → small difference, but many nodes change

VLDB 2005, Trondheim

¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

Page 13

Properties — Robust to Local Changes ☞ Intuition: Weight local changes less than distributed changes! ☞ Address application: missing house with apartment numbers → small difference, but many nodes change

☞ Local changes → less pq -grams change ➳ neighbored nodes “share” pq -grams ➳ change counts only once.

VLDB 2005, Trondheim

¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

Page 13

Properties — Robust to Local Changes ☞ Intuition: Weight local changes less than distributed changes! ☞ Address application: missing house with apartment numbers → small difference, but many nodes change

☞ Local changes → less pq -grams change ➳ neighbored nodes “share” pq -grams ➳ change counts only once. Theorem 3 Delete or update all nodes of subtree with l leaves and i non-leaves ⇒ only 2l + iq pq -grams change

VLDB 2005, Trondheim

¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

+q−1

Page 13

Properties — Robust to Local Changes ☞ Intuition: Weight local changes less than distributed changes! ☞ Address application: missing house with apartment numbers → small difference, but many nodes change

☞ Local changes → less pq -grams change ➳ neighbored nodes “share” pq -grams ➳ change counts only once. Theorem 3 Delete or update all nodes of subtree with l leaves and i non-leaves ⇒ only 2l + iq pq -grams change Example: Delete subtree rooted in e

• e → in 11 pq -grams • h,i, k → in 4 pq -grams each • actually changing: 11 pq -grams (vs. 3 × 4 + 11 = 23)

T

b

+q−1

a c

d

e

f g

h i k

VLDB 2005, Trondheim

¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

Page 13

Experiments — Sensitivity to Structure Changes ☞ Cost for leaf change → depends only on q

VLDB 2005, Trondheim

¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

Page 14

Experiments — Sensitivity to Structure Changes ☞ Cost for leaf change → depends only on q ☞ Experiment: ➳ delete leaf nodes ➳ measure edit distance

VLDB 2005, Trondheim

¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

Page 14

Experiments — Sensitivity to Structure Changes ☞ Cost for leaf change → depends only on q ☞ Experiment: ➳ delete leaf nodes ➳ measure edit distance 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0

vary q

1,3-grams 2,3-grams 3,3-grams 4,3-grams

0

2

4

6

pq-gram distance

pq-gram distance

vary p

8

10 12 14 16 18 20

0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0

2,1-grams 2,2-grams 2,3-grams 2,4-grams

0

2

4

edit distance

6

8

10 12 14 16 18 20

edit distance

(Artificial tree with 144 nodes, 102 leaves, fanout 2–6 and depth 6. Average over 100 runs.)

VLDB 2005, Trondheim

¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

Page 14

Experiments — Sensitivity to Structure Changes ☞ Cost for non-leaf change → controlled by p

VLDB 2005, Trondheim

¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

Page 15

Experiments — Sensitivity to Structure Changes ☞ Cost for non-leaf change → controlled by p ☞ Experiment: ➳ delete non-leaf nodes ➳ measure edit distance

VLDB 2005, Trondheim

¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

Page 15

Experiments — Sensitivity to Structure Changes ☞ Cost for non-leaf change → controlled by p ☞ Experiment: ➳ delete non-leaf nodes ➳ measure edit distance 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

vary q 0.7

1,3-grams 2,3-grams 3,3-grams 4,3-grams

pq-gram distance

pq-gram distance

vary p 2,1-grams 2,2-grams 2,3-grams 2,4-grams

0.6 0.5 0.4 0.3 0.2 0.1 0

0

2

4

6

8 10 12 14 16 18 20

0

2

4

edit distance

6

8 10 12 14 16 18 20 edit distance

(Artificial tree with 144 nodes, 102 leaves, fanout 2–6 and depth 6. Average over 100 runs.)

VLDB 2005, Trondheim

¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

Page 15

Experiments — Robustness to Local Changes ☞ Subtree deletions → cheaper than distributed deletions

VLDB 2005, Trondheim

¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

Page 16

Experiments — Robustness to Local Changes ☞ Subtree deletions → cheaper than distributed deletions ☞ Experiment: ➳ delete subtree ➳ randomly delete same number of distributed nodes ➳ compare edit distance

VLDB 2005, Trondheim

¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

Page 16

Experiments — Robustness to Local Changes ☞ Subtree deletions → cheaper than distributed deletions ☞ Experiment: ➳ delete subtree ➳ randomly delete same number of distributed nodes ➳ compare edit distance

2,3-gram-distance

0.35 distributed changes local changes

0.3 0.25 0.2 0.15 0.1 0.05 0 0

5

10

15

20

edit distance (Artificial tree with 144 nodes, 102 leaves, fanout 2–6 and depth 6. Average over 100 runs.)

VLDB 2005, Trondheim

¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

Page 16

Experiments — Scalability to Large Trees ☞ pq -gram distance → scalable to large trees ☞ compare with edit distance

a

implementation by Zhang and Shasha (http://www.cs.nyu.edu/cs/faculty/shasha/papers/tree.html)

VLDB 2005, Trondheim

¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

Page 17

Experiments — Scalability to Large Trees ☞ pq -gram distance → scalable to large trees ☞ compare with edit distance ☞ Experiment: For pair of trees ➳ compute tree edit distancea and pq -gram distance ➳ vary tree size: up 2 × 106 nodes ➳ measure wall clock time

a

implementation by Zhang and Shasha (http://www.cs.nyu.edu/cs/faculty/shasha/papers/tree.html)

VLDB 2005, Trondheim

¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

Page 17

Experiments — Scalability to Large Trees ☞ pq -gram distance → scalable to large trees ☞ compare with edit distance ☞ Experiment: For pair of trees ➳ compute tree edit distancea and pq -gram distance ➳ vary tree size: up 2 × 106 nodes ➳ measure wall clock time 27 hours

100000

edit dist 2,3-gram dist

10000

time [sec]

1000 100 10 1 0.1 0.01 0.001 1

10

100

1000

10000 100000 1e+06

number of nodes (n) a

implementation by Zhang and Shasha (http://www.cs.nyu.edu/cs/faculty/shasha/papers/tree.html)

VLDB 2005, Trondheim

¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

Page 17

Experiments — Influence of p and q on Scalability ☞ Scalability independent of p and q .

VLDB 2005, Trondheim

¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

Page 18

Experiments — Influence of p and q on Scalability ☞ Scalability independent of p and q . ☞ Experiment: For pair of trees ➳ compute pq -gram distance for varying p and q ➳ vary tree size: up 106 nodes ➳ measure wall clock time

VLDB 2005, Trondheim

¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

Page 18

Experiments — Influence of p and q on Scalability ☞ Scalability independent of p and q . ☞ Experiment: For pair of trees ➳ compute pq -gram distance for varying p and q ➳ vary tree size: up 106 nodes ➳ measure wall clock time 25

3,4-gram dist 2,3-gram dist 1,2-gram dist

time [sec]

20

15

10

5

0 0

100000

200000

300000

400000

500000

number of nodes (n)

VLDB 2005, Trondheim

¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

Page 18

Experiments — Effectiveness for Real World Dataset

☞ pq -gram distance → effective approximation of tree edit distance ☞ test on real world data (address databases)

VLDB 2005, Trondheim

¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

Page 19

Experiments — Effectiveness for Real World Dataset

☞ pq -gram distance → effective approximation of tree edit distance ☞ test on real world data (address databases) ☞ Experiment: For two sets of address trees ➳ find matches (closest tree of other set) ➳ use different distance functions ➳ count correct matches

VLDB 2005, Trondheim

¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

Page 19

Experiments — Effectiveness for Real World Dataset

☞ pq -gram distance → effective approximation of tree edit distance ☞ test on real world data (address databases) ☞ Experiment: For two sets of address trees ➳ find matches (closest tree of other set) ➳ use different distance functions ➳ count correct matches

VLDB 2005, Trondheim

accuracy

correct

false pos.

runtime

edit dist

82.7%

248

9

187,538s

1,2-grams

78.3%

235

5

181s

2,3-grams

77.3%

232

4

204s

3,2-grams

79.3%

238

2

180s

tree-embedding

69.0%

207

8

313s

buttom-up

50.0%

150

12

237s

¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

Page 19

Experiments — Tree-Embedding vs. pq -Grams tree edit distance embedding

☞ variable shapes

VLDB 2005, Trondheim

pq -grams ☞ fixed shape

¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

Page 20

Experiments — Tree-Embedding vs. pq -Grams pq -grams ☞ fixed shape

tree edit distance embedding

☞ variable shapes ☞ elements can be ➳ single node (no structure) ➳ chains (only vertical structure) ➳ contiguous leaves (only horizontal structure) ➳ subtrees with vertical and horizontal structure

☞ horizontal and vertical structure

Typical address tree (nearly complete tree): Phase

0

1

2

3

4

5

6

tot.

tree-embed

pq -gram

single nodes

29

8

4

2

1

-

-

44

65%

0%

chains

-

1

1

-

-

-

-

2

3%

0%

cont. leaves

-

7

2

-

-

-

-

9

13%

0%

subtrees

-

-

3

4

3

2

1

13

19%

100%

VLDB 2005, Trondheim

¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

Page 20

Experiments — Tree-Embedding vs. pq -Grams pq -grams ☞ fixed shape

tree edit distance embedding

☞ variable shapes ☞ elements can be ➳ single node (no structure) ➳ chains (only vertical structure) ➳ contiguous leaves (only horizontal structure) ➳ subtrees with vertical and horizontal structure

☞ horizontal and vertical structure

☞ guarantees with respect to tree edit distance

☞ emphasizes structure

Typical address tree (nearly complete tree): Phase

0

1

2

3

4

5

6

tot.

tree-embed

pq -gram

single nodes

29

8

4

2

1

-

-

44

65%

0%

chains

-

1

1

-

-

-

-

2

3%

0%

cont. leaves

-

7

2

-

-

-

-

9

13%

0%

subtrees

-

-

3

4

3

2

1

13

19%

100%

VLDB 2005, Trondheim

¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

Page 20

Conclusion and Future Work

☞ pq -gram distance

VLDB 2005, Trondheim

¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

Page 21

Conclusion and Future Work

☞ pq -gram distance ➳ scalable to large trees

VLDB 2005, Trondheim

¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

Page 21

Conclusion and Future Work

☞ pq -gram distance ➳ scalable to large trees ➳ emphasizes structure

VLDB 2005, Trondheim

¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

Page 21

Conclusion and Future Work

☞ pq -gram distance ➳ scalable to large trees ➳ emphasizes structure ➳ robust to local change

VLDB 2005, Trondheim

¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

Page 21

Conclusion and Future Work

☞ pq -gram distance ➳ scalable to large trees ➳ emphasizes structure ➳ robust to local change ➳ effective approximation of tree edit distance

VLDB 2005, Trondheim

¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

Page 21

Conclusion and Future Work

☞ pq -gram distance ➳ scalable to large trees ➳ emphasizes structure ➳ robust to local change ➳ effective approximation of tree edit distance ☞ Ongoing and future work:

VLDB 2005, Trondheim

¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

Page 21

Conclusion and Future Work

☞ pq -gram distance ➳ scalable to large trees ➳ emphasizes structure ➳ robust to local change ➳ effective approximation of tree edit distance ☞ Ongoing and future work: ➳ strict bounds for pq -gram approximations

VLDB 2005, Trondheim

¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

Page 21

Conclusion and Future Work

☞ pq -gram distance ➳ scalable to large trees ➳ emphasizes structure ➳ robust to local change ➳ effective approximation of tree edit distance ☞ Ongoing and future work: ➳ strict bounds for pq -gram approximations ➳ clustering of XML data

VLDB 2005, Trondheim

¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

Page 21

Conclusion and Future Work

☞ pq -gram distance ➳ scalable to large trees ➳ emphasizes structure ➳ robust to local change ➳ effective approximation of tree edit distance ☞ Ongoing and future work: ➳ strict bounds for pq -gram approximations ➳ clustering of XML data ➳ incremental updates of pq -gram profiles

VLDB 2005, Trondheim

¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

Page 21

References [Chawathe and Garcia-Molina, 1997] Chawathe, S. S. and Garcia-Molina, H. (1997). Meaningful change detection in structured data. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 26–37, Tucson, Arizona, United States. ACM Press. [Chawathe et al., 1996] Chawathe, S. S., Rajaraman, A., Garcia-Molina, H., and Widom, J. (1996). Change detection in hierarchically structured information. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 493–504, Montreal, Quebec, Canada. ACM Press. [Garofalakis and Kumar, 2003] Garofalakis, M. and Kumar, A. (2003). Correlating XML data streams using tree-edit distance embeddings. In Proceedings of the Twenty-Second ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS 2003), pages 143–154, San Diego, California. ACM Press. [Garofalakis and Kumar, 2005] Garofalakis, M. and Kumar, A. (2005). XML stream processing using tree-edit distance embeddings. ACM Transactions on Database Systems, 30(1):279–332. [Jiang et al., 1995] Jiang, T., Wang, L., and Zhang, K. (1995). Alignment of trees—an alternative to tree edit. Theoretical Computer Science, 143(1):137–148. [Klein, 1998] Klein, P. N. (1998). Computing the edit-distance between unrooted ordered trees. In Proceedings of the 6th European Symposium on Algorithms, volume 1461 of Lecture Notes in Computer Science, pages 91–102, Venice, Italy. Springer. VLDB 2005, Trondheim

¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

Page 22

[Lee et al., 2004] Lee, K.-H., Choy, Y.-C., and Cho, S.-B. (2004). An efficient algorithm to compute differences between structured documents. IEEE Transactions on Knowledge and Data Engineering, 16(8):965–979. [Navarro, 2001] Navarro, G. (2001). A guided tour to approximate string matching. ACM Computing Surveys, 33(1):31–88. [Selkow, 1977] Selkow, S. M. (1977). The tree-to-tree editing problem. Information Processing Letters, 6(6):184–186. [Tanaka and Tanaka, 1988] Tanaka, E. and Tanaka, K. (1988). The tree-to-tree editing problem. Int. Journal of Pattern Recognition and Artificial Intelligence (IJPRAI), 2(2):221–240. [Ukkonen, 1992] Ukkonen, E. (1992). Approximate string-matching with q -grams and maximal matches. Theoretical Computer Science, 92(1):191–211. [Valiente, 2001] Valiente, G. (2001). An efficient bottom-up distance between trees. In Proceedings of the 8th Symposium on String Processing and Information Retrieval, pages 212–219, Laguna de San Rafael, Chile. IEEE Computer Science Press. [Yang, 1991] Yang, W. (1991). Identifying syntactic differences between two programs. Software—Practice & Experience, 21(7):739–755. [Zhang and Shasha, 1989] Zhang, K. and Shasha, D. (1989). Simple fast algorithms for the editing distance between trees and related problems. SIAM Journal on Computing, 18(6):1245–1262. VLDB 2005, Trondheim

¨ Nikolaus Augsten, Michael Bohlen, Johann Gamper

Page 23