Instance-based OWL Schema Matching

0 downloads 0 Views 604KB Size Report
Syndrome”, “Verdi: Macbeth”}. (am:title,am:Book) × (eb:title,eb:Book) → 57% de semelhança. (am:title,am:Book) × (eb:title,eb:Music) → 0% de semelhança ...
Instance-based OWL Schema Matching

Luiz André P. Paes Leme, Marco A. Casanova, Karin K. Breitman, Antonio L. Furtado Department of Informatics – Pontifical Catholic University of Rio de Janeiro {lleme, casanova, karin, furtado}@inf.puc-rio.br

Topics

• Introduction • OWL Extralite • Comments on similarity models • Vocabulary matching • Concept mapping • Summary

14/5/2009

2

Introduction

1. Discovering of new data sources 2. Capturing the export schemas of the data sources 3. Matching the export schemas 4. Translating user queries 5. Dealing with huge volumes of data 6. Dynamic matching – Data source join or leave the mediated environment at will

14/5/2009

3

Introduction

1. Discovering of new data sources 2. Capturing the export schemas of the data sources 3. Matching the export schemas 4. Translating user queries 5. Dealing with huge volumes of data 6. Dynamic matching – Data source join or leave the mediated environment at will

14/5/2009

4

Introduction

• Strategy – Instance-based schema matching technique

• Assumptions – The export schemas are known

– The schemas are defined using a subset of OWL which has the same expressiveness as UML

– Queries on data sources use SPARQL

14/5/2009

5

Topics

• Introduction • OWL Extralite • Comments on similarity models • Vocabulary matching • Concept mapping • Summary

14/5/2009

6

OWL Extralite • Schema elements – Classes – Subclasses – Datatype properties – Object properties – Individuals

• Properties Domains and Ranges – Property domains are named classes – Property ranges of datatype properties are XML schema datatypes – Property ranges of object properties are name classes

• Restrictions – Minimum cardinality – Maximum cardinality – Inverse functional (primary keys) 14/5/2009

7

OWL Extralite Amazon Schema

14/5/2009

8

OWL Extralite eBay schema

14/5/2009

9

Topics

• Introduction • OWL Extralite • Comments on similarity models • Vocabulary matching • Concept mapping • Summary

14/5/2009

10

Comments on similarity models

• Considerations... – model may use different representations for the vocabulary element – model depends on the choice of the similarity function • model may combine several similarity functions

– model has to be calibrated

14/5/2009

11

Comments on similarity models • Similaridade de propriedades por valores observados – Dadas duas propriedades Ai e Bj, se o conjunto dos valores observados de Ai é semelhante ao conjunto de valores observados de Bj então Ai e Bj são semelhantes – Exemplo OV((am:title,am:Book)) = {“King Lear”, “Romeo and Juliet”, “Hamlet”, “Othello”, “Dom Casmurro”} OV((eb:title,eb:Book)) = {“King Lear”, “Romeo and Juliet”, “Hamlet”, “Macbeth”, “Dom Casmurro”, “Quincas Borba”} OV((eb:title,eb:Music)) = { “Romeo and Juliet (Royal Ballet) - Rudolf Nureyev and Margot Fonteyn”, “Ambroise Thomas - Hamlet - Barcelona Opera”, “Othello Syndrome”, “Verdi: Macbeth”} (am:title,am:Book) × (eb:title,eb:Book)  57% de semelhança (am:title,am:Book) × (eb:title,eb:Music)  0% de semelhança

14/5/2009

12

Comments on similarity models

• Similaridade de propriedades por valores observados – Exemplo OV((am:title,am:Book)) = {“King Lear (new)”, “Romeo and Juliet”, “Hamlet”, “Othello”, “Dom Casmurro”} OV((eb:title,eb:Book)) = {“King Lear”, “Romeo and Juliet (Arkangel Complete Shakespeare)”, “Hamlet (Shakespeare)”, “Macbeth”, “Dom Casmurro”, “Quincas Borba”} OV((eb:title,eb:Music)) = { “Romeo and Juliet (Royal Ballet) - Rudolf Nureyev and Margot Fonteyn”, “Ambroise Thomas - Hamlet - Barcelona Opera”, “Othello Syndrome”, “Verdi: Macbeth”} (am:title,am:Book) × (eb:title,eb:Book)  25% de semelhança (am:title,am:Book) × (eb:title,eb:Music)  0% de semelhança

14/5/2009

13

Comments on similarity models

• Similaridade de propriedades por tokens observados – Exemplo TK((am:title,am:Book)) = {“King”, “Lear”, “new”, “Romeo”, “Juliet”, “Hamlet”, “Othello”, “Dom”, “Casmurro”} TK((eb:title,eb:Book)) = {“King”, “Lear”, “Romeo”, “Juliet”, “Arkangel”, “Complete”, “Shakespeare”/2, “Hamlet”, “Macbeth”, “Dom”, “Casmurro”, “Quincas”, “Borba”} TK((eb:title,eb:Music)) = { “Romeo”, “Juliet”, “Royal”, “Ballet”, “Rudolf”, “Nureyev”, “Margot”, “Fonteyn”, “Ambroise”, “Thomas”, “Hamlet”, “Barcelona”, “Opera”, “Othello”, “Syndrome”, “Verdi”, “Macbeth”}

(am:title,am:Book) × (eb:title,eb:Book)  50% de semelhança (am:title,am:Book) × (eb:title,eb:Music)  15% de semelhança

14/5/2009

14

Comments on similarity models

• Similaridade de propriedades por valores observados – Exemplo OV((am:edition,S)) = {1, 2, 3, 4} OV((eb:rating,S)) = {1, 2, 3, 4} am:edition × eb:rating  100% de semelhança

14/5/2009

15

Comments on similarity models

• Similaridade de propriedades por pares (instância,valor) – considera pares (instância,valor), em lugar dos valores das propriedades – requer alinhamento de instâncias – Exemplo: • catálogos de livros: – am:edition e eb:rating podem ter valores diferentes para o mesmo livro

• Exemplo de pares (instância, valor) IV((am:edition,S)) ={(“Hamlet”,1), (“Othello”,2), (“Macbeth”,3), (“Dom Casmurro”,4)} IV((eb:rating,S)) = {(“Hamlet”,2), (“Othello”,4), (“Macbeth”,1), (“Dom Casmurro”,3)} am:edition × eb:rating  0% de semelhança 14/5/2009

16

Comments on similarity models

• Contrast Model – A similaridade entre X e Y cresce com a quantidade de características comuns e decresce com a quantidade de características que pertencem somente a X ou Y.

τ ( S1 , S 2 ) = θf ( S1 ∩ S 2 ) − αf ( S1 − S 2 ) − β f ( S1 − S 2 ) −1

  N ( S1 ∩ S 2 ) (θ +α + β )     , N (S ) = S C similarityθ ,α , β ( S1 , S 2 ) = 1 − log α β   N ( S1 ) N ( S 2 )   

14/5/2009

17

Comments on similarity models • Cosseno e TF-IDF – termos que ocorrem com frequência em um documento contribuem mais para a similaridade – termos que são raros no conjunto de todos os documentos contribuem mais para a similaridade

• Cosine distance:

r r r r S •S similarity ( S1 , S 2 ) = r1 r 2 S1 S 2

• Representation of strings as bag of words – sets : S= {t1, t2, ...,tn} – multiset: S= ({t1, t2, ...,tn}, {(t1,ft1), (t2,ft2), …,(tn,ftn)})

• Representation of strings as vectors r S = ( f t1 , f t 2 ,..., f tn ) – TF: r – TF/IDF: S = ( wt 1 , wt 2 ,..., wtn ),

14/5/2009

where wtn =

f tn N log max t∈S f t N tn 18

Comments on similarity models



Estimated Mutual Information

14/5/2009

     mrs similarity ( Ar , Bs ) = log mij   i, j   



mrs

∑m

ij

i, j

∑m ∑ * ∑m ∑ rj

j

i

ij

i, j

i, j

      mis    mij   

19

Topics

• Introduction • OWL Extralite • Vocabulary matching • Comments on similarity models • Concept mapping • Summary

14/5/2009

20

Vocabulary matching

• Vocabulary matching – Example • property am:title, when applied to instances of type am:Book, is equivalent to property eb:title when applied to instances of type eb:Book

– Vocabulary elements • Contextualized properties • Classes

S (Amazon) Element

14/5/2009

T (eBay)

Context

Element

Context

am:title

am:Book

eb:title

eb:Book

am:title

am:Music

eb:title

eb:Music

am:name

am:Publ

eb:publisher

eb:Book

am:Book

T

eb:Book

T

am:Music

T

eb:Music

T

21

Vocabulary matching

Four-step matching process 1. Compute a temporary contextualized property matching –

use the set of tokens extracted from the property values (from the set of values)

2. Compute a class matching −

use the set of properties of each class (from the schema)



use property compatibility (from Step 1)

3. Compute an instance matching –

use class compatibility (from Step 2)



use the temporary property matching (from Step 1)



use the set of tokens extracted from the values of matching properties (from the set of values)

4. Refine the contextualized property matching

14/5/2009



use property datatype (from the schema)



use context compatibility (from Step 2)



use the set of tokens extracted from the property values (from the set of values)



use the set of instance-value pairs (from Step 3) 22

Vocabulary matching

Four-step matching process 1. Compute a temporary contextualized property matching –

use the set of tokens extracted from the property values (from the set of values)

2. Compute a class matching −

use the set of properties of each class (from the schema)



use property compatibility (from Step 1)

3. Compute an instance matching –

use class compatibility (from Step 2)



use the temporary property matching (from Step 1)



use the set of tokens extracted from the values of matching properties (from the set of values)

4. Refine the contextualized property matching

14/5/2009



use property datatype (from the schema)



use context compatibility (from Step 2)



use the set of tokens extracted from the property values (from the set of values)



use the set of instance-value pairs (from Step 3) 23

Vocabulary matching

Step 1 – Compute a temporary property matching

• Representation of contextualized properties – datatype of property p (from the schema) – Tkp = set of observed tokens of property p (from the set of values)

• Contextualized property matching (properties p and q) – if p and q have different datatypes then p does not match q – if similarity(Tkp,Tkq) ≥ threshold then p matches q

14/5/2009

24

Vocabulary matching

Step 2 – Compute a class matching

• Representation of classes – PC = set of properties of class C (from the schema) • Class matching (classes C and D) – if similarity(PC ,PD) ≥ threshold then C matches D

14/5/2009

25

Vocabulary matching

Step 3 – Instance matching

• Representation of instances – CI = class of instance I (from the set of values) – TkI = set of observed tokens of common properties • Tokens from common properties (from the set of values) – am:Book[title, author]; eb:Book[title, author, binding] – tokens extracted from title and author

• requires property and class matching

• Instance matching (instances I of class CI and J of class Cj) – if classes CI and CJ match and similarity(TkI, TkJ) ≥ threshold then I matches J

14/5/2009

26

Vocabulary matching

Step 4 – refine the contextualized property matching

• Representation of contextualized properties – datatype of property p – context of property p – Tkp = set of observed tokens – Tkp = set of observed (instance id,token) pairs

• Contextualized property matching (properties p and q) – if p and q have different datatypes or contexts then p does not match q – if f(similarity(Tkp,Tkq),similarity(IMp,IMq)) ≥ threshold then p matches q • f() combines several similarity measures

14/5/2009

27

Vocabulary matching

Experiment – Configuration

• Configuration – Temporary contextualized property matching • Did not use multisets • Similarity function: Contrast Model (α=1.0, β=γ=3.0) • Threshold = max - 20%

– Instance matching • Used multiset • Cosine with TF-IDF • Threshold = 0,8

– Refinement of property matching • Did not use multisets • Similarity function: Contrast Model (α=1.0, β=γ=3.0) • Threshold = max -30%

14/5/2009

28

Vocabulary matching

Experiment – fragment of the results •

Results

Amazon

eBay

– Recall = 71%

v1

e1

e2

– Precision = 86%

Books

– Overall performance = 78%

author

B

author

B

tp

binding

B

biding

B

tp

edition

B

edition

B

tp

format

B

biding

B

tp

isbn-10

B

isbn

B

tp

isbn-13

B

ean

B

tp

publisher

B

name

P

tp

title

B

title

B

tp

tp

Books

ComputerNetworking

14/5/2009

v2

tp

PCHardware

operatingSys

CN

platform

PC

tp

operatingSystem

CN

operSyst

PC

tp

processorConfig

CN

cpuType

PC

tp

processorType

CN

cpuType

PC

tp

processorType

CN

cpuManufact

PC

tp

title

CN

title

PC

tp

29

Topics

• Introduction • OWL Extralite • Vocabulary matching • Comments on similarity models • Concept mapping • Summary

14/5/2009

30

Concept mapping

• Concept mapping from T into S – Defines concepts from T in terms of concepts of S by means of rules • Examples eb:title(p,t) ← am:title(p,t), am:Book(p) eb:publisher(b,n) ← am:publisher(b,p), am:name(p,n) eb:Book(p) ← am:Book(p)

– Induced by the vocabulary matching – Used for translating queries over T to queries over S

14/5/2009

31

Query translation

Concept mapping rules

eb:title(p,t) ← am:title(p,t), am:Book(p) eb:publisher(b,n) ← am:publisher(b,p), am:name(p,n)

Query over eBay

SELECT ?title, ?pubName WHERE {?s eb:title ?title. ?s eb:publisher ?pubName}

14/5/2009

Query over Amazon



SELECT ?title, ?pubName WHERE {?s am:title ?title. ?s rdf:type am:Book. ?s am:publisher ?p. ?p am:name ?pubName}

32

Concept mapping rules

• Case 1 – class translation (am:Book,T,eb:Book,T) (am:Music,T,eb:Music,T)

vocabulary matching

eb:Book(x) ← am:Book(x) eb:Music(x) ← am:Music(x) eb:Product(x) ← am:Book(x)

induced concept mapping

eb:Product(x) ← am:Music(x)

• Case 2 – simple property translation (am:title,am:Book,eb:title,eb:Book) (am:Book,T,eb:Book,T) eb:title(x,y) ← am:title(x,y);am:Book(x) 14/5/2009

33

Concept mapping rules

• Case 3 – property translation – am:name,am:Publ,eb:publisher,eb:Book) – Concept mapping rule eb:publisher(x,y) ← am:publisher(x,s), am:name(s,y), am:Publ(s) but, since the domain of am:name is equal to its context, we may simplify to: eb:publisher(x,y) ← am:publisher(x,s), am:name(s,y)

eb:publisher(x,s) eb:Book

String

am:publisher(x,s) am:Book 14/5/2009

am:name(s,y) am:Publ

String 34

Consistency of vocabulary matching

• Consistency of the vocabulary matching for a concept mapping from T into S (T = eBay, S = Amazon) – t:p matches s:q is consistent with the integrity constraints only if the red conditions hold – otherwise discard the matching to maintain consistency

t:Domain of p

t:p(x,s) T

t:Context

Class/Type

equivalent

equivalent

s:q(s,y) S 14/5/2009

s:Class

path

s:Context

Class/Type 35

Consistency of vocabulary matching

• Consistency of the vocabulary matching for a concept mapping from T into S (T = eBay, S = Amazon) – t:p matches s:q is consistent with the integrity constraints only if the red conditions hold – otherwise discard the matching to maintain consistency

eb:Book

eb:publisher(x,s) T

eb:Book

String

equivalent

equivalent

am:name(s,y) S 14/5/2009

am:Book

am:publisher(x,s)

am:Publ

String 36

Topics

• Introduction • OWL Extralite • Vocabulary matching • Comments on similarity models • Concept mapping • Summary

14/5/2009

37

Summary

• Problem decomposition – Vocabulary matching + Concept mapping – Vocabulary matching induces the concept mapping

• Complex schema matching – Different similarity models for each matching task – Multiple representations for each type of vocabulary element

14/5/2009

38

Suggest Documents