Syndrome”, “Verdi: Macbeth”}. (am:title,am:Book) × (eb:title,eb:Book) → 57% de semelhança. (am:title,am:Book) × (eb:title,eb:Music) → 0% de semelhança ...
Instance-based OWL Schema Matching
Luiz André P. Paes Leme, Marco A. Casanova, Karin K. Breitman, Antonio L. Furtado Department of Informatics – Pontifical Catholic University of Rio de Janeiro {lleme, casanova, karin, furtado}@inf.puc-rio.br
Topics
• Introduction • OWL Extralite • Comments on similarity models • Vocabulary matching • Concept mapping • Summary
14/5/2009
2
Introduction
1. Discovering of new data sources 2. Capturing the export schemas of the data sources 3. Matching the export schemas 4. Translating user queries 5. Dealing with huge volumes of data 6. Dynamic matching – Data source join or leave the mediated environment at will
14/5/2009
3
Introduction
1. Discovering of new data sources 2. Capturing the export schemas of the data sources 3. Matching the export schemas 4. Translating user queries 5. Dealing with huge volumes of data 6. Dynamic matching – Data source join or leave the mediated environment at will
14/5/2009
4
Introduction
• Strategy – Instance-based schema matching technique
• Assumptions – The export schemas are known
– The schemas are defined using a subset of OWL which has the same expressiveness as UML
– Queries on data sources use SPARQL
14/5/2009
5
Topics
• Introduction • OWL Extralite • Comments on similarity models • Vocabulary matching • Concept mapping • Summary
14/5/2009
6
OWL Extralite • Schema elements – Classes – Subclasses – Datatype properties – Object properties – Individuals
• Properties Domains and Ranges – Property domains are named classes – Property ranges of datatype properties are XML schema datatypes – Property ranges of object properties are name classes
• Restrictions – Minimum cardinality – Maximum cardinality – Inverse functional (primary keys) 14/5/2009
7
OWL Extralite Amazon Schema
14/5/2009
8
OWL Extralite eBay schema
14/5/2009
9
Topics
• Introduction • OWL Extralite • Comments on similarity models • Vocabulary matching • Concept mapping • Summary
14/5/2009
10
Comments on similarity models
• Considerations... – model may use different representations for the vocabulary element – model depends on the choice of the similarity function • model may combine several similarity functions
– model has to be calibrated
14/5/2009
11
Comments on similarity models • Similaridade de propriedades por valores observados – Dadas duas propriedades Ai e Bj, se o conjunto dos valores observados de Ai é semelhante ao conjunto de valores observados de Bj então Ai e Bj são semelhantes – Exemplo OV((am:title,am:Book)) = {“King Lear”, “Romeo and Juliet”, “Hamlet”, “Othello”, “Dom Casmurro”} OV((eb:title,eb:Book)) = {“King Lear”, “Romeo and Juliet”, “Hamlet”, “Macbeth”, “Dom Casmurro”, “Quincas Borba”} OV((eb:title,eb:Music)) = { “Romeo and Juliet (Royal Ballet) - Rudolf Nureyev and Margot Fonteyn”, “Ambroise Thomas - Hamlet - Barcelona Opera”, “Othello Syndrome”, “Verdi: Macbeth”} (am:title,am:Book) × (eb:title,eb:Book) 57% de semelhança (am:title,am:Book) × (eb:title,eb:Music) 0% de semelhança
14/5/2009
12
Comments on similarity models
• Similaridade de propriedades por valores observados – Exemplo OV((am:title,am:Book)) = {“King Lear (new)”, “Romeo and Juliet”, “Hamlet”, “Othello”, “Dom Casmurro”} OV((eb:title,eb:Book)) = {“King Lear”, “Romeo and Juliet (Arkangel Complete Shakespeare)”, “Hamlet (Shakespeare)”, “Macbeth”, “Dom Casmurro”, “Quincas Borba”} OV((eb:title,eb:Music)) = { “Romeo and Juliet (Royal Ballet) - Rudolf Nureyev and Margot Fonteyn”, “Ambroise Thomas - Hamlet - Barcelona Opera”, “Othello Syndrome”, “Verdi: Macbeth”} (am:title,am:Book) × (eb:title,eb:Book) 25% de semelhança (am:title,am:Book) × (eb:title,eb:Music) 0% de semelhança
14/5/2009
13
Comments on similarity models
• Similaridade de propriedades por tokens observados – Exemplo TK((am:title,am:Book)) = {“King”, “Lear”, “new”, “Romeo”, “Juliet”, “Hamlet”, “Othello”, “Dom”, “Casmurro”} TK((eb:title,eb:Book)) = {“King”, “Lear”, “Romeo”, “Juliet”, “Arkangel”, “Complete”, “Shakespeare”/2, “Hamlet”, “Macbeth”, “Dom”, “Casmurro”, “Quincas”, “Borba”} TK((eb:title,eb:Music)) = { “Romeo”, “Juliet”, “Royal”, “Ballet”, “Rudolf”, “Nureyev”, “Margot”, “Fonteyn”, “Ambroise”, “Thomas”, “Hamlet”, “Barcelona”, “Opera”, “Othello”, “Syndrome”, “Verdi”, “Macbeth”}
(am:title,am:Book) × (eb:title,eb:Book) 50% de semelhança (am:title,am:Book) × (eb:title,eb:Music) 15% de semelhança
14/5/2009
14
Comments on similarity models
• Similaridade de propriedades por valores observados – Exemplo OV((am:edition,S)) = {1, 2, 3, 4} OV((eb:rating,S)) = {1, 2, 3, 4} am:edition × eb:rating 100% de semelhança
14/5/2009
15
Comments on similarity models
• Similaridade de propriedades por pares (instância,valor) – considera pares (instância,valor), em lugar dos valores das propriedades – requer alinhamento de instâncias – Exemplo: • catálogos de livros: – am:edition e eb:rating podem ter valores diferentes para o mesmo livro
• Exemplo de pares (instância, valor) IV((am:edition,S)) ={(“Hamlet”,1), (“Othello”,2), (“Macbeth”,3), (“Dom Casmurro”,4)} IV((eb:rating,S)) = {(“Hamlet”,2), (“Othello”,4), (“Macbeth”,1), (“Dom Casmurro”,3)} am:edition × eb:rating 0% de semelhança 14/5/2009
16
Comments on similarity models
• Contrast Model – A similaridade entre X e Y cresce com a quantidade de características comuns e decresce com a quantidade de características que pertencem somente a X ou Y.
τ ( S1 , S 2 ) = θf ( S1 ∩ S 2 ) − αf ( S1 − S 2 ) − β f ( S1 − S 2 ) −1
N ( S1 ∩ S 2 ) (θ +α + β ) , N (S ) = S C similarityθ ,α , β ( S1 , S 2 ) = 1 − log α β N ( S1 ) N ( S 2 )
14/5/2009
17
Comments on similarity models • Cosseno e TF-IDF – termos que ocorrem com frequência em um documento contribuem mais para a similaridade – termos que são raros no conjunto de todos os documentos contribuem mais para a similaridade
• Cosine distance:
r r r r S •S similarity ( S1 , S 2 ) = r1 r 2 S1 S 2
• Representation of strings as bag of words – sets : S= {t1, t2, ...,tn} – multiset: S= ({t1, t2, ...,tn}, {(t1,ft1), (t2,ft2), …,(tn,ftn)})
• Representation of strings as vectors r S = ( f t1 , f t 2 ,..., f tn ) – TF: r – TF/IDF: S = ( wt 1 , wt 2 ,..., wtn ),
14/5/2009
where wtn =
f tn N log max t∈S f t N tn 18
Comments on similarity models
•
Estimated Mutual Information
14/5/2009
mrs similarity ( Ar , Bs ) = log mij i, j
∑
mrs
∑m
ij
i, j
∑m ∑ * ∑m ∑ rj
j
i
ij
i, j
i, j
mis mij
19
Topics
• Introduction • OWL Extralite • Vocabulary matching • Comments on similarity models • Concept mapping • Summary
14/5/2009
20
Vocabulary matching
• Vocabulary matching – Example • property am:title, when applied to instances of type am:Book, is equivalent to property eb:title when applied to instances of type eb:Book
– Vocabulary elements • Contextualized properties • Classes
S (Amazon) Element
14/5/2009
T (eBay)
Context
Element
Context
am:title
am:Book
eb:title
eb:Book
am:title
am:Music
eb:title
eb:Music
am:name
am:Publ
eb:publisher
eb:Book
am:Book
T
eb:Book
T
am:Music
T
eb:Music
T
21
Vocabulary matching
Four-step matching process 1. Compute a temporary contextualized property matching –
use the set of tokens extracted from the property values (from the set of values)
2. Compute a class matching −
use the set of properties of each class (from the schema)
−
use property compatibility (from Step 1)
3. Compute an instance matching –
use class compatibility (from Step 2)
–
use the temporary property matching (from Step 1)
–
use the set of tokens extracted from the values of matching properties (from the set of values)
4. Refine the contextualized property matching
14/5/2009
–
use property datatype (from the schema)
–
use context compatibility (from Step 2)
–
use the set of tokens extracted from the property values (from the set of values)
–
use the set of instance-value pairs (from Step 3) 22
Vocabulary matching
Four-step matching process 1. Compute a temporary contextualized property matching –
use the set of tokens extracted from the property values (from the set of values)
2. Compute a class matching −
use the set of properties of each class (from the schema)
−
use property compatibility (from Step 1)
3. Compute an instance matching –
use class compatibility (from Step 2)
–
use the temporary property matching (from Step 1)
–
use the set of tokens extracted from the values of matching properties (from the set of values)
4. Refine the contextualized property matching
14/5/2009
–
use property datatype (from the schema)
–
use context compatibility (from Step 2)
–
use the set of tokens extracted from the property values (from the set of values)
–
use the set of instance-value pairs (from Step 3) 23
Vocabulary matching
Step 1 – Compute a temporary property matching
• Representation of contextualized properties – datatype of property p (from the schema) – Tkp = set of observed tokens of property p (from the set of values)
• Contextualized property matching (properties p and q) – if p and q have different datatypes then p does not match q – if similarity(Tkp,Tkq) ≥ threshold then p matches q
14/5/2009
24
Vocabulary matching
Step 2 – Compute a class matching
• Representation of classes – PC = set of properties of class C (from the schema) • Class matching (classes C and D) – if similarity(PC ,PD) ≥ threshold then C matches D
14/5/2009
25
Vocabulary matching
Step 3 – Instance matching
• Representation of instances – CI = class of instance I (from the set of values) – TkI = set of observed tokens of common properties • Tokens from common properties (from the set of values) – am:Book[title, author]; eb:Book[title, author, binding] – tokens extracted from title and author
• requires property and class matching
• Instance matching (instances I of class CI and J of class Cj) – if classes CI and CJ match and similarity(TkI, TkJ) ≥ threshold then I matches J
14/5/2009
26
Vocabulary matching
Step 4 – refine the contextualized property matching
• Representation of contextualized properties – datatype of property p – context of property p – Tkp = set of observed tokens – Tkp = set of observed (instance id,token) pairs
• Contextualized property matching (properties p and q) – if p and q have different datatypes or contexts then p does not match q – if f(similarity(Tkp,Tkq),similarity(IMp,IMq)) ≥ threshold then p matches q • f() combines several similarity measures
14/5/2009
27
Vocabulary matching
Experiment – Configuration
• Configuration – Temporary contextualized property matching • Did not use multisets • Similarity function: Contrast Model (α=1.0, β=γ=3.0) • Threshold = max - 20%
– Instance matching • Used multiset • Cosine with TF-IDF • Threshold = 0,8
– Refinement of property matching • Did not use multisets • Similarity function: Contrast Model (α=1.0, β=γ=3.0) • Threshold = max -30%
14/5/2009
28
Vocabulary matching
Experiment – fragment of the results •
Results
Amazon
eBay
– Recall = 71%
v1
e1
e2
– Precision = 86%
Books
– Overall performance = 78%
author
B
author
B
tp
binding
B
biding
B
tp
edition
B
edition
B
tp
format
B
biding
B
tp
isbn-10
B
isbn
B
tp
isbn-13
B
ean
B
tp
publisher
B
name
P
tp
title
B
title
B
tp
tp
Books
ComputerNetworking
14/5/2009
v2
tp
PCHardware
operatingSys
CN
platform
PC
tp
operatingSystem
CN
operSyst
PC
tp
processorConfig
CN
cpuType
PC
tp
processorType
CN
cpuType
PC
tp
processorType
CN
cpuManufact
PC
tp
title
CN
title
PC
tp
29
Topics
• Introduction • OWL Extralite • Vocabulary matching • Comments on similarity models • Concept mapping • Summary
14/5/2009
30
Concept mapping
• Concept mapping from T into S – Defines concepts from T in terms of concepts of S by means of rules • Examples eb:title(p,t) ← am:title(p,t), am:Book(p) eb:publisher(b,n) ← am:publisher(b,p), am:name(p,n) eb:Book(p) ← am:Book(p)
– Induced by the vocabulary matching – Used for translating queries over T to queries over S
14/5/2009
31
Query translation
Concept mapping rules
eb:title(p,t) ← am:title(p,t), am:Book(p) eb:publisher(b,n) ← am:publisher(b,p), am:name(p,n)
Query over eBay
SELECT ?title, ?pubName WHERE {?s eb:title ?title. ?s eb:publisher ?pubName}
14/5/2009
Query over Amazon
→
SELECT ?title, ?pubName WHERE {?s am:title ?title. ?s rdf:type am:Book. ?s am:publisher ?p. ?p am:name ?pubName}
32
Concept mapping rules
• Case 1 – class translation (am:Book,T,eb:Book,T) (am:Music,T,eb:Music,T)
vocabulary matching
eb:Book(x) ← am:Book(x) eb:Music(x) ← am:Music(x) eb:Product(x) ← am:Book(x)
induced concept mapping
eb:Product(x) ← am:Music(x)
• Case 2 – simple property translation (am:title,am:Book,eb:title,eb:Book) (am:Book,T,eb:Book,T) eb:title(x,y) ← am:title(x,y);am:Book(x) 14/5/2009
33
Concept mapping rules
• Case 3 – property translation – am:name,am:Publ,eb:publisher,eb:Book) – Concept mapping rule eb:publisher(x,y) ← am:publisher(x,s), am:name(s,y), am:Publ(s) but, since the domain of am:name is equal to its context, we may simplify to: eb:publisher(x,y) ← am:publisher(x,s), am:name(s,y)
eb:publisher(x,s) eb:Book
String
am:publisher(x,s) am:Book 14/5/2009
am:name(s,y) am:Publ
String 34
Consistency of vocabulary matching
• Consistency of the vocabulary matching for a concept mapping from T into S (T = eBay, S = Amazon) – t:p matches s:q is consistent with the integrity constraints only if the red conditions hold – otherwise discard the matching to maintain consistency
t:Domain of p
t:p(x,s) T
t:Context
Class/Type
equivalent
equivalent
s:q(s,y) S 14/5/2009
s:Class
path
s:Context
Class/Type 35
Consistency of vocabulary matching
• Consistency of the vocabulary matching for a concept mapping from T into S (T = eBay, S = Amazon) – t:p matches s:q is consistent with the integrity constraints only if the red conditions hold – otherwise discard the matching to maintain consistency
eb:Book
eb:publisher(x,s) T
eb:Book
String
equivalent
equivalent
am:name(s,y) S 14/5/2009
am:Book
am:publisher(x,s)
am:Publ
String 36
Topics
• Introduction • OWL Extralite • Vocabulary matching • Comments on similarity models • Concept mapping • Summary
14/5/2009
37
Summary
• Problem decomposition – Vocabulary matching + Concept mapping – Vocabulary matching induces the concept mapping
• Complex schema matching – Different similarity models for each matching task – Multiple representations for each type of vocabulary element
14/5/2009
38