terms in a target thesaurus dj with respect to a term of the source thesaurus qi. S. Faro, E. Francesconi, E. Marinai, V. Sandrucci. D2.3 â Execution and results of ...
TENDER N◦ 10118 - EUROVOC Studies LOT2
D2.3 – “Report on execution and results of the interoperability tests” Final Version S. Faro, E. Francesconi, E. Marinai, V. Sandrucci ITTIG-CNR – Institute of Legal Information Theory and Techniques Italian National Research Council
Final Review Meeting – Luxembourg, January 24th , 2008
S. Faro, E. Francesconi, E. Marinai, V. Sandrucci
D2.3 – Execution and results of the interoperability tests
Overview Formal characterization given to the thesaurus mapping problem Interopearbility workflow – Thesauri SKOS Core transformation – Thesaurus Mapping algorithms implementation
The “gold standard” data set and the THALEN application Thesaurus interoperability assessment measures Experimental results
S. Faro, E. Francesconi, E. Marinai, V. Sandrucci
D2.3 – Execution and results of the interoperability tests
Thesaurus Mapping (T M)
Definition The process of identifying terms, concepts and hierarchical relationships that are approximately equivalent between thesauri
The problem is moved to the definition of concept equivalence
S. Faro, E. Francesconi, E. Marinai, V. Sandrucci
D2.3 – Execution and results of the interoperability tests
Concept equivalence
Definition (Instance-based equivalence) Two concepts are deemed to be equivalent if they are associated with, or classify the same set of objects
Definition (Schema-based equivalence) Two concepts are deemed to be equivalent if there exists a similarity among their features
S. Faro, E. Francesconi, E. Marinai, V. Sandrucci
D2.3 – Execution and results of the interoperability tests
Identification of the problem characteristics
Thesaurus mapping for the project case study is a problem of term alignments, where only schema information is available
It is not a problem of instance classification with respect to a predefined schema of classes (Instance-based matching)
It is a problem where to measure the conceptual / semantic similarity between a term (simple or complex) in the source thesaurus and candidate terms in a target thesaurus (Schema-based matching)
S. Faro, E. Francesconi, E. Marinai, V. Sandrucci
D2.3 – Execution and results of the interoperability tests
Our proposal for Thesaurus Mapping formal characterization We have proposed to characterize the problem of Thesaurus Mapping (T M) as a problem of Information Retrieval (IR) In IR the aim is to find the documents, in a document collection, better matching the semantics of a query Similarly, in T M the aim is to find the terms, in a term collection (target thesaurus), better matching the semantics of a given term in a source thesaurus TM Term in source thesaurus Term in target thesaurus
S. Faro, E. Francesconi, E. Marinai, V. Sandrucci
⇒ ⇒
IR Query Document
D2.3 – Execution and results of the interoperability tests
Our T M formal characterization Definition We propose to characterize T M as a 4-upla [D, Q, F , R(qi , dj )] where: D is a the set of possible representations (logical views) of a term in a target thesaurus (in IR documents in a collection) Q is the set of the possible representations (logical views) of a term in a source thesaurus (in IR queries to be matched with documents of the collections) F is the framework of term representations in source and target thesauri R(qi , dj ) is the ranking function, which associates a real number to a (qi , dj ) where qi ∈ Q , dj ∈ D, giving an order of relevance to the terms in a target thesaurus dj with respect to a term of the source thesaurus qi
S. Faro, E. Francesconi, E. Marinai, V. Sandrucci
D2.3 – Execution and results of the interoperability tests
Isomorphism between T M and IR
T M ⇐⇒ IR
S. Faro, E. Francesconi, E. Marinai, V. Sandrucci
D2.3 – Execution and results of the interoperability tests
Term lexical manifestation and semantics Different terms can be used to identify the same concept in the same language (e.g. ‘pollution’, ‘contamination’, ‘discharge of pollutants’); in different languages (e.g. EUROVOC EN term ‘water’ and IT term ‘acqua’) T M should aim at matching term meanings (the semantics of the terms) rather than formal (lexical) manifestations Hypothesis The more terms in source and target thesauri are semantically characterized, the more the system will be able to match them according to their meanings, enhanching mapping reliability
S. Faro, E. Francesconi, E. Marinai, V. Sandrucci
D2.3 – Execution and results of the interoperability tests
The proposed Logical Views of terms in source (Q) and target (D) thesauri Representation of terms meaning (the semantics of terms) The semantics of a term is conveyed by: 1
its morphological characteristics
2
the context in which the term is used
3
the relations with other terms
We have proposed to represent the semantics of a term in a thesaurus by: 1
its Lexical Manifestation: strings (pre-processed strings)
2
its Lexical Context: vector of weighted/binary terms (the term itself and other related terms)
3
its Lexical Network: graph of terms (nodes are terms and labeled edges are relations)
S. Faro, E. Francesconi, E. Marinai, V. Sandrucci
D2.3 – Execution and results of the interoperability tests
A Lexical Manifestation
(Stemmed variation)
Parliamentary committees
→
S. Faro, E. Francesconi, E. Marinai, V. Sandrucci
Parliament$ committee$
D2.3 – Execution and results of the interoperability tests
A Lexical Context
A Lexical Context is a vector ~d of binary/weighted terms [w1 , . . . , w|T | ], where T is the dimension of a target thesaurus vocabulary S. Faro, E. Francesconi, E. Marinai, V. Sandrucci
D2.3 – Execution and results of the interoperability tests
A Lexical Network
S. Faro, E. Francesconi, E. Marinai, V. Sandrucci
D2.3 – Execution and results of the interoperability tests
The proposed Ranking Functions (R) 1
Lexical Manifestation: Levenshtein Distance/Similarity
2
Lexical Context: Cosine Distance/Similarity
3
Lexical Network: Graph Edit Distance/Similarity
S. Faro, E. Francesconi, E. Marinai, V. Sandrucci
D2.3 – Execution and results of the interoperability tests
Standards for Interoperability Environment
The interoperability environment is based on RDF standards for thesaurus description and mapping:
– SKOS Core
– SKOS Mapping (exactMatch, partial match (broadMatch, narrowMatch))
S. Faro, E. Francesconi, E. Marinai, V. Sandrucci
D2.3 – Execution and results of the interoperability tests
Interoperability Workflow
S. Faro, E. Francesconi, E. Marinai, V. Sandrucci
D2.3 – Execution and results of the interoperability tests
Workflow
S. Faro, E. Francesconi, E. Marinai, V. Sandrucci
D2.3 – Execution and results of the interoperability tests
Workflow 1
2 3
SKOS Core transformation of each thesaurus (XSLT technologies) Thesaurus term pre-processing Thesaurus term representation Lexical Manifestation; Lexical Context; Lexical Network
4
Thesaurus term candidate selection for mapping Levenshtein Distance (for the Lexical Manifestation) Cosine Distance (for the Lexical Context) Graph Edit Distance (for the Lexical Network)
5
Ranking among candidate terms and mapping implementation if if if
6
sim < T1 T1 < sim < T2 T2 < sim
⇒ ⇒ ⇒
No Match partial match (broadMatch or narrowMatch) exactMatch
Representation of the semantics of mapping in SKOS Mapping
S. Faro, E. Francesconi, E. Marinai, V. Sandrucci
D2.3 – Execution and results of the interoperability tests
SKOS Core thesauri transformation
S. Faro, E. Francesconi, E. Marinai, V. Sandrucci
D2.3 – Execution and results of the interoperability tests
Technonogies for SKOS Core transformation
XSLT technologies to transform
XML proprietary formats ⇒ RDF SKOS Core
S. Faro, E. Francesconi, E. Marinai, V. Sandrucci
D2.3 – Execution and results of the interoperability tests
SKOS Core
S. Faro, E. Francesconi, E. Marinai, V. Sandrucci
D2.3 – Execution and results of the interoperability tests
SKOS Core: Relations (properties)
S. Faro, E. Francesconi, E. Marinai, V. Sandrucci
D2.3 – Execution and results of the interoperability tests
SKOS Core: Notes (properties)
S. Faro, E. Francesconi, E. Marinai, V. Sandrucci
D2.3 – Execution and results of the interoperability tests
SKOS Core: Concept Schemes (classes and properties)
S. Faro, E. Francesconi, E. Marinai, V. Sandrucci
D2.3 – Execution and results of the interoperability tests
SKOS Core: Collections (classes and properties)
S. Faro, E. Francesconi, E. Marinai, V. Sandrucci
D2.3 – Execution and results of the interoperability tests
SKOS Core: Subject Indexing (classes and properties)
S. Faro, E. Francesconi, E. Marinai, V. Sandrucci
D2.3 – Execution and results of the interoperability tests
SKOS Mapping
S. Faro, E. Francesconi, E. Marinai, V. Sandrucci
D2.3 – Execution and results of the interoperability tests
SKOS Core trasformation procedures
S. Faro, E. Francesconi, E. Marinai, V. Sandrucci
D2.3 – Execution and results of the interoperability tests
SKOS Core EUROVOC transformation
XML elements
SKOS Core
DOMAINE_ID
skos:Collection
THESAURUS_ID
skos:ConceptScheme
DESCRIPTEUR_ID
skos:Concept
LIBELLE
skos:prefLabel
USED_FOR
skos:altLabel
PERMUTATION (/PERM/PERM_EL)
skos:altLabel
SCOPE_NOTE
skos:scopeNote
RELATION_BT
skos:broader
RELATION_RT
skos:related
S. Faro, E. Francesconi, E. Marinai, V. Sandrucci
D2.3 – Execution and results of the interoperability tests
EUROVOC XSLT Trasformation: eurovoc2skos.xslt ... S. Faro, E. Francesconi, E. Marinai, V. Sandrucci
D2.3 – Execution and results of the interoperability tests
EUROVOC in SKOS Core format ... undervisningsfrihed Freiheit der Lehre academic freedom libertad de enseñanza akadeemiline vabadus akateeminen vapaus liberté de l’enseignement akademske slobode ...
S. Faro, E. Francesconi, E. Marinai, V. Sandrucci
D2.3 – Execution and results of the interoperability tests
SKOS Core UNESCO Thesaurus transformation
XML elements
SKOS Core
DATABASE_UNESEN/RECORD/MT
skos:ConceptScheme
MT
skos:inScheme
RECORD
skos:Concept
Term, Terme, Término
skos:prefLabel
UF, EP, UP, USE, EMP
skos:altLabel
SN, NE, NA
skos:scopeNote
BT, TG
skos:broader
NT, TS, TE
skos:narrower
RT, TA, TR
skos:related
S. Faro, E. Francesconi, E. Marinai, V. Sandrucci
D2.3 – Execution and results of the interoperability tests
UNESCO XSLT Trasformation: unesco2skos.xsl