Report on execution and results of the interoperability tests - Eurovoc

0 downloads 0 Views 1MB Size Report
terms in a target thesaurus dj with respect to a term of the source thesaurus qi. S. Faro, E. Francesconi, E. Marinai, V. Sandrucci. D2.3 – Execution and results of ...
TENDER N◦ 10118 - EUROVOC Studies LOT2

D2.3 – “Report on execution and results of the interoperability tests” Final Version S. Faro, E. Francesconi, E. Marinai, V. Sandrucci ITTIG-CNR – Institute of Legal Information Theory and Techniques Italian National Research Council

Final Review Meeting – Luxembourg, January 24th , 2008

S. Faro, E. Francesconi, E. Marinai, V. Sandrucci

D2.3 – Execution and results of the interoperability tests

Overview Formal characterization given to the thesaurus mapping problem Interopearbility workflow – Thesauri SKOS Core transformation – Thesaurus Mapping algorithms implementation

The “gold standard” data set and the THALEN application Thesaurus interoperability assessment measures Experimental results

S. Faro, E. Francesconi, E. Marinai, V. Sandrucci

D2.3 – Execution and results of the interoperability tests

Thesaurus Mapping (T M)

Definition The process of identifying terms, concepts and hierarchical relationships that are approximately equivalent between thesauri

The problem is moved to the definition of concept equivalence

S. Faro, E. Francesconi, E. Marinai, V. Sandrucci

D2.3 – Execution and results of the interoperability tests

Concept equivalence

Definition (Instance-based equivalence) Two concepts are deemed to be equivalent if they are associated with, or classify the same set of objects

Definition (Schema-based equivalence) Two concepts are deemed to be equivalent if there exists a similarity among their features

S. Faro, E. Francesconi, E. Marinai, V. Sandrucci

D2.3 – Execution and results of the interoperability tests

Identification of the problem characteristics

Thesaurus mapping for the project case study is a problem of term alignments, where only schema information is available

It is not a problem of instance classification with respect to a predefined schema of classes (Instance-based matching)

It is a problem where to measure the conceptual / semantic similarity between a term (simple or complex) in the source thesaurus and candidate terms in a target thesaurus (Schema-based matching)

S. Faro, E. Francesconi, E. Marinai, V. Sandrucci

D2.3 – Execution and results of the interoperability tests

Our proposal for Thesaurus Mapping formal characterization We have proposed to characterize the problem of Thesaurus Mapping (T M) as a problem of Information Retrieval (IR) In IR the aim is to find the documents, in a document collection, better matching the semantics of a query Similarly, in T M the aim is to find the terms, in a term collection (target thesaurus), better matching the semantics of a given term in a source thesaurus TM Term in source thesaurus Term in target thesaurus

S. Faro, E. Francesconi, E. Marinai, V. Sandrucci

⇒ ⇒

IR Query Document

D2.3 – Execution and results of the interoperability tests

Our T M formal characterization Definition We propose to characterize T M as a 4-upla [D, Q, F , R(qi , dj )] where: D is a the set of possible representations (logical views) of a term in a target thesaurus (in IR documents in a collection) Q is the set of the possible representations (logical views) of a term in a source thesaurus (in IR queries to be matched with documents of the collections) F is the framework of term representations in source and target thesauri R(qi , dj ) is the ranking function, which associates a real number to a (qi , dj ) where qi ∈ Q , dj ∈ D, giving an order of relevance to the terms in a target thesaurus dj with respect to a term of the source thesaurus qi

S. Faro, E. Francesconi, E. Marinai, V. Sandrucci

D2.3 – Execution and results of the interoperability tests

Isomorphism between T M and IR

T M ⇐⇒ IR

S. Faro, E. Francesconi, E. Marinai, V. Sandrucci

D2.3 – Execution and results of the interoperability tests

Term lexical manifestation and semantics Different terms can be used to identify the same concept in the same language (e.g. ‘pollution’, ‘contamination’, ‘discharge of pollutants’); in different languages (e.g. EUROVOC EN term ‘water’ and IT term ‘acqua’) T M should aim at matching term meanings (the semantics of the terms) rather than formal (lexical) manifestations Hypothesis The more terms in source and target thesauri are semantically characterized, the more the system will be able to match them according to their meanings, enhanching mapping reliability

S. Faro, E. Francesconi, E. Marinai, V. Sandrucci

D2.3 – Execution and results of the interoperability tests

The proposed Logical Views of terms in source (Q) and target (D) thesauri Representation of terms meaning (the semantics of terms) The semantics of a term is conveyed by: 1

its morphological characteristics

2

the context in which the term is used

3

the relations with other terms

We have proposed to represent the semantics of a term in a thesaurus by: 1

its Lexical Manifestation: strings (pre-processed strings)

2

its Lexical Context: vector of weighted/binary terms (the term itself and other related terms)

3

its Lexical Network: graph of terms (nodes are terms and labeled edges are relations)

S. Faro, E. Francesconi, E. Marinai, V. Sandrucci

D2.3 – Execution and results of the interoperability tests

A Lexical Manifestation

(Stemmed variation)

Parliamentary committees



S. Faro, E. Francesconi, E. Marinai, V. Sandrucci

Parliament$ committee$

D2.3 – Execution and results of the interoperability tests

A Lexical Context

A Lexical Context is a vector ~d of binary/weighted terms [w1 , . . . , w|T | ], where T is the dimension of a target thesaurus vocabulary S. Faro, E. Francesconi, E. Marinai, V. Sandrucci

D2.3 – Execution and results of the interoperability tests

A Lexical Network

S. Faro, E. Francesconi, E. Marinai, V. Sandrucci

D2.3 – Execution and results of the interoperability tests

The proposed Ranking Functions (R) 1

Lexical Manifestation: Levenshtein Distance/Similarity

2

Lexical Context: Cosine Distance/Similarity

3

Lexical Network: Graph Edit Distance/Similarity

S. Faro, E. Francesconi, E. Marinai, V. Sandrucci

D2.3 – Execution and results of the interoperability tests

Standards for Interoperability Environment

The interoperability environment is based on RDF standards for thesaurus description and mapping:

– SKOS Core

– SKOS Mapping (exactMatch, partial match (broadMatch, narrowMatch))

S. Faro, E. Francesconi, E. Marinai, V. Sandrucci

D2.3 – Execution and results of the interoperability tests

Interoperability Workflow

S. Faro, E. Francesconi, E. Marinai, V. Sandrucci

D2.3 – Execution and results of the interoperability tests

Workflow

S. Faro, E. Francesconi, E. Marinai, V. Sandrucci

D2.3 – Execution and results of the interoperability tests

Workflow 1

2 3

SKOS Core transformation of each thesaurus (XSLT technologies) Thesaurus term pre-processing Thesaurus term representation Lexical Manifestation; Lexical Context; Lexical Network

4

Thesaurus term candidate selection for mapping Levenshtein Distance (for the Lexical Manifestation) Cosine Distance (for the Lexical Context) Graph Edit Distance (for the Lexical Network)

5

Ranking among candidate terms and mapping implementation if if if

6

sim < T1 T1 < sim < T2 T2 < sim

⇒ ⇒ ⇒

No Match partial match (broadMatch or narrowMatch) exactMatch

Representation of the semantics of mapping in SKOS Mapping

S. Faro, E. Francesconi, E. Marinai, V. Sandrucci

D2.3 – Execution and results of the interoperability tests

SKOS Core thesauri transformation

S. Faro, E. Francesconi, E. Marinai, V. Sandrucci

D2.3 – Execution and results of the interoperability tests

Technonogies for SKOS Core transformation

XSLT technologies to transform

XML proprietary formats ⇒ RDF SKOS Core

S. Faro, E. Francesconi, E. Marinai, V. Sandrucci

D2.3 – Execution and results of the interoperability tests

SKOS Core

S. Faro, E. Francesconi, E. Marinai, V. Sandrucci

D2.3 – Execution and results of the interoperability tests

SKOS Core: Relations (properties)

S. Faro, E. Francesconi, E. Marinai, V. Sandrucci

D2.3 – Execution and results of the interoperability tests

SKOS Core: Notes (properties)

S. Faro, E. Francesconi, E. Marinai, V. Sandrucci

D2.3 – Execution and results of the interoperability tests

SKOS Core: Concept Schemes (classes and properties)

S. Faro, E. Francesconi, E. Marinai, V. Sandrucci

D2.3 – Execution and results of the interoperability tests

SKOS Core: Collections (classes and properties)

S. Faro, E. Francesconi, E. Marinai, V. Sandrucci

D2.3 – Execution and results of the interoperability tests

SKOS Core: Subject Indexing (classes and properties)

S. Faro, E. Francesconi, E. Marinai, V. Sandrucci

D2.3 – Execution and results of the interoperability tests

SKOS Mapping

S. Faro, E. Francesconi, E. Marinai, V. Sandrucci

D2.3 – Execution and results of the interoperability tests

SKOS Core trasformation procedures

S. Faro, E. Francesconi, E. Marinai, V. Sandrucci

D2.3 – Execution and results of the interoperability tests

SKOS Core EUROVOC transformation

XML elements

SKOS Core

DOMAINE_ID

skos:Collection

THESAURUS_ID

skos:ConceptScheme

DESCRIPTEUR_ID

skos:Concept

LIBELLE

skos:prefLabel

USED_FOR

skos:altLabel

PERMUTATION (/PERM/PERM_EL)

skos:altLabel

SCOPE_NOTE

skos:scopeNote

RELATION_BT

skos:broader

RELATION_RT

skos:related

S. Faro, E. Francesconi, E. Marinai, V. Sandrucci

D2.3 – Execution and results of the interoperability tests

EUROVOC XSLT Trasformation: eurovoc2skos.xslt ... S. Faro, E. Francesconi, E. Marinai, V. Sandrucci

D2.3 – Execution and results of the interoperability tests

EUROVOC in SKOS Core format ... undervisningsfrihed Freiheit der Lehre academic freedom libertad de enseñanza akadeemiline vabadus akateeminen vapaus liberté de l’enseignement akademske slobode ...

S. Faro, E. Francesconi, E. Marinai, V. Sandrucci

D2.3 – Execution and results of the interoperability tests

SKOS Core UNESCO Thesaurus transformation

XML elements

SKOS Core

DATABASE_UNESEN/RECORD/MT

skos:ConceptScheme

MT

skos:inScheme

RECORD

skos:Concept

Term, Terme, Término

skos:prefLabel

UF, EP, UP, USE, EMP

skos:altLabel

SN, NE, NA

skos:scopeNote

BT, TG

skos:broader

NT, TS, TE

skos:narrower

RT, TA, TR

skos:related

S. Faro, E. Francesconi, E. Marinai, V. Sandrucci

D2.3 – Execution and results of the interoperability tests

UNESCO XSLT Trasformation: unesco2skos.xsl

Suggest Documents