UNITOR-CORE TYPED Combining Text Similarity and ... - clic-cimec

4 downloads 122 Views 149KB Size Report
Danilo Croce, Valerio Storch and Roberto Basili. *Sem - Atlanta, June 2013. UNITOR-CORE TYPED. Combining Text Similarity and Semantic. Filters through SV ...
UNITOR-CORE TYPED Combining Text Similarity and Semantic Filters through SV Regression

Danilo Croce, Valerio Storch and Roberto Basili UNIVERSITY OF ROMA, TOR VERGATA

*Sem - Atlanta, June 2013

Outline ¨ 

Modeling STS through kernel-based regression ¤  Similarity

functions

¨ 

Semantic constraints for the Typed STS

¨ 

STS challenge results ¤  Core-STS ¤  Typed-STS

Textual similarity as SV regression ¨ 

¨ 

STS is modeled as a Support Vector (SV) regression problem [Smola and Scholkopf, 2004] The semantic relatedness between two texts is first redundantly modeled through a set of independent similarity functions. ¤ 

each function reflects a specific semantic perspective: n 

¨ 

E.g. syntactic and lexical similarity

A Support Vector regressor learns the proper combination of different functions acquired in an unsupervised fashion

STS functions: lexical information ¨ 

Lexical Overlap (LO) considers the lexical similarity between sentences ¤  LO

¨ 

is the Jaccard Similarity score between word sets

Lexical information is generalized through a Word Space model ¤  each

word is a vector in a space where distance reflects semantic relations

¨ 

Linear Combination (SUM) ¤  a

sentence is modeled as the linear combination of its words ¤  similarity is the cosine similarity between the resulting representations

STS functions: introducing syntax ¨ 

A Distributional Compositional Semantics based operator A sentence is a set of syntactically restricted compounds (V-SBJ) ¤  Syntactic bi-grams similarity is modeled as the projection in lexically-driven Subspaces [Annesi et al, 2012] ¤  The Syntactic Soft Cardinality (SSC) operator combines the contribution of specific compounds based on the Soft-cardinality function [Jimenez et al, 2012] ¤ 

¨ 

A semantically Smoothed Partial Tree Kernel (SPTK) [Croce et al, 2011] operator based a Convolution Kernel jointly modeling syntactic and lexical semantic similarity in both sentences ¤  it extends the similarity between tree structures with a function of node similarity (here from the Word Space) ¤ 

Semantic constraints for the Typed STS The chemist R.S. Hudson began manufacturing soap in the back of his small shop in West Bomich in 1837

PERSON (e.g. R.S. Hudson) ¨  Verbs (e.g. began manifacturing) ¨  LOCATIONS (e.g. West Bomich) ¨  TIME (e.g. 1837) ¨ 

people involved event location time

Semantic constraints for the Typed STS ¨ 

Given a semi-structured source, we filter fields by their type ¤  ¤ 

¨ 

¨ 

¨ 

When a time-based similarity is targeted, the dcDate field should be considered while the dcCreator may be neglected Other fields may contain useful information, but it should be filtered

An information extraction system is used to extract useful information, e.g. temporal information Similarity functions are applied to selected fields where specific phrases have been extracted Example: the time similarity ¤  ¤ 

the dcDate field is fully considered only phrases expressing temporal information are considered within the dcTitle, dcSubject and dcDescription

UNITOR-CORE: Experimental Setup ¨ 

A regressor has been trained in a 13 dimensional feature space ¤  ¤  ¤  ¤ 

¨ 

¨ 

5 scores from LO 5 scores from SUM 1 score from SSC 2 scores from SPTK

The word space model is derived from the distributional analysis of the UkWaC corpus Three runs differing in the training set definition ¤  ¤  ¤ 

Run1: training dataset are heuristically selected, i.e. one regressor for test dataset; a linear kernel is employed Run2: all data have been used within the regressor, only one classifier Run3: same trainset of Run1; a gaussian kernel is employed

UNITOR-Core: Results Run1

Run2

Run3

Run1*

headlines

.635 (50)

.651 (39)

.603 (58)

.671 (30)

OnWN

.574 (33)

.561 (36)

.549 (40)

.637 (25)

FNWN

.352 (35)

.358 (32)

.327 (44)

.459 (07)

SMT

.328 (39)

.310 (49)

.319 (44)

.348 (21)

Mean

.494 (37)

.490 (42)

.472 (52)

.537 (19)

¨ 

The selection of wrong dataset provide performance drop ¤  Run1*:

a better selection of the training material, i.e. dataset maximizing performance ¤  Improvement from 37th to 19th position

UNITOR-Typed: Experimental Setup ¨ 

The same SV regressor-based schema is applied ¤  A

specific regressor is learned for each target similarity ¤  Text are not sentential n  the

LO and SUM function are employed

¤  The

Stanford NER system extracts phrases referring to the classes PERSON, TIME, LOCATION

time  

dcTitle

dcSubject

dcDesc.

dcCreator

dcDate

dcSource

DATE  

DATE  

DATE  

-  

*  

-  

UNITOR-Typed: Results ¨ 

Run   1   2  

¨ 

Two Runs ¤  Run1: linear kernel is used within ¤  Run2: a gaussian kernel is used general   .7981   .7564  

author   .8158   .8076  

peop.inv.   .6922   .6758  

time   .7471   .7090  

location   .7723   .7351  

event   .6835   .6623  

subject   .7875   .7520  

descr.   .7996   .7745  

mean   .7620   .7341  

Error Analysis: some scores are overestimated due to some “coincidences” The Octagon and Pavilions, Pavilion Garden, Buxton, c 1875 VS The Beatles, The Octagon, Pavillion Gardens, St John’s Road, Buxton, 1963

Conclusion ¨ 

We modeled STS as a SV regression problem ¤  a

SV regressor learns how to combine basic similarity measures ¤  we apply simple but effective semantic filters to emphasize specific information ¤  no hand-coded resource is used ¨ 

Future work ¤  improve

the domain adaptation of the proposed similarity functions and combination approach ¤  provide a method to properly select training material

Thank you for your attention…

Suggest Documents