UNITOR-CORE TYPED Combining Text Similarity and ... - clic-cimec

UNITOR-CORE TYPED Combining Text Similarity and Semantic Filters through SV Regression

Danilo Croce, Valerio Storch and Roberto Basili UNIVERSITY OF ROMA, TOR VERGATA

*Sem - Atlanta, June 2013

Outline ¨ 

Modeling STS through kernel-based regression ¤  Similarity

functions

¨ 

Semantic constraints for the Typed STS

¨ 

STS challenge results ¤  Core-STS ¤  Typed-STS

Textual similarity as SV regression ¨ 

¨ 

STS is modeled as a Support Vector (SV) regression problem [Smola and Scholkopf, 2004] The semantic relatedness between two texts is first redundantly modeled through a set of independent similarity functions. ¤ 

each function reflects a specific semantic perspective: n 

¨ 

E.g. syntactic and lexical similarity

A Support Vector regressor learns the proper combination of different functions acquired in an unsupervised fashion

STS functions: lexical information ¨ 

Lexical Overlap (LO) considers the lexical similarity between sentences ¤  LO

¨ 

is the Jaccard Similarity score between word sets

Lexical information is generalized through a Word Space model ¤  each

word is a vector in a space where distance reflects semantic relations

¨ 

Linear Combination (SUM) ¤  a

sentence is modeled as the linear combination of its words ¤  similarity is the cosine similarity between the resulting representations

STS functions: introducing syntax ¨ 

A Distributional Compositional Semantics based operator A sentence is a set of syntactically restricted compounds (V-SBJ) ¤  Syntactic bi-grams similarity is modeled as the projection in lexically-driven Subspaces [Annesi et al, 2012] ¤  The Syntactic Soft Cardinality (SSC) operator combines the contribution of specific compounds based on the Soft-cardinality function [Jimenez et al, 2012] ¤ 

¨ 

A semantically Smoothed Partial Tree Kernel (SPTK) [Croce et al, 2011] operator based a Convolution Kernel jointly modeling syntactic and lexical semantic similarity in both sentences ¤  it extends the similarity between tree structures with a function of node similarity (here from the Word Space) ¤ 

Semantic constraints for the Typed STS The chemist R.S. Hudson began manufacturing soap in the back of his small shop in West Bomich in 1837

PERSON (e.g. R.S. Hudson) ¨  Verbs (e.g. began manifacturing) ¨  LOCATIONS (e.g. West Bomich) ¨  TIME (e.g. 1837) ¨ 

people involved event location time

Semantic constraints for the Typed STS ¨ 

Given a semi-structured source, we filter fields by their type ¤  ¤ 

¨ 

¨ 

¨ 

When a time-based similarity is targeted, the dcDate field should be considered while the dcCreator may be neglected Other fields may contain useful information, but it should be filtered

An information extraction system is used to extract useful information, e.g. temporal information Similarity functions are applied to selected fields where specific phrases have been extracted Example: the time similarity ¤  ¤ 

the dcDate field is fully considered only phrases expressing temporal information are considered within the dcTitle, dcSubject and dcDescription

UNITOR-CORE: Experimental Setup ¨ 

A regressor has been trained in a 13 dimensional feature space ¤  ¤  ¤  ¤ 

¨ 

¨ 

5 scores from LO 5 scores from SUM 1 score from SSC 2 scores from SPTK

The word space model is derived from the distributional analysis of the UkWaC corpus Three runs differing in the training set definition ¤  ¤  ¤ 

Run1: training dataset are heuristically selected, i.e. one regressor for test dataset; a linear kernel is employed Run2: all data have been used within the regressor, only one classifier Run3: same trainset of Run1; a gaussian kernel is employed

UNITOR-Core: Results Run1

Run2

Run3

Run1*

headlines

.635 (50)

.651 (39)

.603 (58)

.671 (30)

OnWN

.574 (33)

.561 (36)

.549 (40)

.637 (25)

FNWN

.352 (35)

.358 (32)

.327 (44)

.459 (07)

SMT

.328 (39)

.310 (49)

.319 (44)

.348 (21)

Mean

.494 (37)

.490 (42)

.472 (52)

.537 (19)

¨ 

The selection of wrong dataset provide performance drop ¤  Run1*:

a better selection of the training material, i.e. dataset maximizing performance ¤  Improvement from 37th to 19th position

UNITOR-Typed: Experimental Setup ¨ 

The same SV regressor-based schema is applied ¤  A

specific regressor is learned for each target similarity ¤  Text are not sentential n  the

LO and SUM function are employed

¤  The

Stanford NER system extracts phrases referring to the classes PERSON, TIME, LOCATION

time

dcTitle

dcSubject

dcDesc.

dcCreator

dcDate

dcSource

DATE

DATE

DATE

-

*

-

UNITOR-Typed: Results ¨ 

Run 1 2

¨ 

Two Runs ¤  Run1: linear kernel is used within ¤  Run2: a gaussian kernel is used general .7981 .7564

author .8158 .8076

peop.inv. .6922 .6758

time .7471 .7090

location .7723 .7351

event .6835 .6623

subject .7875 .7520

descr. .7996 .7745

mean .7620 .7341

Error Analysis: some scores are overestimated due to some “coincidences” The Octagon and Pavilions, Pavilion Garden, Buxton, c 1875 VS The Beatles, The Octagon, Pavillion Gardens, St John’s Road, Buxton, 1963

Conclusion ¨ 

We modeled STS as a SV regression problem ¤  a

SV regressor learns how to combine basic similarity measures ¤  we apply simple but effective semantic filters to emphasize specific information ¤  no hand-coded resource is used ¨ 

Future work ¤  improve

the domain adaptation of the proposed similarity functions and combination approach ¤  provide a method to properly select training material

Thank you for your attention…

UNITOR-CORE TYPED Combining Text Similarity and ... - clic-cimec

UNITOR-CORE TYPED Combining Text Similarity and ... - clic-cimec

Suggest Documents

UNITOR-CORETYPED: Combining Text Similarity and Semantic ...

Combining pairwise sequence similarity and support ... - CiteSeerX

Combining Relational and Attributional Similarity ... - Semantic Scholar

Video Segmentation Combining Similarity Analysis and ... - CiteSeerX

Combining Pairwise Sequence Similarity and Support Vector ...

Textual Similarity Combining Conceptual Similarity with an N-Gram ...

KeyStrokes: Personalizing Typed Text with Visualization - InnoVis

Augmenting Noun Taxonomies by Combining Lexical Similarity ...

Text Categorization and Similarity Analysis - Research Commons ...

Combining Link and Content Analysis to Estimate Semantic Similarity

FlowMenu: Combining Command, Text, and Data Entry

Combining Text Embedding and Knowledge Graph Embedding

Combining Similarity and Sentiment in Opinion Mining for Product ...

Combining String and Context Similarity for Bilingual Term Alignment ...

Combining Time and Space Similarity for Small Size ... - Google Sites

COMBINING TEXT STRUCTURE AND MEANING ...

Combining Similarity and Sentiment in Opinion Mining for Product ...

Combining Similarity in Time and Space for ... - Semantic Scholar

Combining Shallow Text Processing and Machine ... - CiteSeerX

Combining Text and Linguistic Document Representations for ...

Combining Content and Structure Similarity for XML Document ...

Analysing Fuzzy Sets Through Combining Measures of Similarity and ...

Combining similarity functions and majority rules for multi-building ...

Combining EEG source connectivity and network similarity - arXiv