Structured Composition of Semantic Vectors - Google Sites

Introduction Structured Vectorial Semantics Evaluation

Structured Composition of Semantic Vectors Stephen Wu Division of Biomedical Statistics and Informatics Mayo Clinic

January 13, 2011 | IWCS

Stephen Wu

Structured Composition of Semantic Vectors


Outline 1

Introduction Overview Related Work

2

Structured Vectorial Semantics Vector Composition Semantically-annotated Parsing Distributed Semantics in SVS

3

Evaluation Model Fit Parsing Speed Performance

Stephen Wu



Overview Related Work

Outline 1


2


3


Stephen Wu




Big Picture

Distributed Semantic Vector Composition

    .5 .1 .2 , .1 .1 .1 the

engineers

Stephen Wu




Big Picture


    .5 .1 .2 , .1 .1 .1 the

(Syntactic) Parsing

+

S NP

pulled off ...

DT

NN

the

engineers

engineers

Stephen Wu




Big Picture


    .5 .1 .2 , .1 .1 .1 the

(Syntactic) Parsing

+

=

Structured Vectorial Semantics (SVS)

  .1 .2 .1 NP

S NP

pulled off ...

DT

NN

the

engineers

engineers

  .1 .2 .1 DT the

Stephen Wu

  .5 .1 .1 NN

engineers




Weaknesses of Distributed Semantic models 1

No compositionality Ex 1: Patient is a 48-year old male with no significant past medical history

2

Bag-of-words independence assumption Ex 2: 2a) Significant improvement of health outcomes followed the drastic overhaul of surgical pre-operation procedure. 2b) Significant overhaul of surgical pre-operation procedure followed the drastic improvement of health outcomes.

⇒ Structured Vectorial Semantics (SVS) Stephen Wu





No compositionality Ex 1: Patient is a 48-year old male with no significant past medical history complaining of abdominal pain.

2








2








2








2






Vector Composition Background General definition

(Mitchell & Lapata ’08)

eγ = f ( eα , eβ , M, L )

Syntactic context Predicate–argument Selectional preferences Language models Matrices

Stephen Wu

(Kintsch ’01) (Erk & Padó ’08) (Mitchell & Lapata ’09) (Rudolph & Giesbrecht ’10)






eγ = f ( eα , eβ , |{z} M , |{z} L ) |{z} |{z} |{z}

target vector

source 1 source 2 syntax knowledge

Syntactic context Predicate–argument Selectional preferences Language models Matrices

Stephen Wu







eγ = f ( eα , eβ , |{z} M , |{z} L ) |{z} |{z} |{z}

target vector


Add: eγ [i] = eα [i] + eβ [i] Mult: eγ [i] = eα [i] · eβ [i] Syntactic context Predicate–argument Selectional preferences Language models Matrices

Stephen Wu







eγ = f ( eα , eβ , |{z} M , |{z} L ) |{z} |{z} |{z}

target vector


Add: eγ [i] = eα [i] + eβ [i] Mult: eγ [i] = eα [i] · eβ [i] Syntactic context Predicate–argument Selectional preferences Language models Matrices

Stephen Wu





Semantically-annotated Parsing Headword Lexicalization S

(Charniak ’97) One-word semantics Subcategorization ⇒ Headword-lex SVS

VP

NP

DT

NN

the

engineers

VBD

NP

Latent Annotations (Matsuzaki VBD

PRT

DT

pulled

off

an

et al. ’05)

NN

NN

NN

engineering

trick

Learned subcats Clustered semantics ⇒ Relationally-clustered SVS

Semantic parsing Logical forms ⇒ Logical interpretation SVS

Stephen Wu




Semantically-annotated Parsing Headword Lexicalization

S

ipulled

NP

VP

iengineers

DT

NN


ipulled

VBD

ithe

iengineers

the

engineers iVBD pulled

NP

ipulled

pulled

itrick

PRT ioff

off

DT ian

Latent Annotations (Matsuzaki NN

et al. ’05)

itrick

an i NN engineering engineering

NN

itrick

trick



Stephen Wu




Semantically-annotated Parsing Headword Lexicalization S[e]

NP[e]

DT[e]

NN[e]


VP[e]

VBD[e]

NP[e]

Latent Annotations (Matsuzaki the

engineers VBD[e] PRT[e]

pulled

off

DT[e]

an

et al. ’05)

NN[e]

NN[e]

NN[e]

engineering

trick



Stephen Wu




Semantically-annotated Parsing Headword Lexicalization

S pulled(egr,trick(egrng))

VP

NP egr

DT -


pulled(x,trick(egrng))

NN eng

VBD

NP

pulled(x,y)

trick(egrng)

Latent Annotations (Matsuzaki the

VBD engineers pulled(x,y)

pulled

PRT -

DT -

off

an

NN trick(egrng)

NN

NN

egrng

trick(z)

engineering

trick

et al. ’05)



Stephen Wu



Vector Composition Semantically-annotated Parsing Distributed Semantics in SVS

Outline 1


2


3


Stephen Wu




SVS Composition Components Word vector in context (e)   .5 eβ = .1 .1 P(engineers | lciβ ) iu ik ip

  .1 engineers eα = .2 .1 P(the | lciα ) iu ik ip

the

ip ik iu 1 0 0 Lγ×β (lId )= 0 1 0 0 0 1 P(iγ | iβ , lβ )

Syntactic vector (m)

iu : unknown

iu ik ip

iu ik ip

ip : people ik : known

Relation matrices (L) u ip ik i .6 .2 .2 Lγ×α (lM OD )= .2 .5 .3 .1 .2 .7 P(iγ | iα , lα )

eγ = f (eα , eβ , M, L)

iu ik ip

  .2 m(lMod NP lMod DT lId NN) = .3 .4 – not purely syntactic P(lciγ → lcα lcβ ) Stephen Wu






the



iu : unknown

iu ik ip

iu ik ip



eγ = f (eα , eβ , M, L)

iu ik ip







the



iu : unknown

iu ik ip

iu ik ip



eγ = f (eα , eβ , M, L)

iu ik ip







the



iu : unknown

iu ik ip

iu ik ip



eγ = f (eα , eβ , M, L)

iu ik ip







the



iu : unknown

iu ik ip

iu ik ip



eγ = f (eα , eβ , M, L)

iu ik ip







the



iu : unknown

iu ik ip

iu ik ip



eγ = f (eα , eβ , M, L)

iu ik ip







the



iu : unknown

iu ik ip

iu ik ip



eγ = f (eα , eβ , M, L)

iu ik ip





SVS Composition Equation eǫ (lM OD )S

Composing “the engineers...” eγ (lM OD )NP eα

(lM OD )DT

the

Stephen Wu

pulled off ... eβ

(lI D )NN

engineers





Composing “the engineers...” eγ (lM OD )NP

eγ = f ( eα , eβ , M, L )

(lM OD )DT

iu ik ip

= m ⊙ ( Lγ×α eα ) ⊙ ( Lγ×β eβ )

the

pulled off ... eβ

eα

(lI D )NN

engineers

           0.0120 .2 .1 .5 1 0 0 .6 .2 .2 0.0042= .3 ⊙ .2 .5 .3 .2 ⊙ 0 1 0 .1 0 0 1 0.0048 .4 .1 .1 .1 .2 .7 | {z } | {z } | {z } | {z } {z } | {z } | eγ M Lγ×α (lMod ) eα Lγ×β (lId ) eβ



What context? Choose between? ⇒ Consider dual problem of parsing! Stephen Wu






eγ = f ( eα , eβ , M, L )

(lM OD )DT

iu ik ip

= m ⊙ ( Lγ×α eα ) ⊙ ( Lγ×β eβ )

the

pulled off ... eβ

eα

(lI D )NN

engineers

           .6 .2 .2 0.0120 .2 .1 .5 1 0 0 0.0042= .3 ⊙ .2 .5 .3 .2 ⊙ 0 1 0 .1 0 0 1 .1 .2 .7 0.0048 .4 .1 .1 | {z } | {z } | | {z } | {z } {z } | {z } eγ M Lγ×α (lMod ) eα Lγ×β (lId ) eβ









eγ = f ( eα , eβ , M, L )

(lM OD )DT

iu ik ip

= m ⊙ ( Lγ×α eα ) ⊙ ( Lγ×β eβ )

the

pulled off ... eβ

eα

(lI D )NN

engineers

           .6 .2 .2 0.0120 .2 .1 .5 1 0 0 0.0042= .3 ⊙ .2 .5 .3 .2 ⊙ 0 1 0 .1 0 0 1 .1 .2 .7 0.0048 .4 .1 .1 | {z } | {z } | | {z } | {z } {z } | {z } eγ M Lγ×α (lMod ) eα Lγ×β (lId ) eβ









eγ = f ( eα , eβ , M, L )

(lM OD )DT

iu ik ip

= m ⊙ ( Lγ×α eα ) ⊙ ( Lγ×β eβ )

the

pulled off ... eβ

eα

(lI D )NN

engineers










eγ = f ( eα , eβ , M, L )

(lM OD )DT

iu ik ip

= m ⊙ ( Lγ×α eα ) ⊙ ( Lγ×β eβ )

the

pulled off ... eβ

eα

(lI D )NN

engineers

           1 0 0 .6 .2 .2 .1 .5 .2 0.0120 0.0042= .3 ⊙ .2 .5 .3 .2 ⊙ 0 1 0 .1 0 0 1 .1 .2 .7 .1 .1 .4 0.0048 | {z } | {z } | | {z } {z } | {z } | {z } eγ M Lγ×α (lMod ) eα Lγ×β (lId ) eβ









eγ = f ( eα , eβ , M, L )

(lM OD )DT

iu ik ip

= m ⊙ ( Lγ×α eα ) ⊙ ( Lγ×β eβ )

the

pulled off ... eβ

eα

(lI D )NN

engineers










eγ = f ( eα , eβ , M, L )

(lM OD )DT

iu ik ip

= m ⊙ ( Lγ×α eα ) ⊙ ( Lγ×β eβ )

the

pulled off ... eβ

eα

(lI D )NN

engineers










eγ = f ( eα , eβ , M, L )

(lM OD )DT

iu ik ip

= m ⊙ ( Lγ×α eα ) ⊙ ( Lγ×β eβ )

the

pulled off ... eβ

eα

(lI D )NN

engineers








Dual Problem: Parsing eγ = m ⊙ ( Lγ×α eα ) ⊙ ( Lγ×β eβ ) X X P(iβ | iγ , lβ ) · P(xβ | lciβ ) P(iα | iγ , lα ) · P(xα | lciα ) · = P(lciγ → lcα lcβ ) · iβ

iα

=

XX

P(lciγ → lcα lcβ ) · P(iα | iγ , lα ) · P(xα | lciα ) · P(iβ | iγ , lβ ) · P(xβ | lciβ )

XX

P(lciγ → lcα lcβ ) · P(iα | iγ , lα ) · P(iβ | iγ , lβ ) · P(xα | lciα ) · P(xβ | lciβ )

XX

P(lciγ → lciα lciβ ) · P(xα | lciα ) · P(xβ | lciβ )

iα

=

iα

=

iα

iβ

iβ

iβ

Semantic labels l, concepts i Standard equations Stephen Wu





iα

=

XX


XX


XX


iα

=

iα

=

iα

iβ

iβ

iβ






iα

=

XX


XX


XX


iα

=

iα

=

iα

iβ

iβ

iβ






iα

=

XX


XX


XX


iα

=

iα

=

iα

iβ

iβ

iβ






iα

=

XX iα

=

XX iα

=

iβ

XX iα


iβ

P(lciγ → lcα lcβ ) · P(iα | iγ , lα ) · P(iβ | iγ , lβ ) · P(xα | lciα ) · P(xβ | lciβ ) | {z } P(lciγ → lciα lciβ ) · P(xα | lciα ) · P(xβ | lciβ )

iβ






iα

=

XX


XX


XX


iα

=

iα

=

iα

iβ

iβ

iβ





Most Likely Tree Compare vectors (aT ) X P(xγ , lcγ ) = P(lciγ )·P(xγ | lciγ ) = aTγ · eγ Best vector

iγ

def

PθVit(G) (xγ | lceγ ) = J eγ = arg max lceι

aTι eι

K

Implied tree Similar at root

Stephen Wu





iγ

def


aTι eι

K


Stephen Wu





iγ

def


aTι eι

K


Stephen Wu





iγ

def


aTι eι

K


Stephen Wu





iγ

def


aTι eι

K


Stephen Wu




SVS Probability Models Syntactic model

m(lcγ lcα lcβ )[iγ , iγ ] =PθM (lciγ → lcα lcβ )

Semantic model

Lγ×ι(lι )[iγ , iι ] =PθL (iι | iγ , lι ) eγ [iγ ] =PθP-Vit(G) (xγ | lciγ ),

Preterminal model

for preterm γ

T

Root const. model

aǫ [iǫ ] =PπGǫ (lciǫ )

Any const. model

aT γ [iγ ] =PπG (lciγ )

Different instantiations

Stephen Wu






Semantic model


Preterminal model

for preterm γ

T

Root const. model


Any const. model



Stephen Wu






Semantic model


Preterminal model

for preterm γ

T

Root const. model


Any const. model



Stephen Wu






Semantic model


Preterminal model

for preterm γ

T

Root const. model


Any const. model



Stephen Wu




Relationally-clustered headwords Headword Lexicalization    e= 

0 1 .. . 0

Relational clusters   p1

iaardvark iengineers .. . izygote

  

icluster1

e =  ...  ... p|e|

iengineers

icluster|e|

icluster1 (lM OD )NP

(lM OD )NP

ithe (lM OD )DT

iengineers (lI D )NN

icluster2 (lM OD )DT

icluster3

the

engineers

the

engineers

(lI D )NN

Inside–Outside Algorithm (EM)

−−−−−−−−−−−−−−−−−−−−−−−−−−−→ Stephen Wu





0 1 .. . 0



  

icluster1

e =  ...  ... p|e|

iengineers

icluster|e|


(lM OD )NP

ithe (lM OD )DT



icluster3

the

engineers

the

engineers

(lI D )NN


−−−−−−−−−−−−−−−−−−−−−−−−−−−→ Stephen Wu





0 1 .. . 0



  

icluster1

e =  ...  ... p|e|

iengineers

icluster|e|


(lM OD )NP

ithe (lM OD )DT



icluster3

the

engineers

the

engineers

(lI D )NN


−−−−−−−−−−−−−−−−−−−−−−−−−−−→ Stephen Wu



Model Fit Parsing Speed Performance

Outline 1


2


3


Stephen Wu




Model Fit Evaluation WSJ Sec 02–21 train, 23 test Binarized, subcategorized Non-syntactic information

Quantitative fit: Perplexity Models explain language Sec. 23, ‘unk’+‘num’ syntax only baseline

Stephen Wu

Perplexity 428.94





Quantitative fit: Perplexity Models explain language Sec. 23, ‘unk’+‘num’ syntax only baseline

Stephen Wu

Perplexity 428.94





Quantitative fit: Perplexity Models explain language Sec. 23, ‘unk’+‘num’ syntax only baseline rel’n clust. 1khw→005e

Stephen Wu

Perplexity 428.94 371.76




EM-learned Relational Clusters Clusters in syntactic context (plural nouns) Cluster i0 ‘money’ unk 0.431 cents 0.135 shares 0.084 yen 0.036 sales 0.025 points 0.023 marks 0.018 francs 0.018 tons 0.013 people 0.012

Cluster i1 ‘people’ officials 0.145 unk 0.141 years 0.132 shares 0.093 prices 0.061 people 0.050 stocks 0.032 sales 0.027 executives 0.024 analysts 0.018

Stephen Wu

Cluster i2 ‘companies’ unk 0.248 markets 0.056 companies 0.036 issues 0.035 firms 0.033 banks 0.030 loans 0.025 investors 0.024 contracts 0.022 stocks 0.021

Cluster i5 ‘time’ years 0.25 months 0.19 unk 0.18 days 0.12 weeks 0.06 points 0.03 companies 0.02 hours 0.02 people 0.01 units 0.01




EM-learned Relational Clusters Clusters in syntactic context (past-tense verbs) Cluster i1 ‘announcement’ unk 0.362 was 0.173 reported 0.097 posted 0.036 earned 0.029 filed 0.024 were 0.022 had 0.020 told 0.013 approved 0.013

Cluster i5 ‘change in value’ rose 0.137 fell 0.124 unk 0.116 gained 0.063 dropped 0.051 attributed 0.051 jumped 0.046 added 0.041 lost 0.039 advanced 0.022

Stephen Wu

Cluster i7 ‘change possession’ unk 0.381 had 0.065 was 0.062 took 0.036 bought 0.027 completed 0.025 received 0.024 were 0.023 got 0.018 made 0.018 acquired 0.016




WSJ Parsing accuracy and Relational clusters Are distributed semantics better? Sec. 23, length < 40 wds syntax-only baseline: headword-lex. 10hw: headword-lex. 50hw: rel’n clust. 50hw10 clust:

LR 83.32 83.10 83.09 83.67

LP 83.83 83.61 83.40 84.13

F 83.57 83.35 83.24 83.90

Are more clusters better? Sec. 23, length < 40 wds baseline1 clust 1000 hw5 clust, avg 1000 hw10 clust, avg 1000 hw15 clust, avg 1000 hw20 clust, avg Stephen Wu

LR 83.34 83.85 84.04 84.15 84.21

LP 83.90 84.23 84.40 84.38 84.42

F 83.62 84.04 84.21 84.26 84.31





LR 83.32 83.10 83.09 83.67

LP 83.83 83.61 83.40 84.13

F 83.57 83.35 83.24 83.90


LR 83.34 83.85 84.04 84.15 84.21

LP 83.90 84.23 84.40 84.38 84.42

F 83.62 84.04 84.21 84.26 84.31





LR 83.32 83.10 83.09 83.67

LP 83.83 83.61 83.40 84.13

F 83.57 83.35 83.24 83.90


LR 83.34 83.85 84.04 84.15 84.21

LP 83.90 84.23 84.40 84.38 84.42

F 83.62 84.04 84.21 84.26 84.31




Parsing Speed with Vectors Extra operations Slower Vectorization improves speed O(n3 ) runtime Coefficients? 0.66505 un-vectorized 0.00267 vectorized

Efficient operations

Stephen Wu





Average Parsing Time (s)

500 Non−vectorized Vectorized

400 300 200 100 0

0

5

10

15 20 25 Sentence Length

30


Stephen Wu


35

40






400 300 200 100 0

0

5

10


30


Stephen Wu


35

40






400 300 200 100 0

0

5

10


30


Stephen Wu


35

40






400 300 200 100 0

0

5

10


30


Stephen Wu


35

40



Conclusion: Structured Vectorial Semantics Addressing weaknesses No compositionality ← Phrasal semantics Bag-of-words ← Context

Relational-clustering SVS Distributed semantics + Latent-annotation parsing Broad-coverage

Evaluation Perplexity reduction Qualitative clusters Mild parsing gains Tractability

Stephen Wu







Stephen Wu







Stephen Wu




Thank you! [email protected]

Stephen Wu


Inside–Outside Algorithm (EM) E-step: Estimates → annot. rule

∧

∧

P(iγ , iα , iβ | lcγ , lcα , lcβ ) =

∧

PθOut (lciγ , lchǫ −lchγ ) · PθIns (lchγ | lciγ ) ∧

P(lchǫ )

Weight against real data ∧

M-step:

∧

:

P(lciγ ,lciα ,lciβ ) = P(iγ ,iα ,iβ |lcγ ,lcα ,lcβ ) · P(lcγ ,lcα ,lcβ ) : P ∧ iη0 ,iη1 P(lciη , lciη0 , lciη1 ) PθM (lciη lcη0 , lcη1 ) ← P : latent cliη0 ,cliη1 P(lciη , lciη0 , lciη1 )

Estimate grammar rules Imagine annotations

Frequency count

P

∧

PθL (iη0 | iη ; lη0 ) ← P ∧

:

clη ,cη0 ,cliη1

P(lciη , lciη0 , lciη1 )

clη ,ciη0 ,cliη1


:

:

P(lciη , −, −) PθH (hη | lciη ) ← P : hη P(lciη , −, −) Stephen Wu



∧

∧


∧


P(lchǫ )


M-step:

∧

:



Frequency count

P

∧

PθL (iη0 | iη ; lη0 ) ← P ∧

:

clη ,cη0 ,cliη1


clη ,ciη0 ,cliη1


:

:




∧

∧


∧


P(lchǫ )


M-step:

∧

:



Frequency count

P

∧

PθL (iη0 | iη ; lη0 ) ← P ∧

:

clη ,cη0 ,cliη1


clη ,ciη0 ,cliη1


:

:




∧

∧


∧


P(lchǫ )


M-step:

∧

:



Frequency count

P

∧

PθL (iη0 | iη ; lη0 ) ← P ∧

:

clη ,cη0 ,cliη1


clη ,ciη0 ,cliη1


:

:




∧

∧


∧


P(lchǫ )


M-step:

∧

:



Frequency count

P

∧

PθL (iη0 | iη ; lη0 ) ← P ∧

:

clη ,cη0 ,cliη1


clη ,ciη0 ,cliη1


:

:




∧

∧


∧


P(lchǫ )


M-step:

∧

:



Frequency count

P

∧

PθL (iη0 | iη ; lη0 ) ← P ∧

:

clη ,cη0 ,cliη1


clη ,ciη0 ,cliη1


:

:



Relational Clustering SVS Five SVS models to train Syntactic model

PθM (lciγ → lcα lcβ )

estimated in EM

Semantic model

PθL (iι | iγ , lι )

estimated in EM

PθP-Vit(G) (xγ | lciγ )

backed off from EM

Root const. model

PπGǫ (lciǫ )

byproduct of EM

Any const. model

PπG (lciγ )

byproduct of EM

Preterminal model

∧ P (x | lci ) η θH η PθP-Vit(G) (xη | lciη ) = ∧ P θP-Vit(G) (xη | cη )· PθH (unk | lciη ) ∧

Preterminal model

∧

Root const. model Any const. model

def

∧

PπGǫ (lciǫ ) = PθOut (lciǫ , lchǫ −lchǫ ) X: ∧ def P(lciη , lciη0 , lciη1 ) PπG (lciη ) = lciη0 ,lciη1

Stephen Wu


xη ∈ H xη 6∈ H



estimated in EM

Semantic model


estimated in EM


backed off from EM

Root const. model

PπGǫ (lciǫ )

byproduct of EM

Any const. model

PπG (lciγ )

byproduct of EM

Preterminal model


Preterminal model

∧


def

∧


Stephen Wu





estimated in EM

Semantic model


estimated in EM


backed off from EM

Root const. model

PπGǫ (lciǫ )

byproduct of EM

Any const. model

PπG (lciγ )

byproduct of EM

Preterminal model


Preterminal model

∧


def

∧


Stephen Wu



Structured Composition of Semantic Vectors - Google Sites

Structured Composition of Semantic Vectors - Google Sites

Suggest Documents

Structured Service Composition - Semantic Scholar

Structured Prediction - Google Sites

Structured Component Composition Frameworks for Embedded ...

Advanced Placement Language and Composition Structured ...

Prediction of Thematic Rank for Structured Semantic ... - Google Sites

Vectors & Scalars 1 Vectors & Scalars 1 - Google Sites

Against Structured Propositions - Google Sites

Collaborative Scheduling of DAG Structured ... - Google Sites

Structured Casenotes - Semantic Scholar

Structured Abstract - Semantic Scholar

Managing Structured and Semi-structured RDF Data ... - Google Sites

A Structured SVM Semantic Parser Augmented by ... - Google Sites

Probing for semantic evidence of composition by ... - Google Sites

Comparison of Well-Structured & Ill-Structured ... - Semantic Scholar

Comparative assessment of vaccine vectors ... - Semantic Scholar

Insect vectors of Leishmania: distribution ... - Semantic Scholar

Potential Vectors of Dirofilaria immitis - Semantic Scholar

Mosquito community composition in South Africa ... - Parasites & Vectors

Document Embedding with Paragraph Vectors - Google Sites

Vectors of Beet Yellows Virus - Semantic Scholar

Stability of Retroviral Vectors Against ... - Semantic Scholar

Vectors

Vectors

Calculus & Vectors MCV4U Calculus & Vectors