Type-Based MCMC - Stanford CS

13 downloads 329 Views 3MB Size Report
NN. NNS. VBD. ···. ··· tech stocks fell ···. ··· DT. NNS. VBD ...... 6 today. 5 investors. 2 await. 4 stocks. 3 news. Token-based sampler (Token). 1. Choose token. 11 ...
Type-Based MCMC NAACL 2010 – Los Angeles

Percy Liang

Michael I. Jordan

Dan Klein

Type-Based MCMC NAACL 2010 – Los Angeles

Percy Liang

Michael I. Jordan

Dan Klein

Learning latent-variable models

···

DT

NNS

VBD

···

· · · some stocks rose · · · VBD

···

· · · tech stocks

fell

···

···

DT

VBD

···

···

the stocks soared · · ·

···

WRB

···

NN

NNS

NNS

NNS

VBD

···

· · · how stocks dipped · · · ···

DT

NNS

VBD

···

· · · many stocks created · · · 2

Learning latent-variable models

···

DT

NNS

VBD

···

· · · some stocks rose · · · ···

VBD

···

· · · tech stocks

fell

···

···

DT

VBD

···

···

the stocks soared · · ·

···

WRB

NN

NNS

NNS

NNS

VBD

···

· · · how stocks dipped · · · ···

DT

NNS

VBD

···

· · · many stocks created · · · 2

Learning latent-variable models θ ···

DT

NNS

VBD

···

· · · some stocks rose · · · ···

VBD

···

· · · tech stocks

fell

···

···

DT

VBD

···

···

the stocks soared · · ·

···

WRB

NN

NNS

NNS

NNS

VBD

···

· · · how stocks dipped · · · ···

DT

NNS

VBD

···

· · · many stocks created · · · 2

Learning latent-variable models

···

DT

NNS

VBD

···

· · · some stocks rose · · · ···

VBD

···

· · · tech stocks

fell

···

···

DT

VBD

···

···

the stocks soared · · ·

···

WRB

NN

NNS

NNS

NNS

VBD

···

· · · how stocks dipped · · · ···

DT

NNS

VBD

···

· · · many stocks created · · · 2

Learning latent-variable models Variables are highly dependent ···

DT

NNS

VBD

···

· · · some stocks rose · · · ···

VBD

···

· · · tech stocks

fell

···

···

DT

VBD

···

···

the stocks soared · · ·

···

WRB

NN

NNS

NNS

NNS

VBD

···

· · · how stocks dipped · · · ···

DT

NNS

VBD

···

· · · many stocks created · · · 2

Learning latent-variable models Variables are highly dependent ···

DT

NNS

VBD

···

Simple: token-based (e.g., Gibbs sampler)

· · · some stocks rose · · · ···

VBD

···

· · · tech stocks

fell

···

···

DT

VBD

···

···

the stocks soared · · ·

···

WRB

NN

NNS

NNS

NNS

VBD

···

· · · how stocks dipped · · · ···

DT

NNS

VBD

···

· · · many stocks created · · · 2

Learning latent-variable models Variables are highly dependent ···

DT

NNS

VBD

···

· · · some stocks rose · · · ···

VBD

···

· · · tech stocks

fell

···

···

DT

VBD

···

···

the stocks soared · · ·

···

WRB

NN

NNS

NNS

NNS

VBD

Simple: token-based (e.g., Gibbs sampler) ⇒ local optima, slow mixing

···

· · · how stocks dipped · · · ···

DT

NNS

VBD

···

· · · many stocks created · · · 2

Learning latent-variable models Variables are highly dependent ···

DT

NNS

VBD

···

· · · some stocks rose · · · ···

VBD

···

· · · tech stocks

fell

···

···

DT

VBD

···

···

the stocks soared · · ·

···

WRB

NN

NNS

NNS

NNS

VBD

Simple: token-based (e.g., Gibbs sampler) ⇒ local optima, slow mixing Classic: sentence-based (e.g., EM)

···

· · · how stocks dipped · · · ···

DT

NNS

VBD

···

· · · many stocks created · · · 2

Learning latent-variable models Variables are highly dependent ···

DT

NNS

VBD

···

· · · some stocks rose · · · ···

VBD

···

· · · tech stocks

fell

···

···

DT

VBD

···

···

the stocks soared · · ·

···

WRB

NN

NNS

NNS

NNS

VBD

Simple: token-based (e.g., Gibbs sampler) ⇒ local optima, slow mixing Classic: sentence-based (e.g., EM) Structural dependencies

···

· · · how stocks dipped · · · ···

DT

NNS

VBD

···

· · · many stocks created · · · 2

Learning latent-variable models Variables are highly dependent ···

DT

NNS

VBD

···

· · · some stocks rose · · · ···

VBD

···

· · · tech stocks

fell

···

···

DT

VBD

···

···

the stocks soared · · ·

···

WRB

NN

NNS

NNS

NNS

VBD

Simple: token-based (e.g., Gibbs sampler) ⇒ local optima, slow mixing Classic: sentence-based (e.g., EM) Structural dependencies dynamic programming

···

· · · how stocks dipped · · · ···

DT

NNS

VBD

···

· · · many stocks created · · · 2

Learning latent-variable models Variables are highly dependent ···

DT

NNS

VBD

···

· · · some stocks rose · · · ···

VBD

···

· · · tech stocks

fell

···

···

DT

VBD

···

···

the stocks soared · · ·

···

WRB

NN

NNS

NNS

NNS

VBD

DT

NNS

VBD

Classic: sentence-based (e.g., EM) Structural dependencies dynamic programming

···

· · · how stocks dipped · · · ···

Simple: token-based (e.g., Gibbs sampler) ⇒ local optima, slow mixing

New: type-based

···

· · · many stocks created · · · 2

Learning latent-variable models Variables are highly dependent ···

DT

NNS

VBD

···

· · · some stocks rose · · · ···

VBD

···

· · · tech stocks

fell

···

···

DT

VBD

···

···

the stocks soared · · ·

···

WRB

NN

NNS

NNS

NNS

VBD

DT

NNS

VBD

Classic: sentence-based (e.g., EM) Structural dependencies dynamic programming

···

· · · how stocks dipped · · · ···

Simple: token-based (e.g., Gibbs sampler) ⇒ local optima, slow mixing

···

New: type-based Parameter dependencies

· · · many stocks created · · · 2

Learning latent-variable models Variables are highly dependent ···

DT

NNS

VBD

···

· · · some stocks rose · · · ···

VBD

···

· · · tech stocks

fell

···

···

DT

VBD

···

···

the stocks soared · · ·

···

WRB

NN

NNS

NNS

NNS

VBD

DT

NNS

VBD

Classic: sentence-based (e.g., EM) Structural dependencies dynamic programming

···

· · · how stocks dipped · · · ···

Simple: token-based (e.g., Gibbs sampler) ⇒ local optima, slow mixing

···

· · · many stocks created · · ·

New: type-based Parameter dependencies exchangeability 2

Structural dependencies Dependencies between adjacent variables

3

Structural dependencies

probability

Dependencies between adjacent variables

NN VBZ NN NN worker demands meeting resistance

3

Structural dependencies

probability

Dependencies between adjacent variables

NN NNS VBG NN worker demands meeting resistance NN VBZ NN NN worker demands meeting resistance

3

Structural dependencies

probability

Dependencies between adjacent variables

NN NNS VBG NN worker demands meeting resistance NN VBZ NN NN worker demands meeting resistance

Token-based: update only one variable at a time

3

Structural dependencies

probability

Dependencies between adjacent variables

NN NNS VBG NN worker demands meeting resistance NN VBZ NN NN worker demands meeting resistance

NN VBZ VBG NN worker demands meeting resistance

Token-based: update only one variable at a time

3

Structural dependencies

probability

Dependencies between adjacent variables

NN NNS VBG NN worker demands meeting resistance NN VBZ NN NN worker demands meeting resistance

NN VBZ VBG NN worker demands meeting resistance

Token-based: update only one variable at a time

3

Structural dependencies

probability

Dependencies between adjacent variables

NN NNS VBG NN worker demands meeting resistance NN VBZ NN NN worker demands meeting resistance

NN VBZ VBG NN worker demands meeting resistance

Token-based: update only one variable at a time Problem: need to go downhill before uphill

3

Structural dependencies Dependencies between adjacent variables

probability

Sentence-based: update all variables in a sentence NN NNS VBG NN worker demands meeting resistance NN VBZ NN NN worker demands meeting resistance

NN VBZ VBG NN worker demands meeting resistance

Token-based: update only one variable at a time Problem: need to go downhill before uphill

3

Parameter dependencies Dependencies between variables with shared parameters

4

Parameter dependencies

probability

Dependencies between variables with shared parameters

VBZ stocks VBZ stocks VBZ stocks VBZ stocks

VBD rose VBD fell VBD took VBD shot

... ... ... ... ... ... ... ...

4

Parameter dependencies

probability

Dependencies between variables with shared parameters

VBZ stocks VBZ stocks VBZ stocks VBZ stocks

VBD rose VBD fell VBD took VBD shot

... ... ... ... ... ... ... ...

NNS stocks NNS stocks NNS stocks NNS stocks

VBD rose VBD fell VBD took VBD shot

... ... ... ... ... ... ... ...

4

Parameter dependencies

probability

Dependencies between variables with shared parameters

VBZ stocks VBZ stocks VBZ stocks VBZ stocks

VBD rose VBD fell VBD took VBD shot

NNS stocks NNS stocks NNS stocks NNS stocks

... ... ... ... ... ... ... ... NNS stocks VBZ stocks VBZ stocks VBZ stocks

VBD rose VBD fell VBD took VBD shot

VBD rose VBD fell VBD took VBD shot

... ... ... ... ... ... ... ...

... ... ... ... ... ... ... ...

Token-based: 4

Parameter dependencies

probability

Dependencies between variables with shared parameters

VBZ stocks VBZ stocks VBZ stocks VBZ stocks

VBD rose VBD fell VBD took VBD shot

NNS stocks NNS stocks NNS stocks NNS stocks

... ... ... ... ... ... ... ... NNS stocks VBZ stocks VBZ stocks VBZ stocks

VBD rose VBD fell VBD took VBD shot

... ... ... ... ... ... ... ...

NNS stocks NNS stocks VBZ stocks VBZ stocks

VBD rose VBD fell VBD took VBD shot

VBD rose VBD fell VBD took VBD shot

... ... ... ... ... ... ... ...

... ... ... ... ... ... ... ...

Token-based: 4

Parameter dependencies

probability

Dependencies between variables with shared parameters

VBZ stocks VBZ stocks VBZ stocks VBZ stocks

VBD rose VBD fell VBD took VBD shot

NNS stocks NNS stocks NNS stocks NNS stocks

... ... ... ... ... ... ... ... NNS stocks VBZ stocks VBZ stocks VBZ stocks

VBD rose VBD fell VBD took VBD shot

... ... ... ... ... ... ... ...

NNS stocks NNS stocks VBZ stocks VBZ stocks

VBD rose VBD fell VBD took VBD shot

... ... ... ... ... ... ... ...

NNS stocks NNS stocks NNS stocks VBZ stocks

VBD rose VBD fell VBD took VBD shot

VBD rose VBD fell VBD took VBD shot

... ... ... ... ... ... ... ...

... ... ... ... ... ... ... ...

Token-based: 4

Parameter dependencies

probability

Dependencies between variables with shared parameters

VBZ stocks VBZ stocks VBZ stocks VBZ stocks

VBD rose VBD fell VBD took VBD shot

NNS stocks NNS stocks NNS stocks NNS stocks

... ... ... ... ... ... ... ... NNS stocks VBZ stocks VBZ stocks VBZ stocks

VBD rose VBD fell VBD took VBD shot

... ... ... ... ... ... ... ...

NNS stocks NNS stocks VBZ stocks VBZ stocks

VBD rose VBD fell VBD took VBD shot

... ... ... ... ... ... ... ...

NNS stocks NNS stocks NNS stocks VBZ stocks

VBD rose VBD fell VBD took VBD shot

VBD rose VBD fell VBD took VBD shot

... ... ... ... ... ... ... ...

... ... ... ... ... ... ... ...

Token-based: need to go downhill a lot before going uphill 4

Parameter dependencies

probability

Dependencies between variables with shared parameters Type-based: update all variables of one type VBZ stocks VBZ stocks VBZ stocks VBZ stocks

VBD rose VBD fell VBD took VBD shot

NNS stocks NNS stocks NNS stocks NNS stocks

... ... ... ... ... ... ... ... NNS stocks VBZ stocks VBZ stocks VBZ stocks

VBD rose VBD fell VBD took VBD shot

... ... ... ... ... ... ... ...

NNS stocks NNS stocks VBZ stocks VBZ stocks

VBD rose VBD fell VBD took VBD shot

... ... ... ... ... ... ... ...

NNS stocks NNS stocks NNS stocks VBZ stocks

VBD rose VBD fell VBD took VBD shot

VBD rose VBD fell VBD took VBD shot

... ... ... ... ... ... ... ...

... ... ... ... ... ... ... ...

Token-based: need to go downhill a lot before going uphill 4

Parameter dependencies

probability

Dependencies between variables with shared parameters Type-based: update all variables of one type VBZ VBD ... stocks rose ... VBZ VBD ... stocks fell ... VBZ VBD ... stocks took ... VBZ VBD ... 1. Parameter stocks shot ...

dependencies create deeper

NNS stocks VBZ stocks VBZ stocks VBZ stocks

VBD rose VBD fell VBD took VBD shot

... ... ... ... ... ... ... ...

NNS stocks NNS stocks VBZ stocks VBZ stocks

VBD rose VBD fell VBD took VBD shot

... ... ... ... ... ... ... ...

NNS stocks NNS stocks NNS stocks VBZ stocks

VBD rose VBD fell VBD took VBD shot

NNS VBD stocks rose NNS VBD stocks fell NNS VBD stocks took NNS VBD stocks shot valleys

... ... ... ... ... ... ... ...

... ... ... ... ... ... ... ...

Token-based: need to go downhill a lot before going uphill 4

Parameter dependencies

probability

Dependencies between variables with shared parameters Type-based: update all variables of one type NNS VBD stocks rose NNS VBD stocks fell NNS VBD stocks took NNS VBD stocks shot valleys

VBZ VBD ... stocks rose ... VBZ VBD ... stocks fell ... VBZ VBD ... stocks took ... VBZ VBD ... 1. Parameter stocks shot ...

... ... ... ... ... ... ... ...

dependencies create deeper 2. Sentence-based these NNS VBD ...cannot handle NNS VBDdependencies ... stocks VBZ stocks VBZ stocks VBZ stocks

rose VBD fell VBD took VBD shot

... ... ... ... ... ... ...

NNS stocks NNS stocks VBZ stocks VBZ stocks

VBD rose VBD fell VBD took VBD shot

... ... ... ... ... ... ... ...

stocks NNS stocks NNS stocks VBZ stocks

rose VBD fell VBD took VBD shot

... ... ... ... ... ... ...

Token-based: need to go downhill a lot before going uphill 4

What exactly is a type?

How can we update all variables of a type efficiently?

5

Formal setup Parameters θ: vector of conditional probabilities

6

Formal setup Parameters θ: vector of conditional probabilities θT:V,N: from state V, probability of transitioning to state N

6

Formal setup Parameters θ: vector of conditional probabilities θT:V,N: from state V, probability of transitioning to state N

p(θ) is product of Dirichlets

[prior]

6

Formal setup Parameters θ: vector of conditional probabilities θT:V,N: from state V, probability of transitioning to state N

p(θ) is product of Dirichlets

[prior]

Choices z: specifies values of latent and observed variables

6

Formal setup Parameters θ: vector of conditional probabilities θT:V,N: from state V, probability of transitioning to state N

p(θ) is product of Dirichlets

[prior]

Choices z: specifies values of latent and observed variables ···

8

9

V

N

···

stocks

6

Formal setup Parameters θ: vector of conditional probabilities θT:V,N: from state V, probability of transitioning to state N

p(θ) is product of Dirichlets

[prior]

Choices z: specifies values of latent and observed variables z∈z [T : V, N at 8]

···

8

9

V

N

···

stocks

6

Formal setup Parameters θ: vector of conditional probabilities θT:V,N: from state V, probability of transitioning to state N

p(θ) is product of Dirichlets

[prior]

Choices z: specifies values of latent and observed variables z∈z

j(z) [parameter used] · · ·

[T : V, N at 8] T:V,N

8

9

V

N

···

stocks

6

Formal setup Parameters θ: vector of conditional probabilities θT:V,N: from state V, probability of transitioning to state N

p(θ) is product of Dirichlets

[prior]

Choices z: specifies values of latent and observed variables z∈z

j(z) [parameter used] · · ·

[T : V, N at 8] T:V,N

p(z | θ) =

Q

z∈z θj(z)

8

9

V

N

···

stocks

[likelihood]

6

Formal setup Parameters θ: vector of conditional probabilities θT:V,N: from state V, probability of transitioning to state N

p(θ) is product of Dirichlets

[prior]

Choices z: specifies values of latent and observed variables z∈z

j(z) [parameter used] · · ·

[T : V, N at 8] T:V,N

8

9

V

N

···

stocks

Q

p(z | θ) = z∈z θj(z) [likelihood] R p(z) = p(z | θ)p(θ)dθ [marginal likelihood]

6

Formal setup Parameters θ: vector of conditional probabilities θT:V,N: from state V, probability of transitioning to state N

p(θ) is product of Dirichlets

[prior]

Choices z: specifies values of latent and observed variables z∈z

j(z) [parameter used] · · ·

[T : V, N at 8] T:V,N

8

9

V

N

···

stocks

Q

p(z | θ) = z∈z θj(z) [likelihood] R p(z) = p(z | θ)p(θ)dθ [marginal likelihood] Observations x: observed part of z (e.g., the words) 6

Formal setup Parameters θ: vector of conditional probabilities θT:V,N: from state V, probability of transitioning to state N

p(θ) is product of Dirichlets

[prior]

Choices z: specifies values of latent and observed variables z∈z

j(z) [parameter used] · · ·

[T : V, N at 8] T:V,N

8

9

V

N

···

stocks

Q

p(z | θ) = z∈z θj(z) [likelihood] R p(z) = p(z | θ)p(θ)dθ [marginal likelihood] Observations x: observed part of z (e.g., the words) Goal: sample from p(z | x)

[not p(z | x, θ)] 6

Exchangeability Sufficient statistics n: # times parameters were used in z

7

Exchangeability Sufficient statistics n: # times parameters were used in z nT:V,N: # times that state V transitioned to state N

7

Exchangeability Sufficient statistics n: # times parameters were used in z nT:V,N: # times that state V transitioned to state N Rewrite likelihood: Q nj p(z | θ) = j θj

7

Exchangeability Sufficient statistics n: # times parameters were used in z nT:V,N: # times that state V transitioned to state N Rewrite likelihood: Q nj p(z | θ) = j θj = simple function of n and θ

7

Exchangeability Sufficient statistics n: # times parameters were used in z nT:V,N: # times that state V transitioned to state N Rewrite likelihood: Q nj p(z | θ) = j θj = simple function of n and θ p(z) = simple function of n

7

Exchangeability Sufficient statistics n: # times parameters were used in z nT:V,N: # times that state V transitioned to state N Rewrite likelihood: Q nj p(z | θ) = j θj = simple function of n and θ p(z) = simple function of n (key: exchangeability)

7

Exchangeability Sufficient statistics n: # times parameters were used in z nT:V,N: # times that state V transitioned to state N Rewrite likelihood: Q nj p(z | θ) = j θj = simple function of n and θ p(z) = simple function of n (key: exchangeability) z1 ··· ···

D

V

V

many

stocks

rose

D

N

V

tech

stocks

were

··· ···

7

Exchangeability Sufficient statistics n: # times parameters were used in z nT:V,N: # times that state V transitioned to state N Rewrite likelihood: Q nj p(z | θ) = j θj = simple function of n and θ p(z) = simple function of n (key: exchangeability) z2

z1 ··· ···

D

V

V

many

stocks

rose

D

N

V

tech

stocks

were

··· ···

··· ···

D

N

V

many

stocks

rose

D

V

V

tech

stocks

were

··· ···

7

Exchangeability Sufficient statistics n: # times parameters were used in z nT:V,N: # times that state V transitioned to state N Rewrite likelihood: Q nj p(z | θ) = j θj = simple function of n and θ p(z) = simple function of n (key: exchangeability) z2

z1 ··· ···

D

V

V

many

stocks

rose

D

N

V

tech

stocks

were

··· ···

··· ···

D

N

V

many

stocks

rose

D

V

V

tech

stocks

were

··· ···

7

Exchangeability Sufficient statistics n: # times parameters were used in z nT:V,N: # times that state V transitioned to state N Rewrite likelihood: Q nj p(z | θ) = j θj = simple function of n and θ p(z) = simple function of n (key: exchangeability) z2

z1 ··· ···

D

V

V

many

stocks

rose

D

N

V

tech

stocks

were

··· ···

··· ···

D

N

V

many

stocks

rose

D

V

V

tech

stocks

were

··· ···

z1 and z2 have same sufficient statistics 7

Exchangeability Sufficient statistics n: # times parameters were used in z nT:V,N: # times that state V transitioned to state N Rewrite likelihood: Q nj p(z | θ) = j θj = simple function of n and θ p(z) = simple function of n (key: exchangeability) z2

z1 ··· ···

D

V

V

many

stocks

rose

D

N

V

tech

stocks

were

··· ···

··· ···

D

N

V

many

stocks

rose

D

V

V

tech

stocks

were

··· ···

z1 and z2 have same sufficient statistics ⇒ p(z1) = p(z2) 7

Types Type of a variable: its dependent parameter components

8

Types Type of a variable: its dependent parameter components For HMM, type = assignment to Markov blanket

8

Types Type of a variable: its dependent parameter components For HMM, type = assignment to Markov blanket

···

D many

V stocks

···

rose

8

Types Type of a variable: its dependent parameter components For HMM, type = assignment to Markov blanket type( ···

D many

) = (D, stocks, V) V

stocks

···

rose

8

Types Type of a variable: its dependent parameter components For HMM, type = assignment to Markov blanket type( ···

D many

··· ···

stocks

stocks

D how

···

have

V stocks

···

were

V stocks

···

rose

V

D the

···

V

D tech

) = (D, stocks, V)

···

from

8

Types Type of a variable: its dependent parameter components For HMM, type = assignment to Markov blanket ) = (D, stocks, V)

type( ··· ··· ··· ···

D

V

V

many

stocks

rose

D

V

V

tech

stocks

were

D

N

V

the

stocks

have

D

N

V

how

stocks

from

··· ···

Assignments [VVNN]

··· ···

8

Types Type of a variable: its dependent parameter components For HMM, type = assignment to Markov blanket ) = (D, stocks, V)

type( ··· ··· ··· ···

D

V

V

many

stocks

rose

D

N

V

tech

stocks

were

D

V

V

the

stocks

have

D

N

V

how

stocks

from

··· ···

Assignments [VVNN] [VNVN]

··· ···

8

Types Type of a variable: its dependent parameter components For HMM, type = assignment to Markov blanket ) = (D, stocks, V)

type( ··· ··· ··· ···

D

V

V

many

stocks

rose

D

N

V

tech

stocks

were

D

N

V

the

stocks

have

D

V

V

how

stocks

from

··· ···

Assignments [VVNN] [VNVN] [VNNV]

··· ···

8

Types Type of a variable: its dependent parameter components For HMM, type = assignment to Markov blanket ) = (D, stocks, V)

type( ··· ··· ··· ···

D

N

V

many

stocks

rose

D

V

V

tech

stocks

were

D

V

V

the

stocks

have

D

N

V

how

stocks

from

··· ··· ···

Assignments [VVNN] [VNVN] [VNNV] [NVVN]

···

8

Types Type of a variable: its dependent parameter components For HMM, type = assignment to Markov blanket ) = (D, stocks, V)

type( ··· ··· ··· ···

D

N

V

many

stocks

rose

D

V

V

tech

stocks

were

D

N

V

the

stocks

have

D

V

V

how

stocks

from

··· ··· ···

Assignments [VVNN] [VNVN] [VNNV] [NVVN] [NVNV]

···

8

Types Type of a variable: its dependent parameter components For HMM, type = assignment to Markov blanket ) = (D, stocks, V)

type( ··· ··· ··· ···

D

N

V

many

stocks

rose

D

N

V

tech

stocks

were

D

V

V

the

stocks

have

D

V

V

how

stocks

from

··· ··· ··· ···

Assignments [VVNN] [VNVN] [VNNV] [NVVN] [NVNV] [NNVV]

8

Types Type of a variable: its dependent parameter components For HMM, type = assignment to Markov blanket ) = (D, stocks, V)

type( ··· ··· ··· ···

D

N

V

many

stocks

rose

D

N

V

tech

stocks

were

D

V

V

the

stocks

have

D

V

V

how

stocks

from

··· ··· ··· ···

Assignments [VVNN] [VNVN] [VNNV] [NVVN] [NVNV] [NNVV]

p(assignment) only depends on number of Vs and Ns 8

Sampling same-type variables Goal: sample an assignment of a set of same-type variables

9

Sampling same-type variables Goal: sample an assignment of a set of same-type variables

m=0 p0

[VVVV]

m=1

[VVVN] p1 [VVNV] p1 [VNVV] p1 [NVVV] p1

m=2

m=3

[VVNN] p2 [VNVN] p2 [VNNV] p2 [NVVN] p2 [NVNV] p2 [NNVV]

[VNNN] p3 [NVNN] p3 [NNVN] p3 [NNNV]

p2

p3

m=4 p4

[NNNN]

9

Sampling same-type variables Goal: sample an assignment of a set of same-type variables Algorithm: 1. Choose m ∈ {0, . . . , B} with prob. ∝

m=0 p0

[VVVV]

m=1

[VVVN] p1 [VVNV] p1 [VNVV] p1 [NVVV] p1

B m



pm

m=2

m=3

[VVNN] p2 [VNVN] p2 [VNNV] p2 [NVVN] p2 [NVNV] p2 [NNVV]

[VNNN] p3 [NVNN] p3 [NNVN] p3 [NNNV]

p2

p3

m=4 p4

[NNNN]

9

Sampling same-type variables Goal: sample an assignment of a set of same-type variables Algorithm: 1. Choose m ∈ {0, . . . , B} with prob. ∝

B m



pm

2. Choose assignment uniformly from column m m=0 p0

[VVVV]

m=1

[VVVN] p1 [VVNV] p1 [VNVV] p1 [NVVV] p1

m=2

m=3

[VVNN] p2 [VNVN] p2 [VNNV] p2 [NVVN] p2 [NVNV] p2 [NNVV]

[VNNN] p3 [NVNN] p3 [NNVN] p3 [NNNV]

p2

p3

m=4 p4

[NNNN]

9

Full algorithm Iterate:

··· ··· ··· ···

D

V

V

many

stocks

rose

D

N

V

tech

stocks

were

D

N

V

the

stocks

have

D

V

V

how

stocks

from

··· ··· ··· ···

10

Full algorithm Iterate: 1. Choose a position e.g., 2

··· ··· ··· ···

D

V

V

many

stocks

rose

D

N

V

tech

stocks

were

D

N

V

the

stocks

have

D

V

V

how

stocks

from

··· ··· ··· ···

10

Full algorithm Iterate: ··· 1. Choose a position e.g., 2 ··· 2. Add variables with same type e.g., (D, stocks, V) ··· ···

D

V

V

many

stocks

rose

D

N

V

tech

stocks

were

D

N

V

the

stocks

have

D

V

V

how

stocks

from

··· ··· ··· ···

10

Full algorithm Iterate: ··· 1. Choose a position e.g., 2 ··· 2. Add variables with same type e.g., (D, stocks, V) ··· 3. Sample m e.g., 1 ⇒ {V, V, V, N} ···

D many

V stocks

D tech

D the

stocks

···

have

V stocks

···

were

V

D how

rose

V stocks

···

···

from

10

Full algorithm Iterate: ··· 1. Choose a position e.g., 2 ··· 2. Add variables with same type e.g., (D, stocks, V) ··· 3. Sample m e.g., 1 ⇒ {V, V, V, N} ··· 4. Sample assignment e.g., [VNVV]

D

V

V

many

stocks

rose

D

N

V

tech

stocks

were

D

V

V

the

stocks

have

D

V

V

how

stocks

from

··· ··· ··· ···

10

Experimental setup: sampling algorithms 2

5

3

1

tech stocks rose 4

8

in

6

2

heavy trading

8

1

worker demands meeting resistance 2

4

3

8

many stocks shot 5

2

6

up today 4

3

investors await stocks news

11

Experimental setup: sampling algorithms 2

5

3

1

tech stocks rose 4

8

in

6

2

Token-based sampler (Token) 1. Choose token

heavy trading 1

8

worker demands meeting resistance 2

4

3

8

many stocks shot 5

2

6

up today 4

3

investors await stocks news

11

Experimental setup: sampling algorithms 2

5

3

1

tech stocks rose 4

8

in

6

2

heavy trading

Token-based sampler (Token) 1. Choose token 2. Update conditioned on rest

1

7

worker demands meeting resistance 2

4

3

8

many stocks shot 5

2

6

up today 4

3

investors await stocks news

11

Experimental setup: sampling algorithms 2

5

3

tech stocks rose 4

1

8

in

heavy

6

8

Token-based sampler (Token) 2 1. Choose token trading 2. Update conditioned on rest 1

worker demands meeting resistance 2

4

3

8

many stocks shot 5

2

Sentence-based sampler (Sentence) 1. Choose sentence

6

up today 4

3

investors await stocks news

11

Experimental setup: sampling algorithms 3

4

6

tech stocks rose 4

1

2

in

heavy

6

8

Token-based sampler (Token) 7 1. Choose token trading 2. Update conditioned on rest 1

worker demands meeting resistance 2

4

3

8

many stocks shot 5

2

6

Sentence-based sampler (Sentence) 1. Choose sentence 2. Update conditioned on rest

up today 4

3

investors await stocks news

11

Experimental setup: sampling algorithms 2

3

5

1

tech stocks rose 4

8

in

6

2

heavy trading

8

1

worker demands meeting resistance 2

3

4

8

many stocks shot 5

2

6

Token-based sampler (Token) 1. Choose token 2. Update conditioned on rest Sentence-based sampler (Sentence) 1. Choose sentence 2. Update conditioned on rest

up today 4

3

investors await stocks news

Type-based sampler (Type) 1. Choose type

11

Experimental setup: sampling algorithms 2

3

4

1

tech stocks rose 4

8

in

6

2

heavy trading

8

1

worker demands meeting resistance 2

3

3

8

many stocks shot 5

2

6

Token-based sampler (Token) 1. Choose token 2. Update conditioned on rest Sentence-based sampler (Sentence) 1. Choose sentence 2. Update conditioned on rest

up today 4

3

investors await stocks news

Type-based sampler (Type) 1. Choose type 2. Update conditioned on rest

11

Experimental setup: models/tasks/datasets Hidden Markov model (HMM)

12

Experimental setup: models/tasks/datasets Hidden Markov model (HMM) NN NNS VBG NN worker demands meeting resistance

Part-of-speech induction

12

Experimental setup: models/tasks/datasets Hidden Markov model (HMM) NN NNS VBG NN worker demands meeting resistance

Part-of-speech induction

12

Experimental setup: models/tasks/datasets Hidden Markov model (HMM) NN NNS VBG NN worker demands meeting resistance

Part-of-speech induction

Dataset: WSJ (49K sentences, 45 tags)

12

Experimental setup: models/tasks/datasets Hidden Markov model (HMM) NN NNS VBG NN worker demands meeting resistance

Part-of-speech induction

Dataset: WSJ (49K sentences, 45 tags)

Unigram segmentation model (USM) [Goldwater et al., 2006]

12

Experimental setup: models/tasks/datasets Hidden Markov model (HMM) NN NNS VBG NN worker demands meeting resistance

Part-of-speech induction

Dataset: WSJ (49K sentences, 45 tags)

Unigram segmentation model (USM) [Goldwater et al., 2006] look at the book

Word segmentation

12

Experimental setup: models/tasks/datasets Hidden Markov model (HMM) NN NNS VBG NN worker demands meeting resistance

Part-of-speech induction

Dataset: WSJ (49K sentences, 45 tags)

Unigram segmentation model (USM) [Goldwater et al., 2006] look at the book

Word segmentation

12

Experimental setup: models/tasks/datasets Hidden Markov model (HMM) NN NNS VBG NN worker demands meeting resistance

Part-of-speech induction

Dataset: WSJ (49K sentences, 45 tags)

Unigram segmentation model (USM) [Goldwater et al., 2006] look at the book

Word segmentation

Dataset: CHILDES (9.7K sentences)

12

Experimental setup: models/tasks/datasets Hidden Markov model (HMM) NN NNS VBG NN worker demands meeting resistance

Part-of-speech induction

Dataset: WSJ (49K sentences, 45 tags)

Unigram segmentation model (USM) [Goldwater et al., 2006] look at the book

Word segmentation

Dataset: CHILDES (9.7K sentences)

Probabilistic tree-substitution grammar (PTSG) [Cohn et al., 2009]

12

Experimental setup: models/tasks/datasets Hidden Markov model (HMM) NN NNS VBG NN worker demands meeting resistance

Part-of-speech induction

Dataset: WSJ (49K sentences, 45 tags)

Unigram segmentation model (USM) [Goldwater et al., 2006] look at the book

Dataset: CHILDES (9.7K sentences)

Word segmentation

Probabilistic tree-substitution grammar (PTSG) [Cohn et al., 2009] S NP

VP

DT

NN

VBD

VBN

the

sun

has

risen

12

Experimental setup: models/tasks/datasets Hidden Markov model (HMM) NN NNS VBG NN worker demands meeting resistance

Part-of-speech induction

Dataset: WSJ (49K sentences, 45 tags)

Unigram segmentation model (USM) [Goldwater et al., 2006] look at the book

Dataset: CHILDES (9.7K sentences)

Word segmentation

Probabilistic tree-substitution grammar (PTSG) [Cohn et al., 2009] S NP

VP

DT

NN

VBD

VBN

the

sun

has

risen

12

Experimental setup: models/tasks/datasets Hidden Markov model (HMM) NN NNS VBG NN worker demands meeting resistance

Part-of-speech induction

Dataset: WSJ (49K sentences, 45 tags)

Unigram segmentation model (USM) [Goldwater et al., 2006] look at the book

Dataset: CHILDES (9.7K sentences)

Word segmentation

Probabilistic tree-substitution grammar (PTSG) [Cohn et al., 2009] S NP

VP

DT

NN

VBD

VBN

the

sun

has

risen

Dataset: WSJ (49K sentences, 45 tags) 12

Token versus Type HMM

USM

PTSG

13

Token versus Type HMM

USM

PTSG

log-likelihood

-6.5e6 -6.9e6 -7.3e6

3

6

9 12

time (hr.)

Token

13

Token versus Type HMM

USM

PTSG

log-likelihood

-6.5e6 -6.9e6 -7.3e6

3

6

9 12

time (hr.)

Token Type

13

Token versus Type HMM

USM -1.7e5

log-likelihood

log-likelihood

-6.5e6

PTSG

-6.9e6 -7.3e6

3

6

9 12

time (hr.)

-1.9e5 -2.1e5

2

4

6

8

time (min.)

Token Type

13

Token versus Type HMM

USM -1.7e5

-6.9e6 -7.3e6

3

6

9 12

time (hr.)

-5.3e6

log-likelihood

log-likelihood

log-likelihood

-6.5e6

PTSG

-1.9e5 -2.1e5

2

4

6

8

time (min.)

-5.5e6 -5.7e6 -5.9e6 -6.1e6 3

6

9 12

time (hr.)

Token Type

13

Token versus Type HMM

USM -1.7e5

-6.9e6 -7.3e6

3

6

-5.3e6

log-likelihood

log-likelihood

log-likelihood

-6.5e6

-1.9e5 -2.1e5

9 12

2

time (hr.) 60 55 50 45 40 35 3

6

9 12

time (hr.)

4

6

8

time (min.) word token F1

tag accuracy

PTSG

60 55 50 45 40 35

-5.5e6 -5.7e6 -5.9e6 -6.1e6 3

6

9 12

time (hr.)

Token Type 2

4

6

8

time (min.)

13

Sensitivity to initialization

14

Sensitivity to initialization

(use few params.) η=0

14

Sensitivity to initialization

(use few params.) η=0

V V V V worker demands meeting resistance

14

Sensitivity to initialization

(use few params.) η=0

V V V V worker demands meeting resistance

l o o k a t t h e b o o k

14

Sensitivity to initialization

(use few params.) η=0

(use many params.) η=1

V V V V worker demands meeting resistance

l o o k a t t h e b o o k

14

Sensitivity to initialization

(use few params.) η=0

V V V V worker demands meeting resistance

(use many params.) η=1

P N D V worker demands meeting resistance

l o o k a t t h e b o o k

14

Sensitivity to initialization

(use few params.) η=0

(use many params.) η=1

V V V V worker demands meeting resistance

P N D V worker demands meeting resistance

l o o k a t t h e b o o k

lookatthebook

14

Sensitivity to initialization HMM

USM

PTSG

15

Sensitivity to initialization HMM

USM

PTSG

log-likelihood

-6.6e6 -6.8e6 -7.0e6 -7.2e6 0.2 0.4 0.6 0.8 1.0

tag accuracy

η 60 50 40 30 20 10

Token 0.2 0.4 0.6 0.8 1.0

η

15

Sensitivity to initialization HMM

USM

PTSG

log-likelihood

-6.6e6 -6.8e6 -7.0e6 -7.2e6 0.2 0.4 0.6 0.8 1.0

tag accuracy

η 60 50 40 30 20 10

Token Type 0.2 0.4 0.6 0.8 1.0

η

15

Sensitivity to initialization HMM

USM -1.9e5

log-likelihood

log-likelihood

-6.6e6 -6.8e6 -7.0e6 -7.2e6

-2.3e5 -2.7e5 -3.1e5 -3.4e5

0.2 0.4 0.6 0.8 1.0

0.2 0.4 0.6 0.8 1.0

η 60 50 40 30 20 10

η word token F1

tag accuracy

PTSG

0.2 0.4 0.6 0.8 1.0

η

60 50 40 30 20 10

Token Type 0.2 0.4 0.6 0.8 1.0

η

15

Sensitivity to initialization HMM

USM -5.4e6

-6.8e6 -7.0e6 -7.2e6

log-likelihood

-1.9e5

log-likelihood

log-likelihood

-6.6e6

-2.3e5 -2.7e5 -3.1e5 -3.4e5

0.2 0.4 0.6 0.8 1.0

0.2 0.4 0.6 0.8 1.0

η 60 50 40 30 20 10

η word token F1

tag accuracy

PTSG

0.2 0.4 0.6 0.8 1.0

η

60 50 40 30 20 10

-5.5e6 -5.6e6 -5.7e6 0.2 0.4 0.6 0.8 1.0

η

Token Type 0.2 0.4 0.6 0.8 1.0

η

15

Sensitivity to initialization HMM

USM -5.4e6

-6.8e6 -7.0e6 -7.2e6

log-likelihood

-1.9e5

log-likelihood

log-likelihood

-6.6e6

-2.3e5 -2.7e5 -3.1e5 -3.4e5

0.2 0.4 0.6 0.8 1.0

-5.6e6 -5.7e6

η word token F1

60 50 40 30 20 10 0.2 0.4 0.6 0.8 1.0

η

-5.5e6

0.2 0.4 0.6 0.8 1.0

η tag accuracy

PTSG

60 50 40 30 20 10

0.2 0.4 0.6 0.8 1.0

η

Token Type 0.2 0.4 0.6 0.8 1.0

η

Type less sensitive than Token 15

Alternatives Can we get the gains of Type via simpler means?

16

Alternatives Can we get the gains of Type via simpler means?

• Annealing (Tokenanneal) – Use p(z | x)1/T , temperature T from 5 to 1

17

Annealing HMM

USM

PTSG

18

Annealing HMM

USM

PTSG

log-likelihood

-6.5e6 -6.9e6 -7.3e6

3

6

9 12

tag accuracy

time (hr.) 60 55 50 45 40 35

Token Type 3

6

9 12

time (hr.)

18

Annealing HMM

USM

PTSG

log-likelihood

-6.5e6 -6.9e6 -7.3e6

3

6

9 12

tag accuracy

time (hr.) 60 55 50 45 40 35 3

6

9 12

Token Tokenanneal Type

time (hr.)

[anneal. hurts] 18

Annealing HMM

USM -1.7e5

log-likelihood

log-likelihood

-6.5e6 -6.9e6 -7.3e6

3

6

-1.9e5 -2.1e5

9 12

2

time (hr.) 60 55 50 45 40 35 3

6

9 12

time (hr.)

[anneal. hurts]

4

6

8

time (min.) word token F1

tag accuracy

PTSG

60 55 50 45 40 35 2

4

6

8

Token Tokenanneal Type

time (min.)

[anneal. helps] 18

Annealing HMM

USM -1.7e5

-6.9e6 -7.3e6

3

6

-5.3e6

log-likelihood

log-likelihood

log-likelihood

-6.5e6

-1.9e5 -2.1e5

9 12

2

time (hr.) 60 55 50 45 40 35 3

6

9 12

time (hr.)

[anneal. hurts]

4

6

8

time (min.) word token F1

tag accuracy

PTSG

60 55 50 45 40 35 2

4

6

8

-5.5e6 -5.7e6 -5.9e6 -6.1e6 3

6

9 12

time (hr.)

Token Tokenanneal Type

time (min.)

[anneal. helps]

[no effect] 18

Alternatives Can we get the gains of Type via simpler means?

• Annealing (Tokenanneal) – Use p(z | x)1/T , temperature T from 5 to 1 – More (random) mobility through space, but insufficient

19

Alternatives Can we get the gains of Type via simpler means?

• Annealing (Tokenanneal) – Use p(z | x)1/T , temperature T from 5 to 1 – More (random) mobility through space, but insufficient • Sentence-based (Sentence)

20

Sentence USM: word token F1

log-likelihood

-1.7e5 -1.9e5 -2.1e5

2

4

6

8

time (min.)

60 55 50 45 40 35

Token Type 2

4

6

8

time (min.)

21

Sentence USM: word token F1

log-likelihood

-1.7e5 -1.9e5 -2.1e5

2

4

6

8

time (min.)

60 55 50 45 40 35 2

4

6

8

Token Type Sentence

time (min.)

21

Sentence USM: word token F1

log-likelihood

-1.7e5 -1.9e5 -2.1e5

2

4

6

8

time (min.)

60 55 50 45 40 35 2

4

6

8

Token Type Sentence

time (min.)

Sentence performs comparably to Token, but worse than Type

21

Sentence USM: word token F1

log-likelihood

-1.7e5 -1.9e5 -2.1e5

2

4

6

8

time (min.)

60 55 50 45 40 35 2

4

6

8

Token Type Sentence

time (min.)

Sentence performs comparably to Token, but worse than Type Sentence requires dynamic programming, computationally more expensive than Token and Type 21

Alternatives Can we get the gains of Type via simpler means?

• Annealing (Tokenanneal) – Use p(z | x)1/T , temperature T from 5 to 1 – More (random) mobility through space, but insufficient • Sentence-based (Sentence) – Sentence handles structural dependencies – Type handles parameter dependencies

22

Alternatives Can we get the gains of Type via simpler means?

• Annealing (Tokenanneal) – Use p(z | x)1/T , temperature T from 5 to 1 – More (random) mobility through space, but insufficient • Sentence-based (Sentence) – Sentence handles structural dependencies – Type handles parameter dependencies – Intuition: parameter dependencies more important in unsupervised learning

23

Summary and outlook General strategy: update many dependent variables tractably

24

Summary and outlook General strategy: update many dependent variables tractably Variables sharing parameters very dependent Type-based sampler updates exactly these variables

24

Summary and outlook General strategy: update many dependent variables tractably Variables sharing parameters very dependent Type-based sampler updates exactly these variables Techniques for tractability: Old: exploit dynamic programming structure

24

Summary and outlook General strategy: update many dependent variables tractably Variables sharing parameters very dependent Type-based sampler updates exactly these variables Techniques for tractability: Old: exploit dynamic programming structure New: think about exchangeability

24

Summary and outlook General strategy: update many dependent variables tractably Variables sharing parameters very dependent Type-based sampler updates exactly these variables Techniques for tractability: Old: exploit dynamic programming structure New: think about exchangeability Tokens versus types:

24

Summary and outlook General strategy: update many dependent variables tractably Variables sharing parameters very dependent Type-based sampler updates exactly these variables Techniques for tractability: Old: exploit dynamic programming structure New: think about exchangeability Tokens versus types: • Older work operate on types (e.g., model merging)

24

Summary and outlook General strategy: update many dependent variables tractably Variables sharing parameters very dependent Type-based sampler updates exactly these variables Techniques for tractability: Old: exploit dynamic programming structure New: think about exchangeability Tokens versus types: • Older work operate on types (e.g., model merging) Larger updates, but greedy and brittle

24

Summary and outlook General strategy: update many dependent variables tractably Variables sharing parameters very dependent Type-based sampler updates exactly these variables Techniques for tractability: Old: exploit dynamic programming structure New: think about exchangeability Tokens versus types: • Older work operate on types (e.g., model merging) Larger updates, but greedy and brittle • Recent methods operate on sentences or tokens Smaller updates, but softer and more robust 24

Summary and outlook General strategy: update many dependent variables tractably Variables sharing parameters very dependent Type-based sampler updates exactly these variables Techniques for tractability: Old: exploit dynamic programming structure New: think about exchangeability Tokens versus types: • Older work operate on types (e.g., model merging) Larger updates, but greedy and brittle • Recent methods operate on sentences or tokens Smaller updates, but softer and more robust Type-based sampling combines advantages of both 24

Thank you!

25