NN. NNS. VBD. ···. ··· tech stocks fell ···. ··· DT. NNS. VBD ...... 6 today. 5 investors. 2 await. 4 stocks. 3 news. Token-based sampler (Token). 1. Choose token. 11 ...
Type-Based MCMC NAACL 2010 – Los Angeles
Percy Liang
Michael I. Jordan
Dan Klein
Type-Based MCMC NAACL 2010 – Los Angeles
Percy Liang
Michael I. Jordan
Dan Klein
Learning latent-variable models
···
DT
NNS
VBD
···
· · · some stocks rose · · · VBD
···
· · · tech stocks
fell
···
···
DT
VBD
···
···
the stocks soared · · ·
···
WRB
···
NN
NNS
NNS
NNS
VBD
···
· · · how stocks dipped · · · ···
DT
NNS
VBD
···
· · · many stocks created · · · 2
Learning latent-variable models
···
DT
NNS
VBD
···
· · · some stocks rose · · · ···
VBD
···
· · · tech stocks
fell
···
···
DT
VBD
···
···
the stocks soared · · ·
···
WRB
NN
NNS
NNS
NNS
VBD
···
· · · how stocks dipped · · · ···
DT
NNS
VBD
···
· · · many stocks created · · · 2
Learning latent-variable models θ ···
DT
NNS
VBD
···
· · · some stocks rose · · · ···
VBD
···
· · · tech stocks
fell
···
···
DT
VBD
···
···
the stocks soared · · ·
···
WRB
NN
NNS
NNS
NNS
VBD
···
· · · how stocks dipped · · · ···
DT
NNS
VBD
···
· · · many stocks created · · · 2
Learning latent-variable models
···
DT
NNS
VBD
···
· · · some stocks rose · · · ···
VBD
···
· · · tech stocks
fell
···
···
DT
VBD
···
···
the stocks soared · · ·
···
WRB
NN
NNS
NNS
NNS
VBD
···
· · · how stocks dipped · · · ···
DT
NNS
VBD
···
· · · many stocks created · · · 2
Learning latent-variable models Variables are highly dependent ···
DT
NNS
VBD
···
· · · some stocks rose · · · ···
VBD
···
· · · tech stocks
fell
···
···
DT
VBD
···
···
the stocks soared · · ·
···
WRB
NN
NNS
NNS
NNS
VBD
···
· · · how stocks dipped · · · ···
DT
NNS
VBD
···
· · · many stocks created · · · 2
Learning latent-variable models Variables are highly dependent ···
DT
NNS
VBD
···
Simple: token-based (e.g., Gibbs sampler)
· · · some stocks rose · · · ···
VBD
···
· · · tech stocks
fell
···
···
DT
VBD
···
···
the stocks soared · · ·
···
WRB
NN
NNS
NNS
NNS
VBD
···
· · · how stocks dipped · · · ···
DT
NNS
VBD
···
· · · many stocks created · · · 2
Learning latent-variable models Variables are highly dependent ···
DT
NNS
VBD
···
· · · some stocks rose · · · ···
VBD
···
· · · tech stocks
fell
···
···
DT
VBD
···
···
the stocks soared · · ·
···
WRB
NN
NNS
NNS
NNS
VBD
Simple: token-based (e.g., Gibbs sampler) ⇒ local optima, slow mixing
···
· · · how stocks dipped · · · ···
DT
NNS
VBD
···
· · · many stocks created · · · 2
Learning latent-variable models Variables are highly dependent ···
DT
NNS
VBD
···
· · · some stocks rose · · · ···
VBD
···
· · · tech stocks
fell
···
···
DT
VBD
···
···
the stocks soared · · ·
···
WRB
NN
NNS
NNS
NNS
VBD
Simple: token-based (e.g., Gibbs sampler) ⇒ local optima, slow mixing Classic: sentence-based (e.g., EM)
···
· · · how stocks dipped · · · ···
DT
NNS
VBD
···
· · · many stocks created · · · 2
Learning latent-variable models Variables are highly dependent ···
DT
NNS
VBD
···
· · · some stocks rose · · · ···
VBD
···
· · · tech stocks
fell
···
···
DT
VBD
···
···
the stocks soared · · ·
···
WRB
NN
NNS
NNS
NNS
VBD
Simple: token-based (e.g., Gibbs sampler) ⇒ local optima, slow mixing Classic: sentence-based (e.g., EM) Structural dependencies
···
· · · how stocks dipped · · · ···
DT
NNS
VBD
···
· · · many stocks created · · · 2
Learning latent-variable models Variables are highly dependent ···
DT
NNS
VBD
···
· · · some stocks rose · · · ···
VBD
···
· · · tech stocks
fell
···
···
DT
VBD
···
···
the stocks soared · · ·
···
WRB
NN
NNS
NNS
NNS
VBD
Simple: token-based (e.g., Gibbs sampler) ⇒ local optima, slow mixing Classic: sentence-based (e.g., EM) Structural dependencies dynamic programming
···
· · · how stocks dipped · · · ···
DT
NNS
VBD
···
· · · many stocks created · · · 2
Learning latent-variable models Variables are highly dependent ···
DT
NNS
VBD
···
· · · some stocks rose · · · ···
VBD
···
· · · tech stocks
fell
···
···
DT
VBD
···
···
the stocks soared · · ·
···
WRB
NN
NNS
NNS
NNS
VBD
DT
NNS
VBD
Classic: sentence-based (e.g., EM) Structural dependencies dynamic programming
···
· · · how stocks dipped · · · ···
Simple: token-based (e.g., Gibbs sampler) ⇒ local optima, slow mixing
New: type-based
···
· · · many stocks created · · · 2
Learning latent-variable models Variables are highly dependent ···
DT
NNS
VBD
···
· · · some stocks rose · · · ···
VBD
···
· · · tech stocks
fell
···
···
DT
VBD
···
···
the stocks soared · · ·
···
WRB
NN
NNS
NNS
NNS
VBD
DT
NNS
VBD
Classic: sentence-based (e.g., EM) Structural dependencies dynamic programming
···
· · · how stocks dipped · · · ···
Simple: token-based (e.g., Gibbs sampler) ⇒ local optima, slow mixing
···
New: type-based Parameter dependencies
· · · many stocks created · · · 2
Learning latent-variable models Variables are highly dependent ···
DT
NNS
VBD
···
· · · some stocks rose · · · ···
VBD
···
· · · tech stocks
fell
···
···
DT
VBD
···
···
the stocks soared · · ·
···
WRB
NN
NNS
NNS
NNS
VBD
DT
NNS
VBD
Classic: sentence-based (e.g., EM) Structural dependencies dynamic programming
···
· · · how stocks dipped · · · ···
Simple: token-based (e.g., Gibbs sampler) ⇒ local optima, slow mixing
···
· · · many stocks created · · ·
New: type-based Parameter dependencies exchangeability 2
Structural dependencies Dependencies between adjacent variables
3
Structural dependencies
probability
Dependencies between adjacent variables
NN VBZ NN NN worker demands meeting resistance
3
Structural dependencies
probability
Dependencies between adjacent variables
NN NNS VBG NN worker demands meeting resistance NN VBZ NN NN worker demands meeting resistance
3
Structural dependencies
probability
Dependencies between adjacent variables
NN NNS VBG NN worker demands meeting resistance NN VBZ NN NN worker demands meeting resistance
Token-based: update only one variable at a time
3
Structural dependencies
probability
Dependencies between adjacent variables
NN NNS VBG NN worker demands meeting resistance NN VBZ NN NN worker demands meeting resistance
NN VBZ VBG NN worker demands meeting resistance
Token-based: update only one variable at a time
3
Structural dependencies
probability
Dependencies between adjacent variables
NN NNS VBG NN worker demands meeting resistance NN VBZ NN NN worker demands meeting resistance
NN VBZ VBG NN worker demands meeting resistance
Token-based: update only one variable at a time
3
Structural dependencies
probability
Dependencies between adjacent variables
NN NNS VBG NN worker demands meeting resistance NN VBZ NN NN worker demands meeting resistance
NN VBZ VBG NN worker demands meeting resistance
Token-based: update only one variable at a time Problem: need to go downhill before uphill
3
Structural dependencies Dependencies between adjacent variables
probability
Sentence-based: update all variables in a sentence NN NNS VBG NN worker demands meeting resistance NN VBZ NN NN worker demands meeting resistance
NN VBZ VBG NN worker demands meeting resistance
Token-based: update only one variable at a time Problem: need to go downhill before uphill
3
Parameter dependencies Dependencies between variables with shared parameters
4
Parameter dependencies
probability
Dependencies between variables with shared parameters
VBZ stocks VBZ stocks VBZ stocks VBZ stocks
VBD rose VBD fell VBD took VBD shot
... ... ... ... ... ... ... ...
4
Parameter dependencies
probability
Dependencies between variables with shared parameters
VBZ stocks VBZ stocks VBZ stocks VBZ stocks
VBD rose VBD fell VBD took VBD shot
... ... ... ... ... ... ... ...
NNS stocks NNS stocks NNS stocks NNS stocks
VBD rose VBD fell VBD took VBD shot
... ... ... ... ... ... ... ...
4
Parameter dependencies
probability
Dependencies between variables with shared parameters
VBZ stocks VBZ stocks VBZ stocks VBZ stocks
VBD rose VBD fell VBD took VBD shot
NNS stocks NNS stocks NNS stocks NNS stocks
... ... ... ... ... ... ... ... NNS stocks VBZ stocks VBZ stocks VBZ stocks
VBD rose VBD fell VBD took VBD shot
VBD rose VBD fell VBD took VBD shot
... ... ... ... ... ... ... ...
... ... ... ... ... ... ... ...
Token-based: 4
Parameter dependencies
probability
Dependencies between variables with shared parameters
VBZ stocks VBZ stocks VBZ stocks VBZ stocks
VBD rose VBD fell VBD took VBD shot
NNS stocks NNS stocks NNS stocks NNS stocks
... ... ... ... ... ... ... ... NNS stocks VBZ stocks VBZ stocks VBZ stocks
VBD rose VBD fell VBD took VBD shot
... ... ... ... ... ... ... ...
NNS stocks NNS stocks VBZ stocks VBZ stocks
VBD rose VBD fell VBD took VBD shot
VBD rose VBD fell VBD took VBD shot
... ... ... ... ... ... ... ...
... ... ... ... ... ... ... ...
Token-based: 4
Parameter dependencies
probability
Dependencies between variables with shared parameters
VBZ stocks VBZ stocks VBZ stocks VBZ stocks
VBD rose VBD fell VBD took VBD shot
NNS stocks NNS stocks NNS stocks NNS stocks
... ... ... ... ... ... ... ... NNS stocks VBZ stocks VBZ stocks VBZ stocks
VBD rose VBD fell VBD took VBD shot
... ... ... ... ... ... ... ...
NNS stocks NNS stocks VBZ stocks VBZ stocks
VBD rose VBD fell VBD took VBD shot
... ... ... ... ... ... ... ...
NNS stocks NNS stocks NNS stocks VBZ stocks
VBD rose VBD fell VBD took VBD shot
VBD rose VBD fell VBD took VBD shot
... ... ... ... ... ... ... ...
... ... ... ... ... ... ... ...
Token-based: 4
Parameter dependencies
probability
Dependencies between variables with shared parameters
VBZ stocks VBZ stocks VBZ stocks VBZ stocks
VBD rose VBD fell VBD took VBD shot
NNS stocks NNS stocks NNS stocks NNS stocks
... ... ... ... ... ... ... ... NNS stocks VBZ stocks VBZ stocks VBZ stocks
VBD rose VBD fell VBD took VBD shot
... ... ... ... ... ... ... ...
NNS stocks NNS stocks VBZ stocks VBZ stocks
VBD rose VBD fell VBD took VBD shot
... ... ... ... ... ... ... ...
NNS stocks NNS stocks NNS stocks VBZ stocks
VBD rose VBD fell VBD took VBD shot
VBD rose VBD fell VBD took VBD shot
... ... ... ... ... ... ... ...
... ... ... ... ... ... ... ...
Token-based: need to go downhill a lot before going uphill 4
Parameter dependencies
probability
Dependencies between variables with shared parameters Type-based: update all variables of one type VBZ stocks VBZ stocks VBZ stocks VBZ stocks
VBD rose VBD fell VBD took VBD shot
NNS stocks NNS stocks NNS stocks NNS stocks
... ... ... ... ... ... ... ... NNS stocks VBZ stocks VBZ stocks VBZ stocks
VBD rose VBD fell VBD took VBD shot
... ... ... ... ... ... ... ...
NNS stocks NNS stocks VBZ stocks VBZ stocks
VBD rose VBD fell VBD took VBD shot
... ... ... ... ... ... ... ...
NNS stocks NNS stocks NNS stocks VBZ stocks
VBD rose VBD fell VBD took VBD shot
VBD rose VBD fell VBD took VBD shot
... ... ... ... ... ... ... ...
... ... ... ... ... ... ... ...
Token-based: need to go downhill a lot before going uphill 4
Parameter dependencies
probability
Dependencies between variables with shared parameters Type-based: update all variables of one type VBZ VBD ... stocks rose ... VBZ VBD ... stocks fell ... VBZ VBD ... stocks took ... VBZ VBD ... 1. Parameter stocks shot ...
dependencies create deeper
NNS stocks VBZ stocks VBZ stocks VBZ stocks
VBD rose VBD fell VBD took VBD shot
... ... ... ... ... ... ... ...
NNS stocks NNS stocks VBZ stocks VBZ stocks
VBD rose VBD fell VBD took VBD shot
... ... ... ... ... ... ... ...
NNS stocks NNS stocks NNS stocks VBZ stocks
VBD rose VBD fell VBD took VBD shot
NNS VBD stocks rose NNS VBD stocks fell NNS VBD stocks took NNS VBD stocks shot valleys
... ... ... ... ... ... ... ...
... ... ... ... ... ... ... ...
Token-based: need to go downhill a lot before going uphill 4
Parameter dependencies
probability
Dependencies between variables with shared parameters Type-based: update all variables of one type NNS VBD stocks rose NNS VBD stocks fell NNS VBD stocks took NNS VBD stocks shot valleys
VBZ VBD ... stocks rose ... VBZ VBD ... stocks fell ... VBZ VBD ... stocks took ... VBZ VBD ... 1. Parameter stocks shot ...
... ... ... ... ... ... ... ...
dependencies create deeper 2. Sentence-based these NNS VBD ...cannot handle NNS VBDdependencies ... stocks VBZ stocks VBZ stocks VBZ stocks
rose VBD fell VBD took VBD shot
... ... ... ... ... ... ...
NNS stocks NNS stocks VBZ stocks VBZ stocks
VBD rose VBD fell VBD took VBD shot
... ... ... ... ... ... ... ...
stocks NNS stocks NNS stocks VBZ stocks
rose VBD fell VBD took VBD shot
... ... ... ... ... ... ...
Token-based: need to go downhill a lot before going uphill 4
What exactly is a type?
How can we update all variables of a type efficiently?
5
Formal setup Parameters θ: vector of conditional probabilities
6
Formal setup Parameters θ: vector of conditional probabilities θT:V,N: from state V, probability of transitioning to state N
6
Formal setup Parameters θ: vector of conditional probabilities θT:V,N: from state V, probability of transitioning to state N
p(θ) is product of Dirichlets
[prior]
6
Formal setup Parameters θ: vector of conditional probabilities θT:V,N: from state V, probability of transitioning to state N
p(θ) is product of Dirichlets
[prior]
Choices z: specifies values of latent and observed variables
6
Formal setup Parameters θ: vector of conditional probabilities θT:V,N: from state V, probability of transitioning to state N
p(θ) is product of Dirichlets
[prior]
Choices z: specifies values of latent and observed variables ···
8
9
V
N
···
stocks
6
Formal setup Parameters θ: vector of conditional probabilities θT:V,N: from state V, probability of transitioning to state N
p(θ) is product of Dirichlets
[prior]
Choices z: specifies values of latent and observed variables z∈z [T : V, N at 8]
···
8
9
V
N
···
stocks
6
Formal setup Parameters θ: vector of conditional probabilities θT:V,N: from state V, probability of transitioning to state N
p(θ) is product of Dirichlets
[prior]
Choices z: specifies values of latent and observed variables z∈z
j(z) [parameter used] · · ·
[T : V, N at 8] T:V,N
8
9
V
N
···
stocks
6
Formal setup Parameters θ: vector of conditional probabilities θT:V,N: from state V, probability of transitioning to state N
p(θ) is product of Dirichlets
[prior]
Choices z: specifies values of latent and observed variables z∈z
j(z) [parameter used] · · ·
[T : V, N at 8] T:V,N
p(z | θ) =
Q
z∈z θj(z)
8
9
V
N
···
stocks
[likelihood]
6
Formal setup Parameters θ: vector of conditional probabilities θT:V,N: from state V, probability of transitioning to state N
p(θ) is product of Dirichlets
[prior]
Choices z: specifies values of latent and observed variables z∈z
j(z) [parameter used] · · ·
[T : V, N at 8] T:V,N
8
9
V
N
···
stocks
Q
p(z | θ) = z∈z θj(z) [likelihood] R p(z) = p(z | θ)p(θ)dθ [marginal likelihood]
6
Formal setup Parameters θ: vector of conditional probabilities θT:V,N: from state V, probability of transitioning to state N
p(θ) is product of Dirichlets
[prior]
Choices z: specifies values of latent and observed variables z∈z
j(z) [parameter used] · · ·
[T : V, N at 8] T:V,N
8
9
V
N
···
stocks
Q
p(z | θ) = z∈z θj(z) [likelihood] R p(z) = p(z | θ)p(θ)dθ [marginal likelihood] Observations x: observed part of z (e.g., the words) 6
Formal setup Parameters θ: vector of conditional probabilities θT:V,N: from state V, probability of transitioning to state N
p(θ) is product of Dirichlets
[prior]
Choices z: specifies values of latent and observed variables z∈z
j(z) [parameter used] · · ·
[T : V, N at 8] T:V,N
8
9
V
N
···
stocks
Q
p(z | θ) = z∈z θj(z) [likelihood] R p(z) = p(z | θ)p(θ)dθ [marginal likelihood] Observations x: observed part of z (e.g., the words) Goal: sample from p(z | x)
[not p(z | x, θ)] 6
Exchangeability Sufficient statistics n: # times parameters were used in z
7
Exchangeability Sufficient statistics n: # times parameters were used in z nT:V,N: # times that state V transitioned to state N
7
Exchangeability Sufficient statistics n: # times parameters were used in z nT:V,N: # times that state V transitioned to state N Rewrite likelihood: Q nj p(z | θ) = j θj
7
Exchangeability Sufficient statistics n: # times parameters were used in z nT:V,N: # times that state V transitioned to state N Rewrite likelihood: Q nj p(z | θ) = j θj = simple function of n and θ
7
Exchangeability Sufficient statistics n: # times parameters were used in z nT:V,N: # times that state V transitioned to state N Rewrite likelihood: Q nj p(z | θ) = j θj = simple function of n and θ p(z) = simple function of n
7
Exchangeability Sufficient statistics n: # times parameters were used in z nT:V,N: # times that state V transitioned to state N Rewrite likelihood: Q nj p(z | θ) = j θj = simple function of n and θ p(z) = simple function of n (key: exchangeability)
7
Exchangeability Sufficient statistics n: # times parameters were used in z nT:V,N: # times that state V transitioned to state N Rewrite likelihood: Q nj p(z | θ) = j θj = simple function of n and θ p(z) = simple function of n (key: exchangeability) z1 ··· ···
D
V
V
many
stocks
rose
D
N
V
tech
stocks
were
··· ···
7
Exchangeability Sufficient statistics n: # times parameters were used in z nT:V,N: # times that state V transitioned to state N Rewrite likelihood: Q nj p(z | θ) = j θj = simple function of n and θ p(z) = simple function of n (key: exchangeability) z2
z1 ··· ···
D
V
V
many
stocks
rose
D
N
V
tech
stocks
were
··· ···
··· ···
D
N
V
many
stocks
rose
D
V
V
tech
stocks
were
··· ···
7
Exchangeability Sufficient statistics n: # times parameters were used in z nT:V,N: # times that state V transitioned to state N Rewrite likelihood: Q nj p(z | θ) = j θj = simple function of n and θ p(z) = simple function of n (key: exchangeability) z2
z1 ··· ···
D
V
V
many
stocks
rose
D
N
V
tech
stocks
were
··· ···
··· ···
D
N
V
many
stocks
rose
D
V
V
tech
stocks
were
··· ···
7
Exchangeability Sufficient statistics n: # times parameters were used in z nT:V,N: # times that state V transitioned to state N Rewrite likelihood: Q nj p(z | θ) = j θj = simple function of n and θ p(z) = simple function of n (key: exchangeability) z2
z1 ··· ···
D
V
V
many
stocks
rose
D
N
V
tech
stocks
were
··· ···
··· ···
D
N
V
many
stocks
rose
D
V
V
tech
stocks
were
··· ···
z1 and z2 have same sufficient statistics 7
Exchangeability Sufficient statistics n: # times parameters were used in z nT:V,N: # times that state V transitioned to state N Rewrite likelihood: Q nj p(z | θ) = j θj = simple function of n and θ p(z) = simple function of n (key: exchangeability) z2
z1 ··· ···
D
V
V
many
stocks
rose
D
N
V
tech
stocks
were
··· ···
··· ···
D
N
V
many
stocks
rose
D
V
V
tech
stocks
were
··· ···
z1 and z2 have same sufficient statistics ⇒ p(z1) = p(z2) 7
Types Type of a variable: its dependent parameter components
8
Types Type of a variable: its dependent parameter components For HMM, type = assignment to Markov blanket
8
Types Type of a variable: its dependent parameter components For HMM, type = assignment to Markov blanket
···
D many
V stocks
···
rose
8
Types Type of a variable: its dependent parameter components For HMM, type = assignment to Markov blanket type( ···
D many
) = (D, stocks, V) V
stocks
···
rose
8
Types Type of a variable: its dependent parameter components For HMM, type = assignment to Markov blanket type( ···
D many
··· ···
stocks
stocks
D how
···
have
V stocks
···
were
V stocks
···
rose
V
D the
···
V
D tech
) = (D, stocks, V)
···
from
8
Types Type of a variable: its dependent parameter components For HMM, type = assignment to Markov blanket ) = (D, stocks, V)
type( ··· ··· ··· ···
D
V
V
many
stocks
rose
D
V
V
tech
stocks
were
D
N
V
the
stocks
have
D
N
V
how
stocks
from
··· ···
Assignments [VVNN]
··· ···
8
Types Type of a variable: its dependent parameter components For HMM, type = assignment to Markov blanket ) = (D, stocks, V)
type( ··· ··· ··· ···
D
V
V
many
stocks
rose
D
N
V
tech
stocks
were
D
V
V
the
stocks
have
D
N
V
how
stocks
from
··· ···
Assignments [VVNN] [VNVN]
··· ···
8
Types Type of a variable: its dependent parameter components For HMM, type = assignment to Markov blanket ) = (D, stocks, V)
type( ··· ··· ··· ···
D
V
V
many
stocks
rose
D
N
V
tech
stocks
were
D
N
V
the
stocks
have
D
V
V
how
stocks
from
··· ···
Assignments [VVNN] [VNVN] [VNNV]
··· ···
8
Types Type of a variable: its dependent parameter components For HMM, type = assignment to Markov blanket ) = (D, stocks, V)
type( ··· ··· ··· ···
D
N
V
many
stocks
rose
D
V
V
tech
stocks
were
D
V
V
the
stocks
have
D
N
V
how
stocks
from
··· ··· ···
Assignments [VVNN] [VNVN] [VNNV] [NVVN]
···
8
Types Type of a variable: its dependent parameter components For HMM, type = assignment to Markov blanket ) = (D, stocks, V)
type( ··· ··· ··· ···
D
N
V
many
stocks
rose
D
V
V
tech
stocks
were
D
N
V
the
stocks
have
D
V
V
how
stocks
from
··· ··· ···
Assignments [VVNN] [VNVN] [VNNV] [NVVN] [NVNV]
···
8
Types Type of a variable: its dependent parameter components For HMM, type = assignment to Markov blanket ) = (D, stocks, V)
type( ··· ··· ··· ···
D
N
V
many
stocks
rose
D
N
V
tech
stocks
were
D
V
V
the
stocks
have
D
V
V
how
stocks
from
··· ··· ··· ···
Assignments [VVNN] [VNVN] [VNNV] [NVVN] [NVNV] [NNVV]
8
Types Type of a variable: its dependent parameter components For HMM, type = assignment to Markov blanket ) = (D, stocks, V)
type( ··· ··· ··· ···
D
N
V
many
stocks
rose
D
N
V
tech
stocks
were
D
V
V
the
stocks
have
D
V
V
how
stocks
from
··· ··· ··· ···
Assignments [VVNN] [VNVN] [VNNV] [NVVN] [NVNV] [NNVV]
p(assignment) only depends on number of Vs and Ns 8
Sampling same-type variables Goal: sample an assignment of a set of same-type variables
9
Sampling same-type variables Goal: sample an assignment of a set of same-type variables
m=0 p0
[VVVV]
m=1
[VVVN] p1 [VVNV] p1 [VNVV] p1 [NVVV] p1
m=2
m=3
[VVNN] p2 [VNVN] p2 [VNNV] p2 [NVVN] p2 [NVNV] p2 [NNVV]
[VNNN] p3 [NVNN] p3 [NNVN] p3 [NNNV]
p2
p3
m=4 p4
[NNNN]
9
Sampling same-type variables Goal: sample an assignment of a set of same-type variables Algorithm: 1. Choose m ∈ {0, . . . , B} with prob. ∝
m=0 p0
[VVVV]
m=1
[VVVN] p1 [VVNV] p1 [VNVV] p1 [NVVV] p1
B m
pm
m=2
m=3
[VVNN] p2 [VNVN] p2 [VNNV] p2 [NVVN] p2 [NVNV] p2 [NNVV]
[VNNN] p3 [NVNN] p3 [NNVN] p3 [NNNV]
p2
p3
m=4 p4
[NNNN]
9
Sampling same-type variables Goal: sample an assignment of a set of same-type variables Algorithm: 1. Choose m ∈ {0, . . . , B} with prob. ∝
B m
pm
2. Choose assignment uniformly from column m m=0 p0
[VVVV]
m=1
[VVVN] p1 [VVNV] p1 [VNVV] p1 [NVVV] p1
m=2
m=3
[VVNN] p2 [VNVN] p2 [VNNV] p2 [NVVN] p2 [NVNV] p2 [NNVV]
[VNNN] p3 [NVNN] p3 [NNVN] p3 [NNNV]
p2
p3
m=4 p4
[NNNN]
9
Full algorithm Iterate:
··· ··· ··· ···
D
V
V
many
stocks
rose
D
N
V
tech
stocks
were
D
N
V
the
stocks
have
D
V
V
how
stocks
from
··· ··· ··· ···
10
Full algorithm Iterate: 1. Choose a position e.g., 2
··· ··· ··· ···
D
V
V
many
stocks
rose
D
N
V
tech
stocks
were
D
N
V
the
stocks
have
D
V
V
how
stocks
from
··· ··· ··· ···
10
Full algorithm Iterate: ··· 1. Choose a position e.g., 2 ··· 2. Add variables with same type e.g., (D, stocks, V) ··· ···
D
V
V
many
stocks
rose
D
N
V
tech
stocks
were
D
N
V
the
stocks
have
D
V
V
how
stocks
from
··· ··· ··· ···
10
Full algorithm Iterate: ··· 1. Choose a position e.g., 2 ··· 2. Add variables with same type e.g., (D, stocks, V) ··· 3. Sample m e.g., 1 ⇒ {V, V, V, N} ···
D many
V stocks
D tech
D the
stocks
···
have
V stocks
···
were
V
D how
rose
V stocks
···
···
from
10
Full algorithm Iterate: ··· 1. Choose a position e.g., 2 ··· 2. Add variables with same type e.g., (D, stocks, V) ··· 3. Sample m e.g., 1 ⇒ {V, V, V, N} ··· 4. Sample assignment e.g., [VNVV]
D
V
V
many
stocks
rose
D
N
V
tech
stocks
were
D
V
V
the
stocks
have
D
V
V
how
stocks
from
··· ··· ··· ···
10
Experimental setup: sampling algorithms 2
5
3
1
tech stocks rose 4
8
in
6
2
heavy trading
8
1
worker demands meeting resistance 2
4
3
8
many stocks shot 5
2
6
up today 4
3
investors await stocks news
11
Experimental setup: sampling algorithms 2
5
3
1
tech stocks rose 4
8
in
6
2
Token-based sampler (Token) 1. Choose token
heavy trading 1
8
worker demands meeting resistance 2
4
3
8
many stocks shot 5
2
6
up today 4
3
investors await stocks news
11
Experimental setup: sampling algorithms 2
5
3
1
tech stocks rose 4
8
in
6
2
heavy trading
Token-based sampler (Token) 1. Choose token 2. Update conditioned on rest
1
7
worker demands meeting resistance 2
4
3
8
many stocks shot 5
2
6
up today 4
3
investors await stocks news
11
Experimental setup: sampling algorithms 2
5
3
tech stocks rose 4
1
8
in
heavy
6
8
Token-based sampler (Token) 2 1. Choose token trading 2. Update conditioned on rest 1
worker demands meeting resistance 2
4
3
8
many stocks shot 5
2
Sentence-based sampler (Sentence) 1. Choose sentence
6
up today 4
3
investors await stocks news
11
Experimental setup: sampling algorithms 3
4
6
tech stocks rose 4
1
2
in
heavy
6
8
Token-based sampler (Token) 7 1. Choose token trading 2. Update conditioned on rest 1
worker demands meeting resistance 2
4
3
8
many stocks shot 5
2
6
Sentence-based sampler (Sentence) 1. Choose sentence 2. Update conditioned on rest
up today 4
3
investors await stocks news
11
Experimental setup: sampling algorithms 2
3
5
1
tech stocks rose 4
8
in
6
2
heavy trading
8
1
worker demands meeting resistance 2
3
4
8
many stocks shot 5
2
6
Token-based sampler (Token) 1. Choose token 2. Update conditioned on rest Sentence-based sampler (Sentence) 1. Choose sentence 2. Update conditioned on rest
up today 4
3
investors await stocks news
Type-based sampler (Type) 1. Choose type
11
Experimental setup: sampling algorithms 2
3
4
1
tech stocks rose 4
8
in
6
2
heavy trading
8
1
worker demands meeting resistance 2
3
3
8
many stocks shot 5
2
6
Token-based sampler (Token) 1. Choose token 2. Update conditioned on rest Sentence-based sampler (Sentence) 1. Choose sentence 2. Update conditioned on rest
up today 4
3
investors await stocks news
Type-based sampler (Type) 1. Choose type 2. Update conditioned on rest
11
Experimental setup: models/tasks/datasets Hidden Markov model (HMM)
12
Experimental setup: models/tasks/datasets Hidden Markov model (HMM) NN NNS VBG NN worker demands meeting resistance
Part-of-speech induction
12
Experimental setup: models/tasks/datasets Hidden Markov model (HMM) NN NNS VBG NN worker demands meeting resistance
Part-of-speech induction
12
Experimental setup: models/tasks/datasets Hidden Markov model (HMM) NN NNS VBG NN worker demands meeting resistance
Part-of-speech induction
Dataset: WSJ (49K sentences, 45 tags)
12
Experimental setup: models/tasks/datasets Hidden Markov model (HMM) NN NNS VBG NN worker demands meeting resistance
Part-of-speech induction
Dataset: WSJ (49K sentences, 45 tags)
Unigram segmentation model (USM) [Goldwater et al., 2006]
12
Experimental setup: models/tasks/datasets Hidden Markov model (HMM) NN NNS VBG NN worker demands meeting resistance
Part-of-speech induction
Dataset: WSJ (49K sentences, 45 tags)
Unigram segmentation model (USM) [Goldwater et al., 2006] look at the book
Word segmentation
12
Experimental setup: models/tasks/datasets Hidden Markov model (HMM) NN NNS VBG NN worker demands meeting resistance
Part-of-speech induction
Dataset: WSJ (49K sentences, 45 tags)
Unigram segmentation model (USM) [Goldwater et al., 2006] look at the book
Word segmentation
12
Experimental setup: models/tasks/datasets Hidden Markov model (HMM) NN NNS VBG NN worker demands meeting resistance
Part-of-speech induction
Dataset: WSJ (49K sentences, 45 tags)
Unigram segmentation model (USM) [Goldwater et al., 2006] look at the book
Word segmentation
Dataset: CHILDES (9.7K sentences)
12
Experimental setup: models/tasks/datasets Hidden Markov model (HMM) NN NNS VBG NN worker demands meeting resistance
Part-of-speech induction
Dataset: WSJ (49K sentences, 45 tags)
Unigram segmentation model (USM) [Goldwater et al., 2006] look at the book
Word segmentation
Dataset: CHILDES (9.7K sentences)
Probabilistic tree-substitution grammar (PTSG) [Cohn et al., 2009]
12
Experimental setup: models/tasks/datasets Hidden Markov model (HMM) NN NNS VBG NN worker demands meeting resistance
Part-of-speech induction
Dataset: WSJ (49K sentences, 45 tags)
Unigram segmentation model (USM) [Goldwater et al., 2006] look at the book
Dataset: CHILDES (9.7K sentences)
Word segmentation
Probabilistic tree-substitution grammar (PTSG) [Cohn et al., 2009] S NP
VP
DT
NN
VBD
VBN
the
sun
has
risen
12
Experimental setup: models/tasks/datasets Hidden Markov model (HMM) NN NNS VBG NN worker demands meeting resistance
Part-of-speech induction
Dataset: WSJ (49K sentences, 45 tags)
Unigram segmentation model (USM) [Goldwater et al., 2006] look at the book
Dataset: CHILDES (9.7K sentences)
Word segmentation
Probabilistic tree-substitution grammar (PTSG) [Cohn et al., 2009] S NP
VP
DT
NN
VBD
VBN
the
sun
has
risen
12
Experimental setup: models/tasks/datasets Hidden Markov model (HMM) NN NNS VBG NN worker demands meeting resistance
Part-of-speech induction
Dataset: WSJ (49K sentences, 45 tags)
Unigram segmentation model (USM) [Goldwater et al., 2006] look at the book
Dataset: CHILDES (9.7K sentences)
Word segmentation
Probabilistic tree-substitution grammar (PTSG) [Cohn et al., 2009] S NP
VP
DT
NN
VBD
VBN
the
sun
has
risen
Dataset: WSJ (49K sentences, 45 tags) 12
Token versus Type HMM
USM
PTSG
13
Token versus Type HMM
USM
PTSG
log-likelihood
-6.5e6 -6.9e6 -7.3e6
3
6
9 12
time (hr.)
Token
13
Token versus Type HMM
USM
PTSG
log-likelihood
-6.5e6 -6.9e6 -7.3e6
3
6
9 12
time (hr.)
Token Type
13
Token versus Type HMM
USM -1.7e5
log-likelihood
log-likelihood
-6.5e6
PTSG
-6.9e6 -7.3e6
3
6
9 12
time (hr.)
-1.9e5 -2.1e5
2
4
6
8
time (min.)
Token Type
13
Token versus Type HMM
USM -1.7e5
-6.9e6 -7.3e6
3
6
9 12
time (hr.)
-5.3e6
log-likelihood
log-likelihood
log-likelihood
-6.5e6
PTSG
-1.9e5 -2.1e5
2
4
6
8
time (min.)
-5.5e6 -5.7e6 -5.9e6 -6.1e6 3
6
9 12
time (hr.)
Token Type
13
Token versus Type HMM
USM -1.7e5
-6.9e6 -7.3e6
3
6
-5.3e6
log-likelihood
log-likelihood
log-likelihood
-6.5e6
-1.9e5 -2.1e5
9 12
2
time (hr.) 60 55 50 45 40 35 3
6
9 12
time (hr.)
4
6
8
time (min.) word token F1
tag accuracy
PTSG
60 55 50 45 40 35
-5.5e6 -5.7e6 -5.9e6 -6.1e6 3
6
9 12
time (hr.)
Token Type 2
4
6
8
time (min.)
13
Sensitivity to initialization
14
Sensitivity to initialization
(use few params.) η=0
14
Sensitivity to initialization
(use few params.) η=0
V V V V worker demands meeting resistance
14
Sensitivity to initialization
(use few params.) η=0
V V V V worker demands meeting resistance
l o o k a t t h e b o o k
14
Sensitivity to initialization
(use few params.) η=0
(use many params.) η=1
V V V V worker demands meeting resistance
l o o k a t t h e b o o k
14
Sensitivity to initialization
(use few params.) η=0
V V V V worker demands meeting resistance
(use many params.) η=1
P N D V worker demands meeting resistance
l o o k a t t h e b o o k
14
Sensitivity to initialization
(use few params.) η=0
(use many params.) η=1
V V V V worker demands meeting resistance
P N D V worker demands meeting resistance
l o o k a t t h e b o o k
lookatthebook
14
Sensitivity to initialization HMM
USM
PTSG
15
Sensitivity to initialization HMM
USM
PTSG
log-likelihood
-6.6e6 -6.8e6 -7.0e6 -7.2e6 0.2 0.4 0.6 0.8 1.0
tag accuracy
η 60 50 40 30 20 10
Token 0.2 0.4 0.6 0.8 1.0
η
15
Sensitivity to initialization HMM
USM
PTSG
log-likelihood
-6.6e6 -6.8e6 -7.0e6 -7.2e6 0.2 0.4 0.6 0.8 1.0
tag accuracy
η 60 50 40 30 20 10
Token Type 0.2 0.4 0.6 0.8 1.0
η
15
Sensitivity to initialization HMM
USM -1.9e5
log-likelihood
log-likelihood
-6.6e6 -6.8e6 -7.0e6 -7.2e6
-2.3e5 -2.7e5 -3.1e5 -3.4e5
0.2 0.4 0.6 0.8 1.0
0.2 0.4 0.6 0.8 1.0
η 60 50 40 30 20 10
η word token F1
tag accuracy
PTSG
0.2 0.4 0.6 0.8 1.0
η
60 50 40 30 20 10
Token Type 0.2 0.4 0.6 0.8 1.0
η
15
Sensitivity to initialization HMM
USM -5.4e6
-6.8e6 -7.0e6 -7.2e6
log-likelihood
-1.9e5
log-likelihood
log-likelihood
-6.6e6
-2.3e5 -2.7e5 -3.1e5 -3.4e5
0.2 0.4 0.6 0.8 1.0
0.2 0.4 0.6 0.8 1.0
η 60 50 40 30 20 10
η word token F1
tag accuracy
PTSG
0.2 0.4 0.6 0.8 1.0
η
60 50 40 30 20 10
-5.5e6 -5.6e6 -5.7e6 0.2 0.4 0.6 0.8 1.0
η
Token Type 0.2 0.4 0.6 0.8 1.0
η
15
Sensitivity to initialization HMM
USM -5.4e6
-6.8e6 -7.0e6 -7.2e6
log-likelihood
-1.9e5
log-likelihood
log-likelihood
-6.6e6
-2.3e5 -2.7e5 -3.1e5 -3.4e5
0.2 0.4 0.6 0.8 1.0
-5.6e6 -5.7e6
η word token F1
60 50 40 30 20 10 0.2 0.4 0.6 0.8 1.0
η
-5.5e6
0.2 0.4 0.6 0.8 1.0
η tag accuracy
PTSG
60 50 40 30 20 10
0.2 0.4 0.6 0.8 1.0
η
Token Type 0.2 0.4 0.6 0.8 1.0
η
Type less sensitive than Token 15
Alternatives Can we get the gains of Type via simpler means?
16
Alternatives Can we get the gains of Type via simpler means?
• Annealing (Tokenanneal) – Use p(z | x)1/T , temperature T from 5 to 1
17
Annealing HMM
USM
PTSG
18
Annealing HMM
USM
PTSG
log-likelihood
-6.5e6 -6.9e6 -7.3e6
3
6
9 12
tag accuracy
time (hr.) 60 55 50 45 40 35
Token Type 3
6
9 12
time (hr.)
18
Annealing HMM
USM
PTSG
log-likelihood
-6.5e6 -6.9e6 -7.3e6
3
6
9 12
tag accuracy
time (hr.) 60 55 50 45 40 35 3
6
9 12
Token Tokenanneal Type
time (hr.)
[anneal. hurts] 18
Annealing HMM
USM -1.7e5
log-likelihood
log-likelihood
-6.5e6 -6.9e6 -7.3e6
3
6
-1.9e5 -2.1e5
9 12
2
time (hr.) 60 55 50 45 40 35 3
6
9 12
time (hr.)
[anneal. hurts]
4
6
8
time (min.) word token F1
tag accuracy
PTSG
60 55 50 45 40 35 2
4
6
8
Token Tokenanneal Type
time (min.)
[anneal. helps] 18
Annealing HMM
USM -1.7e5
-6.9e6 -7.3e6
3
6
-5.3e6
log-likelihood
log-likelihood
log-likelihood
-6.5e6
-1.9e5 -2.1e5
9 12
2
time (hr.) 60 55 50 45 40 35 3
6
9 12
time (hr.)
[anneal. hurts]
4
6
8
time (min.) word token F1
tag accuracy
PTSG
60 55 50 45 40 35 2
4
6
8
-5.5e6 -5.7e6 -5.9e6 -6.1e6 3
6
9 12
time (hr.)
Token Tokenanneal Type
time (min.)
[anneal. helps]
[no effect] 18
Alternatives Can we get the gains of Type via simpler means?
• Annealing (Tokenanneal) – Use p(z | x)1/T , temperature T from 5 to 1 – More (random) mobility through space, but insufficient
19
Alternatives Can we get the gains of Type via simpler means?
• Annealing (Tokenanneal) – Use p(z | x)1/T , temperature T from 5 to 1 – More (random) mobility through space, but insufficient • Sentence-based (Sentence)
20
Sentence USM: word token F1
log-likelihood
-1.7e5 -1.9e5 -2.1e5
2
4
6
8
time (min.)
60 55 50 45 40 35
Token Type 2
4
6
8
time (min.)
21
Sentence USM: word token F1
log-likelihood
-1.7e5 -1.9e5 -2.1e5
2
4
6
8
time (min.)
60 55 50 45 40 35 2
4
6
8
Token Type Sentence
time (min.)
21
Sentence USM: word token F1
log-likelihood
-1.7e5 -1.9e5 -2.1e5
2
4
6
8
time (min.)
60 55 50 45 40 35 2
4
6
8
Token Type Sentence
time (min.)
Sentence performs comparably to Token, but worse than Type
21
Sentence USM: word token F1
log-likelihood
-1.7e5 -1.9e5 -2.1e5
2
4
6
8
time (min.)
60 55 50 45 40 35 2
4
6
8
Token Type Sentence
time (min.)
Sentence performs comparably to Token, but worse than Type Sentence requires dynamic programming, computationally more expensive than Token and Type 21
Alternatives Can we get the gains of Type via simpler means?
• Annealing (Tokenanneal) – Use p(z | x)1/T , temperature T from 5 to 1 – More (random) mobility through space, but insufficient • Sentence-based (Sentence) – Sentence handles structural dependencies – Type handles parameter dependencies
22
Alternatives Can we get the gains of Type via simpler means?
• Annealing (Tokenanneal) – Use p(z | x)1/T , temperature T from 5 to 1 – More (random) mobility through space, but insufficient • Sentence-based (Sentence) – Sentence handles structural dependencies – Type handles parameter dependencies – Intuition: parameter dependencies more important in unsupervised learning
23
Summary and outlook General strategy: update many dependent variables tractably
24
Summary and outlook General strategy: update many dependent variables tractably Variables sharing parameters very dependent Type-based sampler updates exactly these variables
24
Summary and outlook General strategy: update many dependent variables tractably Variables sharing parameters very dependent Type-based sampler updates exactly these variables Techniques for tractability: Old: exploit dynamic programming structure
24
Summary and outlook General strategy: update many dependent variables tractably Variables sharing parameters very dependent Type-based sampler updates exactly these variables Techniques for tractability: Old: exploit dynamic programming structure New: think about exchangeability
24
Summary and outlook General strategy: update many dependent variables tractably Variables sharing parameters very dependent Type-based sampler updates exactly these variables Techniques for tractability: Old: exploit dynamic programming structure New: think about exchangeability Tokens versus types:
24
Summary and outlook General strategy: update many dependent variables tractably Variables sharing parameters very dependent Type-based sampler updates exactly these variables Techniques for tractability: Old: exploit dynamic programming structure New: think about exchangeability Tokens versus types: • Older work operate on types (e.g., model merging)
24
Summary and outlook General strategy: update many dependent variables tractably Variables sharing parameters very dependent Type-based sampler updates exactly these variables Techniques for tractability: Old: exploit dynamic programming structure New: think about exchangeability Tokens versus types: • Older work operate on types (e.g., model merging) Larger updates, but greedy and brittle
24
Summary and outlook General strategy: update many dependent variables tractably Variables sharing parameters very dependent Type-based sampler updates exactly these variables Techniques for tractability: Old: exploit dynamic programming structure New: think about exchangeability Tokens versus types: • Older work operate on types (e.g., model merging) Larger updates, but greedy and brittle • Recent methods operate on sentences or tokens Smaller updates, but softer and more robust 24
Summary and outlook General strategy: update many dependent variables tractably Variables sharing parameters very dependent Type-based sampler updates exactly these variables Techniques for tractability: Old: exploit dynamic programming structure New: think about exchangeability Tokens versus types: • Older work operate on types (e.g., model merging) Larger updates, but greedy and brittle • Recent methods operate on sentences or tokens Smaller updates, but softer and more robust Type-based sampling combines advantages of both 24
Thank you!
25