C19 : Lecture 4 : A Gibbs Sampler for Gaussian Mixture Models

C19 : Lecture 4 : A Gibbs Sampler for Gaussian Mixture Models Frank Wood

University of Oxford

January, 2014

Figures and derivations from [Wood and Black, 2008]

Wood (University of Oxford)

Unsupervised Machine Learning

January, 2014

1 / 22

Gaussian Mixture Model



ci |~π ∼ Discrete(~π )



~yi |ci = k; Θ ∼ Gaussian(·|θk ). α α ~π |α ∼ Dirichlet(·| , . . . , ) K K Θ ∼ G0

  



 

Kinds of questions : What’s the probability mass in this region? Is item i the same as item j?



Figure : Bayesian GMM Graphical Model Wood (University of Oxford)

How many classes are there (somewhat dangerous).


January, 2014

2 / 22

Gibbs Sampler for GMM I A Gaussian mixture model is density constructed by mixing Gaussians P(~yi ) =

K X

P(ci = k)P(~yi |θk )

k=1

where K is the number of “classes,” ci is a class indicator variable (i.e. ci = k means that the ith observation came from class k), P(ci = k) = πk represents the a priori probability that the observation i was generated by class k, and the likelihood P(~yi |θk ) is the generative model of data from class k parameterised by θk . The observation model is taken to be a multivariate Gaussian with parameters θk for each class k



January, 2014

3 / 22

Gibbs Sampler for GMM II Notationally, C = {ci }N i=1 is the collection of all class indicator variables, K Θ = {θk }k=1 , is the collection of all class parameters, θk = {~ µk , Σk } are the mean and covariance for class k, and ~π = {πk }K , π = P(ci = k) k k=1 are the class prior probabilities.



January, 2014

4 / 22

Gibbs Sampler for GMM III To estimate the posterior distribution we first have to specify a prior for all of the parameters of the model. α α ,..., ) K K

(1)

Σk

∼ Inverse-Wishartυ0 (Λ−1 0 )

(2)

µ ~k

∼ Gaussian(~ µ0 , Σk /κ0 ).

(3)

~π |α ∼ Dirichlet(·| Θ ∼ G0 where Θ ∼ G0 is shorthand for

These priors are chosen for mathematical convenience and interpretable expressiveness. They are conjugate priors which will allow us to analytically perform many of the marginalization steps (integrations) necessary to derive a Gibbs sampler for this model.



January, 2014

5 / 22

Gibbs Sampler for GMM IV The parameters of the Inverse-Wishart prior, H = {Λ−1 ~ 0 , κ0 }, are 0 , υ0 , µ used to encode our prior beliefs regarding class observation distribution shape and variability. For instance µ ~ 0 specifies our prior belief about what the mean of all classes should look like, where κ0 is the number of pseudo-observations we are willing to ascribe to our belief (in a way similar to that described above for the Dirichlet prior). The hyper-parameters Λ−1 0 and υ0 encode the ways in which individual observations are likely to vary from the mean and how confident we are in our prior beliefs about that.



January, 2014

6 / 22

Gibbs Sampler for GMM V The joint distribution (from the graphical model) for a Gaussian mixture mode is

P(Y, Θ, C, ~π , α; H)   ! K N Y Y =  P(θj ; H) P(~yi |ci , θci )P(ci |~π ) P(~π |α)P(α). j=1


i=1


January, 2014

7 / 22

Gibbs Sampler for GMM VI Applying Bayes rule and conditioning on the observed data we see that posterior distribution is simply proportional to the joint (rewritten slightly) P(C, Θ, ~π , α|Y; H) ∝ P(Y|C, Θ)P(Θ; H)

N Y

! P(ci |~π ) P(~π |α)P(α).

i=1

where P(Y|C, Θ) =

N Y

P(~yi |ci , θci )

i=1

and P(Θ; H) =

K Y

P(θj ; H).

j=1 Wood (University of Oxford)


January, 2014

8 / 22

Gibbs Sampler for GMM VII Gibbs sampling, as developed in general by, is possible in this model. Deriving Gibbs sampler for this model requires deriving an expression for the conditional distribution of every latent variable conditioned on all of the others. To start note that ~π can be analytically marginalised out

Z P(C|α) =

d~π

N Y

P(ci |~π )P(~π |α)

i=1

QK =


α k=1 Γ(mk + K ) Γ( Kα )K


Γ(α) . Γ(N + α)

(4)

January, 2014

9 / 22

Gibbs Sampler for GMM VIII Continuing along this line, it is also possible to analytically marginalize out Θ. In the following expression the joint over the data is considered and broken into K parts, each corresponding to one class. We use H to denote all hyper parameters. Z P(C, Y; H)

=

dΘP(C, Θ, Y; H) Z

∝

P(C; H)



Z ···

dθ1 · · · dθK 

K Y

 P(θj ; H)

j=1

Z ∝

P(C; H)

Z ···

dθ1 · · · dθK

K Y j=1

∝

P(C; H)

K Z Y j=1


dθj (

N Y

 (

N Y

P(~ yi |ci = j, θj )

i=1 N Y

 P(~ yi |ci = j, θj )

I(ci =j)

)P(θj ; H)

i=1

P(~ yi |ci = j, θj )

I(ci =j) )P(θ

j ; H)

(5)

i=1


January, 2014

10 / 22

Gibbs Sampler for GMM IX Focusing on a single cluster and dropping the class index j for now we see that Z P(Y|H) =

dθ

N Y

P(~yi |θ)P(θ; H)

(6)

i=1

is the standard likelihood conjugate prior integral for a MVN-IW where, remembering that θ = {~ µ, Σ} the expression for the MVN likelihood term, P(~yi |θ) expands to the familiar MVN normal joint distribution for N i.i.d. observations N Y

P(~yi |θ) = (2π)−

Nd 2

|Σ|− 2 e − 2 tr(Σ N

1

−1 S

0)

(7)

i=1

P where S0 = N yi − µ ~ )(~yi − µ ~ )T . Here, following convention, |X| i=1 (~ means the matrix determinant of X. Wood (University of Oxford)


January, 2014

11 / 22

Gibbs Sampler for GMM X For ease of reference the MVN-IW prior P(Θ; H) is P(θ; H)

=

=

P(~ µ, Σ|~ µ0 , Λ0 , ν0 , κ0 ) 2π ) (κ


0

−d 2

κ − 1 − 0 (~ µ−~ µ0 )T Σ−1 (~ µ−~ µ0 ) 2e 2

|Σ|

ν0 d 2 2

d(d−1) 4 π

Γ( νo +1−j j=1 2

Qd

ν0

− |Λ0 | 2 |Σ|

ν0 +d+1 − 1 tr(Λ0 Σ−1 ) 2 e 2

(8)

)


January, 2014

12 / 22

Gibbs Sampler for GMM XI Now, the choice earlier in this section of a conjugate prior for the MVN class parameters helps tremendously. This seemingly daunting integral has a simple analytical solution thanks to conjugacy. Following [Gelman et al., 1995] pg. 87 in making the following variable substitutions N κ0 µ ~0 + y¯ κ0 + N κ0 + N = κ0 + N

µ ~n = κn

νn = ν0 + N Λn = Λ0 + S +

κ0 n (¯ y −µ ~ 0 )(¯ y −µ ~ 0 )T κ0 + N

where S=

N X

(~yi − y¯)(~yi − y¯)T

i=1 Wood (University of Oxford)


January, 2014

13 / 22

Gibbs Sampler for GMM XII Now P(Y; H) =

1 Z0

Z Z

d~ µdΣ|Σ|−(

νn +d +1) 2

e − 2 tr(Λn Σ 1

−1 )− κn 2

(~ µ−~ µn )T Σ−1 (~ µ−~ µn )

can be solved immediately by realizing that this is itself a MVN-IW distribution and the integral is simply the inverse of its normalization constant Z0 Z0 = (2π)

Nd 2

d ν0 2π d ν0 d d(d−1) Y ν0 + 1 − j 2 2 4 ( ) 2 π Γ( )|Λ0 |− 2 . κ0 2

(9)

j=1

with the same variable substitutions applied, i.e.,



January, 2014

14 / 22

Gibbs Sampler for GMM XIII

Z Z Zn = = (2π)

d~ µdΣ|Σ|−( Nd 2

(

νn +d +1) 2

e − 2 tr(Λn Σ 1

−1 )− κn (~ µ−~ µn )T Σ−1 (~ µ−~ µn ) 2

d νn 2π d νn d d(d−1) Y νn + 1 − j )2 2 2 π 4 Γ( )|Λn |− 2 κn 2 j=1

which yields

P(Y; H) =

Zn Z0

(10)

ν0 Q d νn +1−j κ0 d d (νn −ν0 ) |Λ0 | 2 j=1 Γ( 2 ) 2 2 = ( ) 2 νn Q κn |Λn | 2 dj=1 Γ( ν0 +1−j )

(11)

2



January, 2014

15 / 22

Gibbs Sampler for GMM XIV Remembering that our derivation of P(Y; H) was for a single class, we now have an analytic expression for P(C, Y; H) =

K Y

P(Y (j) |C; H)P(C|H)

j=1

From which we can MH sample by modifying each ci and recomputing the joint.



January, 2014

16 / 22

Gibbs Sampler for GMM XV In this model we can do better by deriving a Gibbs update for each class indicator variable ci . P(ci = j|C−i , Y, α; H) ∝ P(Y|C; H)P(C|α) ∝

K Y

P(Y (j) ; H)P(ci = j|C−i , α)

j=1

∝ P(Y (j) ; H)P(ci = j|C−i , α) ∝ P(yi |Y (j) \yi ; H)P(ci = j|C−i , α)

(12)

where Y (j) \yi is the set of observations currently assigned to class j except yi (yi is “removed” from the class to which it belongs when sampling).



January, 2014

17 / 22

Gibbs Sampler for GMM XVI Because of our choice of conjugate prior we know that we know that P(yi |Y (j) \yi ; H) is multivariate Student-t (Gelman et al. [1995] pg. 88) yi |Y (j) \yi ; H ∼ tνn −D+1 (~ µn , Λn (κn + 1)/(κn (νn − D + 1)))

(13)

where N κ0 µ ~0 + y¯ κ0 + N κ0 + N = κ0 + N

µ ~n = κn

νn = ν0 + N Λn = Λ0 + S +

κ0 n (¯ y −µ ~ 0 )(¯ y −µ ~ 0 )T κ0 + N

and D is the dimensionality of yi . Note that N, y¯, µ ~ n , κn , νn , Λn must all be computing excluding yi . Wood (University of Oxford)


January, 2014

18 / 22

Gibbs Sampler for GMM XVII Deriving P(ci = j|C−i , α) was covered in the first lecture and involves simplifying a ratio of two Dirichlet normalising constants. Its simplified form is P(ci = j|C−i , α) =


mj + Kα N −1+α


January, 2014

19 / 22

Gibbs Sampler for GMM XVIII Given P(ci = j|C−i , Y, α; H) ∝ P(yi |Y (j) \yi ; H)P(ci = j|C−i , α) mj + Kα (j) = P(yi |Y \yi ; H) N −1+α we can enumerate all K values ci could take and normalise then sample from this discrete distribution.



January, 2014

20 / 22

Study Suggestions Implement this sampler Write the following question as an integral : what’s the probability of a datapoint falling in a particular region of space? Given the Gibbs sampler described, how would you answer this question efficiently? Derive and implement a Gibbs sampler for LDA. Derive and implement a sampler for PPCA. How would you answer the question, how many classes are there? Is it safe, in this model, to ask questions about the characteristics of class i?



January, 2014

21 / 22

Bibliography I A. Gelman, J. B. Carlin, H. S. Stern, and D. B. Rubin. Bayesian data analysis. Chapman & Hall, New York, 1995. F. Wood and M. J. Black. A nonparametric Bayesian alternative to spike sorting. Journal of Neuroscience Methods, page to appear, 2008.



January, 2014

22 / 22

C19 : Lecture 4 : A Gibbs Sampler for Gaussian Mixture Models

C19 : Lecture 4 : A Gibbs Sampler for Gaussian Mixture Models

Suggest Documents

GAUSSIAN SCALE MIXTURE MODELS FOR

Gaussian mixture models

Lecture 16: Mixture models

gaussian mixture models optimization for counting

conditional gaussian mixture models for ...

Hypothesis Testing for Parsimonious Gaussian Mixture Models

Overlap Measure for Gaussian Mixture Models

Overlap Measure for Gaussian Mixture Models - arXiv

Regularized Subspace Gaussian Mixture Models for Speech ...

A permutation-augmented sampler for DP mixture models

Hidden Markov Models and Gaussian Mixture Models

A Gaussian Mixture PHD Filter for Nonlinear Jump Markov Models

Efficient Greedy Learning of Gaussian Mixture Models

Visual Servoing Based on Gaussian Mixture Models

Semiparametric fractional imputation using Gaussian mixture models

PhyloGibbs: A Gibbs Sampler Incorporating ... - Rockefeller University

Speaker Verification Using Adapted Gaussian Mixture Models

Language Identification Using Gaussian Mixture Models

Detecting Cars Using Gaussian Mixture Models - MATLAB ...

Sound Morphing With Gaussian Mixture Models

A Multiple Hypothesis Gaussian Mixture Filter for

Gibbs Sampler Optimization for Analysis of a Granulated Medium â

A Parallel Gibbs Sampler for Latent Dirichlet ... - Semantic Scholar

Gaussian mixture models of texture and colour for image database ...

C19 : Lecture 4 : A Gibbs Sampler for Gaussian Mixture Models