Bayesian SPLDA

5 downloads 0 Views 192KB Size Report
Nov 20, 2015 - ML] 20 Nov 2015. Bayesian SPLDA. Jesús Villalba. Communications Technology Group (GTC),. Aragon Institute for Engineering Research ...
Bayesian SPLDA Jes´ us Villalba Communications Technology Group (GTC), Aragon Institute for Engineering Research (I3A), University of Zaragoza, Spain

arXiv:1511.07318v1 [stat.ML] 20 Nov 2015

[email protected]

May 15, 2012

1

Introduction

In this document we are going to derive the equations needed to implement a Variational Bayes estimation of the parameters of the SPLDA model [1]. This can be used to adapt the SPLDA from one database to another with few development data or to implement the fully Bayesian recipe [2]. Our approach is similar to Bishop’s VB PPCA in [3].

2

The Model

2.1

SPLDA

SPLDA is a linear generative model represented in Figure 1.

Ni

φij

yi

θij M µ

W

V α

Figure 1: BN for Bayesian SPLDA model. An i-vector φ of speaker i can be written as: φij = µ + Vyi + ǫij

(1)

where µ is a speaker independent mean, V is the eigenvoices matrix, yi is the speaker factor vector, and ǫ is a channel offset. We assume the following priors for the variables: yi ∼ N (yi |0, I) ǫij ∼ N ǫij |0, W

(2) −1



where N denotes a Gaussian distribution; W is the within class precision matrix.

1

(3)

2.2

Notation

We are going to introduce some notation: • Let Φd be the development i-vectors dataset. • Let Φt = {l, r} be the test i-vectors. • Let Φ be any of the previous datasets. • Let θd be the labelling of the development dataset. It partitions the Nd i-vectors into Md speakers. • Let θt be the labelling of the test set, so that θt ∈ {T , N }, where T is the hypothesis that l and r belong to the same speaker and N is the hypothesis that they belong to different speakers. • Let θ be any of the previous labellings. • Let Φi be the i-vectors belonging to the speaker i. • Let φij the i-vector j of speaker i. • Let Yd be the speaker identity variables of the development set. We will have as many identity variables as speakers. • Let Yt be the speaker identity variables of the test set. • Let Y be any of the previous speaker identity variables sets. • Let d be the i-vector dimension. • Let ny be the speaker factor dimension. • Let M = (µ, V, W) be the set of all the parameters.

3

Sufficient statistics

We define the sufficient statistics for speaker i. The zero-order statistic is the number of observations of speaker i Ni . The first-order and second-order statistics are Fi =

Ni X

φij

(4)

φij φTij

(5)

j=1

Si =

Ni X j=1

We define the centered statistics as Fi =Fi − Ni µ Si =

Ni X

(6) T

(φij − µ) (φij − µ) = Si − µFTi − Fi µT + Ni µµT

j=1

2

(7)

We define the global statistics N=

M X

Ni

(8)

Fi

(9)

Fi

(10)

Si

(11)

Si

(12)

i=1

F=

M X i=1

F=

M X i=1

S=

M X i=1

S=

M X i=1

4

Data conditional likelihood

The likelihood of the data given the hidden variables for speaker i is ln P (Φi |yi , µ, V, W) =

Ni X

ln N φij |µ + Vyi , W−1

j=1

Ni W = ln − 2 2π Ni W ln − = 2 2π

N



i 1X (φij − µ − Vyi )T W(φij − µ − Vyi ) 2 j=1

 1 Ni T T tr WSi + yiT VT WFi − y V WVyi 2 2 i

(13)

(14) (15)

We can write this likelihood in another form:

Ni W 1 ln P (Φi |yi , µ, V, W) = ln − tr W Si − 2Fi µT + Ni µµT 2 2π 2  −2 (Fi − Ni µ) yiT VT + Ni Vyi yiT VT

We can write this likelihood in another form if we define:     y ˜ = V µ ˜ yi = i , V 1

(16) (17)

(18)

Then

ln P (Φi |yi , µ, V, W) =

Ni X j=1

  ˜ yi , W−1 ln N φij |V˜

Ni W ln − = 2 2π Ni W = ln − 2 2π Ni W = ln − 2 2π

(19)

N

i 1X ˜ yi )T W(φij − V˜ ˜ yi ) (φij − V˜ 2 j=1

1 ˜ T WV˜ ˜ yi ˜ T WFi − Ni y ˜T V ˜ iT V tr (WSi ) + y 2 2 i  1   ˜T ˜ yi y ˜ T + Ni V˜ ˜iT V ˜ iT V tr W Si − 2Fi y 2

3

(20) (21) (22)

5 5.1

Variational inference with Gaussian-Gamma priors for V, Gaussian for µ and Wishart for W (informative and non-informative) Model priors

We introduce a hierarchical prior P (V|α) over the matrix V governed by a ny dimensional vector of hyperparameters where ny is the dimension of the factors. Each hyperparameter controls one of the columns of the matrix V through a conditional Gaussian distribution of the form:   ny  Y αq d/2 1 T P (V|α) = exp − αq vq vq (23) 2π 2 q=1 where vq are the columns of V. Each αq controls the inverse variance of the corresponding vq . If a particular αq has a posterior distribution concentrated at large values, the corresponding vq will tend to be small, and that direction of the latent space will be effectively ’switched off’. We define a prior for α: P (α) =

ny Y

G (αq |aα , bα )

(24)

q=1

where G denotes the Gamma distribution. Bishop defines broad priors setting a = b = 10−3 . We place a Gaussian prior for the mean µ:  P (µ) = N µ|µ0 , diag(β)−1

(25)

We will consider the case where each dimension has different precision and the case with isotropic precision (diag(β) = βI). Finally, we put a Wishart prior on W, P (W) = W (W|Ψ0 , ν0 )

(26)

We can make the Wishart prior non-informative like in [4]. P (W) = lim W (W|W0 /k, k)

(27)

k→0

= α |W|

5.2

−(d+1)/2

(28)

Variational distributions

We write the joint distribution of the observed and latent variables: P (Φ, Y, µ, V, W, α|µ0 , β, aα , bα ) = P (Φ|Y, µ, V, W) P (Y) P (V|α) P (α|a, b) P (µ|µ0 , β) P (W) (29) Following, the conditioning on (µ0 , β, aα , bα ) will be dropped for convenience. Now, we consider the partition of the posterior: P (Y, µ, V, W, α|Φ) ≈ q (Y, µ, V, W, α) = q (Y)

d Y

q (˜ vr′ ) q (W) q (α)

(30)

r=1

˜ If W were a diagonal matrix the factorization ˜ r′ is a column vector containing the rth row of V. where v Qd ′ vr ) is not necessary because it arises naturally when solving the posterior. However, for full r=1 q (˜ ˜ is a Gaussian with a huge full covariance matrix. Therefore, we covariance W, the posterior of vec(V) force the factorization to made the problem tractable. The optimum for q ∗ (Y): ln q ∗ (Y) =Eµ,V,W,α [ln P (Φ, Y, µ, V, W, α)] + const =Eµ,V,W [ln P (Φ|Y, µ, V, W)] + ln P (Y) + const =

M X

  1   yiT E VT W (Fi − Ni µ) − yiT I + Ni E VT WV yi + const 2 i=1 4

(31) (32) (33)

Therefore q ∗ (Y) is a product of Gaussian distributions. q ∗ (Y) =

M Y

N yi |yi , L−1 yi

i=1

The optimum for q ∗ (˜ vr′ ):



(34)

  Lyi =I + Ni E VT WV  T  yi =L−1 yi E V W (Fi − Ni µ)    T E [V] E [W] Fi − Ni E VT Wµ =L−1 yi

(35) (36) (37)

ln q ∗ (˜ vr′ ) =EY,W,α,˜vs6′ =r [ln P (Φ, Y, µ, V, W, α)] + const

(38)

=EY,W,˜vs6′ =r [ln P (Φ|Y, µ, V, W)] + Eα,vs6′ =r [ln P (V|α)] + Eµs6=r [ln P (µ)] + const    M h iT h  Ti  1X T ˜ ˜ ˜ y ˜ iT V ˜iy tr E [W] −2Fi E [˜ yi ] Ev˜ s6′ =r V + Ni Ev˜ s6′ =r VE =− 2 i=1

(39)

n

− =− − =− −

y  1  1X 2 E [αq ] Evs6′ =r vqT vq − βr (µr − µ0r ) + const 2 q=1 2   h iT i h 1 T ˜ ˜ ˜ tr E [W] −2CEv˜ s6′ =r V + Ev˜ s6′ =r VRy˜ V 2 ny  1  1X E [αq ] Evs6′ =r vqT vq − βr (µr − µ0r )2 + const 2 q=1 2   i h h iT 1 T ˜ ˜ ˜ tr −2Ev˜ s6′ =r V E [W] C + Ev˜ s6′ =r V E [W] V Ry˜ 2 1 1 ′T v diag (E [α]) vr′ − βr (µr − µ0r )2 + const 2 r 2

(40)

(41)

(42) 

d X X 1 T ˜ r′ w rr v ˜r′T Ry˜  ˜ r′ w rs E [˜ ˜ r′ w rs Cs + 2 vs′ ] Ry˜ + v v = − tr −2 v 2 s=1 s6=r

1 1 − vr′T diag (E [α]) vr′ − βr (µr − µ0r )2 + const 2  2      X 1  T ˜ r′ v ˜ r′T w rr Ry˜  w rs Cs − E [˜ vs′ ] Ry˜  + v = − tr −2˜ vr′ w rr Cr + 2

(43)

s6=r

1 1 2 − vr′T diag (E [α]) vr′ − βr (µr − µ0r ) + const 2  2      X 1  T ˜ r′ v ˜ r′T w rr Ry˜  w rs Cs − E [˜ vs′ ] Ry˜  + v = − tr −2˜ vr′ w rr Cr + 2

(44)

s6=r

1 ′T ˜ diag α ˜r′ + βr µr µ0r + const − v ˜r v 2 r     X 1  T vr′ w rr Cr + ˜T0r  vs′ ] Ry˜ + βr µ = − tr −2˜ w rs Cs − E [˜ 2 s6=r   ′ ′T ˜ r diag α ˜ r + wrr Ry˜ +˜ vr v 

5

(45)

(46)

where w rs is the element r, s of E [W], C=

M X

Fi E [˜ yi ]

T

(47)

i=1

Ry˜ =

M X i=1

  ˜ iT ˜iy Ni E y

      E [yi ] E yi yiT T ˜i = ˜iy E y T E [yi ] 1     0 E [α] µ ˜0r = ny ×1 α ˜r = µ0r βr and Cr is the rth row of C. Then q ∗ (˜ vr′ ) is a Gaussian distribution:   ′ ˜ r′ |˜ q ∗ (˜ vr′ ) =N v vr , L−1 ˜ Vr  =diag LV + α ˜ w ˜r r rr Ry ˜     X ′ ′ w rr CTr + ˜ r =L−1 ˜ s + βr µ ˜0r  v wrs CTs − Ry˜ v ˜ V r

(48) (49) (50)

(51) (52) (53)

s6=r

The optimum for q ∗ (α):

ln q ∗ (α) =EY,µ,V,W [ln P (Φ, Y, µ, V, W, α)] + const =EV [ln P (V|α)] + ln P (α|aα , bα ) + const

(54) (55)

ny

Xd

  1 ln αq − αq E vqT vq + (aα − 1) ln αq − bα αq + const 2 2 q=1    ny  X  d 1  = + aα − 1 ln αq − αq bα + E vqT vq + const 2 2 q=1

=

(56)

(57) (58)

Then q ∗ (α) is a product of Gammas: q ∗ (α) =

ny Y

q=1

  G αq |a′α , b′αq

d 2 1  T  =bα + E vq vq 2

a′α =aα + b′αq

(59) (60) (61)

The optimum for q ∗ (W) in the non-informative case:

ln q ∗ (W) =EY,µ,V,α [ln P (Φ, Y, µ, V, W, α)] + const =EY,µ,V [ln P (Φ|Y, µ, V, W)] + ln P (W) + const N d+1 1 = ln |W| − ln |W| − tr (WK) + const 2 2 2

(62) (63) (64)

where K=

M X i=1

h i ˜T ˜ yi y ˜ T − V˜ ˜ yi FT + Ni V˜ ˜ iT V ˜ iT V E Si − Fi y i

h iT h i i h ˜ ˜ CT + E ˜ VR ˜T ˜ y˜ V =S − CE V −E V V 6

(65) (66)

Then q ∗ (W) is Wishart distributed: P (W) =W (W|Ψ, ν) −1

Ψ

if ν > d

(67)

=K

(68)

ν =N

(69)

The optimum for q ∗ (W) in the informative case: ln q ∗ (W) =EY,µ,V,α [ln P (Φ, Y, µ, V, W, α)] + const =EY,µ,V [ln P (Φ|Y, µ, V, W)] + ln P (W) + const  N ν0 − d − 1 1 = ln |W| + ln |W| − tr W Ψ−1 + const 0 +K 2 2 2

(70) (71) (72)

Then q ∗ (W) is Wishart distributed:

P (W) =W (W|Ψ, ν)

(73)

=Ψ−1 0

(74)

−1

Ψ

+K

ν =ν0 + N

7

(75)

Finally, we evaluate the expectations: E [αq ] =

a′α b′αq

(76)

 ′T  ˜1 v ′T  h i  v ˜2  ˜ ˜  V =E V =  .    .. 

(77)

′T

˜d v

W =E [W] = νΨ

(78)

d  X  ′T ′   E vrq vrq E vqT vq =

(79)

r=1

=

d X

L−1 ˜ V

rqq

r=1

+ v′2 rq

(80)

    E VT WV =E V′ WV′T =

d X d X r=1 s=1

=

d X

  w rs E vr′ vs′T

wrr ΣVr +

r=1

=

(81)

d X d X

(82) wrs v′r v′T s

(83)

r=1 s=1

d X

wrr ΣVr + V WV

r=1

r=1 s=1

T

(84)

r=1

d d X d X   X E VT Wµ = wrr ΣVµr + w rs v′r µs

=

d X

(85)

T

wrr ΣVµr + V Wµ

(86)

r=1 ny ny

i XX h   ˜T = ˜ y˜ V ˜ sT ˜r v ry˜ rs E v E VR

(87)

r=1 s=1 ny ny

= =

XX

(88)

r=1 s=1 ny ny

   ry˜ rs E v˜ri v˜sj d×d

XX

   ry˜ rs E v˜i′r v˜j′ s d×d

(89)

r=1 s=1 ny ny

= =

XX r=1 s=1 ny ny

ny ny h i  X X   ′   ′  E v ˜ E v ˜ ry˜ rs diag σV ry˜ rs ˜i ir js d×d + rs

XX

   vri ] E v˜sj d×d + diag (ρ) ry˜ rs E [˜

XX

vr ] E [˜ vs ] + diag (ρ) ry˜ rs E [˜

r=1 s=1 ny ny

=

r=1 s=1

r=1 s=1 T ˜ y˜ V ˜ =VR

T

+ diag (ρ)

d

(90) (91) (92) (93)

8

˜ i′ , ˜ r , v˜i′r is the rth element of v where v˜ri is the ith element of v  h i ΣVµr ˜r = σV = L−1 ˜r ij V Σµr ny ×ny T  ρ = ρ1 ρ2 . . . ρd ny ny   X X Ry˜ ◦ L−1 ρi = ˜ V

ΣV ˜r =



ΣVr ΣTVµr

i

r=1 s=1

rs

(94) (95) (96)

and ◦ is the Hadamard product. 5.2.1

Distributions with deterministic annealing

If we use annealing, for a parameter κ, we have:

q ∗ (Y) =

M Y

N yi |yi , 1/κ L−1 yi

i=1

  ′ ˜ r′ |˜ vr , 1/κ L−1 q ∗ (˜ vr′ ) =N v ˜ V



(97) (98)

r



q (W) =W (W|1/κ Ψ, κ(ν − d − 1) + d + 1) ny   Y ∗ q (α) = G αq |a′α , b′αq

if κ(ν − d − 1) + 1 > 0

(99) (100)

q=1

a′α b′αq

5.3

  d =κ aα + − 1 + 1 2   1  T  =κ bα + E vq vq . 2

(101) (102)

Variational lower bound

The lower bound is given by L =EY,µ,V,W [ln P (Φ|Y, µ, V, W)] + EY [ln P (Y)] + EV,α [ln P (V|α)] + Eα [ln P (α)] + Eµ [ln P (µ)] + EW [ln P (W)] h  i ˜ − Eα [ln q (α)] − EW [ln q (W)] − EY [ln q (Y)] − EV ˜ ln q V

(103)

The term EY,µ,V,W [ln P (Φ|Y, µ, V, W)]:

Nd N E [ln |W|] − ln(2π) 2   2 i h T 1 T ˜ ˜ ˜ − tr W S − 2CV + E VRy˜ V 2  Nd 1 N ln(2π) − tr WS = ln W − 2 2 2   h i T 1 ˜ WC + E V ˜ T WV ˜ Ry˜ − tr −2V 2

EY,µ,V,W [ln P (Φ|Y, µ, V, W)] =

(104)

(105)

where ln W =E [ln |W|]   d X ν +1−i ψ = + d ln 2 + ln |Ψ| 2 i=1 9

(106) (107)

and ψ is the digamma function. The term EY [ln P (Y)]: ! M X   1 M ny T ln(2π) − tr EY [ln P (Y)] = − E yi yi 2 2 i=1 =−

(108)

M ny 1 ln(2π) − tr (P) 2 2

(109)

where P=

M X i=1

  E yi yiT

(110)

The term EV,α [ln P (V|α)]: ny

ny

  ny d dX 1X EV,α [ln P (V|α)] = − ln(2π) + E [ln αq ] − E [αq ] E vqT vq 2 2 q=1 2 q=1

(111)

where E [ln αq ] = ψ(a′α ) − ln b′αq .

(112)

The term Eα [ln P (α)]: Eα [ln P (α)] =ny (aα ln bα − ln Γ (aα )) +

ny X

(aα − 1)E [ln αq ] − bα E [αq ]

(113)

q=1

=ny (aα ln bα − ln Γ (aα )) + (aα − 1)

ny X

E [ln αq ] − bα

q=1

ny X

E [αq ]

(114)

q=1

The term Eµ [ln P (µ)]: d

Eµ [ln P (µ)] = −

d

=−

d

   d 1X 1X ln(2π) + ln βr − βr E µ2r − 2µ0r E [µr ] + µ20r 2 2 r=1 2 r=1

(115)

d

 1X 1X  d ln(2π) + ln βr − βr Σµr + E [µr ]2 − 2µ0r E [µr ] + µ20r 2 2 r=1 2 r=1

(116)

The term EW [ln P (W)] for the non-informative case: EW [ln P (W)] = −

d+1 ln W . 2

(117)

The term EW [ln P (W)] for the informative case: EW [ln P (W)] = ln B (Ψ0 , ν0 ) +

 ν ν0 − d − 1 ln W − tr Ψ−1 0 Ψ 2 2

10

(118)

The term EY [ln q (Y)]: M

EY [ln q (Y)] = −

h i 1  M ny 1X T ln |Lyi | − tr Lyi E (yi − yi ) (yi − yi ) ln(2π) + 2 2 i=1 2

(119)

M

=−

1X M ny ln |Lyi | ln(2π) + 2 2 i=1 M



    1X  T tr Lyi E yi yiT − yi E [yi ] − E [yi ] yTi + yi yTi 2 i=1 M

=−

(120)

M

M ny 1X 1X ln |Lyi | − tr (I) ln(2π) + 2 2 i=1 2 i=1

(121)

M

=−

1X M ny ln |Lyi | (ln(2π) + 1) + 2 2 i=1

(122)

h  i ˜ : The term EV ˜ ln q V

d h  i d(ny + 1) 1 X ˜ EV ln q V = − (ln(2π) + 1) + ln LV ˜ ˜r 2 2 r=1

(123)

The term Eα [ln q (α)]:

Eα [ln q (α)] = −

ny X

H [q (αq )]

(124)

q=1

=

ny X

(a′α − 1)ψ(a′α ) + ln b′αq − a′α − ln Γ (a′α )

(125)

q=1

=ny ((a′α − 1)ψ(a′α ) − a′α − ln Γ (a′α )) +

ny X

ln b′αq

(126)

q=1

The term EW [ln q (W)]: EW [ln q (W)] = − H [q (W)] = ln B (Ψ, ν) +

(127) ν −d−1 νd ln W − 2 2

(128)

where 1 −N/2 |A| 2N d/2 ZN d d Y Γ ((N + 1 − i)/2) =π d(d−1)/4

B(A, N ) = ZN d

(129) (130)

i=1

5.4

Hyperparameter optimization

We can set the Hyperparameters (µ0 , β, aα , bα ) manually or estimate them from the development data maximizing the lower bound. we derive for aα ny X ∂L E [ln αq ] = 0 =ny (ln bα − ψ(aα )) + ∂aα q=1

ψ(aα ) = ln bα +

ny 1 X E [ln αq ] ny q=1

11

=⇒

(131)

(132)

We derive for bα : ny

∂L ny aα X E [αq ] = 0 = − ∂bα bα q=1 !−1 ny 1 X bα = E [αq ] ny aα q=1

=⇒

(133)

(134)

We solve these equations with the procedure described in [5]. We write ψ(a) = ln b + c a b= d

(135) (136)

where c=

d=

ny 1 X E [ln αq ] ny q=1 ny 1 X E [αq ] ny q=1

(137)

(138)

Then f (a) = ψ(a) − ln a + ln d − c = 0

(139)

We can solve for a using Newton-Rhaphson iterations: f (a) = anew =a − ′ f (a)   ψ(a) − ln a + ln d − c =a 1 − aψ ′ (a) − 1

(140) (141)

This algorithm does not assure that a remains positive. We can put a minimum value for a. Alternatively we can solve the equation for a ˜ such as a = exp(˜ a). f (˜ a) = f ′ (˜ a) ψ(a) − ln a + ln d − c =˜ a− ψ ′ (a)a − 1

a ˜new =˜ a−

(142) (143)

Taking exponential in both sides: anew

  ψ(a) − ln a + ln d − c = a exp − ψ ′ (a)a − 1

(144)

We derive for µ0 : ∂L =0 =⇒ ∂µ0 µ0 =E [µ]

(145) (146)

We derive for β: ∂L =0 ∂β

=⇒

(147)

βr−1 =Σµr + E [µr ]2 − 2µ0r E [µr ] + µ20r

(148)

If we take an isotropic prior for µ: d

β −1 =

1X Σµ + E [µr ]2 − 2µ0r E [µr ] + µ20r d r=1 r 12

(149)

5.5

Minimum divergence

We assume a more general prior for the hidden variables: P (y) = N y|µy , Λ−1 y



(150)

To minimize the divergence we maximize the part of L that depends on µy : L(µy , Λy ) =

M X i=1

  EY ln N y|µy , Λ−1 y

(151)

The, we get µy =

M 1 X EY [yi ] M i=1

Σy =Λ−1 y =

(152)

M i h 1 X = EY (yi − µy ) (yi − µy )T M i=1

M   1 X EY yi yiT − µy µTy M i=1

(153)

(154)

We have a transform y = φ(y′ ) such as y′ has a standard prior: T ′ y =µy + (Σ1/2 y ) y

(155)

˜ =J˜ y y′

(156)

we also can write that as

where J=



1/2

(Σy )T 0T

µy 1



(157)

Now, we get q (˜ vr′ ) such us if we apply the transform y′ = φ−1 (y), the term E [ln P (X|Y, W)] of L remains constant: ′



˜ r ←JT v ˜r v L−1 ˜r V

T

←J

(158)

L−1 ˜rJ V

(159)

T LV ˜rG ˜ r ←G LV

(160)

where G = JT " =

6 6.1

−1

1/2 (Σy )−1 1/2 −µTy (Σy )−1

(161) 0 1

#

(162)

Variational inference with Gaussian-Gamma priors for V, Gaussian for µ and Gamma for W Model priors

˜ Then, to In section 5, we saw that if we use a full covariance W we had a full covariance posterior for V. ˜ get a tractable solution, we forced independence between the the rows of V when choosing the variational partition function. 13

In this section, we are going to assume that we have applied a rotation to the data such as we can consider that W is going to remain diagonal during the VB iteration. Then we are going to place a broad Gamma prior over each element of the diagonal of W: P (W) =

d Y

G (wrr |aw , bw )

(163)

r=1

We also consider the case of an isotropic W (W = wI). Then the prior is P (w) =G (w|aw , bw )

6.2

(164)

Variational distributions

We write the joint distribution of the latent variables: P (Φ, Y, µ, V, W, α|µ0 , β, aα , bα , aw , bw ) =P (Φ|Y, µ, V, W) P (Y) P (V|α) P (α|a, b) P (µ|µ0 , β) P (W|aw , bw ) Following, the conditioning on µ0 , β, aα , bα , aw , bw will be dropped for convenience. Now, we consider the partition of the posterior:   ˜ q (W) q (α) P (Y, µ, V, W, α|Φ) ≈ q (Y, µ, V, W, α) = q (Y) q V

(165)

(166)

∗ The optimum for q ∗ (Y)  and q (α) are the same as in section 5.2. ˜ : The optimum for q ∗ V

  ˜ =EY,W,α [ln P (Φ, Y, µ, V, W, α)] + const ln q ∗ V

(167)

=EY,W [ln P (Φ|Y, µ, V, W)] + Eα [ln P (V|α)] + ln P (µ) + const  1  ˜T ˜ T E [W] VR ˜ y˜ = − tr −2V E [W] C + V 2 d d 1X 1 X ′T 2 vr diag (E [α]) vr′ − βr (µr − µ0r ) + const − 2 r=1 2 r=1

(168)

(169)

d

=−

   1X ˜ r′ v ˜ r′T diag α ˜ r + w rr Ry˜ tr −2˜ vr′ wrr Cr + βr µ ˜T0r + v 2 r=1

(170)

  ˜ is a product of Gaussian distributions: Then q ∗ V

d   Y   ′ −1 ˜ ˜ r′ |w ˜ r , LV N v q V = ˜ ∗

 LV ˜ r + wrr Ry˜ ˜ r =diag α ′ ˜r v

=L−1 ˜r V

(171)

r

r=1

w rr CTr

The optimum for q ∗ (W):

+ βr µ ˜0r

(172) 

ln q ∗ (W) =EY,µ,V,α [ln P (Φ, Y, µ, V, W, α)] + const =EY,µ,V [ln P (Φ|Y, µ, V, W)] + ln P (W) + const d X N

1 ln wrr − wrr krr + (aw − 1) ln wrr − bw wrr + const 2 2 r=1    d  X N 1 aw + = − 1 ln wrr − bw + krr wrr + const 2 2 r=1

=

14

(173)

(174) (175) (176) (177)

where  h iT h i i h ˜ ˜ CT + E ˜ VR ˜T ˜ y˜ V K =diag S − CE V −E V V

(178)

Then q ∗ (W) is a product of Gammas:

q ∗ (W) =

d Y

G wrr |a′w , b′wr

r=1

N 2 1 ′ bwr =bw + krr 2 If we force an isotropic W, the optimum q ∗ (W) is a′w =aw +



1 Nd ln w − wk + (aw − 1) ln w − bw w + const 2  2   Nd 1 = aw + − 1 ln w − bw + k w + const 2 2

ln q ∗ (W) =

(179) (180) (181)

(182) (183)

where  i h h i h iT T T ˜ ˜ ˜ ˜ k =tr S − CE V − E V C + EV ˜ VRy ˜V   h iT   h i ˜ ˜TV ˜ Ry˜ =tr S − 2CE V + tr E V

(184) (185)

Then q ∗ (W) is a Gamma distribution:

q ∗ (W) =G (w|a′w , b′w ) Nd a′w =aw + 2 1 ′ bw =bw + k 2 Finally, we evaluate the expectations: h i i h ˜TV ˜ =E V ˜ ′V ˜ ′T E V =

d X r=1

=

d X r=1

6.3

(186) (187) (188)

(189)

h i h iT ˜′ ˜′ E V L−1 + E V ˜ V

(190)

h iT h i ˜ E V ˜ L−1 + E V ˜ V

(191)

r

r

Variational lower bound

The lower bound is given by L =EY,µ,V,W [ln P (Φ|Y, µ, V, W)] + EY [ln P (Y)] + EV,α [ln P (V|α)] + Eα [ln P (α)] + Eµ [ln P (µ)] + EW [ln P (W)] h  i ˜ − Eα [ln q (α)] − EW [ln q (W)] − EY [ln q (Y)] − EV ˜ ln q V

(192)

The term EY,µ,V,W [ln P (Φ|Y, µ, V, W)]:

EY,µ,V,W [ln P (Φ|Y, µ, V, W)] =

d  Nd 1 N X ln(2π) − tr WS ln w rr − 2 r=1 2 2   h i T 1 ˜ WC + E V ˜ T WV ˜ Ry˜ − tr −2V 2

15

(193)

where ln w rr =E [ln |wrr |] =ψ(a′w )



(194)

ln b′wr

(195)

and ψ is the digamma function. The term EW [ln P (W)] with non-isotropic W: EW [ln P (W)] =

d X

aw ln bw + (aw − 1) E [ln wrr ] − bw E [wrr ] − ln Γ (aw )

(196)

r=1

=d (aw ln bw − ln Γ (aw )) + (aw − 1)

d X

E [ln wrr ] − bw

r=1

d X

E [wrr ]

(197)

r=1

The term EW [ln P (W)] with isotropic W: EW [ln P (W)] =aw ln bw − ln Γ (aw ) + (aw − 1) E [ln w] − bw E [w]

(198)

The term EW [ln q (W)] with non-isotropic W: EW [ln q (W)] = −

d X

H [q (wrr )]

(199)

r=1

=d ((a′w − 1)ψ(a′w ) − a′w − ln Γ (a′w )) +

d X

ln b′wr

(200)

r=1

The term EW [ln q (W)] with isotropic W: EW [ln q (W)] = − H [q (w)] =(a′w



1)ψ(a′w )

(201) −

a′w



ln Γ (a′w )

+

ln b′w

(202)

The rest of terms are the same as in section 6.3.

6.4

Hyperparameter optimization

We can estimate the parameters (aw , bw ) from the development data maximizing the lower bound. For non-isotropic W: we derive for aw d X ∂L =d (ln bw − ψ(aw )) + E [ln wrr ] = 0 ∂aw r=1

=⇒

(203)

d

ψ(aw ) = ln bw +

1X E [ln wrr ] d r=1

(204)

We derive for bw : d

∂L daw X = − E [wrr ] = 0 ∂bw b r=1 !−1 d 1 X bw = E [wrr ] daw r=1

=⇒

(205)

(206)

For isotropic W: we derive for aw ∂L = ln bw − ψ(aw ) + E [ln w] = 0 ∂aw ψ(aw ) = ln bw + E [ln w] 16

=⇒

(207) (208)

We derive for bw : ∂L aw = − E [w] = 0 ∂bw b  −1 1 bw = E [w] aw

=⇒

(209) (210)

We can solve these equations by Newton-Rhapson iterations as described in section 5.4.

7

Variational inference with full covariance Gaussian prior for V and µ and Wishart for W

7.1

Model priors

Lets assume that we compute the posterior of model parameters given a development database with a large amount of data. If we want to compute the model posterior for a small database we could use the posterior given the large database as prior. ˜ Thus, we take a prior distribution for V d    Y  −1 ˜ P V = N vr′ |v′0r , LV ˜ 0r

(211)

r=1

The prior for W is P (W) = W (W|Ψ0 , ν0 )

(212)

The parameters v′0r , L−1 ˜ , Ψ0 and ν0 are computed with the large dataset. V 0r

7.2

Variational distributions

The joint distribution of the latent variables:   ˜ P (W) P (Φ, Y, µ, V, W) = P (Φ|Y, µ, V, W) P (Y) P V

(213)

Now, we consider the partition of the posterior:

d   Y ˜ W = q (Y) q (˜ vr′ ) q (W) P (Y, µ, V, W|Φ) ≈ q Y, V,

(214)

r=1

The optimum for q ∗ (Y) is the same as in section 5.2. The optimum for q ∗ (˜ vr′ ): ln q ∗ (˜ vr′ ) =EY,W,˜vs6′ =r [ln P (Φ, Y, µ, V, W)] + const   ˜ + const =EY,W,˜vs6′ =r [ln P (Φ|Y, µ, V, W)] + ln P V       X 1  T ˜ r′ v ˜ r′T w rr Ry˜  = − tr −2˜ vs′ ] Ry˜  + v w rs Cs − E [˜ vr′ w rr Cr + 2

(215) (216)

s6=r

1 ′ ′ − (vr′ − v′0r )T LV ˜ 0r (vr − v0r ) + const 2      X 1  T vr′ v′T vs′ ] Ry˜  = − tr −2˜ wrs Cs − E [˜ ˜ 0r + wrr Cr + 0r LV 2 s6=r  ′ ′T ˜ r LV +˜ vr v ˜ 0r + w rr Ry ˜ 17

(217)

(218)

Then q ∗ (˜ vr′ ) is a Gaussian distribution:   ′ ˜ r′ |˜ q ∗ (˜ vr′ ) =N v vr , L−1 ˜ V

(219)

r

LV ˜ 0r + w rr Ry ˜ r =LV ˜ 

(220)

   X ′ ′ −1  ′ T ˜ r =LV ˜s  LV v w rs CTs − Ry˜ v ˜ 0r v0r + wrr Cr + ˜ r

(221)

s6=r

The optimum for q ∗ (W):

ln q ∗ (W) =EY,µ,V,α [ln P (Φ, Y, µ, V, W, α)] + const

where

=EY,µ,V [ln P (Φ|Y, µ, V, W)] + ln P (W) + const  N ν0 − d − 1 1 = ln |W| + ln |W| − tr W Ψ−1 + const 0 +K 2 2 2

(222) (223) (224)

h iT h i i h ˜ ˜ CT + E ˜ VR ˜T ˜ y˜ V K =S − CE V −E V V

(225)

P (W) =W (W|Ψ, ν)

(226)

=Ψ−1 0

(227)

Then q ∗ (W) is Wishart distributed:

−1

Ψ

+K

ν =ν0 + N 7.2.1

(228)

Distributions with deterministic annealing

If we use annealing, for a parameter κ, we have: q ∗ (W) =W (W|1/κ Ψ, κ(ν0 + N − d − 1) + d + 1)

7.3

if κ(ν0 + N − d − 1) + 1 > 0

(229)

Variational lower bound

The lower bound is given by h  i ˜ + EW [ln P (W)] L =EY,µ,V,W [ln P (Φ|Y, µ, V, W)] + EY [ln P (Y)] + EV ˜ ln P V h  i ˜ − EW [ln q (W)] − EY [ln q (Y)] − EV ˜ ln q V

18

(230)

h  i ˜ : The term EV ˜ ln P V

d h  i 1 X ny d ˜ ln(2π) + ln LV EV ln P V = − ˜ 0r ˜ 2 2 r=1 d



i h 1X  ′ ′ ′ T ′ tr LV ˜ 0r E (vr − v0r ) (vr − v0r ) 2 r=1

(231)

d

=−

ny d 1 X ln(2π) + ln LV ˜ 0r 2 2 r=1 d

− =−

  1X  −1 ′ ′T ′ ′T ′ ′T ′ ′T tr LV ˜ 0r LV ˜ r + vr vr − v0r vr − vr v0r + v0r v0r 2 r=1 d ny d 1 X ln(2π) + ln LV ˜ 0r 2 2 r=1 d



(232)

d

 1X 1X  −1 ′ ′ tr LV (v′ − v′0r )T LV ˜ 0r (vr − v0r ) ˜ 0r LV ˜r − 2 r=1 2 r=1 r

(233)

d

=−

1 X ny d ln(2π) + ln LV ˜ 0r 2 2 r=1 d



d

 1X   1X  −1 ′ ′ ′ ′ T tr LV tr LV ˜ 0r LV ˜ 0r (vr − v0r ) (vr − v0r ) ˜r − 2 r=1 2 r=1

(234)

The term EW [ln P (W)]: EW [ln P (W)] = ln B (Ψ0 , ν0 ) + where

 ν ν0 − d − 1 ln W − tr Ψ−1 0 Ψ 2 2

ln W =E [ln |W|]   d X ν +1−i ψ = + d ln 2 + ln |Ψ| 2 i=1

(235)

(236) (237)

and ψ is the digamma function. The term EW [ln q (W)] EW [ln q (W)] = − H [q (W)] = ln B (Ψ, ν) +

(238) ν −d−1 νd ln W − 2 2

(239)

The rest of terms are the same as the ones in section 5.3.

8 8.1

Variational inference with full covariance Gaussian prior for V and µ and Gamma for W Model priors

˜ Thus, we take a prior distribution for V d    Y  −1 ˜ = P V N vr′ |v′0r , LV ˜ 0r

r=1

19

(240)

The prior for non-isotropic W is P (W) =

d Y

G (wrr |aw , bwr )

(241)

r=1

The prior for isotropic W is P (W) = G (w|aw , bw )

(242)

−1 The parameters v′0r , LV ˜ , aw and bw are computed with the large dataset. 0r

8.2

Variational distributions

The joint distribution of the latent variables:   ˜ P (W) P (Φ, Y, µ, V, W) = P (Φ|Y, µ, V, W) P (Y) P V

Now, we consider the partition of the posterior:     ˜ W = q (Y) q V ˜ q (W) P (Y, µ, V, W|Φ) ≈ q Y, V,

(243)

(244)

The optimum for q ∗ (Y)  is  the same as in section 5.2. ˜ is: Then optimum for q ∗ V

d    Y  ′ −1 ˜ = ˜ r′ |˜ vr , LV N v q∗ V ˜

(245)

r

r=1

(246)

LV ˜ 0r + w rr Ry ˜ r =LV ˜ ′ ˜r v

 T

−1 ′ LV =LV ˜ 0r v0r + w rr Cr ˜ r

The optimum for q ∗ (W) for non-isotropic W: q ∗ (W) =

d Y

G wrr |a′w , b′wr

r=1

N 2 1 + krr 2

a′w =aw + b′wr =bwr



(247)

(248) (249) (250)

where  i h h i h iT T T ˜ ˜ ˜ ˜ K =diag S − CE V − E V C + EV ˜ VRy ˜V

(251)

The optimum for q ∗ (W) for isotropic W: q ∗ (W) =G (w|a′w , b′w ) Nd a′w =aw + 2 1 b′w =bw + k 2

(252) (253) (254)

where  h iT    h i ˜ ˜TV ˜ Ry˜ k =tr S − 2CE V + tr E V 20

(255)

8.3

Variational lower bound

The lower bound is given by h  i ˜ + EW [ln P (W)] L =EY,µ,V,W [ln P (Φ|Y, µ, V, W)] + EY [ln P (Y)] + EV ˜ ln P V h  i ˜ − EW [ln q (W)] − EY [ln q (Y)] − EV ˜ ln q V

(256)

The term EW [ln P (W)] with non-isotropic W: EW [ln P (W)] =

d X

aw ln bwr + (aw − 1) E [ln wrr ] − bwr E [wrr ] − ln Γ (aw )

(257)

r=1

= − d ln Γ (aw ) + aw

d X

ln bwr + (aw − 1)

d X r=1

r=1

E [ln wrr ] −

d X

bwr E [wrr ]

(258)

r=1

h  i ˜ The terms EY [ln P (Y)], EY [ln q (Y)] and EV are the same as in section 5.3. ˜ ln q V The terms EY,µ,V,W [ln P (Φ|Y, µ, V, W)], EW [ln P (W)] with isotropic W and EW [ln q (W)] are the same as is in hsection  6.3. i ˜ The term EV is the same as in section 7.3. ˜ ln P V

References [1] Jes´ us Villalba, “SPLDA,” Tech. Rep., University of Zaragoza, Zaragoza, 2011. [2] Jes´ us Villalba and Niko Brummer, “Towards Fully Bayesian Speaker Recognition: Integrating Out the Between-Speaker Covariance,” in Interspeech 2011, Florence, 2011, pp. 28–31. [3] C M Bishop, “Variational principal components,” 9th International Conference on Artificial Neural Networks ICANN 99, vol. 1, no. 470, pp. 509–514, 1999. [4] Jes´ us Villalba, “Fully Bayesian Two-Covariance Model,” Zaragoza (Spain), 2010.

Tech. Rep., University of Zaragoza,

[5] Matthew J Beal, “Variational algorithms for approximate Bayesian inference,” Philosophy, vol. 38, no. May, pp. 1–281, 2003.

21

Suggest Documents