Bayesian SPLDA Jes´ us Villalba Communications Technology Group (GTC), Aragon Institute for Engineering Research (I3A), University of Zaragoza, Spain
arXiv:1511.07318v1 [stat.ML] 20 Nov 2015
[email protected]
May 15, 2012
1
Introduction
In this document we are going to derive the equations needed to implement a Variational Bayes estimation of the parameters of the SPLDA model [1]. This can be used to adapt the SPLDA from one database to another with few development data or to implement the fully Bayesian recipe [2]. Our approach is similar to Bishop’s VB PPCA in [3].
2
The Model
2.1
SPLDA
SPLDA is a linear generative model represented in Figure 1.
Ni
φij
yi
θij M µ
W
V α
Figure 1: BN for Bayesian SPLDA model. An i-vector φ of speaker i can be written as: φij = µ + Vyi + ǫij
(1)
where µ is a speaker independent mean, V is the eigenvoices matrix, yi is the speaker factor vector, and ǫ is a channel offset. We assume the following priors for the variables: yi ∼ N (yi |0, I) ǫij ∼ N ǫij |0, W
(2) −1
where N denotes a Gaussian distribution; W is the within class precision matrix.
1
(3)
2.2
Notation
We are going to introduce some notation: • Let Φd be the development i-vectors dataset. • Let Φt = {l, r} be the test i-vectors. • Let Φ be any of the previous datasets. • Let θd be the labelling of the development dataset. It partitions the Nd i-vectors into Md speakers. • Let θt be the labelling of the test set, so that θt ∈ {T , N }, where T is the hypothesis that l and r belong to the same speaker and N is the hypothesis that they belong to different speakers. • Let θ be any of the previous labellings. • Let Φi be the i-vectors belonging to the speaker i. • Let φij the i-vector j of speaker i. • Let Yd be the speaker identity variables of the development set. We will have as many identity variables as speakers. • Let Yt be the speaker identity variables of the test set. • Let Y be any of the previous speaker identity variables sets. • Let d be the i-vector dimension. • Let ny be the speaker factor dimension. • Let M = (µ, V, W) be the set of all the parameters.
3
Sufficient statistics
We define the sufficient statistics for speaker i. The zero-order statistic is the number of observations of speaker i Ni . The first-order and second-order statistics are Fi =
Ni X
φij
(4)
φij φTij
(5)
j=1
Si =
Ni X j=1
We define the centered statistics as Fi =Fi − Ni µ Si =
Ni X
(6) T
(φij − µ) (φij − µ) = Si − µFTi − Fi µT + Ni µµT
j=1
2
(7)
We define the global statistics N=
M X
Ni
(8)
Fi
(9)
Fi
(10)
Si
(11)
Si
(12)
i=1
F=
M X i=1
F=
M X i=1
S=
M X i=1
S=
M X i=1
4
Data conditional likelihood
The likelihood of the data given the hidden variables for speaker i is ln P (Φi |yi , µ, V, W) =
Ni X
ln N φij |µ + Vyi , W−1
j=1
Ni W = ln − 2 2π Ni W ln − = 2 2π
N
i 1X (φij − µ − Vyi )T W(φij − µ − Vyi ) 2 j=1
1 Ni T T tr WSi + yiT VT WFi − y V WVyi 2 2 i
(13)
(14) (15)
We can write this likelihood in another form:
Ni W 1 ln P (Φi |yi , µ, V, W) = ln − tr W Si − 2Fi µT + Ni µµT 2 2π 2 −2 (Fi − Ni µ) yiT VT + Ni Vyi yiT VT
We can write this likelihood in another form if we define: y ˜ = V µ ˜ yi = i , V 1
(16) (17)
(18)
Then
ln P (Φi |yi , µ, V, W) =
Ni X j=1
˜ yi , W−1 ln N φij |V˜
Ni W ln − = 2 2π Ni W = ln − 2 2π Ni W = ln − 2 2π
(19)
N
i 1X ˜ yi )T W(φij − V˜ ˜ yi ) (φij − V˜ 2 j=1
1 ˜ T WV˜ ˜ yi ˜ T WFi − Ni y ˜T V ˜ iT V tr (WSi ) + y 2 2 i 1 ˜T ˜ yi y ˜ T + Ni V˜ ˜iT V ˜ iT V tr W Si − 2Fi y 2
3
(20) (21) (22)
5 5.1
Variational inference with Gaussian-Gamma priors for V, Gaussian for µ and Wishart for W (informative and non-informative) Model priors
We introduce a hierarchical prior P (V|α) over the matrix V governed by a ny dimensional vector of hyperparameters where ny is the dimension of the factors. Each hyperparameter controls one of the columns of the matrix V through a conditional Gaussian distribution of the form: ny Y αq d/2 1 T P (V|α) = exp − αq vq vq (23) 2π 2 q=1 where vq are the columns of V. Each αq controls the inverse variance of the corresponding vq . If a particular αq has a posterior distribution concentrated at large values, the corresponding vq will tend to be small, and that direction of the latent space will be effectively ’switched off’. We define a prior for α: P (α) =
ny Y
G (αq |aα , bα )
(24)
q=1
where G denotes the Gamma distribution. Bishop defines broad priors setting a = b = 10−3 . We place a Gaussian prior for the mean µ: P (µ) = N µ|µ0 , diag(β)−1
(25)
We will consider the case where each dimension has different precision and the case with isotropic precision (diag(β) = βI). Finally, we put a Wishart prior on W, P (W) = W (W|Ψ0 , ν0 )
(26)
We can make the Wishart prior non-informative like in [4]. P (W) = lim W (W|W0 /k, k)
(27)
k→0
= α |W|
5.2
−(d+1)/2
(28)
Variational distributions
We write the joint distribution of the observed and latent variables: P (Φ, Y, µ, V, W, α|µ0 , β, aα , bα ) = P (Φ|Y, µ, V, W) P (Y) P (V|α) P (α|a, b) P (µ|µ0 , β) P (W) (29) Following, the conditioning on (µ0 , β, aα , bα ) will be dropped for convenience. Now, we consider the partition of the posterior: P (Y, µ, V, W, α|Φ) ≈ q (Y, µ, V, W, α) = q (Y)
d Y
q (˜ vr′ ) q (W) q (α)
(30)
r=1
˜ If W were a diagonal matrix the factorization ˜ r′ is a column vector containing the rth row of V. where v Qd ′ vr ) is not necessary because it arises naturally when solving the posterior. However, for full r=1 q (˜ ˜ is a Gaussian with a huge full covariance matrix. Therefore, we covariance W, the posterior of vec(V) force the factorization to made the problem tractable. The optimum for q ∗ (Y): ln q ∗ (Y) =Eµ,V,W,α [ln P (Φ, Y, µ, V, W, α)] + const =Eµ,V,W [ln P (Φ|Y, µ, V, W)] + ln P (Y) + const =
M X
1 yiT E VT W (Fi − Ni µ) − yiT I + Ni E VT WV yi + const 2 i=1 4
(31) (32) (33)
Therefore q ∗ (Y) is a product of Gaussian distributions. q ∗ (Y) =
M Y
N yi |yi , L−1 yi
i=1
The optimum for q ∗ (˜ vr′ ):
(34)
Lyi =I + Ni E VT WV T yi =L−1 yi E V W (Fi − Ni µ) T E [V] E [W] Fi − Ni E VT Wµ =L−1 yi
(35) (36) (37)
ln q ∗ (˜ vr′ ) =EY,W,α,˜vs6′ =r [ln P (Φ, Y, µ, V, W, α)] + const
(38)
=EY,W,˜vs6′ =r [ln P (Φ|Y, µ, V, W)] + Eα,vs6′ =r [ln P (V|α)] + Eµs6=r [ln P (µ)] + const M h iT h Ti 1X T ˜ ˜ ˜ y ˜ iT V ˜iy tr E [W] −2Fi E [˜ yi ] Ev˜ s6′ =r V + Ni Ev˜ s6′ =r VE =− 2 i=1
(39)
n
− =− − =− −
y 1 1X 2 E [αq ] Evs6′ =r vqT vq − βr (µr − µ0r ) + const 2 q=1 2 h iT i h 1 T ˜ ˜ ˜ tr E [W] −2CEv˜ s6′ =r V + Ev˜ s6′ =r VRy˜ V 2 ny 1 1X E [αq ] Evs6′ =r vqT vq − βr (µr − µ0r )2 + const 2 q=1 2 i h h iT 1 T ˜ ˜ ˜ tr −2Ev˜ s6′ =r V E [W] C + Ev˜ s6′ =r V E [W] V Ry˜ 2 1 1 ′T v diag (E [α]) vr′ − βr (µr − µ0r )2 + const 2 r 2
(40)
(41)
(42)
d X X 1 T ˜ r′ w rr v ˜r′T Ry˜ ˜ r′ w rs E [˜ ˜ r′ w rs Cs + 2 vs′ ] Ry˜ + v v = − tr −2 v 2 s=1 s6=r
1 1 − vr′T diag (E [α]) vr′ − βr (µr − µ0r )2 + const 2 2 X 1 T ˜ r′ v ˜ r′T w rr Ry˜ w rs Cs − E [˜ vs′ ] Ry˜ + v = − tr −2˜ vr′ w rr Cr + 2
(43)
s6=r
1 1 2 − vr′T diag (E [α]) vr′ − βr (µr − µ0r ) + const 2 2 X 1 T ˜ r′ v ˜ r′T w rr Ry˜ w rs Cs − E [˜ vs′ ] Ry˜ + v = − tr −2˜ vr′ w rr Cr + 2
(44)
s6=r
1 ′T ˜ diag α ˜r′ + βr µr µ0r + const − v ˜r v 2 r X 1 T vr′ w rr Cr + ˜T0r vs′ ] Ry˜ + βr µ = − tr −2˜ w rs Cs − E [˜ 2 s6=r ′ ′T ˜ r diag α ˜ r + wrr Ry˜ +˜ vr v
5
(45)
(46)
where w rs is the element r, s of E [W], C=
M X
Fi E [˜ yi ]
T
(47)
i=1
Ry˜ =
M X i=1
˜ iT ˜iy Ni E y
E [yi ] E yi yiT T ˜i = ˜iy E y T E [yi ] 1 0 E [α] µ ˜0r = ny ×1 α ˜r = µ0r βr and Cr is the rth row of C. Then q ∗ (˜ vr′ ) is a Gaussian distribution: ′ ˜ r′ |˜ q ∗ (˜ vr′ ) =N v vr , L−1 ˜ Vr =diag LV + α ˜ w ˜r r rr Ry ˜ X ′ ′ w rr CTr + ˜ r =L−1 ˜ s + βr µ ˜0r v wrs CTs − Ry˜ v ˜ V r
(48) (49) (50)
(51) (52) (53)
s6=r
The optimum for q ∗ (α):
ln q ∗ (α) =EY,µ,V,W [ln P (Φ, Y, µ, V, W, α)] + const =EV [ln P (V|α)] + ln P (α|aα , bα ) + const
(54) (55)
ny
Xd
1 ln αq − αq E vqT vq + (aα − 1) ln αq − bα αq + const 2 2 q=1 ny X d 1 = + aα − 1 ln αq − αq bα + E vqT vq + const 2 2 q=1
=
(56)
(57) (58)
Then q ∗ (α) is a product of Gammas: q ∗ (α) =
ny Y
q=1
G αq |a′α , b′αq
d 2 1 T =bα + E vq vq 2
a′α =aα + b′αq
(59) (60) (61)
The optimum for q ∗ (W) in the non-informative case:
ln q ∗ (W) =EY,µ,V,α [ln P (Φ, Y, µ, V, W, α)] + const =EY,µ,V [ln P (Φ|Y, µ, V, W)] + ln P (W) + const N d+1 1 = ln |W| − ln |W| − tr (WK) + const 2 2 2
(62) (63) (64)
where K=
M X i=1
h i ˜T ˜ yi y ˜ T − V˜ ˜ yi FT + Ni V˜ ˜ iT V ˜ iT V E Si − Fi y i
h iT h i i h ˜ ˜ CT + E ˜ VR ˜T ˜ y˜ V =S − CE V −E V V 6
(65) (66)
Then q ∗ (W) is Wishart distributed: P (W) =W (W|Ψ, ν) −1
Ψ
if ν > d
(67)
=K
(68)
ν =N
(69)
The optimum for q ∗ (W) in the informative case: ln q ∗ (W) =EY,µ,V,α [ln P (Φ, Y, µ, V, W, α)] + const =EY,µ,V [ln P (Φ|Y, µ, V, W)] + ln P (W) + const N ν0 − d − 1 1 = ln |W| + ln |W| − tr W Ψ−1 + const 0 +K 2 2 2
(70) (71) (72)
Then q ∗ (W) is Wishart distributed:
P (W) =W (W|Ψ, ν)
(73)
=Ψ−1 0
(74)
−1
Ψ
+K
ν =ν0 + N
7
(75)
Finally, we evaluate the expectations: E [αq ] =
a′α b′αq
(76)
′T ˜1 v ′T h i v ˜2 ˜ ˜ V =E V = . ..
(77)
′T
˜d v
W =E [W] = νΨ
(78)
d X ′T ′ E vrq vrq E vqT vq =
(79)
r=1
=
d X
L−1 ˜ V
rqq
r=1
+ v′2 rq
(80)
E VT WV =E V′ WV′T =
d X d X r=1 s=1
=
d X
w rs E vr′ vs′T
wrr ΣVr +
r=1
=
(81)
d X d X
(82) wrs v′r v′T s
(83)
r=1 s=1
d X
wrr ΣVr + V WV
r=1
r=1 s=1
T
(84)
r=1
d d X d X X E VT Wµ = wrr ΣVµr + w rs v′r µs
=
d X
(85)
T
wrr ΣVµr + V Wµ
(86)
r=1 ny ny
i XX h ˜T = ˜ y˜ V ˜ sT ˜r v ry˜ rs E v E VR
(87)
r=1 s=1 ny ny
= =
XX
(88)
r=1 s=1 ny ny
ry˜ rs E v˜ri v˜sj d×d
XX
ry˜ rs E v˜i′r v˜j′ s d×d
(89)
r=1 s=1 ny ny
= =
XX r=1 s=1 ny ny
ny ny h i X X ′ ′ E v ˜ E v ˜ ry˜ rs diag σV ry˜ rs ˜i ir js d×d + rs
XX
vri ] E v˜sj d×d + diag (ρ) ry˜ rs E [˜
XX
vr ] E [˜ vs ] + diag (ρ) ry˜ rs E [˜
r=1 s=1 ny ny
=
r=1 s=1
r=1 s=1 T ˜ y˜ V ˜ =VR
T
+ diag (ρ)
d
(90) (91) (92) (93)
8
˜ i′ , ˜ r , v˜i′r is the rth element of v where v˜ri is the ith element of v h i ΣVµr ˜r = σV = L−1 ˜r ij V Σµr ny ×ny T ρ = ρ1 ρ2 . . . ρd ny ny X X Ry˜ ◦ L−1 ρi = ˜ V
ΣV ˜r =
ΣVr ΣTVµr
i
r=1 s=1
rs
(94) (95) (96)
and ◦ is the Hadamard product. 5.2.1
Distributions with deterministic annealing
If we use annealing, for a parameter κ, we have:
q ∗ (Y) =
M Y
N yi |yi , 1/κ L−1 yi
i=1
′ ˜ r′ |˜ vr , 1/κ L−1 q ∗ (˜ vr′ ) =N v ˜ V
(97) (98)
r
∗
q (W) =W (W|1/κ Ψ, κ(ν − d − 1) + d + 1) ny Y ∗ q (α) = G αq |a′α , b′αq
if κ(ν − d − 1) + 1 > 0
(99) (100)
q=1
a′α b′αq
5.3
d =κ aα + − 1 + 1 2 1 T =κ bα + E vq vq . 2
(101) (102)
Variational lower bound
The lower bound is given by L =EY,µ,V,W [ln P (Φ|Y, µ, V, W)] + EY [ln P (Y)] + EV,α [ln P (V|α)] + Eα [ln P (α)] + Eµ [ln P (µ)] + EW [ln P (W)] h i ˜ − Eα [ln q (α)] − EW [ln q (W)] − EY [ln q (Y)] − EV ˜ ln q V
(103)
The term EY,µ,V,W [ln P (Φ|Y, µ, V, W)]:
Nd N E [ln |W|] − ln(2π) 2 2 i h T 1 T ˜ ˜ ˜ − tr W S − 2CV + E VRy˜ V 2 Nd 1 N ln(2π) − tr WS = ln W − 2 2 2 h i T 1 ˜ WC + E V ˜ T WV ˜ Ry˜ − tr −2V 2
EY,µ,V,W [ln P (Φ|Y, µ, V, W)] =
(104)
(105)
where ln W =E [ln |W|] d X ν +1−i ψ = + d ln 2 + ln |Ψ| 2 i=1 9
(106) (107)
and ψ is the digamma function. The term EY [ln P (Y)]: ! M X 1 M ny T ln(2π) − tr EY [ln P (Y)] = − E yi yi 2 2 i=1 =−
(108)
M ny 1 ln(2π) − tr (P) 2 2
(109)
where P=
M X i=1
E yi yiT
(110)
The term EV,α [ln P (V|α)]: ny
ny
ny d dX 1X EV,α [ln P (V|α)] = − ln(2π) + E [ln αq ] − E [αq ] E vqT vq 2 2 q=1 2 q=1
(111)
where E [ln αq ] = ψ(a′α ) − ln b′αq .
(112)
The term Eα [ln P (α)]: Eα [ln P (α)] =ny (aα ln bα − ln Γ (aα )) +
ny X
(aα − 1)E [ln αq ] − bα E [αq ]
(113)
q=1
=ny (aα ln bα − ln Γ (aα )) + (aα − 1)
ny X
E [ln αq ] − bα
q=1
ny X
E [αq ]
(114)
q=1
The term Eµ [ln P (µ)]: d
Eµ [ln P (µ)] = −
d
=−
d
d 1X 1X ln(2π) + ln βr − βr E µ2r − 2µ0r E [µr ] + µ20r 2 2 r=1 2 r=1
(115)
d
1X 1X d ln(2π) + ln βr − βr Σµr + E [µr ]2 − 2µ0r E [µr ] + µ20r 2 2 r=1 2 r=1
(116)
The term EW [ln P (W)] for the non-informative case: EW [ln P (W)] = −
d+1 ln W . 2
(117)
The term EW [ln P (W)] for the informative case: EW [ln P (W)] = ln B (Ψ0 , ν0 ) +
ν ν0 − d − 1 ln W − tr Ψ−1 0 Ψ 2 2
10
(118)
The term EY [ln q (Y)]: M
EY [ln q (Y)] = −
h i 1 M ny 1X T ln |Lyi | − tr Lyi E (yi − yi ) (yi − yi ) ln(2π) + 2 2 i=1 2
(119)
M
=−
1X M ny ln |Lyi | ln(2π) + 2 2 i=1 M
−
1X T tr Lyi E yi yiT − yi E [yi ] − E [yi ] yTi + yi yTi 2 i=1 M
=−
(120)
M
M ny 1X 1X ln |Lyi | − tr (I) ln(2π) + 2 2 i=1 2 i=1
(121)
M
=−
1X M ny ln |Lyi | (ln(2π) + 1) + 2 2 i=1
(122)
h i ˜ : The term EV ˜ ln q V
d h i d(ny + 1) 1 X ˜ EV ln q V = − (ln(2π) + 1) + ln LV ˜ ˜r 2 2 r=1
(123)
The term Eα [ln q (α)]:
Eα [ln q (α)] = −
ny X
H [q (αq )]
(124)
q=1
=
ny X
(a′α − 1)ψ(a′α ) + ln b′αq − a′α − ln Γ (a′α )
(125)
q=1
=ny ((a′α − 1)ψ(a′α ) − a′α − ln Γ (a′α )) +
ny X
ln b′αq
(126)
q=1
The term EW [ln q (W)]: EW [ln q (W)] = − H [q (W)] = ln B (Ψ, ν) +
(127) ν −d−1 νd ln W − 2 2
(128)
where 1 −N/2 |A| 2N d/2 ZN d d Y Γ ((N + 1 − i)/2) =π d(d−1)/4
B(A, N ) = ZN d
(129) (130)
i=1
5.4
Hyperparameter optimization
We can set the Hyperparameters (µ0 , β, aα , bα ) manually or estimate them from the development data maximizing the lower bound. we derive for aα ny X ∂L E [ln αq ] = 0 =ny (ln bα − ψ(aα )) + ∂aα q=1
ψ(aα ) = ln bα +
ny 1 X E [ln αq ] ny q=1
11
=⇒
(131)
(132)
We derive for bα : ny
∂L ny aα X E [αq ] = 0 = − ∂bα bα q=1 !−1 ny 1 X bα = E [αq ] ny aα q=1
=⇒
(133)
(134)
We solve these equations with the procedure described in [5]. We write ψ(a) = ln b + c a b= d
(135) (136)
where c=
d=
ny 1 X E [ln αq ] ny q=1 ny 1 X E [αq ] ny q=1
(137)
(138)
Then f (a) = ψ(a) − ln a + ln d − c = 0
(139)
We can solve for a using Newton-Rhaphson iterations: f (a) = anew =a − ′ f (a) ψ(a) − ln a + ln d − c =a 1 − aψ ′ (a) − 1
(140) (141)
This algorithm does not assure that a remains positive. We can put a minimum value for a. Alternatively we can solve the equation for a ˜ such as a = exp(˜ a). f (˜ a) = f ′ (˜ a) ψ(a) − ln a + ln d − c =˜ a− ψ ′ (a)a − 1
a ˜new =˜ a−
(142) (143)
Taking exponential in both sides: anew
ψ(a) − ln a + ln d − c = a exp − ψ ′ (a)a − 1
(144)
We derive for µ0 : ∂L =0 =⇒ ∂µ0 µ0 =E [µ]
(145) (146)
We derive for β: ∂L =0 ∂β
=⇒
(147)
βr−1 =Σµr + E [µr ]2 − 2µ0r E [µr ] + µ20r
(148)
If we take an isotropic prior for µ: d
β −1 =
1X Σµ + E [µr ]2 − 2µ0r E [µr ] + µ20r d r=1 r 12
(149)
5.5
Minimum divergence
We assume a more general prior for the hidden variables: P (y) = N y|µy , Λ−1 y
(150)
To minimize the divergence we maximize the part of L that depends on µy : L(µy , Λy ) =
M X i=1
EY ln N y|µy , Λ−1 y
(151)
The, we get µy =
M 1 X EY [yi ] M i=1
Σy =Λ−1 y =
(152)
M i h 1 X = EY (yi − µy ) (yi − µy )T M i=1
M 1 X EY yi yiT − µy µTy M i=1
(153)
(154)
We have a transform y = φ(y′ ) such as y′ has a standard prior: T ′ y =µy + (Σ1/2 y ) y
(155)
˜ =J˜ y y′
(156)
we also can write that as
where J=
1/2
(Σy )T 0T
µy 1
(157)
Now, we get q (˜ vr′ ) such us if we apply the transform y′ = φ−1 (y), the term E [ln P (X|Y, W)] of L remains constant: ′
′
˜ r ←JT v ˜r v L−1 ˜r V
T
←J
(158)
L−1 ˜rJ V
(159)
T LV ˜rG ˜ r ←G LV
(160)
where G = JT " =
6 6.1
−1
1/2 (Σy )−1 1/2 −µTy (Σy )−1
(161) 0 1
#
(162)
Variational inference with Gaussian-Gamma priors for V, Gaussian for µ and Gamma for W Model priors
˜ Then, to In section 5, we saw that if we use a full covariance W we had a full covariance posterior for V. ˜ get a tractable solution, we forced independence between the the rows of V when choosing the variational partition function. 13
In this section, we are going to assume that we have applied a rotation to the data such as we can consider that W is going to remain diagonal during the VB iteration. Then we are going to place a broad Gamma prior over each element of the diagonal of W: P (W) =
d Y
G (wrr |aw , bw )
(163)
r=1
We also consider the case of an isotropic W (W = wI). Then the prior is P (w) =G (w|aw , bw )
6.2
(164)
Variational distributions
We write the joint distribution of the latent variables: P (Φ, Y, µ, V, W, α|µ0 , β, aα , bα , aw , bw ) =P (Φ|Y, µ, V, W) P (Y) P (V|α) P (α|a, b) P (µ|µ0 , β) P (W|aw , bw ) Following, the conditioning on µ0 , β, aα , bα , aw , bw will be dropped for convenience. Now, we consider the partition of the posterior: ˜ q (W) q (α) P (Y, µ, V, W, α|Φ) ≈ q (Y, µ, V, W, α) = q (Y) q V
(165)
(166)
∗ The optimum for q ∗ (Y) and q (α) are the same as in section 5.2. ˜ : The optimum for q ∗ V
˜ =EY,W,α [ln P (Φ, Y, µ, V, W, α)] + const ln q ∗ V
(167)
=EY,W [ln P (Φ|Y, µ, V, W)] + Eα [ln P (V|α)] + ln P (µ) + const 1 ˜T ˜ T E [W] VR ˜ y˜ = − tr −2V E [W] C + V 2 d d 1X 1 X ′T 2 vr diag (E [α]) vr′ − βr (µr − µ0r ) + const − 2 r=1 2 r=1
(168)
(169)
d
=−
1X ˜ r′ v ˜ r′T diag α ˜ r + w rr Ry˜ tr −2˜ vr′ wrr Cr + βr µ ˜T0r + v 2 r=1
(170)
˜ is a product of Gaussian distributions: Then q ∗ V
d Y ′ −1 ˜ ˜ r′ |w ˜ r , LV N v q V = ˜ ∗
LV ˜ r + wrr Ry˜ ˜ r =diag α ′ ˜r v
=L−1 ˜r V
(171)
r
r=1
w rr CTr
The optimum for q ∗ (W):
+ βr µ ˜0r
(172)
ln q ∗ (W) =EY,µ,V,α [ln P (Φ, Y, µ, V, W, α)] + const =EY,µ,V [ln P (Φ|Y, µ, V, W)] + ln P (W) + const d X N
1 ln wrr − wrr krr + (aw − 1) ln wrr − bw wrr + const 2 2 r=1 d X N 1 aw + = − 1 ln wrr − bw + krr wrr + const 2 2 r=1
=
14
(173)
(174) (175) (176) (177)
where h iT h i i h ˜ ˜ CT + E ˜ VR ˜T ˜ y˜ V K =diag S − CE V −E V V
(178)
Then q ∗ (W) is a product of Gammas:
q ∗ (W) =
d Y
G wrr |a′w , b′wr
r=1
N 2 1 ′ bwr =bw + krr 2 If we force an isotropic W, the optimum q ∗ (W) is a′w =aw +
1 Nd ln w − wk + (aw − 1) ln w − bw w + const 2 2 Nd 1 = aw + − 1 ln w − bw + k w + const 2 2
ln q ∗ (W) =
(179) (180) (181)
(182) (183)
where i h h i h iT T T ˜ ˜ ˜ ˜ k =tr S − CE V − E V C + EV ˜ VRy ˜V h iT h i ˜ ˜TV ˜ Ry˜ =tr S − 2CE V + tr E V
(184) (185)
Then q ∗ (W) is a Gamma distribution:
q ∗ (W) =G (w|a′w , b′w ) Nd a′w =aw + 2 1 ′ bw =bw + k 2 Finally, we evaluate the expectations: h i i h ˜TV ˜ =E V ˜ ′V ˜ ′T E V =
d X r=1
=
d X r=1
6.3
(186) (187) (188)
(189)
h i h iT ˜′ ˜′ E V L−1 + E V ˜ V
(190)
h iT h i ˜ E V ˜ L−1 + E V ˜ V
(191)
r
r
Variational lower bound
The lower bound is given by L =EY,µ,V,W [ln P (Φ|Y, µ, V, W)] + EY [ln P (Y)] + EV,α [ln P (V|α)] + Eα [ln P (α)] + Eµ [ln P (µ)] + EW [ln P (W)] h i ˜ − Eα [ln q (α)] − EW [ln q (W)] − EY [ln q (Y)] − EV ˜ ln q V
(192)
The term EY,µ,V,W [ln P (Φ|Y, µ, V, W)]:
EY,µ,V,W [ln P (Φ|Y, µ, V, W)] =
d Nd 1 N X ln(2π) − tr WS ln w rr − 2 r=1 2 2 h i T 1 ˜ WC + E V ˜ T WV ˜ Ry˜ − tr −2V 2
15
(193)
where ln w rr =E [ln |wrr |] =ψ(a′w )
−
(194)
ln b′wr
(195)
and ψ is the digamma function. The term EW [ln P (W)] with non-isotropic W: EW [ln P (W)] =
d X
aw ln bw + (aw − 1) E [ln wrr ] − bw E [wrr ] − ln Γ (aw )
(196)
r=1
=d (aw ln bw − ln Γ (aw )) + (aw − 1)
d X
E [ln wrr ] − bw
r=1
d X
E [wrr ]
(197)
r=1
The term EW [ln P (W)] with isotropic W: EW [ln P (W)] =aw ln bw − ln Γ (aw ) + (aw − 1) E [ln w] − bw E [w]
(198)
The term EW [ln q (W)] with non-isotropic W: EW [ln q (W)] = −
d X
H [q (wrr )]
(199)
r=1
=d ((a′w − 1)ψ(a′w ) − a′w − ln Γ (a′w )) +
d X
ln b′wr
(200)
r=1
The term EW [ln q (W)] with isotropic W: EW [ln q (W)] = − H [q (w)] =(a′w
−
1)ψ(a′w )
(201) −
a′w
−
ln Γ (a′w )
+
ln b′w
(202)
The rest of terms are the same as in section 6.3.
6.4
Hyperparameter optimization
We can estimate the parameters (aw , bw ) from the development data maximizing the lower bound. For non-isotropic W: we derive for aw d X ∂L =d (ln bw − ψ(aw )) + E [ln wrr ] = 0 ∂aw r=1
=⇒
(203)
d
ψ(aw ) = ln bw +
1X E [ln wrr ] d r=1
(204)
We derive for bw : d
∂L daw X = − E [wrr ] = 0 ∂bw b r=1 !−1 d 1 X bw = E [wrr ] daw r=1
=⇒
(205)
(206)
For isotropic W: we derive for aw ∂L = ln bw − ψ(aw ) + E [ln w] = 0 ∂aw ψ(aw ) = ln bw + E [ln w] 16
=⇒
(207) (208)
We derive for bw : ∂L aw = − E [w] = 0 ∂bw b −1 1 bw = E [w] aw
=⇒
(209) (210)
We can solve these equations by Newton-Rhapson iterations as described in section 5.4.
7
Variational inference with full covariance Gaussian prior for V and µ and Wishart for W
7.1
Model priors
Lets assume that we compute the posterior of model parameters given a development database with a large amount of data. If we want to compute the model posterior for a small database we could use the posterior given the large database as prior. ˜ Thus, we take a prior distribution for V d Y −1 ˜ P V = N vr′ |v′0r , LV ˜ 0r
(211)
r=1
The prior for W is P (W) = W (W|Ψ0 , ν0 )
(212)
The parameters v′0r , L−1 ˜ , Ψ0 and ν0 are computed with the large dataset. V 0r
7.2
Variational distributions
The joint distribution of the latent variables: ˜ P (W) P (Φ, Y, µ, V, W) = P (Φ|Y, µ, V, W) P (Y) P V
(213)
Now, we consider the partition of the posterior:
d Y ˜ W = q (Y) q (˜ vr′ ) q (W) P (Y, µ, V, W|Φ) ≈ q Y, V,
(214)
r=1
The optimum for q ∗ (Y) is the same as in section 5.2. The optimum for q ∗ (˜ vr′ ): ln q ∗ (˜ vr′ ) =EY,W,˜vs6′ =r [ln P (Φ, Y, µ, V, W)] + const ˜ + const =EY,W,˜vs6′ =r [ln P (Φ|Y, µ, V, W)] + ln P V X 1 T ˜ r′ v ˜ r′T w rr Ry˜ = − tr −2˜ vs′ ] Ry˜ + v w rs Cs − E [˜ vr′ w rr Cr + 2
(215) (216)
s6=r
1 ′ ′ − (vr′ − v′0r )T LV ˜ 0r (vr − v0r ) + const 2 X 1 T vr′ v′T vs′ ] Ry˜ = − tr −2˜ wrs Cs − E [˜ ˜ 0r + wrr Cr + 0r LV 2 s6=r ′ ′T ˜ r LV +˜ vr v ˜ 0r + w rr Ry ˜ 17
(217)
(218)
Then q ∗ (˜ vr′ ) is a Gaussian distribution: ′ ˜ r′ |˜ q ∗ (˜ vr′ ) =N v vr , L−1 ˜ V
(219)
r
LV ˜ 0r + w rr Ry ˜ r =LV ˜
(220)
X ′ ′ −1 ′ T ˜ r =LV ˜s LV v w rs CTs − Ry˜ v ˜ 0r v0r + wrr Cr + ˜ r
(221)
s6=r
The optimum for q ∗ (W):
ln q ∗ (W) =EY,µ,V,α [ln P (Φ, Y, µ, V, W, α)] + const
where
=EY,µ,V [ln P (Φ|Y, µ, V, W)] + ln P (W) + const N ν0 − d − 1 1 = ln |W| + ln |W| − tr W Ψ−1 + const 0 +K 2 2 2
(222) (223) (224)
h iT h i i h ˜ ˜ CT + E ˜ VR ˜T ˜ y˜ V K =S − CE V −E V V
(225)
P (W) =W (W|Ψ, ν)
(226)
=Ψ−1 0
(227)
Then q ∗ (W) is Wishart distributed:
−1
Ψ
+K
ν =ν0 + N 7.2.1
(228)
Distributions with deterministic annealing
If we use annealing, for a parameter κ, we have: q ∗ (W) =W (W|1/κ Ψ, κ(ν0 + N − d − 1) + d + 1)
7.3
if κ(ν0 + N − d − 1) + 1 > 0
(229)
Variational lower bound
The lower bound is given by h i ˜ + EW [ln P (W)] L =EY,µ,V,W [ln P (Φ|Y, µ, V, W)] + EY [ln P (Y)] + EV ˜ ln P V h i ˜ − EW [ln q (W)] − EY [ln q (Y)] − EV ˜ ln q V
18
(230)
h i ˜ : The term EV ˜ ln P V
d h i 1 X ny d ˜ ln(2π) + ln LV EV ln P V = − ˜ 0r ˜ 2 2 r=1 d
−
i h 1X ′ ′ ′ T ′ tr LV ˜ 0r E (vr − v0r ) (vr − v0r ) 2 r=1
(231)
d
=−
ny d 1 X ln(2π) + ln LV ˜ 0r 2 2 r=1 d
− =−
1X −1 ′ ′T ′ ′T ′ ′T ′ ′T tr LV ˜ 0r LV ˜ r + vr vr − v0r vr − vr v0r + v0r v0r 2 r=1 d ny d 1 X ln(2π) + ln LV ˜ 0r 2 2 r=1 d
−
(232)
d
1X 1X −1 ′ ′ tr LV (v′ − v′0r )T LV ˜ 0r (vr − v0r ) ˜ 0r LV ˜r − 2 r=1 2 r=1 r
(233)
d
=−
1 X ny d ln(2π) + ln LV ˜ 0r 2 2 r=1 d
−
d
1X 1X −1 ′ ′ ′ ′ T tr LV tr LV ˜ 0r LV ˜ 0r (vr − v0r ) (vr − v0r ) ˜r − 2 r=1 2 r=1
(234)
The term EW [ln P (W)]: EW [ln P (W)] = ln B (Ψ0 , ν0 ) + where
ν ν0 − d − 1 ln W − tr Ψ−1 0 Ψ 2 2
ln W =E [ln |W|] d X ν +1−i ψ = + d ln 2 + ln |Ψ| 2 i=1
(235)
(236) (237)
and ψ is the digamma function. The term EW [ln q (W)] EW [ln q (W)] = − H [q (W)] = ln B (Ψ, ν) +
(238) ν −d−1 νd ln W − 2 2
(239)
The rest of terms are the same as the ones in section 5.3.
8 8.1
Variational inference with full covariance Gaussian prior for V and µ and Gamma for W Model priors
˜ Thus, we take a prior distribution for V d Y −1 ˜ = P V N vr′ |v′0r , LV ˜ 0r
r=1
19
(240)
The prior for non-isotropic W is P (W) =
d Y
G (wrr |aw , bwr )
(241)
r=1
The prior for isotropic W is P (W) = G (w|aw , bw )
(242)
−1 The parameters v′0r , LV ˜ , aw and bw are computed with the large dataset. 0r
8.2
Variational distributions
The joint distribution of the latent variables: ˜ P (W) P (Φ, Y, µ, V, W) = P (Φ|Y, µ, V, W) P (Y) P V
Now, we consider the partition of the posterior: ˜ W = q (Y) q V ˜ q (W) P (Y, µ, V, W|Φ) ≈ q Y, V,
(243)
(244)
The optimum for q ∗ (Y) is the same as in section 5.2. ˜ is: Then optimum for q ∗ V
d Y ′ −1 ˜ = ˜ r′ |˜ vr , LV N v q∗ V ˜
(245)
r
r=1
(246)
LV ˜ 0r + w rr Ry ˜ r =LV ˜ ′ ˜r v
T
−1 ′ LV =LV ˜ 0r v0r + w rr Cr ˜ r
The optimum for q ∗ (W) for non-isotropic W: q ∗ (W) =
d Y
G wrr |a′w , b′wr
r=1
N 2 1 + krr 2
a′w =aw + b′wr =bwr
(247)
(248) (249) (250)
where i h h i h iT T T ˜ ˜ ˜ ˜ K =diag S − CE V − E V C + EV ˜ VRy ˜V
(251)
The optimum for q ∗ (W) for isotropic W: q ∗ (W) =G (w|a′w , b′w ) Nd a′w =aw + 2 1 b′w =bw + k 2
(252) (253) (254)
where h iT h i ˜ ˜TV ˜ Ry˜ k =tr S − 2CE V + tr E V 20
(255)
8.3
Variational lower bound
The lower bound is given by h i ˜ + EW [ln P (W)] L =EY,µ,V,W [ln P (Φ|Y, µ, V, W)] + EY [ln P (Y)] + EV ˜ ln P V h i ˜ − EW [ln q (W)] − EY [ln q (Y)] − EV ˜ ln q V
(256)
The term EW [ln P (W)] with non-isotropic W: EW [ln P (W)] =
d X
aw ln bwr + (aw − 1) E [ln wrr ] − bwr E [wrr ] − ln Γ (aw )
(257)
r=1
= − d ln Γ (aw ) + aw
d X
ln bwr + (aw − 1)
d X r=1
r=1
E [ln wrr ] −
d X
bwr E [wrr ]
(258)
r=1
h i ˜ The terms EY [ln P (Y)], EY [ln q (Y)] and EV are the same as in section 5.3. ˜ ln q V The terms EY,µ,V,W [ln P (Φ|Y, µ, V, W)], EW [ln P (W)] with isotropic W and EW [ln q (W)] are the same as is in hsection 6.3. i ˜ The term EV is the same as in section 7.3. ˜ ln P V
References [1] Jes´ us Villalba, “SPLDA,” Tech. Rep., University of Zaragoza, Zaragoza, 2011. [2] Jes´ us Villalba and Niko Brummer, “Towards Fully Bayesian Speaker Recognition: Integrating Out the Between-Speaker Covariance,” in Interspeech 2011, Florence, 2011, pp. 28–31. [3] C M Bishop, “Variational principal components,” 9th International Conference on Artificial Neural Networks ICANN 99, vol. 1, no. 470, pp. 509–514, 1999. [4] Jes´ us Villalba, “Fully Bayesian Two-Covariance Model,” Zaragoza (Spain), 2010.
Tech. Rep., University of Zaragoza,
[5] Matthew J Beal, “Variational algorithms for approximate Bayesian inference,” Philosophy, vol. 38, no. May, pp. 1–281, 2003.
21