Unsupervised Adaptation of SPLDA

Unsupervised Adaptation of SPLDA Jes´ us Villalba Communications Technology Group (GTC), Aragon Institute for Engineering Research (I3A), University of Zaragoza, Spain

arXiv:1511.07421v1 [stat.ML] 20 Nov 2015

[email protected]

June 19, 2013

1

Introduction

In this document we present a Variational Bayes solution to adapt a SPLDA [1] model to a new domain by using unlabelled data. We assume that we count with a labelled dataset (for example Switchboard) to initialise the model.

2

The Model

2.1

SPLDA

SPLDA is a linear generative model where an i-vector φj of speaker i can be written as: φj = µ + Vyi + ǫj

(1)

where µ is a speaker independent mean, V is the eigen-voices matrix, yi is the speaker factor vector, and ǫ is a channel offset. We assume the following priors for y and ǫ: yi ∼ N (yi |0, I) ǫj ∼ N ǫj |0, W

(2) −1

(3)

where N denotes a Gaussian distribution; W is the within class precision matrix. Figure 1 shows the case where the development dataset is split into two parts: one part where the speaker labels are known (supervised) and another with unknown labels (unsupervised). We introduce the variables involved: • Let Φd be the i-vectors of the supervised dataset. • Let Φ be the i-vectors of the unsupervised dataset. • Let Φi be the i-vectors belonging to the speaker i. • Let Yd be the speaker identity variables of the supervised dataset. • Let Y be the speaker identity variables of the unsupervised dataset. • Let θd be the labelling of the supervised dataset. It partitions the Nd i-vectors into Md speakers. • Let θ be the labelling of the unsupervised dataset. It partitions the N i-vectors into M speakers. θj is a latent variable comprising a 1–of–M binary vector with elements θji with i = 1, . . . , M . This

1

Ni

φj

yi

πθ θj

M

µ

W

V α

Ndi

y di

θd j φdj

Md

Figure 1: BN for Bayesian SPLDA model. variable is equivalent to the cluster occupations of a GMM. The conditional distribution of θ given the weights of the mixture is: P (θ|πθ ) =

N Y M Y

θ

πθiji .

(4)

j=1 i=1

• Let πθ be the weights of the mixture. We choose a Dirichlet prior for the weights: P (πθ |τ0 ) = Dir(πθ |τ0 ) = C(τ0 )

M Y

πθτi0 −1

(5)

i=1

where by symmetry we have chosen the same parameter τ0 for each of the components, and C(τ0 ) is the normalisation constant for the Dirichlet distribution defined as C(τ0 ) =

Γ(M τ0 ) Γ(τ0 )M

(6)

and Γ is the Gamma function. • Let d be the i-vector dimension. • Let ny be the speaker factor dimension. • Let M = (µ, V, W) be the set of all the SPLDA parameters. In the most general case, we can assume that the parameters of the model are also hidden variables with prior and posterior distributions.

2.2

Sufficient Statistics

We define some statistics for speaker i in the unsupervised dataset: Ni =

N X

θji

(7)

θji φj

(8)

θji φj φTj .

(9)

j=1

Fi =

N X j=1

Si =

Ni X j=1

2

We define the centered statistics as Fi =Fi − Ni µ Si =

N X

(10) T

θji (φj − µ) (φj − µ) = Si − µFTi − Fi µT + Ni µµT .

(11)

j=1

We define the global statistics N=

M X

Ni

(12)

Fi

(13)

Fi

(14)

Si

(15)

Si .

(16)

i=1

F=

M X i=1

F=

M X i=1

S=

M X i=1

S=

M X i=1

Equally, we can define statistics for the supervised dataset: Nd , Fd , Sd , etc

2.3

Data conditional likelihood

The likelihood of the data given the hidden variables for speaker i is ln P (Φi |yi , θ, µ, V, W) =

N X

θji ln N φj |µ + Vyi , W−1

j=1

Ni W = ln − 2 2π Ni W = ln − 2 2π

We can also write this likelihood as:

If we define:

N

1X θji(φj − µ − Vyi )T W(φj − µ − Vyi ) 2 j=1

Ni T T 1 tr WSi + yiT VT WFi − y V WVyi . 2 2 i

Ni W 1 ln P (Φi |yi , θ, µ, V, W) = ln − tr W Si − 2Fi µT + Ni µµT 2 2π 2 . −2 (Fi − Ni µ) yiT VT + Ni Vyi yiT VT

ln P (Φi |yi , θ, µ, V, W) =

Ni X j=1

(18) (19)

(20) (21)

(22)

˜ yi , W−1 θji ln N φij |V˜

(23)

˜i = y we can write it as

yi ˜ = V , V 1

(17)

µ

Ni W 1 ˜T ˜ yi y ˜ T + Ni V˜ ˜ iT V ˜iT V = . ln − tr W Si − 2Fi y 2 2π 2 3

(24)

Ni

φj

yi

πθ θj

M

µ

W

V

Ndi

y di

θd j φdj

Md

Figure 2: BN for SPLDA with point estimates of the model parameters.

3

Variational Inference with Point Estimates of µ, V and W

As first approximation, we assume a simplified model where we take point estimates of the parameters µ, V and W. In this case, the graphical model simplifies to the one in Figure 2. In this model, yi , ydi and θij are the only hidden variables. V, µ and W are hyperparameters that can be obtained by maximising the VB lower bound.

3.1

Variational Distributions

We write the joint distribution of the observed and latent factors: P (Φ, Φd , Y, Yd , θ, πθ |θd , τ0 , µ, V, W) =P (Φ|Y, θ, µ, V, W) P (Y) P (θ|πθ ) P (πθ |τ0 ) P (Φd |Yd , θd , µ, V, W) P (Yd ) .

(25)

Following, the conditioning on (θd , τ0 , µ, V, W) will be dropped for convenience. Now, we consider the partition of the posterior: P (Y, Yd , θ, πθ |Φ, Φd ) ≈ q (Y, Yd , θ) = q (Y, Yd ) q (θ) q (πθ ) .

(26)

The optimum for q ∗ (Y, Yd ): ln q ∗ (Y, Yd ) =Eθ,πθ [ln P (Φ, Φd , Y, Yd , θ, πθ )] + const =Eθ [ln P (Φ|Y, θ)] + ln P (Y) + ln P (Φd |Yd ) + ln P (Yd ) + const =

(27) (28)

M X

1 yiT VT WE Fi − yiT I + E [Ni ] VT WV yi 2 i=1

Md X

1 ydTi VT WFdi − ydTi I + Ndi VT WV ydi + const 2 i=1

4

(29)

Therefore q ∗ (Y, Yd ) is a product of Gaussian distributions. q ∗ (Y, Yd ) =

M Y

N yi |yi , L−1 yi

i=1

Md Y

i=1

Lyi =I + E [Ni ] VT WV T yi =L−1 yi V WE Fi

E [Ni ] =

N X

N ydi |ydi , L−1 yd i

(30) (31) (32)

E [θji ]

(33)

j=1

N X E Fi = E [θji ] (φj − µ)

(34)

j=1

Lydi =I + Ndi VT WV

(35)

T ydi =L−1 ydi V WFdi

(36)

The optimum for q ∗ (θ): ln q ∗ (θ) =EY,Yd ,πθ [ln P (Φ, Φd , Y, Yd , θ, πθ )] =EY [ln P (Φ|Y, θ)] + Eπθ [ln P (θ|πθ )] + const M N X X 1 W 1 T θji ln = − E (φj − µ − Vyi ) W(φj − µ − Vyi ) + E [ln πθi ] + const 2 2π 2 j=1 i=1 N X M X

1 W 1 T ln − (φj − µ)T W(φj − µ) + E [yi ] VT W (φj − µ) θji = 2 2π 2 j=1 i=1 1 T T − tr V WVE yi yi + E [ln πθi ] + const . 2

(37) (38) (39)

(40)

Taking exponentials in both sides:

∗

q (θ) =

N Y M Y

θ

rjiji

(41)

j=1 i=1

where ̺ji rji = PM

̺ji 1 W 1 T ln ̺ji = ln − (φj − µ)T W(φj − µ) + E [yi ] VT W (φj − µ) 2 2π 2 1 − tr VT WVE yi yiT + E [ln πθi ] 2 ∗ The optimum for q (πθ ):

(42)

i=1

ln q ∗ (πθ ) =EY,Yd ,θ [ln P (Φ, Φd , Y, Yd , θ, πθ )] =Eθ [ln P (θ|πθ )] + ln P (πθ |τ0 ) + const =

N X M X

E [θji ] ln πθi + (τ0 − 1)

M X

ln πθ + const

(44)

(45) (46) (47)

i=1

j=1 i=1

=

M X

(43)

(E [Ni ] + τ0 − 1) ln πθi .

(48)

i=1

Thus:

q ∗ (πθ ) =Dir(πθ |τ ) = C(τ )

M Y

i=1

5

πθτii −1

(49)

where

Finally, we evaluate the expectations:

τi =E [Ni ] + τ0 P M Γ i=1 τi C(τ ) = QM i=1 Γ (τi )

(50) (51)

(52)

E [yi ] =yi T E yi yiT =L−1 yi + y i y i E [yi ] E yi yiT ˜ iT = ˜iy E y T E [yi ] 1

(53) (54)

E [θji ] =rji

(55)

τi

(56)

E [πθi ] = PM

i=1 τi M X

E [ln πθi ] =ψ (τi ) − ψ

τi

i=1

3.1.1

!

(57)

Distributions with deterministic annealing

If we use annealing, for a parameter κ, we have q ∗ (Y, Yd ) =

M Y

N yi |yi , 1/κ L−1 yi

i=1

Md Y

i=1

q ∗ (θ) =

M N Y Y

N ydi |ydi , 1/κ L−1 yd

(58)

i

θ

rjiji

(59)

j=1 i=1

where ̺κji rji = PM i=1

(60)

̺κji

q ∗ (πθ ) =Dir(πθ |τ ) = C(τ )

M Y

πθτii −1

(61)

i=1

where

τi =κ(E [Ni ] + τ0 − 1) + 1

3.2

(62)

Variational lower bound

The lower bound is given by: L =EY,θ [ln P (Φ|Y, θ)] + EY [ln P (Y)] + Eθ,πθ [ln P (θ|πθ )] + Eπθ [ln P (πθ )] + EYd [ln P (Φd |Yd )] + EYd [ln P (Yd )] − EY [ln q (Y)] − Eθ [ln q (θ)] − Eπθ [ln q (πθ )] − EYd [ln q (Yd )] . The term EY,θ [ln P (Φ|Y, θ)]: E [N ] W ln EY,θ [ln P (Φ|Y, θ)] = 2 2π

M M X X T 1 ˜ ˜T +V ˜ ˜ iT V ˜iy − tr W E [S] − 2 E [Ni ] E y E [Fi ] E [˜ yi ]T V 2 i=1 i=1

6

(63)

!!

(64)

We define Cy˜ =

M X

E [Fi ] E [˜ yi ]T

(65)

˜ iT ˜iy E [Ni ] E y

(66)

i=1

Ry˜ =

M X i=1

Then E [N ] W 1 ˜T ˜ T + VR ˜ y˜ V EY,θ [ln P (Φ|Y, θ)] = . ln − tr W E [S] − 2Cy˜ V 2 2π 2

(67)

The term EYd [ln P (Φd |Yd )]:

where

Nd W 1 ˜T ˜ T + VR ˜ y˜ V EYd [ln P (Φd |Yd )] = . ln − tr W Sd − 2Cy˜ d V d 2 2π 2 Cy˜ d =

Md X

ydi ] Fdi E [˜

T

(68)

(69)

i=1

Ry˜ d =

Md X i=1

˜ dTi ˜ di y Ndi E y

(70)

The term EY [ln P (Y)]: EY [ln P (Y)] = −

1 M ny ln(2π) − tr (Py ) 2 2

(71)

where P=

M X

E yi yiT

i=1

(72)

The term EYd [ln P (Yd )]: EYd [ln P (Yd )] = −

M d ny 1 ln(2π) − tr (Pyd ) 2 2

(73)

where P yd =

Md X i=1

E ydi ydTi

(74)

The term Eθ,πθ [ln P (θ|πθ )]: Eθ,πθ [ln P (θ|πθ )] =

N X M X

rji E [ln πθi ]

(75)

j=1 i=1

The term Eπθ [ln P (πθ )]: Eπθ [ln P (πθ )] = ln C(τ0 ) + (τ0 − 1)

M X

E [ln πθi ]

(76)

i=1

The term EY [ln q (Y)]: M

EY [ln q (Y)] = −

M ny 1X ln |Lyi | (ln(2π) + 1) + 2 2 i=1 7

(77)

The term EYd [ln q (Yd )]: EYd [ln q (Yd )] = − The term Eθ [ln q (θ)]:

Md M d ny 1X ln Lydi (ln(2π) + 1) + 2 2 i=1

Eθ [ln q (θ)] =

N X M X

rji ln rji

(78)

(79)

j=1 i=1

The term Eπθ [ln q (πθ )]: Eπθ [ln q (πθ )] = ln C(τ ) +

M X

(τi − 1)E [ln πθi ]

(80)

i=1

3.3

Hyperparameter optimisation

We can obtain the hyperparameters (τ0 , µ, V, W) by maximising the lower bound. We control the weight of each of the databases on the estimation by introducing the parameter η ≤ 1 into the lower bound expression: L(µ, V, W, τ0 ) =EY,θ [ln P (Φ|Y, θ)] + Eπθ [ln P (πθ )] + ηEYd [ln P (Φd |Yd )] + const

(81)

˜ We derive for V: ∂L ˜ (Ry˜ + ηRy˜ ) = 0 =⇒ =Cy˜ + ηCy˜ d − V d ˜ ∂V ˜ = C′˜ R′−1 V y y ˜

(82) (83)

where C′y˜ =Cy˜ + ηCy˜ d

(84)

R′y˜

(85)

=Ry˜ + ηRy˜ d

We derive for W:

where

1 ∂L E [N ] + ηNd 2W−1 − diag W−1 − K + KT − diag (K) = ∂W 2 2 ˜ + VR ˜ ′˜ V ˜T K = E [S] + ηSd − 2C′y˜ V y

(86)

(87)

Then W−1 =

K + KT 1 E [N ] + ηNd 2

(88)

We derive for τ0 : M X ∂L E [ln πθi ] = 0 =M (ψ (M τ0 ) − ψ (τ0 )) + ∂τ0 i=1

(89)

We define τ0 = exp(˜ τ0 ) and f (τ0 ) =ψ (M τ0 ) − ψ (τ0 ) + g = 0 g=

1 M

M X

E [ln πθi ] = 0 .

i=1

8

(90) (91)

We can solve for τ˜0 by Newton-Rhapson iterations: f (˜ τ0 ) f ′ (˜ τ0 ) ψ (M τ0 ) − ψ (τ0 ) + g =˜ τ0 − τ0 (ψ ′ (M τ0 ) − ψ ′ (τ0 ))

τ0 − τ˜0new =˜

(92) (93)

Taking exponentials in both sides: ψ (M τ0 ) − ψ (τ0 ) + g τ0new =τ0 exp − τ0 (ψ ′ (M τ0 ) − ψ ′ (τ0 ))

3.4

(94)

Minimum divergence

We assume a more general prior for the hidden variables: P (y) = N y|µy , Λ−1 y Then we maximise L(µy , Λy ) =

M X i=1

=

(95)

Md X EY ln N yd |µy , Λ−1 +η EY ln N y|µy , Λ−1 y y

(96)

i=1

M + ηMd ln |Λy | 2 !! Md M i h i h X X 1 T T + const E (ydi − µy ) (ydi − µy ) E (yi − µy ) (yi − µy ) + η − tr Λy 2 i=1 i=1

(97)

We derive for µy : M

M

d ∂L(µy , Λy ) 1X 1X Λy E [yi − µy ] + η Λy E [ydi − µy ] = 0 = ∂µy 2 i=1 2 i=1 ! Md M X X 1 E [ydi ] E [yi ] + η µy = M + ηMd i=1 i=1

=⇒

(98)

(99)

We derive for Λy :

where

1 ∂L(µy , Λy ) M + ηMd −1 2Λ−1 = (2S − diag(S)) = 0 y − diag(Λy ) − ∂Λy 2 2

S=

M X i=1

M i h i h X E (ydi − µy ) (ydi − µy )T E (yi − µy ) (yi − µy )T + η

(100)

(101)

i=1

Then Σy = Λ−1 y =

1 (Py + ηPyd ) − µy µTy M + ηMd

(102)

To obtain a standard prior for y We transform µ and V by using µ′ =µ + Vµy

(103)

T V′ =V(Σ1/2 y )

(104)

9

3.5

Determining the number of speakers

To determine the number of speakers we initialise the algorithm assuming that there is a large number of speakers and after some iterations we eliminate speakers based on heuristics: • Each i-vector belongs only to one speaker. • Each speaker has an integer number of i-vectors. • If several i-vectors have similar E [θ] for several speakers we can merge the speakers. • Compare the lower bound for different values of M to determine the best number of speakers.

3.6

Initialise the VB

• The values of µ, V and W can be initialised using the supervised dataset. • q (πθ ) can be initialised assuming that all the speakers have the same number of i-vectors. • q (θ) can be initialised using AHC or some simple algorithm based on the pairwise scores computed evaluating the initial PLDA model. We should also initialise q (θ) with the oracle labels and check that the partition does not degrade itself as the algorithm iterates. This will provide an upper bound for the performance of the algorithm. • Instead of initialising q (θ) we can initialise q (Y) sampling random speakers from the standard distribution and afterwards, compute q (θ) given q (Y).

3.7

Combining VB and sampling methods

I am interested in Dan’s idea of combining VB and sampling methods. Instead of computing the i-vector statistics as shown in Equations (33) and (34), we can draw samples θˆjk , k = 1, . . . , K from q (θ). Then, compute K i-vector statistics for speaker i as: Nik =

N X

θˆjki

Fik =

N X

θˆjki φj .

(105)

j=1

j=1

Thus, the statistics are computed in a way that each i-vector only belongs to one speaker while in the standard VB formulation i-vectors are shared between several clusters. Then, we can follow several strategies: • Select the sample k ∗ that maximises the lower bound. • For sample k, obtain the accumulators needed to compute µ, V and W (Ry˜ , Cy˜ , etc), average the accumulators of all the samples and compute the model. • For each sample k, compute a model and average the models. However, I think that averaging the accumulators is more correct. The drawback of this method is that the computational cost grows linearly with K, and we may need a large K to make it work.

4 4.1

Variational inference with Gaussian-Gamma priors for V, Gaussian for µ and non-informative prior for W Model priors

We chose the model priors based on the Bishop’s paper about VB PPCA [2]. We introduce a hierarchical prior P (V|α) over the matrix V governed by a ny dimensional vector of hyperparameters where ny is

10

the dimension of the factors. Each hyperparameter controls one of the columns of the matrix V through a conditional Gaussian distribution of the form: ny Y αq d/2 1 T P (V|α) = (106) exp − αq vq vq 2π 2 q=1 where vq are the columns of V. Each αq controls the inverse variance of the corresponding vq . If a particular αq has a posterior distribution concentrated at large values, the corresponding vq will tend to be small, and that direction of the latent space will be effectively ’switched off’. We define a prior for α: P (α) =

ny Y

G (αq |aα , bα )

(107)

q=1

where G denotes the Gamma distribution. Bishop defines broad priors setting a = b = 10−3 . We place a Gaussian prior for the mean µ: P (µ) = N µ|µ0 , diag(β)−1 .

(108)

We will consider the case where each dimension has different precision and the case with isotropic precision (diag(β) = βI). Finally, we use a non-informative prior for W like in [3]. P (W) = lim W (W|W0 /k, k)

(109)

k→0

= α |W|

4.2

−(d+1)/2

.

(110)

Variational distributions

We write the joint distribution of the observed and latent variables: P (Φ, Φd , Y, Yd , θ, πθ , µ, V, W, α|θd , τ0 , µ0 , β, aα , bα ) =P (Φ|Y, θ, µ, V, W) P (Y) P (θ|πθ ) P (πθ |τ0 ) P (V|α) P (α|a, b) P (µ|µ0 , β) P (W) P (Φd |Yd , θd , µ, V, W) P (Yd )

(111)

Following, the conditioning on (θd , τ0 , µ0 , β, aα , bα ) will be dropped for convenience. Now, we consider the partition of the posterior: P (Y, Yd , θ, πθ , µ, V, W, α|Φ, Φd ) ≈q (Y, Yd , θ, πθ , µ, V, W, α) =q (Y, Yd ) q (θ) q (πθ )

d Y

q (˜ vr′ ) q (W) q (α)

(112)

r=1

˜ If W were a diagonal matrix the factorisation ˜ r′ is a column vector containing the rth row of V. where v Qd ′ vr ) is not necessary because it arises naturally when solving the posterior. However, for full r=1 q (˜ ˜ is a Gaussian with a huge full covariance matrix. We force the covariance W, the posterior of vec(V) factorisation to made the problem tractable. The optimum for q ∗ (Y, Yd ): ln q ∗ (Y, Yd ) =Eθ,πθ ,µ,V,W,α [ln P (Φ, Φd , Y, Yd , θ, πθ , µ, V, W, α)] + const =Eθ,µ,V,W [ln P (Φ|Y, θ, µ, V, W)] + ln P (Y) + Eµ,V,W [ln P (Φd |Yd , µ, V, W)] + ln P (Yd ) + const =

(113) (114)

M X

1 yiT E VT W (Fi − Ni µ) − yiT I + E [Ni ] E VT WV yi 2 i=1

+

Md X

1 ydTi E VT W (Fdi − Ndi µ) − ydTi I + Ndi E VT WV ydi + const 2 i=1 11

(115)

Therefore q ∗ (Y, Yd ) is a product of Gaussian distributions. q ∗ (Y, Yd ) =

M Y

N yi |yi , L−1 yi

i=1

Md Y

i=1

N ydi |ydi , L−1 yd i

Lyi =I + E [Ni ] E VT WV T T E [V] E [W] E [F ] − E [N ] E V Wµ yi =L−1 i i yi

E [Ni ] =

N X

(116) (117) (118)

E [θji ]

(119)

E [θji ] φj

(120)

j=1

E [Fi ] =

N X j=1

Lydi =I + Ndi E VT WV T T E V Wµ − N E [V] E [W] F ydi =L−1 d d i i yd i

The optimum for q ∗ (θ):

ln q ∗ (θ) =EY,Yd ,πθ ,µ,V,W,α [ln P (Φ, Φd , Y, Yd , θ, πθ , µ, V, W, α)] + const

(121) (122)

(123)

=EY,µ,V,W [ln P (Φ|Y, θ, µ, V, W)] + Eπθ [ln P (θ|πθ )] + const (124) N X M i X 1 d 1 h ˜ yi )T W(φj − V˜ ˜ yi ) + E [ln πθ ] + const θji E [ln |W|] − ln(2π) − E (φj − V˜ = i 2 2 2 j=1 i=1

(125)

Taking exponentials in both sides: q ∗ (θ) =

M N Y Y

θ

rjiji

(126)

j=1 i=1

where ̺ji rji = PM i=1

(127)

̺ji

i d 1 h 1 ˜ yi )T W(φj − V˜ ˜ yi ) + E [ln πθi ] . ln ̺ji = E [ln |W|] − ln(2π) − E (φj − V˜ 2 2 2

(128)

The optimum for q ∗ (πθ ):

q ∗ (πθ ) =Dir(πθ |τ ) = C(τ )

M Y

πθτii −1

(129)

i=1

where τi =E [Ni ] + τ0 P M τ Γ i i=1 C(τ ) = QM i=1 Γ (τi )

(130) (131)

To compute the optimum for q ∗ (˜ vr′ ), we, again, introduce the parameter η to control the weight of

12

the supervised dataset. ln q ∗ (˜ vr′ ) =EY,Yd ,θ,πθ ,W,α,˜vs6′ =r [ln P (Φ, Φd , Y, Yd , θ, πθ , µ, V, W, α)] + const

(132)

=EY,θ,W,˜vs6′ =r [ln P (Φ|Y, θ, µ, V, W)] + ηEYd ,W,˜vs6′ =r [ln P (Φd |Yd , µ, V, W)] + Eα,vs6′ =r [ln P (V|α)] + Eµs6=r [ln P (µ)] + const    X 1  T = − tr −2˜ w rs Cs − E [˜ vs′ ] R′y˜ + βr µ ˜T0r  vr′ w rr Cr + 2 s6=r ′ ′T ˜ r diag α ˜ r + w rr R′y˜ +˜ vr v

(133)

(134)

where w rs is the element r, s of E [W],

Cy˜ =

M X

E [Fi ] E [˜ yi ]

T

(135)

i=1

Ry˜ =

M X i=1

Cy˜ d =

Md X

˜ iT ˜iy E [Ni ] E y ydi ] Fdi E [˜

T

(136)

(137)

i=1

Md X

˜ dTi ˜ di y Ndi E y

(138)

C′y˜ =Cy˜ + ηCy˜ d

(139)

R′y˜

=Ry˜ + ηRy˜ d 0ny ×1 E [α] µ ˜0r = α ˜r = µ0r βr

(140)

Ry˜ d =

i=1

and Cr is the rth row of C′y˜ . Then q ∗ (˜ vr′ ) is a Gaussian distribution: ′ ˜ r′ |˜ vr , L−1 q ∗ (˜ vr′ ) =N v ˜r V LV ˜ r + w rr R′y˜ ˜ r =diag α   X ′ ′ w rr CTr + ˜ r =L−1 ˜ s + βr µ ˜0r  v wrs CTs − R′y˜ v ˜ V r

(141)

(142) (143) (144)

s6=r

The optimum for q ∗ (α):

ln q ∗ (α) =EY,Yd ,θ,πθ ,µ,V,W [ln P (Φ, Φd , Y, Yd , θ, πθ , µ, V, W, α)] + const =EV [ln P (V|α)] + ln P (α|aα , bα ) + const ny X d 1 T = + aα − 1 ln αq − αq bα + E vq vq + const 2 2 q=1

13

(145) (146) (147) (148)

Then q ∗ (α) is a product of Gammas: ∗

q (α) =

ny Y

q=1

G αq |a′α , b′αq

(149)

d 2 1 T =bα + E vq vq 2

a′α =aα + b′αq The optimum for q ∗ (W):

(150) (151)

ln q ∗ (W) =EY,Yd ,θ,πθ ,µ,V,α [ln P (Φ, Φd , Y, Yd , θ, πθ , µ, V, W, α)] + const (152) =EY,θ,µ,V [ln P (Φ|Y, θ, µ, V, W)] + ηEYd ,µ,V [ln P (Φd |Yd , µ, V, W)] + ln P (W) + const (153) =

d+1 1 N′ ln |W| − ln |W| − tr (WK) + const 2 2 2

(154)

where N ′ = E [N ] + ηNd

(155)

h iT h i i h ˜ ˜ CTy˜ + E ˜ VR ˜ ′y˜ V ˜T K =E [S] + ηSd − C′y˜ E V −E V V

(156)

Then q ∗ (W) is Wishart distributed:

q ∗ (W) = W W|K−1 , N ′ Finally, we evaluate the expectations:

if N ′ > d .

E [yi ] =yi T E yi yiT =L−1 yi + y i y i E [yi ] E yi yiT ˜ iT = ˜iy E y T E [yi ] 1 E [θji ] =rji

τi

E [πθi ] = PM

(158) (159) (160) (161) (162)

i=1 τi

E [ln πθi ] =ψ (τi ) − ψ

M X

τi

i=1

E [αq ] =

(157)

a′α b′αq

!

(163) (164)

′T  ˜1 v ′T  h i  v ˜2  ˜ ˜  V =E V =  .    .. 



(165)

′T

˜d v

W =E [W] = N ′ K−1

14

(166)

d X ′T ′ E vrq vrq E vqT vq =

(167)

r=1

=

d X

−1 LV ˜

rqq

r=1

+ v′2 rq

(168)

E VT WV =E V′ WV′T =

d d X X r=1 s=1

=

d X

(169)

w rs E vr′ vs′T

(170)

T

w rr ΣVr + V WV

(171)

r=1

d X T E VT Wµ = w rr ΣVµr + V Wµ

(172)

r=1

d h i X T ˜ ˜ ˜ T WV ˜ = w rr ΣV E V ˜ r + V WV

(173)

r=1

h i h i ˜ yi + tr E V ˜ T WV ˜ E y ˜ yi )T W(φj − V˜ ˜ yi ) =φTj Wφj − 2φTj WV˜ ˜ iT ˜iy E (φj − V˜ ny

(174)

ny

i XX h ˜ ′˜ V ˜T = ˜ sT ˜r v ry˜ rs E v E VR y r=1 s=1 T ˜ ′˜ V ˜ =VR y

+ diag (ρ)

(175) (176)

where h i ΣVµr ˜r = σV = L−1 ˜r ij V Σµr ny ×ny T ρ = ρ1 ρ2 . . . ρd ny ny X X R′y˜ ◦ L−1 ρi = ˜ V

ΣV ˜r =

ΣVr ΣTVµr

i

r=1 s=1

(177) (178) (179)

rs

and ◦ is the Hadamard product. 4.2.1

Distributions with deterministic annealing

If we use annealing, for a parameter κ, we have q ∗ (Y, Yd ) =

M Y

N yi |yi , 1/κ L−1 yi

Md Y

i=1

i=1

q ∗ (θ) =

N Y M Y

N ydi |ydi , 1/κ L−1 yd i

θ

rjiji

(180)

(181)

j=1 i=1

where ̺κji rji = PM i=1

(182)

̺κji

q ∗ (πθ ) =Dir(πθ |τ ) = C(τ )

M Y

i=1

15

πθτii −1

(183)

where τi =κ(E [Ni ] + τ0 − 1) + 1 ′ −1 ˜ r′ |˜ q ∗ (˜ vr′ ) =N v vr , 1/κ LV ˜

(184)

(185)

r

q ∗ (W) =W W|1/κ K−1 , κ(N ′ − d − 1) + d + 1 ny Y G αq |a′α , b′αq q ∗ (α) =

if κ(N ′ − d − 1) + 1 > 0

(186) (187)

q=1

a′α b′αq

4.3

d =κ aα + − 1 + 1 2 1 T =κ bα + E vq vq . 2

(188) (189)

Variational lower bound

The lower bound is given by: L =EY,θ,µ,V,W [ln P (Φ|Y, θ, µ, V, W)] + EY [ln P (Y)] + Eθ,πθ [ln P (θ|πθ )] + Eπθ [ln P (πθ )] + EV,α [ln P (V|α)] + Eα [ln P (α)] + Eµ [ln P (µ)] + EW [ln P (W)] + ηEYd ,µ,V,W [ln P (Φd |Yd , µ, V, W)] + ηEYd [ln P (Yd )] − EY [ln q (Y)] − Eθ [ln q (θ)] − Eπθ [ln q (πθ )] h i ˜ − Eα [ln q (α)] − EW [ln q (W)] − ηEY [ln q (Yd )] . − EV ˜ ln q V d

(190)

The term EY,θ,µ,V,W [ln P (Φ|Y, θ, µ, V, W)]:

E [N ] d E [N ] E [ln |W|] − ln(2π) 2 2 i h T 1 T ˜ ˜ ˜ − tr W E [S] − 2Cy˜ V + E VRy˜ V 2 E [N ] E [N ] d 1 = ln W − ln(2π) − tr WE [S] 2 2 2 h i T 1 T ˜ WCy˜ + E V ˜ WV ˜ Ry˜ − tr −2V 2

EY,θ,µ,V,W [ln P (Φ|Y, θ, µ, V, W)] =

where

ln W =E [ln |W|] ′ d X N +1−i + d ln 2 + ln K−1 ψ = 2 i=1

(191)

(192)

(193) (194)

and ψ is the digamma function. The term EYd ,θ,µ,V,W [ln P (Φd |Yd , µ, V, W)]:

Nd Nd d E [ln |W|] − ln(2π) 2 2 i h T 1 ˜ + E VR ˜T ˜ y˜ V − tr W Sd − 2Cy˜ d V d 2 1 Nd Nd d ln W − ln(2π) − tr WSd = 2 2 2 h i T 1 ˜ WCy˜ + E V ˜ T WV ˜ Ry˜ − tr −2V d d 2

EY,θ,µ,V,W [ln P (Φd |Y, θ, µ, V, W)] =

16

(195)

(196)

The term EV,α [ln P (V|α)]: ny

ny

ny d dX 1X EV,α [ln P (V|α)] = − ln(2π) + E [ln αq ] − E [αq ] E vqT vq 2 2 q=1 2 q=1

(197)

where E [ln αq ] = ψ(a′α ) − ln b′αq .

(198)

The term Eα [ln P (α)]: Eα [ln P (α)] =ny (aα ln bα − ln Γ (aα )) + (aα − 1)

ny X

E [ln αq ] − bα

q=1

ny X

E [αq ]

(199)

q=1

The term Eµ [ln P (µ)]: d

Eµ [ln P (µ)] = −

d

d 1X 1X 2 ln(2π) + ln βr − βr Σµr + E [µr ] − 2µ0r E [µr ] + µ20r 2 2 r=1 2 r=1

(200)

The term EW [ln P (W)]: EW [ln P (W)] = −

d+1 ln W 2

(201)

h i ˜ : The term EV ˜ ln q V

d h i 1 X d(ny + 1) ˜ (ln(2π) + 1) + ln LV EV ln q V = − ˜r ˜ 2 2 r=1

(202)

The term Eα [ln q (α)]:

Eα [ln q (α)] = −

ny X

H [q (αq )]

(203)

q=1

=ny ((a′α

−

1)ψ(a′α )

−

a′α

−

ln Γ (a′α ))

+

ny X

ln b′αq

(204)

q=1

The term EW [ln q (W)]: EW [ln q (W)] = − H [q (W)]

where

(205)

N −d−1 Nd = ln B K−1 , N + ln W − 2 2 B(A, N ) =

1 2N d/2 Z

ZN d =π d(d−1)/4

|A|

Nd d Y

−N/2

Γ ((N + 1 − i)/2)

(206)

(207) (208)

i=1

The expressions for the terms EY [ln P (Y)], EYd [ln P (Yd )], Eθ,πθ [ln P (θ|πθ )], Eπθ [ln P (πθ )], EY [ln q (Y)], EYd [ln q (Yd )], Eθ [ln q (θ)] and Eπθ [ln q (πθ )] are the same as the ones in Section 3.2.

17

4.4

Hyperparameter optimisation

We can set the Hyperparameters (τ0 , µ0 , β, aα , bα ) manually or estimate them from the development data maximising the lower bound. τ0 can be derived as shown in Section 3.3. we derive for aα ny X ∂L =ny (ln bα − ψ(aα )) + E [ln αq ] = 0 ∂aα q=1

=⇒

ny 1 X ψ(aα ) = ln bα + E [ln αq ] ny q=1

(209)

(210)

We derive for bα : ny

∂L ny aα X E [αq ] = 0 = − ∂bα b q=1 !−1 ny 1 X bα = E [αq ] ny aα q=1

=⇒

(211)

(212)

We solve these equations with the procedure described in [4]. We write ψ(a) = ln b + c a b= d

(213) (214)

where c=

ny 1 X E [ln αq ] ny q=1

ny 1 X d= E [αq ] ny q=1

(215)

(216)

Then f (a) = ψ(a) − ln a + ln d − c = 0

(217)

We can solve for a using Newton-Rhaphson iterations: f (a) anew =a − ′ = f (a) ψ(a) − ln a + ln d − c =a 1 − aψ ′ (a) − 1

(218) (219)

This algorithm does not assure that a remains positive. We can put a minimum value for a. Alternatively we can solve the equation for a ˜ such as a = exp(˜ a). f (˜ a) = ′ f (˜ a) ψ(a) − ln a + ln d − c =˜ a− ψ ′ (a)a − 1

a ˜new =˜ a−

(220) (221)

Taking exponential in both sides: ψ(a) − ln a + ln d − c anew = a exp − ψ ′ (a)a − 1 18

(222)

We derive for µ0 : ∂L =0 =⇒ ∂µ0 µ0 =E [µ]

(223) (224)

We derive for β: ∂L =0 ∂β

=⇒

(225) 2

βr−1 =Σµr + E [µr ] − 2µ0r E [µr ] + µ20r

(226)

If we take an isotropic prior for µ: d

β −1 =

4.5

1X Σµ + E [µr ]2 − 2µ0r E [µr ] + µ20r d r=1 r

(227)

Some ideas

What we expect from this model is: • We expect that taking into account the full posterior of the parameters of the SPLDA, we will obtain a better estimation of the labels and the number of speakers. • The variances of V and W decrease as the number of speakers and segments, respectively, grow. Thus, we expect a larger improvement in cases where we have scarce adaptation data. • We can analyse, how the labels affect the posteriors of the parameters. I have the intuition that if the labels are wrong the variance of V should be larger than if the labels are right. • From q (α), we can infer the best value for ny . If the E [αq ] (prior precision of vq ) is large, vq will tend to be small as can be seen in Equation (106).

References [1] Jes´ us Villalba, “SPLDA,” Tech. Rep., University of Zaragoza, Zaragoza, July 2011. [2] Christopher Bishop, “Variational principal components,” in Proceedings of the 9th International Conference on Artificial Neural Networks, ICANN 99, Edinburgh, Scotland, Sept. 1999, IET, vol. 1, pp. 509–514. [3] Jes´ us Villalba, “Fully Bayesian Two-Covariance Model,” Zaragoza, Spain, 2010.

Tech. Rep., University of Zaragoza,

[4] Matthew J. Beal, Variational algorithms for approximate Bayesian inference, Ph.D. thesis, University College London, 2003.

19