PLDA with Two Sources of Inter-session Variability

4 downloads 0 Views 117KB Size Report
Nov 20, 2015 - arXiv:1511.06772v1 [stat.ML] 20 Nov 2015. PLDA with Two Sources of Inter-session Variability. Jesús Villalba. Communications Technology ...
PLDA with Two Sources of Inter-session Variability Jes´ us Villalba Communications Technology Group (GTC), Aragon Institute for Engineering Research (I3A), University of Zaragoza, Spain

arXiv:1511.06772v1 [stat.ML] 20 Nov 2015

[email protected]

Nov 6, 2012

1

The Model

1.1

PLDA

We take a linear-Gaussian generative model M. We suppose that we have i-vectors of the same conversations recorded simultaneously by different channels or different noisy conditions. Then, an. i-vector φijk of speaker i, session j recorded in a channel l can be written as: φijl = µ + Vyi + Uxij + ǫijl

(1)

where µ is speaker independent term, V is the eigenvoices matrix, yi is the speaker factor vector, U is an the eigenchannels matrix, xij and ǫijl is a channel offset. The term xij must be the same for all the recordings of the same conversation. The term ǫijl accounts for the channel variability. We assume the following priors for the variables: y ∼ N (y|0, I)

(2)

x ∼ N (x|0, I)

(3)

ǫ ∼ N ǫ|0, W−1



(4)

where N denotes a Gaussian distribution; and D is a full rank precision matrix. φ is an observable variable and y and x are hidden variables.

1.2

Notation

We are going to introduce some notation: • Let Φd be the development i-vectors dataset. • Let Φt = {l, r} be the test i-vectors. • Let Φ be any of the previous datasets. • Let θd be the labelling of the development dataset. It partitions the Nd i-vectors into Md speakers. Each speaker has Hi sessions and each session can be recorded by Lij different channels. • Let θt be the labelling of the test set, so that θt ∈ {T , N }, where T is the hypothesis that l and r belong to the same speaker and N is the hypothesis that they belong to different speakers. • Let θ be any of the previous labellings. • Let Φi be the i-vectors belonging to the speaker i.

1

• Let Yd be the speaker identity variables of the development set. We will have as many identity variables as speakers. • Let Yt be the speaker identity variables of the test set. • Let Y be any of the previous speaker identity variables sets. • Let Xd be the channel variables of the development set. • Let Xt be the channel variables of the test set. • Let X be any of the previous channel variables sets. • Let Xi = [xi1 , xi2 , . . . , xiHi ] be the channel variables of speaker i. • Let M = (µ, V, U, D) be the set of all the model parameters.

2

Likelihood calculations

2.1

Definitions

We define the sufficient statistics for speaker i. The zero-order statistic is the number of observations of speaker i Ni . The first-order and second-order statistics are Fi =

Lij Hi X X

φijl

(5)

φijl φTijl

(6)

j=1 l=1

Si =

Lij Hi X X j=1 l=1

We define the centered statistics as Fi =Fi − Ni µ

(7)

Lij

Si =

Hi X X

T

(φijl − µ) (φijl − µ) = Si − µFTi − Fi µT + Ni µµT

(8)

j=1 l=1

We define the session statistics as Fij =

Lij X

φijl

(9)

l=1

Fij =Fij − Lij µ

(10)

where Lij is the number of channels for the conversation ij. We define the global statistics N=

M X

Ni

(11)

Fi

(12)

Fi

(13)

Si

(14)

Si

(15)

i=1

F=

M X i=1

F=

M X i=1

S=

M X i=1

S=

M X i=1

2

2.2

Data conditional likelihood

The likelihood of the data given the hidden variables for speaker i is ln P (Φi |yi , Xi , M) =

Lij Hi X X

ln N φijl |µ + Vyi + Uxij , W−1

j=1 l=1

Ni W ln − = 2 2π Ni W = ln − 2 2π +

Hi X

H

Lij



(16)

i X 1X (φijl − µ − Vyi − Uxij )T W(φijl − µ − Vyi − Uxij ) (17) 2 j=1

l=1

 1 Ni T T tr WSi + yiT VT WFi − y V WVyi 2 2 i

1 xTij UT WFij − Lij yiT VT WUxij − Lij xTij UT WUxij 2 j=1

We can write this likelihood in other form if we define:   yi ˜ ij = xij  , y 1

 ˜ = V V

U µ



(18)

(19)

Then

ln P (Φi |yi , Xi , M) =

Lij Hi X X j=1 l=1

=

  ˜ yij , W−1 ln N φijl |V˜

Lij Hi X Ni W 1 X ˜ yij )T W(φijl − V˜ ˜ yij ) ln − (φijl − V˜ 2 2π 2 j=1

(20)

(21)

l=1

Hi X 1 Ni W 1 T ˜T T ˜T ˜ yij ˜ ij ˜ ij y V WFij − Lij y V WV˜ ln − tr (WSi ) + 2 2π 2 2 j=1    Hi X Ni W 1   T ˜ T  T ˜T ˜ yij y ˜ ij ˜ ij ln − tr W Si + V V + Lij V˜ −2Fij y = 2 2π 2

=

(22)

(23)

j=1

2.3

Posterior of the hidden variables

The posterior of the hidden variables can be decomposed into two factors: P (yi , Xi |Φi , M) = P (Xi |yi , Φi , M) P (yi |Φi , M) 2.3.1

(24)

Conditional posterior of Xi

The conditional posterior of Xi is P (Xi |yi , Φi , M) =

P (Φi |yi , Xi , M) P (Xi ) P (Φi |yi , M)

3

(25)

Using equations (3) and (18) ln P (Xi |yi , Φi , M) = ln P (Φi |yi , Xi , M) + ln P (Xi |M) + const =

=

=

Hi X

(26)

1 1 xTij UT WFij − Lij yT VT WUxij − Lij xTij UT WUxij − xTij xij + const 2 2 j=1 (27)

Hi X

 1 xTij UT W Fij − Lij Vyi − xTij Lxij xij + const 2 j=1

Hi X

1 xTij ζij − xTij Lxij xij + const 2 j=1

(28)

(29)

where  ζij =UT W Fij − Lij Vyi = ζ˜ij − Lij Jyi ζ˜ij =UT WFij T

J =U WV

(30) (31) (32)

T

Lxij =I + Lij U WU

(33)

Equation (29) has the form of a product of Gaussian distributions. Therefore P (Xi |yi , Φi , M) =

Hi Y

j=1

  N xij |xij , L−1 xij

(34)

where xij = L−1 xij ζij 2.3.2

(35)

Posterior of yi

The marginal posterior of y is P (y|Φi , M) =

P (Φi |yi , M) P (y) P (Φi |M)

(36)

We can use Bayes Theorem to write P (Φi , Xi |yi , M) = P (Φi |yi , Xi , M) P (Xi |yi , M) = P (Xi |Φi , yi , M) P (Φi |yi , M)

(37)

Simplifying P (Φi |yi , Xi , M) P (Xi ) = P (Xi |Φi , yi , M) P (Φi |yi , M)

(38)

Then P (y|Φi , M) = Using equations (2), (18) and (34)

P (Φi |yi , Xi , M) P (Xi ) P (y) P (Xi |Φi , yi , M) P (Φi |M) Xi =0

4

(39)

ln P (yi |Φi , M) = ln P (Φi |yi , Xi , M) + ln P (y) − ln P (Xi |Φi , yi , M) + const =yT VT WFi −

1 1 Ni T T y V WVyi − yiT yi + 2 i 2 2

Hi X

xTij Lxij xij + const

(40) (41)

j=1

 1 =yT VT WFi − yiT I + Ni VT WV yi 2 Hi X T  1 T + Fij − Lij Vyi WUL−1 xij U W Fij − Lij Vyi + const 2 j=1

(42)

 1 =yT VT WFi − yiT I + Ni VT WV yi 2 Hi X 1 T T + F WUL−1 xij U WFij 2 j=1 ij T − 2Lij yiT VT WUL−1 xij U WFij

T + L2ij yiT VT WUL−1 xij U WVyi + const   Hi X T  Lij WUL−1 =yT VT WFi − xij U WFij

(43)

j=1

   Hi X 1 T T  V yi + const L2ij WUL−1 − yi I + VT Ni W − xij U W 2 j=1 

Then

P (yi |Φi , M) = N yi |yi , L−1 yi where 

Lyi =I + VT Ni W − =I + Ni VT WV −

Hi X j=1

Hi X



(44)

(45)



T V L2ij WUL−1 xij U W

(46)

L2ij JT L−1 xij J

(47)

j=1



γi =VT WFi − γ˜i =VT WFi

Hi X j=1



T  = γ˜i − Lij WUL−1 xij U WFij

Hi X

˜ Lij JT L−1 xij ζij

yi =L−1 y i γi

2.4

(48)

j=1

(49) (50)

Marginal likelihood of the data

The marginal likelihood of the data is P (Φi |M) =

P (Φi |yi , Xi , M) P (yi ) P (Xi ) P (Xi |yi , Φi , M) P (yi |Φi , M) yi =0,Xi =0

(51)

Taking equations (18), (2), (3) (34) and (45)

Hi Hi 1X  1X Ni W 1 1 1 ln P (Φi |M) = ln Lxij + ln − tr WSi − ζ˜T L−1 ζ˜ij − ln |Lyi | + γiT L−1 y i γi 2 2π 2 2 j=1 2 j=1 ij xij 2 2

(52)

5

3

EM algorithm

3.1

E-step

In the E-step we calculate the posterior of y and X with equation (24)

3.2

M-step ML

We maximize the EM auxiliary function Q(M) Q(M) =

M X

EY,X [ln P (Φi , yi , Xi |M)]

(53)

EY,X [ln P (Φi |yi , Xi , M)] + EY,X [ln P (yi )] + EY,X [ln P (Xi )]

(54)

i=1

=

M X i=1

Taking equation (23)    Hi M X X   N W 1 T ˜ T  ˜ Y,X y ˜ T + Lij VE ˜ ij ˜ ij y Q(M) = ln − tr W S + V (55) −2Fij EY,X [˜ yij ]T V 2 2π 2 i=1 j=1

we define

Ry˜ =

Hi M X X i=1 j=1

C=

Hi M X X

  T ˜ ij ˜ ij y Lij EY y Fij EY [˜ yij ]

(56)

T

(57)

i=1 j=1

then Q(M) =

 N 1   ˜T ˜ T + VR ˜ y˜ V + const ln |W| − tr W S − 2CV 2 2 ∂Q(M) ˜ y˜ = 0 = C − VR ˜ ∂V ˜ = CR−1 V ˜ y

(58)

=⇒

(59) (60)

 1  ∂Q(M) N = 2W−1 − diag(W−1 ) − K + KT − diag(K) = 0 ∂W 2 2

(61)

˜ T , so ˜ T + VR ˜ y˜ V where K = S − 2CV

1 N 1 = N 1 = N

W−1 =

K + KT 2   ˜T ˜ T − CV ˜ T + VR ˜ y˜ V Sφ − VC   ˜ T S − VC

(62) (63) (64)

  T ˜ ij ˜ ij y and compute Ry˜ and C. Finally, we need to evaluate the expectations EY [˜ yij ] and EY y T EY [yi ]  Fij EY,X [xij ] = Cy Fij EY,X [˜ yij ] = C= i=1 j=1 i=1 j=1 1 Hi M X X

T

Hi M X X

6



Cx

 F

(65)

Now EY [yi ] =yi

(66)

EY,X [xij ] =EY [xij ] = L−1 xij

Hi M X X

Cy =

Fij yTi =

i=1 j=1

Hi M X X

Cx =

i=1 j=1

M X

Ni EY [yi ] =

Hi M X X M X i=1

Rxy =

Hi M X X

Hi M X X i=1 j=1

Rx =

Hi M X X i=1 j=1

=

Hi M X X i=1 j=1

=

M X Hi X i=1 j=1

3.3

 Ry1 Rx1  N

(70)

(71)

Lij EY,X [xij ] =

Hi M X X

T Lij L−1 xij U WFij − Lij Jyi

i=1 j=1

M  X   T Ni L−1 Ni EY yi yiT = yi + y i y i

i=1 j=1

=

Ryx Rx RTx1

(69)

Ni y i

i=1 j=1

Ry =

(68)

i=1

i=1

Rx1 =

M X

Fi yTi

 T Fij ζ˜ij − Lij Jyi L−1 xij

Ry Ry˜ = Rxy RTy1

Ry1 =

M X

(67)

i=1



Now

  ζ˜ij − Lij Jyi



(72)

(73)

i=1

Lij EY,X xij yiT 



=

Hi M X X i=1 j=1

 i  h T ˜ Lij EY L−1 xij ζij − Lij Jyi yi

(74)

  T T T Lij L−1 xij U WFij yi − Lij JEY yi yi

(75)

  Lij EY,X xij xTij

(76)

h   T i −1  −1 UT WFij − Lij Jyi UT WFij − Lij Jyi Lxij Lij L−1 xij + Lxij EY

(77)

  T T −1 + L Lij Lx−1 xij U WFij Fij WU ij T

−Lij UT WFij yTi JT − Lij Jyi Fij WU     +L2ij JEY yi yiT JT L−1 xij

(78)

M-step MD

We assume a more general prior for the hidden variables: P (yi ) =N yi |µy , Λ−1 y P (xij |yi ) =N xij |Hyi +

7



µx , Λ−1 x

(79) 

(80)

To minimize the divergence we maximize Q(µy , Λy , H, µx , Λx ) =

M X i=1

Hi    X  EY,X ln N xij |Hyi + µx , Λ−1 + EY ln N yi |µy , Λ−1 x y

(81)

j=1

! M i h X M 1 T = ln |Λy | − tr Λy EY (yi − µy ) (yi − µy ) 2 2 i=1   M Hi i h H 1  XX + ln |Λx | − tr Λx EY,X (xij − Hyi − µx ) (xij − Hyi − µx )T  2 2 i=1 j=1 + const

(82)

M

1X ∂Q(µy , Λy , H, µx , Λx ) Λy EY [yi − µy ] = 0 = ∂µy 2 i=1 µy =

=⇒

M 1 X EY [yi ] M i=1

 1 ∂Q(µy , Λy , H, µx , Λx ) M −1 2Λ−1 = (2S − diag(S)) = 0 y − diag(Λy ) − ∂Λy 2 2 i h PM T where S = i=1 EY (yi − µy ) (yi − µy ) , so Σy =Λ−1 y =

=

=

(83)

(84)

(85)

(86)

M i h 1 X T EY (yi − µy ) (yi − µy ) M i=1

M   1 X EY yi yiT − µy EY [yi ]T − EY [yi ] µTy + µy µTy M i=1 M   1 X EY yi yiT − µy µTy M i=1

(88)

(89)

Hi M X X ∂Q(µy , Λy , H, µx , Λx ) EY,X [xij − Hyi − µx ] = 0 =⇒ =Λx ∂µx i=1 j=1   Hi M M X X X 1 µx =  Hi EY [yi ] EX [xij ] − H H i=1 j=1 i=1

=

(87)

1 (Px1 − HPy1) H

(90)

(91) (92)

where Px1 =

Hi M X X

EX [xij ]

(93)

i=1 j=1

Py1 =

M X

Hi EY [yi ]

i=1

8

(94)

Hi M X X   ∂Q(µy , Λy , H, µx , Λx ) EY,X (xij − Hyi − µx ) yiT = 0 =Λx ∂H i=1 j=1

(95)

=⇒ Pxy − HPy − µx P1y = 0 1 =⇒ Pxy − HPy − (Px1 − HPy1 ) P1y = 0 H   1 1 =⇒ Pxy − Px1 P1y − H Py − Py1P1y = 0 H H   −1 1 1 H = Pxy − Px1 P1y Py − Py1 P1y H H

(96) (97) =⇒

(98) (99)

where Pxy =

Hi M X X i=1 j=1

Py =

M X i=1

Px =

  EY,X xij yiT

(100)

  Hi EY yi yiT

Hi M X X i=1 j=1

(101)

  EX xij xTij

(102)

 1 ∂Q(µy , Λy , H, µx , Λx ) H −1 2Λ−1 = (2S − diag(S)) = 0 x − diag(Λx ) − ∂Λx 2 2 i h PM PHi T EY,X (xij − Hyi − µx ) (xij − Hyi − µx ) , so where S = i=1 j=1 Σx =Λ−1 x 1 Px − Pxy HT − HPTxy − Px1 µTx − µx PTx1 + HPy HT = H  +HPy1µTx + µx PTy1 HT + Hµx µTx 1 = Px − Pxy HT − HPTxy + HPy HT H 

(104)

(105)

− (Px1 − HPy1) µTx − µx (Px1 − HPy1)T + Hµx µTx

=

(106)

1 Px − Pxy HT − HPTxy + HPy HT − (Px1 − HPy1 ) µx H

The transform (y, x) = φ(y′ , x′ ) such as y′ and x′ has a standard prior is T ′ y =µy + (Σ1/2 y ) y

 T

(107)

(108)

T ′ x =µx + Hy + (Σ1/2 x ) x

=µx + Hµy +

(103)

T ′ H(Σ1/2 y ) y

(109) +

T ′ (Σ1/2 x ) x

(110)

We can transform µ, V and U using that transform U′ =U(Σx−1/2 )T ′

V = (V +

UH) (Σy−1/2 )T



µ =µ + (V + UH) µy + Uµx

9

(111) (112) (113)

3.4

Objective function

The EM objective function is equation (52) summed for all speakers ln P (Φ|M) =

Hi Hi M X M X 1X  1X N W 1 ln Lxij + ζ˜T L−1 ζ˜ij ln − tr WS − 2 2π 2 2 i=1 j=1 2 i=1 j=1 ij xij M

M

1X 1 X T −1 − ln |Lyi | + γ L γi 2 i=1 2 i=1 i yi

4

(114)

Likelihood ratio

Given a model M we can calculate the ratio of the posterior probabilities of target and non target as shown in [1]: P (T |Φt , M, π) PT P (Φt |T , M) PT R (Φt , M) = = P (N |Φt , M, π) PN P (Φt |N , M) PN

(115)

where we have defined the plug-in likelihood ratio R (Φt , M). To get this ratio we need to calculate P (Φ|θ, M). Given a model M, the y1 , y2 , . . . , yM ∈ Y are sampled independently from P (y|M). Besides, given the M and a speaker i the set Φi of i-vectors produced by that speaker are drown independently from P (Φ|yi , M). Using these independence assumptions we can write: P (Φ|θ, M) =

M Y

P (Φi |M)

(116)

i=1

P (Φi |y, M) =

Y

P (φ|y, M)

(117)

φ∈Φi

Then, the likelihood of Φ is P (Φ|θ, M) =

M Y P (Φi |y0 , M) P (y0 |M) = K(Φ)L(θ|Φ) P (y0 |Φi , M) i=1

(118)

Q where K(Φ) = N i=1 P (φj |y0 , M) is a term that only dependent on the dataset, not θ, so it vanishes when doing the ratio and we do not need to calculate it. What we need to calculate is: L(θ|Φ) =

M Y

Q (Φi )

(119)

P (y0 |M) P (y0 |Φi , M)

(120)

i=1

Q (Φi ) = and the likelihood ratio is:

R (Φt , M) =

Q ({l, r}) Q ({l}) Q ({r})

(121)

Making y0 = 0 we can get use (39), (2) to calculate Q (Φ) ln Q (Φi ) =

 1 − ln |Lyi | + γiT L−1 y i γi 2

(122)

Given a set of training observations Φ1 of a speaker 1 with statistics N1 and F1 ; and a set of test observations Φ2 of a speaker 2 with statistics N2 and F2 . To test if the speakers 1 and 2 are the same speaker the log-likelihood ratio is ln R (Φt , M) =

 1 T −1 T −1 − ln |L3 | + γ3T L−1 3 γ3 + ln |L1 | − γ1 L1 γ1 + ln |L2 | − γ2 L2 γ2 2 10

(123)

where −1 P (y|Φ1 , M) =N y|γ1 L−1 1 , L1

P (y|Φ2 , M) =N P (y|Φ1 , Φ2 , M) =N



−1 y|γ2 L−1 2 , L2  −1 y|γ3 L−1 3 , L3



(124) (125) (126) (127)

Using that γ3 = γ1 + γ2 : ln R (Φt , M) =

 1 −1 −1 −1 −1 T T ln |L1 | + ln |L2 | − ln |L3 | + 2γ1T L−1 3 γ2 + γ1 (L3 − L1 )γ1 + γ2 (L3 − L2 )γ2 2 (128)

References [1] Niko Brummer and Edward De Villiers, “The Speaker Partitioning Problem,” in Oddyssey Speaker and Language Recognition Workshop, Brno, Czech Republic, 2010.

11

Suggest Documents