Speech enhancement - Google Sites

2 downloads 180 Views 1MB Size Report
E-step -> Compute counts and centered first order suf. stats. ... E-step -> Posterior means and correlation matric
Joint Factor Analysis for Speaker Recognition reinterpreted as Signal Coding using Overcomplete Dictionaries Daniel Garcia-Romero Carol Espy-Wilson Department of Electrical & Computer Engineering University of Maryland, College Park, MD, USA 1

Outline • Introduce JFA configuration and notation

• Present an alternative perspective – Link JFA with Signal Coding using Overcomplete Dictionary – Discuss algorithmic differences

• Experimental validation • Remarks about cross-pollination opportunities

2

JFA configuration • Based on point estimates

• Hyperparameters: –

and

from UBM and fixed

– Independent training of subspaces

by ML

• Speaker models: – MAP estimation with one EM iteration

• Scoring: – Linear scoring

3

JFA: Model training (1) • Given observed vectors

and assuming:

Speaker supervector

with and

• Obtain the MAP speaker model E-step -> Compute counts and centered first order suf. stats. and M-step -> Minimize surrogate objective with 4

JFA: Model training (2)

Feature extraction

Solve for

Data alignment

E-step

M-step

5

JFA: Hyperparameter estimation • Given utterances and an initial use EM to solve for the ML estimate [Kenny et al., 2008]: E-step -> Posterior means and correlation matrices and M-step -> For -dim subset of rows for mixtures

If then solve independent linear systems of equations with right-hand side elements 6

JFA: Scoring • Linear scoring [Glembek et al., 2009]: Speaker model ->

Test utterance -> Score ->

7

Definitions • Overcomplete dictionary [Rubinstein et al., 2010]: – Full row-rank matrix with more columns than rows – Analytical (i.e., unions of FFT, DCT, Wavelets) or data driven (learned from data)

– Applications: • Compression, denoising, source separation, face recognition

• Signal Coding (SC): – Given a signal and a dictionary , obtain an optimal encoding according to a given objective

8

Signal generation (E-step)

Feature extraction

Data alignment

• Utterance represented by fixed-length vector and a weighting matrix 9

Signal Coding (Ridge regression)

10

Dictionary learning • Given utterances

solve:

• Block-coordinate descent [Bertsekas, 1999]: – Alternating optimization Signal coding (SC) Dictionary Update (DU) – SC is performed keeping fixed – DU is performed by keeping the fixed 11

DU: Algorithmic opportunity • Keeping the coefficients

fixed, the DU step:

• Comparing with JFA ML estimation: E-step -> Posterior means and correlation matrices and M-step -> For -dim subset of rows for mixtures

• No explicit matrix inversions in SC step -> 2 times faster 12

Hybrid approach

JFA ->

DU ->

Hybrid -> is the average weighting matrix from training utterances 13

Scoring • Notions of model and test segments are blurred – Both are treated as signals to be encoded on dictionary

• Given two utterances

and

1. Signal coding ->

and

2. Compensation -> zero coeffs from

->

3. Similarity computation: (cosine similarity in SV space)

where candidates for

are

and 14

Experimental setup • SWB-I database: – 520 speakers balanced in gender with 4856 speech files – Telephone speech (approx. 70% elec. 30% carb. button) – Two balanced partitions P1 and P2

• Parameterization*: – 38 MFCC (19 + Delta) band limited 300-3400 Hz every 10ms with 20ms Hamming window

• UBM: 2048 Mixture GMM trained on P2 data • Simple dictionary (eigenchannel configuration) also learned from P2 -> * Feature extraction and UBM training done with MIT-LL software

15

Analysis of DL procedure • Evaluate the distance between subspaces: – Use projection distance –

are orthonormal basis for the subspaces

16

Analysis of DL procedure • Examine recognition accuracy: – Closed-set identification on P1 • 2408 utterances from 260 speakers -> 33,866 trials • Model and test utterances encoded and compensated • Cosine similarity score with Dimension 128 64 32

JFA 95.0% 94.5% 93.3%

as the metric Hybrid 94.9% 94.5% 93.3%

DU 94.9% 94.5% 93.3%

– No apparent degradation from either alternative – How these results hold for reduce amounts of data will be studied in the near future 17

Analysis of encoding and scoring (I) • Model and test utterances encoded same way • Cosine similarity score

18

Analysis of encoding and scoring (II) • Verification experiments: – Leave-one-out on 2408 utterances from P1 • 33,866 target trials and 5,764,598 non-target Mixed results for SRE 2010 • Only U and D: • Slightly better treat model and test utterances the same way • Full JFA with U, V and D: • Mixed results, some tasks better encode both others

19

Discussion (Speculation) • Many public resources for dictionary learning: – K-SVD and multiple variations, FOCUSS-DL, MOD

• Discriminatively trained dictionaries: – In Proc CVPR, [Miral et al., 2008] – Emphasize the fact that the ultimate goal is discrimination not representation

• Sparsity inducing priors: (i.e., Laplacian prior) – LASSO, [Tibshirani, 1996] • Reduced amount of data Image occlusion then L1 regularization has proved extremely effective [Wright et al., 2008] in PAMI 20

Conclusions • Different perspective on JFA

• Algorithmic suggestion for ML subspace training • Explored different scorings and metrics • Opportunities for cross-polination

21

Acknowledgments • Special thanks to MIT-LL for binaries to parametrize data and compute UBM

22

Sparse Coding (SC)

23

SRE 2010

24