E-step -> Compute counts and centered first order suf. stats. ... E-step -> Posterior means and correlation matric
Joint Factor Analysis for Speaker Recognition reinterpreted as Signal Coding using Overcomplete Dictionaries Daniel Garcia-Romero Carol Espy-Wilson Department of Electrical & Computer Engineering University of Maryland, College Park, MD, USA 1
Outline • Introduce JFA configuration and notation
• Present an alternative perspective – Link JFA with Signal Coding using Overcomplete Dictionary – Discuss algorithmic differences
• Experimental validation • Remarks about cross-pollination opportunities
2
JFA configuration • Based on point estimates
• Hyperparameters: –
and
from UBM and fixed
– Independent training of subspaces
by ML
• Speaker models: – MAP estimation with one EM iteration
• Scoring: – Linear scoring
3
JFA: Model training (1) • Given observed vectors
and assuming:
Speaker supervector
with and
• Obtain the MAP speaker model E-step -> Compute counts and centered first order suf. stats. and M-step -> Minimize surrogate objective with 4
JFA: Model training (2)
Feature extraction
Solve for
Data alignment
E-step
M-step
5
JFA: Hyperparameter estimation • Given utterances and an initial use EM to solve for the ML estimate [Kenny et al., 2008]: E-step -> Posterior means and correlation matrices and M-step -> For -dim subset of rows for mixtures
If then solve independent linear systems of equations with right-hand side elements 6
JFA: Scoring • Linear scoring [Glembek et al., 2009]: Speaker model ->
Test utterance -> Score ->
7
Definitions • Overcomplete dictionary [Rubinstein et al., 2010]: – Full row-rank matrix with more columns than rows – Analytical (i.e., unions of FFT, DCT, Wavelets) or data driven (learned from data)
– Applications: • Compression, denoising, source separation, face recognition
• Signal Coding (SC): – Given a signal and a dictionary , obtain an optimal encoding according to a given objective
8
Signal generation (E-step)
Feature extraction
Data alignment
• Utterance represented by fixed-length vector and a weighting matrix 9
Signal Coding (Ridge regression)
10
Dictionary learning • Given utterances
solve:
• Block-coordinate descent [Bertsekas, 1999]: – Alternating optimization Signal coding (SC) Dictionary Update (DU) – SC is performed keeping fixed – DU is performed by keeping the fixed 11
DU: Algorithmic opportunity • Keeping the coefficients
fixed, the DU step:
• Comparing with JFA ML estimation: E-step -> Posterior means and correlation matrices and M-step -> For -dim subset of rows for mixtures
• No explicit matrix inversions in SC step -> 2 times faster 12
Hybrid approach
JFA ->
DU ->
Hybrid -> is the average weighting matrix from training utterances 13
Scoring • Notions of model and test segments are blurred – Both are treated as signals to be encoded on dictionary
• Given two utterances
and
1. Signal coding ->
and
2. Compensation -> zero coeffs from
->
3. Similarity computation: (cosine similarity in SV space)
where candidates for
are
and 14
Experimental setup • SWB-I database: – 520 speakers balanced in gender with 4856 speech files – Telephone speech (approx. 70% elec. 30% carb. button) – Two balanced partitions P1 and P2
• Parameterization*: – 38 MFCC (19 + Delta) band limited 300-3400 Hz every 10ms with 20ms Hamming window
• UBM: 2048 Mixture GMM trained on P2 data • Simple dictionary (eigenchannel configuration) also learned from P2 -> * Feature extraction and UBM training done with MIT-LL software
15
Analysis of DL procedure • Evaluate the distance between subspaces: – Use projection distance –
are orthonormal basis for the subspaces
16
Analysis of DL procedure • Examine recognition accuracy: – Closed-set identification on P1 • 2408 utterances from 260 speakers -> 33,866 trials • Model and test utterances encoded and compensated • Cosine similarity score with Dimension 128 64 32
JFA 95.0% 94.5% 93.3%
as the metric Hybrid 94.9% 94.5% 93.3%
DU 94.9% 94.5% 93.3%
– No apparent degradation from either alternative – How these results hold for reduce amounts of data will be studied in the near future 17
Analysis of encoding and scoring (I) • Model and test utterances encoded same way • Cosine similarity score
18
Analysis of encoding and scoring (II) • Verification experiments: – Leave-one-out on 2408 utterances from P1 • 33,866 target trials and 5,764,598 non-target Mixed results for SRE 2010 • Only U and D: • Slightly better treat model and test utterances the same way • Full JFA with U, V and D: • Mixed results, some tasks better encode both others
19
Discussion (Speculation) • Many public resources for dictionary learning: – K-SVD and multiple variations, FOCUSS-DL, MOD
• Discriminatively trained dictionaries: – In Proc CVPR, [Miral et al., 2008] – Emphasize the fact that the ultimate goal is discrimination not representation
• Sparsity inducing priors: (i.e., Laplacian prior) – LASSO, [Tibshirani, 1996] • Reduced amount of data Image occlusion then L1 regularization has proved extremely effective [Wright et al., 2008] in PAMI 20
Conclusions • Different perspective on JFA
• Algorithmic suggestion for ML subspace training • Explored different scorings and metrics • Opportunities for cross-polination
21
Acknowledgments • Special thanks to MIT-LL for binaries to parametrize data and compute UBM
22
Sparse Coding (SC)
23
SRE 2010
24