The two “R” of Speech Processing

0 downloads 0 Views 2MB Size Report
training for robust i-vector based speaker recognition”, Interspeech 2015. ..... “Deep neural network embeddings for text-independent speaker verification,” ...
UEF Summer School Joensuu, Finland / Aug 15, 2018

The two “R” of Speech Processing

Dr. DAYANA RIBAS

ViVoLab, Zaragoza University, Spain https://sites.google.com/view/dayanaribas/presentations

Outline A. B. C. D.

Introduction Speech processing: challenges and trends From ROBUST to REALISTIC speech processing Future perspectives

Introduction Dayana Ribas González Original from Havana, Cuba Research Scientist, PhD https://sites.google.com/view/dayanaribas Research Interests Robust speech processing Speaker recognition in noisy environments Speech quality measures Realism in robust speech and language processing

UEF JOENSUU

Introduction Dayana Ribas González Original from Havana, Cuba UT DALLAS USA

CENATAV LA HABANA

VIVOLAB ZARAGOZA

INRIA NANCY

Research Scientist, PhD https://sites.google.com/view/dayanaribas UCLV SANTA CLARA

Research Interests Robust speech processing Speaker recognition in noisy environments Speech quality measures Realism in robust speech and language processing

Speech It is a convenient and minimally invasive way to interact with the user - Spontaneity - Portability - Not require expensive equipment

Speech Processing: Applications COMMERCIAL

SECURITY

Access control • physical facilities • network • website

• Remote verification • Forensic surveillance and identification • Home parole

Domotic • Home automation • In-car GPS • Handicaped assistance • Hearing aid • Interfaces personalization • Intelligent answer machine Automatic transcription • Meeting/lecture • Close-caption • TV/radio stream

Biometric authentication • Voice service e.g. banking • Commercial transactions by phone/website

POLICE

$

Speech Processing: Scenarios OUTDOOR

INDOOR music

conversation

voices

train tv

console games

MEETING cocktail party effect

wind

street noises

applause

cutlery

air conditioning

traffic

engine

moving chairs

typing

ROBUST Speech Processing APLICATION

SCENARIO voices

wind

street noises

train

CHALLENGES & TRENDS

engine

traffic

ACCESS CONTROL

OUTDOOR

Challenges Speaker-based variability: Range of changes about how a speaker produces the speech signal, including within and between speaker variability. - Stress - Vocal Effort/Style - Emotion - Physiological changes - Spoofing attacks

Challenges Speaker-based variability: Vocal effort (Lombard effect)

Neutral Speech: error 7% Lombard Speech: error 19%

[1] John Hansen, HynekBoril: “Robustness in Speech, Speaker, and Language Recognition: You’ve Got to Know Your Limitations”, Interspeech 2016.

Challenges Speaker-based variability: Spoofing attacks

[2] Zhizheng Wu and Haizhou Li: “Voice conversion and spoofing attack on speaker verification systems“, Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, 2013

Challenges Technological or external based variability: How and where the audio was captured. - Channel/Session Mismatch - Environmental distortions - Distant microphone - Reverberation - Data quality - Short duration utterances

Challenges Technological or external based variability: Environmental distortions

Clean Speech: error 3% Reverberated Speech: error 13% Noisy Speech: error 28%

[3] Dayana Ribas, Emmanuel Vincent, José R. Calvo: “Full multicondition training for robust i-vector based speaker recognition”, Interspeech 2015.

Challenges Technological or external based variability: Short duration utterances Speech Duration: 2.5 min error 3% 2 sec error 30%

[4] A. Kanagasundaram, R. Vogt, D. Dean, S. Sridharan, M. Mason “i-vector Based Speaker Recognition on Short Utterances” Interspeech 2011

Trends

(Mostly based in the speaker recognition technology)

- Robust features - Alternative acoustic modeling - Back-end classifiers - Score calibration and fusion - End-to-end systems

Speaker Recognition System Feature Extraction 3.4 3.6 3.4 2.1 3.6 3.4 0.0 2.1 3.6 -0.9 0.0 2.1 0 . 3- 0 . 9 0 . 0 . 1 0 . 3- 0 . 9 .1 0.3 .1

Trends: Robust features - Delayed decision making

+ Text and language independence + Real-time recognition

High-level features Phones, idiolect (personal lexicon), semantics, accent, pronunciation

Socio-economic status, language background, personality type, parental

features

Physiological (organic)

and voice source features Spectrum (MFCC, LFCC), formants, glottal pulse features

length and dimensions of

Data-driven modeling techniques have been more successful improving robustness compared to new features, mainly if large datasets are used in training strong discriminative models.

Trends: Acoustic Modeling GMM-based Methods: The probabilistic nature of the Gaussian Mixture Model (GMM) represents the large variability of speech data through an unsupervised modeling of the acoustic classes on the speech signal

Acoustic Modeling

Sx 1

1995, GMM [5]

Sx 2

3. 4 3. 6 3. 4 2. 1 3. 6 3. 4 0. 0 2. 1 3. 6 -0. 9 0. 0 2. 1 0 . 3- 0 . 9 0 . 0 . 1 0 . 3- 0 . 9 . 1 0. 3 .1

3. 4 3. 6 3. 4 2. 1 3. 6 3. 4 0. 0 2. 1 3. 6 -0. 9 0. 0 2. 1 0 . 3- 0 . 9 0 . 0 . 1 0 . 3- 0 . 9 . 1 0. 3 .1

GMM:

Trends: Acoustic Modeling GMM-based Methods: The probabilistic nature of the Gaussian Mixture Model (GMM) represents the large variability of speech data through an unsupervised modeling of the acoustic classes on the speech signal 1995, GMM [5] 2000, GMM-UBM Adapted speaker model from a Universal Model [6]

Acoustic Modeling 3. 4 3. 6 3. 4 2. 1 3. 6 3. 4 0. 0 2. 1 3. 6 -0. 9 0. 0 2. 1 0 . 3- 0 . 9 0 . 0 . 1 0 . 3- 0 . 9 . 1 0. 3 .1

Universal Background Model

Trends: Acoustic Modeling GMM-based Methods: The probabilistic nature of the Gaussian Mixture Model (GMM) represents the large variability of speech data through an unsupervised modeling of the acoustic classes on the speech signal 1995, GMM [5] 2000, GMM-UBM: Adapted speaker model from a Universal Model [6] 2006, GMM-SVM: Introducing the discriminating ability of the SVM [7]

Acoustic Modeling 3. 4 3. 6 3. 4 2. 1 3. 6 3. 4 0. 0 2. 1 3. 6 -0. 9 0. 0 2. 1 0 . 3- 0 . 9 0 . 0 . 1 0 . 3- 0 . 9 . 1 0. 3 .1

3. 4 3. 6 3. 4 2. 1 3. 6 3. 4 0. 0 2. 1 3. 6 -0. 9 0. 0 2. 1 0 . 3- 0 . 9 0 . 0 . 1 0 . 3- 0 . 9 . 1 0. 3 .1

m1 m2 m3 . . . mx Universal Background Model

m1 m2 m3 . . . mx

Trends: Acoustic modeling Factor Analysis: To add robustness against channel variability 2005, JFA: Models speaker and channel variability in GMM supervectors, exploiting the behavior of features in a variety of channel conditions learned using FA [8]

s = m + Vy + Ux + Dz Residual

JointFactor Analysis Mode l UBM 3. 4 3. 6 3. 4 2. 1 3. 6 3. 4 0. 0 2. 1 3. 6 -0. 9 0. 0 2. 1 0 . 3- 0 . 9 0 . 0 . 1 0 . 3- 0 . 9 . 1 0. 3 .1

Channel dependent component Speaker dependent component Speaker independent component

Trends: Acoustic modeling Factor Analysis: To add robustness against channel variability 2009, i-vector: Reduces the supervector dimension using FA before classification [9]

s=m+Tw Identity vector

JointFactor Analysis Mode l UBM 3. 4 3. 6 3. 4 2. 1 3. 6 3. 4 0. 0 2. 1 3. 6 -0. 9 0. 0 2. 1 0 . 3- 0 . 9 0 . 0 . 1 0 . 3- 0 . 9 . 1 0. 3 .1

Total variability matrix Speaker independent component

i-vector 1

i1, i2, i3, ... ,ix

i-vector 2

i1, i2, i3, ... ,ix

S C O R I N G

speaker identity

Trends: Back-end classifiers Factor Analysis: To add robustness against channel variability 2011, i-vector-PLDA: Classification based on a Probabilistic Linear Discriminat Analysis [10]

s=m+Tw Identity vector

JointFactor Analysis Mode l UBM 3. 4 3. 6 3. 4 2. 1 3. 6 3. 4 0. 0 2. 1 3. 6 -0. 9 0. 0 2. 1 0 . 3- 0 . 9 0 . 0 . 1 0 . 3- 0 . 9 . 1 0. 3 .1

Total variability matrix Speaker independent component

i-vector 1

i1, i2, i3, ... ,ix

i-vector 2

i1, i2, i3, ... ,ix

S C O PLDA R I N G

speaker identity

Trends: Deep Learning DNN-based systems: Replacing parts of the system with a DNN 2012, Indirect method: uses a DNN to extract data that is then used to train a secondary classifier for the recognition task [11-18] JointFactor Analysis Mode l Front-End

UBM

3. 4 3. 6 3. 4 2. 1 3. 6 3. 4 0. 0 2. 1 3. 6 -0. 9 0. 0 2. 1 0 . 3- 0 . 9 0 . 0 . 1 0 . 3- 0 . 9 . 1 0. 3 .1

(Bottleneck feat.)

Scoring

Accumulate a Gaussian vector (CNN, RNN) (TDNN)

i-vector 1

i1, i2, i3, ... ,ix

i-vector 2

i1, i2, i3, ... ,ix

PLDA Replace PLDA with a DNN

speaker identity

Trends: Deep learning DNN-based systems: Embeddings and end-to-end systems 2015, Direct method: uses a DNN trained as a classifier for the recognition task directly to discriminate among speakers, phones, languages, etc [19-23].

. . . . . . .

speaker 1 speaker 2 . speaker n

Results

EER (%) 18

16.21 [I2R,2016]

13.6 - 16.67 [JHU-I2R,2016]

16 14

11.9 [JHU,2017]

12 10 8

8.10 [SoA,2010] 4.47 [I4U,2008]

6 4

3.19 [I4U,2008]

3.45 [I4U,2012]

2 0

GMM-UBM 2000

GMM-SVM

JFA

i-vector

i-vector

2006

2008

2011

2012

NIST2008 Channel mismatch (USA-USA) Tel , mic speech Mostly English

NIST2012 Additive noise

DNN bottleneck Embeddings xvectors 2014

NIST2016 Channel mismatch (USA-Asia) Asian Languages

2017

Results

EER (%) 18

16.21 [I2R,2016]

13.6 - 16.67 [JHU-I2R,2016]

16 14

11.9 [JHU,2017]

12 10 8

8.10 [SoA,2010] 4.47 [I4U,2008]

6 4

3.19 [I4U,2008]

3.45 [I4U,2012]

2 0

GMM-UBM 2000

GMM-SVM

JFA

i-vector

i-vector

2006

2008

2011

2012

NIST2008 Channel mismatch (USA-USA) Tel , mic speech Mostly English

NIST2012 Additive noise

DNN bottleneck Embeddings xvectors 2014

NIST2016 Channel mismatch (USA-Asia) Asian Languages

2017

New challenges NIST 2018 Language shift: A vast majority of the training data consists of English speakers, while the test data consists of Arabic language speakers. Channel mismatch: Significant mismatch in terms of channel conditions, the recordings are made in different parts of the world. Distortions: VoIP Found data: Audiovisual samples from Youtube

FROM ROBUST TO REALISTIC SPEECH PROCESSING

The technology transfer of laboratory-based solutions to real-world applications Realistic Speech

Unrealistic Speech Ex: NIST (Telephonic speech dataset)

Real Scenario

Robust Speech Processing

?

How to bridge this gap

babble noise

(Acoustic, speaker and language variabilities, provoke the system underperformance in practice)

Problem: Early evaluations in practical targeted scenarios are hardly feasible DATASET Simulated data Acoustically realistic? Real dataset Ecologically realistic? Found data Characteristics of the collection?

Satisfactory performance of the tested methods on scenarios that will never happen in practice, while their performance may be much worse in real scenarios Complex and expensive methods might be obtained that are actually not required Intensive use of data with less understanding of the impact this data has on final solutions (e.g. DNN-based solutions)

A suitable dataset for a given task will strongly depend on how well it matches the acoustic characteristics of the target application scenarios.

Problem: There is a lack of common used proper guidelines about which conditions are actually present in certain application scenarios EXPERIMENTAL PROTOCOL Simulation tools Realistic? Experimental setup Guidelines? METHODS REQUIREMENTS Robust Methods Practical value?

Difficulty to compare among algorithm achievements, and to get solid conclusions about the performance Some methods requirements are difficult to meet in practice resulting in the use of unreliable solutions, for example: huge amount of data for training, specific data characteristic requirements, high computational cost for online executions, unrealistic assumptions e.g. clean speech reference in enhancement methods

THE INTEREST OF THE RESEARCH COMMUNITY IN THE PROBLEM OF THE REALISM IS INCREASING Challenges and evaluations have evolved in more realistic conditions e.g. CHiME 2014, NIST-SRE 2016, 2018, Paralinguistic Challenges at Interspeech Prestigious journals have included special issues on related topics “Realism in robust speech processing”, D. Ribas, E. Vincent, J.H.L. Hansen, Interspeech, 2016. “Multi-microphone speech recognition in everyday environments”, J. Barker, R. Marxer, E. Vincent, Comp. Speech and Language Journal, Elsevier, 2017. “Realism in robust speech and language processing”, D. Ribas, E. Vincent, J.H.L. Hansen, Speech Communication Journal, Elsevier, In process, 2018.

https://sites.google.com/view/dayanaribas/realism-in-speech-processing

Trends: Characterization of real scenarios - Position papers about best practices - Objective experimental characterization of the acoustic conditions: reverberation, noise, sensor variability, source/sensor movement, environment change, etc [24]. - Objective experimental characterization of the speech characteristics: spontaneous speech, number of speakers, vocal effort, effect of age, non-neutral speech, etc. [1]. - Objective experimental characterization of the language variability

Speech distortion conditions in real scenarios

outdoor

Scenario location

indoor Far

Home Automation

Noise level

FORENSICS t s i

D Access Control

0

e c an

Close

Meeting

Web Driven Voice Application Platform Intelligent Answer Machine

Home Parole

PROCEDURE Dataset selection for experimental setup 1. APPLICATION Define application of interest 2. SCENARIO Define the potential 3. INTERFACE Define the recording interface 4. D Define the speaker distance 5. LOCATION Identify the scenario location 6. Characterize the distortion type: a) Reverberation if Location = indoor then Define RT60 (s) DRR: Compute DRR b) Noise if You have access to the target scenario then Measure LN,A and LS,A,mic else Get LN,A from a dataset and predict LS,A, mic 7.SNR Compute SNR

D.Ribas, E. Vincent, J.R. Calvo, “A study of speech distortion conditions in real scenarios for speech processing applications”, IEEE-SLT 2016

Speech distortion conditions in real scenarios 30

Application: Access Control Scenario: Office D = 30 - 70 cm Location: Indoor RT60 = 0.25 - 0.43 s DRR = -2 - 9 dB LN,A = 40 - 60 dB LS,A,1m(1) = 52 - 69 dB Cdist= 1 SNR = 1.3 - 27.5 dB

25

SNR (dB)

20 15 10 5 0

-2

0

2

10

DRR (dB)

LN,A, RT60, L S,A,1m (1)

40, 0.25,

52 58

40, 0.43,

52 58

60, 0.25,

61 69

60, 0.43,

61 69

D.Ribas, E. Vincent, J.R. Calvo, “A study of speech distortion conditions in real scenarios for speech processing applications”, IEEE-SLT 2016

Trends: Robust methods and evaluations

INDE PEND ENT EFFO RTS Evaluate the results in the more realistic conditions as possible

- Robust methods and systems: Enhancement, Adaptation, Robust features/models, etc [25-28] - Evaluations on real speech, real vs. simulated datasets [29] - Challenges [CHiME, NIST-SRE, Paralinguistics Challenge, etc]

BIG COMPANIES STRATEGY To collect a huge amount of data for training models as real as possible

Trends: Datasets - Data collection protocols [30] - Data simulation algorithms/toolkits [31] - New datasets [30] - Use of datasets [32,33] - Dataset quality measure

The REALISTIC paradigm has to be:

Multidisciplinary experience

Collaborative network

Motivate the interrelation among researchers and practitioners Exchange experiences from diverse points of view

Conclusions The acoustic variability in the speech-related technologies application scenarios raises several robustness problems due to environmental noise, reverberation, source or sensor movement, among others. For many decades, the development of methods following the ROBUST PARADIGM has been a main goal in the speech processing field.

Conclusions Beyond the independent efforts of authors to evaluate their results in the more realistic conditions as possible, or the strong system training, there is a need of establishing a consistent research line to handle the current limitations that interfere the developing of realistic solutions. Present times need more effective solutions than continue developing new algorithms under the same structural problems.

Conclusions Beyond the independent efforts of authors to evaluate their results in the more realistic conditions as possible, or the strong system training, there is a need of establishing a consistent research line to handle the current limitations that interfere the developing of realistic solutions. Present times need more effective solutions than continue developing new algorithms under the same structural problems.

Be REALISTIC

Kiitos Thank you Gracias

Question?

Explain the rise of paradigms like robust and realistic processing in the development of speech-related technologies.

References [1] John H.L. Hansen, Hynek Boril: “Robustness in Speech, Speaker, and Language Recognition: “You’ve Got to Know Your Limitations”, Interspeech, 2016. [2] Zhizheng Wu and Haizhou Li: “Voice conversion and spoofing attack on speaker verification systems“, Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, 2013. [3] D. Ribas, E. Vincent and J.R. Calvo, “Full multicondition training for robust i-vector based speaker rec.,” in Proc. Interspeech, 2015, pp. 1057-1061. [4] A. Kanagasundaram, R. Vogt, D. Dean, et al.: “i-vector Based Speaker Recognition on Short Utterances” Interspeech 2011. [5] D.A. Reynolds and R.C. Rose “Robust text-independent speaker identification using GMM”, IEEE Trans. Speech Audio Process. 3, 72–83, 1995. [6] D. A. Reynolds, T. F. Quatieri, and R.B. Dunn “Speaker Verification Using Adapted Gaussian Mixture Models” Digital Signal Processing 10, 19–41, 2000. [7] W. M. Campbell, D. Sturim, et al., “Support vector machines using GMM supervectors for speaker verification,” IEEE Signal Processing Letters, 13(5), 2006. [8] P. Kenny, G. Boulianne, and P. Dumouchel, “Eigenvoice Modeling With Sparse Training Data” IEEE Trans. Speech Audio Process. 13(3), 2005. [9] N. Dehak, P. Kenny, R. Dehak et al. “Front-End Factor Analysis for Speaker Verification” IEEE Trans. Audio, Speech and Lang. Proc. 19(4), 2011. [10] Daniel Garcia-Romero and Carol Y. Espy-Wilson “Analysis of I-vector Length Normalization in Speaker Recognition Systems”, Interspeech 2011. [11] S. Yaman, J. Pelecanos and Ruhi Sarikaya: “Bottleneck Features for Speaker Recognition”, Odyssey 2012. [12] T. Yamada, L. Wang, and A. Kai: “Improvement of Distant-Talking Speaker Identification Using Bottleneck Features of DNN”, Interspeech 2013. [13] A. Lozano-Diez, A. Silnova, P. Matejka, et al., “Analysis and optimization of bottleneck features for speaker recognition,” Speaker Odyssey, 2016, pp. 352–357. [14] Y. Lei, N. Scheffer, L. Ferrer, et al. “A novel scheme for speaker recognition using a phonetically-aware deep neural network,” ICASSP 2014 pp. 1695–1699. [15] Mitchell McLaren, Yun Lei, Nicolas Scheffer et al. “Application of Convolutional Neural Networks to Speaker Recognition in Noisy Conditions “, Interspeech 2014. [16] Hao Zheng, Shanshan Zhang, Wenju Liu “Exploring Robustness of DNN/RNN for Extracting Speaker Baum-Welch Statistics in Mismatched Conditions”, Interspeech 2015. [17] David Snyder, Daniel Garcia-Romero and Daniel Povey “Time delay deep neural network-based universal background models for speaker recognition” ASRU 2015. [18] O. Ghahabi and J. Hernando, “Deep belief networks for i-vector based speaker recognition,” ICASSP 2014 pp. 1700–1704. [19] G. Bhattacharya, J. Alam, and P. Kenny, “Deep speaker embeddings for short-duration speaker verification,” Interspeech, 2017, pp. 1517–1521. [20] D. Snyder, D. Garcia-Romero, D. Povey, and S. Khudanpur, “Deep neural network embeddings for text-independent speaker verification,” Interspeech, 2017, pp. 999–1003. [21] Hee-Soo Heo, IL-Ho Yang, Myung-Jae Kim, et al. “Advanced b-vector system based deep neural network As classifier for speaker verification” ICASSP 2016. [22] Andreas Nautsch, Hong Haoy, Themos Stafylakis, et al. “Towards PLDA-RBM based speaker recognition in mobile environment”, ICASSP 2016. [23] G. Heigold, I. Moreno, S. Bengio, et al. “End-to-end text-dependent speaker verification,” ICASSP, 2016 pp. 5115–5119. [24] D.Ribas, E. Vincent, J.R. Calvo, “A study of speech distortion conditions in real scenarios for speech processing applications”, IEEE-SLT 2016. [25] D. Ribas, E. Vincent et al., “Uncertainty propagation for noise robust speaker rec.: the case of NIST-SRE,” in Proc. Interspeech, 2015, pp. 3536-3540. [26] Padmanabhan Rajan, Anton Afanasyev, et al., “From single to multiple enrollment i-vectors: Practical PLDA scoring variants for speaker recognition”, Digital Signal Processing, Elsevier, vol 31, 2014, pp. 93-101. [27] Bo Li, Tara N. Sainath, et al. “Acoustic Modeling for Google Home”, in Proc. of Interspeech, pp. 399–403, 2017. [28] Sri Garimella, Arindam Mandal, et al, “Robust i-vector based Adaptation of DNN Acoustic Model for Speech Recognition”, in Proc. of Interspeech, 2015. [29] F. Richardson, M. Brandstein, J. Melot, D. Reynolds; “Speaker Recognition Using Real vs Synthetic Parallel Data for DNN Channel Compensation” Interspeech 2016. [30] Nancy Bertin, Ewen Camberlein, Emmanuel Vincent, et al. “A French Corpus for Distant-Microphone Speech Processing in Real Homes”, Interspeech, 2016. [31] Mirco Ravanelli, Piergiorgio Svaizer, Maurizio Omologo; “Realistic Multi-Microphone Data Simulation for Distant Speech Recognition”, Interspeech 2016 [32] Douglas E. Sturim, Pedro A. Torres-Carrasquillo, Joseph P. Campbell; “Corpora for the Evaluation of Robust Speaker Recognition Systems”, Interspeech 2016 [33] Dataset catalog created by the ISCA SIG on Robust Speech Processing (RoSP), Available at: https://wiki.inria.fr/rosp/Datasets