Deep Learning for Speech Generation and Synthesis - SuperLectures

Deep Learning for Speech Generation and Synthesis Yao Qian Frank K. Soong Speech Group MS Research Asia

Sep. 13, 2014

Background 

Deep learning has made a huge impact on automatic speech recognition (ASR) research, products and services



Speech generation and synthesis is an inverse process of speech recognition Text-to-Speech (TTS)  Speech-to-Text (STT)



Deep learning approaches to speech generation and synthesis Focus on TTS synthesis and voice conversion

Qian & Soong

Tutorial, ISCSLP 2014, Singapore

2

Outline 

Statistical parametric speech generation and synthesis  



Deep learning 



RBM, DBN, DNN, MDN and RNN

Deep learning for speech generation and synthesis  



HMM-based speech synthesis GMM-based voice conversion

Approaches to speech synthesis Approaches to voice conversion

Conclusions and future work

Qian & Soong


3

Outline 




Deep learning 










Qian & Soong


4

HMM-based Speech Synthesis[1] 

By reversing the input and output of a Hidden Markov Model (HMM) in speech recognition, we then turn HMM into a generation model for textto-speech (TTS) synthesis.



Speech spectrum, fundamental frequency, voicing and duration are modeled simultaneously by HMM [2].



Models are trained and signals are generated by the universal Maximum Likelihood (ML) criterion.

Qian & Soong


5

HMM-based Speech Synthesis[1] Speech Signal

SPEECH

Feature Extraction

DATABASE Log F0 Labels

Training LSP & Gain Phonetic & Prosodic Question Set

HMM Training

Stream-Dependent

Synthesis

HMMs Sequence of HMMs TEXT

Text Analysis

Labels

Source-filter Parameter Generation

Log F0 Mixed Excitation Generation

Qian & Soong

LSP & Gain Excitation

LPC Synthesis

Synthesized Speech


6

HMM-based Speech Synthesis[1] SPEECH

Speech Signal

Feature Extraction

DATABASE Log F0 Labels

Training LSP & Gain

HMM Training

Phonetic & Prosodic Question Set

Stream-Dependent HMMs

Qian & Soong


7

HMM-based Speech Synthesis[1]

Phonetic & Prosodic Question Set

Stream-Dependent

Synthesis

HMMs Sequence of HMMs TEXT

Text Analysis

Labels

Source-filter Parameter Generation

Log F0 Mixed Excitation Generation

Qian & Soong

LSP & Gain

Excitation

LPC Synthesis


Synthesized Speech

8

Advantages of HMM-based TTS 

Statistical parametric model can be efficiently trained with recorded speech data and corresponding transcriptions.



HMM-based statistical model is parsimonious (i.e., small model size) and data efficient.



HMMs can generate “optimal” speech parameter trajectory in the ML sense.

Qian & Soong


9

Concatenative vs. HMM-based TTS synthesis 

HMM-based synthesis    



Statistically trained Vocoded speech (smooth & stable) Small footprint (less than 2MB) Easy to modify its voice characteristics

Concatenative synthesis    

Qian & Soong

Sample-based (unit selection) High quality with occasional glitches Large footprint More difficult to modify its voice characteristics


10

Technologies 

Multi-Space Distribution (MSD)-HMM for F0 and voicing modeling



Contextual clustering of decision tree



Parameter generation with dynamic features



Mixed excitation



Global Variance



Minimum Generation Error Training

Qian & Soong


11

MSD-HMM for F0 Modeling[3] 

A schematic representation of using MSD-HMM for f0 modeling

F0 (Hz) 150

100

50

Time (10ms/frame)

Pt  o   1

Pt  o   1

c11  0.83

c21  0.99

t 12

o

t 1K

o

t 22

o

t 2K

o

Pt  o   1

c31  0.87

t 32

o

t 3K

o

k2c1k  0.17 k2c2k  0.01 k2c3k  0.13 K

Qian & Soong

K

g

i

t

K

Zero-dimensional subspace for unvoiced part

One-dimensional subspace for voiced part

an Pan4  o   1

Pan4  o   1

Pan4  o   1

c11  0.05

c21  0.01

c31  0.1

o

o

o

o

o

o

an4

an4

an4

an4

an4

an4

12

1K

22

2K

32

3K

K K k2c1k  0.95 k2c2k  0.99 k2 c3k  0.9


K

12

Context Clustering[4,5] 

Each state of model has three trees: trees of duration, spectrum and pitch

MDL for stopping criterion

Qian & Soong


13

An Example of Decision Trees Question set

k-a+b

C=fricative? C=Vowel? R=silence? R=“cl”? L=“gy”? L=voice?

t-a+h

…

…

s-i+n

…

Qian & Soong


14

Parameter Generation[6] Speech parameter generation from HMM with dynamic constraint For a given HMM  , determine a speech parameter vector sequence O  [o1T , o2T , o3T ,..., oTT ]T , ot  [ctT , ctT , 2ctT ]T, which maximizes: P(O |  )   P(O, Q |  ) all Q

If given Q , maximizes P(O | Q,  ) with respect to O  WC 1 log P(O | Q,  )   OTU 1O  O TU 1M  K 2

LogP(WC | Q,  ) 0 C

W TU 1WC  W TU 1M T Qian & Soong


15

Generated Speech Parameter Trajectory [7]

Qian & Soong


16

Mixed Excitation [8,9] 



Traditional excitation model (hiss-buzz model)

A better excitation model (mixed excitation model)

Qian & Soong


17

Global Variance (GV) [10] 

“Global variance” of features over an utterance



Generation

Qian & Soong

A Single Gaussian with 𝜇𝑣 and 𝑣


18

Minimum Generation Error Training [11] 

Training model by maximizing the likelihood of training data

*  arg max  P(O, Q |  ) 



w.r.t

O  WC

all Q

Refining model by minimizing weighted generation error



i 1

Qian & Soong

~  arg min  P(Q |  , O)D(C , C ) i



all Q


19

GMM-based Voice Conversion [12,15] 

Modeling joint feature vector (source target speakers) by GMM

Qian & Soong


and

20

GMM-based Voice Conversion [12,13] 

Maximum likelihood conversion

Qian & Soong


21

Limitations 

Vocoder 



Synthesized voice with some traditional vocoder flavor

Acoustic modeling 

“Wrong” model assumptions out of necessity, e.g., GMM, diagonal covariance



Greedy, hence suboptimal, search derived decision tree state clustering (tying)

Qian & Soong


22

Outline 




Deep learning 










Qian & Soong


23

Deep Learning[16] 

A machine learning algorithm to learn high-level abstractions in data by using multiple-layered models 

Typical neural nets with 3 or more hidden layers



Motivated by human brain, which organize ideas and concepts hierarchically



Difficulties to train DNN in 1980s 

Training slowly , vanishing gradients (RNN)

Qian & Soong

Tutorial, ISCSLP 2014, Singapore Introduction

24

Deep Learning[16] 

Improving DNN training significantly since 2006   



GPU Big data Pre-training

Different architectures, e.g. DNN, CNN, RNN and DBN, for computer vision, speech recognition and synthesis, and natural language processing

Qian & Soong

Tutorial, ISCSLP 2014, Singapore Introduction

25

Restricted Boltzmann Machine (RBM) [17] Restricted: No interaction between hidden variables No interaction between visible variables





Joint distribution function

defined in terms of an energy

Parameters can be estimated by contrastive divergence learning [18]

Qian & Soong


26

Deep Belief Network (DBN) [17] 

Each successive pair of layers is treated as an RBM



Stack RBMs to form a DBN

Qian & Soong


27

Deep Neural Net (DNN) 

A feed-forward, artificial neural networks with multiple hidden layers 𝑦𝑗 = 𝑓 𝑥𝑗

where 𝑥𝑗 = 𝑏𝑗 +

𝑦𝑖 𝑤𝑖𝑗 𝑖



A nonlinear activation function 𝑓 𝑥𝑗 =

1 −𝑥 1+e 𝑗

Sigmoid function Qian & Soong

𝑓 𝑥𝑗 =

𝑥 −𝑥 e 𝑗− e 𝑗 𝑥 e 𝑗

−𝑥 +e 𝑗

Hyperbolic tangent (or tanh) function Tutorial, ISCSLP 2014, Singapore

𝑓 𝑥𝑗 = max(𝑥𝑗 , 0) Rectified linear function 28

DNN 

Train discriminative models via back-propagation (BP) [19] with an appropriate cost function



A “mini-batch” based stochastic gradient descent algorithm



Improved acoustic model for enhancing speech recognition performance significantly [17]

May 8, 2014

Qian & Soong

29


29

DNN Pre-training 

Initialize the weights to a better starting point than random initialization, prior to BP 

DBN[17]

Layer-by-layer trained RBMs Unsupervised training via Contrastive Divergence 

Layer-wise BP[20]

Add one hidden layer after another Supervised training with BP

May 8, 2014

Qian & Soong

30


30

Mixture Density Networks (MDN) [21] 

MDN combines a conventional neural network with a mixture density model



Conventional neural network can represent arbitrary functions



In principle, MDN can represent arbitrary conditional probability distributions

Qian & Soong


31

Mixture Density Networks (MDN) [21] 

Output is a mixture density of target 



Make neural network a generative model rather than regression model

Activation function  



Sigmoid or hyperbolic tangent for hidden nodes Softmax for mixture weights, exponential for variance and identity for mean

Criterion 

Maximum likelihood (MLE)

Qian & Soong


32

Recurrent Neural Network (RNN) [22] Feed Forward Neural Network y(1)

y(t)

x(1)

x(t)

y(T)

ℎ𝑡 = ℋ 𝑾𝑥ℎ 𝑥𝑡 + 𝑏ℎ 𝑦𝑡 = ℎ𝑡 x(T)

Acoustic Context

Recurrent Neural Network y(1)

y(t)

y(T)

ℎ𝑡 = ℋ 𝑾𝑥ℎ 𝑥𝑡 + 𝑾ℎℎ ℎ𝑡−1 + 𝑏ℎ 𝑦𝑡 = 𝑾ℎ𝑦 ℎ𝑡 + 𝑏𝑦 x(1)

Qian & Soong

x(t)


x(T)

33

Long Short Term Memory (LSTM)[22,23] it   Wxi xt  Whi ht 1  Wci ct 1  bi 

f t   Wxf xt  Whf ht 1  Wcf ct 1  b f



ct  f t ct 1  it tanh Wxc xt  Whc ht 1  bc  ot   Wxo xt  Who ht 1  Wco ct  bo  ht  ot tanh  ct  Outputs

Hidden Layer

Inputs

Time

Qian & Soong

1

2

3

4

5

6


7

34

Bidirectional RNN[22,24] Output

Forward Layer Backward Layer

Input x(T)

x(1)

ℎ𝑡 = ℋ 𝑾𝑥ℎ 𝑥𝑡 + 𝑾ℎℎ ℎ𝑡−1 + 𝑏ℎ

x(t) ← x(1), x(2), …,x(t) Forward direction: past and current

ℎ𝑡 = ℋ 𝑾𝑥ℎ 𝑥𝑡 + 𝑾ℎℎ ℎ𝑡+1 + 𝑏ℎ

x(t) ← x(T), x(T-1), …, x(t) Backward direction: future and current

𝑦𝑡 = 𝑾ℎ𝑦 ℎ𝑡 + 𝑾ℎ𝑦 ℎ𝑡 + 𝑏𝑦 Qian & Soong

x(t) ← x(1), x(2), … , x(t),…,x(T-1), ,x(T) Bidirection: all


36

Outline 




Deep learning 










Qian & Soong


37

Deep Learning for Speech Generation and Synthesis 

Motivated by the DNN’s superior performance in ASR, we investigate potential advantages of NN in speech synthesis and voice conversion 

Model long-span, high dimensional and correlated features as its input 



MCEP or LSP  Spectral envelop, 1 frame  multi frames

Non-linear mapping between input and output features with a deep-layered , hierarchical structure  Representing highly-variable mapping function compactly  Simulating human speech production system

Qian & Soong


38

Deep Learning for Speech Generation and Synthesis 

Distributed representation 

Generating observed data by the interactions of many different factors on different levels



Replacing HMM states and decision tree, which decompose the training data into small partitions



“Discriminative” and “predictive” in generation sense, with appropriate cost function(s), e.g. generation error 

Training criteria is closer to the objectives than ML

Qian & Soong


39

Approaches of Deep Learning to Speech Synthesis  

  











RBM/DBN

CUHK [25], USTC/Microsoft [26]

DNN

Google [27], CSTR[28], Microsoft [29]

DBN/DNN for feature transformation IBM [30]

MDN

Google [31]

RNN with BLSTM

Microsoft [32], IBM [33]

Qian & Soong


40

Generative models: RBM vs. GMM [34] 

GMM

𝑀

𝑝 𝒗 =

 



𝑚𝑖 𝑁(𝒖𝑖 , Σ𝑖 ) 𝑖=1

Mixture models can’t generate distributions sharper than the individual components Need lots of data to estimate parameters

RBM

 

Product models can generate distributions sharper than the individual components Need less data to estimate parameters

Qian & Soong


41

DBN for Speech Synthesis [25,26] 

Modeling joint distribution of linguistic and acoustic features 



Fully utilize generative nature to generate speech parameter from DBN

DBNs replaces GMMs for the distribution of the spectral envelops at each decision tree-clustered HMM state 

The estimated model of spectral RBM has much sharper formant structure than Gaussian mean

Qian & Soong


42

Deep Learning (DNN, MDN, RNN) for Speech Synthesis

Excitation features

Output layer

Spectral features

Hidden layers

Gain

LSP Spectrum

Input layer

Numeric features

Number of Word Frame Position Duration …

Binary features

Phone label POS label Word …

F0

DNN, MDN or RNN

Qian & Soong


43

DNN for TTS Training 

Input features x and output features y are time-aligned frame-by-frame by well-trained HMM models



L2 cost function with Back-Propagation (BP) 𝐶=



1 2𝑇

𝑇

𝑓(𝑥

𝑡

) − 𝑦 (𝑡)

2

+

𝑡=1

λ 2

𝑙 𝑤𝑖,𝑗

2

𝑖,𝑗,𝑙

A “mini-batch” based stochastic gradient descent algorithm 𝑊 𝑙 , 𝑏𝑙

May 8, 2014

Qian & Soong

← 𝑊 𝑙 , 𝑏𝑙 + 𝜀

𝜕𝐶 𝜕 𝑊 𝑙 ,𝑏𝑙

, 0≤𝑙≤𝐿

44


44

DNN vs. HMM for TTS Synthesis Parameter Generation

Vocoder Parameters

HMM

Y1

Synthesized Speech

DNN

Y2

Y3

YN-1

YN

h3 1

h3 2

h3 M

h2 1

h2 2

h2 M

h1 1

h1 2

h1 M

Vowel no

yes

nasal no

R-silence yes

no

Voiced plosive

no semivowel

L-unvoiced Posive

N

yes L-silence

no L-silence

yes 1to 13_a0

no L-voiced Plosive

yes Low-tail

yes

R-sil no L-semivowel

yes i

X1

X2

X3

XN-1

XN

Labels Text

Qian & Soong

Text Analysis


45

Deep MDN (DMDN) based Speech Synthesis[31] 

MDN, In principle, can represent arbitrary conditional probability distributions



To address the limitations of DNN 

Unimodal nature of its objective function



Lack of ability to predict variances

Qian & Soong


46

DMDN for TTS training[31] 

Training Criterion



Consistent with generation algorithm 𝑝 𝑦 𝑥, 𝜆) = 𝑝 𝑦 𝑄, 𝜆 )

Qian & Soong


47

DMDN for TTS Synthesis [31] Parameter Generation

Synthesized Speech

Vocoder Parameters

Y1

Y2

YN

h3 2

h3 M

h2 1

h2 2

h2 M

h1 1

h1 2

h1 M

Text Analysis

Qian & Soong

YN-1

h3 1

X1

Text

Y3

X2

X3

Labels

XN-1

𝑌 = [𝑢1 , 𝑢2 , … , 𝑢𝑁 ]

1 𝑌 = [𝑢11 , 𝑢21 , … , 𝑢𝑁 , 𝜎 1 , 𝑤11 , 𝑤21 , … , 𝑤𝑁1 , 2 𝑢12 , 𝑢22 , … , 𝑢𝑁 , 𝜎 2 , 𝑤12 , 𝑤22 , … , 𝑤𝑁2 XN

….. 𝑀 𝑢1𝑀 , 𝑢2𝑀 , … , 𝑢𝑁 , 𝜎 𝑀 , 𝑤1𝑀 , 𝑤2𝑀 , … , 𝑤𝑁𝑀 ]


48

DBLSTM-RNN based Speech Synthesis 

Speech production is a continuous dynamic process  

Consider semantic and syntactic info, select words, formulate phonetics, articulate sounds These features interact with complex contextual effects from neighboring and distant input



RNN with LSTM, in principle, can capture information from anywhere in the feature sequence



Bi-directional, to access long-range context in both forward and backward directions 

A whole sentence is given as input

Qian & Soong


49

DBLSTM-RNN for TTS training  

 



   

Training criterion L2 cost function

Back-propagation through time (BPTT) Unfold the RNN into feed-forward network through time Train the unfolded network with back-propagation

Sequence mapping between input and output The weight gradients are computed over the entire utterance Utterance-level randomization Tens of utterances for each “mini-batch”

Qian & Soong


50

DBLSTM-RNN for TTS synthesis Vocoder

Waveform

Output

Parameter Generation

Features

Output

Backward Layer

Forward Layer

Input

Text

Qian & Soong

Text Analysis

features

Input

Input Feature Extraction


51

DBLSTM-RNN based Speech Synthesis 

Pros 

Bidirectional, deep-in-time and deep-in-layer structure



Embedded sequence training



Matched criteria for training and testing



Cons 

High computational complexity



Slow to converge

Qian & Soong


52

Experimental Setup Corpus Training/ Test set Linguistic features

U.S. English utterances (female, general domain)

5,000/200 sentences 319 (binary), e.g. phone, POS 36 (numerical), e.g. word position

Acoustic features

1 U/V, LSP + gain (40+1), F0, Δ, Δ2

HMM topology

5-state, left-to-right HMM, MSD F0, MDL, MGE

DNN, MDN, RNN* architecture

1 - 6 layers, 512 or 1024 nodes/layer Sigmoid/Tanh, continuous F0

Post-processing

formant sharpening of LSP frequencies

* CURRENT [40] , Open source code under GPLv3 Qian & Soong


53

Evaluation Metrics 

Objective measures   



Log Spectral Distance (LSD) Voiced/Unvoiced error rate Root Mean Squared Error (RMSE) of F0

Subjective measure  

AB preference test Crowdsourcing (20 subjects, 50 pairs of sentences for each)

Qian & Soong


54

Experimental Results 

Activation Functions (Sigmoid vs. Tanh) 4.2 Sigmoid

4.1

Tanh

LSD

4

3.9 3.8 3.7 3.6 5

10 15 20 25 30 35 40 45 The number of epochs

* Rectifier linear activation function is better in objective measures [31] Qian & Soong


55


Pre-training 4.2

No Pre-training

4.1

DBN Pre-training

LSD

4

Layer-wise BP Pre-training

3.9 3.8 3.7 3.6

10 20 30 40 50 60 70 80 90 The number of epochs Qian & Soong


56


The Number of Hidden Layers 4.2

H1

H2

H3

H4

H5

H6

4.1

LSD

4 3.9 3.8 3.7 3.6 10

Qian & Soong

20 30 40 The number of epochs

50


57

Evaluation Results (DNN) Measure DNN Structure 512 *3 (0.77 M) 512 *6 (1. 55M) 1024*3 (2.59 M) Measure HMM MDL Factor 1 (2.89 M) 1.6 (1.52M) 3 (0.85M) 46% HMM (MDL=1) 23% HMM (MDL=1)

Qian & Soong

LSD (dB) 3.76 3.73 3.73

V/U Error rate (%) 5.9 5.8 5.9

F0 RMSE (Hz) 15.8 15.8 15.9

LSD (dB) 3.74 3.85 3.91

V/U Error rate (%) 5.8 6.1 6.2

F0 RMSE (Hz) 17.7 18.1 18.4

10% Neutral 10% Neutral

44% DNN (512*3) 67% DNN (1024*3)


P=0.45 P

Deep Learning for Speech Generation and Synthesis - SuperLectures

Deep Learning for Speech Generation and Synthesis - SuperLectures

Suggest Documents

pronunciation variation generation for spontaneous speech synthesis

Deep Learning in Speech Synthesis - Research at Google

Deep Learning in Speech Synthesis - Research at Google

Deep Learning for Distant Speech Recognition

DEEP LEARNING FOR SPEECH CLASSIFICATION ...

RECENT ADVANCES IN DEEP LEARNING FOR SPEECH ...

Deep Learning Approaches for Understanding Simple Speech

Deep Learning Based Speech Beamforming

DEEP LEARNING FOR ROBUST FEATURE GENERATION IN ...

Special Issue on Deep Learning for Speech and Language Processing

Special Issue on Deep Learning for Speech and Language Processing

Learning Representations of Emotional Speech with Deep ...

Expressive Facial Animation Synthesis by Learning Speech ...

Deep Active Learning for Dialogue Generation - Association for ...

SNR-Based Progressive Learning of Deep Neural Network for Speech ...

End-to-End Deep Learning Framework for Speech

Ensemble Deep Learning for Speech Recognition - Semantic Scholar

Deep Learning for Environmentally Robust Speech Recognition - arXiv

Speech Synthesis

Deep Learning Techniques for Music Generation-A Survey

A Side Information Generation method using Deep Learning for

F0 Modeling In Hmm-Based Speech Synthesis System Using Deep ...

Statistical parametric speech synthesis for Ibibio - Centre for Speech ...

deep learning lab 2018 - New Pedagogies for Deep Learning