Deep Learning for Speech Generation and Synthesis - SuperLectures

8 downloads 0 Views 2MB Size Report
Sep 13, 2014 - TEXT. Labels. Speech Signal. Training. Synthesis. Log F0. LSP & Gain. SPEECH. DATABASE. LPC. Mixed ExcitationExcitation. Generation.
Deep Learning for Speech Generation and Synthesis Yao Qian Frank K. Soong Speech Group MS Research Asia

Sep. 13, 2014

Background 

Deep learning has made a huge impact on automatic speech recognition (ASR) research, products and services



Speech generation and synthesis is an inverse process of speech recognition Text-to-Speech (TTS)  Speech-to-Text (STT)



Deep learning approaches to speech generation and synthesis Focus on TTS synthesis and voice conversion

Qian & Soong

Tutorial, ISCSLP 2014, Singapore

2

Outline 

Statistical parametric speech generation and synthesis  



Deep learning 



RBM, DBN, DNN, MDN and RNN

Deep learning for speech generation and synthesis  



HMM-based speech synthesis GMM-based voice conversion

Approaches to speech synthesis Approaches to voice conversion

Conclusions and future work

Qian & Soong

Tutorial, ISCSLP 2014, Singapore

3

Outline 

Statistical parametric speech generation and synthesis  



Deep learning 



RBM, DBN, DNN, MDN and RNN

Deep learning for speech generation and synthesis  



HMM-based speech synthesis GMM-based voice conversion

Approaches to speech synthesis Approaches to voice conversion

Conclusions and future work

Qian & Soong

Tutorial, ISCSLP 2014, Singapore

4

HMM-based Speech Synthesis[1] 

By reversing the input and output of a Hidden Markov Model (HMM) in speech recognition, we then turn HMM into a generation model for textto-speech (TTS) synthesis.



Speech spectrum, fundamental frequency, voicing and duration are modeled simultaneously by HMM [2].



Models are trained and signals are generated by the universal Maximum Likelihood (ML) criterion.

Qian & Soong

Tutorial, ISCSLP 2014, Singapore

5

HMM-based Speech Synthesis[1] Speech Signal

SPEECH

Feature Extraction

DATABASE Log F0 Labels

Training LSP & Gain Phonetic & Prosodic Question Set

HMM Training

Stream-Dependent

Synthesis

HMMs Sequence of HMMs TEXT

Text Analysis

Labels

Source-filter Parameter Generation

Log F0 Mixed Excitation Generation

Qian & Soong

LSP & Gain Excitation

LPC Synthesis

Synthesized Speech

Tutorial, ISCSLP 2014, Singapore

6

HMM-based Speech Synthesis[1] SPEECH

Speech Signal

Feature Extraction

DATABASE Log F0 Labels

Training LSP & Gain

HMM Training

Phonetic & Prosodic Question Set

Stream-Dependent HMMs

Qian & Soong

Tutorial, ISCSLP 2014, Singapore

7

HMM-based Speech Synthesis[1]

Phonetic & Prosodic Question Set

Stream-Dependent

Synthesis

HMMs Sequence of HMMs TEXT

Text Analysis

Labels

Source-filter Parameter Generation

Log F0 Mixed Excitation Generation

Qian & Soong

LSP & Gain

Excitation

LPC Synthesis

Tutorial, ISCSLP 2014, Singapore

Synthesized Speech

8

Advantages of HMM-based TTS 

Statistical parametric model can be efficiently trained with recorded speech data and corresponding transcriptions.



HMM-based statistical model is parsimonious (i.e., small model size) and data efficient.



HMMs can generate “optimal” speech parameter trajectory in the ML sense.

Qian & Soong

Tutorial, ISCSLP 2014, Singapore

9

Concatenative vs. HMM-based TTS synthesis 

HMM-based synthesis    



Statistically trained Vocoded speech (smooth & stable) Small footprint (less than 2MB) Easy to modify its voice characteristics

Concatenative synthesis    

Qian & Soong

Sample-based (unit selection) High quality with occasional glitches Large footprint More difficult to modify its voice characteristics

Tutorial, ISCSLP 2014, Singapore

10

Technologies 

Multi-Space Distribution (MSD)-HMM for F0 and voicing modeling



Contextual clustering of decision tree



Parameter generation with dynamic features



Mixed excitation



Global Variance



Minimum Generation Error Training

Qian & Soong

Tutorial, ISCSLP 2014, Singapore

11

MSD-HMM for F0 Modeling[3] 

A schematic representation of using MSD-HMM for f0 modeling

F0 (Hz) 150

100

50

Time (10ms/frame)

Pt  o   1

Pt  o   1

c11  0.83

c21  0.99

t 12

o

t 1K

o

t 22

o

t 2K

o

Pt  o   1

c31  0.87

t 32

o

t 3K

o

k2c1k  0.17 k2c2k  0.01 k2c3k  0.13 K

Qian & Soong

K

g

i

t

K

Zero-dimensional subspace for unvoiced part

One-dimensional subspace for voiced part

an Pan4  o   1

Pan4  o   1

Pan4  o   1

c11  0.05

c21  0.01

c31  0.1

o

o

o

o

o

o

an4

an4

an4

an4

an4

an4

12

1K

22

2K

32

3K

K K k2c1k  0.95 k2c2k  0.99 k2 c3k  0.9

Tutorial, ISCSLP 2014, Singapore

K

12

Context Clustering[4,5] 

Each state of model has three trees: trees of duration, spectrum and pitch

MDL for stopping criterion

Qian & Soong

Tutorial, ISCSLP 2014, Singapore

13

An Example of Decision Trees Question set

k-a+b

C=fricative? C=Vowel? R=silence? R=“cl”? L=“gy”? L=voice?

t-a+h





s-i+n



Qian & Soong

Tutorial, ISCSLP 2014, Singapore

14

Parameter Generation[6] Speech parameter generation from HMM with dynamic constraint For a given HMM  , determine a speech parameter vector sequence O  [o1T , o2T , o3T ,..., oTT ]T , ot  [ctT , ctT , 2ctT ]T, which maximizes: P(O |  )   P(O, Q |  ) all Q

If given Q , maximizes P(O | Q,  ) with respect to O  WC 1 log P(O | Q,  )   OTU 1O  O TU 1M  K 2

LogP(WC | Q,  ) 0 C

W TU 1WC  W TU 1M T Qian & Soong

Tutorial, ISCSLP 2014, Singapore

15

Generated Speech Parameter Trajectory [7]

Qian & Soong

Tutorial, ISCSLP 2014, Singapore

16

Mixed Excitation [8,9] 



Traditional excitation model (hiss-buzz model)

A better excitation model (mixed excitation model)

Qian & Soong

Tutorial, ISCSLP 2014, Singapore

17

Global Variance (GV) [10] 

“Global variance” of features over an utterance



Generation

Qian & Soong

A Single Gaussian with 𝜇𝑣 and 𝑣

Tutorial, ISCSLP 2014, Singapore

18

Minimum Generation Error Training [11] 

Training model by maximizing the likelihood of training data

*  arg max  P(O, Q |  ) 



w.r.t

O  WC

all Q

Refining model by minimizing weighted generation error



i 1

Qian & Soong

~  arg min  P(Q |  , O)D(C , C ) i



all Q

Tutorial, ISCSLP 2014, Singapore

19

GMM-based Voice Conversion [12,15] 

Modeling joint feature vector (source target speakers) by GMM

Qian & Soong

Tutorial, ISCSLP 2014, Singapore

and

20

GMM-based Voice Conversion [12,13] 

Maximum likelihood conversion

Qian & Soong

Tutorial, ISCSLP 2014, Singapore

21

Limitations 

Vocoder 



Synthesized voice with some traditional vocoder flavor

Acoustic modeling 

“Wrong” model assumptions out of necessity, e.g., GMM, diagonal covariance



Greedy, hence suboptimal, search derived decision tree state clustering (tying)

Qian & Soong

Tutorial, ISCSLP 2014, Singapore

22

Outline 

Statistical parametric speech generation and synthesis  



Deep learning 



RBM, DBN, DNN, MDN and RNN

Deep learning for speech generation and synthesis  



HMM-based speech synthesis GMM-based voice conversion

Approaches to speech synthesis Approaches to voice conversion

Conclusions and future work

Qian & Soong

Tutorial, ISCSLP 2014, Singapore

23

Deep Learning[16] 

A machine learning algorithm to learn high-level abstractions in data by using multiple-layered models 

Typical neural nets with 3 or more hidden layers



Motivated by human brain, which organize ideas and concepts hierarchically



Difficulties to train DNN in 1980s 

Training slowly , vanishing gradients (RNN)

Qian & Soong

Tutorial, ISCSLP 2014, Singapore Introduction

24

Deep Learning[16] 

Improving DNN training significantly since 2006   



GPU Big data Pre-training

Different architectures, e.g. DNN, CNN, RNN and DBN, for computer vision, speech recognition and synthesis, and natural language processing

Qian & Soong

Tutorial, ISCSLP 2014, Singapore Introduction

25

Restricted Boltzmann Machine (RBM) [17] Restricted: No interaction between hidden variables No interaction between visible variables





Joint distribution function

defined in terms of an energy

Parameters can be estimated by contrastive divergence learning [18]

Qian & Soong

Tutorial, ISCSLP 2014, Singapore

26

Deep Belief Network (DBN) [17] 

Each successive pair of layers is treated as an RBM



Stack RBMs to form a DBN

Qian & Soong

Tutorial, ISCSLP 2014, Singapore

27

Deep Neural Net (DNN) 

A feed-forward, artificial neural networks with multiple hidden layers 𝑦𝑗 = 𝑓 𝑥𝑗

where 𝑥𝑗 = 𝑏𝑗 +

𝑦𝑖 𝑤𝑖𝑗 𝑖



A nonlinear activation function 𝑓 𝑥𝑗 =

1 −𝑥 1+e 𝑗

Sigmoid function Qian & Soong

𝑓 𝑥𝑗 =

𝑥 −𝑥 e 𝑗− e 𝑗 𝑥 e 𝑗

−𝑥 +e 𝑗

Hyperbolic tangent (or tanh) function Tutorial, ISCSLP 2014, Singapore

𝑓 𝑥𝑗 = max(𝑥𝑗 , 0) Rectified linear function 28

DNN 

Train discriminative models via back-propagation (BP) [19] with an appropriate cost function



A “mini-batch” based stochastic gradient descent algorithm



Improved acoustic model for enhancing speech recognition performance significantly [17]

May 8, 2014

Qian & Soong

29

Tutorial, ISCSLP 2014, Singapore

29

DNN Pre-training 

Initialize the weights to a better starting point than random initialization, prior to BP 

DBN[17]

Layer-by-layer trained RBMs Unsupervised training via Contrastive Divergence 

Layer-wise BP[20]

Add one hidden layer after another Supervised training with BP

May 8, 2014

Qian & Soong

30

Tutorial, ISCSLP 2014, Singapore

30

Mixture Density Networks (MDN) [21] 

MDN combines a conventional neural network with a mixture density model



Conventional neural network can represent arbitrary functions



In principle, MDN can represent arbitrary conditional probability distributions

Qian & Soong

Tutorial, ISCSLP 2014, Singapore

31

Mixture Density Networks (MDN) [21] 

Output is a mixture density of target 



Make neural network a generative model rather than regression model

Activation function  



Sigmoid or hyperbolic tangent for hidden nodes Softmax for mixture weights, exponential for variance and identity for mean

Criterion 

Maximum likelihood (MLE)

Qian & Soong

Tutorial, ISCSLP 2014, Singapore

32

Recurrent Neural Network (RNN) [22] Feed Forward Neural Network y(1)

y(t)

x(1)

x(t)

y(T)

ℎ𝑡 = ℋ 𝑾𝑥ℎ 𝑥𝑡 + 𝑏ℎ 𝑦𝑡 = ℎ𝑡 x(T)

Acoustic Context

Recurrent Neural Network y(1)

y(t)

y(T)

ℎ𝑡 = ℋ 𝑾𝑥ℎ 𝑥𝑡 + 𝑾ℎℎ ℎ𝑡−1 + 𝑏ℎ 𝑦𝑡 = 𝑾ℎ𝑦 ℎ𝑡 + 𝑏𝑦 x(1)

Qian & Soong

x(t)

Tutorial, ISCSLP 2014, Singapore

x(T)

33

Long Short Term Memory (LSTM)[22,23] it   Wxi xt  Whi ht 1  Wci ct 1  bi 

f t   Wxf xt  Whf ht 1  Wcf ct 1  b f



ct  f t ct 1  it tanh Wxc xt  Whc ht 1  bc  ot   Wxo xt  Who ht 1  Wco ct  bo  ht  ot tanh  ct  Outputs

Hidden Layer

Inputs

Time

Qian & Soong

1

2

3

4

5

6

Tutorial, ISCSLP 2014, Singapore

7

34

Bidirectional RNN[22,24] Output

Forward Layer Backward Layer

Input x(T)

x(1)

ℎ𝑡 = ℋ 𝑾𝑥ℎ 𝑥𝑡 + 𝑾ℎℎ ℎ𝑡−1 + 𝑏ℎ

x(t) ← x(1), x(2), …,x(t) Forward direction: past and current

ℎ𝑡 = ℋ 𝑾𝑥ℎ 𝑥𝑡 + 𝑾ℎℎ ℎ𝑡+1 + 𝑏ℎ

x(t) ← x(T), x(T-1), …, x(t) Backward direction: future and current

𝑦𝑡 = 𝑾ℎ𝑦 ℎ𝑡 + 𝑾ℎ𝑦 ℎ𝑡 + 𝑏𝑦 Qian & Soong

x(t) ← x(1), x(2), … , x(t),…,x(T-1), ,x(T) Bidirection: all

Tutorial, ISCSLP 2014, Singapore

36

Outline 

Statistical parametric speech generation and synthesis  



Deep learning 



RBM, DBN, DNN, MDN and RNN

Deep learning for speech generation and synthesis  



HMM-based speech synthesis GMM-based voice conversion

Approaches to speech synthesis Approaches to voice conversion

Conclusions and future work

Qian & Soong

Tutorial, ISCSLP 2014, Singapore

37

Deep Learning for Speech Generation and Synthesis 

Motivated by the DNN’s superior performance in ASR, we investigate potential advantages of NN in speech synthesis and voice conversion 

Model long-span, high dimensional and correlated features as its input 



MCEP or LSP  Spectral envelop, 1 frame  multi frames

Non-linear mapping between input and output features with a deep-layered , hierarchical structure  Representing highly-variable mapping function compactly  Simulating human speech production system

Qian & Soong

Tutorial, ISCSLP 2014, Singapore

38

Deep Learning for Speech Generation and Synthesis 

Distributed representation 

Generating observed data by the interactions of many different factors on different levels



Replacing HMM states and decision tree, which decompose the training data into small partitions



“Discriminative” and “predictive” in generation sense, with appropriate cost function(s), e.g. generation error 

Training criteria is closer to the objectives than ML

Qian & Soong

Tutorial, ISCSLP 2014, Singapore

39

Approaches of Deep Learning to Speech Synthesis  

  











RBM/DBN

CUHK [25], USTC/Microsoft [26]

DNN

Google [27], CSTR[28], Microsoft [29]

DBN/DNN for feature transformation IBM [30]

MDN

Google [31]

RNN with BLSTM

Microsoft [32], IBM [33]

Qian & Soong

Tutorial, ISCSLP 2014, Singapore

40

Generative models: RBM vs. GMM [34] 

GMM

𝑀

𝑝 𝒗 =

 



𝑚𝑖 𝑁(𝒖𝑖 , Σ𝑖 ) 𝑖=1

Mixture models can’t generate distributions sharper than the individual components Need lots of data to estimate parameters

RBM

 

Product models can generate distributions sharper than the individual components Need less data to estimate parameters

Qian & Soong

Tutorial, ISCSLP 2014, Singapore

41

DBN for Speech Synthesis [25,26] 

Modeling joint distribution of linguistic and acoustic features 



Fully utilize generative nature to generate speech parameter from DBN

DBNs replaces GMMs for the distribution of the spectral envelops at each decision tree-clustered HMM state 

The estimated model of spectral RBM has much sharper formant structure than Gaussian mean

Qian & Soong

Tutorial, ISCSLP 2014, Singapore

42

Deep Learning (DNN, MDN, RNN) for Speech Synthesis

Excitation features

Output layer

Spectral features

Hidden layers

Gain

LSP Spectrum

Input layer

Numeric features

Number of Word Frame Position Duration …

Binary features

Phone label POS label Word …

F0

DNN, MDN or RNN

Qian & Soong

Tutorial, ISCSLP 2014, Singapore

43

DNN for TTS Training 

Input features x and output features y are time-aligned frame-by-frame by well-trained HMM models



L2 cost function with Back-Propagation (BP) 𝐶=



1 2𝑇

𝑇

𝑓(𝑥

𝑡

) − 𝑦 (𝑡)

2

+

𝑡=1

λ 2

𝑙 𝑤𝑖,𝑗

2

𝑖,𝑗,𝑙

A “mini-batch” based stochastic gradient descent algorithm 𝑊 𝑙 , 𝑏𝑙

May 8, 2014

Qian & Soong

← 𝑊 𝑙 , 𝑏𝑙 + 𝜀

𝜕𝐶 𝜕 𝑊 𝑙 ,𝑏𝑙

, 0≤𝑙≤𝐿

44

Tutorial, ISCSLP 2014, Singapore

44

DNN vs. HMM for TTS Synthesis Parameter Generation

Vocoder Parameters

HMM

Y1

Synthesized Speech

DNN

Y2

Y3

YN-1

YN

h3 1

h3 2

h3 M

h2 1

h2 2

h2 M

h1 1

h1 2

h1 M

Vowel no

yes

nasal no

R-silence yes

no

Voiced plosive

no semivowel

L-unvoiced Posive

N

yes L-silence

no L-silence

yes 1to 13_a0

no L-voiced Plosive

yes Low-tail

yes

R-sil no L-semivowel

yes i

X1

X2

X3

XN-1

XN

Labels Text

Qian & Soong

Text Analysis

Tutorial, ISCSLP 2014, Singapore

45

Deep MDN (DMDN) based Speech Synthesis[31] 

MDN, In principle, can represent arbitrary conditional probability distributions



To address the limitations of DNN 

Unimodal nature of its objective function



Lack of ability to predict variances

Qian & Soong

Tutorial, ISCSLP 2014, Singapore

46

DMDN for TTS training[31] 

Training Criterion



Consistent with generation algorithm 𝑝 𝑦 𝑥, 𝜆) = 𝑝 𝑦 𝑄, 𝜆 )

Qian & Soong

Tutorial, ISCSLP 2014, Singapore

47

DMDN for TTS Synthesis [31] Parameter Generation

Synthesized Speech

Vocoder Parameters

Y1

Y2

YN

h3 2

h3 M

h2 1

h2 2

h2 M

h1 1

h1 2

h1 M

Text Analysis

Qian & Soong

YN-1

h3 1

X1

Text

Y3

X2

X3

Labels

XN-1

𝑌 = [𝑢1 , 𝑢2 , … , 𝑢𝑁 ]

1 𝑌 = [𝑢11 , 𝑢21 , … , 𝑢𝑁 , 𝜎 1 , 𝑤11 , 𝑤21 , … , 𝑤𝑁1 , 2 𝑢12 , 𝑢22 , … , 𝑢𝑁 , 𝜎 2 , 𝑤12 , 𝑤22 , … , 𝑤𝑁2 XN

….. 𝑀 𝑢1𝑀 , 𝑢2𝑀 , … , 𝑢𝑁 , 𝜎 𝑀 , 𝑤1𝑀 , 𝑤2𝑀 , … , 𝑤𝑁𝑀 ]

Tutorial, ISCSLP 2014, Singapore

48

DBLSTM-RNN based Speech Synthesis 

Speech production is a continuous dynamic process  

Consider semantic and syntactic info, select words, formulate phonetics, articulate sounds These features interact with complex contextual effects from neighboring and distant input



RNN with LSTM, in principle, can capture information from anywhere in the feature sequence



Bi-directional, to access long-range context in both forward and backward directions 

A whole sentence is given as input

Qian & Soong

Tutorial, ISCSLP 2014, Singapore

49

DBLSTM-RNN for TTS training  

 



   

Training criterion L2 cost function

Back-propagation through time (BPTT) Unfold the RNN into feed-forward network through time Train the unfolded network with back-propagation

Sequence mapping between input and output The weight gradients are computed over the entire utterance Utterance-level randomization Tens of utterances for each “mini-batch”

Qian & Soong

Tutorial, ISCSLP 2014, Singapore

50

DBLSTM-RNN for TTS synthesis Vocoder

Waveform

Output

Parameter Generation

Features

Output

Backward Layer

Forward Layer

Input

Text

Qian & Soong

Text Analysis

features

Input

Input Feature Extraction

Tutorial, ISCSLP 2014, Singapore

51

DBLSTM-RNN based Speech Synthesis 

Pros 

Bidirectional, deep-in-time and deep-in-layer structure



Embedded sequence training



Matched criteria for training and testing



Cons 

High computational complexity



Slow to converge

Qian & Soong

Tutorial, ISCSLP 2014, Singapore

52

Experimental Setup Corpus Training/ Test set Linguistic features

U.S. English utterances (female, general domain)

5,000/200 sentences 319 (binary), e.g. phone, POS 36 (numerical), e.g. word position

Acoustic features

1 U/V, LSP + gain (40+1), F0, Δ, Δ2

HMM topology

5-state, left-to-right HMM, MSD F0, MDL, MGE

DNN, MDN, RNN* architecture

1 - 6 layers, 512 or 1024 nodes/layer Sigmoid/Tanh, continuous F0

Post-processing

formant sharpening of LSP frequencies

* CURRENT [40] , Open source code under GPLv3 Qian & Soong

Tutorial, ISCSLP 2014, Singapore

53

Evaluation Metrics 

Objective measures   



Log Spectral Distance (LSD) Voiced/Unvoiced error rate Root Mean Squared Error (RMSE) of F0

Subjective measure  

AB preference test Crowdsourcing (20 subjects, 50 pairs of sentences for each)

Qian & Soong

Tutorial, ISCSLP 2014, Singapore

54

Experimental Results 

Activation Functions (Sigmoid vs. Tanh) 4.2 Sigmoid

4.1

Tanh

LSD

4

3.9 3.8 3.7 3.6 5

10 15 20 25 30 35 40 45 The number of epochs

* Rectifier linear activation function is better in objective measures [31] Qian & Soong

Tutorial, ISCSLP 2014, Singapore

55

Experimental Results 

Pre-training 4.2

No Pre-training

4.1

DBN Pre-training

LSD

4

Layer-wise BP Pre-training

3.9 3.8 3.7 3.6

10 20 30 40 50 60 70 80 90 The number of epochs Qian & Soong

Tutorial, ISCSLP 2014, Singapore

56

Experimental Results 

The Number of Hidden Layers 4.2

H1

H2

H3

H4

H5

H6

4.1

LSD

4 3.9 3.8 3.7 3.6 10

Qian & Soong

20 30 40 The number of epochs

50

Tutorial, ISCSLP 2014, Singapore

57

Evaluation Results (DNN) Measure DNN Structure 512 *3 (0.77 M) 512 *6 (1. 55M) 1024*3 (2.59 M) Measure HMM MDL Factor 1 (2.89 M) 1.6 (1.52M) 3 (0.85M) 46% HMM (MDL=1) 23% HMM (MDL=1)

Qian & Soong

LSD (dB) 3.76 3.73 3.73

V/U Error rate (%) 5.9 5.8 5.9

F0 RMSE (Hz) 15.8 15.8 15.9

LSD (dB) 3.74 3.85 3.91

V/U Error rate (%) 5.8 6.1 6.2

F0 RMSE (Hz) 17.7 18.1 18.4

10% Neutral 10% Neutral

44% DNN (512*3) 67% DNN (1024*3)

Tutorial, ISCSLP 2014, Singapore

P=0.45 P