Sep 13, 2014 - TEXT. Labels. Speech Signal. Training. Synthesis. Log F0. LSP & Gain. SPEECH. DATABASE. LPC. Mixed ExcitationExcitation. Generation.
Deep Learning for Speech Generation and Synthesis Yao Qian Frank K. Soong Speech Group MS Research Asia
Sep. 13, 2014
Background
Deep learning has made a huge impact on automatic speech recognition (ASR) research, products and services
Speech generation and synthesis is an inverse process of speech recognition Text-to-Speech (TTS) Speech-to-Text (STT)
Deep learning approaches to speech generation and synthesis Focus on TTS synthesis and voice conversion
Qian & Soong
Tutorial, ISCSLP 2014, Singapore
2
Outline
Statistical parametric speech generation and synthesis
Deep learning
RBM, DBN, DNN, MDN and RNN
Deep learning for speech generation and synthesis
HMM-based speech synthesis GMM-based voice conversion
Approaches to speech synthesis Approaches to voice conversion
Conclusions and future work
Qian & Soong
Tutorial, ISCSLP 2014, Singapore
3
Outline
Statistical parametric speech generation and synthesis
Deep learning
RBM, DBN, DNN, MDN and RNN
Deep learning for speech generation and synthesis
HMM-based speech synthesis GMM-based voice conversion
Approaches to speech synthesis Approaches to voice conversion
Conclusions and future work
Qian & Soong
Tutorial, ISCSLP 2014, Singapore
4
HMM-based Speech Synthesis[1]
By reversing the input and output of a Hidden Markov Model (HMM) in speech recognition, we then turn HMM into a generation model for textto-speech (TTS) synthesis.
Speech spectrum, fundamental frequency, voicing and duration are modeled simultaneously by HMM [2].
Models are trained and signals are generated by the universal Maximum Likelihood (ML) criterion.
Qian & Soong
Tutorial, ISCSLP 2014, Singapore
5
HMM-based Speech Synthesis[1] Speech Signal
SPEECH
Feature Extraction
DATABASE Log F0 Labels
Training LSP & Gain Phonetic & Prosodic Question Set
HMM Training
Stream-Dependent
Synthesis
HMMs Sequence of HMMs TEXT
Text Analysis
Labels
Source-filter Parameter Generation
Log F0 Mixed Excitation Generation
Qian & Soong
LSP & Gain Excitation
LPC Synthesis
Synthesized Speech
Tutorial, ISCSLP 2014, Singapore
6
HMM-based Speech Synthesis[1] SPEECH
Speech Signal
Feature Extraction
DATABASE Log F0 Labels
Training LSP & Gain
HMM Training
Phonetic & Prosodic Question Set
Stream-Dependent HMMs
Qian & Soong
Tutorial, ISCSLP 2014, Singapore
7
HMM-based Speech Synthesis[1]
Phonetic & Prosodic Question Set
Stream-Dependent
Synthesis
HMMs Sequence of HMMs TEXT
Text Analysis
Labels
Source-filter Parameter Generation
Log F0 Mixed Excitation Generation
Qian & Soong
LSP & Gain
Excitation
LPC Synthesis
Tutorial, ISCSLP 2014, Singapore
Synthesized Speech
8
Advantages of HMM-based TTS
Statistical parametric model can be efficiently trained with recorded speech data and corresponding transcriptions.
HMM-based statistical model is parsimonious (i.e., small model size) and data efficient.
HMMs can generate “optimal” speech parameter trajectory in the ML sense.
Qian & Soong
Tutorial, ISCSLP 2014, Singapore
9
Concatenative vs. HMM-based TTS synthesis
HMM-based synthesis
Statistically trained Vocoded speech (smooth & stable) Small footprint (less than 2MB) Easy to modify its voice characteristics
Concatenative synthesis
Qian & Soong
Sample-based (unit selection) High quality with occasional glitches Large footprint More difficult to modify its voice characteristics
Tutorial, ISCSLP 2014, Singapore
10
Technologies
Multi-Space Distribution (MSD)-HMM for F0 and voicing modeling
Contextual clustering of decision tree
Parameter generation with dynamic features
Mixed excitation
Global Variance
Minimum Generation Error Training
Qian & Soong
Tutorial, ISCSLP 2014, Singapore
11
MSD-HMM for F0 Modeling[3]
A schematic representation of using MSD-HMM for f0 modeling
F0 (Hz) 150
100
50
Time (10ms/frame)
Pt o 1
Pt o 1
c11 0.83
c21 0.99
t 12
o
t 1K
o
t 22
o
t 2K
o
Pt o 1
c31 0.87
t 32
o
t 3K
o
k2c1k 0.17 k2c2k 0.01 k2c3k 0.13 K
Qian & Soong
K
g
i
t
K
Zero-dimensional subspace for unvoiced part
One-dimensional subspace for voiced part
an Pan4 o 1
Pan4 o 1
Pan4 o 1
c11 0.05
c21 0.01
c31 0.1
o
o
o
o
o
o
an4
an4
an4
an4
an4
an4
12
1K
22
2K
32
3K
K K k2c1k 0.95 k2c2k 0.99 k2 c3k 0.9
Tutorial, ISCSLP 2014, Singapore
K
12
Context Clustering[4,5]
Each state of model has three trees: trees of duration, spectrum and pitch
MDL for stopping criterion
Qian & Soong
Tutorial, ISCSLP 2014, Singapore
13
An Example of Decision Trees Question set
k-a+b
C=fricative? C=Vowel? R=silence? R=“cl”? L=“gy”? L=voice?
t-a+h
…
…
s-i+n
…
Qian & Soong
Tutorial, ISCSLP 2014, Singapore
14
Parameter Generation[6] Speech parameter generation from HMM with dynamic constraint For a given HMM , determine a speech parameter vector sequence O [o1T , o2T , o3T ,..., oTT ]T , ot [ctT , ctT , 2ctT ]T, which maximizes: P(O | ) P(O, Q | ) all Q
If given Q , maximizes P(O | Q, ) with respect to O WC 1 log P(O | Q, ) OTU 1O O TU 1M K 2
LogP(WC | Q, ) 0 C
W TU 1WC W TU 1M T Qian & Soong
Tutorial, ISCSLP 2014, Singapore
15
Generated Speech Parameter Trajectory [7]
Qian & Soong
Tutorial, ISCSLP 2014, Singapore
16
Mixed Excitation [8,9]
Traditional excitation model (hiss-buzz model)
A better excitation model (mixed excitation model)
Qian & Soong
Tutorial, ISCSLP 2014, Singapore
17
Global Variance (GV) [10]
“Global variance” of features over an utterance
Generation
Qian & Soong
A Single Gaussian with 𝜇𝑣 and 𝑣
Tutorial, ISCSLP 2014, Singapore
18
Minimum Generation Error Training [11]
Training model by maximizing the likelihood of training data
* arg max P(O, Q | )
w.r.t
O WC
all Q
Refining model by minimizing weighted generation error
i 1
Qian & Soong
~ arg min P(Q | , O)D(C , C ) i
all Q
Tutorial, ISCSLP 2014, Singapore
19
GMM-based Voice Conversion [12,15]
Modeling joint feature vector (source target speakers) by GMM
Qian & Soong
Tutorial, ISCSLP 2014, Singapore
and
20
GMM-based Voice Conversion [12,13]
Maximum likelihood conversion
Qian & Soong
Tutorial, ISCSLP 2014, Singapore
21
Limitations
Vocoder
Synthesized voice with some traditional vocoder flavor
Acoustic modeling
“Wrong” model assumptions out of necessity, e.g., GMM, diagonal covariance
Greedy, hence suboptimal, search derived decision tree state clustering (tying)
Qian & Soong
Tutorial, ISCSLP 2014, Singapore
22
Outline
Statistical parametric speech generation and synthesis
Deep learning
RBM, DBN, DNN, MDN and RNN
Deep learning for speech generation and synthesis
HMM-based speech synthesis GMM-based voice conversion
Approaches to speech synthesis Approaches to voice conversion
Conclusions and future work
Qian & Soong
Tutorial, ISCSLP 2014, Singapore
23
Deep Learning[16]
A machine learning algorithm to learn high-level abstractions in data by using multiple-layered models
Typical neural nets with 3 or more hidden layers
Motivated by human brain, which organize ideas and concepts hierarchically
Difficulties to train DNN in 1980s
Training slowly , vanishing gradients (RNN)
Qian & Soong
Tutorial, ISCSLP 2014, Singapore Introduction
24
Deep Learning[16]
Improving DNN training significantly since 2006
GPU Big data Pre-training
Different architectures, e.g. DNN, CNN, RNN and DBN, for computer vision, speech recognition and synthesis, and natural language processing
Qian & Soong
Tutorial, ISCSLP 2014, Singapore Introduction
25
Restricted Boltzmann Machine (RBM) [17] Restricted: No interaction between hidden variables No interaction between visible variables
Joint distribution function
defined in terms of an energy
Parameters can be estimated by contrastive divergence learning [18]
Qian & Soong
Tutorial, ISCSLP 2014, Singapore
26
Deep Belief Network (DBN) [17]
Each successive pair of layers is treated as an RBM
Stack RBMs to form a DBN
Qian & Soong
Tutorial, ISCSLP 2014, Singapore
27
Deep Neural Net (DNN)
A feed-forward, artificial neural networks with multiple hidden layers 𝑦𝑗 = 𝑓 𝑥𝑗
where 𝑥𝑗 = 𝑏𝑗 +
𝑦𝑖 𝑤𝑖𝑗 𝑖
A nonlinear activation function 𝑓 𝑥𝑗 =
1 −𝑥 1+e 𝑗
Sigmoid function Qian & Soong
𝑓 𝑥𝑗 =
𝑥 −𝑥 e 𝑗− e 𝑗 𝑥 e 𝑗
−𝑥 +e 𝑗
Hyperbolic tangent (or tanh) function Tutorial, ISCSLP 2014, Singapore
𝑓 𝑥𝑗 = max(𝑥𝑗 , 0) Rectified linear function 28
DNN
Train discriminative models via back-propagation (BP) [19] with an appropriate cost function
A “mini-batch” based stochastic gradient descent algorithm
Improved acoustic model for enhancing speech recognition performance significantly [17]
May 8, 2014
Qian & Soong
29
Tutorial, ISCSLP 2014, Singapore
29
DNN Pre-training
Initialize the weights to a better starting point than random initialization, prior to BP
DBN[17]
Layer-by-layer trained RBMs Unsupervised training via Contrastive Divergence
Layer-wise BP[20]
Add one hidden layer after another Supervised training with BP
May 8, 2014
Qian & Soong
30
Tutorial, ISCSLP 2014, Singapore
30
Mixture Density Networks (MDN) [21]
MDN combines a conventional neural network with a mixture density model
Conventional neural network can represent arbitrary functions
In principle, MDN can represent arbitrary conditional probability distributions
Qian & Soong
Tutorial, ISCSLP 2014, Singapore
31
Mixture Density Networks (MDN) [21]
Output is a mixture density of target
Make neural network a generative model rather than regression model
Activation function
Sigmoid or hyperbolic tangent for hidden nodes Softmax for mixture weights, exponential for variance and identity for mean
Criterion
Maximum likelihood (MLE)
Qian & Soong
Tutorial, ISCSLP 2014, Singapore
32
Recurrent Neural Network (RNN) [22] Feed Forward Neural Network y(1)
y(t)
x(1)
x(t)
y(T)
ℎ𝑡 = ℋ 𝑾𝑥ℎ 𝑥𝑡 + 𝑏ℎ 𝑦𝑡 = ℎ𝑡 x(T)
Acoustic Context
Recurrent Neural Network y(1)
y(t)
y(T)
ℎ𝑡 = ℋ 𝑾𝑥ℎ 𝑥𝑡 + 𝑾ℎℎ ℎ𝑡−1 + 𝑏ℎ 𝑦𝑡 = 𝑾ℎ𝑦 ℎ𝑡 + 𝑏𝑦 x(1)
Qian & Soong
x(t)
Tutorial, ISCSLP 2014, Singapore
x(T)
33
Long Short Term Memory (LSTM)[22,23] it Wxi xt Whi ht 1 Wci ct 1 bi
f t Wxf xt Whf ht 1 Wcf ct 1 b f
ct f t ct 1 it tanh Wxc xt Whc ht 1 bc ot Wxo xt Who ht 1 Wco ct bo ht ot tanh ct Outputs
Hidden Layer
Inputs
Time
Qian & Soong
1
2
3
4
5
6
Tutorial, ISCSLP 2014, Singapore
7
34
Bidirectional RNN[22,24] Output
Forward Layer Backward Layer
Input x(T)
x(1)
ℎ𝑡 = ℋ 𝑾𝑥ℎ 𝑥𝑡 + 𝑾ℎℎ ℎ𝑡−1 + 𝑏ℎ
x(t) ← x(1), x(2), …,x(t) Forward direction: past and current
ℎ𝑡 = ℋ 𝑾𝑥ℎ 𝑥𝑡 + 𝑾ℎℎ ℎ𝑡+1 + 𝑏ℎ
x(t) ← x(T), x(T-1), …, x(t) Backward direction: future and current
𝑦𝑡 = 𝑾ℎ𝑦 ℎ𝑡 + 𝑾ℎ𝑦 ℎ𝑡 + 𝑏𝑦 Qian & Soong
x(t) ← x(1), x(2), … , x(t),…,x(T-1), ,x(T) Bidirection: all
Tutorial, ISCSLP 2014, Singapore
36
Outline
Statistical parametric speech generation and synthesis
Deep learning
RBM, DBN, DNN, MDN and RNN
Deep learning for speech generation and synthesis
HMM-based speech synthesis GMM-based voice conversion
Approaches to speech synthesis Approaches to voice conversion
Conclusions and future work
Qian & Soong
Tutorial, ISCSLP 2014, Singapore
37
Deep Learning for Speech Generation and Synthesis
Motivated by the DNN’s superior performance in ASR, we investigate potential advantages of NN in speech synthesis and voice conversion
Model long-span, high dimensional and correlated features as its input
MCEP or LSP Spectral envelop, 1 frame multi frames
Non-linear mapping between input and output features with a deep-layered , hierarchical structure Representing highly-variable mapping function compactly Simulating human speech production system
Qian & Soong
Tutorial, ISCSLP 2014, Singapore
38
Deep Learning for Speech Generation and Synthesis
Distributed representation
Generating observed data by the interactions of many different factors on different levels
Replacing HMM states and decision tree, which decompose the training data into small partitions
“Discriminative” and “predictive” in generation sense, with appropriate cost function(s), e.g. generation error
Training criteria is closer to the objectives than ML
Qian & Soong
Tutorial, ISCSLP 2014, Singapore
39
Approaches of Deep Learning to Speech Synthesis
RBM/DBN
CUHK [25], USTC/Microsoft [26]
DNN
Google [27], CSTR[28], Microsoft [29]
DBN/DNN for feature transformation IBM [30]
MDN
Google [31]
RNN with BLSTM
Microsoft [32], IBM [33]
Qian & Soong
Tutorial, ISCSLP 2014, Singapore
40
Generative models: RBM vs. GMM [34]
GMM
𝑀
𝑝 𝒗 =
𝑚𝑖 𝑁(𝒖𝑖 , Σ𝑖 ) 𝑖=1
Mixture models can’t generate distributions sharper than the individual components Need lots of data to estimate parameters
RBM
Product models can generate distributions sharper than the individual components Need less data to estimate parameters
Qian & Soong
Tutorial, ISCSLP 2014, Singapore
41
DBN for Speech Synthesis [25,26]
Modeling joint distribution of linguistic and acoustic features
Fully utilize generative nature to generate speech parameter from DBN
DBNs replaces GMMs for the distribution of the spectral envelops at each decision tree-clustered HMM state
The estimated model of spectral RBM has much sharper formant structure than Gaussian mean
Qian & Soong
Tutorial, ISCSLP 2014, Singapore
42
Deep Learning (DNN, MDN, RNN) for Speech Synthesis
Excitation features
Output layer
Spectral features
Hidden layers
Gain
LSP Spectrum
Input layer
Numeric features
Number of Word Frame Position Duration …
Binary features
Phone label POS label Word …
F0
DNN, MDN or RNN
Qian & Soong
Tutorial, ISCSLP 2014, Singapore
43
DNN for TTS Training
Input features x and output features y are time-aligned frame-by-frame by well-trained HMM models
L2 cost function with Back-Propagation (BP) 𝐶=
1 2𝑇
𝑇
𝑓(𝑥
𝑡
) − 𝑦 (𝑡)
2
+
𝑡=1
λ 2
𝑙 𝑤𝑖,𝑗
2
𝑖,𝑗,𝑙
A “mini-batch” based stochastic gradient descent algorithm 𝑊 𝑙 , 𝑏𝑙
May 8, 2014
Qian & Soong
← 𝑊 𝑙 , 𝑏𝑙 + 𝜀
𝜕𝐶 𝜕 𝑊 𝑙 ,𝑏𝑙
, 0≤𝑙≤𝐿
44
Tutorial, ISCSLP 2014, Singapore
44
DNN vs. HMM for TTS Synthesis Parameter Generation
Vocoder Parameters
HMM
Y1
Synthesized Speech
DNN
Y2
Y3
YN-1
YN
h3 1
h3 2
h3 M
h2 1
h2 2
h2 M
h1 1
h1 2
h1 M
Vowel no
yes
nasal no
R-silence yes
no
Voiced plosive
no semivowel
L-unvoiced Posive
N
yes L-silence
no L-silence
yes 1to 13_a0
no L-voiced Plosive
yes Low-tail
yes
R-sil no L-semivowel
yes i
X1
X2
X3
XN-1
XN
Labels Text
Qian & Soong
Text Analysis
Tutorial, ISCSLP 2014, Singapore
45
Deep MDN (DMDN) based Speech Synthesis[31]
MDN, In principle, can represent arbitrary conditional probability distributions
To address the limitations of DNN
Unimodal nature of its objective function
Lack of ability to predict variances
Qian & Soong
Tutorial, ISCSLP 2014, Singapore
46
DMDN for TTS training[31]
Training Criterion
Consistent with generation algorithm 𝑝 𝑦 𝑥, 𝜆) = 𝑝 𝑦 𝑄, 𝜆 )
Qian & Soong
Tutorial, ISCSLP 2014, Singapore
47
DMDN for TTS Synthesis [31] Parameter Generation
Synthesized Speech
Vocoder Parameters
Y1
Y2
YN
h3 2
h3 M
h2 1
h2 2
h2 M
h1 1
h1 2
h1 M
Text Analysis
Qian & Soong
YN-1
h3 1
X1
Text
Y3
X2
X3
Labels
XN-1
𝑌 = [𝑢1 , 𝑢2 , … , 𝑢𝑁 ]
1 𝑌 = [𝑢11 , 𝑢21 , … , 𝑢𝑁 , 𝜎 1 , 𝑤11 , 𝑤21 , … , 𝑤𝑁1 , 2 𝑢12 , 𝑢22 , … , 𝑢𝑁 , 𝜎 2 , 𝑤12 , 𝑤22 , … , 𝑤𝑁2 XN
….. 𝑀 𝑢1𝑀 , 𝑢2𝑀 , … , 𝑢𝑁 , 𝜎 𝑀 , 𝑤1𝑀 , 𝑤2𝑀 , … , 𝑤𝑁𝑀 ]
Tutorial, ISCSLP 2014, Singapore
48
DBLSTM-RNN based Speech Synthesis
Speech production is a continuous dynamic process
Consider semantic and syntactic info, select words, formulate phonetics, articulate sounds These features interact with complex contextual effects from neighboring and distant input
RNN with LSTM, in principle, can capture information from anywhere in the feature sequence
Bi-directional, to access long-range context in both forward and backward directions
A whole sentence is given as input
Qian & Soong
Tutorial, ISCSLP 2014, Singapore
49
DBLSTM-RNN for TTS training
Training criterion L2 cost function
Back-propagation through time (BPTT) Unfold the RNN into feed-forward network through time Train the unfolded network with back-propagation
Sequence mapping between input and output The weight gradients are computed over the entire utterance Utterance-level randomization Tens of utterances for each “mini-batch”
Qian & Soong
Tutorial, ISCSLP 2014, Singapore
50
DBLSTM-RNN for TTS synthesis Vocoder
Waveform
Output
Parameter Generation
Features
Output
Backward Layer
Forward Layer
Input
Text
Qian & Soong
Text Analysis
features
Input
Input Feature Extraction
Tutorial, ISCSLP 2014, Singapore
51
DBLSTM-RNN based Speech Synthesis
Pros
Bidirectional, deep-in-time and deep-in-layer structure
Embedded sequence training
Matched criteria for training and testing
Cons
High computational complexity
Slow to converge
Qian & Soong
Tutorial, ISCSLP 2014, Singapore
52
Experimental Setup Corpus Training/ Test set Linguistic features
U.S. English utterances (female, general domain)
5,000/200 sentences 319 (binary), e.g. phone, POS 36 (numerical), e.g. word position
Acoustic features
1 U/V, LSP + gain (40+1), F0, Δ, Δ2
HMM topology
5-state, left-to-right HMM, MSD F0, MDL, MGE
DNN, MDN, RNN* architecture
1 - 6 layers, 512 or 1024 nodes/layer Sigmoid/Tanh, continuous F0
Post-processing
formant sharpening of LSP frequencies
* CURRENT [40] , Open source code under GPLv3 Qian & Soong
Tutorial, ISCSLP 2014, Singapore
53
Evaluation Metrics
Objective measures
Log Spectral Distance (LSD) Voiced/Unvoiced error rate Root Mean Squared Error (RMSE) of F0
Subjective measure
AB preference test Crowdsourcing (20 subjects, 50 pairs of sentences for each)
Qian & Soong
Tutorial, ISCSLP 2014, Singapore
54
Experimental Results
Activation Functions (Sigmoid vs. Tanh) 4.2 Sigmoid
4.1
Tanh
LSD
4
3.9 3.8 3.7 3.6 5
10 15 20 25 30 35 40 45 The number of epochs
* Rectifier linear activation function is better in objective measures [31] Qian & Soong
Tutorial, ISCSLP 2014, Singapore
55
Experimental Results
Pre-training 4.2
No Pre-training
4.1
DBN Pre-training
LSD
4
Layer-wise BP Pre-training
3.9 3.8 3.7 3.6
10 20 30 40 50 60 70 80 90 The number of epochs Qian & Soong
Tutorial, ISCSLP 2014, Singapore
56
Experimental Results
The Number of Hidden Layers 4.2
H1
H2
H3
H4
H5
H6
4.1
LSD
4 3.9 3.8 3.7 3.6 10
Qian & Soong
20 30 40 The number of epochs
50
Tutorial, ISCSLP 2014, Singapore
57
Evaluation Results (DNN) Measure DNN Structure 512 *3 (0.77 M) 512 *6 (1. 55M) 1024*3 (2.59 M) Measure HMM MDL Factor 1 (2.89 M) 1.6 (1.52M) 3 (0.85M) 46% HMM (MDL=1) 23% HMM (MDL=1)
Qian & Soong
LSD (dB) 3.76 3.73 3.73
V/U Error rate (%) 5.9 5.8 5.9
F0 RMSE (Hz) 15.8 15.8 15.9
LSD (dB) 3.74 3.85 3.91
V/U Error rate (%) 5.8 6.1 6.2
F0 RMSE (Hz) 17.7 18.1 18.4
10% Neutral 10% Neutral
44% DNN (512*3) 67% DNN (1024*3)
Tutorial, ISCSLP 2014, Singapore
P=0.45 P