Speech: auto-encoder at FFT spectra [ , ] ! positive results ... Joint acoustic feature extraction & model training
Generative Model-Based Text-to-Speech Synthesis Heiga Zen (Google London) February �rd, ����@MIT
Outline Generative TTS Generative acoustic models for parametric TTS Hidden Markov models (HMMs) Neural networks Beyond parametric TTS Learned features WaveNet End-to-end Conclusion & future topics
Outline Generative TTS Generative acoustic models for parametric TTS Hidden Markov models (HMMs) Neural networks Beyond parametric TTS Learned features WaveNet End-to-end Conclusion & future topics
Text-to-speech as sequence-to-sequence mapping Automatic speech recognition (ASR) ! “Hello my name is Heiga Zen” Machine translation (MT) “Hello my name is Heiga Zen” ! “Ich heiße Heiga Zen” Text-to-speech synthesis (TTS) “Hello my name is Heiga Zen” !
Heiga Zen
Generative Model-Based Text-to-Speech Synthesis
February �rd, ����
� of ��
Speech production process
modulation of carrier wave by speech information
freq transfer char
voiced/unvoiced
fundamental freq
text (concept)
speech
frequency transfer characteristics magnitude start-- end
Sound source voiced: pulse unvoiced: noise
fundamental frequency
air flow Heiga Zen
Generative Model-Based Text-to-Speech Synthesis
February �rd, ����
� of ��
Typical �ow of TTS system TEXT Sentence segmentation Word segmentation Text normalization Part-of-speech tagging Pronunciation
discrete ) discrete NLP Frontend
Heiga Zen
Text analysis Speech synthesis
SYNTHESIZED SEECH
Generative Model-Based Text-to-Speech Synthesis
Prosody prediction Waveform generation
discrete ) continuous Speech
Backend
February �rd, ����
� of ��
Speech synthesis approaches
Heiga Zen
Generative Model-Based Text-to-Speech Synthesis
February �rd, ����
� of ��
Speech synthesis approaches Rule-based, formant synthesis [�]
Heiga Zen
Generative Model-Based Text-to-Speech Synthesis
February �rd, ����
� of ��
Speech synthesis approaches Rule-based, formant synthesis [�]
Heiga Zen
Sample-based, concatenative synthesis [�]
Generative Model-Based Text-to-Speech Synthesis
February �rd, ����
� of ��
Speech synthesis approaches Rule-based, formant synthesis [�]
Sample-based, concatenative synthesis [�]
Model-based, generative synthesis p(speech=
Heiga Zen
| text=”Hello, my name is Heiga Zen.”)
Generative Model-Based Text-to-Speech Synthesis
February �rd, ����
� of ��
Probabilistic formulation of TTS Random variables
Heiga Zen
X
Speech waveforms (data)
Observed
W
Transcriptions (data)
Observed
w
Given text
Observed
x
Synthesized speech
Unobserved
Generative Model-Based Text-to-Speech Synthesis
February �rd, ����
6 of ��
Probabilistic formulation of TTS Random variables X
Speech waveforms (data)
Observed
W
Transcriptions (data)
Observed
w
Given text
Observed
x
Synthesized speech
Unobserved
Synthesis • Estimate posterior predictive distribution ! p(x | w, X , W) ¯ from the posterior distribution • Sample x
Heiga Zen
Generative Model-Based Text-to-Speech Synthesis
X
W
w
x
February �rd, ����
6 of ��
Probabilistic formulation Introduce auxiliary variables (representation) + factorize dependency y XX p(x | w, X , W) = p(x | o)p(o | l, )p(l | w) 8l
8L
p(X | O)p(O | L, )p( )p(L | W)/ p(X ) dodOd
where
O, o: Acoustic features
L, l: Linguistic features : Model
X
W
w
O
L
l
o x Heiga Zen
Generative Model-Based Text-to-Speech Synthesis
February �rd, ����
� of ��
Approximation (�) Approximate {sum & integral} by best point estimates (like MAP) [�] where
ˆ p(x | w, X , W) ⇡ p(x | o)
ˆ O, ˆ L, ˆ ˆ } = arg max ˆ l, {o, o,l,O,L,
X
W
w
O
L
l
p(x | o)p(o | l, )p(l | w)
p(X | O)p(O | L, )p( )p(L | W)
o oˆ x
Heiga Zen
Generative Model-Based Text-to-Speech Synthesis
February �rd, ����
8 of ��
Approximation (�) Joint ! Step-by-step maximization [�] ˆ = arg max p(X | O) O
Extract acoustic features
Lˆ = arg max p(L | W)
Extract linguistic features
O L
ˆ = arg max p(O ˆ | L, ˆ )p( ) Learn mapping
X
W
w
O
L
l
ˆ O
Lˆ
lˆ
lˆ = arg max p(l | w)
Predict linguistic features
ˆ ˆ) oˆ = arg max p(o | l,
Predict acoustic features
o
¯ ⇠ fx (o) ˆ = p(x | o) ˆ x
Synthesize waveform
oˆ
l
o
ˆ
x
Heiga Zen
Generative Model-Based Text-to-Speech Synthesis
February �rd, ����
� of ��
Approximation (�) Joint ! Step-by-step maximization [�] ˆ = arg max p(X | O) O
Extract acoustic features
O
Lˆ = arg max p(L | W) L
ˆ = arg max p(O ˆ | L, ˆ )p( ) lˆ = arg max p(l | w) l
ˆ ˆ) oˆ = arg max p(o | l, o
¯ ⇠ fx (o) ˆ = p(x | o) ˆ x
X
W
w
O
L
l
ˆ O
Lˆ
lˆ
ˆ o oˆ x
Heiga Zen
Generative Model-Based Text-to-Speech Synthesis
February �rd, ����
� of ��
Approximation (�) Joint ! Step-by-step maximization [�] ˆ = arg max p(X | O) O
Extract acoustic features
Lˆ = arg max p(L | W)
Extract linguistic features
O L
ˆ = arg max p(O ˆ | L, ˆ )p( ) lˆ = arg max p(l | w) l
ˆ ˆ) oˆ = arg max p(o | l, o
¯ ⇠ fx (o) ˆ = p(x | o) ˆ x
X
W
w
O
L
l
ˆ O
Lˆ
lˆ
ˆ o oˆ x
Heiga Zen
Generative Model-Based Text-to-Speech Synthesis
February �rd, ����
� of ��
Approximation (�) Joint ! Step-by-step maximization [�] ˆ = arg max p(X | O) O
Extract acoustic features
Lˆ = arg max p(L | W)
Extract linguistic features
O L
ˆ = arg max p(O ˆ | L, ˆ )p( ) Learn mapping lˆ = arg max p(l | w) l
ˆ ˆ) oˆ = arg max p(o | l, o
¯ ⇠ fx (o) ˆ = p(x | o) ˆ x
X
W
w
O
L
l
ˆ O
Lˆ
lˆ
ˆ o oˆ x
Heiga Zen
Generative Model-Based Text-to-Speech Synthesis
February �rd, ����
� of ��
Approximation (�) Joint ! Step-by-step maximization [�] ˆ = arg max p(X | O) O
Extract acoustic features
Lˆ = arg max p(L | W)
Extract linguistic features
O L
ˆ = arg max p(O ˆ | L, ˆ )p( ) Learn mapping lˆ = arg max p(l | w)
Predict linguistic features
l
ˆ ˆ) oˆ = arg max p(o | l, o
¯ ⇠ fx (o) ˆ = p(x | o) ˆ x
X
W
w
O
L
l
ˆ O
Lˆ
lˆ
ˆ o oˆ x
Heiga Zen
Generative Model-Based Text-to-Speech Synthesis
February �rd, ����
� of ��
Approximation (�) Joint ! Step-by-step maximization [�] ˆ = arg max p(X | O) O
Extract acoustic features
Lˆ = arg max p(L | W)
Extract linguistic features
O L
ˆ = arg max p(O ˆ | L, ˆ )p( ) Learn mapping lˆ = arg max p(l | w)
Predict linguistic features
ˆ ˆ) oˆ = arg max p(o | l,
Predict acoustic features
l
o
¯ ⇠ fx (o) ˆ = p(x | o) ˆ x
X
W
w
O
L
l
ˆ O
Lˆ
lˆ
ˆ o oˆ x
Heiga Zen
Generative Model-Based Text-to-Speech Synthesis
February �rd, ����
� of ��
Approximation (�) Joint ! Step-by-step maximization [�] ˆ = arg max p(X | O) O
Extract acoustic features
Lˆ = arg max p(L | W)
Extract linguistic features
O L
ˆ = arg max p(O ˆ | L, ˆ )p( ) Learn mapping
X
W
w
O
L
l
ˆ O
Lˆ
lˆ
lˆ = arg max p(l | w)
Predict linguistic features
ˆ ˆ) oˆ = arg max p(o | l,
Predict acoustic features
o
¯ ⇠ fx (o) ˆ = p(x | o) ˆ x
Synthesize waveform
oˆ
l
o
ˆ
x
Heiga Zen
Generative Model-Based Text-to-Speech Synthesis
February �rd, ����
� of ��
Approximation (�) Joint ! Step-by-step maximization [�] ˆ = arg max p(X | O) O
Extract acoustic features
Lˆ = arg max p(L | W)
Extract linguistic features
O L
ˆ = arg max p(O ˆ | L, ˆ )p( ) Learn mapping
X
W
w
O
L
l
ˆ O
Lˆ
lˆ
lˆ = arg max p(l | w)
Predict linguistic features
ˆ ˆ) oˆ = arg max p(o | l,
Predict acoustic features
o
¯ ⇠ fx (o) ˆ = p(x | o) ˆ x
Synthesize waveform
oˆ
l
o
ˆ
x
Representations: acoustic, linguistic, mapping Heiga Zen
Generative Model-Based Text-to-Speech Synthesis
February �rd, ����
� of ��
Representation – Linguistic features Hello, world. Hello,
world.
Phrase: intonation, ...
hello
world
Word: POS, grammar, ...
w-er1-l-d
Syllable: stress, tone, ...
h-e2
l-ou1
h
l
Heiga Zen
Sentence: length, ...
e
ou
w er
l
d
Phone: voicing, manner, ...
Generative Model-Based Text-to-Speech Synthesis
February �rd, ����
�� of ��
Representation – Linguistic features Hello, world.
Sentence: length, ...
Hello,
world.
Phrase: intonation, ...
hello
world
Word: POS, grammar, ...
w-er1-l-d
Syllable: stress, tone, ...
h-e2
l-ou1
h
l
e
ou
w er
l
d
Phone: voicing, manner, ...
! Based on knowledge about spoken language • Lexicon, letter-to-sound rules • Tokenizer, tagger, parser • Phonology rules Heiga Zen
Generative Model-Based Text-to-Speech Synthesis
February �rd, ����
�� of ��
Representation – Acoustic features Piece-wise stationary, source-�lter generative model p(x | o) Vocal source
Vocal tract filter
Pulse train (voiced) Cepstrum, LPC, ... Aperiodicity, voicing, ...
Fundamental frequency
[dB]
overlap/shift windowing
80
+
Speech
e(n)
x(n) = h(n)*e(n)
0 0
8 [kHz]
h(n) White noise (unvoiced)
Heiga Zen
Generative Model-Based Text-to-Speech Synthesis
February �rd, ����
�� of ��
Representation – Acoustic features Piece-wise stationary, source-�lter generative model p(x | o) Vocal source
Vocal tract filter
Pulse train (voiced) Cepstrum, LPC, ... Aperiodicity, voicing, ...
Fundamental frequency
[dB]
overlap/shift windowing
80
+
Speech
e(n)
x(n) = h(n)*e(n)
0 0
8 [kHz]
h(n) White noise (unvoiced)
! Needs to solve inverse problem • Estimate parameters from signals • Use estimated parameters (e.g., cepstrum) as acoustic features Heiga Zen
Generative Model-Based Text-to-Speech Synthesis
February �rd, ����
�� of ��
Representation – Mapping Rule-based, formant synthesis [�] ˆ = arg max p(X | O) O
Vocoder analysis
Lˆ = arg max p(L | W)
Text analysis
O L
ˆ = arg max p(O ˆ | L, ˆ )p( ) Extract rules
X
W
w
O
L
l
ˆ O
Lˆ
lˆ
lˆ = arg max p(l | w)
Text analysis
ˆ ˆ) oˆ = arg max p(o | l,
Apply rules
o
¯ ⇠ fx (o) ˆ = p(x | o) ˆ x
Vocoder synthesis
oˆ
l
o
ˆ
x
Heiga Zen
Generative Model-Based Text-to-Speech Synthesis
February �rd, ����
�� of ��
Representation – Mapping Rule-based, formant synthesis [�] ˆ = arg max p(X | O) O
Vocoder analysis
Lˆ = arg max p(L | W)
Text analysis
O L
ˆ = arg max p(O ˆ | L, ˆ )p( ) Extract rules
X
W
w
O
L
l
ˆ O
Lˆ
lˆ
lˆ = arg max p(l | w)
Text analysis
ˆ ˆ) oˆ = arg max p(o | l,
Apply rules
o
¯ ⇠ fx (o) ˆ = p(x | o) ˆ x
Vocoder synthesis
oˆ
l
o
ˆ
x
! Hand-crafted rules on knowledge-based features Heiga Zen
Generative Model-Based Text-to-Speech Synthesis
February �rd, ����
�� of ��
Representation – Mapping HMM-based [�], statistical parametric synthesis [�] ˆ = arg max p(X | O) O
Vocoder analysis
Lˆ = arg max p(L | W)
Text analysis
O L
ˆ = arg max p(O ˆ | L, ˆ )p( ) Train HMMs
X
W
w
O
L
l
ˆ O
Lˆ
lˆ
lˆ = arg max p(l | w)
Text analysis
ˆ ˆ) oˆ = arg max p(o | l,
Parameter generation
o
¯ ⇠ fx (o) ˆ = p(x | o) ˆ x
Vocoder synthesis
oˆ
l
o
ˆ
x
Heiga Zen
Generative Model-Based Text-to-Speech Synthesis
February �rd, ����
�� of ��
Representation – Mapping HMM-based [�], statistical parametric synthesis [�] ˆ = arg max p(X | O) O
Vocoder analysis
Lˆ = arg max p(L | W)
Text analysis
O L
ˆ = arg max p(O ˆ | L, ˆ )p( ) Train HMMs
X
W
w
O
L
l
ˆ O
Lˆ
lˆ
lˆ = arg max p(l | w)
Text analysis
ˆ ˆ) oˆ = arg max p(o | l,
Parameter generation
o
¯ ⇠ fx (o) ˆ = p(x | o) ˆ x
Vocoder synthesis
oˆ
ˆ
l
o
x
! Replace rules by HMM-based generative acoustic model Heiga Zen
Generative Model-Based Text-to-Speech Synthesis
February �rd, ����
�� of ��
Outline Generative TTS Generative acoustic models for parametric TTS Hidden Markov models (HMMs) Neural networks Beyond parametric TTS Learned features WaveNet End-to-end Conclusion & future topics
HMM-based generative acoustic model for TTS • Context-dependent subword HMMs • Decision trees to cluster & tie HMM states ! interpretable l1
...
lN l
... o1 o2 o3 o4 o5 o6 ... ... ... ... oT o2 o2
p(o | l, ) = =
T XY 8q t=1
T XY 8q t=1
Heiga Zen
q1
q2
q3
q4
: Discrete
o1
o2
o3
o4
: Continuous
p(ot | qt , )P (q | l, )
qt : hidden state at t
N (ot ; µqt , ⌃qt )P (q | l, )
Generative Model-Based Text-to-Speech Synthesis
February �rd, ����
�� of ��
HMM-based generative acoustic model for TTS
• Non-smooth, step-wise statistics ! Smoothing is essential • Dif�cult to use high-dimensional acoustic features (e.g., raw spectra) ! Use low-dimensional features (e.g., cepstra) • Data fragmentation ! Ineffective, local representation A lot of research work have been done to address these issues
Heiga Zen
Generative Model-Based Text-to-Speech Synthesis
February �rd, ����
�6 of ��
Alternative acoustic model HMM: Handle variable length & alignment Decision tree: Map linguistic ! acoustic Linguistic features l
yes yes yes
no yes
no no
...
no yes
no
Statistics of acoustic features o
Regression tree: linguistic features ! Stats. of acoustic features
Heiga Zen
Generative Model-Based Text-to-Speech Synthesis
February �rd, ����
�� of ��
Alternative acoustic model HMM: Handle variable length & alignment Decision tree: Map linguistic ! acoustic Linguistic features l
yes yes yes
no yes
no no
...
no yes
no
Statistics of acoustic features o
Regression tree: linguistic features ! Stats. of acoustic features Replace the tree w/ a general-purpose regression model ! Arti�cial neural network Heiga Zen
Generative Model-Based Text-to-Speech Synthesis
February �rd, ����
�� of ��
FFNN-based acoustic model for TTS [6] Target Frame-level acoustic feature o t
o t−1
ot
o t+1
lt−1
lt
lt+1
ht
Frame-level linguistic feature lt Input
ht = g (Whl lt + bh )
ˆ = arg min
X t
oˆt = Woh ht + bo
kot
oˆt k2
= {Whl , Woh , bh , bo }
oˆt ⇡ E [ot | lt ] ! Replace decision trees & Gaussian distributions Heiga Zen
Generative Model-Based Text-to-Speech Synthesis
February �rd, ����
�8 of ��
RNN-based acoustic model for TTS [�] Target Frame-level acoustic feature o t
o t−1
ot
o t+1
lt−1
lt
lt+1
Recurrent connections
Frame-level linguistic feature lt Input
ht = g (Whl lt + Whh ht
1
+ bh )
ˆ = arg min
X t
oˆt = Woh ht + bo FFNN: oˆt ⇡ E [ot | lt ] Heiga Zen
kot
oˆt k2
= {Whl , Whh , Woh , bh , bo }
RNN: oˆt ⇡ E [ot | l1 , . . . , lt ]
Generative Model-Based Text-to-Speech Synthesis
February �rd, ����
�� of ��
NN-based generative acoustic model for TTS • Non-smooth, step-wise statistics ! RNN predicts smoothly varying acoustic features [�, 8] • Dif�cult to use high-dimensional acoustic features (e.g., raw spectra) ! Layered architecture can handle high-dimensional features [�] • Data fragmentation ! Distributed representation [��]
Heiga Zen
Generative Model-Based Text-to-Speech Synthesis
February �rd, ����
�� of ��
NN-based generative acoustic model for TTS • Non-smooth, step-wise statistics ! RNN predicts smoothly varying acoustic features [�, 8] • Dif�cult to use high-dimensional acoustic features (e.g., raw spectra) ! Layered architecture can handle high-dimensional features [�] • Data fragmentation ! Distributed representation [��] NN-based approach is now mainstream in research & products • Models: FFNN [6], MDN [��], RNN [�], Highway network [��], GAN [��] • Products: e.g., Google [��] Heiga Zen
Generative Model-Based Text-to-Speech Synthesis
February �rd, ����
�� of ��
Outline Generative TTS Generative acoustic models for parametric TTS Hidden Markov models (HMMs) Neural networks Beyond parametric TTS Learned features WaveNet End-to-end Conclusion & future topics
NN-based generative model for TTS
modulation of carrier wave by speech information
freq transfer char
voiced/unvoiced
fundamental freq
text (concept)
speech
frequency transfer characteristics magnitude start-- end
Sound source voiced: pulse unvoiced: noise
fundamental frequency
air flow
Text ! Linguistic ! (Articulatory) ! Acoustic ! Waveform Heiga Zen
Generative Model-Based Text-to-Speech Synthesis
February �rd, ����
�� of ��
Knowledge-based features ! Learned features Unsupervised feature learning ~ x(t)
world
w(n+1)
0000000000000000000100
Acoustic ) feature o(t)
) l(n-1)
Linguistic feature l(n)
0000000000100000000000 hello
x(t) (raw FFT spectrum)
w(n) (1-hot representation of word)
• Speech: auto-encoder at FFT spectra [�, ��] ! positive results • Text: word [�6], phone & syllable [��] ! less positive Heiga Zen
Generative Model-Based Text-to-Speech Synthesis
February �rd, ����
�� of ��
Relax approximation Joint acoustic feature extraction & model training Two-step optimization ! Joint optimization 8 ˆ = arg max p(X | O) > : ˆ = arg max p(O
+ ˆ = arg max p(X | O)p(O | L, ˆ )p( ) { ˆ , O} ,O
X O
W
w
L
l
Lˆ
lˆ
ˆ
Joint source-�lter & acoustic model optimization
o
• HMM [�8, ��, ��]
oˆ
• NN [��, ��] Heiga Zen
x Generative Model-Based Text-to-Speech Synthesis
February �rd, ����
�� of ��
Relax approximation Joint acoustic feature extracion & model training Mixed-phase cepstral analysis + LSTM-RNN [��] pn G(z)
fn
Pulse train z
xn Hu-1 (z)
-
z -1
z
en
z -1
z -1
z -1
sn
Speech 1-
... Cepstrum
...
dt(v)
ot(v)
d(u) t
Forward propagation
o(u) t
Back propagation Linguistic features
Heiga Zen
...
...
Derivatives
...
lt
...
Generative Model-Based Text-to-Speech Synthesis
February �rd, ����
�� of ��
Relax approximation Direct mapping from linguistic to waveform No explicit acoustic features ˆ = arg max p(X | O)p(O | L, ˆ )p( ) { ˆ , O} ,O
X
+
ˆ = arg max p(X | L, ˆ )p( )
W
w
L
l
Lˆ
lˆ
Generative models for raw audio • LPC [��]
ˆ
• WaveNet [��]
• SampleRNN [��] Heiga Zen
Generative Model-Based Text-to-Speech Synthesis
x
February �rd, ����
�6 of ��
WaveNet: A generative model for raw audio Autoregressive (AR) modelling of speech signals x = {x0 , x1 , . . . , xN p(x | ) = p(x0 , x1 , . . . , xN
Heiga Zen
1} 1
: raw waveform | )=
N Y1 n=0
Generative Model-Based Text-to-Speech Synthesis
p(xn | x0 , . . . , xn
1,
)
February �rd, ����
�� of ��
WaveNet: A generative model for raw audio Autoregressive (AR) modelling of speech signals x = {x0 , x1 , . . . , xN p(x | ) = p(x0 , x1 , . . . , xN WaveNet [��] ! p(xn | x0 , . . . , xn
Heiga Zen
1,
1} 1
: raw waveform | )=
N Y1 n=0
p(xn | x0 , . . . , xn
1,
)
) is modeled by convolutional NN
Generative Model-Based Text-to-Speech Synthesis
February �rd, ����
�� of ��
WaveNet: A generative model for raw audio Autoregressive (AR) modelling of speech signals x = {x0 , x1 , . . . , xN p(x | ) = p(x0 , x1 , . . . , xN WaveNet [��] ! p(xn | x0 , . . . , xn
1,
1} 1
: raw waveform | )=
N Y1 n=0
p(xn | x0 , . . . , xn
1,
)
) is modeled by convolutional NN
Key components • Causal dilated convolution: capture long-term dependency • Gated convolution + residual + skip: powerful non-linearity • Softmax at output: classi�cation rather than regression Heiga Zen
Generative Model-Based Text-to-Speech Synthesis
February �rd, ����
�� of ��
WaveNet – Causal dilated convolution ���ms in �6kHz sampling = �,6�� time steps ! Too long to be captured by normal RNN/LSTM
Dilated convolution Exponentially increase receptive �eld size w.r.t. # of layers p(xn | xn-1 ,..., xn-16) Output Hidden layer3 Hidden layer2 Hidden layer1 Input xn-16 Heiga Zen
... Generative Model-Based Text-to-Speech Synthesis
xn-3 xn-2 xn-1 February �rd, ����
�8 of ��
WaveNet – Non-linearity Residual block
Skip connections
Residual block 256
Softmax
1x1
256
ReLU
256
1x1
+
ReLU
...
...
30
p(xn | x0 ,..., xn-1 )
Residual block To residual block
Residual block
512
Residual block 1x1
1x1
: 1x1 convolution
ReLU
: ReLU activation
Gated
: Gated activation
Softmax
Heiga Zen
Gated
1x1
256
To skip connection
2x1 dilated
: Softmax activation Generative Model-Based Text-to-Speech Synthesis
February �rd, ����
�� of ��
WaveNet – Softmax
Amplitude
Time
Analog audio signal
Heiga Zen
Generative Model-Based Text-to-Speech Synthesis
February �rd, ����
�� of ��
WaveNet – Softmax
Amplitude
Time
Sampling & Quantization
Heiga Zen
Generative Model-Based Text-to-Speech Synthesis
February �rd, ����
�� of ��
WaveNet – Softmax
Amplitude
Category index
1
16
Time
Categorical distribution → Histogram - Unimodal - Multimodal - Skewed ...
Heiga Zen
Generative Model-Based Text-to-Speech Synthesis
February �rd, ����
�� of ��
WaveNet – Conditional modelling
Linguistic features l
1x1
Residual block Softmax
Residual block
1x1
1x1
ReLU
Residual block
1x1
1x1
+
ReLU
...
...
hn
Residual block
...
Embedding at time n
1x1
p(xn | x0 ,..., xn-1,l)
Residual block 1x1 Gated
1x1
conditioning 2x1 dilated
Heiga Zen
Generative Model-Based Text-to-Speech Synthesis
February �rd, ����
�� of ��
WaveNet vs conventional audio generative models Assumptions in conventional audio generative models [��, �6, ��, ��] • Stationary process w/ �xed-length analysis window ! Estimate model within ��–��ms window w/ �–�� shift • Linear, time-invariant �lter within a frame ! Relationship between samples can be non-linear
• Gaussian process ! Assumes speech signals are normally distributed WaveNet • Sample-by-saple, non-linear, capable to take additional inputs • Arbitrary-shaped signal distribution
SOTA subjective naturalness w/ WaveNet-based TTS [��] HMM LSTM Concatenative WaveNet Heiga Zen
Generative Model-Based Text-to-Speech Synthesis
February �rd, ����
�� of ��
Relax approximation Towards Bayesian end-to-end TTS Integrated end-to-end 8 > : ˆ = arg max p(X | L, +
ˆ = arg max p(X | W, )p( )
ˆ
Text analysis is integrated to model x
Heiga Zen
Generative Model-Based Text-to-Speech Synthesis
February �rd, ����
�� of ��
Relax approximation Towards Bayesian end-to-end TTS Bayesian end-to-end 8 < ˆ = arg max p(X | W, )p( )
X
W
w
:x ¯ ⇠ fx (w, ˆ ) = p(x | w, ˆ ) +
¯ ⇠ fx (w, X , W) = p(x | w, X , W) x Z = p(x | w, )p( | X , W)d K 1 X ⇡ p(x | w, ˆ k ) K
Ensemble
x
k=1
Marginalize model parameters & architecture Heiga Zen
Generative Model-Based Text-to-Speech Synthesis
February �rd, ����
�� of ��
Outline Generative TTS Generative acoustic models for parametric TTS Hidden Markov models (HMMs) Neural networks Beyond parametric TTS Learned features WaveNet End-to-end Conclusion & future topics
Generative model-based text-to-speech synthesis • Bayes formulation + factorization + approximations • Representation: acoustic features, linguistic features, mapping Mapping: Rules ! HMM ! NN Feature: Engineered ! Unsupervised, learned • Less approximations Joint training, direct waveform modelling Moving towards integrated & Bayesian end-to-end TTS Naturalness: Concatenative Generative
Flexibility: Concatenative ⌧ Generative (e.g., multiple speakers) Heiga Zen
Generative Model-Based Text-to-Speech Synthesis
February �rd, ����
�6 of ��
Beyond “text”-to-speech synthesis TTS on conversational assistants • Texts aren’t fully contained
• Need more context Location to resolve homographs User query to put right emphasis
Heiga Zen
Generative Model-Based Text-to-Speech Synthesis
February �rd, ����
�� of ��
Beyond “text”-to-speech synthesis TTS on conversational assistants • Texts aren’t fully contained
• Need more context Location to resolve homographs User query to put right emphasis
We need representation that can organize the world information & make it accessible & useful from TTS generative models , Heiga Zen
Generative Model-Based Text-to-Speech Synthesis
February �rd, ����
�� of ��
Beyond “generative” TTS
Generative model-based TTS • Model represents process behind speech production Trained to minimize error against human-produced speech Learned model ! speaker
Heiga Zen
Generative Model-Based Text-to-Speech Synthesis
February �rd, ����
�8 of ��
Beyond “generative” TTS
Generative model-based TTS • Model represents process behind speech production Trained to minimize error against human-produced speech Learned model ! speaker • Speech is for communication Goal: maximize the amount of information to be received Missing “listener” ! “listener” in training / model itself?
Heiga Zen
Generative Model-Based Text-to-Speech Synthesis
February �rd, ����
�8 of ��
Thanks!
Heiga Zen
Generative Model-Based Text-to-Speech Synthesis
February �rd, ����
�� of ��
References I [�]
D. Klatt. Real-time speech synthesis by rule. Journal of ASA, 68(S�):S�8–S�8, ��8�.
[�]
A. Hunt and A. Black. Unit selection in a concatenative speech synthesis system using a large speech database. In Proc. ICASSP, pages ���–��6, ���6.
[�]
K. Tokuda. Speech synthesis as a statistical machine learning problem. https://www.sp.nitech.ac.jp/~tokuda/tokuda_asru2011_for_pdf.pdf. Invited talk given at ASRU ����.
[�]
T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura. Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis. IEICE Trans. Inf. Syst., J8�-D-II(��):����–����, ����. (in Japanese).
[�]
H. Zen, K. Tokuda, and A. Black. Statistical parametric speech synthesis. Speech Commn., ��(��):����–��6�, ����.
[6]
H. Zen, A. Senior, and M. Schuster. Statistical parametric speech synthesis using deep neural networks. In Proc. ICASSP, pages ��6�–��66, ����.
[�]
Y. Fan, Y. Qian, F.-L. Xie, and F. Soong. TTS synthesis with bidirectional LSTM based recurrent neural networks. In Proc. Interspeech, pages ��6�–��68, ����.
Heiga Zen
Generative Model-Based Text-to-Speech Synthesis
February �rd, ����
�� of ��
References II [8]
H. Zen. Acoustic modeling for speech synthesis: from HMM to RNN. http://research.google.com/pubs/pub44630.html. Invited talk given at ASRU ����.
[�]
S. Takaki and J. Yamagishi. A deep auto-encoder based low-dimensional feature extraction from FFT spectral envelopes for statistical parametric speech synthesis. In Proc. ICASSP, pages ����–����, ���6.
[��] G. Hinton, J. McClelland, and D. Rumelhart. Distributed representation. In D. Rumelhart, J. McClelland, and the PDP Research Group, editors, Parallel distributed processing: Explorations in the microstructure of cognition. MIT Press, ��86. [��]
H. Zen and A. Senior. Deep mixture density networks for acoustic modeling in statistical parametric speech synthesis. In Proc. ICASSP, pages �8��–�8�6, ����.
[��] X. Wang, S. Takaki, and J. Yamagishi. Investigating very deep highway networks for parametric speech synthesis. In Proc. ISCA SSW�, ���6. [��] Y. Saito, S. Takamichi, and Saruwatari. Training algorithm to deceive anti-spoo�ng veri�cation for DNN-based speech synthesis. In Proc. ICASSP, ����. [��] H. Zen, Y. Agiomyrgiannakis, N. Egberts, F. Henderson, and P. Szczepaniak. Fast, compact, and high quality LSTM-RNN based statistical parametric speech synthesizers for mobile devices. In Proc. Interspeech, ���6.
Heiga Zen
Generative Model-Based Text-to-Speech Synthesis
February �rd, ����
�� of ��
References III [��] P. Muthukumar and A. Black. A deep learning approach to data-driven parameterizations for statistical parametric speech synthesis. arXiv:����.8��8, ����. [�6] P. Wang, Y. Qian, F. Soong, L. He, and H. Zhao. Word embedding for recurrent neural network based TTS synthesis. In Proc. ICASSP, pages �8��–�88�, ����. [��] X. Wang, S. Takaki, and J. Yamagishi. Investigation of using continuous representation of various linguistic units in neural network-based text-to-speech synthesis. IEICE Trans. Inf. Syst., E��-D(��):����–��8�, ���6. [�8] T. Toda and K. Tokuda. Statistical approach to vocal tract transfer function estimation based on factor analyzed trajectory hmm. In Proc. ICASSP, pages ����–���8, ���8. [��] Y.-J. Wu and K. Tokuda. Minimum generation error training with direct log spectral distortion on LSPs for HMM-based speech synthesis. In Proc. Interspeech, pages ���–�8�, ���8. [��] R. Maia, H. Zen, and M. Gales. Statistical parametric speech synthesis with joint estimation of acoustic and excitation model parameters. In Proc. ISCA SSW�, pages 88–��, ����. [��] K. Tokuda and H. Zen. Directly modeling speech waveforms by neural networks for statistical parametric speech synthesis. In Proc. ICASSP, pages ����–����, ����. [��] K. Tokuda and H. Zen. Directly modeling voiced and unvoiced components in speech waveforms by neural networks. In Proc. ICASSP, pages �6��–�6��, ���6.
Heiga Zen
Generative Model-Based Text-to-Speech Synthesis
February �rd, ����
�� of ��
References IV
[��] F. Itakura and S. Saito. A statistical method for estimation of speech spectral density and formant frequencies. Trans. IEICE, J���A:��–��, ����. [��] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu. WaveNet: A generative model for raw audio. arXiv:�6��.�����, ���6. [��] S. Mehri, K. Kumar, I. Gulrajani, R. Kumar, S. Jain, J. Sotelo, A. Courville, and Y. Bengio. SampleRNN: An unconditional end-to-end neural audio generation model. arXiv:�6��.��8��, ���6. [�6] S. Imai and C. Furuichi. Unbiased estimation of log spectrum. In Proc. EURASIP, pages ���–��6, ��88. [��] H. Kameoka, Y. Ohishi, D. Mochihashi, and J. Le Roux. Speech analysis with multi-kernel linear prediction. In Proc. Spring Conference of ASJ, pages ���–���, ����. (in Japanese).
Heiga Zen
Generative Model-Based Text-to-Speech Synthesis
February �rd, ����
�� of ��
X
W
w
X
W
O
L
w l
X
W
O
L
w
X
W
l
O
L
l
ˆ O
Lˆ
lˆ
o
w
ˆ
o
o
oˆ
oˆ
x
x
x
(1) Bayesian
(2) Auxiliary variables + factorization
(3) Joint maximization
x
X O
W
w
W
w
L
l
L
l
Lˆ
lˆ
Lˆ
lˆ
X
ˆ
ˆ
X
W
w
(4) Step-by-step maximization e.g., statistical parametric TTS X
W
w
ˆ
o oˆ x
x
x
x
(5) Joint acoustic feature extraction + model training
(6) Conditional WaveNet -based TTS
(7) Integrated end-to-end
(8) Bayesian end-to-end
Heiga Zen
Generative Model-Based Text-to-Speech Synthesis
February �rd, ����
�� of ��