Document not found! Please try again

Generative Model-Based [6pt] Text-to-Speech ... - Research at Google

3 downloads 269 Views 2MB Size Report
Speech: auto-encoder at FFT spectra [ , ] ! positive results ... Joint acoustic feature extraction & model training
Generative Model-Based Text-to-Speech Synthesis Heiga Zen (Google London) February �rd, ����@MIT

Outline Generative TTS Generative acoustic models for parametric TTS Hidden Markov models (HMMs) Neural networks Beyond parametric TTS Learned features WaveNet End-to-end Conclusion & future topics

Outline Generative TTS Generative acoustic models for parametric TTS Hidden Markov models (HMMs) Neural networks Beyond parametric TTS Learned features WaveNet End-to-end Conclusion & future topics

Text-to-speech as sequence-to-sequence mapping Automatic speech recognition (ASR) ! “Hello my name is Heiga Zen” Machine translation (MT) “Hello my name is Heiga Zen” ! “Ich heiße Heiga Zen” Text-to-speech synthesis (TTS) “Hello my name is Heiga Zen” !

Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

� of ��

Speech production process

modulation of carrier wave by speech information

freq transfer char

voiced/unvoiced

fundamental freq

text (concept)

speech

frequency transfer characteristics magnitude start-- end

Sound source voiced: pulse unvoiced: noise

fundamental frequency

air flow Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

� of ��

Typical �ow of TTS system TEXT Sentence segmentation Word segmentation Text normalization Part-of-speech tagging Pronunciation

discrete ) discrete NLP Frontend

Heiga Zen

Text analysis Speech synthesis

SYNTHESIZED SEECH

Generative Model-Based Text-to-Speech Synthesis

Prosody prediction Waveform generation

discrete ) continuous Speech

Backend

February �rd, ����

� of ��

Speech synthesis approaches

Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

� of ��

Speech synthesis approaches Rule-based, formant synthesis [�]

Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

� of ��

Speech synthesis approaches Rule-based, formant synthesis [�]

Heiga Zen

Sample-based, concatenative synthesis [�]

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

� of ��

Speech synthesis approaches Rule-based, formant synthesis [�]

Sample-based, concatenative synthesis [�]

Model-based, generative synthesis p(speech=

Heiga Zen

| text=”Hello, my name is Heiga Zen.”)

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

� of ��

Probabilistic formulation of TTS Random variables

Heiga Zen

X

Speech waveforms (data)

Observed

W

Transcriptions (data)

Observed

w

Given text

Observed

x

Synthesized speech

Unobserved

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

6 of ��

Probabilistic formulation of TTS Random variables X

Speech waveforms (data)

Observed

W

Transcriptions (data)

Observed

w

Given text

Observed

x

Synthesized speech

Unobserved

Synthesis • Estimate posterior predictive distribution ! p(x | w, X , W) ¯ from the posterior distribution • Sample x

Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

X

W

w

x

February �rd, ����

6 of ��

Probabilistic formulation Introduce auxiliary variables (representation) + factorize dependency y XX p(x | w, X , W) = p(x | o)p(o | l, )p(l | w) 8l

8L

p(X | O)p(O | L, )p( )p(L | W)/ p(X ) dodOd

where

O, o: Acoustic features

L, l: Linguistic features : Model

X

W

w

O

L

l

o x Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

� of ��

Approximation (�) Approximate {sum & integral} by best point estimates (like MAP) [�] where

ˆ p(x | w, X , W) ⇡ p(x | o)

ˆ O, ˆ L, ˆ ˆ } = arg max ˆ l, {o, o,l,O,L,

X

W

w

O

L

l

p(x | o)p(o | l, )p(l | w)

p(X | O)p(O | L, )p( )p(L | W)

o oˆ x

Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

8 of ��

Approximation (�) Joint ! Step-by-step maximization [�] ˆ = arg max p(X | O) O

Extract acoustic features

Lˆ = arg max p(L | W)

Extract linguistic features

O L

ˆ = arg max p(O ˆ | L, ˆ )p( ) Learn mapping

X

W

w

O

L

l

ˆ O





lˆ = arg max p(l | w)

Predict linguistic features

ˆ ˆ) oˆ = arg max p(o | l,

Predict acoustic features

o

¯ ⇠ fx (o) ˆ = p(x | o) ˆ x

Synthesize waveform



l

o

ˆ

x

Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

� of ��

Approximation (�) Joint ! Step-by-step maximization [�] ˆ = arg max p(X | O) O

Extract acoustic features

O

Lˆ = arg max p(L | W) L

ˆ = arg max p(O ˆ | L, ˆ )p( ) lˆ = arg max p(l | w) l

ˆ ˆ) oˆ = arg max p(o | l, o

¯ ⇠ fx (o) ˆ = p(x | o) ˆ x

X

W

w

O

L

l

ˆ O





ˆ o oˆ x

Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

� of ��

Approximation (�) Joint ! Step-by-step maximization [�] ˆ = arg max p(X | O) O

Extract acoustic features

Lˆ = arg max p(L | W)

Extract linguistic features

O L

ˆ = arg max p(O ˆ | L, ˆ )p( ) lˆ = arg max p(l | w) l

ˆ ˆ) oˆ = arg max p(o | l, o

¯ ⇠ fx (o) ˆ = p(x | o) ˆ x

X

W

w

O

L

l

ˆ O





ˆ o oˆ x

Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

� of ��

Approximation (�) Joint ! Step-by-step maximization [�] ˆ = arg max p(X | O) O

Extract acoustic features

Lˆ = arg max p(L | W)

Extract linguistic features

O L

ˆ = arg max p(O ˆ | L, ˆ )p( ) Learn mapping lˆ = arg max p(l | w) l

ˆ ˆ) oˆ = arg max p(o | l, o

¯ ⇠ fx (o) ˆ = p(x | o) ˆ x

X

W

w

O

L

l

ˆ O





ˆ o oˆ x

Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

� of ��

Approximation (�) Joint ! Step-by-step maximization [�] ˆ = arg max p(X | O) O

Extract acoustic features

Lˆ = arg max p(L | W)

Extract linguistic features

O L

ˆ = arg max p(O ˆ | L, ˆ )p( ) Learn mapping lˆ = arg max p(l | w)

Predict linguistic features

l

ˆ ˆ) oˆ = arg max p(o | l, o

¯ ⇠ fx (o) ˆ = p(x | o) ˆ x

X

W

w

O

L

l

ˆ O





ˆ o oˆ x

Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

� of ��

Approximation (�) Joint ! Step-by-step maximization [�] ˆ = arg max p(X | O) O

Extract acoustic features

Lˆ = arg max p(L | W)

Extract linguistic features

O L

ˆ = arg max p(O ˆ | L, ˆ )p( ) Learn mapping lˆ = arg max p(l | w)

Predict linguistic features

ˆ ˆ) oˆ = arg max p(o | l,

Predict acoustic features

l

o

¯ ⇠ fx (o) ˆ = p(x | o) ˆ x

X

W

w

O

L

l

ˆ O





ˆ o oˆ x

Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

� of ��

Approximation (�) Joint ! Step-by-step maximization [�] ˆ = arg max p(X | O) O

Extract acoustic features

Lˆ = arg max p(L | W)

Extract linguistic features

O L

ˆ = arg max p(O ˆ | L, ˆ )p( ) Learn mapping

X

W

w

O

L

l

ˆ O





lˆ = arg max p(l | w)

Predict linguistic features

ˆ ˆ) oˆ = arg max p(o | l,

Predict acoustic features

o

¯ ⇠ fx (o) ˆ = p(x | o) ˆ x

Synthesize waveform



l

o

ˆ

x

Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

� of ��

Approximation (�) Joint ! Step-by-step maximization [�] ˆ = arg max p(X | O) O

Extract acoustic features

Lˆ = arg max p(L | W)

Extract linguistic features

O L

ˆ = arg max p(O ˆ | L, ˆ )p( ) Learn mapping

X

W

w

O

L

l

ˆ O





lˆ = arg max p(l | w)

Predict linguistic features

ˆ ˆ) oˆ = arg max p(o | l,

Predict acoustic features

o

¯ ⇠ fx (o) ˆ = p(x | o) ˆ x

Synthesize waveform



l

o

ˆ

x

Representations: acoustic, linguistic, mapping Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

� of ��

Representation – Linguistic features Hello, world. Hello,

world.

Phrase: intonation, ...

hello

world

Word: POS, grammar, ...

w-er1-l-d

Syllable: stress, tone, ...

h-e2

l-ou1

h

l

Heiga Zen

Sentence: length, ...

e

ou

w er

l

d

Phone: voicing, manner, ...

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

�� of ��

Representation – Linguistic features Hello, world.

Sentence: length, ...

Hello,

world.

Phrase: intonation, ...

hello

world

Word: POS, grammar, ...

w-er1-l-d

Syllable: stress, tone, ...

h-e2

l-ou1

h

l

e

ou

w er

l

d

Phone: voicing, manner, ...

! Based on knowledge about spoken language • Lexicon, letter-to-sound rules • Tokenizer, tagger, parser • Phonology rules Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

�� of ��

Representation – Acoustic features Piece-wise stationary, source-�lter generative model p(x | o) Vocal source

Vocal tract filter

Pulse train (voiced) Cepstrum, LPC, ... Aperiodicity, voicing, ...

Fundamental frequency

[dB]

overlap/shift windowing

80

+

Speech

e(n)

x(n) = h(n)*e(n)

0 0

8 [kHz]

h(n) White noise (unvoiced)

Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

�� of ��

Representation – Acoustic features Piece-wise stationary, source-�lter generative model p(x | o) Vocal source

Vocal tract filter

Pulse train (voiced) Cepstrum, LPC, ... Aperiodicity, voicing, ...

Fundamental frequency

[dB]

overlap/shift windowing

80

+

Speech

e(n)

x(n) = h(n)*e(n)

0 0

8 [kHz]

h(n) White noise (unvoiced)

! Needs to solve inverse problem • Estimate parameters from signals • Use estimated parameters (e.g., cepstrum) as acoustic features Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

�� of ��

Representation – Mapping Rule-based, formant synthesis [�] ˆ = arg max p(X | O) O

Vocoder analysis

Lˆ = arg max p(L | W)

Text analysis

O L

ˆ = arg max p(O ˆ | L, ˆ )p( ) Extract rules

X

W

w

O

L

l

ˆ O





lˆ = arg max p(l | w)

Text analysis

ˆ ˆ) oˆ = arg max p(o | l,

Apply rules

o

¯ ⇠ fx (o) ˆ = p(x | o) ˆ x

Vocoder synthesis



l

o

ˆ

x

Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

�� of ��

Representation – Mapping Rule-based, formant synthesis [�] ˆ = arg max p(X | O) O

Vocoder analysis

Lˆ = arg max p(L | W)

Text analysis

O L

ˆ = arg max p(O ˆ | L, ˆ )p( ) Extract rules

X

W

w

O

L

l

ˆ O





lˆ = arg max p(l | w)

Text analysis

ˆ ˆ) oˆ = arg max p(o | l,

Apply rules

o

¯ ⇠ fx (o) ˆ = p(x | o) ˆ x

Vocoder synthesis



l

o

ˆ

x

! Hand-crafted rules on knowledge-based features Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

�� of ��

Representation – Mapping HMM-based [�], statistical parametric synthesis [�] ˆ = arg max p(X | O) O

Vocoder analysis

Lˆ = arg max p(L | W)

Text analysis

O L

ˆ = arg max p(O ˆ | L, ˆ )p( ) Train HMMs

X

W

w

O

L

l

ˆ O





lˆ = arg max p(l | w)

Text analysis

ˆ ˆ) oˆ = arg max p(o | l,

Parameter generation

o

¯ ⇠ fx (o) ˆ = p(x | o) ˆ x

Vocoder synthesis



l

o

ˆ

x

Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

�� of ��

Representation – Mapping HMM-based [�], statistical parametric synthesis [�] ˆ = arg max p(X | O) O

Vocoder analysis

Lˆ = arg max p(L | W)

Text analysis

O L

ˆ = arg max p(O ˆ | L, ˆ )p( ) Train HMMs

X

W

w

O

L

l

ˆ O





lˆ = arg max p(l | w)

Text analysis

ˆ ˆ) oˆ = arg max p(o | l,

Parameter generation

o

¯ ⇠ fx (o) ˆ = p(x | o) ˆ x

Vocoder synthesis



ˆ

l

o

x

! Replace rules by HMM-based generative acoustic model Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

�� of ��

Outline Generative TTS Generative acoustic models for parametric TTS Hidden Markov models (HMMs) Neural networks Beyond parametric TTS Learned features WaveNet End-to-end Conclusion & future topics

HMM-based generative acoustic model for TTS • Context-dependent subword HMMs • Decision trees to cluster & tie HMM states ! interpretable l1

...

lN l

... o1 o2 o3 o4 o5 o6 ... ... ... ... oT o2 o2

p(o | l, ) = =

T XY 8q t=1

T XY 8q t=1

Heiga Zen

q1

q2

q3

q4

: Discrete

o1

o2

o3

o4

: Continuous

p(ot | qt , )P (q | l, )

qt : hidden state at t

N (ot ; µqt , ⌃qt )P (q | l, )

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

�� of ��

HMM-based generative acoustic model for TTS

• Non-smooth, step-wise statistics ! Smoothing is essential • Dif�cult to use high-dimensional acoustic features (e.g., raw spectra) ! Use low-dimensional features (e.g., cepstra) • Data fragmentation ! Ineffective, local representation A lot of research work have been done to address these issues

Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

�6 of ��

Alternative acoustic model HMM: Handle variable length & alignment Decision tree: Map linguistic ! acoustic Linguistic features l

yes yes yes

no yes

no no

...

no yes

no

Statistics of acoustic features o

Regression tree: linguistic features ! Stats. of acoustic features

Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

�� of ��

Alternative acoustic model HMM: Handle variable length & alignment Decision tree: Map linguistic ! acoustic Linguistic features l

yes yes yes

no yes

no no

...

no yes

no

Statistics of acoustic features o

Regression tree: linguistic features ! Stats. of acoustic features Replace the tree w/ a general-purpose regression model ! Arti�cial neural network Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

�� of ��

FFNN-based acoustic model for TTS [6] Target Frame-level acoustic feature o t

o t−1

ot

o t+1

lt−1

lt

lt+1

ht

Frame-level linguistic feature lt Input

ht = g (Whl lt + bh )

ˆ = arg min

X t

oˆt = Woh ht + bo

kot

oˆt k2

= {Whl , Woh , bh , bo }

oˆt ⇡ E [ot | lt ] ! Replace decision trees & Gaussian distributions Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

�8 of ��

RNN-based acoustic model for TTS [�] Target Frame-level acoustic feature o t

o t−1

ot

o t+1

lt−1

lt

lt+1

Recurrent connections

Frame-level linguistic feature lt Input

ht = g (Whl lt + Whh ht

1

+ bh )

ˆ = arg min

X t

oˆt = Woh ht + bo FFNN: oˆt ⇡ E [ot | lt ] Heiga Zen

kot

oˆt k2

= {Whl , Whh , Woh , bh , bo }

RNN: oˆt ⇡ E [ot | l1 , . . . , lt ]

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

�� of ��

NN-based generative acoustic model for TTS • Non-smooth, step-wise statistics ! RNN predicts smoothly varying acoustic features [�, 8] • Dif�cult to use high-dimensional acoustic features (e.g., raw spectra) ! Layered architecture can handle high-dimensional features [�] • Data fragmentation ! Distributed representation [��]

Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

�� of ��

NN-based generative acoustic model for TTS • Non-smooth, step-wise statistics ! RNN predicts smoothly varying acoustic features [�, 8] • Dif�cult to use high-dimensional acoustic features (e.g., raw spectra) ! Layered architecture can handle high-dimensional features [�] • Data fragmentation ! Distributed representation [��] NN-based approach is now mainstream in research & products • Models: FFNN [6], MDN [��], RNN [�], Highway network [��], GAN [��] • Products: e.g., Google [��] Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

�� of ��

Outline Generative TTS Generative acoustic models for parametric TTS Hidden Markov models (HMMs) Neural networks Beyond parametric TTS Learned features WaveNet End-to-end Conclusion & future topics

NN-based generative model for TTS

modulation of carrier wave by speech information

freq transfer char

voiced/unvoiced

fundamental freq

text (concept)

speech

frequency transfer characteristics magnitude start-- end

Sound source voiced: pulse unvoiced: noise

fundamental frequency

air flow

Text ! Linguistic ! (Articulatory) ! Acoustic ! Waveform Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

�� of ��

Knowledge-based features ! Learned features Unsupervised feature learning ~ x(t)

world

w(n+1)

0000000000000000000100

Acoustic ) feature o(t)

) l(n-1)

Linguistic feature l(n)

0000000000100000000000 hello

x(t) (raw FFT spectrum)

w(n) (1-hot representation of word)

• Speech: auto-encoder at FFT spectra [�, ��] ! positive results • Text: word [�6], phone & syllable [��] ! less positive Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

�� of ��

Relax approximation Joint acoustic feature extraction & model training Two-step optimization ! Joint optimization 8 ˆ = arg max p(X | O) > : ˆ = arg max p(O

+ ˆ = arg max p(X | O)p(O | L, ˆ )p( ) { ˆ , O} ,O

X O

W

w

L

l





ˆ

Joint source-�lter & acoustic model optimization

o

• HMM [�8, ��, ��]



• NN [��, ��] Heiga Zen

x Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

�� of ��

Relax approximation Joint acoustic feature extracion & model training Mixed-phase cepstral analysis + LSTM-RNN [��] pn G(z)

fn

Pulse train z

xn Hu-1 (z)

-

z -1

z

en

z -1

z -1

z -1

sn

Speech 1-

... Cepstrum

...

dt(v)

ot(v)

d(u) t

Forward propagation

o(u) t

Back propagation Linguistic features

Heiga Zen

...

...

Derivatives

...

lt

...

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

�� of ��

Relax approximation Direct mapping from linguistic to waveform No explicit acoustic features ˆ = arg max p(X | O)p(O | L, ˆ )p( ) { ˆ , O} ,O

X

+

ˆ = arg max p(X | L, ˆ )p( )

W

w

L

l





Generative models for raw audio • LPC [��]

ˆ

• WaveNet [��]

• SampleRNN [��] Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

x

February �rd, ����

�6 of ��

WaveNet: A generative model for raw audio Autoregressive (AR) modelling of speech signals x = {x0 , x1 , . . . , xN p(x | ) = p(x0 , x1 , . . . , xN

Heiga Zen

1} 1

: raw waveform | )=

N Y1 n=0

Generative Model-Based Text-to-Speech Synthesis

p(xn | x0 , . . . , xn

1,

)

February �rd, ����

�� of ��

WaveNet: A generative model for raw audio Autoregressive (AR) modelling of speech signals x = {x0 , x1 , . . . , xN p(x | ) = p(x0 , x1 , . . . , xN WaveNet [��] ! p(xn | x0 , . . . , xn

Heiga Zen

1,

1} 1

: raw waveform | )=

N Y1 n=0

p(xn | x0 , . . . , xn

1,

)

) is modeled by convolutional NN

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

�� of ��

WaveNet: A generative model for raw audio Autoregressive (AR) modelling of speech signals x = {x0 , x1 , . . . , xN p(x | ) = p(x0 , x1 , . . . , xN WaveNet [��] ! p(xn | x0 , . . . , xn

1,

1} 1

: raw waveform | )=

N Y1 n=0

p(xn | x0 , . . . , xn

1,

)

) is modeled by convolutional NN

Key components • Causal dilated convolution: capture long-term dependency • Gated convolution + residual + skip: powerful non-linearity • Softmax at output: classi�cation rather than regression Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

�� of ��

WaveNet – Causal dilated convolution ���ms in �6kHz sampling = �,6�� time steps ! Too long to be captured by normal RNN/LSTM

Dilated convolution Exponentially increase receptive �eld size w.r.t. # of layers p(xn | xn-1 ,..., xn-16) Output Hidden layer3 Hidden layer2 Hidden layer1 Input xn-16 Heiga Zen

... Generative Model-Based Text-to-Speech Synthesis

xn-3 xn-2 xn-1 February �rd, ����

�8 of ��

WaveNet – Non-linearity Residual block

Skip connections

Residual block 256

Softmax

1x1

256

ReLU

256

1x1

+

ReLU

...

...

30

p(xn | x0 ,..., xn-1 )

Residual block To residual block

Residual block

512

Residual block 1x1

1x1

: 1x1 convolution

ReLU

: ReLU activation

Gated

: Gated activation

Softmax

Heiga Zen

Gated

1x1

256

To skip connection

2x1 dilated

: Softmax activation Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

�� of ��

WaveNet – Softmax

Amplitude

Time

Analog audio signal

Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

�� of ��

WaveNet – Softmax

Amplitude

Time

Sampling & Quantization

Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

�� of ��

WaveNet – Softmax

Amplitude

Category index

1

16

Time

Categorical distribution → Histogram - Unimodal - Multimodal - Skewed ...

Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

�� of ��

WaveNet – Conditional modelling

Linguistic features l

1x1

Residual block Softmax

Residual block

1x1

1x1

ReLU

Residual block

1x1

1x1

+

ReLU

...

...

hn

Residual block

...

Embedding at time n

1x1

p(xn | x0 ,..., xn-1,l)

Residual block 1x1 Gated

1x1

conditioning 2x1 dilated

Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

�� of ��

WaveNet vs conventional audio generative models Assumptions in conventional audio generative models [��, �6, ��, ��] • Stationary process w/ �xed-length analysis window ! Estimate model within ��–��ms window w/ �–�� shift • Linear, time-invariant �lter within a frame ! Relationship between samples can be non-linear

• Gaussian process ! Assumes speech signals are normally distributed WaveNet • Sample-by-saple, non-linear, capable to take additional inputs • Arbitrary-shaped signal distribution

SOTA subjective naturalness w/ WaveNet-based TTS [��] HMM LSTM Concatenative WaveNet Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

�� of ��

Relax approximation Towards Bayesian end-to-end TTS Integrated end-to-end 8 > : ˆ = arg max p(X | L, +

ˆ = arg max p(X | W, )p( )

ˆ

Text analysis is integrated to model x

Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

�� of ��

Relax approximation Towards Bayesian end-to-end TTS Bayesian end-to-end 8 < ˆ = arg max p(X | W, )p( )

X

W

w

:x ¯ ⇠ fx (w, ˆ ) = p(x | w, ˆ ) +

¯ ⇠ fx (w, X , W) = p(x | w, X , W) x Z = p(x | w, )p( | X , W)d K 1 X ⇡ p(x | w, ˆ k ) K

Ensemble

x

k=1

Marginalize model parameters & architecture Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

�� of ��

Outline Generative TTS Generative acoustic models for parametric TTS Hidden Markov models (HMMs) Neural networks Beyond parametric TTS Learned features WaveNet End-to-end Conclusion & future topics

Generative model-based text-to-speech synthesis • Bayes formulation + factorization + approximations • Representation: acoustic features, linguistic features, mapping Mapping: Rules ! HMM ! NN Feature: Engineered ! Unsupervised, learned • Less approximations Joint training, direct waveform modelling Moving towards integrated & Bayesian end-to-end TTS Naturalness: Concatenative  Generative

Flexibility: Concatenative ⌧ Generative (e.g., multiple speakers) Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

�6 of ��

Beyond “text”-to-speech synthesis TTS on conversational assistants • Texts aren’t fully contained

• Need more context Location to resolve homographs User query to put right emphasis

Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

�� of ��

Beyond “text”-to-speech synthesis TTS on conversational assistants • Texts aren’t fully contained

• Need more context Location to resolve homographs User query to put right emphasis

We need representation that can organize the world information & make it accessible & useful from TTS generative models , Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

�� of ��

Beyond “generative” TTS

Generative model-based TTS • Model represents process behind speech production Trained to minimize error against human-produced speech Learned model ! speaker

Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

�8 of ��

Beyond “generative” TTS

Generative model-based TTS • Model represents process behind speech production Trained to minimize error against human-produced speech Learned model ! speaker • Speech is for communication Goal: maximize the amount of information to be received Missing “listener” ! “listener” in training / model itself?

Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

�8 of ��

Thanks!

Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

�� of ��

References I [�]

D. Klatt. Real-time speech synthesis by rule. Journal of ASA, 68(S�):S�8–S�8, ��8�.

[�]

A. Hunt and A. Black. Unit selection in a concatenative speech synthesis system using a large speech database. In Proc. ICASSP, pages ���–��6, ���6.

[�]

K. Tokuda. Speech synthesis as a statistical machine learning problem. https://www.sp.nitech.ac.jp/~tokuda/tokuda_asru2011_for_pdf.pdf. Invited talk given at ASRU ����.

[�]

T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura. Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis. IEICE Trans. Inf. Syst., J8�-D-II(��):����–����, ����. (in Japanese).

[�]

H. Zen, K. Tokuda, and A. Black. Statistical parametric speech synthesis. Speech Commn., ��(��):����–��6�, ����.

[6]

H. Zen, A. Senior, and M. Schuster. Statistical parametric speech synthesis using deep neural networks. In Proc. ICASSP, pages ��6�–��66, ����.

[�]

Y. Fan, Y. Qian, F.-L. Xie, and F. Soong. TTS synthesis with bidirectional LSTM based recurrent neural networks. In Proc. Interspeech, pages ��6�–��68, ����.

Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

�� of ��

References II [8]

H. Zen. Acoustic modeling for speech synthesis: from HMM to RNN. http://research.google.com/pubs/pub44630.html. Invited talk given at ASRU ����.

[�]

S. Takaki and J. Yamagishi. A deep auto-encoder based low-dimensional feature extraction from FFT spectral envelopes for statistical parametric speech synthesis. In Proc. ICASSP, pages ����–����, ���6.

[��] G. Hinton, J. McClelland, and D. Rumelhart. Distributed representation. In D. Rumelhart, J. McClelland, and the PDP Research Group, editors, Parallel distributed processing: Explorations in the microstructure of cognition. MIT Press, ��86. [��]

H. Zen and A. Senior. Deep mixture density networks for acoustic modeling in statistical parametric speech synthesis. In Proc. ICASSP, pages �8��–�8�6, ����.

[��] X. Wang, S. Takaki, and J. Yamagishi. Investigating very deep highway networks for parametric speech synthesis. In Proc. ISCA SSW�, ���6. [��] Y. Saito, S. Takamichi, and Saruwatari. Training algorithm to deceive anti-spoo�ng veri�cation for DNN-based speech synthesis. In Proc. ICASSP, ����. [��] H. Zen, Y. Agiomyrgiannakis, N. Egberts, F. Henderson, and P. Szczepaniak. Fast, compact, and high quality LSTM-RNN based statistical parametric speech synthesizers for mobile devices. In Proc. Interspeech, ���6.

Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

�� of ��

References III [��] P. Muthukumar and A. Black. A deep learning approach to data-driven parameterizations for statistical parametric speech synthesis. arXiv:����.8��8, ����. [�6] P. Wang, Y. Qian, F. Soong, L. He, and H. Zhao. Word embedding for recurrent neural network based TTS synthesis. In Proc. ICASSP, pages �8��–�88�, ����. [��] X. Wang, S. Takaki, and J. Yamagishi. Investigation of using continuous representation of various linguistic units in neural network-based text-to-speech synthesis. IEICE Trans. Inf. Syst., E��-D(��):����–��8�, ���6. [�8] T. Toda and K. Tokuda. Statistical approach to vocal tract transfer function estimation based on factor analyzed trajectory hmm. In Proc. ICASSP, pages ����–���8, ���8. [��] Y.-J. Wu and K. Tokuda. Minimum generation error training with direct log spectral distortion on LSPs for HMM-based speech synthesis. In Proc. Interspeech, pages ���–�8�, ���8. [��] R. Maia, H. Zen, and M. Gales. Statistical parametric speech synthesis with joint estimation of acoustic and excitation model parameters. In Proc. ISCA SSW�, pages 88–��, ����. [��] K. Tokuda and H. Zen. Directly modeling speech waveforms by neural networks for statistical parametric speech synthesis. In Proc. ICASSP, pages ����–����, ����. [��] K. Tokuda and H. Zen. Directly modeling voiced and unvoiced components in speech waveforms by neural networks. In Proc. ICASSP, pages �6��–�6��, ���6.

Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

�� of ��

References IV

[��] F. Itakura and S. Saito. A statistical method for estimation of speech spectral density and formant frequencies. Trans. IEICE, J���A:��–��, ����. [��] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu. WaveNet: A generative model for raw audio. arXiv:�6��.�����, ���6. [��] S. Mehri, K. Kumar, I. Gulrajani, R. Kumar, S. Jain, J. Sotelo, A. Courville, and Y. Bengio. SampleRNN: An unconditional end-to-end neural audio generation model. arXiv:�6��.��8��, ���6. [�6] S. Imai and C. Furuichi. Unbiased estimation of log spectrum. In Proc. EURASIP, pages ���–��6, ��88. [��] H. Kameoka, Y. Ohishi, D. Mochihashi, and J. Le Roux. Speech analysis with multi-kernel linear prediction. In Proc. Spring Conference of ASJ, pages ���–���, ����. (in Japanese).

Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

�� of ��

X

W

w

X

W

O

L

w l

X

W

O

L

w

X

W

l

O

L

l

ˆ O





o

w

ˆ

o

o





x

x

x

(1) Bayesian

(2) Auxiliary variables + factorization

(3) Joint maximization

x

X O

W

w

W

w

L

l

L

l









X

ˆ

ˆ

X

W

w

(4) Step-by-step maximization e.g., statistical parametric TTS X

W

w

ˆ

o oˆ x

x

x

x

(5) Joint acoustic feature extraction + model training

(6) Conditional WaveNet -based TTS

(7) Integrated end-to-end

(8) Bayesian end-to-end

Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

�� of ��