Introduction to Speech Recognition
1
Paul R. Dixon Yandex Zurich
[email protected]
2 Overview of a Statistical Speech Recognition Training Data
Acoustic models
Spoken utterance
Feature extraction
¡ Statistical model built from large data and expert knowledge ¡ Language independent – Japanese, arabic, …
¡ Signal processing to extract salient features from speech
Pronunciation model
Language model
Search Written transcription
¡ Search – Given the models and audio input generate text transcription ¡ Voice search, dictation, transcription of meetings, talks and lectures, audio indexing + keyword spotting
3
P (Wn |Wn
N +1 , . . . , Wn 1 )
N
Speech Decoding Problem
Count(W1 , W2 , W3 P (W3 |W1 , W2 ) = Language Model Count(W 2 , W3 ) 9
𝑮
𝑳
10 N-grams
P (Wn |Wn
N +1 , . . . , Wn
106 Words
1) =
Count(Wn N +1 Count(Wn N + e
O ) = PI P (C3i |O 10
𝑯𝑪
𝑶
Phones 105 Gaussians
j=1
ai
e
aj
O) P (Ci |O O P (O |Ci )from = Acoustic score Search P (C ) i graph utterance W⇤ = Best(O H C L G) Best word sequence
Acoustic
1 Lexicon Grammar
Evaluation of Speech Recognition Hyp Ref Error
A A The Sub Ins
Dog Dog
Ate Ate
Homework My Homework Del (Example source: Kingsbury 2009)
Error types: Substitution, Deletion, Insertions Find best alignment with edit (Levenstein) distance Word Error Rate (WER) =
100 x number of errors number reference words
Real Time Factor (RTF) Task: Domain, vocabulary, Time taken / Length of speech language model size,…. Memory Usage Memory taken to store the Other metrics models and search usage Latency, Sentence Error Rate
comprehensive series of tests on ASR over two and a half decades, and the performance of major ASR benchmark tests is summarized in [19]. Here we reproduce the results in Figure 1 for the readers’ reference.
Applications of this optimization framework to MT and the extension to ST will be presented in this lecture note. To conclude this brief introduction, we point out that in each of the ASR, MT, and ST fields, a series of benchmark evalua-
arning of ST systems. On the other hand, discriminative arning has been used pervasively in cent statistical signal processing and attern recognition research including SR and MT separately [7], [13], [26]. In
PR
Performance of Speech Recognition Conversational Speech
Switchboard
(Non-English)
Read Speech
Meeting Speech Meeting—SDMOV4 Meeting—MDMOV4
Switchboard II
EE
Broadcast Speech
Air Travel Planning Kiosk Speech
Varied Microphones
CTS Arabic (UL) CTS Mandarin (UL)0 Meeting—IHM News Mandarin 10X News Arabic 10X
(Non-English)
CTS Fisher (UL)
20k
Noisy
5k
Mismatches cause huge problems
News English 1X
News English Unlimited
10
News English 10X
IE
WER (%)
[Image source He & Deng 2011]
NIST STT Benchmark Test History—May 2009
100
1k
Stops before Deep Learning became popular
4
Range of Human Error in Transcription
2
2011
2010
2009
2008
2007
2000
2005
2004
2003
2002
2001
2000
1998
1998
1997
1996
1995
1994
1993
1992
1991
1990
1989
1988
1
IG1] After [19], a summarization of historical performance progresses on major ASR benchmark tests conducted by NIST.
Move on to more difficult tasks IEEE SIGNAL PROCESSING MAGAZINE [2] SEPTEMBER 2011
Conversational Telephone Speech (CTS)
Results
In Practice Tune for a Trade Moses Benchmarks: 8 Threads off38Between WER/RTF8 marks: Threads
2
Backoff Models Data Structures Results
Perplexity Translation
Rand Backoff 2≠8 false
4
29
8 Rand Backoff 2≠8 false
Time WER (h)
27
1 2
25
Chop
8
Chop
8
8
23
1 8 Trie 4 21 19
0
17
00
4
,
Trie
Probing
0.1
2
0.2
6 8 Memory (GB)
4
/
SRI Each curve represents a Probing different system
6 0.4 80.5 10 12 RTF Memory (GB) 0.3
Heafield
SRI
10
0.6
KenLM: Faster and Smaller Language Model Queries
Fundamental Equation of Speech Recognition Where: W |O O) W ⇤ = arg max P (W W
O |W W )P (W ) P (O = arg max O) P (O W O |W W )P (W W) = arg max P (O
W
⇡ arg m W
W some word sequence W ⇤ is the best word sequence O is a sequence of acoustic vectors
Bayes0 rule O ) is constant P (O
W
O |W W ) + log P (W W )} = arg max {log P (O
To avoid underflow
W
O |W W ) + ↵ log P (W W )} = arg max {log P (O W
Where: W some word sequence W ⇤Challenges: is the best word sequence is a sequence of acoustic vectors 1. Efficiently build accurate is the language model weight
models 2. Efficiently decode speech
Fudge factor
Formulate Recognition as a Transduction ¡ Weighted Finite State Transducers [Pereira 1994, Mohri 2009]give an formal framework for expressing the models providing operations: ¡ Composition: Combine levels of information ¡ Determinization: Optimize for speed
¡ Models are learned from data ¡ N-grams, Gaussian mixture models, (Hidden) Markov models, Neural networks
Sentence
Word
Phone
Acoustic
Weighted Finite State Transducers
¡ Generalization of a finite state machine ¡ Pioneered at AT&T in the mid 90s ¡ Weighted mapping between strings
¡ Unified framework for representation and optimization
¡ See OpenFst - http://openfst.org Input label
Output label
Weight
Initial state
Sentence
Final states
Word
Phone
Acoustic
Language Model ¡ Disambiguate acoustic similar words ¡ Recognize speech ¡ Wreck a nice beach
¡ Most popular model is a simple N-gram ¡ Easy to train and scales well
¡ Used in many tasks, POS tagging, machine translation, handwriting recognition, text prediction, sentiment analysis
Sentence
Word
Phone
Acoustic
: uence word sequence Language Model ic vectors best 2 word sequence Model weight ¡ Want Language to compute uence of acoustic vectors nguage model weight P (W 1 , W2 , W3 , . . . , Wn ) = P (W1 )P (W2 |W1 )P (W
Model ¡ Apply chain rule nguage P (W )P1(W (W = P (W1 )P (WModel (W13 |W , W22|W ) . .1. )P P (W . W(W 2 |W1 )P n |W 1 .2.)P n 14)|W 3 |W
, W3¡ , A . .pproximate . , Wn ) = P (W (W2 |W1 )P (W , W2 ) . . . P (Wn with history 1 )Ptruncated 3 |W1(Bigram) W1 )P (W3 |W2 )P (W4 |W3 ) . . . P (Wn |Wn 1 ) P (W1 )P (W2 |W1 )P (W3 |W2 )P (W4 |W3 ) . . . P (Wn |Wn Unigram
P (W1 )
Bigram
Unigram
1)
2
1
P (W3 |W1 , W2 )
Trigram
Count(W 1 , W2 , W3 ) N-gram Language Model P (W |W , . . . , W ) +1 n 1 P (W3 |W1n, Wn2 )N=
Count(W 2, W 3) ¡ Estimate parameters using Maximum likelihood Count(W W2 , W3 ) 1,) Count(W , W P (W3 |W1 , W2 ) = 1 2 P (W2 |W1 ) = Count(W2 , W3 ) Count(W1 )
P (Wn |Wn
Count(Wn N +1 , . . . , Wn ) N +1 , . . . , Wn 1 ) = Count(W . ., W . ,nW1n) ) Count(W n Nn+1 N,+1
n N +1 , . . . , Wn 1 )
=
Count(Wn e
O ) =/2.7081 P (Ci |O PI
j=1
/4.1997
ai
e
B/0.75869
/99 A/0.6799 aj
A/0.57053 A/0.82768
/99
O) P (Ci |O O |Ci ) = P (O P (Ci ) A/0.69415
N +1 , Wn 1 )
B/0.85175
A/2.9266
B/0.96444
B/0.69415
1 ¡ (In practice also use back-off, smoothing ……)
1
B/2.3575
Pronunciation Model ¡ Data sparsity means many words will never be seen or hardly seen ¡ Build words from smaller phoneme units ¡ Can recognize words not seen in training ¡ One of the few parts of the system that uses expert knowledge
Phoneme label
DH:THE DH:THE 0
Word label
Sentence
D:DOOR D:DOG
4
6
1
AH:
3
IY:
AO:
AO:
5
Epsilon/null transitions don’t consume symbols
R: G:
2
7
:
Word
Phone
Acoustic
Combining the Language Models and Lexicon DH:THE DH:THE 0
D:DOOR D:DOG
4
6
1
AH:
3
IY:
AO:
AO:
5
R: G:
0
2
THE
1
DOOR DOG
2
7
:
¡ Composition DH:THE 0
1
DH:THE
D:DOG
AH: IY:
3
:
4
5
7
G: R:
D:DOOR 6
2
AO:
AO:
9
8
¡ Determinization 0
DH:THE
1
AH: IY:
2
:
3
D:
4
AO:
5
G:DOG R:DOOR
6
Phone Model ¡ Hidden Markov models (HMMs) are the building block of modern ASR systems ¡ Typically a three state Hidden Markov ¡ Enforce a minimum phone durations
¡ Trained with expectation maximization algorithm ¡ In practice build context dependent units 2322/0.33423 0
1036/0.19008
2322/1.2584
1
PDF Index or DNN output unit
Sentence
510/0.29304
1036/1.7538
2
510/1.3704
Transition Probability (-log)
Word
Phone
Acoustic
3
nd combined to give the overall probability of the vector under the
Acoustic p (xModeling | Q) = Â w p (x | ✓ )with Gaussian (2.1) Models is aMixtures data vector; the mixing parameters w are constrained so that 0 w 1 k
k k
k
k
k
wk = 1, ✓k represents the parameters of the k th Gaussian, and Q is used te all of the parameters in the model. Each component is a d dimensional p (x | Q ) = wk pk (x | ✓k ) (2.1) riate Gaussian density: k ⇢ 1 1 p x | ✓ = exp (x µk )T Sk 1 (x so µkthat ) (2.2) ( ) k parameters k he mixing w are constrained 0 wk 1 1/2 k d/2 2 (2p ) |Sk |
Â
sents the parameters of the k th Gaussian, and Q is used µk and Sk are the mean and covariances. The architecture of a GMM is shown meters in the model. Each component is a d dimensional work diagram in figure 2.1 . with expectation sity: ion ¡ Train from a GMM is simple. First a Gaussian component is chosen according to maximization algorithm r probabilities w = (w1 , w2⇢ , . . . , wk ), then a sample is drawn from the selected 1 n distribution. 1 1 T ¡ Speaker 1 adaptation exp ( x µ ) S ( x µ ) (2.2) k k ¡ Training can be k 1/2 2 and bootstrapping is 2p )d/2 S | | k parallelized Training a Gaussian Mixture Model straightforward mean covariances. architecture of a GMM is shown set¡ Scales ofand training data X = {xThe with data 1 , x2 , . . . , x N } drawn from an Independent lly Distributed learning process in involves gure 2.1 . (IID) probability distribution theImplementation http://kaldi.sf.net g the free parameters Q = w, µ, S (where all the individual weights, means, ( ) ¡ Speed up techniques ariances are grouped into acomponent tuple) to find theisMaximum s simple. First a together Gaussian chosenLikelihood according to Phone Acoustic =ughout (w1this , Sentence wthesis , then aWord sample from the selected willk ) denote probability mass and is p( x )drawn probability density function. In 2 , . P. (.x,) w
Count(W2 , W3
CD-DNN-HMM 2322/0.33423 0
Count(W1 , W2 , W P (W3 |W1 , W2 ) = Count(Wn N P (Wn |Wn N +1 , . . . , Wn 1 ) =Count(W2 , W3 )
1036/0.19008
2322/1.2584
O1 Ouput layer
H1
Hidden layer
1036/1.7538
1
.. . .. .
Input layer I1
I2
510/0.29304
I3
2
e Count(W n
O ) = PI P (Ci |O
j=1aie
On
e
aj
¡ Normalize priors O ) =byPstate P (Ci |O I to give HMM state Oaj) Pj=1 (Cei |O likelihoods O |Ci ) = P (O
P (C Oi )) P (C i |O O |Ci ) = P (O P (Ci )
Hn
.. .
P (W 510/1.3704 n |Wn 3
¡ Last layer has unit for Count(W n each HMM state Count(Wn N + softmax ai N +1 , . . . , Wn 1 ) =
¡ Trained with Stochastic Gradient Descent
1
In
¡ Input is a windows of 1 features ¡ Bootstrapped from a GMM system Implementation in http://kaldi.sf.net
N
Performance of DNNs on Switchboard 300hrs task One-pass
Multi-pass / combination
Year
GMM
DNN
GMM
DNN
Details
2011
23.6
16.1
17.1
-
(Seide 2011)
2012
18.9
13.3
15.1
-
(Kingsbury 2012). DNN Sequence training
2013
18.6
12.6
-
(Vesely 2013). DNN Sequence training [^]
10.7
(Sainath 2014). Convolutional neural network
2014
11.5
14.5
(Disclaimer: potential differences in the experimental setups of the systems)
Search Algorithm ¡ Shortest path through a huge graph ¡ Dynamic programming ¡ Viterbi algorithm ¡ Beam search ¡ At any time, only keep the states that are within a given beam of the best scoring state
Open Source Resources (not exhaustive) Kaldi - http://kaldi.sourceforge.net OpenFst - http://openfst.org OpenGrm - http://opengrm.org Phonetisaurus - https://code.google.com/p/phonetisaurus/ Voxforge http://http://voxforge.org/ TED LIUM corpus http://www-lium.univ-lemans.fr/en/content/ted-lium-corpus CMU dictionary http://www.speech.cs.cmu.edu/cgi-bin/cmudict 1 Billion word benchmark https://code.google.com/p/1-billion-word-language-modelingbenchmark/
Thanks for listening Questions?
References ¡ Xiaodong He and Li Deng, Speech Recognition, Machine Translation, and Speech Translation – A Unified Discriminative Learning Paradigm, in IEEE Signal Processing Magazine, September 2011 ¡ Jurafsky, Daniel, and James H. Martin. 2009. Speech and Language Processing: AnIntroduction to Natural Language Processing, Speech Recognition, and Computational Linguistics. 2nd edition ¡ Weighted Rational Transductions and their Application to Human Language Processing Fernando Pereira, Michael Riley, Richard W. Sproat Human Language Technology Workshop pp. 262-267 ¡ Mehryar Mohri. Weighted automata algorithms. 2009. ¡ Frank Seide, Gang Li, and Dong Yu, Conversational Speech Transcription Using Context-Dependent Deep Neural Networks, in Interspeech 2011 ¡ B. Kingsbury, T. N. Sainath, and H. Soltau, “Scalable minimum Bayes risk training of deep neural network acoustic models using distributed Hessian-free optimization,” in Proc. INTERSPEECH, September 2012 ¡ "Sequence-discriminative training of deep neural networks", K. Vesely, A. Ghoshal, L. Burget and D. Povey, Interspeech 2013 ¡ Tara N. Sainath Improvements to Deep Neural Networks for Large Vocabulary Continuous Speech Recognition Tasks 2014 http://wissap.iiit.ac.in/proceedings/ TSai_L8.pdf
ASR and Speech Processing ¡ Automatic Speech Recognition (ASR)
¡ Given a audio input – generate text transcription ¡ Applications: voice search, dictation, transcription of meetings, talks and lectures, audio indexing + keyword spotting
¡ Speech synthesis – text to speech
¡ Voice conversion, singing synthesis
¡ Speaker recognition
¡ Speaker verification - Speaker identification ¡ Related tasks, gender or language
¡ Spoken dialogue/understanding systems, speech enhancement ¡ Machine translation
What Makes Speech Recognition Hard? Speaker
Environment
• Age • Accent/ Dialect/Nonnative • Speaking rate • Co-articulation • Style • Physical differences • Language
• Background noise • Other speakers • Reverberation • Microphone placement
Audio quality • Device • Channel characteristics • Knowledge of the task • Volume
Changes and variations in conditions will often cause a huge impact on recognition performance
The Noisy Channel Model Noisy Channel Model
(Image source Jurafsky & Martin 2009)
! Search through space of all possible sentences. ¡ Acoustic realization is a corrupted ¡ Statistical speech processing ! Pickofthe one that is most probable given ¡ CMU Harpy and the IBM 1970s version the original text waveform.
¡ Same framework for many problems including: OCR ¡ Build a channel model so we can recognition, machine recover the original translation ¡ Passed through a noisy channel