Pavel Golik and Albert Zeyer for their guidance and help. I would also thank ...... 93, McDermott & Hazen 04]. Next in the list are ... INTRODUCTION optimization methods [Nocedal & Wright 99], that are shown to converge to a local optimum.
Discriminative Training of Linear Transformations and Mixture Density Splitting for Speech Recognition
¨ Mathematik, Informatik und Von der Fakult¨at fur Naturwissenschaften der RWTH Aachen University zur Erlangung des akademischen Grades eines Doktors der Naturwissenschaften genehmigte Dissertation
vorgelegt von M. Sc. Muhammad Ali Tahir aus Rawalpindi, Pakistan
Berichter: Professor Dr.–Ing. Hermann Ney Professor Dr.–Ing. Reinhold H¨ab-Umbach
¨ ¨ Tag der mundlichen Prufung: 27. November 2015
¨ Diese Dissertation ist auf den Internetseiten der Hochschulbibliothek online verfugbar.
Acknowledgments I would like to thank Prof. Dr.-Ing. Hermann Ney for providing me with the opportunity of research at Chair of Computer Science 6 of RWTH Aachen University, where I have been since June 2007. I would also thank Prof. Dr.-Ing. Reinhold H¨ab-Umbach for accepting to co-supervise this work. My special thanks to my supervisor Dr. rer. nat Ralf Schl¨uter for providing me with help and support all along the thesis, without whom it would not have been possible. I would also like to thank Georg Heigold, Christian Plahl, Dr. Volker Steinbiß, Simon Wiesler, Heyun Huang, Markus Nußbaum-Thom, Mahaboob Ali Basha Shaik, Zolt´an T¨uske, Pavel Golik and Albert Zeyer for their guidance and help. I would also thank Gisela Gillmann, Katja B¨acker, Andrea Kierdorf, Stephanie Jansen and Dhenya Schwarz for their help in matters related to management. Last but not the least my thanks to my wife, daughter, parents and sisters, who provided me with the possibilities of undertaking these studies and for their love and support. This work was partly realized as part of the Quaero Programme, funded by OSEO, French State agency for innovation. The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement No. 287658 (EU-Bridge), and under grant agreement 287755 (transLectures).
3
Abstract Discriminative training has been established as an effective technique for training the acoustic model of an automatic speech recognition system. It reduces the word error rate as compared to standard maximum likelihood training. This thesis concerns itself with training reduced-rank linear transformations for reducing the number of parameters. Conversely, it is also investigated if a small robust model can be split into a larger model, for robust initialization. Previous work has shown the usefulness of discriminatively trained log-linear acoustic models. These have been shown to cover Gaussian single density and mixture models. Log-linear training i.e. convex optimization can also be used to train linear feature transformations. The main focus of this thesis is to explore log-linear training of such feature transformations. This has been done for direct feature transformations, for high dimensional polynomial features and for multilayer log-linear training. Multilayer log-linear transformation training makes the use of polynomial features of higher order computationally feasible. Another important aspect of this thesis is the discriminative splitting of log-linear models into log-linear mixture models with hidden variables. Thus the acoustic mixture models are trained as well as split during discriminative training, instead of using splitting information from a previous maximum likelihood training step. Recently, the speech recognition community has shifted its focus to deep neural networks for acoustic modelling. Thus we also apply the discriminative splitting concept to deep neural networks and achieve encouraging results.
Zusammenfassung Diskriminatives Training hat sich als ein effektives Verfahren f¨ur das Trainieren des akustischen Modells eines automatischen Spracherkennungssystems etabliert. Es reduziert die Wortfehlerrate im Gegensatz zum Standard Maximum-Likelihood-Training. Diese Arbeit besch¨aftigt sich mit dem Trainieren von Rang-reduzierten linearen Transformationen, zur Verringerung der Anzahl von Parametern. Umgekehrt wird ebenfalls untersucht, ein kleines robustes Modell in ein gr¨oßeres Modell aufzuteilen, zwecks Regularisierung. Fr¨uhere Arbeiten hatten die N¨utzlichkeit von diskriminativ trainierten log-linearen akustischen Modellen gezeigt. Es wurde gezeigt, dass log-lineare Modelle eine Verallgemeinung von Gaußschen Modellen mit Einzelund Mischverteilungen darstellen. Log-lineares Training bzw. konvexe Optimierung kann auch zum Trainieren einer linearen Transformation verwendet werden. Die Hauptaufgabe dieser Arbeit ist, das log-lineare Training dieser Transformationen zu erforschen. Dies wurde zur direkten Feature-Transformation, f¨ur hochdimensionale polynomielle Features sowie f¨ur mehrschichtiges log-lineares Training, untersucht. Log-lineares Training erm¨oglicht das Trainieren von hochdimensionalen polynomiellen Features von zweiter oder h¨oher Ordnung. Ein weiterer wichtiger Aspekt dieser Arbeit ist die diskriminierende Aufteilung der log-linearen Modelle in log-lineare Mischverteilungen bzw. Modelle mit verborgenen Variablen. So werden die Mischverteilungs-basierten akustischen Modelle trainiert sowie aufgeteilt w¨ahrend des
diskriminativen Trainings, anstelle der Verwendung von Aufteilungsinformationen aus einen fr¨uheren Maximum-Likelihood-Trainingsschritt. Vor kurzem hat die Spracherkennungsgemeinschaft ihren Fokus auf tiefe neuronale Netze f¨ur die akustische Modellierung gesetzt. Deswegen wandten wir dieses Aufteilungskonzept auch auf tiefe neuronale Netze an, und erzielten viel versprechende Ergebnisse.
Contents 1
Introduction
1
1.1
An overview of Statistical Speech Recognition . . . . . . . . . . . . . . . . . .
1
1.1.1
Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
1.1.2
Acoustic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
1.1.2.1
HMM Topologies . . . . . . . . . . . . . . . . . . . . . . .
7
1.1.2.2
n-Phone State-tying . . . . . . . . . . . . . . . . . . . . . .
8
1.1.2.3
Gaussian Mixture Model . . . . . . . . . . . . . . . . . . .
8
1.1.3
Language Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
1.1.4
Global Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
1.1.5
Acoustic Model Training . . . . . . . . . . . . . . . . . . . . . . . . .
13
1.1.5.1
Maximum Likelihood Training . . . . . . . . . . . . . . . .
13
1.1.5.2
Optimality of Maximum Likelihood . . . . . . . . . . . . .
17
Speaker Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
An Overview of Discriminative Acoustic Model Training . . . . . . . . . . . .
19
1.2.1
Training Criteria and Optimization . . . . . . . . . . . . . . . . . . . .
19
1.2.2
Discriminative Training of Feature Transformations . . . . . . . . . . .
20
1.2.3
Direct Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
1.2.4
Multilayer Perceptrons . . . . . . . . . . . . . . . . . . . . . . . . . .
21
1.1.6 1.2
2
Scientific Goals
23
3
Discriminative Training
27
3.1
27
The Discriminative Paradigm . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1
Maximum Likelihood versus Discriminative Training for two-class problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
28
3.2
Frame-level Discriminative Training . . . . . . . . . . . . . . . . . . . . . . .
30
3.3
Sentence-level Discriminative Training Criteria . . . . . . . . . . . . . . . . .
30
i
ii
4
CONTENTS 3.3.1
Maximum Mutual Information . . . . . . . . . . . . . . . . . . . . . .
30
3.3.2
Minimum Classification Error . . . . . . . . . . . . . . . . . . . . . .
32
3.3.3
Minimum Word Error and Minimum Phone Error . . . . . . . . . . . .
32
3.4
Unified forms of Discriminative Training Criteria . . . . . . . . . . . . . . . .
33
3.5
Parameter Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
34
3.5.1
Extended Baum-Welch Algorithm . . . . . . . . . . . . . . . . . . . .
35
3.6
Weighted Finite State Transducer Framework . . . . . . . . . . . . . . . . . .
36
3.7
Word Lattices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
38
Log-Linear Acoustic Models
39
4.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
39
4.1.1
Generalized Iterative Scaling Algorithm (GIS) . . . . . . . . . . . . .
40
4.1.2
Improved Iterative Scaling (IIS) . . . . . . . . . . . . . . . . . . . . .
41
4.1.3
RPROP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
43
Log-Linear Acoustic Modelling . . . . . . . . . . . . . . . . . . . . . . . . .
43
4.2.1
Log-Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . .
44
4.2.1.1
45
4.2
4.2.2
Log-linear Mixture Model . . . . . . . . . . . . . . . . . . . . . . . . 4.2.2.1
45
Conversion between log-linear mixture model and Gaussian mixture model . . . . . . . . . . . . . . . . . . . . . . . . .
46
Integration of SAT MLLR and CMLLR . . . . . . . . . . . . . . . . .
46
Log-linear Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
47
4.3.1
Frame-level Discriminative Training . . . . . . . . . . . . . . . . . . .
47
4.3.2
Sentence-level Discriminative Training . . . . . . . . . . . . . . . . .
47
4.4
Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
49
4.5
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
51
4.2.3 4.3
5
Conversion between log-linear and Gaussian acoustic model
Training of Linear Feature Transformations
53
5.1
Generative Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
54
5.1.1
Linear Discriminant Analysis (LDA) . . . . . . . . . . . . . . . . . .
54
5.1.2
Heteroscedastic discriminant analysis (HDA) . . . . . . . . . . . . . .
56
5.2
Discriminative Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . .
58
5.3
Log-linear Discriminative Training . . . . . . . . . . . . . . . . . . . . . . . .
58
5.3.1
Training on State Level . . . . . . . . . . . . . . . . . . . . . . . . . .
59
5.3.2
Training on Sentence Level . . . . . . . . . . . . . . . . . . . . . . .
60
CONTENTS
6
5.4
Direct Transformation of Input Features . . . . . . . . . . . . . . . . . . . . .
61
5.5
Dimension-reduced Higher-order Polynomial Features . . . . . . . . . . . . .
62
5.5.1
Initialization of Log-Linear Training . . . . . . . . . . . . . . . . . . .
63
5.5.2
Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . .
64
5.5.3
Effect of Additional Parameters . . . . . . . . . . . . . . . . . . . . .
66
5.5.4
Effect of Unconstrained α s . . . . . . . . . . . . . . . . . . . . . . . .
68
5.6
Dimension Reduction for Multi-layer Log-linear Training . . . . . . . . . . . .
69
5.7
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
70
Mixture Density Splitting
71
6.1
Overview of Techniques for Mixture Density Splitting . . . . . . . . . . . . .
72
6.1.1
K-means clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . .
72
6.1.2
Splitting in Expectation-Maximization . . . . . . . . . . . . . . . . . .
72
6.1.3
Other Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
73
Discriminative Splitting of Log-linear Mixture Models . . . . . . . . . . . . .
73
6.2.1
Maximum Approximation . . . . . . . . . . . . . . . . . . . . . . . .
75
6.2.2
Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . .
75
Discriminative Splitting for Deep Neural Networks . . . . . . . . . . . . . . .
79
6.3.1
Deep Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . .
79
6.3.2
Linear bottlenecks for DNNs . . . . . . . . . . . . . . . . . . . . . . .
81
6.3.3
Discriminative Splitting for DNNs . . . . . . . . . . . . . . . . . . . .
81
6.3.4
Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . .
82
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
86
6.2
6.3
6.4 7
iii
Scientific Contributions
87
7.1
Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
87
7.2
Secondary Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
88
A Mathematical Symbols
91
B Acronyms
92
C Corpora and ASR systems
94
C.1 European Parliament Plenary Sessions (EPPS) English . . . . . . . . . . . . .
94
C.2 QUAERO English . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
94
C.3 QUAERO Spanish . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
95
iv Bibliography
CONTENTS 97
List of Tables 1.1
An excerpt from the EPPS pronunciation lexicon [L¨oo¨ f & Gollan+ 07] . . . . .
6
3.1
Comparison of different training criteria under Schl¨uter & Macherey’s unifying approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
34
3.2
Comparison of different training criteria under He & Deng’s unifying approach
34
3.3
Definition of operations for different semirings over R . . . . . . . . . . . . . .
37
3.4
Definition of operations for expectation semiring over R+ × R . . . . . . . . . .
37
4.1
EPPS English dev2007: Comparison of Gaussian and log-linear single density and mixture models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
49
QUAERO Spanish task: Comparison of MPE training of LLMM acoustic model with GMM, for MLP input features. . . . . . . . . . . . . . . . . . . .
50
Training corpus: QUAERO English 50h / 250h. Comparison of Gaussian and log-linear mixture models. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
50
Training corpus: QUAERO English 50h. Log-linear training of feature transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
61
5.2
EPPS English dev2007: Mixtures vs. Polynomials . . . . . . . . . . . . . . . .
64
5.3
Training corpus: QUAERO English 50h, MFCC window size 9: WER for ML vs. higher order features . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
64
Training corpus: QUAERO English 50h. Combination of state-of-the-art DNN with different techniques discussed in this work . . . . . . . . . . . . . . . . .
66
Training corpus: QUAERO English 50h. Dimension reduction for two-layer log-linear training, each layer of size 4501 . . . . . . . . . . . . . . . . . . . .
69
Training corpus: QUAERO English 50h. Comparison of ML split and discriminatively split log-linear mixture models . . . . . . . . . . . . . . . . .
76
Training corpus: QUAERO English 50h. Input size 493 and output layer size 4501. Splitting of a single hidden layer of size 1024 . . . . . . . . . . . . . . .
83
4.2 4.3
5.1
5.4 5.5
6.1 6.2
v
vi
LIST OF TABLES 6.3
6.4
6.5
6.6
Training corpus: QUAERO English 50h. Input size 493 and output layer size 4501. Splitting of a single hidden layer of size 128 and converting it to full (non-sparse) matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
84
Training corpus: QUAERO English 50h. Input size 493 and output layer size 4501. Six sigmoid hidden layers and linear bottleneck size 256. Splitting has been done directly from a network of 6 × 256 . . . . . . . . . . . . . . . . . .
84
Training corpus: QUAERO 50h. Input size 493 and output layer size 4501. Six sigmoid hidden layers and linear bottleneck size 256. Splitting has been done step-wise by successive doubling and training, initializing from a network of 6 × 256 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
85
Training corpus: QUAERO 50h. Input size 493 and output layer size 4501. Six ReLU hidden layers and linear bottleneck size 256. Splitting has been done step-wise by successive doubling and training, initializing from a network of 6 × 256 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
86
List of Figures 1.1
Block diagram of a speech recognition system using Bayes’ architecture . . . .
2
1.2
Flow diagram of MFCC feature extraction process . . . . . . . . . . . . . . . .
4
1.3
6-state HMM of a phonetic unit unrolled along time axis . . . . . . . . . . . .
8
1.4
Approximation of a two-dimensional probability distribution with a Gaussian mixture model of 3 densities. Each density has separate mean and covariance vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
1.5
Example of beam search for an ASR digit recognition task [Ney 06] . . . . . .
13
1.6
Pictorial representation of Baum-Welch algorithm . . . . . . . . . . . . . . . .
15
3.1
Maximum likelihood estimates for a two-class problem with a classindependent full covariance matrix (a) and a class-independent diagonal covariance matrix (b) [Macherey 10] . . . . . . . . . . . . . . . . . . . . . . . .
28
Maximum mutual information estimates for a two-class problem with a classindependent diagonal covariance matrix (a). In (b) the ML estimation of the diagonal covariance matrix is used [Macherey 10] . . . . . . . . . . . . . . . .
29
Diagram of a word lattice resulting from recognition of audio utterance “I like this meal” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
37
4.1
Flow diagram of MPE discriminative training of log-linear mixture model . . .
48
5.1
Scope of linear feature transformations . . . . . . . . . . . . . . . . . . . . . .
53
5.2
Flowchart of iterative polynomial features . . . . . . . . . . . . . . . . . . . .
62
5.3
EPPS English dev2007: Comparison of discriminatively trained second order and first order MFCC systems . . . . . . . . . . . . . . . . . . . . . . . . . .
65
EPPS English dev2007: WER(%) vs. No. of parameters for higher-order polynomial features in comparison with mixture densities . . . . . . . . . . . .
67
EPPS English dev2007 with second order features: Comparison of discriminatively trained log-linear model versus the Gaussian mixture model obtained from that . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
68
3.2
3.3
5.4 5.5
vii
viii
LIST OF FIGURES 6.1
Graphical depiction of mixture density splitting for expectation-maximization algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
73
Flow diagram of discriminative training and splitting process for log-linear mixture model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
74
EPPS English dev2007: Ascent of MMI objective function versus number of training iterations. The density splitting events are marked by + . . . . . . . .
76
EPPS English dev2007: Comparison of WER of discriminatively split and ML split log-linear models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
77
EPPS English dev2007: Comparison of WER of discriminatively split and ML split log-linear models, with SAT MLLR and CMLLR . . . . . . . . . . . . .
77
6.6
Example of a multilayer-perceptron with one hidden (sigmoid) layer . . . . . .
80
6.7
Diagram of discriminative splitting of MLP hidden layer . . . . . . . . . . . .
82
C.1 Flow diagram of multilingual MLP features . . . . . . . . . . . . . . . . . . .
95
6.2 6.3 6.4 6.5
Chapter 1 Introduction The task of an automatic speech recognition (ASR) system is to convert human speech into written text. Input to the system is audio signal (which may contain noise etc. apart from human speech) and output is corresponding text in a particular language. Such a speech recognizer can also be part of a higher level task; for example language understanding [Bender & Macherey+ 03], speech to speech translation [Bub & Schwinn 96], spoken document retrieval [Johnson & Jourlin+ 99], spoken dialog system [Kristiina & McTear 09], automatic video subtitling to name a few. The overall performance of such a system can heavily depend on its constituent speech recognition system. There are several challenges involved in achieving good speech recognition performance. The speech may have been recorded under noisy conditions. There can be significant variations of accent among speakers of the same language. The goal of speech recognition research is to model all these variations and try to achieve good recognition performance. The earliest endeavours in this field tried to model the speech and grammatical structure of a language by rule-based systems. However, later the statistical pattern recognition approach has become de-facto standard. The basic principle of this paradigm is Bayes’ decision rule [Bayes 1763], which decides the most likely sequence of words, given an input sequence of acoustic observations. Theoretically, Bayes’ decision rule can be proved to be the most optimal among all rules, if true posterior distributions for word sequences given the acoustic input are known. In practice, however, the true distributions are not known, therefore they are approximated by suitable statistical models. The performance of such a recognizer heavily depends upon the choice of probability distribution used for the phonetic model, as well as the method used to obtain such a model. The probability distribution for a particular phonetic unit should match its corresponding occurrences in the input observations as often as possible.
1.1
An overview of Statistical Speech Recognition
This work discusses automatic speech recognition within the framework of statistical decision theory. Models having statistical properties of data as parameters can reduce the computational complexity of speech recognition. There are two types of information that are combined to create this model: acoustic model and language model. The acoustic model measures phonetic 1
2
CHAPTER 1. INTRODUCTION Speech Input
Feature Vectors x1...xT
Global Search Process: maximize p(w1...wN) . p(x1...xT | w1...wN) over w1...wN
p(x1...xT | w1...wN)
Feature Extraction
Acoustic Model - subword units - pronunciation lexicon
Language Model p(w1...wN)
Recognized Word Sequence {w1...wN}opt
Figure 1.1: Block diagram of a speech recognition system using Bayes’ architecture similarity between input audio features and different phonetic units of that language. The language model measures the probability of a particular string of phonetic units to have been said, depending upon that language’s lexical structure and grammar. Usually a word error rate (WER) metric is used to measure the performance of a speech recognition system. It is the minimum number of word insertions, deletions and substitutions required to convert the recognized hypothesis word sequence into reference word sequence. It can be calculated by aligning the two sentences and measuring Levenshtein distance between them.
WER :=
I+D+S N
(1.1)
where I is the number of insertions, D is the number of deletions, S is the number of substitutions and N is the number of words in the reference. For some applications like keyword search, the useful performance metric may be precision and recall of certain keywords in the recognized text. For ASR, The first step is to convert the audio signal into a stream of n-dimensional feature vectors. Given a stream of such acoustic observation vectors x1T = x1 , ..., xT , the best word sequence hypothesis w1N = w1 , ..., wN is calculated according to Bayes’ rule, which maximizes the posterior probability of that word sequence given the input feature vector stream:
1.1. AN OVERVIEW OF STATISTICAL SPEECH RECOGNITION
[w1N ]opt = arg max{p(w1N |x1T )} = arg max{p(x1T |w1N )p(w1N )} w1N
w1N
3
(1.2)
as discussed in Section 1.1.5.2, the Bayes’ rule holds optimal among all rules only if true probability distribution is known. In that case it guarantees the lowest possible classification error [Duda & Hart+ 01]. According to Equation (1.2), a combination of acoustic model and language model is used to estimate the posterior probability of the word sequence. The acoustic model p(x1T |w1N ) denotes the conditional probability of observing the feature vector sequence x1T , given the word sequence w1N . The language model p(w1N ) on the other hand does not depend on the input feature vectors. It denotes the prior probability of occurrence of word sequence w1N . The number of possible hypotheses at the output of a speech recognizer is arbitrary. There can be any number of words N from a possible vocabulary of thousands of words. Therefore we need a method to do the maximization in Equation (1.2) over a set of finite and most probable hypotheses. For this purpose, dynamic programming based search algorithms are used. The integral parts of a modern statistical speech recognition system are: • Feature Extraction or acoustic analysis extracts useful information from speech signal and represents it in the form of a sequence of T feature vectors x1T . Each feature vector represents the information of a small piece of speech, usually a few milliseconds long. • The Acoustic Model represents speech units called phonemes and the way they are grouped together to form words using a dictionary called pronunciation lexicon. If properly trained, the probability p(x|s) should be high for the correct phonetic state and low for all other states (given feature vector x and phonetic state s). The acoustic model is used to calculate conditional probability p(x1T |w1N ) given some sequence of N words w1N . • The Language Model calculates how likely a given sentence w1N is to be spoken, of all possible sentences. This is irrespective of its acoustic model probability. • The Training process takes a number of speech sequences as input, together with their respective word transcriptions. The goal is to calculate model parameters for the acoustic model, that would maximize probabilities p(w1N |x1T ) for all the training sequences. • Recognition takes a test speech sequence as input and attempts to find the most likely sequence of words w1N that was spoken. For this purpose it utilizes combined information from the acoustic and language models, as shown in Equation (1.2). These different steps are shown graphically in Figure 1.1, and are explained briefly in the following paragraphs. In the scope of this thesis, we investigate training of acoustic model parameters. Furthermore, linear feature transformations are explored, which can be thought of as an intermediate stage between feature extraction and acoustic model.
1.1.1
Feature Extraction
The purpose of feature extraction is to represent the input audio waveform into a compact parametric representation, so that it holds necessary information to identify the phonetic
4
CHAPTER 1. INTRODUCTION SPEECH SIGNAL
PREEMPHASIS AND WINDOWING
MAGNITUDE SPECTRUM
MEL SCALED FILTER-BANK
LOGARITHM
CEPSTRAL DECORRELATION
MEAN NORMALIZATION
DYNAMIC FEATURES
FEATURE VECTORS
Figure 1.2: Flow diagram of MFCC feature extraction process
components in the speech. There is a lot of extraneous information in the audio signal that does not usually contribute to the recognition process; for example background noise, loudness, speaker’s voice quality and pitch etc. Such information should be removed in feature extraction as much as possible. Common signal analysis techniques for speech recognition employ a short-term spectral analysis, usually a Fourier transform. Some well known feature extraction techniques are Mel frequency cepstral coefficients (MFCC) [Mermelstein 76], Predictive linear perceptron (PLP) [Hermansky 90] and Gammatone features [Aertsen & Johannesma+ 80, Schl¨uter & Bezrukov+ 07]. More recently, there have been feature extraction techniques that also extract longer term features such as MRAS T A [Valente & Hermansky 08]. These features are usually used in conjunction with short-term features. Furthermore, for some languages such as Chinese that depend on tone information to determine the meaning of words, tonal features have also proven to be helpful [Chang & Zhou+ 00]. For MFCC features used in this thesis, there is a short-term discrete Fourier transform (DFT) on overlapping windows of 25 milliseconds, after a time duration of every 10 ms. Such abrupt cutting of the speech waveform can cause spectral distortion, therefore before DFT a Hamming window [Harris 78] is multiplied with the signal to make the spectral representation smoother. A fast Fourier transform algorithm [Cooley & Tukey 65] is used for the DFT, which is a special low complexity algorithm for the case when the number of samples is a power of two. To model the human auditory response to different frequencies, the output of FFT is frequency-
1.1. AN OVERVIEW OF STATISTICAL SPEECH RECOGNITION
5
warped with respect to a Mel-frequency scale. Triangular frequency filters smooth the output frequency coefficients. The human ear perceives loudness on logarithmic scale to intensity, therefore logarithmic scale is applied to the output. A discrete cosine transform (DCT) can be used to decorrelate the information among the frequency coefficients. The human voice frequency range is between 300 to 6000 Hz, therefore very high frequency components can be discarded. The usual number of MFCC features is between 12 and 16. The better the acoustic input quality, the more coefficients are needed to represent it. The phonetic sounds of human speech can be divided into two complementary categories: voiced and unvoiced. This voicedness measure can also be augmented to the features to improve classification. It has been shown that the voiced feature appended to MFCC features has resulted in WER improvement [Zolnay & Schl¨uter+ 02]. To include some dynamic information along the time axis, first and second derivatives of MFCC features can be appended to the MFCC feature vector. Alternatively, a number of temporally consecutive feature vectors can be concatenated together into a single vector and linear discriminant analysis (LDA) [H¨ab-Umbach & Ney 92] can be performed on them. LDA is a dimension reducing linear transformation which maximizes the class separability between different phonetic classes. LDA assumes equal covariances. Another method called heteroscedastic discriminant analysis (HDA) [Kumar & Andreou 98] calculates the transformation implying class-specific covariances. MFCC are also used for speaker identification tasks [Reynolds 94], which means that there is still a lot of speaker dependent information present in the features that could be removed. This can be done by a linear speaker-dependent transformation of the features, called constrained maximum likelihood linear regression (CMLLR) [Leggetter & Woodland 95]. Another way is to adapt the acoustic model parameters according to the speaker, called maximum likelihood linear regression (MLLR). Such transformations have an added benefit of filtering out the effect of background noise from features.
1.1.2
Acoustic Model
The acoustic model p(x1T |w1N ) provides conditional probability of a sequence of feature vectors x1T , given a sequence of words w1N has occurred. In essence this is a measure of phonetic similarity of the input audio features to a particular string of words, regardless of the probability of occurrence of that word sequence in terms of grammatical correctness. For an ASR system with a small of number of possible words e.g. numerical digits, each word may be represented by a sequence of sub-word units called acoustic states. During training, the statistics of each state are calculated from the occurrences of feature vectors corresponding to that state. For large vocabulary ASR, there may be thousands of words and due to data sparsity it is not practical to accumulate sufficient statistics for each word separately. It is desirable to recognize even those words which may have few or no occurrences in the training data. In this more general case, the words are defined as sequences of phonetic units called phonemes. This is similar to the way word pronunciations are represented in language dictionaries, where every word is defined by a sequence of phonetic units of that language. Such a representation based on sub-word units is called a pronunciation lexicon. Each word in the lexicon may have one or
6
CHAPTER 1. INTRODUCTION Table 1.1: An excerpt from the EPPS pronunciation lexicon [L¨oo¨ f & Gollan+ 07] Word This Thomas Thorn Those Thursday Thursdays Thyssen
Pronunciation dh ih s t oh m ax s th ao n dh ow z th er z d ey , th er z d ih th er z d ih z , th er z d ey z dh ay s s eh n
more pronunciations. Multiple pronunciations per word allow us to handle different words with same orthography and accent variations. The lexicon can be updated as per need by adding new words. Good performance algorithms exist that estimate the pronunciation of new words based on old ones. This process is called grapheme to phoneme conversion [Bisani & Ney 03]. An example excerpt from a pronunciation lexicon for English language is shown in Table 1.1. The pronunciations of different phonemes also depend on their predecessor and successor phonemes; this effect of referred to as coarticulation. Therefore modern speech recognition systems accumulate sub-word statistics for groups of context dependent phonemes called nphones. Commonly, phonetic units with a context of three phonemes are used, called triphones. If these groups of phonemes transcend across word boundaries, then it is an across-word model. Implementation of across-word models in the RWTH ASR toolkit is described in [Sixtus 03] and some adaptations for discriminative training are given in Chapter 5 of [Macherey 10]. A source of variation in spoken words is in temporal domain. Even the same speaker may not utter the same word the same way every time; there can be differences in time duration, pitch and emphasis pattern. Since during feature extraction the feature vectors are sampled at regular intervals, we need some way to model this temporal variability. A hidden Markov model (HMM) is used to model such distortion [Baker 75]. The HMM is a finite state automaton with unobservable (hidden) states, which are connected by transitions (including cyclic). These states emit observable symbols according to some stochastic probability distribution. For ASR these symbols are continuous feature vectors. The HMM-based acoustic model can be represented as: p(x1T |w1N ) =
X
p(x1T , sT1 |w1N )
(1.3)
sT1
The summation in Equation (1.3) signifies that given the word sequence w1N , the output observations x1T can be generated by multiple configurations of hidden states sT1 . Using decomposition rule, this becomes:
p(x1T |w1N ) =
T XY sT1
t=1
N p(xt |x1t−1 , st1 , w1N ) · p(st |x1t−1 , st−1 1 , w1 )
(1.4)
1.1. AN OVERVIEW OF STATISTICAL SPEECH RECOGNITION
7
For simplification, we apply Markov assumption [Duda & Hart+ 01], which states that conditional probability distribution of the present state depends only upon the previous state (first-order) and not any others: p(x1T |w1N )
=
T XY sT1
p(xt |st , w1N ) · p(st |st−1 , w1N )
(1.5)
t=1
The first term in Equation (1.5) is emission probability p(xt |st , w1N ) that is the probability to observe the vector xt given that the automaton is in state st . The second term is the transition probability p(st |st−1 , w1N ) of shifting to state st given that the automaton is in state st−1 . A given path sT1 defines the alignment between the word sequence and the feature stream. Given an alignment, the emission probability of a state s can be measured by accumulating statistics of those features which are aligned with it. The transition probability can also be calculated by first-order counts on the state-sequence [Rabiner 89]; or more simply by fixed transition penalties which only depend on the type of transition and not on the current state. In RWTH ASR toolkit, the transition probabilities are implemented as fixed penalties, called time distortion penalties (TDPs) [Rybach & Gollan+ 09]. The summation in Equation (1.5) can be replaced by maximum approximation or Viterbi approximation: p(x1T |w1N )
1.1.2.1
= max sT1
T Y
p(xt |st , w1N ) · p(st |st−1 , w1N )
(1.6)
t=1
HMM Topologies
The triphones are usually represented by a finite state automaton having three or six states. There are three types of transitions; forward: going from a state to the next, skip: going to the second next state directly, loop: repeating the current state. In a three-state HMM, the three states correspond to the beginning, middle and end of a triphone respectively, and a skip transition may not be present. A 6-state HMM or Bakis topology [Bakis 76] is similar to the three-state with each state duplicated, and there are also skip transitions between the states. If the feature extraction samples the audio signal after every 10ms, the fastest phoneme that can be recognized by a 3-state HMM is 30ms and for a 6-state HMM is 60ms. In cases where this is restrictive, a 3-state HMM with skip may be used. The complete HMM for a word sequence is formed by concatenating the HMMs for its constituent phonetic representations obtained from lexicon. The parameters of HMM are estimated by forward-backward algorithm [Baum 72, Rabiner & Juang 86]. If the input features and corresponding word sequence are given (as during training), then finding the corresponding HMM state sequence as in Equation (1.3) is called forced alignment. The problem is similar to that of finding the shortest path through the given graph, and is solved by dynamic programming. An example of time alignment is depicted in Figure 1.3. The complete HMM for the word “example” is unrolled along time axis. The vertical axis represents a concatenation of 6-state HMMs that make up this word and the horizontal axis shows the acoustic feature vectors. For
...
...
...
E
...
...
E
...
...
M
...
...
M
...
...
B
...
...
B
...
...
...
...
...
...
g’
triphone
...
z
state index
a:
triphone a: m z
triphone
...
a:
word: example
mp
CHAPTER 1. INTRODUCTION
phonemes: I g’ z a: m p l ...
8
time index
Figure 1.3: 6-state HMM of a phonetic unit unrolled along time axis the Viterbi approximation case as in Equation (1.6), finding the highest probability path from the lower-left to the upper-right corner gives us the best state sequence for forced alignment.
1.1.2.2
n-Phone State-tying
A problem with n-phone models is data sparsity owing to their exponential number of possibilities. For example, for a phoneme set of 50 phonemes, the number of triphones can be 503 , although language constraints may not allow all of these. Many of these triphones may have few or no observations making statistics accumulation difficult and error prone. A workaround is to cluster n-phones together based on their acoustic similarity, called generalized nphone models. It this thesis, the triphones are clustered and stored in the leaves of a decision tree structure [Beulen & Ortmanns+ 99]. This is called a classification and regression tree (CART) [Breiman & Leo+ 84].
1.1.2.3
Gaussian Mixture Model
The most common way of expressing the acoustic model HMM emission probabilities is through Gaussian mixture models (GMMs). The theoretical usefulness of Gaussian distribution lies in central limit theorem [Rice 95], which states that under some mild conditions, the mean of many random variables drawn from any distribution has a Gaussian distribution. Similarly,
1.1. AN OVERVIEW OF STATISTICAL SPEECH RECOGNITION
9
4
3
2
1
0
-1
-2
-3 -4
-3
-2
-1
0
1
2
3
Figure 1.4: Approximation of a two-dimensional probability distribution with a Gaussian mixture model of 3 densities. Each density has separate mean and covariance vectors speech vectors emanating from a particular HMM state will be Gaussian distributed around a central mean vector, assuming infinite data and random error sources. In practice, however, neither the data is infinite nor the error sources random. For example, in a speech training data set with both male and female speakers, we can model each HMM state with a mixture of two Gaussian distributions, one for each gender. The general GMM emission probability distribution is defined as: p(x|s; w1N )
=
Ls X
c s,l N(x|µ s,l , Σ s,l ; w1N )
(1.7)
l=1
where L s is the total number of densities for state s. c s,l , µ s,l and Σ s,l are mixture weight, mean and covariance matrix respectively for each density (s, l). For D-dimensional input feature PL s vectors, µ s,l ∈ RD and c s,l ∈ R such that l=1 c s,l = 1. Σ s,l ∈ RD×D is positive definite and symmetric, and to solve the data sparsity problem often tied across several mixture components or states [Povey & Burget+ 11]. 1 1 · exp − (x − µ s,l )> Σ−1 N(x|µ s,l , Σ s,l ; w1N ) = p s,l (x − µ s,l ) 2 (2π)D |Σ s,l |
(1.8)
The RWTH ASR toolkit makes some assumptions for GMM. There is a global (pooled) D covariance vector σ ∈ R≥0 (D-dimensional non-negative real vector) representing the diagonal of a global covariance matrix Σ ∈ RD×D . Using a diagonal covariance requires that
10
CHAPTER 1. INTRODUCTION
the components of feature vectors are statistically independent, therefore the features are decorrelated during feature extraction by DCT and later by LDA (Section 5.1.1). Therefore, the GMM is implemented as: p(x|s; w1N )
=
Ls X
c s,l N(x|µ s,l , σ; w1N )
(1.9)
l=1
Maximum Approximation for Mixture Densities For high dimensional data like speech feature vectors, the mixture of Gaussians probability distribution is sparse in the sense that probability of a Gaussian component decays exponentially as the feature vector moves away from it, as shown in Figure 1.4. Therefore, it is likely that the bulk of probability for a given input vector depends on only one of the Gaussians in the mixture. This makes it possible to approximate the sum of probabilities of a mixture by a maximum of probabilities. Therefore the conditional probability becomes: p(x|s; w1N ) = max c s,l N(x|µ s,l , Σ s,l ; w1N ) l
(1.10)
In this thesis, maximum approximation is used for all the cases of Gaussian mixture models. The parameters of GMM model can be estimated from training data by maximum likelihood training. Due to hidden variables sT1 in the HMM, expectation maximization (EM) algorithm is used, as explained in Section 1.1.5.1. The ML trained acoustic model can be further refined by discriminative training (Chapter 3), which tries to maximize posterior probability of each class with respect to all other classes.
1.1.3
Language Model
In the decision rule in Equation (1.2), the language model (LM) p(w1N ) gives a-priori probability of the word sequence w1N . In essence it provides a measure of likeliness of the word sequence, based on grammatical rules of a language. Furthermore, we want the language model to capture domain specific information for certain special-case ASR systems. For example, a recognizer for educational lectures in mathematics should have higher probabilities for the related technical terms and spoken mathematical formulas, as compared to rest of the text. Since this model only depends on the text and is independent of acoustic data, therefore large amounts of text available on the Internet, books, articles etc. can be used as input source. To capture certain idiosyncrasies associated with spontaneous human speech e.g. certain grammatical errors common in speaking, repetitions etc., text transcriptions are also a useful input data source. The language model probability of a word sequence w1N is given by p(w1N )
=
N Y
p(wn |wn−1 1 )
(1.11)
n=1
Since the total number of possible word sequences is unlimited, simplifying assumptions have to be made to have reliable non-sparse estimates. The most common way to estimate the
1.1. AN OVERVIEW OF STATISTICAL SPEECH RECOGNITION
11
language model is through accumulation of counts of neighbouring words. This is called an mgram language model. It assumes that the probability of current word wn depends only on the + previous m − 1 words wn−1 n−m+1 [Bahl & Jelinek 83]. For example, for a trigram LM the counts of all word sequences of length three are accumulated and then normalized to yield probabilities. The language model thus becomes:
p(w1N ) :=
N Y
p(wn |wn−1 n−m+1 )
(1.12)
n=1 n−1 The word sequence wn−1 if n−m+1 is called history hn of the word wn ; such that hn = w1 n < m, and hn = ∅ if n − 1 < n − m + 1. In recent ASR systems it is common to use 4-gram and 5-gram based LMs while vocabulary sizes are in the range of a few hundred thousand. Since the number of possible word m-grams increases polynomially with vocabulary size and exponentially with m, it is highly likely that many m-grams will occur a few times or not at all in the training data. Looking at Equation (1.12), it is desirable to give non-zero probabilities to such unseen m-grams. A technique called discounting is used for this purpose [Katz 87, Ney & Essen+ 94]. Discounting means that some probability mass is subtracted from the seen examples and distributed over unseen examples (backing off ), usually utilizing a model with shorter history. For estimating the smoothing model parameters, cross-validation techniques like leaving one out are utilized [Ney & Essen+ 94].
The perplexity measure is used to get an estimate of suitability of language model for a particular recognition domain. For some given sample text w1N log PP = log
N hY
i−1/N p(wn |wn−1 n−m+1 )
(1.13)
n=1
The log-perplexity in Equation (1.13) is entropy of the model. It can also be visualized as the average number of word choices that can follow a given history of m − 1 words. The language model can be optimized/domain-adapted to yield a low perplexity value on some development text data. A low perplexity either means that the LM is compact and structured, or it means that the given LM is highly suited to the recognition task at hand. Recently, there has been interest in newer methods for obtaining and improving the language models. LM can be discriminatively trained to obtain a better representation for recognition purpose, as the ultimate goal is to reduce the word error rate during recognition. Neural network LMs in conjunction with m-gram LMs have shown significant improvements [Bengio & Ducharme+ 01, Mikolov 12]. Maintaining an LM cache during recognition is another effective way to adapt to recognition text domain, as words are likely to be repeated during a conversation or speech.
1.1.4
Global Search
Given input features x1N , the goal of global search is to find the word sequence w1N that maximizes the a-posteriori probability:
12
CHAPTER 1. INTRODUCTION
[w1N ]opt = arg max{p(w1N |x1T )} = arg max{p(x1T |w1N )p(w1N )} w1N
w1N
(1.14)
As mentioned in Equation (1.14), for this purpose we need to search the space of all state sequences sT1 that correspond with all possible word sequences w1N . For the case of HMM-based acoustic model and m-gram language model, this becomes:
[w1N ]opt
= arg max w1N
(h Y N
p(wn |wn−1 n−m+1 )
T i) i hXY N N p(xt |st ; w1 ) · p(st |st−1 ; w1 ) · sT1
n=1
(1.15)
t=1
since our objective is to find the optimal word sequence and not to find the exact posterior probability. Assuming that the best state sequence corresponds to the optimal word sequence, we can replace the summation in Equation (1.15) with maximum or Viterbi approximation:
[w1N ]opt
= arg max w1N
(h Y N n=1
T Y i h i) N N · max p(xt |st ; w1 ) · p(st |st−1 ; w1 )
p(wn |wn−1 n−m+1 )
sT1
(1.16)
t=1
Due to Viterbi approximation and the previously discussed Markov assumption for HMM, the problem yields itself to be divided into sub-problems with local dependencies. The task therefore reduces to finding the best path through a state graph, and efficient dynamic programming-based methods can be used for decoding [Bellman 57]. There are two paradigms of doing a graph search: depth first and breadth first. Two examples for depth-first search are the Dijkstra’s [Dijkstra 59] and A* [Jelinek 69] algorithms. The idea is to traverse the graph in such a way that states are pushed and popped from a stack to first get to the final state through the best path. On the other hand, the breadth first search traverses several path simultaneously in a time-synchronous manner [Baker 75, Ney 84]. A complete optimization of the Equation (1.16) is only feasible for small vocabulary ASR. For large vocabulary systems, it is clear that we are searching a space of infinite possible state sequences, and an exhaustive search is not possible. Therefore, during search the number of active hypothesis needs to be limited (pruning). In the breadth-first search it is possible to compare all the competing hypothesis at each time step t, therefore it is possible to prune the least probable hypotheses according to some threshold. This allows for significant saving in computation and memory requirements of the search. In beam search [Lowerre 76, Ney & Mergel+ 87] only a fraction of the total hypotheses are expanded to the next time step, provided that their scores are sufficiently close to the currently optimal word sequence. Additional improvement to the search pruning is possible by language model lookahead, where the language model probabilities are also considered at the early stages of the search [Ortmanns & Ney+ 96]. A pictorial representation of the beam search process is shown in Figure 1.5, where the recognition vocabulary consists of ten digits. The dark lines show the shortest path through the currently recognized digit, while the shaded area around this line shows the high probability portion of the search space that has not been pruned.
1.1. AN OVERVIEW OF STATISTICAL SPEECH RECOGNITION
13
Figure 1.5: Example of beam search for an ASR digit recognition task [Ney 06] It is important to consider here that the beam-search method can be error-prone and there exists a probability that the best hypothesis has been pruned early in the search process. However, proper tuning of the pruning parameters ensures that this source of error has minimal impact on finding the best word sequence. Some other techniques that aim to reduce the computational complexity of search are lexical prefix tree [Ney & H¨ab-Umbach+ 92], fast likelihood computation [Ramasubramanian & Paliwal 92, Ortmanns & Ney+ 97], multi-pass approaches to reduce the size of search space beforehand [Murveit & Butzberger+ 93, Ney & Aubert 94], word lattices [Ney & Aubert 94] and N-best hypothesis lists [Schwartz & Chow 90].
1.1.5
Acoustic Model Training
There are two major paradigms for acoustic model training: maximum likelihood (ML) and discriminative training. The ML training estimates joint probabilities of input features and their transcriptions. Each class (e.g. HMM triphone state) is handled separately, not regarding the possible overlaps between them. Discriminative training tries to estimate the posterior probabilities directly, doing so in comparison to the probabilities of other competing classes. These two approaches are described in some detail below. 1.1.5.1
Maximum Likelihood Training
The input to acoustic model training is a training corpus (X, R) = {(Xr , Wr )} of r = 1, ..., R speech utterances. The utterance or segment r in the corpus consists of acoustic feature vectors Xr = xr1 , ..., xrTr and their corresponding transcribed word sequence Wr = wr1 , ..., wrNr .
14
CHAPTER 1. INTRODUCTION
The standard method for acoustic training is Maximum Likelihood, which maximizes the conditional probability of the acoustic features given the word transcriptions. The objective function to be optimized is: ∗ F ML (θ)
=
R Y
pθ (Xr |Wr )
(1.17)
r=1
where θ denotes the acoustic model parameter set. For computational reasons, a logarithmic version is often used F ML (θ) =
R X
log pθ (Xr |Wr )
(1.18)
r=1
For the case of an HMM-based acoustic model with Gaussian mixtures as emission probabilities (Section 1.1.2.3), a closed-form solution is not possible as there are hidden variables (HMM hidden states) involved. Under the maximum likelihood paradigm, the objective function in Equation (1.18) can be maximized by expectation maximization (EM) algorithm. The reestimation equations for Gaussian HMMs were first derived in [Dempster & Laird+ 77]. ∗ Calculating derivatives of F ML (θ) with respect to acoustic model parameters θ R
T
∂ log p(xr,t |θ s ) ∂F ML (θ) X X = pθ (st = s|Xr , Wr ) · ∂θ s,l ∂θ s,l r=1 t=1
(1.19)
where pθ (st = s|Xr , Wr ) is the word conditioned state posterior probability, such that the path through the state graph passes through state s at time instance t. It is also called forwardbackward probability and can be written as: sT1
γr,t (s|W) = pθ (st = s|X, W)
(1.20)
These forward-backward probabilities can be calculated by the well known Baum-Welch algorithm [Baum 72, Rabiner & Juang 86]. It is a general method for computing unknown parameters of an HMM. For Gaussian mixture densities as defined in Section 1.1.2.3, partial derivative of the last term in Equation (1.10) with respect to density specific parameters: c s,l p(x|θ s,l ) ∂ log p(x|θ s,l ) ∂ log p(x|θ s ) =P · ∂θ s,l ∂θ s,l lˆ c s,lˆ p(x|θ s,lˆ)
(1.21)
due to maximum approximation this becomes ∂ log[c s,l p(x|θ s,l )] ∂ log p(x|θ s ) ≈ δl,lθ (x,s) ∂θ s,l ∂θ s,l and conditional forward-backward probability is written as:
(1.22)
1.1. AN OVERVIEW OF STATISTICAL SPEECH RECOGNITION
15
Calculating γr,t(σ,s|W) ar,t-1(σ|W) σ p(s|σ)
br,t(s|W) s
xt-1
xt
Figure 1.6: Pictorial representation of Baum-Welch algorithm
c s,l p(xr,t |θ s,l ) γr,t (s, l|W) := γr,t (s|W) · P lˆ c s,lˆ p(xr,t |θ s,lˆ)
(1.23)
Accordingly, Equation (1.18) becomes: R
T
∂ log p(xr,t |θ s,l ) ∂F ML (θ) X X = γr,t (s, l|Wr ) · ∂θ s,l ∂θ s,l r=1 t=1
(1.24)
The partial derivative of the objective function with respect to a state transition probability can be similarly represented: R X T X ∂F ML (θ) 1 = γr,t (σ, s|Wr ) ∂p(s|σ) p(s|σ) r=1 t=1
(1.25)
where σ is the state before transition to state s, and γr,t (σ, s|W) := pθ (st−1 = σ, st = s|Xr , W)
(1.26)
The term γr,t (σ, s|W) is the probability that a path through the state graph for the alignment of the feature string Xr to a word sequence W passes through the state σ at time t − 1 and through state s at time t. A summation over all the transitions connecting the previous states to a particular state yields the conditional forward-backward probability for state s at time t
16
CHAPTER 1. INTRODUCTION
γr,t (s|W) =
X
γr,t (σ, s|W)
(1.27)
σ
In Section 1.1.2.1 the standard model for HMM was introduced, which had only loop, skip and forward transitions. The state path can be decomposed into two parts: the one leading up to s (forward path) and the one after s until the final state (backward path):
γr,t (σ, s|W) = =
pθ (Xr , st−1 = σ, st = s|W) pθ (Xr |W) X p(s|σ) t−1 t−1 pθ (xr,1 , s1 ) · pθ (Xr |W) t−2 s1 ,st−1 =σ|W
:=
ar,t−1 (σ; W) ·
(1.28) X
·
Tr Tr pθ (xr,t , st )
r |W st =s,sTt+1
p(s|σ) · br,t (s; W) pθ (Xr |W)
(1.29)
where ar,t−1 (σ; W) is defined as forward probability: the sum of joint probabilities of all paths up to the state σ at time t − 1. Likewise, br,t (s; W) is the backward probability: the sum of joint probabilities of all paths that follow state s at time t. A pictorial representation of this process can be seen in Figure 1.6. The evaluation of these quantities can be done recursively through dynamic programming ar,t (s|W) = pθ (xr,t |s, W) ·
X ar,t−1 (σ; W) p(s|σ)
(1.30)
σ
br,t (s|W) =
X br,t+1 (σ; W) pθ (xr,t+1 |σ, W) p(σ|s)
(1.31)
σ
For initial values to start the recursion ar,0 (s0 ; W) = 1 where s0 is the initial state, and br,Tr (sTr ; W) = 1 where T r is the final state. The forward probabilities can be computed by a time-synchronous forward pass over the state graph, and likewise the backward probabilities can be calculated by going backwards through the graph. The parameter update equations for emission probability’s Gaussian mixture model parameters are: R PT r X t=1 γr,t (s, l|Wr ) c s,l = PTr r=1 t=1 γr,t (s|Wr )
(1.32)
R PT r X t=1 γr,t (s, l|Wr ) xr,t µ s,l = PT r r=1 t=1 γr,t (s|Wr )
(1.33)
R PT r > X t=1 γr,t (s, l|Wr ) (xr,t − µ s,l )(xr,t − µ s,l ) Σ s,l = PT r r=1 t=1 γr,t (s|Wr )
(1.34)
A more detailed description can be found in [Bahl & Jelinek+ 83]. For reasons of numerical stability these probability calculations are done in log-domain; because there can be orders of
1.1. AN OVERVIEW OF STATISTICAL SPEECH RECOGNITION
17
magnitude of difference between different probabilities and also to take advantage of summation instead of multiplication. If the Viterbi approximation is used, as mentioned in Section 1.1.2, then only the maximum from the neighbouring states is considered during the recursive calculation of forward or backward probability. Because of the normalization constraint for probability distributions, the forward-backward probability of states occurring on the Viterbi path will be 1 and others will be 0 γr,t (s|W) ≈ δ s,st (Xr ,W)
(1.35)
where δ is the Kronecker delta function and st (Xr , W) corresponds to the state at time t for the best path aligning feature vectors Xr to word hypothesis W. 1.1.5.2
Optimality of Maximum Likelihood
The optimality of maximum likelihood estimation is subject to a number of underlying assumptions. As shown by [N´adas 83], the parameter set estimated by ML is optimal if: • The samples of training data obey the assumed distribution exactly • The distribution family is well-behaved • The number of training samples is large enough to sufficiently represent its corresponding probability distribution In practice, due to limited data these conditions may not be satisfied, which has led to investigations in discriminative training, covered in Chapter 3.
1.1.6
Speaker Adaptation
The features obtained as a result of feature extraction can have substantial variation, depending on voice pitch and harmonic quality of speaker’s voice. There are differences between male and female speakers. The shape and length of vocal tract influences the output speech. Below we briefly review three major techniques for speaker adaptation, which have been used in the ASR systems in context of this thesis. Vocal Tract Length Normalization For utterances of a given phonetic sound, frequency positions of spectral formant peaks are inversely proportional to length of vocal tract. Therefore, one method commonly employed is the vocal tract length normalization (VTLN). The Mel-filterbanks for MFCC features are warped along the frequency axis by a warping factor. For all segments by a particular speaker, the warping factor is chosen by a grid search, which aims to maximize the likelihood of a speaker (or speaker cluster) given the hypothesis of a preliminary recognition. There is also a
18
CHAPTER 1. INTRODUCTION
faster variant (called fast VTLN) which is used in RWTH ASR toolkit. The warping factor for each speech segment is selected by a classifier [Lee & Rose 96, Molau 03]. The frequency axis of every utterance is scaled piecewise linearly so that it should appear to have been produced by a standard vocal tract length.
fnorm
0 ≤ f ≤ f0 αf = fny −α f0 fny − f0 ( f − f0 ) + α f0 f0 ≤ f ≤ fny
(1.36)
where fny is the Nyquist frequency of 7 to 8 KHz. f0 and α can be calculated by a maximum likelihood training, as in [Lee & Rose 96]. Maximum Likelihood Linear Regression Maximum likelihood linear regression (MLLR) [Gales 98] is a linear transform on acoustic model means and covariances to compensate for speaker variation. It requires a speaker label S for each speech segment in the training/recognition data. For training data, this label may be given manually by the transcriber. Alternatively, speaker clustering algorithms exist to label the segments with their respective speaker IDs. The resulting model transformation are: ) (S ) (S ) µˆ (S s,l = A s µ s,l + b s ,
) (S ) (S )> Σˆ (S s,l = H s Σ s,l H s
(1.37)
The RWTH recognizer contains a globally pooled covariance matrix, therefore only the means are transformed and not covariance. On average each speaker may have just a few associated segments and it is not practical to estimate separate parameters for each triphone state s; therefore state tying schemes are used [Pitz 05]. During estimation phase, the transformations are calculated so as to maximize the likelihood of corresponding speaker cluster, given the transcription obtained from an initial recognition. Constrained Maximum Likelihood Linear Regression For CMLLR implementation in RWTH ASR, the means and variances are transformed by the same acoustic state-independent matrices. These matrices only depend on the speaker, resulting in more robust estimates for limited acoustic data per speaker. If the acoustic model is to be transformed, then: ) (S ) (S ) µˆ (S s,l = A µ s,l + b ,
Σˆ (S ) = A(S ) ΣA(S )>
(1.38)
The CMLLR transformation is more commonly applied as a linear transformation to the input acoustic feature vectors, which is equivalent to acoustic model transformation. This makes it simpler to integrate it into the feature extraction pipeline [Leggetter & Woodland 95], and consecutively the maximum likelihood training can be done on these transformed features. It may be noted that this technique also helps to normalize the effect of some non-speaker related variations such as room acoustic conditions (e.g. reverberations), microphone conditions and background noise. This is because it can be thought of as an acoustic noise removal filter that is applied on speech segment level.
1.2. AN OVERVIEW OF DISCRIMINATIVE ACOUSTIC MODEL TRAINING
1.2
19
An Overview of Discriminative Acoustic Model Training
The discriminative training approach can be applied to acoustic model training in several different ways. First, a Gaussian mixture model may be trained by a discriminative training criterion. Secondly, a discriminative acoustic model is defined which models the posteriors directly. Deep neural networks also belong to this class. Thirdly, a linear feature transformation may be trained discriminatively. A concise state-of-the-art for these types is provided in the following sections.
1.2.1
Training Criteria and Optimization
Gaussian mixture models have been traditionally estimated by generative maximum likelihood training [Rabiner 89]. Later work has shown that such an ML-trained acoustic model can be used as an initial guess and further trained by optimizing a discriminative criterion [Bahl & Brown+ 86, Normandin 96]. A comparison between ML and MMI training is reported by [Bahl & Brown+ 86] for an isolated word recognition system. These aforementioned works used the sentence-level maximum mutual information (MMI) criterion. Another variant of MMI is on frame-level i.e. context independent [Kapadia 98, Povey & Woodland 99]. Frame-level optimization is generally faster; however it has been superseded by sentence-level training because of latter’s better WER performance. An implementation of sentence-level MMI for continuous phoneme based systems is presented in [Merialdo 88]. A comparison of different word models for MMI training can be found in [Cardin & Normandin+ 93]. The minimum classification error (MCE) criterion minimizes the classification error on the training data. This error measure can be sentence, phoneme or state error [McDermott & Katagiri 97]. MCE has been successfully applied to digit recognition tasks in [Chou & Juang+ 92], and to large vocabulary tasks in [Chou & Lee+ 93, McDermott & Hazen 04]. Next in the list are minimum word error and minimum phone error criteria, aimed at minimizing the word and phoneme error rates respectively. MWE was proposed by [Na & Jeon+ 95] and first applied to a large vocabulary ASR task by [Kaiser & Horvat+ 00]. Along with MWE, the MPE criterion was implemented in [Povey & Woodland 02]. It was shown that both MWE and MPE are better than MMI on the Switchboard/Call Home corpus [Povey 03]. There have been some attempts to unify the aforementioned training criteria into a single training criterion [Schl¨uter & Macherey+ 01, He & Deng+ 08, Nakamura & McDermott+ 09]. The MMI objective function can have a large, possibly infinite number of competing hypotheses, as will be discussed in Section 3.3.1. Therefore approximation methods are required for tractable optimization. In [Chow 90], 10-best lists are used for MMI optimization of DARPA resource management corpus. Word lattices were first investigated by [Valtchev & Odell+ 96] on Wall street journal task and showed 5-10% relative improvement. Other successful applications of lattice based MMI are [Schl¨uter & M¨uller+ 99, Woodland & Povey 00]. For the numerical optimization, the most common method is to use the growth transformation based extended Baum-Welch (EBW) algorithm [Normandin & Mogera 91]. Existence of finite iteration constants for real-valued densities was shown in [Kanevsky 04, Axelrod & Goel+ 07]. Instances of using EBW to re-estimate the GMM-HMM parameters are [Merialdo 88, Schl¨uter 00, Povey 03]. On the other end of spectrum are general-purpose gradient based
20
CHAPTER 1. INTRODUCTION
optimization methods [Nocedal & Wright 99], that are shown to converge to a local optimum. Gradient descent (GD) is used in [Chou & Juang+ 92, Valtchev 95, McDermott & Katagiri 97]. Experimental comparison of gradient based algorithms like RPROP, QPROP and L-BFGS are found in [McDermott & Katagiri 05, Gunawardana & Mahajan+ 05]. In our work, the framebased MMI criterion is used for most experiments. Sentence-level MMI and MPE criterion are also used for some experiments.
1.2.2
Discriminative Training of Feature Transformations
Just as discriminative training methods can be applied to acoustic model training, they can also be used to train linear feature transformations. Traditionally, the feature transforms have been trained in a maximum likelihood framework by linear discriminant analysis (LDA) [H¨ab-Umbach & Ney 92] or heteroscedastic discriminant analysis (HDA) [Kumar & Andreou 98]. An MMI based discriminative training of feature transform has been implemented in [Macherey 98] and resulted in significant improvement for a digit recognition task. Another example of reduced-rank transformation training is [Omer & Hasegawa-Johnson 03]. Similarly, discriminative objective functions have been used to train linear speaker adaptive transformations [Gunawardana 01, McDonough & Waibel 02, L¨oo¨ f & Schl¨uter+ 07]. In [Tsakalidis & Doumpiotis+ 02] the transforms are trained leading to feature normalization as well as speaker adaptive training. A linear transformation can also be applied to reduce the dimensions of very high-dimensional features, to make them computationally tractable. In fMPE [Povey & Kingsbury+ 05], the posteriors of Gaussians are projected to high dimensions and then MPE criterion is used to train the dimension-reducing linear transformation. The principle task of this thesis is to investigate discriminative training of linear feature transformations (Chapter 5), primarily for reducing the number of parameters and thus making the system faster and possibly more robust to over-training.
1.2.3
Direct Models
As stated in the previous section, the traditional approach has been to use generative models and further train them with discriminative training. A more logical approach would be to compute HMM state posteriors directly from acoustic model parameters, instead of first computing class conditional probabilities as for generative models. An example of this are maximum entropy Markov models (MEMM) [McCallum & Freitag+ 00] which employ a structure different from HMM. [Kuo & Gao 06] used MEMMs and computed WER by rescoring N-best lists, achieving about 4% absolute improvement as compared to HMMs. The maximum entropy principle [Darroch & Ratcliff 72] with its global maximum property motivates the use of acoustic model with log-linear parameters. Log-linear acoustic models have been investigated by [Hifny & Renals+ 05, Macherey & Ney 03]. Conditional random fields with hidden variables have been investigated by [Gunawardana & Mahajan+ 05]. It was shown that Gaussian mixture model and hidden conditional random field are equivalent under certain conditions [Heigold & Schl¨uter+ 07]. This equivalence can be used to convert either one into the other. These log-linear features can also be used with complex features as input; for example rank-based features [Kuo & Gao 06], posterior-based features [Hifny & Gao 08],
1.2. AN OVERVIEW OF DISCRIMINATIVE ACOUSTIC MODEL TRAINING
21
cluster features [Wiesler & Nußbaum+ 09] and MLP features [Fosler-Lussier & Morris 08]. In most experiments in this thesis, log-linear single density and mixture models are used for acoustic modelling.
1.2.4
Multilayer Perceptrons
Conceptually, neural networks are inspired by a mammalian brain. There are millions of connected neurons, and information processing and memory storage causes changes in the weights of these connections. The multilayer perceptrons (MLP) are used for creating probabilistic features for speech recognition. A multilayer perceptron has non-linear activation functions in the hidden and output layers. The use of neural networks for speech recognition was initiated in [Peeling & Moore+ 86, Bourlard & Wellekens 87, Waibel & Hanazawa+ 89]. These systems modelled the whole recognition process by MLPs and were later superseded by HMMGMM based systems. More recently, the interest in MLPs has resurged due to their improved performance thanks to better methodologies and increased availability of computing power. There are two types of approaches: tandem and hybrid. Both of these approaches employ HMM architecture for the time alignment problem. A tandem MLP system [Hermansky & Ellis+ 00] uses the output of a bottleneck layer as features for training a regular GMM-HMM speech recognition system. These features may be concatenated with regular features e.g. MFCC [Plahl & Schl¨uter+ 11]. Speaker adaptation for tandem MLP is the same as that for a normal GMM-based ASR system. Going one step further away from GMM acoustic model, a hybrid MLP system [Seide & Gang+ 11, Dahl & Deng+ 12] directly uses the posterior probabilities from the last softmax layer of MLP network as acoustic model probabilities for corresponding monophone/triphone states. During the last few years, deep neural networks i.e. MLPs with several hidden layers have caught the attention of pattern recognition community. The number of hidden layers can be six or more. A major breakthrough was the use of layer by layer pre-training strategy [Hinton 10]. The use of deep neural networks has become widespread for speech recognition, and has provided huge gains of WER in comparison to Gaussian mixture models [Hinton & Deng+ 12, Dahl & Deng+ 12, Sainath & Kingsbury+ 11]. In the scope of this thesis, experiments are performed to train linear transformations between layers of a deep neural network (Chapter 6). This is done by initializing from a small network and splitting to increase the number of parameters, thus making it more robust by providing a good initial guess.
22
CHAPTER 1. INTRODUCTION
Chapter 2 Scientific Goals The initial goal of this thesis was investigation of different aspects and properties of loglinear acoustic models and transformations. A conventional HMM-Gaussian mixture model implicitly calculates the phone state posteriors from conditional probabilities through Bayes’ rule. In contrast to this approach, previous work has shown the usefulness of direct posterior models, log-linear models being one of them. Furthermore, Gaussian mixture models can be converted to log-linear models due to equivalence of their posterior forms. It has also been experimentally shown that log-linear models perform as good as or better than discriminatively trained Gaussian mixture models. A frame-based maximum mutual information criterion allows convex optimization of log-linear models, due to maximum entropy principle. It has been verified experimentally that single density log-linear models may be initialized from any random values and upon training converge to the global maximum. Training of log-linear mixture models by splitting Increasing model resolution during ML training (mixture splitting) was standard procedure, but the same was not explored for discriminative training. As mentioned in previous paragraph, the optimization of a single density log-linear acoustic model is a convex optimization problem. However, a log-linear mixture model does not conform to this convexity property, and therefore a good initial guess is required. In previous work, such a mixture-based log-linear model was initialized from a previously trained maximum likelihood trained Gaussian mixture model. It is desirable to use the same objective function for both parameter training and splitting, so that the distribution of mixture densities can better correspond to the objective function. In this work discriminative splitting is investigated as an alternative approach. Would we be able to get a better mixture distribution if we start from a single density log-linear model and successively split it during discriminative training? Efficient exploitation of higher-order polynomial features Polynomial features higher than second order can have prohibitively high dimensions. If possible their dimensions should be reduced without compromising WER. One useful property of log-linear models (as successfully exploited in other fields such as machine translation) is its ability to combine different knowledge sources, and its relative robustness to feature scaling and linear dependences. So the question arises; can we also use log-linear models 23
24
CHAPTER 2. SCIENTIFIC GOALS
to combine different types of input acoustic features? Secondly some previous work has shown successful use of discriminative dimension reduction for combining very high dimensional input features (fMPE [Povey & Kingsbury+ 05]). So a natural motivation arises to apply log-linear transformation training for such high dimensional feature combination. One such set of very high dimensional features are polynomial features. However, going beyond second-order is prohibitively expensive, as the feature dimension increases polynomially with increasing order. In this case, log-linear dimension reduction is a good candidate for reducing the number of dimensions without compromising classification performance. Log-linear training in a layer-by-layer fashion Multilayer concept was not used for log-linear training (our work was initiated before use of deep neural networks became widespread for speech recognition). For polynomial features, after second-order expansion and subsequent dimension reduction, the resultant feature vector is small enough to be polynomial-expanded again. This provides us with a multi-layer feature extraction mechanism where the parameters in each layer are being trained log-linearly. At each layer, the linear transformation and the log-linear acoustic model parameters are to be trained discriminatively. As shown in Chapter 5, the MMI objective function is convex for transformation matrix optimization if the acoustic model parameters are held constant. Alternatively, the objective function is convex for acoustic model parameters if the linear transform parameters are held constant. Previously it was shown that these two sets of parameters can be optimized alternately (i.e. one after the other). Although it is no longer a convex optimization problem, experimental results indicate that alternate optimization provides reduction in objective function and also in WER. In the current work the same approach is applied, although this time in a multi-layer scenario. This work on polynomial features therefore provides a multi-layer representation of features. At the time this work was commenced, deep learning was not popular/widespread in the speech recognition/acoustic modelling community. Our work therefore can be viewed as one of the earlier endeavours in the direction of deep networks. Initializing a large deep neural network from a smaller network by splitting Deep neural network (DNN) training is a highly non-convex problem, and random initialization for pre-training is the norm. It can be argued that initializing a large DNN from a smaller network can give a better initial guess. Recently, deep neural networks have become industry standard for acoustic modelling, and have consistently provided superior results as compared to conventional Gaussian mixture models. DNNs have shown their potential to model complex non-linear functions. Hybrid DNNs are also direct models like log-linear models, because these also model the state posteriors directly without first calculating class conditional probabilities. Another similarity is normalization in the last DNN layer (softmax activation function). The softmax is exactly equal to a log-linear model. A log-linear mixture model can be visualized as a softmax layer followed by a sparse linear layer (summation). These similarities motivate us to transfer some of the concepts and optimization strategies of log-linear models to deep neural networks. Since the DNNs are highly non-linear networks, the optimization problem is also theoretically prone to local optima. To alleviate this effect we can apply the same strategy as we used for the similarly non-convex log-linear mixture models: initialize from a small convex model and successively increase its resolution during the training. Therefore we aim to
25 optimize a deep neural network by initializing from a small and robust model and then splitting to increase its resolution to a desirable complexity. A narrow deep neural network can be trained first, which contains a much smaller than usual number of nodes per layer. Once this small network is trained to convergence, it can be split into a large resolution DNN and then trained further. The details of this splitting process are given in Chapter 6. After splitting, original width of the small network becomes the size of linear bottlenecks between layers of the large network. It is hoped that such a splitting-based initialization would be superior to random initialization based pre-training. Initializing from a smaller model can provide resistance against overtraining.
26
CHAPTER 2. SCIENTIFIC GOALS
Chapter 3 Discriminative Training This chapter provides an overview of discriminative training paradigm. Different popular discriminative training criteria are introduced with their associated properties. Two different approaches for unification of these criteria into a general equation have been discussed. Next, details of their implementation i.e. equations for calculation of derivatives and parameter update formulas according to extended Baum algorithm are presented. To reduce the computational complexity of discriminative training, word lattice approximation and associated equations are also provided. Finally, weighted finite state transducers (WFSTs) are introduced, because the discriminative training experiments in this thesis make use of this approach.
3.1
The Discriminative Paradigm
Some limitations of maximum likelihood (ML), as described in Section 1.1.5.1, particularly the fact that it does not take the competition of classes into account, motivates the pursuit of another training approach. If the distribution of data does not conform to the model, or the model type is not well-suited to the problem, then maximum likelihood can estimate incorrect model parameters. This motivates us to find another training method whereby reasonable model parameter estimates can be obtained even if the model type is not exactly appropriate to the problem at hand. Such a class of parameters estimation functions is called discriminative training criteria. The name discriminative training derives from the fact that apart from optimizing each class region itself, a discriminative training objective function also takes into the account the data belonging to other classes, and hence enhances discrimination between the competing classes. Therefore, it is hoped that discriminative training can result in better classification performance than ML when the true data distribution is not represented correctly by the model. In the light of this assumption, a substantial gain in classification as compared to ML is possible only when there exists a set of parameters that can model the mismatch between the model type and the data generating stochastic process. In other words, the mismatch of ML and true distribution should not be random but should lend itself to be modelled, so as to generalize well to unseen data. The difference between classification of ML and discriminative trained models is explained by an example below. 27
28
CHAPTER 3. DISCRIMINATIVE TRAINING
Figure 3.1: Maximum likelihood estimates for a two-class problem with a class-independent full covariance matrix (a) and a class-independent diagonal covariance matrix (b) [Macherey 10]
3.1.1
Maximum Likelihood versus Discriminative Training for two-class problem
To explain how discriminative training can be advantageous in case of imperfect acoustic model, a two-class problem is presented as in [Macherey 10]. Suppose we have a problem where both the classes have true Gaussian probability density functions represented by parameter set θ. A set of 1000 samples from each class s = {1, 2} has been generated by uniform sampling according to their probability distribution functions. The resulting set of two-dimensional points is shown in Figure 3.1. The Gaussian conditional probabilities are: p(x|s) = N(x|µ s , Σ)
(3.1)
The corresponding means {µ1 , µ2 } are represented as central points and a single pooled covariance for both classes Σ as ellipsis of contour at p(x|s) = 0.5. From Bayes’ rule, the posterior probabilities are represented as: p(s|x) = P
p(s)p(x|s) , 0 0 s0 p(s )p(x|s )
s ∈ {1, 2}
(3.2)
For simplification we keep the prior probabilities of both classes equal i.e. p(s) = 0.5. The Gaussian parameter values are
3.1. THE DISCRIMINATIVE PARADIGM
29
Figure 3.2: Maximum mutual information estimates for a two-class problem with a classindependent diagonal covariance matrix (a). In (b) the ML estimation of the diagonal covariance matrix is used [Macherey 10]
" µ1 =
−0.5 0.5
"
# ,
µ2 =
0.1 0.1
"
# ,
Σ=
0.9 0.8 0.8 0.9
# (3.3)
Using this model and corresponding class posteriors, the decision boundary between these classes is as shown in Figure 3.1 (a), which also corresponds to the correct decision boundary for the true probability distributions. Now let us introduce some assumptions that make the model deviate from the sample data. If a diagonal covariance matrix is assumed (as is common in speech recognition and other applications), then the resulting covariance contour changes significantly. This in turn causes the decision boundary to change for the worse, as in Figure 3.1 (b). Due to suboptimal model there are some recognition errors for samples of both classes. To test the discriminative approach for this classification problem, the maximum mutual information criterion is optimized. MMI is defined as: R
R
1X p(sr )p(xr |sr ) 1X F MMI (θ) = log pθ (sr |xr ) = log P R r=1 R r=1 sˆ p( sˆ)p(xr | sˆ)
(3.4)
Equation (3.4) means that for MMI criterion the class posterior probabilities of all samples are maximized with respect to their respective class labels, while simultaneously minimizing the posteriors of competing classes. The parameter set θ contains two types of parameters i.e. means and covariance, therefore we see the effect of training either one or both of them at a given time and see their effect on the decision boundary. As shown in Figure 3.2 (a), if the
30
CHAPTER 3. DISCRIMINATIVE TRAINING
mean vectors and diagonal covariances are optimized by MMI, their positions are shifted so as to optimize the decision boundary to minimize the classification error. This is desirable even if these new values deviate from true mean values of the sample data. In a similar way, if only the means are trained while using the diagonal covariance from the ML step, the means shift as shown in Figure 3.1 (b) to improve the decision boundary between the two classes.
3.2
Frame-level Discriminative Training
The frame-level criterion used in the scope of this thesis is the context independent maximum mutual information (MMI) criterion. This objective function only takes into account the competition between different phone classes. The contextual information such as language model and state transition penalties are not taken into account. It is defined as: F ( f rame) (θ, A) = −τθ ||θ||2 +
Tr R X X
w s log pθ (st |xt )
(3.5)
r=1 t=1
where pθ (st |xt ) are state posterior probabilities for state s given feature vector xt . Here the state parameters are θ s = {µ s , Σ s } with a fixed alignment sT1 . τθ is a regularization parameter. w s are state weights which could be tuned to give less weight to accumulations of e.g. noise and silence states. R is the total number of sentences in the training corpus. The state posterior probability is defined according to Bayes’ theorem: pθ (s|x) = P
p(s)pθ (x|s) 0 0 s0 p(s )pθ (x|s )
(3.6)
This objective function has interesting convergence properties, as will be discussed in detail in Chapter 4. For a fixed feature-vector to HMM-state alignment sT1 and a single density per CART state s, there is exactly one global optimum that can be estimated robustly. This has also been verified experimentally [Heigold 10]. The state priors are used for training, but can be removed later for recognition; because for recognition purpose sentence priors from the language model are used for incorporating contextual information.
3.3 3.3.1
Sentence-level Discriminative Training Criteria Maximum Mutual Information
The MMI criterion is defined as the sum of logarithms of posterior probabilities of all training utterances. For a corpus containing r = 1, ...R training utterances with each utterance r having acoustic feature vector string Xr = xr1 , ..., xrTr and corresponding transcription Wr = wr1 , ..., wrNr , the MMI objective function is: R
R
1X p(Wr )pθ (Xr |Wr ) 1X log pθ (Wr |Xr ) = log P F MMI (θ) = R r=1 R r=1 W∈Mr p(W)pθ (Xr |W)
(3.7)
3.3. SENTENCE-LEVEL DISCRIMINATIVE TRAINING CRITERIA
31
where θ is the acoustic model parameter set and provides the class conditional probability pθ (X|W). The priors p(W) are obtained from language model and are assumed to be constant for this optimization problem. Mr is the set of all possible word sequences. The conditional probability p(Xr |W) is calculated from frame-wise product of HMM state transition and emission probabilities of an alignment between W and Xr :
p(Xr |W) = max sT1 r |W
Tr nY
o p(st |st−1 )Nθ (xt |µ st , Σ st )
(3.8)
t=1
The name maximum mutual information comes from the information theoretic concept of mutual information I(X, W), which measures the amount of information obtained when data X is transmitted on a noisy channel and labels W are observed at the output:
I(X, W) =
X
p(X, W) log
X,W
p(X, W) p(X)p(W)
(3.9)
Since the prior probabilities for data labels i.e. language model are fixed, the maximization of mutual information is equal to that for conditional entropy H(W|X) R
1X H(W|X) = log p(Wr |Xr ) R r=1
(3.10)
Looking at MMI objective function in Equation (3.7), it is clear that the logarithmic terms will react strongly to those sentences where the posterior probability of that sentence is low. This is the case for those data outliers which are not properly modelled by the acoustic model parameters. Therefore the optimization will try to correct those outliers and a small change in their posterior probability will be reflected strongly in the objective function. The posterior probabilities of utterances will be most influenced by a finite number of competing word sequences in M that match in some degree to the correct transcription. The other word sequences have close to zero probabilities and can be neglected. This property is exploited in making the denominator in the equation computationally tractable. Such approximations are lattices [Ney & Aubert 94] and N-best lists. According to [He & Deng 08], MMI is a discriminative performance measure at superstring level, because it tries to improve the conditional likelihood of entire training data instead of each utterance individually. This can be seen from the product form in Equation (3.7) (logarithmic sum). The objective function takes on a value between 0 and 1 based on its product terms. The MMI training criterion maximizes the posterior probabilities of correct word sequences given the acoustic feature vectors. In reality, however, we are more interested in minimizing the error metrics like word error rate. The next class of training criteria tries to minimize the expectation of an error loss function on the training data. Since the objective function is to be maximized, the loss function can be replaced by its respective accuracy function A(Wr , W).
32
3.3.2
CHAPTER 3. DISCRIMINATIVE TRAINING
Minimum Classification Error
Minimum classification error (MCE) criterion minimizes the expectation of smoothed sentence error on training data. The probability of making a classification error on the training data utterance r is given by local error probability:
pB (er ) = 1 − p(Wr |Xr ) P W,W p(W)p(X|W) P r = W p(W)p(X|W) =
p(Wr )p(X|Wr ) 1+ P W,Wr p(W)p(X|W)
(3.11) (3.12) !−1 (3.13)
This is in the form of a sigmoid function or logistic function. Therefore the smoothing function is fρ (z) =
1 1 + e2ρz
(3.14)
where ρ is a slope parameter to be chosen. Summing over all training utterances and smoothing with fρ (z) yields the MCE objective function F MCE (θ) = −
R X r=1
p(Wr )pθ (Xr |Wr ) fρ log P W,Wr p(W)pθ (Xr |W
! (3.15)
Given that the slope of fρ (z) is sufficiently steep by adjusting the parameter ρ, the derivative will be small for those training utterances which have high probabilities and conform to the acoustic model. On the other hand, those sentences which lie near the decision boundary and therefore have large derivatives will be affected most by the optimization. A small change in the parameter set will cause a relatively large change in the objective function for those borderline sentences and are therefore their corresponding parameters are more likely to be optimized. For those outlier training data sentences which match the least to the acoustic model, the optimization will again not be affected significantly, as they lie away from the high derivative space of the smoothing function. Unlike MMI, the MCE objective function is a sum of rational functions and therefore optimizes the conditional likelihood on the training data on a string level. The objective function can take a value between 0 and R. This string level optimization concept also holds for MWE and MPE, introduced next.
3.3.3
Minimum Word Error and Minimum Phone Error
The MWE and MPE criteria are motivated by Bayes’ rule for minimum expected loss. An approximation of word error or phone error respectively is to be minimized by these objective functions. The MWE criterion measures the average word transcription accuracy over all training sentences:
3.4. UNIFIED FORMS OF DISCRIMINATIVE TRAINING CRITERIA
F MWE (θ) =
R X X
33
pθ (W|Xr ) Aw (W, Wr )
(3.16)
r=1 W∈Mr
where Aw (W, Wr ) is the total number of words in the reference transcription Wr minus the number of word errors in the given sentence W. The posterior probability is as for the previously mentioned criteria pθ (W|Xr ) := P
p(W)pθ (Xr |W) V∈Mr p(V)pθ (Xr |V)
(3.17)
Similarly, the MPE objective function is: F MPE (θ) =
R X X
pθ (W|Xr ) A p (W, Wr )
(3.18)
r=1 W∈Mr
where A p (W, Wr ) is the phone accuracy i.e. total number of phones in the reference transcription minus the number of phone errors in the given hypothesis. MWE and MPE criteria were proposed by [Na & Jeon+ 95, Povey 03]. In [Povey & Woodland 02] lattice-based implementations were provided for MWE and MPE, and it was found that discriminative MPE training helps achieve better WER than MMI training.
3.4
Unified forms of Discriminative Training Criteria
In [Schl¨uter & Macherey+ 01], an approach for a unified criterion for MMI and MCE was proposed, which was extended in [Macherey & Haferkamp+ 05] to MPE and MWE training criteria. Let (X, W) := {(Xr , Wr )} for r = 1, ...R training data utterances. The general form of a discriminative training criterion is:
R
1X f log F (X, W|θ; f, κ, G, {Mr }) = R r=1
P
pκ (W)pκθ (Xr |W) G(W, Wr ) P κ κ W∈Mr p (W)pθ (Xr |W)
W
!1/κ ! (3.19)
where f denotes the smoothing function and κ is a weighting exponent for sentence probabilities. A value of κ = 1 denotes different criteria in the forms that were presented in Section 3.3. For larger values of κ, the word sequences in Mr having high posterior probabilities will dominate the rest. For κ = ∞, only the best recognizable hypothesis will be considered. Such a modification of MMI is called corrective training [Bahl & Brown+ 88] and that for MCE is called falsifying training [Macherey 98]. G(W, Wr ) is a gain function that differentiates different discriminative criteria. It provides the opposite of some error metrics such as Levenshtein distance or sentence error. Table 3.1 gives an overview of differences in different criteria, with respect to different parameters of Equation (3.19). Here δ(W, Wr ) is 1 if W = Wr and 0 otherwise.
34
CHAPTER 3. DISCRIMINATIVE TRAINING
Table 3.1: Comparison of different training criteria under Schl¨uter & Macherey’s unifying approach Criterion
f (z)
Mr
G(W, Wr )
MMI
z
all
δ(W, Wr )
MCE
− 1+e12ρz
all without Wr
δ(W, Wr )
MWE
ez
all
Aw (W, Wr )
z
all
A p (W, Wr )
MPE
e
Table 3.2: Comparison of different training criteria under He & Deng’s unifying approach Criterion
G(wr )
MMI
δ(wr , Wr )
MCE
δ(wr , Wr )
PR
MWE
Aw (wr , Wr )
PR
MPE
A p (wr , Wr )
PR
G(w) QR
r=1 G(wr )
r=1 G(wr ) r=1 G(wr ) r=1 G(wr )
Another unified form in literature is given by [He & Deng 08]. A superstring X = X1 , ...XR is concatenation of all acoustic observations. W = W1 , ...WR is concatenation of correct reference transcriptions and w = w1 , ...wR is an arbitrary word sequence that can be aligned with X. Then the unified objective function is: P F (θ) =
pθ (X1 , ..., XR , w1 , ..., wR ) G(w) w1 ,...,wR pθ (X1 , ..., XR , w1 , ..., wR )
w1 ,...,wR
P
(3.20)
The above unified form is a rational function, therefore all the criteria that conform to this form can be optimized by growth transformation based algorithms (discussed in Section 3.5.1). The criteria differ only in the term G(w) which does not depend on the acoustic observations X and the parameter set θ. The distinguishing terms for different criteria are shown in Table 3.2.
3.5
Parameter Optimization
In [He & Deng 08] GMM parameter optimization equations for HMM based Gaussian acoustic models are provided. The function in Equation (3.20) for HMM does not have a closed form solution, therefore gradient-based numerical methods are needed for optimization. However, because this is a rational function, more efficient optimization techniques like growth transformation method are also applicable.
3.5. PARAMETER OPTIMIZATION
35
Growth Transformation Method The growth transformation based optimization is an iterative algorithm, which constructs an auxiliary function to indirectly optimize the main objective function. The auxiliary function is usually easier to optimize than the original one. For a given parameter set θ, let G(θ) and H(θ) be two real-valued functions and H(θ) is also positive. If the objective function can be represented in the form given below: O(θ) =
G(θ) H(θ)
(3.21)
then a growth transformation method exists for its optimization. The auxiliary function is defined as: F(θ; θ0 ) = G(θ) − O(θ0 )H(θ) + D
(3.22)
where θ are the parameters in the current iteration and θ0 are parameter values from the previous iteration. D is an optimization constant used for convergence control. It can be shown that F(θ; θ0 ) − F(θ0 ; θ0 ) = H(θ) (O(θ) − O(θ0 )
(3.23)
Since H(θ) is positive, therefore an increase in F(θ; θ0 ) also results in an increase in O(θ).
3.5.1
Extended Baum-Welch Algorithm
For the special case of HMM, the growth transformation based optimization is called extended Baum-Welch algorithm. The objective function F (θ) to be optimized is as in Equation (3.20). For Gaussian emission probabilities, it can be shown that the growth transformation auxiliary function is:
U(θ; θ ) = 0
Tr X X R X X
pθ0 (w|X) G(w) − F (θ0 ) γr,t (s|w) log p(xr,t |θ s )
w r=1 t=1 s T R r X XX
+
Z p(Xr,t |θ0s ) log p(Xr,t |θ s ) dXr,t
d(r, t, s)
r=1 t=1
(3.24) (3.25)
Xr,t
s
γr,t (s|w) can be obtained from efficient forward-backward algorithm as in Section 1.1.5.1. d(r, t, s) =
X
d0 (s)p(sr,t = s|w, θ0 )
(3.26)
w
d0 (s) are state dependent convergence constants. For details refer to [He & Deng 08]. Taking the partial derivatives of U(θ; θ0 ) with respect to µ s and Σ s and setting to zero yields
36
CHAPTER 3. DISCRIMINATIVE TRAINING
the parameter update formulas. Initial values of these parameters can be obtained by an ML parameter estimation step. Since unlike ML no closed form solution is available, the parameters are refined by merging the new update to their value from previous iteration. PR PT r r=1
µs =
t=1
∆γr,t (s) xr,t + D s µ0s
PR PT r r=1
t=1
(3.27)
∆γr,t (s) + D s
PR PTr > ∆γ (s)(x − µ )(x − µ ) + D s Σ0s + D s (µ s − µ0s )(µ s − µ0s )> r,t r,t s r,t s r=1 t=1 Σs = PR PT r r=1 t=1 ∆γr,t (s) + D s ∆γr,t (s) =
X
p(w|X, θ0 ) G(w) − F (θ0 ) γr,t (s|w)
(3.28)
(3.29)
w
Ds =
Tr R X X
(3.30)
d(r, t, s)
r=1 t=1
These formulas are unified parameter estimation formulas for MMI, MCE, MWE and MPE. The term ∆γr,t (s) differs among the criteria because of using G(w) in its calculation. Extension to Gaussian mixture densities i.e. θ s,l can be done as for ML training in Section 1.1.5.1.
3.6
Weighted Finite State Transducer Framework
The discriminative training framework used in the scope of this thesis is based on a weighted finite state transducer (WFST) implementation. A finite state transducer (FST) is basically a sequence detector. It is a state graph which takes input sequence of symbols, traverses a particular chain of states based on this input and emits a sequence of output symbols. The output based on input sequence may be generated non-deterministically or no output at all. In general, it defines the relation between two formal languages. The edges of an FST may carry weights in addition to input and output symbols; then it is called a weighted FST. If a WFST takes only input and generates no output, then it is an acceptor. WFST provides for example an elegant method for representing problems of statistical non-linear sequence alignment such as hidden Markov model based speech recognition. This section provides an introduction to the RWTH’s WFST framework used for discriminative training. Formally, a weighted finite state transducer is a 7-tuple T :=
¯ 1), ¯ S , I, F, E Σin , Σout , (K, ⊕, ⊗, 0,
(3.31)
where K denotes a field. Σin is the input alphabet and Σout is the output alphabet. S ⊂ N is the set of traversable states where N is the set of natural numbers. I ∈ S × K is a unique initial state. F ∈ S × K are the final states. E ⊂ S × {Σin ∪ } × {Σout ∪ } × K × S is the set of edges connecting states S . Next we define the semiring, which provides basic mathematical set of operations on the ¯ 1) ¯ is a semiring iff: WFST. (K, ⊕, ⊗, 0,
3.6. WEIGHTED FINITE STATE TRANSDUCER FRAMEWORK
37
Table 3.3: Definition of operations for different semirings over R semiring
K
x⊕y
x⊗y
0¯
1¯
inv(x)
probability
R+
x+y
x·y
0
1
1 x
log
R ∪ {−∞, +∞}
− log(exp(−x) + exp(−y))
x+y
+∞
0
-x
tropical
R ∪ {−∞, +∞}
min{x, y}
x+y
+∞
0
-x
Table 3.4: Definition of operations for expectation semiring over R+ × R semiring expectation
K
(p, v) ⊕ (p0 , v0 )
(p, v) ⊗ (p0 , v0 )
0¯
1¯
inv(p, v)
R+ × R
(p + p0 , v + v0 )
(p · p0 , p · v0 + p0 · v)
(0, 0)
(1, 0)
( 1p , − pv2 )
¯ is a commutative monoid: (x ⊕ y) ⊕ z = x ⊕ (y ⊕ z) , 0¯ ⊕ x = x ⊕ 0 = x • (K, ⊕, 0) x⊕y=y⊕x. ¯ is a monoid: (x ⊗ y) ⊗ z = x ⊗ (y ⊗ z) • (K, ⊗, 1)
and
and 1¯ ⊗ x = x ⊗ 1 = x
• ⊗ distributes over ⊕: x ⊗ (y ⊕ z) = (x ⊗ y) ⊕ (x ⊗ z)
and (x ⊕ y) ⊗ z = (x ⊗ z) ⊕ (y ⊗ z)
• 0¯ is an annihilator for ⊗: 0¯ ⊗ x = x ⊗ 0¯ = 0¯ Additionally, a semiring may have an inverse such that inv(x) ⊗ x = 1¯ for any x ∈ K. Figure 3.3 shows an example of a WFST. Each edge is labelled by input symbol, output symbol and weight respectively. Table 3.3 shows mathematical operations for different common semirings. Another important semiring is the expectation semiring [Eisner 01], which is used to calculate expectations for transducer-based MMI training. Its operations are shown in Table 3.4. Here p is like a probability semiring and v can be used to represent an additive random variable like word error.
Figure 3.3: Diagram of a word lattice resulting from recognition of audio utterance “I like this meal”
38
3.7
CHAPTER 3. DISCRIMINATIVE TRAINING
Word Lattices
As indicated before in Section 3.3, the sentence-level discriminative training criteria require summation over all possible sequences of words for objective function calculation. This can be prohibitively expensive if iterated over all feature vectors and all the aligned states. Word lattices represent a subspace of the full search space containing only sentences with posterior probabilities above a certain threshold. Lattices offer a more compact representation of search space as compared to N-best lists, because many sequences in N-best lists differ from each other only slightly. Lattices on the other hand have such word sequences joined in a word graph form thus removing the redundancy and hence speeding up calculations. Word lattices can be represented as acyclic WFSTs with the language pronunciation lexicon as its vocabulary. Each state in a word lattice holds word boundary information. This includes the time frame at the start and end of words and also the acoustic context across word boundaries. Edges between words can be marked with either language model, acoustic model or combined log scores. Assuming that we only concern ourselves with Viterbi i.e. the probability of a word sequence depends only on the best aligned HMM state sequence, the suitable semiring is the tropical semiring (as in Table 3.3 ). In case we need to calculate the HMM forwardbackward probabilities through the state graph without maximum approximation, a log semiring can be employed. Lattice generation and processing tasks in the scope of this thesis have been performed using the RWTH speech recognition toolkit (RASR). For more details about lattice processing in general and by this toolkit, refer to [He & Deng+ 08, Rybach & Gollan+ 09]. For using WFST’s in discriminative training, there are some special considerations. Different types of silence and noise states can cause extra edges in the graph and duplicate hypotheses which should be removed for good performance. There exist algorithms to filter the noise and silence edges as preprocessing steps [Wessel & Schl¨uter+ 01, Hoffmeister & Klein+ 06]. Also, noise and silence states tend to compete with each other during discriminative training and therefore can cause problems in accuracy computation of MPE training. Therefore for MPE one could merge the noise and silence states into one CART leaf. This proposition is explained in more detail in Section 4.2.
Chapter 4 Log-Linear Acoustic Models The maximum entropy principle provides a general purpose machine learning technique for classification and prediction of data. The purpose of maximum entropy is that if we are given a set of training samples, we can estimate the distribution parameters while satisfying some constraints. Other than the constraints, we do not assume anything else. The resulting distribution should be as close to uniform distribution as possible, subject to constraint satisfaction. It can be shown that the resulting distribution leads to a log-linear model [Heigold 10]. The advantage of ME is the fact that we can combine information from multiple knowledge sources and solve for a large number of free parameters in a single model. This chapter is organized as follows: Sections 4.1 through 4.3 present an overview of maximum entropy principle, log-linear acoustic models and training methods. Sections 4.4 and 4.5 present our work in this context, and some observations.
4.1
Introduction
The conditional maximum entropy models can be represented as P exp i λi fi (x, y) P pΛ (y|x) = P 0 y0 exp i λi fi (x, y )
(4.1)
where x is an input vector and y is output which could be a class or a state. The functions fi (x, y) are called feature functions and they are a way of specifying constraints about the model. The feature function has a positive value if some property relating x and y is satisfied, and zero otherwise. The parameters Λ = {λi } are weighting factors for their respective feature functions fi (x, y). Given a set of training samples (xn , yn ), the objective function of ME is
G(Λ) :=
N X
log pΛ (yn |xn ) =
X x,y
n=1
39
N(x, y) log pΛ (y|x)
(4.2)
40
CHAPTER 4. LOG-LINEAR ACOUSTIC MODELS
where N(x, y) is the number of occurrences of pair (x, y) in the training data. It is a property of ME that for the model in Equation (4.1), it maximizes the probability of the training data. G(Λ) is a sum of convex functions and therefore is also convex, and therefore we are guaranteed to find the global maximum. This is one advantage over EM algorithm, where we can only find a local maximum. As an example, suppose that in a task of language translation we need to know the context in order to translate a word. For example, the word “note” could be translated to different words in another language, depending on words around it. Let x represent the context information from neighbouring words and y be the context name (e.g. music, literature or financial). If we have the word “currency” in context, then fcurrency (x, f inancial) would evaluate to true, and its corresponding λi will be a large value. Similarly fbank (x, f inancial) could also evaluate to true, but its λi could be smaller because it is less frequent. The algorithms used for maximum entropy parameter estimation include generalized iterative scaling and improved iterative scaling; and general purpose optimization techniques such as gradient ascent, conjugate gradient, and variable metric methods.
Convexity property A desirable property of the aforementioned log-linear models is that they have a global optimum according to maximum entropy principle. This means that any local optimum of the solution must be a global optimum, and it can theoretically be reached regardless of the initial values of model parameters. This special property could make the optimization easier than the general case. Formally, for a real vector space X with a real valued function f that is defined on a convex subset X of X f :X→R
(4.3)
The optimization problem needs to find any point x∗ in X for which the value of function f (x) is smallest i.e. f (x∗) ≤ f (x) ∀x ∈ X
(4.4)
In the following a list of optimization methods for log-linear models is provided, including specific methods like generalized iterative scaling and improved iterative scaling, as well as generalized numerical method optimization methods like RPROP. Due to its robustness, all the log-linear optimization experiments in the scope of this thesis are performed with RPROP algorithm.
4.1.1
Generalized Iterative Scaling Algorithm (GIS)
GIS as proposed originally by [Darroch & Ratcliff 72] is an iterative algorithm for maximum entropy learning. We need to maximize the function given in Equation (4.2) by finding optimum values of Λ. For this, we take partial derivatives of G with respect to λi
4.1. INTRODUCTION
41
∂G = Ni − Qi (Λ) ∂λi Ni :=
X
N(x, y) · fi (x, y)
and
Qi (Λ) :=
x,y
(4.5)
X
N(x)
X
x
pΛ (y|x) · fi (x, y)
(4.6)
y
Here Ni is the observed count of feature occurrence and Qi is the expected count. For each iteration we need a constant number of active features, which can be done by adding a correction feature f0 f0 (x, y) := F −
X
fi (x, y),
i
F := max x,c
X
fi (x, y)
(4.7)
i
F does not depend on the parameters so it can be determined beforehand on the training data. The update ∆λi can be calculated by solving the equation Qi (Λ) · exp(∆λi F) = Ni
(4.8)
At each iteration, it adjusts the parameters to increase the likelihood of training data. The step size is guaranteed to be not too large and not too small. For a given feature function fi , we P can count how many times it was observed in the training data, observed[i] = j fi (x j , y j ). Now P we can estimate the same count from our ME model pλ by expected[i] = j,y pλ (y|x j ) fi (x j , y). ME models have the property that observed[i] = expected[i] ∀ i. The GIS algorithm tries to achieve this equality by moving them closer at every iteration. Since λi are free parameters, an apparent solution is if we add log(observed[i]/expected[i]) to λi . This would work if there were just one parameter, but because every parameter λi has its effect on the objective function, therefore adjusting one parameter by a large value can upset other parameters and lead to instability. For this reason there is a slowing factor f # equal to the largest total value of fi . The update increment δi is added to λi in each iteration δi =
log(observed[i]/expected[i]) , f#
f # = max j,y
X
fi (x j , y)
(4.9)
i
When applying this algorithm in practice, we keep a data structure, that for each training input x j and output y contains the value i such that fi (x j , y) , 0. GIS is a much slower algorithm as compared to other training methods like EM, but its strength lies in the fact that it can train a large number of parameters simultaneously.
4.1.2
Improved Iterative Scaling (IIS)
The IIS algorithm optimizes the same objective function as GIS algorithm, with the difference that it relaxes the optimization criterion by using upper bounds [Berger 97]. In turn it achieves independence between the free parameters which results in faster convergence. Rewriting Equation (4.2) we get
42
CHAPTER 4. LOG-LINEAR ACOUSTIC MODELS
X
G(Λ) :=
p(x, y) log pΛ (y|x)
(4.10)
x,y
It can be seen that G(Λ) is always less than 0 and G(Λ) = 0 will be the optimal solution. Let ∆ = {δi } be the factor that is added to Λ in every iteration. Then with respect to the observed distribution p(x, y), the change in likelihood corresponding to a change in parameters will be
G(Λ + ∆) − G(Λ) =
X
p (x, y) log pΛ0 (y|x) −
X
x,y
=
X
p (x, y) log pΛ (y|x)
x,y
p (x, y)
X
x,y
δi fi (x, y) −
X
p(x) log
x
i
ZΛ0 (x) ZΛ (x)
(4.11)
where Λ0 = Λ + ∆, and ZΛ (x) is the normalizing term. We use the inequality − log α ≥ 1 − α (∀ α > 0) to establish a lower bound on the change in objective function G(Λ + ∆) − G(Λ) ≥
X
p(x, y)
x,y
X
δi fi (x, y) + 1 −
X
p(x)
x
i
ZΛ0 (x) ZΛ (x)
(4.12)
Let the right side of the above inequality be called A(∆|Λ). Putting the values of ZΛ (x) and simplifying, it can be written as P exp i (λi + δi ) fi (x, y) P A(∆|Λ) := p(x, y) δi fi (x, y) + 1 − p(x) P y exp i λi fi (x, y) x,y x i X X X X X = p(x, y) δi fi (x, y) + 1 − p(x) pΛ (y|x) exp δi fi (x, y) X
X
x,y
Using f # (x, y) :=
X
x
i
P
i fi (x, y),
A(∆|Λ) =
P
y
y
(4.13)
i
we can rewrite Equation (4.13) as
X
p(x, y)
X
x,y
X
δi fi (x, y) + 1−
i
p(x)
x
X y
X δi fi (x, y) # pΛ (y|x) exp f (x, y) # (x, y) f i
(4.14)
here fi (x, y)/ f # (x, y) is a probability distribution because it is non-negative and sums to 1 over i. Therefore we can apply Jensen’s inequality for a probability distribution p(x) exp
X x
and it becomes
p(x)q(x) ≤
X x
p(x) exp q(x)
(4.15)
4.2. LOG-LINEAR ACOUSTIC MODELLING
A(∆|Λ) ≥
X
p(x, y)
X
x,y
X x
43
δi fi (x, y) + 1−
i
p(x)
X y
X fi (x, y) ! # exp δ f (x, y) pΛ (y|x) i f # (x, y) i
(4.16)
Let us call the right side of above inequality as B(∆|Λ). This is another lower bound on the likelihood function, because B(∆|Λ) ≤ G(Λ + ∆) − G(Λ) by Inequalities (4.12) and (4.16). Differentiating B(∆|Λ) with respect to δi gives X X ∂B(∆) X = p(x, y) fi (x, y) − p(x) pΛ (y|x) fi (x, y) · exp(δi f # (x, y)) ∂δi x,y x y
(4.17)
The above equation has a useful property that δi appears alone, and therefore the optimization of one parameter is not influenced by other free parameters. We can solve for each parameter by differentiating B(∆|Λ) by each δi separately, and then add these to update the respective λi .
4.1.3
RPROP
For optimization of the objective function in Equation (3.5) we use the general purpose RPROP algorithm [Riedmiller & Braun 93]. RPROP is an acronym for resilient backpropagation method for continuously differentiable training criteria. It is a first order optimization algorithm that takes only the sign of the partial derivatives into account (and not the magnitude). The weights for parameters are increased if there was no sign change in the partial derivatives in the last iteration, and vice versa. There are some advantages of RPROP in comparison to special-purpose algorithms like extended Baum-Welch and GIS. Due to simple implementation there are not many heuristic parameters to tune. It has a lower memory requirement as unlike EBW the numerator and denominator statistics do not need to be stored separately. Under mild assumptions (e.g. the gradient of the training function must be Lipschitz-continuous), the RPROP algorithm is guaranteed to converge to a local optimum [Anastasiadis & Magoulas+ 05]. For log-linear model optimization with frame-MMI criterion (Section 4.3.1), this naturally translates to a global optimum because the criterion is convex. In Chapter 6 of [Heigold 10] it has been shown that for log-linear acoustic model training, RPROP outperforms GIS. In all the log-linear experiments in this thesis, RPROP algorithm is used for optimization.
4.2
Log-Linear Acoustic Modelling
This section provides an overview of the log-linear paradigm for acoustic modelling. Gaussian pooled covariance acoustic model HMM emission probabilities can be converted into log-linear form. As for a Gaussian mixture model, mixtures can also be introduced for a log-linear model
44
CHAPTER 4. LOG-LINEAR ACOUSTIC MODELS
in the form of log-linear mixture model (LLMM). This log-linear or LLMM acoustic model can be initialized from a Gaussian single density or mixture density acoustic model respectively. As with Gaussian model, discriminative training for log-linear model can be done either at framelevel (context independent) or at sentence-level (context dependent). As we shall see in course of this chapter: although the posterior form of Gaussian and log-linear model are equivalent, in practice for recognition they can differ from each other due to state prior issues.
4.2.1
Log-Linear Models
Assume that the acoustic representation of a speech frame is denoted by x ∈ RD , with a label belonging to one of S triphone classes. Each class s = 1, 2, . . . , S can be modelled by Gaussian distribution with parameter set θ s = {µ s , Σ s }, which can be trained by maximum likelihood (ML) estimators. For a Gaussian single density acoustic model, the state posterior probability p(s|x) of state s given feature vector x is given by
pθ (s|x) = P
p(s)N(µ s , Σ s ) p(s)p(x|s) =P 0 0 0 s0 p(s )p(x|s ) s0 p(s )N(µ s0 , Σ s0 )
(4.18)
expanding the Gaussian representation yields
√ p(s|x) = P
p(s) (2π)D |Σ s |
s0
√
exp
p(s0 )
(2π)D |Σ s0 |
1 (x 2
exp
− µ s )> Σ−1 s (x − µ s )
1 (x 2
0 − µ s0 )> Σ−1 (x − µ ) 0 s s
(4.19)
Gaussian models can be converted to log-linear models [Heigold & Lehnen+ 08, Deselaers & Gass+ 11] by collecting the quadratic and first-order terms of x (as well as a constant):
pθ (s|x) = P
exp(x> Λ s x + λ>s x + α s ) > > s0 exp(x Λ s0 x + λ s0 x + α s0 )
(4.20)
in which θ = {Λ s , λ s , α s , s = 1, 2, . . . , S }. The most conspicuous advantage of log-linear models over Gaussian models lies in the fact that the exponential terms do not strictly need to be probability distributions. Assuming a pooled covariance for the Gaussian model, the quadratic log-linear model can be similarly simplified by setting Λ s = Λ [Heigold 10]:
pθ (s|x) = P
exp(λ>s x + α s ) > s0 exp(λ s0 x + α s0 )
(4.21)
The optimization of a log-linear model is a convex problem according to maximum entropy principle [Darroch & Ratcliff 72]. For a single density per state, the corresponding log-linear
4.2. LOG-LINEAR ACOUSTIC MODELLING
45
model has a global maximum, that can be reached regardless of initial values of parameters. This theoretical property has also been validated experimentally in [Heigold & Rybach+ 09]. Another similar work is [Kuo & Gao 06], although it assumes a different structure of the HMM. A useful property of log-linear models is that they can be used to combine features from different knowledge sources [Fayolle & Moreau+ 10], as the optimum is robust to feature scaling and linear dependencies between different features.
4.2.1.1
Conversion between log-linear and Gaussian acoustic model
For single density per HMM state, log-linear parameters can be initialized from a maximum likelihood trained Gaussian acoustic model [Heigold & Ney+ 11]. 1 α s = − µ>s Σ−1 µ s 2 −1 λs = Σ µs
(4.22)
To convert the log-linear model to Gaussian Σ µs
= any symmetric and positive definite matrix = Σλ s 1 log p0 (s) = α s + µ>s Σ−1 µ s 2
4.2.2
(4.23)
Log-linear Mixture Model
In Section 4.2.1, it has been shown that posterior form of a Gaussian single density acoustic model can be represented as a log-linear model, for a fixed HMM-state to acoustic vector alignment. For the case of Gaussian mixtures, the GMM can still be converted to a log-linear model, with the provision that it now includes hidden variables. This becomes a log-linear mixture model (LLMM) for the case of a pooled covariance GMM. Let the speech feature vectors x1T belong to one of s = 1, ..., S generalized triphone classes, derived from a classification and regression tree (CART); each class with Gaussian mixture parameters set θ s,l = {µ s,l , c s,l }. For pooled covariance Σ, the posterior probability becomes [Heigold & Ney+ 11] p(s)pθ (x|s) 0 0 s0 p(s )pθ (x|s ) P p(s) l c s,l N(x|µ s,l , Σ) P =P 0 s0 p(s ) l c s0 ,l N(x|µ s0 ,l , Σ) P > l exp(λ s,l x + α s,l ) =P > s0 ,l exp(λ s0 ,l x + α s0 ,l )
pθ (s|x) = P
(4.24)
46
CHAPTER 4. LOG-LINEAR ACOUSTIC MODELS
The new acoustic model parameters are λ s,l ∈ RD and α s,l ∈ R, for l = 1...L s mixture parameters in each class s. This is a log-linear mixture model. While the optimization of single density log-linear model is a convex problem, the corresponding optimization of an LLMM acoustic model does not guarantee a global optimum, since its functional form does not satisfy the maximum entropy principle. Nevertheless, as we shall see in the experiments and results in Section 4.3, this non-convexity does not pose a significant disadvantage, and LLMM can still be optimized effectively. It may be initialized from an ML estimated GMM, or it may be initialized by discriminative splitting of a log-linear model (as explained in Section 6.2).
4.2.2.1
Conversion between log-linear mixture model and Gaussian mixture model
LLMM parameters can be initialized from a GMM acoustic model [Heigold & Ney+ 11]. 1 α s,l = − µ>s,l Σ−1 µ s,l + log c s,l 2 −1 λ s,l = Σ µ s,l
(4.25)
To convert the LLMM parameters back to GMM, we use the following equations Σ µ s,l
= any symmetric and positive definite matrix = Σλ s,l 1 log c s,l = α s,l + µ>s,l Σ−1 µ s,l 2 X log c s,l ← log c s,l − log c s,l (normalization)
(4.26)
l
4.2.3
Integration of SAT MLLR and CMLLR
Constrained maximum likelihood linear regression (CMLLR) is a linear feature transform while maximum likelihood linear regression (MLLR) transforms parameters of the Gaussian model. Their purpose is to remove speaker specific information. For example, as shown later in Table 4.1, for the EPPS task SAT gives a WER improvement of 3% absolute. Therefore it should be helpful to integrate SAT into the log-linear discriminative training framework. For this purpose a maximum likelihood SAT CMLLR is performed on the training data to obtain speaker specific transformation matrices. These matrices are added to the feature extraction pipeline of Section 4.3 and the rest of the procedure stays the same. For recognition these log-linear mixture models are converted back into Gaussian models [Heigold & Lehnen+ 08], and then SAT MLLR and CMLLR is performed. The conversion to Gaussian mixture models is necessary since MLLR operates on means and covariances and therefore requires a Gaussian form of the model. In this way we are able to utilize the speaker adaptation ability of MLLR for a log-linear model, by converting the latter back to its Gaussian mixture model form.
4.3. LOG-LINEAR TRAINING
4.3 4.3.1
47
Log-linear Training Frame-level Discriminative Training
The Maximum Mutual Information (MMI) criterion is adopted as the frame-level objective function, with an extra regularization term:
F(λ, α) = −τθ ||λ, α|| + 2
Tr R X X
w st log pλ,α (st |xt )
(4.27)
r=1 t=1
for a fixed alignment sT1 where the state parameters are {λ s , α s }. R is the total number of utterances in training corpus, T r is the total number of feature vectors in r’th utterance. τ is a regularization parameter to increase robustness and avoid overfitting of the model to the training data. ||λ, α|| is the Euclidean distance of [λ> α]> from some central values. It may be zeros (in case the input features x are mean normalized) or it may be their initial values as obtained from some maximum likelihood GMM based initialization. In the experiments, the parameters have been regularized with respect to ML based initial values. w s are state weights which could be tuned to give less weight to some states e.g. silence which occupies a large number of states in the alignment. The state priors are later subtracted from αˆ s for recognition, because for recognition we use language model priors instead of state priors. MMI optimization may also be done at sentence level, by using language-model probabilities as priors. This will be explained in the following part.
4.3.2
Sentence-level Discriminative Training
Context-dependent discriminative training of LLMM acoustic model can be done in a similar way as GMM acoustic model. The same type of objective functions can be used i.e. MMI, MPE and MCE. For more discussion of these objective function refer to Section 3.3. The sentencelevel minimum phone error criterion [Povey 03] incorporates an accuracy score A(W, Wr ) of hypothesis sentence W given the reference sentence Wr , based on forced alignment. This is an approximation of phone accuracy which would be based on Levenshtein alignment. F MPE (Λ) = −τΛ ||Λ − Λ0 || + 2
R X X
PΛ (W|Xr )A(W, Wr )
(4.28)
r=1 W∈Mr
β 1 p(Wr ) η · pΛ (Xr |Wr ) PΛ (Wr |Xr ) = P β 1 η · p (X |W) p(W) Λ r W∈Mr
(4.29)
T r X Y > pΛ (Xr |W) = max p(s |s ) exp(λ x + α ) t t−1 t s ,l t s ,l t Tr s1 |W
(4.30)
t=1
l
48
CHAPTER 4. LOG-LINEAR ACOUSTIC MODELS MLP bottleneck features
Maximum likelihood training CMLLR
Gaussian means, mixture weights and pooled covariance
Gaussian to log-linear conversion
Word lattice generation
log-linear MPE training log-linear acoustic model
log-linear to Gaussian conversion Gaussian mixture means, weights (not normalized) and pooled covariance
MLLR
Recognition
Figure 4.1: Flow diagram of MPE discriminative training of log-linear mixture model the state parameters are Λ s,l = {λ s,l , α s,l }. τΛ is the regularization parameter to increase robustness and avoid over-fitting. The regularization used here is called center regularization, which loosely binds Λ to their initial values Λ0 . R is the total number of sentences in the training corpus. Mr is the set of all possible word sequences, η is a language model scale, and β is a posterior scale. For our experiments, the language model scale is calculated from a recognition based on the initial values of acoustic model parameters. The posterior scale is 1. RPROP algorithm [Riedmiller & Braun 93] is used for optimization of the objective function in Equation (4.28) in all the experiments in Section 4.3. We notice that in the MPE Equation (4.30), p(s) is not present, as for MPE the prior probabilities are computed from the language model. LLMM discriminative training implicitly trains the state priors, and posterior forms of GMM and LLMM are equivalent if the state priors are used for posterior computation. The state priors are not used in the MPE objective
4.4. EXPERIMENTS AND RESULTS
49
function and for recognition, therefore the information of modified state priors is lost when the mixture weights are normalized. On the contrary, if we do not normalize the mixture weights, the resulting Gaussian conditional p(x|s) may not be a valid probability density function. This is because its integral will no longer be equal to one. However, without normalization the information loss can be avoided.
4.4
Experiments and Results
There are two sets of experiments for MPE training of LLMM acoustic models. The first are on MFCC features on EPPS English task, while the second set of experiments are on QUAERO Spanish task for MLP input features. For a description of the aforementioned corpora and systems refer to Appendix C. Table 4.1: EPPS English dev2007: Comparison of Gaussian and log-linear single density and mixture models. WER (%) speaker adaptive
acoustic
training
single
training
model
criterion
density
without SAT
Gaussian
ML
28.3
17.1
MPE
24.5
15.8
MMI-frame
23.3
16.4
MPE
22.8
15.3
ML
16.7
13.6
MPE
15.3
12.5
MMI-frame
15.0
13.1
MPE
14.7
12.1
log-linear with SAT MLLR & CMLLR
Gaussian log-linear
64 densities per state
A summary of recognition results for MFCC features on EPPS English corpus is shown in Table 4.1. The WER improvement from MMI discriminative training is quite large for single densities. However, as the number of densities increases, the difference is reduced. For 64 densities per state this difference is 0.7% absolute without SAT and 0.5% with SAT, small but still significant in relative terms. An important point to note here is that the frame-level MMI criterion is not the best criterion in terms of WER. The purpose of using it for our experiments was its robustness and global maximum property (for single densities). The real usefulness of this discriminative splitting approach is due to the improvement that it provides after a further pass of log-MPE training, which causes further reduction in WER. By this method, the total WER improvement over the baseline ML system is 1.8% without SAT and 1.5% with SAT. For comparison, we take an ML Gaussian model already split as 64 densities per state, and train it discriminatively by MPE. This is a model where only the training of means µ s,l is done discriminatively, and no splitting is done in between. This way we only get a 1.3% improvement
50
CHAPTER 4. LOG-LINEAR ACOUSTIC MODELS
Table 4.2: QUAERO Spanish task: Comparison of MPE training of LLMM acoustic model with GMM, for MLP input features. MLP
SAT
WER (%)
features
MLLR + CMLLR
Spanish
training
dev12
eval12
no
ML
19.7
19.8
yes
ML
17.4
18.1
MPE
17.0
17.2
MPE (single noise state)
16.8
16.9
LLMM MPE (single noise state)
16.4
16.0
LLMM MPE converted to Gaussian
16.9
16.9
(with mixture weight normalization) Multilingual
no
ML
16.9
yes
ML
15.0
MPE (single noise state)
14.4
LLMM MPE (single noise state)
14.1
over the ML model without SAT and 1.1% improvement with SAT, which is less than what we obtain by integrated discriminative splitting and log-linear training . The possible reason for this could be the less susceptibility of such an approach to get stuck in a local maximum. For both Gaussian mixture model MPE and log-MPE, the same language models and lattice generation techniques are used. Also, we observe that the WER improvements with SAT are almost as good as improvements without SAT. This is because of the inclusion of CMLLR matrices into the optimization feature extraction pipeline. Table 4.3: Training corpus: QUAERO English 50h / 250h. Comparison of Gaussian and loglinear mixture models. model
training
WER 50h (%)
WER 250h (%)
type
criterion
dev
eval
dev
eval
GMM
ML
24.3
31.2
22.1
28.6
MPE
23.6
30.2
20.4
26.2
MMI-frame
23.9
30.9
22.1
28.6
MPE
23.6
30.4
20.3
26.7
LLMM
For the experiments on QUAERO Spanish task and MLP input features, initially a maximum likelihood mixture acoustic model is trained until there are up to 256 densities per triphone CART state. As shown in Figure 4.1, this is then converted to an LLMM acoustic mixture model and MPE training in performed. All the noise and silence states are combined into a
4.5. CONCLUSION
51
single noise state, so that during the training they are not discriminated against each other. This combination is necessary for an LLMM model, because its mixture weights are not normalized. If the term A(W, Wr ) in Equation (4.28) becomes too small because of discrimination between competing noise phonemes, then the denominator can dominate and MPE training can cause large reduction in weights of some of the noise states. The input features for MPE training include CMLLR matrices in the feature extraction pipeline. The MPE-trained LLMM acoustic model is then converted back to Gaussian form, but without normalization of mixture weights. In a normal Gaussian mixture density acoustic model, the sum of weights of all the densities in a particular CART state is 1. This condition does not hold true for LLMM model, and adds an extra degree of freedom to it. In practice we have found out that this leads to better WER, and therefore we do not re-normalize the mixture weights of converted Gaussian mixture model. The converted mixture model is used to estimate MLLR transformations and then for recognition. Looking at the results in Table 4.2, we see that starting from a maximum likelihood Gaussian mixture acoustic model on top of Spanish MLP features, the WER on development task is 18.1%. MPE discriminative training improves it to 16.9%. If we convert the ML Gaussian acoustic model to LLMM form and do MPE, then the WER decreases to 16.0%. Similar improvement is also evident in the case of multilingual MLP features, which are already welloptimized. For QUAERO English task results in Table 4.3, the difference between GMM-MPE and LLMM-MPE is not apparent. After a sufficient number of MPE iterations the acoustic model starts to over-train, therefore optimization of the objective function after that point does not remain relevant. It is hypothesized that WER in the table obtained by both GMM-MPE and LLMM-MPE is that minimum WER obtained before the onset of overtraining. For frame-MMI for 50 hours corpus, there is an improvement of 0.3% absolute over ML-GMM. For 250 hours corpus, there is no WER improvement. However, its ML-GMM result corresponds to 512 densities per state while the frame-MMI achieves the same WER for 128 densities per state.
4.5
Conclusion
In this section, first an overview of log-linear training and maximum entropy principle is presented. Some algorithms for solving log-linear optimization problems are discussed. Experimental results for log-linear acoustic model were presented on speech recognition tasks, both for frame-level and sentence-level objective functions. Among other things we have presented minimum phone error training results of a log-linear mixture model on top of MLP features in a tandem framework. The LLMM parameters in this context are useful because on one hand they can equivalently represent a Gaussian mixture model; and on the other hand they provide a softmax layer followed by a linear summation layer at the end of an MLP network. Therefore we are able to combine the strengths of mixture density based acoustic models and MPE training of a neural network. There is an extra degree of freedom that we exploit by relaxing the normalization constraint of the weights of Gaussian mixture model. In this way, some phonemes that occur less often but are important for recognition may get higher weights to aid them during recognition, and vice versa. This
52
CHAPTER 4. LOG-LINEAR ACOUSTIC MODELS
weighting is not possible in a regular Gaussian mixture model, as in that case the sum of mixture weights of all the densities in a CART state is 1. Based on observations from these experiments, further work in this direction could be to include such mixture-like representation explicitly in an MLP feature network, thus providing similar type of non-linearity coverage as a mixture density based Gaussian tandem MLP system. Furthermore, this triphone state weighting information can be removed from the LLMM acoustic model, and added to the language model according to pronunciations of words (as these are triphone weights). This can allow the modified language model to carry this information over to other speech recognition tasks. The effect of this unnormalized nature of log-linear mixture weights is also discussed in Section 5.5.4 where some relevant results (Figure 5.5) are discussed.
Chapter 5 Training of Linear Feature Transformations This chapter provides an overview of different linear feature transformation techniques, commonly used in conjunction with acoustic features for ASR. Section 5.3 will present our work on log-linear training of linear feature transformations. Figure 5.1 illustrates the feature transformation step in context of overall speech recognition process. Speech Input
Linear Feature Transformation
Feature Vectors x1...xT
Feature Extraction
x'1...x'T
Global Search Process: maximize p(w1...wN) . p(x'1..x'T | w1...wN) over w1...wN
p(x'1..x'T | w1...wN)
Acoustic Model - subword units - pronunciation lexicon
Language Model p(w1...wN)
Recognized Word Sequence {w1...wN}opt
Figure 5.1: Scope of linear feature transformations 53
54
CHAPTER 5. TRAINING OF LINEAR FEATURE TRANSFORMATIONS
5.1
Generative Techniques
The acoustic feature vectors x1T which are obtained as output of feature extraction process, are transformed by a matrix A such that 0
0
x0 = A · x where x ∈ RD , x0 ∈ RD , A ∈ RD ×D
(5.1)
for some D0 ≤ D. There are several reasons for applying such transformations
• To increase the hyperspace separation between features belonging to different classes, which results in better recognition performance. • It is common to use diagonal covariance matrices for Gaussian mixture models. A rotation transformation aligns the features better to the axes, and therefore minimizes the degradation caused by using diagonal covariance matrices. • To reduce the dimensions of feature vectors, which results in speed and memory optimization. • To cancel the effect of speaker variations by adapting the acoustic model to a particular speaker, as in CMLLR
5.1.1
Linear Discriminant Analysis (LDA)
The purpose of linear discriminant analysis (LDA) [Fisher 36] is to find a linear transformation that best separates two or more classes. In speech recognition we get a sequence of continuous feature vectors as a result of signal processing. Although this process is based on a model of human perception and frequency analysis, yet there is no guarantee that the resulting features are also optimal and relevant for the recognition process. Therefore we use LDA to find a weighted combination of these features so that the separation between different classes is maximized. The classes here correspond to different phoneme models. Another advantage of performing LDA is that it allows us to represent the most important discriminating information in fewer dimensions, and therefore takes care of data sparseness problem typical for high dimensional feature spaces. Therefore we can achieve a dimensionality reduction if we drop some of the insignificant dimensions. The reduced dimension vectors will be more computationally efficient. In addition to that, LDA also results in better classification performance because the vectors that contain information irrelevant to classification are filtered out by LDA, which if present act like noise and degrade the recognition performance. LDA assumes homoscedasticity i.e. within-class covariances are assumed to be the same (the more general variant is presented in Section 5.1.2). In speech recognition systems it is common to use a pooled covariance matrix for all mixture densities of all classes.
5.1. GENERATIVE TECHNIQUES
55
LDA for two classes Let X = {xi }, i = 1...N be a set of observation vectors (independent variables), each element xi of this set corresponds to a known class yi from set Y={0,1} (dependent variables). This is called a training set. The task is to find a classification criterion such that it classifies a vector x as being of the more likely class y. This x may not be part of the original training set. We assume that both the class distribution functions p(x|y = 0) and p(x|y = 1) are normally distributed. Let n0 be the number of samples of class 0 and n1 the number of samples of class 1. Then the class means and covariances are 1 X µ0 = xi , n0 i|y =0 i
1 X Σ0 = (xi − µ0 )(xi − µ0 )> , n0 i|y =0 i
1 X µ1 = xi n1 i|y =1
(5.2)
1 X Σ1 = (xi − µ1 )(xi − µ1 )> n1 i|y =1
(5.3)
i
i
Due to normal distribution we can use Bayes’ decision rule, and can classify the vectors based on whether the likelihood ratio is above or below a threshold T > −1 (x − µ0 )> Σ−1 y=0 (x − µ0 ) + log |Σy=0 | − (x − µ1 ) Σy=1 (x − µ1 ) − log |Σy=1 | < T
(5.4)
where Σ−1 is the inverse of covariance matrix Σ. Now we make a further assumption that the covariance matrices for both classes are equal (homoscedasticity), and that they have full rank. Here we take the average of both covariances to get a single Σ. Then the above equation simplifies and we get (µ1 − µ0 )> · Σ−1 · x < T
(5.5)
for some constant T . Note that this is roughly equivalent to projecting the vector along the line connecting both means and then a basis transformation (scaling in case of diagonal covariance matrix). LDA for more than two classes Suppose we have vectors x corresponding to C different classes, each class j having a mean µ j and covariance Σ j . Then we define the within class scatter as Σ=
X
pj · Σj
(5.6)
j
where p j is a-priori probability of class j. The between class scatter is Σb =
X (µ j − µ)(µ j − µ)> j
(5.7)
56
CHAPTER 5. TRAINING OF LINEAR FEATURE TRANSFORMATIONS
where µ is the mean of all x of all classes. We need to find a transformation that maximizes the ratio of between class scatter to within class scatter. The class separation criterion is S =
w> Σb w w> (Σ−1 Σb )Σw = w> Σw w> Σw
(5.8)
for some vector w. If w is an eigenvector of Σ−1 Σb then Σ−1 Σb w = λw and hence S = λ. The rank of Σb is at most C − 1 so we have C − 1 non-zero eigenvalues. A linear dependency between features is indicated by more zero eigenvalues. These are the basis vectors for the transformation that we are trying to find. If we put w equal to the largest eigenvector, then S will be maximum and therefore this direction is where there is maximum inter-class separation. Smaller eigenvalues contain less discrimination information. As seen above, the transformation matrix for LDA is the matrix containing eigenvectors of Σ Σb as its rows. To achieve dimension reduction, we could drop the rows corresponding to smaller eigenvalues. −1
Use of LDA in Speech Recognition For small vocabulary systems where the words themselves are defined as classes, the use of LDA is pretty straightforward [Doddington 89]. For large vocabulary systems, the situation is more complicated. [Yu & Russel+ 90] used the basic phonemes of the language as classes in an HMM framework, but this did not result in an overall improvement in WER. [Wood & Pearce+ 91] defined the classes as sub-phone units called phonicles and that resulted in performance improvement. In [H¨ab-Umbach & Ney 92], experiments were performed by taking different definitions of classes and seeing the performance in each case. The interaction between LDA and modelling is complicated and can be best answered on experimental basis. Their system is a continuous speech recognizer with Laplacian mixture density HMMs. This is a set of 40-50 phonemes of a language, which can be further divided into three sub-phone units each, called ”phoneme segments”. The best definition of class would be sub-phone segments, also because it is less computationally intensive than using mixture densities.
5.1.2
Heteroscedastic discriminant analysis (HDA)
In the previous section on LDA we made a simplifying assumption that the class covariances are equal. In case we do not make this assumption, then the procedure is called HDA [Kumar & Andreou 98]. Since the objective is to reduce the dimensionality of data, therefore it is explicitly modelled here. We say that for D-dimensional data, D0 dimensions carry all significant classification information, and the other D − D0 dimensions can be rejected. This is equivalent to saying that the means and variances in the rejected dimensions are the same for all classes. Then the maximum likelihood principle is used to find the optimal transformation Let A be a D × D transformation matrix that maps vector x to class y. Distribution of x is Gaussian, therefore distribution of y is also Gaussian. Let the means and variances of class y j
5.1. GENERATIVE TECHNIQUES
57
be µ j and Σ j respectively. As we assume that the discrimination information is only contained in the first D0 dimensions, therefore we can partition these parameters as µD ΣD 0 j µ j = , Σ j = j(D×D) 0) 0 Σ(D−D µ0 (D−D0 )×(D−D0 )
(5.9)
for j = 1...J classes. Here µ0 and Σ(D−D ) are common to all the classes. Accordingly, the probability density of xi will be 0
|A|
exp − p(xi ) = p (2π)D |Σyi |
> (θ> xi − µyi )> Σ−1 yi (A xi − µyi )
2
(5.10)
where xi belongs to class yi . The above equation assumes that |A| is positive. If not, we can make it positive by multiplying a row with -1. The log likelihood of the data under the linear transformation A is given by
N 1 X > > D (A xi −µyi )> Σ−1 (A x −µ log LF (µ j , Σ j , A; {xi }) = − )+log((2π) |Σ |)+N log |A| (5.11) i y y yi i i 2 i=1
The likelihood equation can now be maximized with respect to its parameters. If we try to maximize all the parameters at the same time, it will be quite computationally intensive. We break the process into two parts. First we calculate the mean and variance parameters for a fixed transformation A. Initial guess for A is obtained by applying homoscedastic LDA to find the transformation matrix. Differentiating the likelihood equation with respect to µ j and Σ j and setting the first partial derivatives equal to zero, we can estimate means and variances 0
µˆ Dj = A>D0 µ j , 0 Σˆ Dj = A>D0 Σ j AD0 ,
µˆ 0 = A>D−D0 µ
0 Σˆ D−D = A>D−D0 ΣAD−D0
(5.12)
(5.13)
Parameters withˆare the means and variances for transformed data, and parameters without it are for the original data. Putting these values into the log-likelihood equation and simplifying, we get the maximization criteria for A
J X N Nj log |(A>D0 Σ j AD0 )| + N log |A| Aˆ F = arg max − log |(A>D−D0 ΣAD−D0 )| − A 2 2 j=1
(5.14)
This criterion is more complicated than the one we obtained for LDA, and there is no closed form solution available for it. Therefore we have to use an iterative numerical algorithm to find out the transformation. Quadratic programming techniques can be used, which are available in most popular mathematics packages. But the function surface is not strictly quadratic and
58
CHAPTER 5. TRAINING OF LINEAR FEATURE TRANSFORMATIONS
sometimes the optimization would fail. In such cases more general optimization methods like the steepest descent can be used. But as seen experimentally by [Kumar & Andreou 98], quadratic programming techniques are about two orders of magnitude faster than steepest descent optimization.
Use of HDA in speech recognition [Kumar & Andreou 98] propose the embedding of HDA into the HMM framework. Assuming that the initial and final state of HMM are known, then it is completely specified by the state transition probability matrix B = [bi j ] and state probability distributions p j (x) (which are given 0 0 0 by parameters {A, µ0 , µDj , ΣDj , Σ(D−D ) } using the same parameter definitions as described in the previous section). The input is a time-series of vectors xt = x1 , ..., xT . The EM algorithm is used, however in its maximization step the new parameter B is also included. First an initial guess for the model parameters is used. During each iteration, in the expectation step the probabilities of being in a particular state at each time instance are estimated. Then in the maximization step, the values of {bi j , A, µ j , Σ j } are re-estimated. Here each HMM state is defined as a class.
5.2
Discriminative Techniques
One strategy could be to maximize the mutual information between the features and their class [Omer & Hasegawa-Johnson 03], and this is more intuitively related to minimizing the recognition error, therefore it could be a better objective for discriminant analysis. The performance of this algorithm was compared with other approaches using LDA and HDA. It was observed that the MCMIP system achieves an improvement in recognition accuracy as compared to the baseline system. In fMPE [Povey & Kingsbury+ 05], very high dimensional sparse features are dimensionreduced by a linear transformation. This transformation is trained discriminatively by MPE criterion and resulted in improvement.
5.3
Log-linear Discriminative Training
The maximum entropy principle can also be used for training linear feature transformations. In Section 5.1.1 linear discriminant analysis (LDA) was used to reduce the dimensions of input feature vectors. It also rotates the features to align them better to the feature space axes; this results in better classification performance for the case of diagonal covariance matrices. In this chapter, elements of the transformation matrix A are formulated as log-linear parameters. This is essentially a discriminative training with log-linear parameters, and therefore can be optimized using GIS or RPROP algorithm. From Section 4.2.1 the state posterior probabilities of a single-density log-linear acoustic model are
5.3. LOG-LINEAR DISCRIMINATIVE TRAINING
pΛ (s|x) = P
exp(λ>s x + α s ) > s0 exp(λ s0 x + α s0 )
59
(5.15)
The feature transformation matrix A can be included into Equation (5.15) exp(λ>s Ax + α s ) pΛ,A (s|x) = P > s0 exp(λ s0 Ax + α s0 ) P exp d,d0 ad,d0 (λ s,d xd0 ) + α s =P P s0 exp d,d0 ad,d0 (λ s0 ,d xd0 ) + α s0
(5.16)
where λ s and α s are the log-linear parameters for state s. λ s,d is the dth scalar component of vector λ s , xd0 is the d0 th component of x, and ad,d0 is the element of matrix A at dth row and d0 th column. If the mixture parameters λ s,d are held constant, then the equation is log-linear with respect to parameters ad,d0 . The term log-linear transformation training will be used in the rest of this text to refer to the training of parameters ad,d0 while keeping mixture parameters constant.
5.3.1
Training on State Level
The state level training of log-LDA optimizes the following objective function
F
(state)
(Λ, A) =
T X
log pΛ,A (st |xt )
(5.17)
t=1
where the state sequence sT1 is obtained by aligning the feature vector sequence x1T with the HMM using an initial alignment ML training. pΛ,A (st |xt ) is computed using Equation (5.16). The above objective function does not include any context information, and therefore leads to simpler update rules. For this fixed alignment case, there is exactly one global optimum, which can be found using maximum entropy optimization algorithms like GIS and IIS, or general algorithms like RPROP. Likewise, general purpose gradient ascent algorithms can be used. For the training, the feature functions are defined as λ s,d · xt,d0 if s = s0 f s0 ,d,d0 (x, s) = 0 otherwise
(5.18)
Taking derivative of F (state) (Λ, A) with respect to ad,d0 d F (state) (Λ, A) X X = (δ st ,s − pλ,A (st |xt ))(λ s,d xt,d0 ) d ad,d0 t s X X = λ s,d (δ st ,s − pλ,A (s|xt ))xt,d0 s
t
(5.19)
60
CHAPTER 5. TRAINING OF LINEAR FEATURE TRANSFORMATIONS
For normalization, the features are scaled between the minimum and maximum values, which are min s {λ s,d mint {xt,d0 }} min{λ s,d xt,d0 } = t,s min s {λ s,d maxt {xt,d0 }} max s {λ s,d maxt {xt,d0 }} max{λ s,d xt,d0 } = t,s max s {λ s,d mint {xt,d0 }}
5.3.2
if λ s,d ≥ 0
(5.20)
otherwise if λ s,d ≥ 0
(5.21)
otherwise
Training on Sentence Level
For a speech segment r with feature vectors x1Tr and spoken word sequence w1Nr , the Maximum Mutual Information criterion is F (MMI) (Λ, A) =
X
log pΛ,A (w1Nr |x1Tr )
(5.22)
r
Now considering it to be a single large utterance x1T with corresponding word sequence w1N F (MMI) (Λ, A) = log pΛ,A (w1N |x1T ) Q P p(w1N ) sT1 |w1N Tt=1 pΛ,A (xt |st )p(st |st−1 ) = log P P QT M v1M p(v1 ) sT1 |v1M t=1 pΛ,A (xt |st )p(st |st−1 ) P P P p(w1N ) sT1 |w1N p(st |st−1 ) exp Tt=1 d,d0 ad,d0 (λ st ,d · xt,d0 ) + α st = log P PT P P M d,d0 ad,d0 (λ st ,d · xt,d0 ) + α st v1M p(v1 ) sT1 |v1M p(st |st−1 ) exp t=1
(5.23)
The functional form in Equation (5.23) is difficult to optimize; therefore for calculating derivatives we use another objective function F (context) which is a weak auxiliary function of F (MMI) at (Λ, A) i.e. they are equal if the true values of pt (s, v1M |x1T ) are known. For a discussion of weak auxiliary functions, refer to [Povey 03]. F
(context)
(Λ, A) =
X t
qt (s) = P
log
X
q˜ t (s)pΛ,A (s|xt ) =
s
t
pΛ0 ,A0 ,t (s, w1N |x1T \xt ) s0
X
pΛ0 ,A0 ,t (s0 , w1N |x1T \xt )
,
P qt (s)pΛ,A (xt |s) log P s s pt (s)pΛ,A (xt |s)
(5.24)
pΛ0 ,A0 ,t (s|x1T \xt ) 0 T s0 pΛ0 ,A0 ,t (s |x1 \xt )
(5.25)
pt (s) = P
qt (s) and pt (s) are context priors (time dependent priors) calculated while keeping (Λ0 , A0 ) constant, and
pΛ,A,t (s, w1N |x1T \xt ) = p(w1N )
γΛ,A,t (s|w1N ) , pΛ,A (xt |st )
pΛ,A,t (s|x1T \xt ) =
X v1M
pΛ,A,t (s, v1M |x1T \xt )
(5.26)
5.4. DIRECT TRANSFORMATION OF INPUT FEATURES
61
where γΛ,A,t (s|w1N ) are the forward-backward probabilities and can be calculated using Baum-Welch algorithm as described in Section 1.1.5.1. The context priors are useful because they can contain complete context information despite being single numbers. Like in the case of GMM-based MMI discriminative training, Viterbi approximation and lattices can be used for speed. However, the solution of objective function in Equation (5.23) is not a convex optimization problem, unlike Equation (5.17). Therefore only a locally optimum solution can be expected. Other sentence-level criteria can also be trained similarly. For MPE, the objective function would contain an additional accuracy term in the numerator of Equation (5.23), similar to that in Equation (3.18). In the next Sections 5.4 through 5.6 log-linear transformation training will be applied to different scenarios and relevant results and discussions are presented.
5.4
Direct Transformation of Input Features
In this section, use of log-linear training for a direct feature transform is compared with LDA. Both the λ s parameters of acoustic model and feature transformation A are trained alternately. It is observed that if only the matrix A is discriminatively trained, the objective function is improved but WER does not change. When a subsequent training of log-linear parameters λ s is done, then the objective function as well as the WER improve. Initial experimental results for log-linear transformation training were presented in [Tahir & Heigold+ 09] for a digit recognition task. Some further results are shown in Table 5.1 for single density loglinear acoustic models. The improvement due to log-LDA on top of λ s training is small, possibly because the number of parameters in the transformation matrix is much smaller (6K) as compared to the amount of training data and other acoustic model parameters(207K). Due to this reason, the advantage of discriminative training is not fully realizable. However, in the next sections where the number of trainable parameters in the transformation matrix is large, we shall see significant improvements.
Table 5.1: Training corpus: QUAERO English 50h. Log-linear training of feature transformation Acoustic model
Feature transformation
WER (%)
training
training
dev
eval
ML
LDA
40.7
49.0
MMI-frame
LDA
35.6
43.3
MMI-frame
log-LDA
35.4
43.0
62
CHAPTER 5. TRAINING OF LINEAR FEATURE TRANSFORMATIONS
Figure 5.2: Flowchart of iterative polynomial features
5.5
Dimension-reduced Higher-order Polynomial Features
The input features (or LDA-transformed features) can be cross-multiplied with themselves and then vectorized to create second-order polynomial features. Thus an n-dimensional feature vector becomes n×n-dimensional feature vector after this squaring. However, almost half of the elements in this squared vector are duplicate values, and after removing these duplicate elements the number of elements is (n(n + 1))/2. As we have seen in Section 4.2.1, a Gaussian acoustic model with a pooled covariance matrix can be simplified by cancelling the squared feature terms. The same is true for an equivalent log-linear model. However, by using squared features we can implicitly represent the same type of information as a class-specific covariance based Gaussian/log-linear model. This squaring of features represents a non-linear transformation of input features into a much higher dimensional space, where the classes are expected to be more easily separable. In previous work, second order polynomial features have resulted in WER improvements over the original MFCC features [Wiesler & Nußbaum+ 09]. These features cause a large increase in the length of acoustic model mean vectors, but the WER is as good as a full mixture system with even larger number of parameters. Therefore these squared features are more parameter efficient than the corresponding mixture density system. Polynomial features are feasible for second order e.g. for 45 dimensional input features the size of squared features is 1035. However, for higher orders like fourth or eighth order, their practicality is limited by the fact that going to higher orders increases the number of dimensions exponentially. Therefore, for second order features of 1035 dimensions, fourth order polynomial features would have 536K dimensions. This is about 50 times larger number of parameters than the full mixture acoustic model. To reduce the number of parameters to a computationally tractable size, dimension-reducing log-linear transformations can be applied
5.5. DIMENSION-REDUCED HIGHER-ORDER POLYNOMIAL FEATURES
63
to polynomial features after each squaring, thus allowing us to go to fourth-order, eighthorder polynomials and so on. At each layer k with input features x(k) ∈ Rrk , a transformation A(k) ∈ Rrk ×dk is trained. As seen in Equation (5.27), at each layer k the input features are also concatenated with squared features to retain the information of the original features >(k) exp λ s · A(k)
+ α s vech(x(k) x>(k) ) (k) pθ (s|x ) = (k) >(k) P x (k) 0 exp λ · A + α 0 0 s s s vech(x(k) x>(k) ) x(k)
(5.27)
where A(k) is the dimension-reducing matrix to be trained in the current layer. For a symmetric matrix, vech(·) operator denotes half vectorization. It is a n(n + 1)/2 dimensional column vector obtained by vectorizing only the lower triangular part of A. Vectorizing is defined as a column vector obtained by stacking the columns of the matrix A on top of one another. A pictorial representation of multilayer polynomial features can be seen in Figure 5.2.
5.5.1
Initialization of Log-Linear Training
Log-linear training of transformation depends crucially on reasonable initial values of the projection matrix and log-linear weights; A(k0) and λ(k0) respectively. One reason why good initial values are crucial is the fact that the large scale of ASR training data makes it likely that the algorithm will suffer from a slow convergence to a (local) maximum. Therefore, A(k) (∈ Rrk ×dk ) and corresponding λ(k) are initialized by setting A(k0) = [0rk ×(dk −rk ) |Irk ×rk ] and λ(k0) = λ(k−1) . It can be inferred that this initialization guarantees that the MMI score of the new iteration is exactly identical to that of the previous iteration:
λ
>(k0)
h
0rk ×(dk −rk ) Irk ×rk
i " vech(x(k) x>(k) ) # x
(k)
= λ>(k−1) A(k−1) x(k−1)
(5.28)
Since the iterations continuously absorb more discriminative information from higher-order polynomials, the low-dimensional reduced vector might not be able to capture such additional information and needs to be augmented:
A0 (k0)
=
0rk ×(dk −rk )
Irk ×rk
0(rk+1 −rk )×(dk −rk ) 0(rk+1 −rk )×rk
(5.29)
Compared with A(k0) defined in the previous subsection, the matrix 0(rk+1 −rk )×dk is inserted. Accordingly, λ(k0) is also augmented by its copy, which results in λ0 (k0) = [λ(k0) |λ(k0) ]. It can be seen that λ0 (k0) A0 (k0) = λ(k0) A(k0) . Therefore, the additional rows can be expected to learn the additional information when higher-order polynomials are considered.
64
CHAPTER 5. TRAINING OF LINEAR FEATURE TRANSFORMATIONS Table 5.2: EPPS English dev2007: Mixtures vs. Polynomials Polynom.
Densities
Order
/ mixture
1st
1
Reduction
Criterion
Feature dimension
LDA
ML
45
No. of
WER
params.
(%)
213k
28.3
MMI-frame
25.3
log-LDA
23.5
2
433k
20.2
4
853k
18.4
1080
4972k
20.8
45
263k
21.1
4th
311k
19.3
8th
359k
19.0
611k
18.2
2nd
1
None
2nd
1
log-LDA
90
Table 5.3: Training corpus: QUAERO English 50h, MFCC window size 9: WER for ML vs. higher order features Feature
Densities
Training
Final
polynomial
per
criterion
feature
params.
order
state
dimension
× 1000
1st
single
ML
45
256 1st
single
MMI-frame
45
128 2nd (Full)
single
MMI-frame
1080
No. of
dev
eval
214
40.7
49.0
53010
24.3
31.2
214
35.4
43.0
26508
23.9
30.9
4872
24.2
31.8
23.6
30.6
2nd+sparse 2nd
single
MMI-frame
4th 32
5.5.2
WER (%)
90
467
28.5
35.9
135
721
27.2
34.4
19688
23.1
30.2
Experiments and Results
Results for second-order polynomial features were presented in [Tahir & Schl¨uter+ 11a], and extended to 8th-order features in [Tahir & Huang+ 13]. Table 5.2 shows some results for dimension-reduced higher-order polynomial features on EPPS English large vocabulary task (Appendix C). The input features are 16 × 9 MFCC features which have been LDA transformed
5.5. DIMENSION-REDUCED HIGHER-ORDER POLYNOMIAL FEATURES
65
30 ML first order MFCC log−linear first order MFCC log−linear second order MFCC
L=1
28
L = No. of densities per state
26 WER (%)
L=2 24 L=4 22 L=8 20
L=16
18 16 0
0.5
1
1.5 2 2.5 No. of parameters
3
3.5 6
x 10
Figure 5.3: EPPS English dev2007: Comparison of discriminatively trained second order and first order MFCC systems to 45 dimensions. The results compare the full second-order polynomial features with dimension reduced second, fourth and eighth-order polynomial features. It is also compared with the case where the non-linearity is modelled by conventional mixtures. Note that the features/models for all rows in the table (except the first) have been trained using the same training criterion: frame-level MMI. It can be seen that the number of parameters required for 8th order polynomial features is less than those for mixtures with 4 densities per state; but the WER for the former case is lower than the latter. This gives the indication that modelling the feature non-linearity by polynomial representation required less parameters than doing it through the use of mixtures. This effect can be seen graphically in Figure 5.4. It is noticed that the difference between polynomial features and mixtures is more pronounced for smaller models. The more parameters both these models have, the less difference they have in terms of WER (although still significant difference). Another point to note from Table 5.2 is the effect of number of output dimensions of the dimension reducing transform. For 8th-order features, using 45 output dimensions gives a WER of 19.0%, while increasing the dimensions to 90 decreases the WER to 18.2%. This shows that as the order of polynomial features gets larger, it is beneficial to use projective transformation matrices with larger number of rows. In addition to EPPS English, similar experiments have been performed for the acoustically more difficult QUAERO English 50 hours task. These results have been summarized in Table 5.3. The input features are same as for EPPS. These results also elicit the same conclusions as for Table 5.2. In the EPPS table, the systems contained either polynomial features or mixture density acoustic models. One additional aspect of this table is the result in the last row, which combines polynomial features with mixture-based acoustic model. For
66
CHAPTER 5. TRAINING OF LINEAR FEATURE TRANSFORMATIONS
Table 5.4: Training corpus: QUAERO English 50h. Combination of state-of-the-art DNN with different techniques discussed in this work Model
Training
No. of
type
criterion
params.
dev
eval
GMM
ML
53M
24.3
31.2
DNN 6-layer Hybrid
cross-entropy
7.9M
18.6
24.9
DNN 5th-layer + LLMM
cross-entropy
42M
18.6
24.7
15M
18.2
24.4
7.9M
18.3
24.3
42M
18.0
24.0
DNN 6th-layer + polynomial DNN 6th-layer + log-linear
MPE
DNN 5th-layer + LLMM
WER (%)
the original mixture density model with MFCC features (4th row), the best WER is 30.9% with 128 densities per CART state. Going to higher number of densities per state causes the model to overtrain, because the training corpus is relatively small (50 hours). If the mixture densities are trained on 4th order polynomial features as in last row of table, the WER drops to 30.2% with just 32 densities per state, because of better available input features. This result shows that the results obtained by dimension-reduced polynomial features with log-linear mixtures are significantly better than either full-rank polynomial features, reduced-rank polynomial features or log-linear mixtures alone. Table 5.4 shows some results of combining state-of-the-art deep neural network (DNN) bottleneck features with log-linear modelling approaches researched within the framework of this thesis. The baseline DNN system is the same as in [Wiesler & Richard+ 14]. A brief introduction to deep neural networks and linear bottleneck features can be found in Chapter 6, but for now it is sufficient to refer to them as 256-dimensional input features. This DNN has 6 sigmoid layers with a linear bottleneck of 256 nodes after each sigmoid layer. The network is trained with cross-entropy criterion, which is the same as frame-MMI for log-linear training. In the table the word “DNN 6th-layer” represents that the feature input for log-linear model is the output of last linear bottleneck. Similarly, “DNN 5th-layer” means the output of linear bottleneck after 5th sigmoid layer. It can be seen that a combination of DNN features and a polynomial layer (4th row in table) improves the WER by 0.5% absolute (eval). Some internal experiments on this corpus at our institute show that adding a 7th or 8th sigmoid layer after six layers of DNN does not bring any WER improvement. Thus WER improvement by adding the polynomial layer can be genuinely attributed to its difference to the sigmoid layers. Also, MPE training of an LLMM acoustic model on top of 5th-layer DNN features (6th row in table) is 0.3% better than if only the last softmax layer was MPE trained (5th row in table).
5.5.3
Effect of Additional Parameters
If full second order polynomial features without dimension reduction were used, it could be argued that the improvement in the word error rate is coming from the large number of
5.5. DIMENSION-REDUCED HIGHER-ORDER POLYNOMIAL FEATURES
67
ML (mixtures) 30
MMI (mixtures) MMI (polynomial with dim. reduction) MMI 2nd−order polynomial (full)
28
WER (%)
26
24
22
20
18 5
10
6
10 No. of parameters
Figure 5.4: EPPS English dev2007: WER(%) vs. No. polynomial features in comparison with mixture densities
7
10
of parameters for higher-order
additional parameters. Instead of (d + 1) × S as for the first order MFCC system now we 2 would have ( d +3d + 1) × S parameters λ s,d and α s . Here d is the number of dimensions in 2 λ s and S is the total number of CART states. However, in our experiments we use a second transformation matrix for dimension reduction, therefore the number of additional parameters 2 is very small. In this case there are (d + 1) × S lambda parameters as previously, plus d +3d ×d 2 transformation matrix elements. This just means a 23% increase in the number of parameters for single densities and a mere 1.4% increase for 16 densities per state. Secondly, this means a negligible overhead at recognition time, as the additional parameters are in the feature extraction phase. During recognition, feature extraction phase is less computationally intensive as compared to beam search phase, therefore moving parameters from the acoustic model into the feature extraction part will invariably speed up the recognition process. Figure 5.4 shows the comparison between mixture densities and polynomial features. It can be seen from the figure that increasing the number of parameters by multi-layer polynomial features is more parameter efficient than mixtures,
68
CHAPTER 5. TRAINING OF LINEAR FEATURE TRANSFORMATIONS
35 Maximum likelihood Gaussian discriminative log−linear discrimative
L=1
WER (%)
30
L=2
L = No. of densities per state
L=4 25
L=8 L=16
20
15 0
1
2 No. of parameters
3
4 6
x 10
Figure 5.5: EPPS English dev2007 with second order features: Comparison of discriminatively trained log-linear model versus the Gaussian mixture model obtained from that
5.5.4
Effect of Unconstrained α s
Another interesting fact about log-linear models is unconstrained nature of the constant parameter α s . For single densities and pooled covariance, α s is initialized from the Gaussian model by following equality 1 α s = − µ>s Σ−1 s µ s + log p(s) 2
(5.30)
this equality implies a dependence of α s on λ s parameters (as λ s = Σ−1 µ s ). However in case of log-linear training, α s are optimized as parameters too. This means that they could deviate away from the equality if this means an increase in the value of objective function. Initialization of α s,l for mixtures is done as 1 α s,l = − µ>s,l Σ−1 µ s,l + log c s,l + log p(s) 2
(5.31)
under the constraint that the sum of weights of all densities is 1 i.e. ∀s
X l
c s,l = 1
(5.32)
5.6. DIMENSION REDUCTION FOR MULTI-LAYER LOG-LINEAR TRAINING
69
Equivalently, this puts a constraint on α s,l , which may not hold any more after log-linear training. So when the LLMM is converted to GMM and the Equality (5.32) is enforced again by normalization, some information is lost. Therefore although theoretically the Gaussian and loglinear posterior models are equivalent, they may not be equal at the recognition time because for recognition the priors are calculated from the language model and not from state priors. To test the effect of this extra degree of freedom, the discriminatively trained log-linear models were converted again to Gaussian models and recognition was done using these models. During this Gaussian conversion, the mixture weights c s,l were normalized so that their sum was 1. Experiments show that for single densities, this conversion had a significant effect and the newly converted Gaussian models performed about 4% absolute worse than the corresponding loglinear models. This is illustrated in Figure 5.5. This is because for a small number of densities per state, a free α s can significantly alter the decision boundaries. On the other hand, for 16 densities per state acoustic model, there was no significant WER difference between the loglinear model and the converted Gaussian model. It is important to note here that it is possible to keep the GMM-LLMM equivalence for recognition, if after converting the LLMM to GMM the mixture weights for GMM are not normalized. Recall that if we want to perform MLLR on the acoustic model, GMM is required. The results presented in Table 4.1 are obtained by converting the LLMM to GMM (without normalization) and then doing MLLR. The WER in this case remains the same for LLMM and GMM. Table 5.5: Training corpus: QUAERO English 50h. Dimension reduction for two-layer loglinear training, each layer of size 4501 Layer 1
5.6
Layer 2
WER (%) dev
eval
log-linear
-
35.4
43.0
mixture × 32
-
24.7
31.5
log-linear
log-linear
27.4
34.5
mixture × 32
log-linear
24.8
31.8
Dimension Reduction for Multi-layer Log-linear Training
The third use-case of log-linear feature dimension reduction is shown with application to a two-layer log-linear acoustic model. This is similar to a two-layer MLP network where the hidden layer is a sigmoid and the output layer is a softmax. For the multi-layer log-linear case, both the hidden and output layers are softmax. Another key difference here is that there is no backpropagation involved. We chose not to incorporate it so that our setup can hold to layerwise convexity property as in a regular log-linear model. For each new layer to be trained, the input from previous layers is regarded as input features. On top of these features, the current
70
CHAPTER 5. TRAINING OF LINEAR FEATURE TRANSFORMATIONS
layer can be log-linearly trained to a global optimum regardless of the starting initial guess. If back-propagation were involved, the experiment would lose its log-linear character and become similar to a multilayer perceptron network. As such, our multi-layer log-linear experiments can be thought of as forming a connection/transition between the log-linear and MLP-based acoustic models. In the presented two-layered log-linear model, both the layers are equal to softmax activation functions after linear transforms. Therefore, a dimension reducing transform between these two layers can result in reduction in the number of parameters, while at the same time allowing us to split the output log-linear layer discriminatively into log-linear mixtures. Experiments for two-layer log-linear network are shown in Table 5.5. It was observed that in absolute terms this two-layered log-linear network has worse WER as compared to a single layer full mixture (LLMM) acoustic model. This could be because in our case there is no back-propagation involved as in a standard MLP training. Secondly, both the layers have been trained on optimizing similar set of labels (CART states) and therefore they represent the same transformation twice thus causing redundancy. In another experiment, the first layer has mixtures (32 densities per state) and the second layer is log-linear. The WER is the same as that for a single mixture layer. This case can be thought as a linear transformation between the two layers, because a mixture layer is essentially a non-linear (softmax in this case) layer followed by a linear transformation (summation) layer. Nevertheless, this experiment provides us with a motivation to try out similar dimension reduction techniques for regular MLP and deep MLP training where the hidden layers have a sigmoid activation function. In other words, one could create mixtures in sigmoid hidden layers. Such experiments and results have been reported in Section 6.3.
5.7
Conclusion
In this chapter, different generative and discriminative techniques for linear transformation training are reviewed. After this, Section 5.3 explores the application of log-linear training for transformation matrix parameters. Like the training of acoustic model mixture (or single density) parameters, log-linear training of transformation matrix parameters can also benefit from convex optimization property. The caveat though is that the objective function is either convex with respect to λ or w.r.t. matrix parameters. In practice however we see that both these parameter sets can be alternately optimized. In this chapter three different types of transformation matrices are optimized. Firstly, the optimization of LDA initialized direct feature transformation. This training seemed to have a small improvement for single densities; the improvement however disappeared after full mixture density training. Secondly, training of multilayer polynomial features was tested, with dimension reduction after each squaring of input features. This application showed a promising application of log-linear transformation matrix training. It was shown that higher-order features are more parameter efficient than mixtures in the acoustic model. Thirdly, a two-layer log-linear model was trained with a log-linear mixture model as the first layer (representing a log-linear layer, then a linear transformation, then another log-linear layer). While not giving any WER gain, this experiment provided motivation for mixture-like structure in sigmoid hidden layers of deep neural networks.
Chapter 6 Mixture Density Splitting For a long time the standard technique for acoustic modelling has been hidden Markov model (HMM) based Gaussian mixture model (GMM). This GMM is commonly trained by a maximum likelihood (ML) training procedure. Using an initial ML-trained GMM and correspondingly created acoustic-feature-to-HMM-state alignment, the GMM can be further trained by a discriminative criterion such as MMI or MPE. However, a more direct approach would be to train an acoustic model completely using discriminative training procedure. Some discriminative acoustic models such as maximum entropy Markov models (MEMM) [McCallum & Freitag+ 00] and log-linear models [Heigold & Lehnen+ 08] have been proposed in the literature. These acoustic models can be completely trained with a discriminative criterion, only utilizing ML training for a feature-to-HMM alignment. The log-linear mixture model (LLMM) also incorporates mixtures like a standard GMM, and due to its equivalence with GMM (for first-order features and pooled covariance), it has been initialized from GMM models. In this work we show that an LLMM model can be directly estimated without initialization from GMM model. This is done by discriminative splitting of LLMM parameters during discriminative training. More recently, multilayer-perceptron and deep neural networks have been increasingly popular to model the emission posterior probabilities of HMM phone states [Hinton & Deng+ 12]. Since the log-linear model has a somewhat similar structure to MLP network (it is equivalent to a layer with softmax activation function), therefore we aim to extend some of the results of LLMM discriminative splitting to deep neural networks. Experiments show that discriminative splitting can be used to estimate a sparse linear transformation between two layers of a deep MLP, whose performance compares favourably with other procedures. Secondly, this aforementioned transformation can be used to initialize a non-sparse linear transformation, which can be employed as an effective pre-training method.
71
72
CHAPTER 6. MIXTURE DENSITY SPLITTING
6.1 6.1.1
Overview of Techniques for Mixture Density Splitting K-means clustering
k-means is a vector quantization method from signal processing domain [Lloyd 82], also widely used in data mining [MacQueen 67]. The objective of k-means clustering is to partition t observation vectors into L clusters, such that each observation is associated with its nearest mean. This symbolism (i.e. L instead of k) is being used to ensure notation consistency with Section 1.1.2.3. The problem is NP-hard and therefore a globally optimal solution is not guaranteed. Given a set of observations x = x1 , x2 , ..., xt and given L number of clusters c = c1 , c2 , ..., cL , the solution is a set c that minimizes
arg min c
L X X
||x − µl ||2
(6.1)
l=1 x∈cl
where µl is the mean of observations in cluster cl . The algorithm operates in two steps. In assignment step, each observation is assigned to the cluster which has the least Euclidean distance to it. In update step, new means for each cluster are calculated based on the observations currently assigned to that cluster. Starting from a predefined number of clusters L and randomly initialized mean vectors, the clustering algorithm is guaranteed to eventually converge. The solution is however dependent on initialization. For speech recognition, k-means clustering can be used to initialize more complex algorithms [Renals & Bell 13] such as expectation maximization in the next section.
6.1.2
Splitting in Expectation-Maximization
The expectation maximization (EM) algorithm is used for calculating maximum likelihood estimates of a model where the equations cannot be solved directly. Like the above mentioned k-means algorithm, it converges to a locally optimal solution. The EM algorithm is described along with relevant equations in Section 1.1.5.1. In the current section the splitting method used in the RWTH ASR toolkit [Rybach & Gollan+ 09] in conjunction with EM algorithm is discussed. Let observations x belong to classes s (in this case triphone states). Initial feature vector to state assignment is done by aligning the feature vectors to triphone states. From this alignment means µ s and variance vectors σ2s are calculated (variance vector implies diagonal covariance). Thus the procedure begins with single Gaussian densities, and afterwards the following steps are repeated until the desired resolution is achieved. • splitting: generate two mean vectors from each mean vector µ+s,l = µ s,l + · u; µ−s,l = µ s,l − · u for some suitable direction vector u. This step is shown graphically in Figure 6.1.
(6.2)
6.2. DISCRIMINATIVE SPLITTING OF LOG-LINEAR MIXTURE MODELS
73
Figure 6.1: Graphical depiction of mixture density splitting for expectation-maximization algorithm • parameter estimation: All observation vectors belonging to class s are assigned to the means µ s,l having the highest Gaussian probability. The means µ s,l and variances σ2s,l are recalculated from their currently assigned observations. (For RWTH toolkit there is a pooled variance for all states). • realignment (optional): based on newly calculated Gaussian mixture parameters, the feature vector to state alignment can be refined.
6.1.3
Other Techniques
Gaussian parameter splitting may also be accomplished discriminatively to obtain better fitting models, as in [Normandin 95] where results on a digit recognition task are presented. The emphasis there is to retain the performance of a good system while successively reducing the number of parameters. In [Valtchev & Odell+ 97], a measure of classification error is used to determine the candidate densities to be split. In [Schl¨uter & Macherey+ 99], mixture densities are split discriminatively, and then further trained by ML estimation.
6.2
Discriminative Splitting of Log-linear Mixture Models
Log-linear training is only convex for a single density per state s. For log-linear mixture model (LLMM) training this presents challenges as the initial guess is very important and can influence the final values of objective function and WER. Therefore we need a method to specify a better initial guess for training mixture densities, so that the word error rate is at least as good as WER of a similar but less complex model. To solve this problem we adopt an approach similar to the iterative density splitting algorithm used in maximum likelihood framework, but applied to log-linear parameters λ s,l instead of the means, as in the Gaussian mixtures case. All the λ s,l in
74
CHAPTER 6. MIXTURE DENSITY SPLITTING
MFCC extraction
Maximum likelihood training HMM alignment
MLLR matrices
single density Gaussian means and covariance
Gaussian to log−linear conversion
MMI Optimization
density splitting
achieved desired mixture resolution?
No
Yes
end of training
Figure 6.2: Flow diagram of discriminative training and splitting process for log-linear mixture model
6.2. DISCRIMINATIVE SPLITTING OF LOG-LINEAR MIXTURE MODELS
75
state s are duplicated and a small offset is added to both new λ’s to pull them apart. Log-linear model is covariance normalized, therefore direction of the offset is not important. Subsequent training of this newly split model causes an increase in the objective function as the new λ’s discriminatively adapt themselves to the training data. This successive discriminative training and splitting can be repeated several times until the desired model resolution is achieved. A flow diagram of the training process is shown in Figure 6.2. Initial alignment between the training acoustic data and its transcription is obtained by training a Gaussian generative ML system with mixture densities. This alignment is kept fixed during the later stage of discriminative training, as it was experimentally found that updating it had virtually no effect on the optimization procedure. The single density Gaussian acoustic model is initialized by maximum likelihood training. This model is converted to log-linear form and trained to optimize the MMI frame-level criterion. We choose frame-level MMI because it guarantees a global maximum for single densities. Once the single density optimization has converged, we split it and hence double the number of parameters. When this in turn has converged, we split it again. This process is repeated until we get (for example) 128 densities per triphone state. During the course of this process a steady increase in the objective function value is observed.
6.2.1
Maximum Approximation
MMI optimization on a large training corpus can be computationally expensive. A remedy is to use maximum approximation for optimization of mixture densities. This means for each p(x|s) using the score of the highest scoring density instead of sum of all densities. In practice it was found to be detrimental for the optimization process. When the maximum option is enabled, only those feature vectors contribute to the partial derivatives of λ s,l which lie closer to it than all other λ s,l0 . Therefore if a particular λ s,l strays away from the solution due to a large step size, it will not be brought back towards the solution because there are no feature vectors to contribute towards its partial derivatives. This leads to discontinuities in the partial derivatives. For this reason we calculate the sum of all densities where possible. However, for a very large number of λ s,l parameters per state (for example 16), calculating the sum does not remain feasible due to its computational requirement. Therefore in that case we have to resort to maximum approximation. With proper limiting of RPROP step sizes to increase its robustness, it can also give reasonable gains in the objective function.
6.2.2
Experiments and Results
The speech recognition task for these experiments is the QUAERO English 50h corpus, as described in Appendix C. A single density log-linear acoustic model is successively split into a mixture model. Until 32 densities per state there is no overtraining and therefore the WER corresponds to the objective function. However, beyond this point the systems starts to overtrain (as it is only a 50 hour corpus) and so a limited number of iterations is performed while going from 32 to 64 densities and then from 64 to 128 densities. The number of these limited iterations is determined from validating the WER on the development corpus. To test the effectiveness of
76
CHAPTER 6. MIXTURE DENSITY SPLITTING
8
−2.5
x 10
splitting points −3 L = No. of densities/state
L 32 −> 64
MMI objective function
−3.5
−4
−4.5
−5 L 2 −> 4 −5.5 L 1 −> 2 −6 200
300
400
500 600 No. of iterations
700
800
Figure 6.3: EPPS English dev2007: Ascent of MMI objective function versus number of training iterations. The density splitting events are marked by + discriminative splitting, we take an ML-trained Gaussian model already split into 256 densities per state, and train it further discriminatively. This is a model where only the training of λ s,l is done discriminatively, and no splitting is done in between. The WER starts to degrade with training, because of overtraining. Then we try smaller models i.e. 128/64/32 densities per state and the WER is as good as that of ML training but with 87% less parameters. But it is still 0.3% worse than the WER with discriminative splitting. This result is shown in Table 6.1. Table 6.1: Training corpus: QUAERO English 50h. Comparison of ML split and discriminatively split log-linear mixture models Training
Splitting
WER (%)
criterion
criterion
dev
ML
ML
24.3
31.2
MMI
ML
24.3*
31.2*
MMI
MMI
23.9
30.9
eval
* same WER with 87% less params but no WER improvement
A second set of experiments has been performed on EPPS English task (see Appendix C). These experiments have been reported in [Tahir & Schl¨uter+ 11b]. Initial alignment between
6.2. DISCRIMINATIVE SPLITTING OF LOG-LINEAR MIXTURE MODELS
77
30 ML splitting, ML training ML splitting, discrim. training discrim. splitting, discrim. training
L=1 28
L=No. of densities/state 26
WER (%)
L=2 24
22
20
18
L=64
16 5 10
6
7
8
10 10 No. of parameters
10
Figure 6.4: EPPS English dev2007: Comparison of WER of discriminatively split and ML split log-linear models 16.5 ML splitting, ML training ML splitting, discrim. training discrim. splitting, discrim. training
L=1 16
WER (%)
15.5
L=2
L=No. of densities/state
15
14.5
14 L=64 13.5
13 5 10
6
7
10 10 No. of parameters
8
10
Figure 6.5: EPPS English dev2007: Comparison of WER of discriminatively split and ML split log-linear models, with SAT MLLR and CMLLR
78
CHAPTER 6. MIXTURE DENSITY SPLITTING
the training acoustic data and its transcription is obtained by training a Gaussian generative ML system with 256 densities per triphone state. This alignment is kept fixed during the later stage of discriminative training. Starting from a single density acoustic model, discriminative training and splitting is done until we get 64 densities per triphone state. During the course of this process a steady increase in the objective function value is observed. For up to 8 densities per state we use the sum of all the mixture parameters λ s,l . Since the computation time doubles by doubling the resolution, therefore for 16 densities it becomes prohibitive i.e. 20 hours for a single iteration. So from here onwards we switch to maximum approximation, and set limits on step sizes of the RPROP procedure to increase its robustness. Figure 6.3 illustrates the progress of objective function against the number of iterations. The blue * marks on the figure represent the points where splitting has been done and consequently the number of λ has doubled. The graph shows a consistent gain in the objective function, even for a large number of densities per state! Also, when the objective function optimization starts to converge for a particular mixture resolution; then splitting and further training causes a sharp increase in the slope of objective function vs. no. of iterations. Looking at the graph it seems that the trend would hold if we further continue splitting the log-linear models. However, the usefulness of this consistent ascent of objective function is limited by overtraining which starts to occur at high mixture resolutions. This overtraining can be avoided by simultaneously doing recognition of development corpus using the acoustic model obtained after each splitting and MMI training, and stopping at the point where WER starts to increase again. This approach is used in the scope of this thesis. Alternatively, at each splitting step one can calculate the objective function on some held-out data. This held-out data is excluded from the discriminative training procedure. If the objective function on this held-out data starts to decrease again, then the training can be stopped/step sizes can be reduced. This approach is called cross-validation. As shown in Figure 6.4 and Figure 6.5, WER differences between the single density maximum likelihood and discriminative training are quite large. However, as the number of densities increases, the difference between both is reduced. For 64 densities per state this difference is 0.7% absolute without SAT and 0.5% with SAT, small but still significant in relative terms. To make a comparison between ML-based and discriminative splitting, we take a maximum likelihood split model of 64 densities per state, and train it discriminatively. For this model only the training of λ s,l is done discriminatively, and not the splitting. This way we only get a 0.2% improvement over the ML model without SAT and 0.1% improvement with SAT. This improvement is significantly smaller than what was obtained by an integrated splitting and training. The possible reason for this could be the higher susceptibility of such an approach to get stuck in a local maximum. In addition to the aforementioned experiments, some results are presented in [Tahir & Nußbaum-Thom+ 12] where a discriminatively split log-linear mixture model is trained by MPE criterion. This results in better WER as compared to an ML-split mixture model also trained by MPE objective function.
6.3. DISCRIMINATIVE SPLITTING FOR DEEP NEURAL NETWORKS
6.3 6.3.1
79
Discriminative Splitting for Deep Neural Networks Deep Neural Networks
Artificial neural networks (ANN) have become an important tool for creating probabilistic features for speech recognition. A neural network for speech recognition consists of a multilayer-perceptron (MLP), having non-linear activation functions in the hidden and output layers. Some earlier works exploring the use of ANNs or MLPs for speech recognition are [Peeling & Moore+ 86, Bourlard & Wellekens 87, Waibel & Hanazawa+ 89]. These were complex systems aiming to model the whole speech recognition process by ANNs, but were not able to outperform GMM-HMM based approaches. More recently, there are two ways of applying neural networks for acoustic modelling: hybrid and tandem MLPs. • A hybrid MLP system [Seide & Gang+ 11, Dahl & Deng+ 12] directly uses the posterior probabilities of MLP network as acoustic model probabilities. The probabilities correspond to clustered triphone (CART) states. • A tandem MLP system [Hermansky & Ellis+ 00] has a bottleneck layer as the last output layer of the network and then a regular GMM based maximum likelihood acoustic model is trained on top of it. This makes it easier to use GMM based concepts for optimization such as linear discriminant analysis (LDA), speaker adaptation techniques like Maximum likelihood linear regression (MLLR) and constrained maximum likelihood linear regression (CMLLR). The concept of neural networks is inspired by a human brain, where millions of neurons are connected to each other and information processing causes changes in connections and weights between these neurons. Formally, ANN is a set of neurons linked together by weighted connections. A neuron or node consists of an input activation z j and output activation y j . The input activation is a weighted combination of outputs of nodes in the previous layer xi , plus a bias constant α j . This output activation of the node is a non-linear transformation applied to the node input. zj =
X
λi, j · xi + α j
;
y j = σ(z j )
(6.3)
i
There are two popular activation functions, among others, that are used for hidden layers of ANNs; sigmoid: yj =
1 1 + e−z j
(6.4)
and rectified linear unit (ReLU): y j = max(0, z j )
(6.5)
For the last output layer of the network, a normalized softmax activation function is used, because the outputs are to be interpreted as probabilities
80
CHAPTER 6. MIXTURE DENSITY SPLITTING Hidden Input Output
Figure 6.6: Example of a multilayer-perceptron with one hidden (sigmoid) layer
ez j y j = P zi ie
(6.6)
In case a layer has identity activation i.e. y j = z j , it is called a linear layer. For all MLP experiments in Section 6.3.4, either sigmoid or ReLU activation is used for hidden layers and softmax for output layer neurons. For the linear (bottleneck) layers, the activation is identity. Figure 6.6 shows an example of MLP with an input, one hidden and an output layer. For MLP training for classification task of inputs xn belonging to classes cn , there are two error functions that can be minimized. The first is squared-error K
1X (yk − δ(k, cn ))2 En = 2 k=1
(6.7)
where En is the local error of input xn and yk is the corresponding network output of k’th node. δ(k, cn ) is the Kronecker delta which is 1 only when c = k. The second error criterion is cross-entropy En = −
K X
δ(k, cn ) ln(yk )
(6.8)
k=1
Error of the last layer is back-propagated through the network, based on connections of each current node to each node in the previous layer. When errors of all nodes in all layers are known, the weights of connections can be updated based on error gradients. Details of this process and derivations can be found in [Plahl 14]. A deep neural network refers to an MLP with several non-linear hidden layers, which can be as many as six or more. Use of deep neural networks has become state-of-theart for acoustic modelling in the last few years [Hinton & Deng+ 12, Dahl & Deng+ 12, Sainath & Kingsbury+ 11]. In this work application of discriminative splitting for hidden layers of MLP is discussed, which implies that these concepts can be used for tandem as well as hybrid approaches.
6.3. DISCRIMINATIVE SPLITTING FOR DEEP NEURAL NETWORKS
6.3.2
81
Linear bottlenecks for DNNs
Two consecutive layers of an MLP network that are fully connected may have redundancy in the structure. Many of the elements in a layer may have a negligibly small effect on the output of that layer. If the number of elements in a layer can be reduced by removing those redundancies while not compromising classification performance, it can provide large decreases in time and memory requirements of MLP training. Several methods have been proposed to achieve this compression. [Yu & Seide+ 12] have reduced the number of elements in the layers by removing close to zero weights and converting the matrices to an indexbased representation. [Xue & Li+ 13] have factored the weight matrix into a product of two smaller matrices, providing parameter compression. They have reported encouraging results by doing a singular value decomposition (SVD) based factorization between the hidden layers. The error rate degrades at first but after doing a full network training with back-propagation, classification performance of the MLP network is restored. [Wiesler & Richard+ 14] have proposed a training mechanism whereby a hidden layer and its low-rank factorization can be simultaneously trained from scratch. Apart from model parameter reduction, they report an added benefit of regularization from this factorization. Thus the linear bottleneck can reduce over-training of MLP network parameters.
6.3.3
Discriminative Splitting for DNNs
In this paper a method for training a linear bottleneck between two MLP layers is investigated, which is inspired by mixture density training of GMM acoustic models. For GMM initially a single density is trained for each tied context-dependent state. This density is then split iteratively into a successively larger number of mixture densities until the desired parameter resolution is achieved. The final class conditional probability is a weighted sum of all the respective densities in that mixture. A similar approach has also been employed in Section 6.2 for discriminative training of log-linear mixture models (LLMM), where state posterior probabilities have been successively split during discriminative training. Such a method could in principle also be used for hidden layers in MLP networks; a hidden layer is trained and then all the nodes and their weight parameters are duplicated with some random offset. These duplicated and offset copies of each node are being summed up into the original node, thus converting the original hidden layer into a linear bottleneck with a new larger hidden layer before it. The feasibility of such a layer splitting method is investigated. This process is illustrated graphically in Figure 6.7. Mathematically, Equation (6.3) for hidden layer becomes
z j+ =
X (λi, j + d · η) · xi + α j + d · η i
; z j− =
X
(λi, j − d · η) · xi + α j − d · η
(6.9)
i
yj =
σ(z j + ) + σ(z j − ) 2
(6.10)
for noise η with 0 mean and 1 standard deviation. d is a suitable scaling factor. In our experiments d has been fixed to 0.001. After splitting, the neural network is trained further so that z j + and z j − move away from each other and adapt themselves to the training data.
82
6.3.4
CHAPTER 6. MIXTURE DENSITY SPLITTING
Experiments and Results
The speech corpus is QUAERO English 50h corpus (Appendix C). Input MFCC feature vector length is 29. A window of 17 consecutive frames is appended together instead of 9 and no LDA is performed. The MLP network therefore has 493 input features. The number of nodes in the output softmax layer is 4501 (no. of CART states). The hidden layers have a sigmoid activation function. The number of hidden layers and number of nodes in each layer will be varied during the course of these experiments. As presented in [Tahir & Wiesler+ 15], first the results for a single hidden layer are presented, and then for the case of six hidden layers (best system configuration).
input
input
hidden (sigmoid)
output
hidden (sigmoid)
linear
output
+ + +
Figure 6.7: Diagram of discriminative splitting of MLP hidden layer
6.3. DISCRIMINATIVE SPLITTING FOR DEEP NEURAL NETWORKS
83
MLP with one hidden layer For one hidden layer MLP, initially a network with 1024 hidden layer nodes (sigmoid) is trained. This initial training consists of the usual pre-training and then cross-entropy based training with backpropagation. This gives a WER of 22.7%. Then each node in the hidden layer is duplicated such that each two consecutive nodes sum up into one node in the following linear bottleneck layer. Thus now we have a new MLP network with two hidden layers: one sigmoid hidden layer with 2048 nodes and then a linear layer with 1024 nodes. This new linear layer contains only zeros and “0.5”s as its entries. In effect this means that nodes in the sigmoid layer are very sparsely connected to the linear layer. Each node in the linear layer is fed by only two exclusive nodes of the sigmoid layer. A small random offset is added to each node’s parameters, therefore the objective function (cross-entropy) remains roughly the same. The network is then further trained so that the increased resolution model adapts itself to the training data. The WER decreases to 22.1%. Splitting the hidden layer further to 8192 nodes brings the WER down to 21.1%. This compares favourably with the WER with 8192 hidden layer size with no linear bottleneck, with only a fraction of parameters. This is shown in Table 6.2 After splitting, one way to initialize the new parameters is to add small offsets to the sigmoid layer after duplication. Another way is to hold all other layers constant (not trainable) and then initializing the sigmoid hidden layer with random values. After one or more iterations of training exclusively this layer, the complete training of all layers is done together with backpropagation. In practice it was found that this did not cause a significant difference of objective function or WER as compared to initializing with adding small offsets. However, there was a difference in terms of number of iterations required, because the offsets initialization is closer to the local minimum as compared to random initialization. Table 6.2: Training corpus: QUAERO English 50h. Input size 493 and output layer size 4501. Splitting of a single hidden layer of size 1024 size of hidden layer
size of linear (mixture) bottleneck
no. of
WER (%)
params.
dev
eval
1024
-
5.1M
22.7
28.9
2048
1024
5.6M
22.1
28.4
8192
1024
8.7M
21.2
27.7
2048
no bottleneck
10M
21.8
28.4
8192
no bottleneck
41M
20.6
27.1
The linear mixture bottleneck as described above contains only one matrix (the other matrix is sparse consisting of only zeros and ones in a particular order, hence having no parameters to be trained/stored). What if we convert this sparse matrix into a full matrix and train it further using the sparse representation as initial guess? Would it be able to perform better than a linear bottleneck initialized from discriminative pre-training? As shown in Table 6.3 a full linear bottleneck (two matrices) initialized from discriminative splitting has a 3.0% absolute better WER (eval) than a mixture bottleneck (with only one non-sparse matrix). It is also
84
CHAPTER 6. MIXTURE DENSITY SPLITTING
Table 6.3: Training corpus: QUAERO English 50h. Input size 493 and output layer size 4501. Splitting of a single hidden layer of size 128 and converting it to full (non-sparse) matrix size of
size of
type of
initialization
no. of
WER (%)
hidden
linear
BN
type
params.
dev
eval
layer
BN
128
no BN
0.6M
32.1
39.6
10M
21.8
28.4
2048 2048
128
mixture
splitting
1.6M
24.2
31.0
full
splitting
1.9M
21.8
28.0
full
pre-training
1.9M
22.1
29.0
1.0% better than a full linear bottleneck initialized from discriminative pre-training. This shows that discriminative splitting can be used as an effective pre-training method in cases where linear bottlenecks are involved between the layers of a neural network. Furthermore, by using a smaller model as an initial guess, the number of training iterations could also be decreased as compared to random initialization; although this effect is not researched in detail in the scope of this thesis.
MLP with six hidden layers Table 6.4: Training corpus: QUAERO English 50h. Input size 493 and output layer size 4501. Six sigmoid hidden layers and linear bottleneck size 256. Splitting has been done directly from a network of 6 × 256 initialization type
pre-training
sigmoid
linear
no. of
layer size
BN size
params.
2048
256
7.9M
splitting pre-training splitting
4096
256
14.7M
CE
WER (%)
objec.
dev
eval
2.02
18.2
24.0
2.02
18.0
23.8
1.34
18.3
24.3
1.63
17.9
23.6
The discriminative splitting method for one hidden layer (as in previous section) can easily be extended to a deep neural network scenario. Table 6.4 shows WER for a 6 hidden layer deep neural network for the same task and configuration as in the previous section. In this table, CE objec. refers to the value of cross-entropy objective function. Each hidden layer has 2048 nodes. There is a linear bottleneck of size 256 between every two hidden layers and between the last hidden layer and output layer. The parameters of these linear bottleneck layers may be initialized by discriminative pre-training with random initialization. Alternatively, they may
6.3. DISCRIMINATIVE SPLITTING FOR DEEP NEURAL NETWORKS
85
Table 6.5: Training corpus: QUAERO 50h. Input size 493 and output layer size 4501. Six sigmoid hidden layers and linear bottleneck size 256. Splitting has been done step-wise by successive doubling and training, initializing from a network of 6 × 256 initialization
sigmoid
linear
no. of
layer size
BN size
params.
pre-training
256
-
pre-training
512
256
type
1024
256
dev
eval
1.6M
2.52
20.4
26.1
2.8M
2.39
19.3
24.9
2.31
18.6
24.2
2.19
18.7
24.1
2.10
17.9
23.8
2.02
18.2
24.0
1.90
17.7
23.3
1.39
18.5
24.4
1.78
17.6
23.2
4.5M
splitting pre-training
2048
256
7.9M
splitting pre-training splitting
4096
256
WER (%)
objec.
splitting pre-training
CE
14.7M
be initialized first by pre-training a small (narrow) MLP of 256 hidden nodes in each of the six layers without linear bottlenecks. Then this small network is discriminatively split to 2048 nodes in each sigmoid layer followed by a linear bottleneck of 256 nodes. The comparison between these two initialization types is shown in Table 6.4 in first and second rows of results. The 3rd and 4th rows correspond to a network twice as large: the number of sigmoid nodes in each of the six layers is 4096 and size of linear bottlenecks is still 256. The result in 3rd row is initialized from random in the same way as for first row (pre-training). The result in 4th row is obtained by splitting the network of 2nd row once more to double its size. The noticeable result here is that the pre-trained network of 4096 nodes is slightly worse than the network of 2048 nodes, possibly because of over-training. The size of network in this case is too large in comparison to the amount of training data (50 hours), therefore causing overtraining. What is however interesting is that the split network of 4096 nodes has a WER that is even better than the one with 2048 nodes, both on the dev and test corpora! This shows that the splitting approach is less prone to overtraining as compared to pre-training. It is also observed that the objective function of pre-training case is lower than that of splitting case, despite worse WER. It could be that due to better initialization, the splitting network is stuck in a better local maximum than the pre-training case, and therefore the former is safeguarded against over-training. Table 6.5 shows the results of another similar experiment. In the previous Table 6.4 the splitting was done directly from a network of 6 × 256 nodes, so that each node is duplicated 8 or 16 times (for a final size of 2048 or 4096 nodes respectively). In this Table 6.5 however, the splitting is done in a binary fashion. The 6 × 256 network is first split into 512 sigmoid nodes per layer and then trained until convergence. Then it is split into 1024 nodes per layer and again trained until convergence, and so on. It is seen that the WER in this case is even lower than for the previous table, and total WER reduction is now 0.6% absolute (dev) and 0.8% (eval) as compared to discriminative pre-training.
86
CHAPTER 6. MIXTURE DENSITY SPLITTING
Table 6.6: Training corpus: QUAERO 50h. Input size 493 and output layer size 4501. Six ReLU hidden layers and linear bottleneck size 256. Splitting has been done step-wise by successive doubling and training, initializing from a network of 6 × 256 initialization
sigmoid
linear
no. of
layer size
BN size
params.
pre-training
256
-
1.6M
pre-training
2048
256
7.9M
type
regularization
CE
WER (%)
objec.
dev
eval
no
2.39
19.4
25.2
no
1.52
18.6
24.8
pre-training
yes
1.76
17.7
23.3
splitting
no
1.81
17.3
22.7
The results in Table 6.6 correspond to an experiment with ReLU as the activation function in hidden layers. Other than this, it has the same settings as for the previous Table 6.5 (with sigmoid). It can be seen that our discriminative splitting is also equally applicable to ReLU as for sigmoid activation function. Furthermore, it can be seen from the table that the result with splitting without regularization is better than the result of conventional pre-training with regularization. This indicates that starting from a robust model and then splitting to final size is an effective way to safeguard against overtraining.
6.4
Conclusion
In this chapter discriminative splitting is presented as an approach to increase the model resolution during discriminative training process. For training a log-linear mixture model, instead of using ML split mixture models one could perform the splitting during MMI training. Experiments show that this approach yields better results than discriminative training of ML split models. Similarly, the resolution of a deep neural network may be increased by discriminative splitting method. Experiments for a single hidden layer and six hidden layer cases show the potential of this approach as an alternative method of pre-training for linear bottlenecks of MLP hidden layers.
Chapter 7 Scientific Contributions Previous work has shown the usefulness of discriminatively trained log-linear acoustic models. Log-linear training i.e. convex optimization can also be used to train linear feature transformations. The principle task of this thesis is to explore the log-linear training of such feature transformations. This has been done for direct feature transformation as well as for high dimensional polynomial features and for multilayer log-linear training. Log-linear transformation training makes high dimensional polynomial features of greater than 2nd order computationally feasible. Another important aspect of this thesis is discriminative splitting of log-linear models into conditional random fields with hidden variables. Thus the acoustic mixture models are trained as well as split during discriminative training, instead of using the splitting information from a previous maximum likelihood training step. Recently, the speech recognition community has shifted its focus to deep neural networks for acoustic modelling. Thus we also apply the discriminative splitting concept to DNN and achieve encouraging results.
7.1
Main Contributions
The different techniques researched and developed within the scope of this thesis are listed below, along with their corresponding results and conclusions. 1. Training of log-linear mixture models by splitting WER: • mixture splitting during ML training, then frame-MMI training = 24.3% dev, 31.2% eval • mixture splitting during frame-MMI training = 23.9% dev, 30.9% eval Mixture splitting during discriminative training works better than using an ML-split mixture set for discriminative training. This is possibly because of using the same training criterion for both parameter training and splitting. This technique can also be applied to Gaussian or any other mixture densities. 87
88
CHAPTER 7. SCIENTIFIC CONTRIBUTIONS 2. Efficient exploitation of higher-order polynomial features Discriminative training for multilayer high order polynomial features becomes feasible due to log-linear training of feature transformations. Higher order polynomial features are more parameter efficient than mixtures. Trained with the same MMI-frame criterion. WER: • first order features + mixtures = 23.9% dev, 30.9% eval ( 20 M parameters, best mixtures) • 4th order polynomial features + mixtures = 23.1% dev, 30.2% eval ( 27 M parameters) This concept may also be applied to other very high dimensional features (like fMPE). Additionally, multi-layer training of high dimensional features is possible due to such discriminatively trained dimension reduction. This work was started before use of deep neural networks became widespread for speech recognition. The multilayer polynomial network can be considered as an alternative deep architecture. An example of combining this approach with DNN is shown in Chapter 5 (Table 5.4). Another aspect is that the multilayer polynomial features can be described as a special case of sum-product networks [Gens & Domingos 12]. In this respect our work can be described as a proof of concept of applicability of sum-product networks for acoustic modelling. 3. Initializing a large deep neural network from a smaller network by splitting The motivation is to obtain better initialization of MLP network, since DNN optimization is a highly non-convex problem. It is compared with standard pre-training i.e. layer-bylayer random initialization. WER: • with random initialization = 18.2% dev, 24.0% eval • with discriminative splitting = 17.6% dev, 23.2% eval Discriminative splitting provides better WER results, especially for too wide networks that may be susceptible to over-training otherwise. However, further investigation is required to take speed advantage due to better initialization.
7.2
Secondary Contributions
Apart from the main objectives of this theses, some other ideas arising from the work and their conclusions are listed below. 1. The posterior forms of Gaussian and log-linear acoustic models are equivalent (for first order features and pooled covariance). However, for recognition there is a difference involved as the language model priors are used instead of state priors. Recall that the state priors are also trained during log-linear training and thus contain discriminative information. This effect is more pronounced for single densities and becomes less important for mixture densities. For full mixtures, no appreciable difference of WER was found either for frame or sentence-level training. This concept is explained in Section 5.5.4.
7.2. SECONDARY CONTRIBUTIONS
89
2. For multilayer log-linear features, mixtures in the hidden layer improve the WER. • For a 2-layer network and mixtures in the first layer, the WER is 24.8% as compared to 2-layer single density where WER is 27.4%. • However, no improvement for multilayer mixture as compared to single layer mixture (WER=24.7%). Lesson: This experiment provides motivation for mixture like hidden variables in DNN layers.
90
CHAPTER 7. SCIENTIFIC CONTRIBUTIONS
Appendix A Mathematical Symbols • θ
set of all acoustic model parameters
• t
time frame index
• r
speech utterance index
• x1T
sequence of acoustic feature vectors from time 1 to t i.e. x1 to xT
• w1N
string of N hypothesized words i.e. w1 to wN
• Mr
set of all possible word sequences hypothesized for utterance r
• s
HMM state
• µ s,l
mean vector for state s and density index l
• c s,l
mixture weight for state s and density index l
• A • ai, j
Linear feature transformation matrix element at row i and column j of linear feature transformation matrix
• Σ
pooled Gaussian covariance matrix
• Λ
set of log-linear parameters
• λ s,l
log-linear parameter vector for HMM state s and mixture density index l
• α s,l
log-linear constant for HMM state s and mixture density index l
• I
identity matrix
91
Appendix B Acronyms • ASR
Automatic speech recognition
• CART • CE
Classification and regression tree Cross-entropy objective function
• CMLLR • dev
Constrained maximum likelihood linear regression
Development corpus
• DNN
Deep neural network
• EPPS
European parliament plenary sessions corpus
• eval
Evaluation corpus
• fMPE
Feature-level minimum phone error
• GMM
Gaussian mixture model
• HMM
Hidden Markov model
• LDA • LLMM • MCE • ME • MFCC • ML • MLLR • MLP
Linear discriminant analysis Log-linear mixture model Minimum classification error Maximum entropy Mel-frequency cepstral coefficients Maximum likelihood Maximum likelihood linear regression Multilayer perceptron 92
93 • MMI
Maximum mutual information
• MPE
Minimum phone error
• MRASTA • MWE • PLP • RASR
Multi-resolution relative spectral transform Minimum word error
Perceptual linear prediction RWTH speech recognition toolkit
• SAT
Speaker adaptive training
• TDP
Time distortion penalty
• VTLN • WER • WFST
Vocal tract length normalization Word error rate Weighted finite state transducer
Appendix C Corpora and ASR systems C.1
European Parliament Plenary Sessions (EPPS) English
The EPPS British English task consists of planned (non-spontaneous) speeches of European Parliament under clean conditions. It is part of TC-STAR project [L¨oo¨ f & Gollan+ 07]. One source of variability is the presence of a number of non-native speakers. Input audio is sampled at 16 KHz and initial input features are 16 MFCC features with VTLN warping plus an energy and a voiced feature. These features are concatenated together in a window of 9 frames (4 to +4) then transformed by LDA to a 45 dimensional vector. The lexicon consists of 60K words and the language model is 4-gram true-case trained on 400M running words. A triphonebased classification and regression tree is computed which clusters triphones into 4501 classes including silence. The baseline acoustic model is a Gaussian mixture density model with 256 densities per CART state. Apart from silence, the phoneme set also includes some nonvoice phonemes like hesitation, noise, breath. The acoustic training data is 90 hours of audio (with about 40% silence ratio). Newer versions of this task contain more training data. The development and evaluation corpora are about 3 hours each. For speaker adaptation, MLLR and CMLLR are trained on the hypothesis of first pass recognition and applied to acoustic model and input features respectively for second pass recognition.
C.2
QUAERO English
The QUAERO British English task consists of news broadcasts, podcasts, debates, interviews etc. [Sundermeyer & Nußbaum-Thom+ 11]. The speaking style ranges from planned for broadcast news to conversational for interviews. The audio is sampled at 16 KHz but some recordings contain telephone calls which are band-limited to 4KHz. There are also some instances of multiple simultaneous speakers and background music/noise. There are two versions of this task used as a standard setup for GMM/log-linear/MLP comparison experiments at RWTH Aachen: a 50 hour training data task and a 250 hour training data task. The input features for the baseline setup are 16 MFCC features with VTLN warping. 9 of such consecutive frames are concatenated together and LDA-transformed to 45 dimensions. The lexicon contains 325K words and the language model is trained on 2G running words. There are 4501 CART 94
C.3. QUAERO SPANISH
95
French
English
German Input features Polish
Figure C.1: Flow diagram of multilingual MLP features triphone classes. For the MLP experiments the input features are slightly different from baseline setup. Instead of 9 MFCC frames, 17 consecutive MFCC frames are concatenated together and there is no LDA transformation after that. This feature arrangement was found to be better in terms of WER. The development and evaluation corpora (eval10 and eval11 respectively) are 3 hours each.
C.3
QUAERO Spanish
The QUAERO Spanish corpus [Shaik & T¨uske+ 14] contains news broadcasts, podcasts and debates with a wide variety of noise, including background music, simultaneous speakers etc. The training audio data is 390 hours and also includes audio from other sources (TC-STAR, Translectures, Hub-4). The acoustic model of the baseline system uses 4501 triphone CART leaves. A pooled covariance is used. The lexicon contains 300k words and there is a 5-gram language model. The initial features are 16 MFCCs (vocal tract length normalized, including an energy feature), augmented with a single voicedness feature. Nine such consecutive frames are concatenated together, and then projected by a classical LDA matrix to 45 dimensions. These features are then concatenated with MLP features. During recognition, speaker adaptive training (SAT) with MLLR and CMLLR is applied. The QUAERO Spanish experiments are performed on two different MLP features as input, the details of which are given below. Spanish MLP features Three types of acoustic features are given as input to the neural network - MFCC, PLP and Gammatone. After augmenting with first and second order temporal derivatives, 33 dimensional
96
APPENDIX C. CORPORA AND ASR SYSTEMS
feature vectors each for MFCC and PLP and 31 dimensional for Gammatone are formed. Nine consecutive temporal frames are concatenated together, so this gives altogether an 873 dimensional feature vector. The training is done with a feed-forward neural network of three layers. There are 4000 nodes in the hidden layer and 37 in the output layer (phoneme targets). After transforming by logarithm and PCA, a 15 dimensional feature vector is obtained, which is then concatenated with LDA-transformed MFCC feature vector of 45 dimensions. More details of these MLP features can be found in [Plahl & Schl¨uter+ 11]. Multilingual MLP features The multilingual MLP features are trained on an 800 hours corpus of French, English, German and Polish. The training method does not need to map the phonetic units of multiple languages to a common set. Bottleneck features are extracted from hierarchical MLP based processing of modulation spectrum. The input of the first MLP contains fast modulation components of MRASTA filtering. The subsequent MLP is trained on the bottleneck output of first MLP and the slow modulation part. Input features for both cases are augmented with the logarithm of critical band energies. The number of nodes in the hidden layers is 2000 → 60 → 2000. The final layer has 1500 × 4 outputs, corresponding to the four languages. These are PCA transformed and then concatenated with LDA-transformed MFCC feature vector of 45 dimensions. This system is explained in detail in [T¨uske & Schl¨uter+ 13]. A pictorial representation of this MLP network is shown in Figure C.1
Bibliography [Aertsen & Johannesma+ 80] A.M.H.J. Aertsen, P.I.M. Johannesma, D.J. Hermes: Spectrotemporal receptive fields of auditory neurons in the grassfrog. II. Analysis of the stimulusevent relation for tonal stimuli, Biological Cybernetics, Vol. 38, pp. 235-248. 1980. [Anastasiadis & Magoulas+ 05] A.D. Anastasiadis, G.D. Magoulas, M.N. Vrahatis: New globally convergent training scheme based on the resilient propagation algorithm, Neurocomputing, Vol. 64, pp. 253-270. 2005. [Anderson & Bai+ 99] E. Anderson, Z. Bai, J. Bishop, J. Demmel, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, S. Ostrouchov, D. Sorensen: LAPACK User’s Guide, SIAM, 3rd edition, Philadelphia, PA, USA. 1999. [Axelrod & Goel+ 07] S. Axelrod, V. Goel, R. Gopinath, P. Olsen, K. Visweswariah: Discriminative estimation of subspace constrained Gaussian mixture models for speech recognition, IEEE Transactions on Speech and Audio Processing, Vol. 15, pp. 172-189. Jan. 2007. [Bahl & Brown+ 86] L.R. Bahl, P.F. Brown, P.V. Souza, R.L. Mercer: Maximum mutual information estimation of hidden Markov model parameters for speech recognition, In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.49-52, Tokyo, Japan. Apr. 1986. [Bahl & Brown+ 88] L.R. Bahl, P.F. Brown, P.V. de Souza, R.L. Mercer: A new algorithm for the estimation of hidden Markov model parameters, In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 493-496, New York, NY, USA. Apr. 1988. [Bahl & Jelinek+ 83] L.R. Bahl, F. Jelinek, R.L. Mercer: A Maximum likelihood approach to continuous speech recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 5, pp. 179-190. 1983. [Baker 75] J.K. Baker: Stochastic modeling for automatic speech understanding, In D. R. Reddy, editor, Speech Recognition, pp. 512-542. Academic Press, New York, NY, USA. 1975. [Bakis 76] R. Bakis: Continuous speech word recognition via centisecond acoustic states, Acoustic Society of America Meeting, Washington, DC, USA. Apr. 1976. [Baum 72] L.E. Baum: An inequality and associated maximization technique in statistical estimation for probabilistic functions of Markov processes, Inequalities, Vol. 3, pp. 1-8. Academic Press, New York, NY. 1972. 97
98
BIBLIOGRAPHY
[Bayes 1763] T. Bayes: An essay towards solving a problem in the doctrine of chances, Philosophical Transactions of the Royal Society of London, 53:370-418. 1763. [Bellman 57] R.E. Bellman: Dynamic programming, Princeton University Press, Princeton, NJ, USA. 1957. [Bender & Macherey+ 03] O. Bender, K. Macherey, F.J. Och, H. Ney: Comparison of alignment templates and maximum entropy models for natural language understanding, In 10th Conference of the European Chapter of the Association for Computational Linguistics (EACL), pp. 11-18, Budapest, Hungary. Apr. 2003. [Bengio & Ducharme+ 01] Y. Bengio, R. Ducharme, P. Vincent: A neural probabilistic language model, Advances in Neural Information Processing Systems, Vol. 13, pp. 932938. 2001. [Berger 97] A. Berger: The improved iterative scaling algorithm: A gentle introduction, http://www.cs.cmu.edu/afs/cs/user/aberger/www/ps/scaling.ps. 1997. [Beulen & Ortmanns+ 99] K. Beulen, S. Ortmanns, C. Elting: Dynamic programming search techniques for across-word modeling in speech recognition, In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 609-612, Phoenix, AZ, USA. Mar. 1999. [Bisani & Ney 03] M. Bisani, H. Ney: Multigram-based grapheme-to-phoneme conversion for LVCSR, In Interspeech, pp. 933-936, Geneva, Switzerland. Sep. 2003. [Bourlard & Wellekens 87] H. Bourlard, C.J. Wellekens: Multi-layer perceptron and automatic speech recognition, In International Conference on Neural Networks (ICNN), Vol. 4, pp. 407-416, San Diego, CA, USA. Jun. 1987. [Breiman & Leo+ 84] L. Breiman, J.H. Friedman, R.A. Olshen, C.J. Stone: Classification and regression trees, Wadsworth & Brooks/Cole Advanced Books & Software. 1984. [Bub & Schwinn 96] T. Bub, J. Schwinn: VERBMOBIL: the evolution of a complex large speech-to-speech translation system, In International Conference on Spoken Language Processing (ICSLP), pp. 2371-2374 vol. 4, Philadelphia, PA, USA. Oct. 1996. [Cardin & Normandin+ 93] R. Cardin, Y. Normandin, E. Millien: Inter-Word coarticulation modeling and MMIE training for improved connected digit recognition, In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. 2, pp. 243-246, Minneapolis, MN, USA. Apr. 1993. [Chang & Zhou+ 00] E. Chang, J. Zhou, S. Di, C. Huang, K. Lee: Large vocabulary Mandarin speech recognition with different approaches in modeling tones, In International Conference on Spoken Language Processing (ICSLP), pp. 983-986, Beijing, China. Oct. 2000. [Chou & Juang+ 92] W. Chou, B.H. Juang, C.H. Lee: Segmental GPD training of HMM based speech recognizer, In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. 1, pp. 473-476, San Francisco, CA, USA. Mar. 1992.
BIBLIOGRAPHY
99
[Chou & Lee+ 93] W. Chou, C.H. Lee, B.H. Juang: Minimum error rate training based on n-best string models, In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. 2, pp. 652-655, Minneapolis, Minnesota, USA. Apr. 1993. [Chow 90] Y. Chow: Maximum mutual information estimation of HMM parameters for continuous speech recognition using the N-Best algorithm, In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 701-704, Albuquerque, NM, USA. Apr. 1990. [Cooley & Tukey 65] J. Cooley, J. Tukey: An algorithm for the machine calculation of complex Fourier series, Math. Comput., 19, pp. 297-301. 1965. [Dahl & Deng+ 12] G.E. Dahl, D. Yu, L. Deng, A. Acero: Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition, IEEE Trans. Audio Speech Lang. Process., Vol. 20, No. 1, pp. 30-42. 2012. [Darroch & Ratcliff 72] J.N. Darroch, D. Ratcliff: Generalized iterative scaling for log-linear models, Annals of Mathematical Statistics, Vol. 43, no. 5, pp. 1470-1480. 1972. [Dempster & Laird+ 77] A.P. Dempster, N.M. Laird, D.B. Rubin: Maximum-Likelihood from incomplete data via the EM algorithm, Journal of the Royal Statistical Society, Series B, Vol. 39, No. 1, pp. 1-38. 1977. [Deselaers & Gass+ 11] T. Deselaers, T. Gass, G. Heigold, H. Ney: Latent log-linear models for handwritten digit classification, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 34, pp. 1105-1117. 2011. [Dijkstra 59] E.W. Dijkstra: A note on two problems in connection with graphs, Numerische Mathematik, 1:269-271, 1959. [Doddington 89] G.R. Doddington: Phonetically sensitive discriminants for improved speech recognition, In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 556-559, Glasgow, Scotland. May 1989. [Duda & Hart+ 01] R.O. Duda, P.E. Hart, D.G. Stork: Sons, New York, NY, USA. 2001.
Pattern classification, John Wiley &
[Eisner 01] J. Eisner: Expectation semirings: Flexible EM for finite-state transducers, In International Workshop on Finite-State Methods and Natural Language Processing (FSMNLP), Helsinki, Finland. Aug. 2001. [Fayolle & Moreau+ 10] J. Fayolle, F. Moreau, C. Raymond, G. Gravier, P. Gros: CRF-based combination of contextual features to improve a posteriori word-level confidence measures, In Interspeech, pp. 1942-1945, Makuhari, Japan. Sep. 2010. [Fisher 36] R.A. Fisher: The use of multiple measurements in taxonomic problems, Annals of Eugenics, 7, pp. 179-188 . 1936.
100
BIBLIOGRAPHY
[Fosler-Lussier & Morris 08] E. Fosler-Lussier, J. Morris: CRANDEM systems: Conditional random field acoustic models for hidden Markov models, In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4049-4052, Las Vegas, NV, USA. Apr. 2008. [Gales 98] M.J.F. Gales, Maximum likelihood linear transformations for HMM-based speech recognition, Computer Speech and Language, Vol. 12, pp. 75-98. 1998. [Gens & Domingos 12] R. Gens, P. Domingos: Discriminative learning of sum-product networks, Advances in Neural Information Processing Systems, Vol. 25. 2012. [Goodman 02] J. Goodman: Sequential conditional generalized iterative scaling, 40th Annual Meeting on Association For Computational Linguistics, pp. 9-16, Philadelphia, PA, USA. Jul. 2002. [Gopalakrishnan & Kanevsky+ 91] P.S. Gopalakrishnan, D. Kanevsky, A. N´adas, D. Nahamoo: An inequality for rational functions with applications to some statistical estimation problems, IEEE Transactions on Information Theory, 37, pp. 107-113. 1991. [Gopalakrishnan & Kanevsky+ 88] P.S. Gopalakrishnan, D. Kanevsky, A. N´adas, D.Nahamoo, M.A. Picheny: Decoder selection based on cross-entropies, In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 20-23, New York, USA. Apr. 1988. [Greenwood 90] D.D. Greenwood: A cochlear frequency-position function for several species - 29 years later, The Journal of the Acoustical Society of America, Vol. 87, No. 6, pp. 25922605. 1990. [Gunawardana 01] A. Gunawardana: Maximum mutual information estimation of acoustic HMM emission densities, CLSP Research Note No. 40, Center for Language and Speech Processing, Johns Hopkins University, Baltimore, MD, USA. 2001. [Gunawardana & Mahajan+ 05] A. Gunawardana, M. Mahajan, A. Acero, J. Platt: Hidden conditional random fields for phone classification, In Interspeech, pp. 117-120, Lisbon, Portugal. Sep. 2005. [Harris 78] F.J. Harris: On the use of windows for harmonic analysis with the discrete Fourier transform, In Proc. of the IEEE, Vol. 66, pp. 51-83. Jan. 1978. [H¨ab-Umbach & Ney 92] R. H¨ab-Umbach, H. Ney: Linear discriminant analysis for improved large vocabulary continuous speech recognition, In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. 1, pp. 13-16. Mar. 1992. [He & Deng 08] X. He, L. Deng: Discriminative learning for speech recognition: theory and practice, Morgan & Claypool. 2008. [He & Deng+ 08] X. He, L. Deng, W. Chou: Discriminative learning in sequential pattern recognition - a unifying review for optimization-oriented speech recognition, IEEE Signal Processing Magazine, pp. 14-36. 2008.
BIBLIOGRAPHY
101
[Heigold 10] G. Heigold: A log-linear discriminative modeling framework for speech recognition, Ph.D. thesis, Computer Science Department, RWTH Aachen University, Aachen, Germany. Jun. 2010. [Heigold & Lehnen+ 08] G. Heigold, P. Lehnen, R. Schl¨uter, H. Ney: On the equivalence of Gaussian and log-linear HMMs, In Interspeech, pp. 273-276, Brisbane, Australia. Sep. 2008. [Heigold & Ney+ 11] G. Heigold, H. Ney, P. Lehnen, T. Gass, R. Schl¨uter: Equivalence of generative and log-linear models, IEEE Transactions on Audio, Speech, and Language Processing, Vol. 19, no. 5, pp. 1138-1148. 2011. [Heigold & Rybach+ 09] G. Heigold, D. Rybach, R. Schl¨uter, H. Ney: Investigations on convex optimization using log-linear HMMs for digit string recognition, In Interspeech, pp. 216-219, Brighton, UK. Sep. 2009. [Heigold & Schl¨uter+ 07] G. Heigold, R. Schl¨uter, H. Ney: On the equivalence of Gaussian HMM and Gaussian HMM-like hidden conditional random fields, In Interspeech, pp. 17211724, Antwerp, Belgium. Aug. 2007. [Hermansky 90] H. Hermansky: Acoust. Soc. Am.. 1990.
Perceptual linear predictive (PLP) analysis of speech, J.
[Hermansky & Ellis+ 00] H. Hermansky, D.P.W. Ellis, S. Sharma: Tandem connectionist feature extraction for conventional HMM systems, In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. 3, pp. 1635-1638, Istanbul, Turkey. Jun. 2000. [Hifny & Gao 08] Y. Hifny, Y. Gao: Discriminative training using the trusted expectation maximization, In Interspeech, pp. 948-951, Brisbane, Australia. Sep. 2008. [Hifny & Renals+ 05] Y. Hifny, S. Renals, N.D. Lawrence: A hybrid MaxEnt/HMM based ASR system, In Interspeech, pp. 3017-3020, Lisbon, Portugal. Sep. 2005. [Hinton 10] G.E. Hinton: A practical guide to training restricted Boltzmann machines, Tech. Report, CS Dept, Univ. of Toronto. 2010. [Hinton & Deng+ 12] G. Hinton, L. Deng, D. Yu, G. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath, B. Kingsbury: Deep neural networks for acoustic modeling in speech recognition — The shared views of four research groups, IEEE Signal Processing Magazine, Vol. 29, No. 6, pp. 82-97. 2012. [Hoffmeister & Klein+ 06] B. Hoffmeister, T. Klein, R. Schl¨uter, H. Ney: Frame based system combination and a comparison with weighted ROVER and CNC, In Interspeech, pp. 537540, Pittsburgh, PA, USA. Sep. 2006. [Jelinek 69] F. Jelinek: A fast sequential decoding algorithm using a stack, IBM Journal of Research and Development, 13:675-685. 1969.
102
BIBLIOGRAPHY
[Johnson & Jourlin+ 99] S.E. Johnson, P. Jourlin, G.L. Moore, K.S. Jones, P.C. Woodland: The Cambridge university spoken document retrieval system, In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. 1, pp. 49-52, Phoenix, AZ, USA. May 1999. [Kaiser & Horvat+ 00] J. Kaiser, B. Horvat, Z. Kaˇciˇc: A novel loss function for the overall risk criterion based discriminative training of HMM models, In Int. Conf. on Spoken Language Processing, Vol. 2, pp. 887-890, Beijing, China. Oct. 2000. [Kanevsky 04] D. Kanevsky: Extended Baum Welch transformations for general functions, In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 821-824, Montreal, Quebec, Canada. May 2004. [Kapadia 98] S. Kapadia: Discriminative training of hidden Markov models, Ph.D. thesis, Downing College, Cambridge University, Cambridge, UK. Mar. 1998. [Katz 87] S.M. Katz: Estimation of probabilities from sparse data for the language model component of a speech recognizer, IEEE Transactions on Speech and Audio Processing, 35:400-401. Mar. 1987. [Kristiina & McTear 09] J. Kristiina, M. McTear: Spoken dialogue systems, Synthesis Lectures on Human Language Technologies, 2(1), 1-151. 2009. [Kneser & Ney 95] R. Kneser, H. Ney: Improved backing-off for m-gram language modeling, In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. 1, pp. 181-184, Detroit, MI, USA. May 1995. [Kumar & Andreou 98] N. Kumar, A.G. Andreou: Heteroscedastic discriminant analysis and reduced rank HMMs for improved speech recognition, Speech Commun, 26, 4, pp. 283297. 1998. [Kuo & Gao 06] H. Kuo, Y. Gao: Maximum entropy direct models for speech recognition, IEEE Transactions on Audio, Speech, and Language Processing, Vol. 14, no. 3, pp. 873881. 2006. [Laube 06] T. Laube: Acceleration of search in large vocabulary continuous speech recognition, Diploma thesis, Computer Science Department, RWTH Aachen University, Aachen, Germany. Aug. 2006. [Lee & Rose 96] L. Lee, R. Rose: Speaker normalization using efficient frequency warping procedures, In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 353-356, Atlanta, GA, USA. May 1996. [Leggetter & Woodland 95] C.J. Leggetter, P.C. Woodland: Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models, Computer Speech and Language, 9-2: 171-185. 1995. [Liu & Chen+ 06] J. Liu, S. Chen, X. Tan, D. Zhang: An efficient pseudoinverse linear discriminant analysis method for face recognition, In International Conference on Neural Information Processing, Hong Kong, China. Oct. 2006.
BIBLIOGRAPHY
103
[Lloyd 82] S.P. Lloyd: Least squares quantization in PCM, IEEE Transactions on Information Theory, Vol. 28(2): pp. 129-137. 1982. [L¨oo¨ f & Gollan+ 07] J. L¨oo¨ f, C. Gollan, S. Hahn, G. Heigold, B. Hoffmeister, C. Plahl, D. Rybach, R. Schl¨uter, H. Ney: The RWTH 2007 TC-STAR evaluation system for European English and Spanish, In Interspeech, pp. 2145-2148, Antwerp, Belgium. Aug. 2007. [L¨oo¨ f & Schl¨uter+ 07] J. L¨oo¨ f, R. Schl¨uter, H. Ney: Efficient estimation of speaker-specific projecting feature transforms, In Interspeech, pp. 1557-1560, Antwerp, Belgium. Aug. 2007. [Lowerre 76] B. Lowerre: A comparative performance analysis of speech understanding systems, Ph.D. thesis, Computer Science Department, Carnegie-Mellon University, Pittsburgh, PA, USA. Apr. 1976. [Macherey 98] W. Macherey: Implementation and comparison of discriminative training methods for automatic speech recognition, M.Sc. thesis, Computer Science Department, RWTH Aachen University, Aachen, Germany. Nov. 1998. [Macherey 10] W. Macherey: Discriminative training and acoustic modeling for automatic speech recognition, Ph.D. thesis, Computer Science Department, RWTH Aachen University, Aachen, Germany. Mar. 2010. [Macherey & Haferkamp+ 05] W. Macherey, L. Haferkamp, R. Schl¨uter, H. Ney: Investigations on error minimizing training criteria for discriminative training in automatic speech recognition, In Interspeech, pp. 2133-2136, Lisbon, Portugal. Sep. 2005. [Macherey & Ney 03] W. Macherey, H. Ney: A comparative study on maximum entropy and discriminative training for acoustic modeling in automatic speech recognition, In European Conference on Speech Communication and Technology (Eurospeech), pp. 493-496, Geneva, Switzerland. Sep. 2003. [MacQueen 67] J.B. MacQueen: Some methods for classification and analysis of multivariate observations, In 5th Berkeley Symposium on Mathematical Statistics and Probability, pp. 281-297. 1967. [McCallum & Freitag+ 00] A. McCallum, D. Freitag, F. Pereira: Maximum entropy Markov models for information extraction and segmentation, In 17th Intl. Conf. for Machine Learning (ICML), pp. 591-598, Stanford, CA, USA. Mar. 2000. [McDermott & Hazen 04] E. McDermott, T.J. Hazen: Minimum classification error training of landmark models for real-time continuous speech recognition, In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. 1, pp. 937-940, Montreal, Canada. May 2004. [McDermott & Katagiri 97] E. McDermott, S. Katagiri: String-level MCE for continuous phoneme recognition, In Europ. Conf. on Speech Communication and Technology, Vol. 1, pp. 123-126, Rhodes, Greece. Sep. 1997.
104
BIBLIOGRAPHY
[McDermott & Katagiri 05] E. McDermott, S. Katagiri: Minimum classification error for large scale speech recognition tasks using weighted finite state transducers, In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Philadelphia, PA, USA. Apr. 2005. [McDonough & Waibel 02] J. McDonough, A. Waibel: On maximum mutual information speaker-adapted training, In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. 1, pp. 601-604, Orlando, FL, USA. May 2002. [Merialdo 88] B. Merialdo: Phonetic recognition using hidden Markov models and maximum mutual information training, In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 111-114, New York, NY, USA. Apr. 1988. [Mermelstein 76] P. Mermelstein: Distance measures for speech recognition, psychological and instrumental, Pattern Recognition and Artificial Intelligence, pp. 374-388. 1976. [Mikolov 12] T. Mikolov: Statistical language models based on neural networks, Ph.D. thesis, Brno University of Technology, Brno, Czech Republic. 2012. [Molau 03] S. Molau: Normalization in the acoustic feature space for improved speech recognition, Ph.D. thesis, Computer Science Department, RWTH Aachen University, Aachen, Germany. Feb. 2003. [Murveit & Butzberger+ 93] H. Murveit, J. Butzberger, V. Digalakis, M. Weintraub: Progressive-search algorithms for large-vocabulary speech recognition, In workshop on human language technology, pp. 87-90, Morristown, NJ, USA. 1993. [Na & Jeon+ 95] K. Na, B. Jeon, D. Chang, S. Chae, S. Ann: Discriminative training of hidden Markov models using overall risk criterion and reduced gradient method, ISCA Europ. Conf. on speech communication and technology, Vol. 1, pp. 97-100, Madrid, Spain. Sep. 1995. [N´adas 83] A. N´adas: A decision theoretic formulation of a training problem in speech recognition and a comparison of training by unconditional versus conditional maximum likelihood, IEEE Trans. on Acoustics, Speech, and Signal Processing, Vol. 31, No. 4, pp. 814-817. 1983. [Nakamura & McDermott+ 09] A. Nakamura, E. McDermott, S. Watanabe, S. Katagiri: A unified view for discriminative objective functions based on negative exponential of difference measure between strings, In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1633-1636, Taipei, Taiwan. Apr. 2009. [Ney 06] H. Ney: Lecture notes: Automatic speech recognition, Chair of computer science 6, RWTH Aachen, Germany. 2006. [Ney 84] H. Ney: The use of a one-stage dynamic programming algorithm for connected word recognition, IEEE Transactions on Speech and Audio Processing, 32(2):263-271. 1984. [Ney & Aubert 94] H. Ney, X. Aubert: A word graph algorithm for large vocabulary continuous speech recognition, In International Conference on Spoken Language Processing (ICSLP), Vol. 3, pp. 1355-1358, Yokohama, Japan. Sep. 1994.
BIBLIOGRAPHY
105
[Ney & Essen+ 94] H. Ney, U. Essen, R. Kneser: On structuring probabilistic dependencies in language modeling, Computer Speech and Language, 2(8):1-38. Mar. 1994. [Ney & H¨ab-Umbach+ 92] H. Ney, R. H¨ab-Umbach, B.H. Tran, M. Oerder: Improvements in beam search for 10000-word continuous speech recognition, In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. 1, pp. 9-12, San Francisco, CA, USA. Mar. 1992. [Ney & Mergel+ 87] H. Ney, D. Mergel, A. Noll, A. Paeseler: A data-driven organization of the dynamic programming beam search for continuous speech recognition, In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 833836, Dallas, TX, USA. Apr. 1987. [Nocedal & Wright 99] J. Nocedal, S. Wright: Numerical optimization, Springer. 1999. [Normandin 95] Y. Normandin: Optimal splitting of HMM Gaussian mixture components with MMIE training, In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. 1, pp. 449-452, Detroit, MI, USA. May 1995. [Normandin 96] Y. Normandin: Maximum mutual information estimation of hidden Markov models, Kluwer Academic Publishers, pp. 57-81, Norwell, MA, USA. 1996. [Normandin & Mogera 91] Y. Normandin, S. Morgera: An improved MMIE training algorithm for speaker-independent, small vocabulary, continuous speech recognition, In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 537-540, Toronto, Canada. May 1991. [Omer & Hasegawa-Johnson 03] M.K. Omar, M. Hasegawa-Johnson: Maximum conditional mutual information projection for speech recognition, In International Conference on Spoken Language Processing (ICSLP), pp. 505-508, Geneva, Switzerland. Sep. 2003. [Ortmanns & Ney+ 97] S. Ortmanns, H. Ney, T. Firzlaff: Fast likelihood computation methods for continuous mixture densities in large vocabulary speech recognition, In European Conference on Speech Communication and Technology (Eurospeech), Vol. 1, pp. 139-142, Rhodes, Greece. Sep. 1997. [Ortmanns & Ney+ 96] S. Ortmanns, H. Ney, A. Eiden: Language-model look-ahead for large vocabulary speech recognition, In International Conference on Spoken Language Processing (ICSLP), Vol. 4, pp. 2095-2098, Philadelphia, PA, USA. Oct. 1996. [Peeling & Moore+ 86] S.M. Peeling, R.K. Moore, M.J. Tomlinson: The multi-layer perceptron as a tool for speech pattern processing research, In Institute of Acoustics, Autumn Conference on Speech and Hearing, Vol. 8, pp. 307-314, Windermere, UK. 1986. [Plahl 14] C. Plahl: Neural network based feature extraction for speech and image recognition, Ph.D. thesis, Computer Science Department, RWTH Aachen University, Aachen, Germany, Jan. 2014. [Plahl & Schl¨uter+ 11] C. Plahl, R. Schl¨uter, H. Ney: Improved acoustic feature combination for LVCSR by neural networks, In Interspeech, pp. 1237-1240, Florence, Italy. Aug. 2011.
106
BIBLIOGRAPHY
[Pitz 05] M. Pitz: Investigations on linear transformations for speaker adaptation and normalization, Ph.D. thesis, Computer Science Department, RWTH Aachen University, Aachen, Germany. Mar. 2005. [Povey 03] D. Povey: Discriminative training for large vocabulary speech recognition, Ph.D. thesis, Engineering Department, Univ. of Cambridge, Cambridge, UK. Mar. 2003. [Povey & Burget+ 11] D. Povey, L. Burget, M. Agarwal, P. Akyazi, F. Kai, A. Ghoshal, O. Glembek, A. Goel, M. Karafiat, A. Rastrow, R.C. Rose, P. Schwarz, S. Thomas: The subspace Gaussian mixture model - A structured model for speech recognition, Computer speech and language, Vol. 25, Issue 2, pp. 404-439. 2011. [Povey & Kingsbury+ 05] D. Povey, B. Kingsbury, L. Mangu, G. Saon, H. Soltau, G. Zweig: fMPE: Discriminatively trained features for speech recognition, In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 961-964, Philadelphia, PA, USA. Mar. 2005. [Povey & Woodland 99] D. Povey, P.C. Woodland: Frame discrimination training of HMMs for large vocabulary speech recognition, In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. 1, pp. 333-336, Phoenix, AZ, USA. May 1999. [Povey & Woodland 02] D. Povey, P.C. Woodland: Minimum phone error and I-smoothing for improved discriminative training, In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp-105-108, Orlando, FL, USA. May 2002. [Rabiner 89] L.R. Rabiner: A Tutorial of hidden Markov models and selected applications in speech recognition, In Proceedings of the IEEE, Vol. 77, No. 2, pp. 257-286. Feb. 1989. [Rabiner & Juang 86] L. Rabiner, B.-H. Juang: IEEE ASSP Magazine, 3(1): 4-16. 1986.
An introduction to hidden Markov models,
[Rabiner & Schafer 78] L. Rabiner and R. Schafer: Prentice Hall, Englewood Cliffs, NJ, USA. 1978.
Digital processing of speech signals,
[Ramasubramanian & Paliwal 92] V. Ramasubramanian, K.K. Paliwal: Fast k-dimensional tree algorithms for nearest neighbor search with application to vector quantization encoding., IEEE Transactions on Speech and Audio Processing, 40(3):518-528. 1992. [Renals & Bell 13] S. Renals, P. Bell: Hidden Markov models and Gaussian mixture models, ASR lectures 4 & 5, Univ. of Edinburgh, UK. 2013. [Reynolds 94] D.A. Reynolds: Experimental evaluation of features for robust speaker identification, IEEE Trans. on Acoust. Speech and Audio Processing, Vol. 2, No. 4, pp. 639-643. 1994. [Rice 95] J. Rice: Mathematical statistics and data analysis, 2nd. edition, Duxbury Press 1995. [Riedmiller & Braun 93] M. Riedmiller, H. Braun: A direct adaptive method for faster backpropagation learning: The RPROP algorithm, In IEEE International Conference on Neural Networks, pp. 586-591, San Francisco, CA, USA. Mar. 1993.
BIBLIOGRAPHY
107
[Rybach & Gollan+ 09] D. Rybach, C. Gollan, G. Heigold, B. Hoffmeister, J. L¨oo¨ f, R. Schl¨uter, H. Ney: The RWTH Aachen university open source speech recognition system, In Interspeech, pp. 2111-2114, Brighton, UK. Sep. 2009. [Sainath & Kingsbury+ 11] T.N. Sainath, B. Kingsbury, B. Ramabhadran, P. Fousek, P. Novak, A. Mohamed: Making deep belief networks effective for large vocabulary continuous speech recognition, In IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 30-35, Waikoloa, HI, USA. Dec. 2011. [Seide & Gang+ 11] F. Seide, L. Gang, Y. Dong: Conversational speech transcription using context-dependent deep neural network, In Interspeech, pp. 437-440, Florence, Italy. Aug. 2011. [Schl¨uter 00] R. Schl¨uter: Investigations on discriminative training, Ph.D. thesis, Computer Science Department, RWTH Aachen University, Aachen, Germany. Sep. 2000. [Schl¨uter & Bezrukov+ 07] R. Schl¨uter, I. Bezrukov, H. Wagner, H. Ney: Gammatone features and feature combination for large vocabulary speech recognition, In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 649-652, Honolulu, HI, USA. Apr. 2007. [Schl¨uter & Macherey+ 99] R. Schl¨uter, W. Macherey, B. M¨uller, H. Ney: A combined maximum mutual information and maximum likelihood approach for mixture density splitting, In European Conf. on Speech Communication and Technology, Vol. 4, pp. 17151718, Budapest, Hungary. Sep. 1999. [Schl¨uter & Macherey+ 01] R. Schl¨uter, W. Macherey, B. M¨uller, H. Ney: Comparison of discriminative training criteria and optimization methods for speech recognition, Speech Communication, Vol. 34, pp. 287-310. 2001. [Schl¨uter & M¨uller+ 99] R. Schl¨uter, B. M¨uller, F. Wessel, H. Ney: Interdependence of language models and discriminative training, In IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Vol. 1, pp. 119-122, Keystone, CO, USA. Dec. 1999. [Schl¨uter & Zolnay+ 06] R. Schl¨uter, A. Zolnay, H. Ney: Feature combination using linear discriminant analysis and its pitfalls, In 9th International Conference on Spoken Language Processing, pp. 345-348, Pittsburgh, PA, USA. Sep. 2006. [Schwartz & Chow 90] R. Schwartz, Y.L. Chow: The N-best algorithm: An efficient and exact procedure for finding the N most likely sentence hypotheses, In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 81-84, Albuquerque, NM, USA. Apr. 1990. [Sixtus 03] A. Sixtus: Across-Word phoneme models for large vocabulary continuous speech recognition, Ph.D. thesis, Computer Science Department, RWTH Aachen University, Aachen, Germany. Jan. 2003. [Shaik & T¨uske+ 14] M.A.B. Shaik, Z. T¨uske, M.A. Tahir, M. Nußbaum-Thom, R. Schl¨uter, H. Ney: RWTH LVCSR systems for Quaero and EU-Bridge: German, Polish, Spanish and Portuguese, In Interspeech, pp. 973-977, Singapore. Sep. 2014.
108
BIBLIOGRAPHY
[Sundermeyer & Nußbaum-Thom+ 11] M. Sundermeyer, M. Nußbaum-Thom, S. Wiesler, C. Plahl, A. El-Desoky Mousa, S. Hahn, D. Nolden, R. Schl¨uter, H. Ney: The RWTH 2010 Quaero ASR evaluation system for English, French, and German, In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2212-2215, Prague, Czech Republic. May 2011. [Tahir & Heigold+ 09] M.A. Tahir, G. Heigold, C. Plahl, R. Schl¨uter, H. Ney: Log-linear framework for linear feature transformations in speech recognition, In IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 76-81, Merano, Italy. Dec. 2009. [Tahir & Huang+ 13] M.A. Tahir, H. Huang, R. Schl¨uter, H. Ney, L. ten Bosch, B. Cranen, L. Boves: Training log-linear acoustic models in higher-order polynomial feature space for speech recognition, In Interspeech, pp. 3352-3355, Lyon, France. Aug. 2013. [Tahir & Nußbaum-Thom+ 12] M.A. Tahir, M. Nußbaum-Thom, R. Schl¨uter, H. Ney: Simultaneous discriminative training and mixture splitting of HMMs for speech recognition, In Interspeech, pp. 571-574, Portland, OR, USA. Sep. 2012. [Tahir & Schl¨uter+ 11a] M.A. Tahir, R. Schl¨uter, H. Ney: Log-linear optimization of secondorder polynomial features with subsequent dimension reduction for speech recognition, In Interspeech, pp. 1705-1708, Florence, Italy. Aug. 2011. [Tahir & Schl¨uter+ 11b] M.A. Tahir, R. Schl¨uter, H. Ney: Discriminative splitting of Gaussian/log-linear mixture HMMs for speech recognition, In IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 7-11, Waikoloa, HI, USA. Dec. 2011. [Tahir & Wiesler+ 15] M.A. Tahir, S. Wiesler, R. Schl¨uter, H. Ney: Investigation of mixture splitting concept for training linear bottlenecks of deep neural network acoustic models, In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4614-4618, Brisbane, Australia. Apr. 2015. [Tsakalidis & Doumpiotis+ 02] S. Tsakalidis, V. Doumpiotis, W. Byrne: Discriminative linear transforms for feature normalization and speaker adaptation in HMM estimation, In International Conference on Spoken Language Processing (ICSLP), pp. 2585-2588, Denver, CO, USA. Sep. 2002. [T¨uske & Schl¨uter+ 13] Z. T¨uske, R. Schl¨uter, H. Ney: Multilingual hierarchical MRASTA features for ASR, In Interspeech, pp. 2222-2226, Lyon, France. Aug. 2013. [Valente & Hermansky 08] F. Valente, H. Hermansky: Hierarchical and parallel processing of modulation spectrum for ASR applications, In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4165-4168, Las Vegas, NV, USA. Apr. 2008. [Valtchev 95] V. Valtchev: Discriminative methods in HMM-based speech recognition, Ph.D. thesis, Engineering Department, University of Cambridge, Cambridge, UK. 1995.
BIBLIOGRAPHY
109
[Valtchev & Odell+ 96] V. Valtchev, J.J. Odell, P.C. Woodland, S.J. Young: Lattice-based discriminative training for large vocabulary speech recognition, In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. 2, pp. 605-608, Atlanta, GA, USA. May 1996. [Valtchev & Odell+ 97] V. Valtchev, J.J. Odell, P.C. Woodland, S.J. Young: MMIE training of large vocabulary recognition systems, Speech Communication, Vol. 22 Issue 4, pp. 303314. Sep. 1997. [Visweswariah & Gopinath 04] K. Visweswariah, R. Gopinath: Adaptation of front end parameters in a speech recognizer, Int. Conf. on Spoken Language Processing (ICSLP), pp. 21-24, Jeju Island, South Korea. Oct. 2004. [Viterbi 67] A. Viterbi: Error bounds for convolutional codes and an asymptotically optimal decoding algorithm, IEEE Trans. Information Theory, Vol. IT-13, pp. 260-269. 1967. [Waibel & Hanazawa+ 89] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, K.J. Lang: Phoneme recognition using time-delay neural networks, IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. 37, No. 3, pp. 328-339. 1989. [Welling 99] L. Welling: Merkmalsextraktion in Spracherkennungssystemen f¨ur grossen Wortschatz, Ph.D. thesis, Computer Science Department, RWTH Aachen University, Aachen, Germany. Jan. 1999. [Wessel & Schl¨uter+ 01] F. Wessel, R. Schl¨uter, K. Macherey, H. Ney: Confidence measures for large vocabulary continuous speech recognition, IEEE Transactions on Speech and Audio Processing, Vol. 9, No. 3, pp. 288-298. 2001. [Wiesler & Nußbaum+ 09] S. Wiesler, M. Nußbaum, G. Heigold, R. Schl¨uter, H. Ney: Investigations on features for log-Linear acoustic models in continuous speech recognition, In IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 52-57, Merano, Italy. Dec. 2009. [Wiesler & Richard+ 14] S. Wiesler, A. Richard, R. Schl¨uter, H. Ney: Mean-normalized stochastic gradient for large-scale deep learning, In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 180-184, Florence, Italy. May 2014. [Wood & Pearce+ 91] L. Wood, D. Pearce, F. Novello: Improved vocabulary independent subword HMM modeling, In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, Canada, pp. 181-184. May 1991. [Woodland & Povey 00] P.C. Woodland, D. Povey: Large scale discriminative training for speech recognition, In ISCA ITRW ASR2000, Automatic Speech Recognition, Challenges for the new millennium, pp. 7-16, Paris, France. Sep. 2000. [Xue & Li+ 13] J. Xue, J. Li, Y. Gong: Restructuring of deep neural network acoustic models with singular value decomposition, In Interspeech, pp. 2365-2369, Lyon, France. Aug. 2013.
[Yu & Russel+ 90] G. Yu, W. Russel, R. Schwartz, J. Makhoul: Discriminant analysis and supervised vector quantization for continuous speech recognition, In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 685-688, Albuquerque, NM, USA. Apr. 1990. [Yu & Seide+ 12] D. Yu, F. Seide, G. Li, L. Deng. Exploiting sparseness in deep neural networks for large vocabulary speech recognition, In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4409-4412, Kyoto, Japan. Mar. 2012. [Zolnay 06] A. Zolnay: Acoustic feature combination for speech recognition, Ph.D. thesis, Computer Science Department, RWTH Aachen University, Aachen, Germany. Aug. 2006. [Zolnay & Schl¨uter+ 02] A. Zolnay, R. Schl¨uter, H. Ney: Robust speech recognition using a voiced-unvoiced feature, In International Conference on Spoken Language Processing (ICSLP), Vol. 2, pp. 1065-1068, Denver, CO, USA. Sep. 2002.