Support Vector Machines for Postprocessing of

0 downloads 0 Views 253KB Size Report
TIMIT database and 12.8% on the Wallstreet Journal Cambridge database using ... Machine, Speech Recognition, Hidden Markov Model, TIMIT database, Wall-.
Support Vector Machines for Postprocessing of Speech Recognition Hypotheses A. Stuhlsatz1,2, H.-G. Meier2, M. Katz1, S. E. Kr¨uger1 and A. Wendemuth1 1

Otto-von-Guericke University Magdeburg, Cognitive Systems Group, Department of Electrical Engineering, P.O. Box 4120, 39016 Magdeburg, Germany, e-mail: [email protected] 2

University of Applied Sciences D¨usseldorf, Department of Electrical Engineering, Josef-Gockeln-Str. 9, 40474 D¨usseldorf, Germany, e-mail: [email protected]

Abstract In this paper, we introduce an approach to improve the recognition performance of a Hidden Markov Model (HMM) based monophone recognizer using Support Vector Machines (SVMs). We developed and examined a method for re-scoring the HMM recognizer hypotheses by SVMs in a phoneme recognition framework. Compared to a stand-alone HMM system, an improvement of 9.2% was reached on the TIMIT database and 12.8% on the Wallstreet Journal Cambridge database using the hybrid framework.

Keywords Support Vector Machine, Speech Recognition, Hidden Markov Model, TIMIT database, Wallstreet Journal Cambridge database

1. Introduction Maximum A Posteriori (MAP) classifiers (1) are commonly used in a HMM based recognizer to map the acoustic features x ∈ IRn to their corresponding classes c∗ , e.g. phoneme classes. In practice, the unknown parameters ϑ are learned from given data xm ∈ X ⊆ IRn , m ∈ M := {1, . . . , M } using iterative methods, that increase the likelihood p(Xc |ϑc ) with known classmembership c in terms of ϑc . A Maximum Likelihood (ML) approach is not optimized with respect to a best generalization performance, because it minimizes only the empirical risk (2) (ERM principle [1]) without making any predictions about the expected actual risk (3). ¸ · p(x|ϑc )P (ϑc ) ∗ , (1) c := arg max c p(x) ∀c ∈ C := {1, . . . , N } Remp (ϑ) :=

1 X L(xm , cm , ϑ) M m∈M

(2)

Z Ract (ϑ) :=

L(x, c, ϑ)dP (x, c)

(3)

IRn ×C

1.1. Support Vector Machine Instead of minimizing just the empirical risk (2) with a appropriate loss function L, the SVM [1] algorithm exerts influence on a upper bound of the actual risk (3) by controlling its complexity. This is achieved by the implementation of an approximated Structural Risk Minimization (SRM [1]), that leads to a quadratic optimization problem (4) and the decision function (5). While C is a free parameter vector and K(xl , xm ) := xl · xm a euclidian scalar product, the bias β can be derived from the SVM optimality conditions of problem (4). The vectors xm ∈ IRn , m ∈ S of a set of given training examples (xm , ym ) , ym ∈ {−1, 1}, are called Support Vectors. 1 α∗ := argmax eT α − αT Aα 2 α∈IRM s.t. where

(4)

αT y = 0 , 0 ≤ α ≤ C y := (y1 , . . . , yM )T , e := (1, . . . , 1)T ∈ IRM , A := (yl ym K(xl , xm ))l,m∈M " c∗ (x) := sign β +

X

# α∗m ym K(x, xm ) ,

(5)

m∈S

S := {s ∈ M | 0 < α∗s ≤ C} Although, the SVM learns a linear discriminant, it is possible to learn non-linearities via kernel functions, that have to fullfill the Mercer’s condition [4]. Important kernels are the polynomial kernel K(xl , xm ) := (xl · xm + 1)d and the Gaussian radial basis function kernel K(xl , xm ) := exp(−γ|xl − xm |2 ).

2. Combined system architecture SVMs are excellent classifiers for static pattern classification tasks due to their superior generalization ability, whereas HMMs excel in an suitable time adaptation to speech. Thus, our goal is to develop an architecture combining both methods using HMMs for modeling the temporal variability and SVMs for classification, as e.g. indicated in [5] and [6] . This should improve the recognition performance compared to a stand-alone HMM system, as seen in connectionist systems where neural networks (NNs) replace the Gaussian output probabilities of the HMM acoustic model [11]. Hybrid NN/HMM systems often use NNs to estimate posterior probabilities and employ an underlaying HMM structure to model the temporal evolution. Instead of this integrated approach, we utilized a N-best-list re-scoring approach based on the work of [5] which we improved using phoneme lattices. Figure 1 shows the structure of the used architecture in principle. For the multi-class situation here, we utilized one versus all classifiers, which learn to discriminate one class against the remaining classes.

Figure 1: Scheme of the hybrid architecture consisting of the HMM recognizer [7] and our developed tools [10] (including [9]). 2.1. HMM segmentation The HMM speech recognizer [7] provides information about the time alignment and the corresponding feature vectors. The feature vectors, representing a time segment of a single phoneme, are grouped into three regions [5] (first-, middle- and end-part of the phoneme) to construct time independent vectors required by the SVM. Whithin each of these three groups, we calculated a arithmetical mean vector of the corresponding feature vectors. Concatenating these three mean vectors resulted in a composition vector which represents one phoneme and its according stream of feature vectors per time segment (Fig. 2). Because this approach is improper for shortly spoken phonemes, we added the logarithm of the segment duration as an additional feature. These composed vectors are used for the SVM training and classification. 2.2. Estimation of posteriors from the SVM output The output (6) of a SVM can be seen as a distance measure between a test pattern x and the decision boundary {x | f (x) = 0}. There exists no clear relationship with the posterior class probability P (y|f ), which we need to tie the HMM model and the SVM model. A possible estimate for these probabilities can be obtained by modeling the distribution of the SVM outputs using Gaussians [8] with equal variances (7). Applying Bayes formula we get the conditional probability that the vector x belongs to the positive class (8). From this follows (9), where a and b depend on µc and σ. Utilizing a Minimum Squared Error (MSE) parameter estimator, we estimated the parameters a and b. f (x) := β +

X m∈S

α∗m ym K(x, xm )

(6)

Figure 2: Scheme of the composed vector building process. 1 p(f (x)|y = c) := √ exp σ 2π

P (y = 1|f (x)) :=

µ

¶ −[f (x) − µc ]2 2σ 2 c ∈ {−1, 1}

p(f (x)|y = 1)P (y = 1) P p(f (x)|y = c)P (y = c)

(7)

(8)

c∈{−1,1}

P (y = 1|f (x)) =

1 1 + exp(a · f (x) + b)

(9)

2.3. Re-scoring of N-best-lists As a first step of integrating SVMs into the speech recognition process, we used the SVM output - respectively the probability P (y = 1|f (x)) - to re-score N-best-lists [5] generated by a standard HMM system [7]. A N-best-list consists of the N most likely recognition hypotheses of a test utterance according to the HMM score. Every hypotheses are broken down into their transcriptions and their segmentations.The scores ln[P (y = 1|f (x))] are added (with suitable factors α and β) to the scores provided by the HMM decoder for every segment (resp. phoneme) (10). After re-scoring all hypotheses of the N-best-list of every test utterance, the lists are reordered and a new best hypothesis (max. total score) is chosen for every utterance to measure the performance of the combined system. scoreHM M/SV M := α · scoreHM M + β · scoreSV M

(10)

2.4. Re-scoring of phoneme lattices A disadvantage of re-scoring the N-best-lists is the finalization of the N-best hypotheses by the HMM recognizer, alone. Because by re-scoring, we can only choose from this pool, it might be possible, that the best hypothesis is not contained. It should be better, if the SVM could individually influence (together with the HMM model) the selection of the hypotheses to potentiate a better choice. For this reason, we introduced a improved method - the re-scoring of phoneme lattices. For every test utterance, a lattice is produced by using the Viterbi decoder of the HMM system [7]. In this lattice the possible phoneme sequences (which include in fact the hypotheses of the N-best-list) are made up of nodes and arcs.The nodes represent a certain point of time, while each arc represents a possible phoneme scored by the HMM decoder with the logarithm of the acoustic probability. In the same manner when re-scoring the N-best-lists, we re-scored the phoneme lattices using the linear combination (10). After this, the best hypothesis is found by searching through the lattice for the best path, i.e. the path with the maximum total score.

3. Experiments 3.1. Experimental setup In a previous work based on our framework [12] we used the DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus (TIMIT) [2] for the experiments. For this paper we proved our approach on the Wallstreet Journal Cambridge Corpus (WSJCAM0) [3]. The TIMIT database as well as WSJCAM0 database were divided in three independent sets for training, test and evaluation. During evaluation, we used only a subset of the entire training-set to keep the evaluation in the scope of time (Tab. 1). The HMMs are trained with 39-dimensional feature vectors. The same features vectors are used to construct the 118-dimensional vectors (Fig. 2) for the SVM training, too. The required segmentation was generated by a Viterbi decoder [7] without a language model. The HMM monophone recognition achieved 43.26% PhonemeError-Rate (PER) on the TIMIT test-set (cf. [12]) and 50.41% PER on the WSJCAM0 test-set. Details about the baseline are summarized in (Tab. 1). 3.2. Test performance The TIMIT test-set contained 7,215 phonemes, while the WSJCAM0 test-set consisted of 37,573 phonemes. First, the SVMs are trained on the TIMIT training-set with the optimal parameter settings, which have been found during the evaluation (Tab. 2). The Viterbi decoder [7] produced the 50-best-lists and lattices utilizing the HMM models, which were trained on the same training-set as the SVMs. Then, the 50-best-lists are re-scored using the linear combination with α = 0.01 and β = 1.0. After reordering, the best hypotheses reached 41.75% PER (cf. [12]). For re-scoring the lattices we applied the scaling α = 1.75 and β = 10.0. The re-scored lattices achieved a performance of 39.28% PER (cf. [12]). As second test, we trained SVMs on the WSJCAM0 with the evaluated parameters, and re-scored the 50-best-lists as well as the lattices produced by the Viterbi decoder trained on this speech corpus. We obtained a PER of 49.19% in the case of the N-best-lists re-scoring and 43.98% PER using lattice re-scoring. Finally, we trained simple MAP classifiers (Considering i.i.d. features, uniform class probabil-

Table 1: Facts about the baseline. workstation: features:

2.4 GHz CPU, 4GB RAM 12 MFCC + energy + δ1 + δ2 39 components HMM recognizer: 50 monophone models (TIMIT) 45 monophone models (WSJCAM0) 3 emitting states per model 10 gaussian mixtures per state with diagonal covariance matrix segmentation: automatic Viterbi alignment corpora: TIMIT (phonemes) WSJCAM0 (phonemes) training-set: 140,154 168,765 re-training-set: 11,396 10,211 test-set: 7,215 37,573 evaluation-set: 931 1,277

Table 2: Summerize of the evaluated parameter settings. SVM training parameters:

Re-scoring settings N-best-list: phoneme lattice:

gaussian kernel function γ = 0.250 , C = 200.0 (TIMIT) γ = 0.600 , C = 5.0 (WSJCAM0) grouping: 30%-40%-30% 50 hypotheses α = 0.01 , β = 1.0 10 incoming edges per node α = 1.75 , β = 10.0

Figure 3: This figure shows the Phoneme-Error-Rates after re-scoring the 50-best-lists and lattices of the test-set using the best parameter settings from the evaluation (Tab. 2). ities P (ϑc ) and Gaussian likelihoods p(x|ϑc ) with equal variances.) to re-score lattices on both corpora to reduce the doubt, that the improvement we obtained is solely the result of re-scoring the lattices with any kind of trained classifier. Our results on the test-set are summarized in (Fig. 3).

4. Conclusion We applied our approach - the re-scoring of phoneme lattices using Support Vector Machines as a improved method of the N-best-list re-scoring [5]. Utilizing our method, we achieved a relative PER reduction of 9.2% on the TIMIT corpus compared to the pure HMM recognizer PER of 43.26% [12]. On the WSJCAM0 database this approach proved its ability to improve the overall recognition performance by reducing the PER about 12.8% compared to 50.41% PER of the HMM recognizer. Both results are superior to previous attempts [5] by re-scoring the N-best-lists with SVMs. Using this method we achieved a relative reduction of 3.5% on TIMIT [12] and 2.4% on WSJCAM0, only. Interestingly, lattice re-scoring was able to increase performance on the WSJCAM0 corpus using simple trained MAP classifiers (6.4% relative PER reduction). This result emphasizes, that the phoneme lattice re-scoring approach is in general to prefer. Unlike the SVM, the predictive power of the MAP classifier was not sufficient to increase the performance on TIMIT due to the better generalization of the baseline HMM system compared to the WSJCAM0 baseline. This was reliable confirmed, as we successively reduced the number of mixtures of the acoustic model, the relative performance of the MAP combination increased, while the baseline performance decreased. However, the MAP classifier never reached the potential of the SVM to classify the acoustic features independently of the quality of the HMM baseline. As shown in many other applications, our results confirm the higher accuracy of SVMs to classify the features in speech recognition, than e.g. MAP classifiers.

5. References [1] Vapnik, V. (1995), ”The Nature of Statistical Learning Theory.”, Springer Verlag [2] Garfolo, J. S., Lamel, L. F., Fisher, W. M., Fiscus, J. G., Pallett, D. S. and Dahlgren, N. L. (1993), ”The DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus.”, U.S. Dept. of Commerce, NIST, Gaithersburg, MD, Feburary [3] Robinson, T., Fransen, J., Pye, D., Foote, J. and Renals, S. (1995), ”WSJCAM0: A British English Speech Corpus for Large Vocabulary Continuous Speech Recognition.”, Proc. ICASSP ’95, Detroit, MI, http://citeseer.nj.nec.com/robinson95wsjcam.html [4] Courant, R. and Hilbert, D. (1953), ”Methods of Mathematical Physics.”, Interscience Publishers, Inc. [5] Ganapathiraju, A. (2001), ”Support Vector Machines for Speech Recognition.”, PhD thesis, Mississippi State University [6] Salomon, J. (2001), ”Support Vector Machines for Phoneme Classification.”, Master thesis, University of Edinburgh [7] Young, S. (1999), ”HTK.”, http://htk.eng.cam.ac.uk

Cambridge University Engineering Department,

[8] Platt, J. (1999), ”Probalistic outputs for support vector machines and comparisons to regularized likelihood methods.”, Smola, A., Bartlett, P., Sch¨olkopf, B., Schuurmans, D., editors, Advances in Large Margin Classifiers, MIT Press [9] Joachims, T. (1999), ”Making Large-Scale SVM Learning Practical.”, Sch¨olkopf, B., Burges , Smola, A. (ed.), Advances in Kernel Methods - Support Vector Learning, MIT Press, http://cs.cornell.edu/People/tj/svm light [10] Stuhlsatz, A. (2004), ”HSVM - A SVM toolkit for segmented speech data.”, unpublished, University of Applied Sciences D¨usseldorf, Otto-von-Guericke University Magdeburg, [email protected] [11] Trentin, E. and Gori, M. (2001), ”A survey of hybrid ANN/HMM models for automatic speech recognition.”, Neurocomputing, 37:91-126 [12] Stuhlsatz, A., Meier, H.G., Katz, M., Kr¨uger, S. E. and Wendemuth, A. (2003) ”Classification of speech recognition hypotheses with Support Vector Machines.”, In: Proceedings of the Speech Processing Workshop in connection with DAGM (Speech-DAGM), Magdeburg, pp. 65-72, Published by University of Magdeburg