Center for Spoken Language Understanding,. Oregon Graduate Institute of Science and Technology. Abstract. Current frame-based speech recognition systems.
Transition-based Feature Extraction within Frame-based Recognition Zhihong Hu, Etienne Barnard, Ronald Cole ( zhihong,barnard,cole)@cse.ogi.edu
Center for Spoken Language Understanding, Oregon Graduate Institute of Science and Technology
Abstract
Current frame-based speech recognition systems sample speech at a xed set of locations relative to each frame. Modeling the temporal dynamic behavior of speech is thereby complicated. This work shows that by explicitly using transitional information when extracting features, one can better model the acoustic phonetic structure, resulting in higher word level recognition performance. In this proposed approach, features representing local transitional information are used (a constant number of features are selected at each time frame, but the features are sampled near areas of greatest spectrum change within a relatively long window.) By explicitly modeling transitions in this way, we can also model local contextual information. Using this technique, the word level error rate decreased up to 30% on the databases we tested.
1. Introduction
Current speech recognition systems can be categorized either as frame-based or segment-based [1]. Frame-based systems are currently more popular since they do not require explicit detection of segment boundaries in a vocabulary-independent fashion, and thus give better classi cation performance. They suer, however, from severe modeling limitations: speech is modeled as a sequence of independent frames, and these are assumed to have piecewise-constant statistics. One way to overcome some of these limitations is to construct a frame-based system which focuses on transitional information. Evidence suggesting that the transitional part of speech carries important information for speech perception exists [2]. Various approaches to incorporate this information have been suggested. For example, Ghitza and Sondhi's [3] HMM-based system recognizes speech based only upon classi cation of diphone transitions. Fanty and Cole [4] modeled dynamics by sampling features more densely at segment boundaries, therefore improved alphabet recognition performance. Goldenthal's segment based system [5] also attempts to model transitions by incorporating cross phone fea-
ture trajectories. Bahl's [6] HMM system uses discriminant projections to extract the most informative features among the adjoining features from several frames in order to capture the temporal information. These methods are computationally intensive, and are not easy to extend to implement in real time. We propose an alternative methodology for acoustic-phonetic modeling of speech sounds. In this method we extract features within the frame-based approach, whilst emphasizing the local transitional information. In the following section we outline the general framework of the frame-based system and our approach in greater detail. Section 3 describes two dierent system structures, some experiments and results. Finally Section 4 concludes and proposes future developments along these lines.
2. Feature Extraction
The general system structure is shown in Figure 1.
Signal
delta PLP 174ms center frame feature
baseline
extraction transition-based
Figure 1: System structure. Features are extracted for each frame of incoming speech. Typically these features are cepstrum based after some spectral warping (in our systems, 7th or-
der PLP coecients are used [7].) These features then serve as input to either a discriminative classi er (e.g. neural network) or an estimator of likelihoods (such as continuous hidden Markov models). In our baseline system (which uses xed sampling of features,) the input to the classi er consists of 56 features representing PLP coecients from 7 regions spanning a 174 ms window centered on the frame to be classi ed. Once we have an estimate of the probability of each possible phone for each frame, they are used as input to a Viterbi search, producing the recognized word or word string. Within the framework described above, our new approach explicitly incorporates the local transitional information. This is depicted in Figure 1. The crux of this approach is to provide the classi er with information from the local parts of speech where the spectrum changes rapidly. The assumption of incorporating the local transitional information is that these features might be informative and/or more invariant compared to the xed sampling feature representation. Hypothethized peaks , i.e. where the spectrum changes most rapidly are obtained using the rst and the second order delta cepstrum computed from the PLP representation of the speech signal. Then a neural network classi er is used to determine whether a hypothesized peak is a true transition or not . This decision is based upon the acoustic features surrounding the peak, as well a parametric representation of the shape of the peak1 . Comparison has been done between peaks detected using the neural network classi er and the hand-labeled boundaries. The percentage of spectral change peaks located within 40ms of the handlabeled segment boundary was found to be 62.3%. The purpose of this analysis is to examine how much the spectral change peaks found correlate to the hand-labeled segment boundaries. It must be noted here that the peaks detected concentrate mostly on the spectral properties, which do not necessarily correspond to the hand-labeled segment boundaries. We do however expect a high correlation between these two, but not an exact match.
3. Experiments and Results
In order to evaluate our proposed method, we tested both the baseline approach and the feature extraction based on transitional information, on (a) continuous digits recognition and (b) recognition of a 58-word vocabulary collected for a telecommunication task.2 1 In this case, we use both PLP coecients and rst order delta PLP coecients to represent the acoustics at the hypothesized peak. 2 the task is to identify which word or phrase out of the 58 words or phrases was spoken in the incoming utterance
To study the eect of the feature extraction method, we keep all stages of the recognition process, as described in Figure 1, the same (i.e. word model, grammar and recognition method). To constrain the comparison only to the eect of the features used for classi cation, we keep the back end of recognition as simple as possible. Therefore we use a single pronunciation model for each word, an unconstrained grammar and assume pauses between digits. This results in much higher word level error rate than we could get if we use more complicated language and word models, but does allow for a valid comparison of the two feature representations discussed in Section 2. In the experiments, we use both tri-phones and biphones as our units of recognition. For example, the tri-phone \f < ay > v" represents phoneme \ay" in the left context of phoneme \f" and in the right context of phoneme \v". Bi-phone \f < ay" represents phoneme \ay" in the left context of phoneme \f" and bi-phone \ay > v" represents phoneme \ay" in the right context of phoneme \v".
3.1. Feature invariance analysis
In order to examine the assumption that the transition-based feature representation is more invariant than the xed sampling features, we compared the average covariance across each segment. Figure 2 depicts the histogram of dierences of covariances over all classes (tri-phone and bi-phone) examined . These statistical tests are done on both the 58 words and the digit database, and using both triphone and bi-phone phoneme model for analysis. The dierence in covariance between the transition based feature and the baseline approach can be calculated based on the covariance for each class. The covariance of a given class was determined by averaging the covariance matrices of the given class over a sub set of the training database. The sum of the eigenvalues is equal to the diagonal elements, which is the sum of the feature covariances. This measurement gives an indication of the invariance of the features used. Positive values would re ect that the sum of feature variances for the baseline system is larger than that of the transition-based approach, and therefore more variant. From Figure 2 we can conclude that: 1. The transition-based features are generally a more invariant representation for the same phoneme in same context in comparison with the xed sampling features. 2. The results on digits are better than those on the 58-word task. This might be related to the fact that the digits data are hand-labeled but the 58-word task data were forced-aligned, and
histogram of bi−phone on 58−WORDS 40
histogram of tri−phone on 58−WORDS 50 40
30
30 20 20 10
10
0 −1
0
1
2
0 −1
histogram of bi−phone on DIGITS 15
−0.5
0
0.5
1
error. The second experiment was performed on a database containing 58 words or multi-word phrases spoken by each of 1120 dierent speakers. 60% of the data (670 speakers) were used for training. The remaining 400 speakers were divided into two groups of 200 speakers each, one for the development set and the other representing the nal test set. The results are shown in Table 2. Using transition-based features the word error was reduced by 30%.
histogram of tri−phone on DIGITS 15
error 10
10
5
5
0 −0.5
0
0.5
1
0 −1
baseline transition 12.3% 8.7%
Table 2: Tri-Phone result on 58-words corpus
0
1
2
Figure 2: Variance Analysis for Two Kinds of Features. thus the boundaries are not as accurate as would be the case if they were hand-labeled.
3.2. Experiments using tri-phone phoneme model
In this section, we report on experiments using the tri-phone as our unit of recognition, that it, each segment is represented by a phoneme with speci ed phonemes as left and right contexts, respectively. The rst experiment used the OGI numbers database.3 The comparison experiment was performed using the utterances which contain only continuous digit strings. Table 1 depicts the word level results of both systems. baseline transition substitution 8.9% 7.5% deletion 9.5% 7.3% insertion 3.3% 1.6% word error 21.8% 16.4% Table 1: Tri-Phone result on digits corpus Using features based on transitional information helped in decreasing both the insertions and deletions obtained, resulting in 25% reduction in word This corpus contains a series of spoken numbers taken from utterances of zipcodes, street addresses, and telephone numbers. The corpus is available to academic institutions free of charge from OGI. The rst version has been released publicly and the experimental results reported in this paper were obtained on this part of corpus.
Both experiments show that by incorporating transitional information in the features we can improve the recognition performance signi cantly.
3.3. Experiments using bi-phone phoneme model
Encouraged by the results we obtained with triphone phoneme models, we implemented a set of biphone phoneme models using transition-based features. The motivation for doing this is to obtain better performance by doing ner modeling of the speech sound, and also to reduce the complexity of the recognizer for large-vocabulary tasks. The system structure of our bi-phone phoneme model system is shown in Figure 3.
Signal
delta PLP
center frame feature transition-based
extraction
Left Net
Right Net
Search
Word String
3
Figure 3: Bi-Phone model system structure. In this case, we concentrated on the OGI numbers
database. The training and test conditions were kept the same as in the rst experiment. Table 3 depicts the word level results on the digit corpus. In this table, experiment B1 represents the baseline system ( xed sampling) using only PLP features; experiment B2 represents the baseline system using both PLP and rst order delta PLP features; experiment T1 uses transition based features (variable sampling) as input to the phoneme probability estimator; nally in experiment T2 we use the handlabeled segment boundaries to indicate the transition points (i.e. where to extract transition based features)4 . The last experiment (T2) was done to further verify that the transitional information are less variant and will thus provide a lower bound on the recognition error obtainable with something like our current approach. B1 substitution 6.7% deletion 4.4% insertion 1.6% word error 12.7%
B2 6.3% 4.0% 2.0% 12.3%
T1 5.1% 3.8% 1.3% 10.2%
T2 4.8% 3.8% 1.5% 10.2%
Table 3: Bi-Phone result on digits corpus Using features based on transitional information helped in decreasing both the insertions and deletions obtained, resulting in 20% reduction in word error. Comparing the results of T2 and T1 we could also conclude that the hand-labeled segment boundaries are not necessarily the spectral transition points; therefore, knowing exactly where the segment boundary is does not show signi cant advantage over our transition peak detection algorithm for use on the transition-based features in the system we are using.
4. Summary and Future Work
In this paper, we presented a new approach of incorporating transitional information into the features in continuous speech recognition. Feature analysis suggests that the transition-based features (variable sampling) are more invariant than the baseline features ( xed sampling). Experimental results on two dierent corpora indicate that using transition-based features can help to improve the recognition performance signi cantly, and therefore presents a promising new technique for extending speaker independent continuous speech recognition. The results we obtained conform to previous work done on stochastic perceptual models of speech [8]. 4 The feature vector dimension in experiment B1 is 56; the feature vector dimension in experiment B2, T1 and T2 is 104.
The following extensions of this technique are now being studied:
The majority of the errors in our transition peak-detection algorithm occur at vowel-vowel or vowel-semivowel boundaries in uent speech where the transition is not sharp. Therefore further study is needed to develop more intelligent techniques of accurately nding the transitional part of uent speech, which directly impacts the performance of this approach. More informative representations of the transitional information should be designed to incorporate the information more eciently.
References
[1] L. Rabiner and B.H. Juang. Fundamentals of Speech Recognition. Englewood Clis NJ: PTR Prentice Hall (Signal Processing Series), 1993. General Intro : ISBN 0-13-015157-2. [2] S. Furui. On the role of spectral transition for speech perception. J. Acoust. Soc. Am., 80(4):1016{1025, October 1986. [3] O. Ghitza and M.M. Sondhi. Hidden markov models with templates as non-stationary states: an application to speech recognition. Computer Speech and Language, 2:101{119, 1993. [4] M. Fanty, R. Cole, and K. Roginski. English alphabet recognition with telephone speech. In J. E. Moody, S. J. Hanson, and R. P. Lippman, editors, Advances in Neural Information Processing Systems 4. San Mateo, CA: Morgan Kaufmann, 1992. [5] W.D. Goldenthal. Statistical Trajectory Models for Phonetic Recognition. PhD thesis, M.I.T., Auguest 1994. [6] L.R. Bahl, P.V. de Souza, P.S. Gopalakrisnan, D Nahamoo, and M.A. Picheny. Robust methods for using context-dependent features and models in a continuous speech recognizer. In IEEE Int. Conf. on ASSP, pages 533{536, April 1994. [7] H. Hermansky. Perceptual linear predictive (PLP) analysis of speech. J.Acoustic. Soc. Am., 4:1738{1752, 1987. [8] N. Morgan, H. Bourlard, S. Greenberg, H. Hermansky, and S. Wu. Stochastic perceptual models of speech. In IEEE Int. Conf. on ASSP, pages 397{400, May 1995.