Barge-In/End of Utterance Detec - Semantic Scholar

EMBEDDED IMPLEMENTATION OF A HYBRID NEURAL-NETWORK TELEPHONE SPEECH RECOGNITION SYSTEM Johan Schalkwyk, Pieter Vermeulen, Mark Fanty and Ronald Coley

Center for Spoken Language Understanding, Oregon Graduate Institute of Science and Technology 20000 N.W. Walker Road, P.O. Box 91000, Portland, OR 97291-1000, USA Tel: +1 503-6901318, E-mail: [email protected]

ABSTRACT

An embedded implementation of a hybrid neural-network based telephone speech recognition system is described. The system (1) digitizes speech and computes a PLP spectral representation; (2) uses a Neural Network to assign 534 contextual phoneme probabilities; and (3) uses these phoneme probabilities for a general purpose word-spotting task.

Data Capture Barge-In/ End of Utterance RASTA PLP Computation Energy Prenormalization

1. Introduction

In this paper we present an embedded implementation of a hybrid neural-network based speaker independent telephone speech recognition system. The algorithm is implemented on a Linkon FC3000 PC telephony board, which can have up to 12 DSP32C processors per board (one for each connected telephone line). Given the USstandard T1 digital interface with 24 telephone lines, using two of these telephone boards we are able to have up to 24 simultaneous recognition tasks running on the same platform. In the next section we outline the algorithm. Section 3, brie y discusses modularity of code using the VCOS operating environment, followed by a discussion of some of the implementation issues in Section 4. In section 5 we present timing and memory usage for each of the DSP modules used in the embedded implementation, followed by summary and conclusions in section 6.

2. Algorithmic Outline

Figure 1 depicts the basic outline of the speech recognition algorithm. This hybrid technology easily lends itself to a pipelined implementation, therefore reducing the associated response time. We now outline the modules of this algorithm which is detailed elsewhere[1].:

Barge-In/End of Utterance Detection: Barge-In (or talk-through), concerns

itself with detecting whether the end-user is preempting the voice prompt. The problem

Feature Collection Frame-Based Phonetic Classification Viterbi Search

Figure 1. Algorithmic Outline

in this scenario is not only to detect that the person is speaking, and thus being able to act upon this fact, but also not to act on other spurious high energy non-speech events which may occur. Various algorithms have been proposed for detecting speech versus non-speech events [2, 3]. These are either based on relative energy measurements, or by looking at the deviations from the spectral characteristics of the background noise. We propose a similar method to that to that of Mauuary [2], but choose to use a running estimate of the variance on the absolute amplitude of the signal over time. Together with state duration constraints, we are able to eectively lter short energy events and trigger on end-of-speech events. DC-Removal: Because the PLP feature computation involves computing the FFT of the waveform, the DC-component is easily removed by setting the rst FFT component to zero [4]. However, to keep the system modular and facilitate the replacement of the PLP analysis with other feature extractors

such as LPC which do not involve a FFT, the DC-component removal unit was implemented using a simple rst-order IIR lter. Signal Representation: The speech signal is represented using a seventh-order PLP (Perceptual Linear Predictive) analysis [5] every 10ms using a 10ms window. This analysis yields eight coecients per frame (7 cepstral coecients and one energy coecient). Pre-normalization: The energy coef cient of the input cepstral feature vector is normalized using an automatic gain control lter (AGC), with a look-ahead buer of 160ms. The normalization is performed using a variable gain ampli er in which the gain is controlled by a peak detector on the energy feature. The peak detector has a decay factor of 0.999 and also includes a limiter to prevent excessive gain during silence [4]. Currently this form of energy pre-normalization leads to an inherent delay of 160ms in the pipelined recognition process. We have reason to believe that this look-ahead is not needed in a digital telephone network with built in gain control and that a well chosen initial value and a simple IIR low-passed estimate of the peak value will suce. This will be investigated in the future. Feature Collection: The input to the phonemic classi er consists of 56 features representing PLP cepstral coecients from 7 distinct regions spanning a 160 ms window centered on the frame to be classi ed. This adds another 80 ms inherent delay to the the pipelined recognition.

Frame-based Phonetic Probability Estimation: We use a mixture of tri-

phones, biphones, monophones and broad categories based on place of articulation as our units of recognition. For example, the \=aa=" in the word \man" (=m==aa==n=) is represented by the three part sequence: \nasal < aa", meaning the rst part of an \=aa=" that following a nasal, \< aa >", the middle or stationary part and nally \aa > nasal", )", which is the nal part of an \=aa=" that precedes a nasal. Some phonemes do not have stable middle parts and others have the same characteristics independent of the context and those are represented by two and one units respectively.

All English words are represented in terms of 534 such units. A fully connected three-layer perceptron (MLP) neural network estimates the probability that each of these categories are present. Search: Finally, a Viterbi search is used to combine this matrix of classi cation probabilities so as to decide which word (or sequence of words) was spoken. Thus, each word is expressed in terms of the sequence of phonemes that is expected when that word is uttered. To compute the likelihood that each word was spoken, one typically assumes that the acoustic vectors in dierent time frames are independent, so that the likelihood of a phoneme occurring in a sequence of time frames is the product of the likelihoods in each of the individual time frames.

3. Regarding Modularity

The philosophy behind the real-time implementation was to maintain modularity of run-time code throughout the whole recognition process. Maintaining modularity allows rapid future development of related algorithms. Using the VCOS [6] operating environment we were able to eciently develop stand-alone DSP modules with input, output and parameter buers. VCOS provides an environment where one can reuse these modules in an application without referring to the assembly code. All basic building blocks of the algorithm were developed as separate entities, where each entity is described by a set of input, output and parameter buers, based upon the VCOS assembler macros. Using this approach one can easily have both a speaker veri cation and recognition system deployed based upon the sharing of run-time code wherever possible. This process is described more clearly in Figure 2. VCOS employs a visible caching mechanism to gain eciency. Each time a module is called, it will load the program and/or data to \cache memory", then process a block of data. By processing a large number of data each time the module is called, the caching overhead is reduced in direct proportion to the amount of data processed.

4. Embedded implementation

The system described above has been implemented to run real-time using the Linkon

Echo Cancel

K

H

AH

AE

BargeIn

DMA

L

Speech Detector Speaker Verify

DC Filter

L .pau

Bcl

W

B

P

EY

L

Help

Tcl

Signal Analysis

OW

T

Word Spotter

Pcl

Kcl

Call Wait IY

K Call Block

Figure 2. Modularity of DSP run-time code.

FC3000 board based on the DSP32C. The modi cation required to the baseline algorithm described to achieve real-time performance were mostly related optimal use of the multiply-accumulate pipeline and ecient use of DSP pipeline delay slots. If several multiplyaccumulate instructions are executed one after the other, the DSP32C automatically pipelines the instructions such that one instruction is completed every instruction cycle, but results are not immediately available for re-use. This latency is 4 instruction cycles on the DSP32C [7]. The very regular memory access of the matrix vector multiplications of the feed-forward MLP is especially suited for this architecture. The Viterbi search on the other-hand is not well suited for this architecture, since it involves intensive memory usage with multiple oating point comparisons. We optimized the Viterbi search by representing the word models using a compact tree representation. This is graphically depicted in Figure 3, for the three phrases \call waiting", \call block" and \help". By sharing common states through-out, the compact-tree search reduces both the memory and computational requirements needed for the word spotting task. Furthermore we introduced a pruning mechanism, by which the tree is pruned keeping only states (or branches) which have a signi cant score. The pruning factor is adapted as to keep a prede ned number of states active during the search. In our experiments we have found that we can keep 30% states active without incurring recognition error. Further optimizations concentrated on optimal usage of internal cache memory for lookup tables (sigmoidal and log functions), intermediate

NG Call Waiting

Figure 3. Compact tree representation of word models.

results (neural net multiplications) and certain Viterbi search history variables. For most small vocabulary applications the current implementation will run in real time. As the vocabulary grows, the search memory and computational requirements grow and at some cut-o it may not t in real time any more. Depending on the application, this might be an acceptable situation. For example: end-pointing of the utterance typically lags the actual end point of the speech by a few hundred milliseconds and the analysis of these non-speech frames can be queued, without actually processing them if not needed. In our implementation all modules except the neural-network probability estimator and Viterbi search run in an interrupt mode, which is executed every 10ms. The neuralnetwork probability estimator and Viterbi search runs in interruptible mode and can lag the realtime execution depending on the size of the vocabulary. The communication between these two parts of the algorithm is per a shared parameter buer.

5. Timing and Memory Usage Requirements Table 1 tabulates the memory usage and percentage real time for each of the modules running as a non-interruptible process. All measurements are based on the 88 MHz DSP32C as implemented on the Linkon FC3000 board. These modules constitute mainly the feature collection part of the speech recognizer. Every 10ms the DSP interrupts the current process, updates the DMA

Module Descriptor code data % real time Echo Canceller 708 2088 25 % Barge-In Detector 2708 36 3% DC-Oset Removal 256 44 0.2 % PLP 2704 3052 10 % Pre-normalization 564 576 0.04 % Feature Collection 564 1068 0.1 % Shared Parameters { 87431 { Table 1. Memory requirements(bytes) and timing measurement for interrupt mode DSP modules

frame pointer and executes all interrupt mode modules. Since these modules are non interruptible, it is important to keep the execution well below real-time, so as not to loose any DMA samples while still having enough cycles left for the background modules (i.e. Neural-network and compact-tree search) to complete. While playing the voice prompt, each of these modules are running continually. The PLP features are stored in a circular buer, which can store up to ve seconds of PLP features. During this mode the background part of the recognizer is not running, giving a total CPU requirement of roughly 40% real time. As soon as the barge-in module detects speech apart from the prompt (talk-through), the voice prompt is switched o, and the recognizer is switched on. Processing then starts on features one second earlier than the current DMA frame. During recognition the echo canceller is switched o, and the speech-detector switches to end of utterance detection. In recognition mode the real time requirement drops to 14% CPU time, since the echo cancellation is not needed. The remaining 86% CPU time is used by the neural network and search to complete. Depending on both the size of the neural network(# number of parameters) and size of the vocabulary (# states in compact tree), the recognizer might run real time or will lag the feature extractor. Table 2 tabulates the memory usage and timing measurements for the background level tasks, for two dierent recognizer con gurations.

6. Conclusion The architecture presented lends itself to an ef cient hardware implementation: the algorithm is compact and mostly highly regular, thus executing very eciently on a pipelined DSP architecture, such as the DSP32C. Sharing of run-time code in a modular fashion, based on the VCOS operating environment,

Module Descriptor Neural Network (56 x 45 x 534) Neural Network (56 x 45 x 109) (month recognizer) Compact Tree (1163 states) (45 words) Compact Tree (250 states) (12 words)

code memory % real time 5840 106704 44 % 5840

29880

12 %

4328

52264

49 %

4328

8648

10 %

Table 2. Memory requirements(bytes) and timing measurement for background level DSP modules

enables rapid development of applications and deployment of speech recognizers and related algorithms.

References

[1] E.Barnard, R.Cole, M.Fanty, and P.Vermeulen, \Realworld speech recognition with neural networks," in Applications of Neural Networks to Telecommunications (R. J.Alspector and T.X.Brown, eds.), vol. 2, (Hillsdale, New Jersey), pp. 186{193, IWANNT95, Lawrence Erlbaum Assoc., 1995. [2] L. Mauuary and J. Mone, \Speech/nonspeech detection for voice response systems," in International Conference on Acoustics, Speech, and Signal Processing, pp. 1097{ 1100, 1993. [3] M. Rangoussi, S. Bakamidis, and G. Carayannis, \Robust end-point detection of speech in the presence of noise.," in Proceedings of Eurospeech'93, pp. 649{652, 1993. [4] P. Schmid, R. Cole, M. Fanty, and H. Bourlard, \Real-time, neural-netword based, french alphabet recognition with telephone speech," in Proceedings of the 3rd European Conference on Speech Communications and Technology, 1993. [5] H. Hermansky, N. Morgan, A. Baya, and P. Kohn, \Compensation for the eect of the communication channel in auditory-like analysis of speech (rasta-plp)," in Proceedings of Eurospeech'91, pp. 1367{1370, 1991. [6] VCOS - Software Development Kit, Technical Reference Manual. AT&T, 1994. [7] DSP32C - Digital Signal Processor (information manual). AT& T, 1990.