A DSP-BASED MODULAR ARCHITECTURE FOR NOISE CANCELLATION AND SPEECH RECOGNITION P, Gbmez, A. Alvarez, R. Martinez, M Pkrez-Castellanos, V. Rodellar and V. Nieto Dept. Arquitectura y Tecnologia de Sistemas Inform6ticos Facultad de Informgtica, Universidad PolitCcnica de Madrid Campus de Montegancedo, s/n, Boadilla del Monte, 28660, Madrid, SPAIN Phone: +34.1.336.73.82, FAX: +34.1.336.66.01 E-mail:
[email protected] ABSTRACT Through the present paper the architecture of a low-cost Hardware Building Block (HBB) Board for Signal Processing is presented. An overview of the algorithms to be supported by the HBB is followed by a brief presentation of the low-cost hardware platform based on the TMS32OC31, interfacing with a host computer through a PCI bus. A description of the modular solutions based on this board for Isolated Word Speech Recognition in Noisy Environments is also given. Measurements of the real-time performance of the card and several NoiseCanceling experiments are also shown. Applications of the proposed architecture may be found in the field of Speech Training, Language Acquisition, Phonetics, and others related.
+ hockgrotid noise
GradientAdaptive Lattice + LevinsonDinbin Eslimution
Formant fitrmution
I145iLiZd Vector Quantizer
1. INTRODUCTION
ViterhiAlgorithm
Speech Recognition is a field knowing a rapid expansion toward newer areas of application, as Human-Machine Interfaces, Dictation Systems, Process Control, Training and Education, Computer-Aided Manufacturing, Handicap-Aid Systems, Automotive Systems, Multimedia, etc. [ 1-61. This expansion is based on the fact that speech interaction seems to be a more natural way of introducing data and control in Command Interfaces (friendlier environments), opening the possibility of interacting hands-free with complex systems, by the development of conversational protocols [3, 4, 71. A key aspect in this success is the reliability factor. There are certain applications where a high rate of hits is desired when introducing commands or data in hands-free interfaces. Reliability is especially necessary in certain hostile environments, where noise plays an important role in its degradation, as is the case in Automotion, Avionics or Computer-Aided Manufacturing. Certain noisy environments require that a specific treatment of noise should be provided, and that robust Speech Recognizers should he designed to prevent noise and its undesired side effects [8]. These treatments are based in the development of new and flexible noise-canceling algorithms, which are introduced as pre-processing stages in Speech Recognition Modules [9]. One such system being currently developed within the Project IVORY (ESPRIT-OM1 No. 20277) [lo] is shown in Figure 1.
0-7803-4455-3/98/$10.00 0 1998 IEEE
1
Word Coder
Figure 1. General Framework of the Isolated-Word Speech Recognition System In this figure an Isolated-Word Speech Recognition System is described. A Gradient Adaptive Lattice Spectral Estimator [ 111 produces Vector Templates containing Dynamic Formant Information from speech. These Vector Templates are Vector Quantized using "a priori" information from a Code Book, and parsed by a Hidden Markov Model Parser, using Word Models from a previously trained Data Base. This structure should be appropriate when clean speech is used as input. If speech is contaminated by noise, a two-signal recording scheme is used, one channel being reserved for noisy speech, and the other for a sample of the noise, these signals being derived from a pair of microphones separated by a given distance. A Noise Canceller [12] based on a Joint Process Estimator [I 1, 131 is then used to produce a sample of clean speech from both signals, resulting in an improvement in the SignaVNoise ratio of around 10-12 dB. This enhancement, and the tailoring of the Data Base to the characteristics of Noise-Affected Speech (Model Compensation) [8, 141 grant a significant improvement in the recognition rates. An important task to be solved is the design of a computing platform amenable of supporting the computational requirements of the Noise Canceller, the Spectral Estimator, the Vector Quantizer and the HMM Parser. Important requirements are that this platform must be low-cost and PC compatible.
V-178
2. PROPOSED ARCHITECTURE. The computational requirements of the Noise Canceling and Speech Recognition Algorithms may be met by many off-theshelf General Purpose and DSP Processors. The structure of the Noise Canceller and the Spectral estimators require the implementation of a number of divisions in the main loops of both algorithms, these being critical factors regarding computational costs. Therefore, a signal-processing oriented CPU implementing a fast library division was preferred. The design of an ASIC was disregarded as a first choice to be reconsidered later to reduce global costs for massive production.
,+I
I
I
l
4 4
I SR.IM
l
This memory can hold up to 80 Word Models to be parsed simultaneously. It is connected with the host motherboard by a PCI bus, under the control of an FPGA, which also manages the Analog Interface. This is composed on its tum by a stereo CODEC (for two microphonic channels at 44.8 kHz) and a stereo sound output. The card has been produced and tested using the mentioned algorithms with good results for both the 50 and 60 MHz versions of the DSP processor. Figure 3 shows a picture of the 60-MHz. version.
3. MODULAR SOLUTIONS The proposed HBB could be used as a stand-alone module, or as part of a more complex system integrated by several HBB's. In the case of a Clean Speech Recognizer, a stand-alone HBB would suffice. What we will present here is the modular structure required when the Noise Canceller and the Speech Recognizer have to be combined for applications in Noisy Environments where noise cancellation is demanded. In such case, the structure of Figure 4 gives an overview of the modular solution adopted.
I I
Primary
NC-HBB
A
-
I ROM
Front-End
Sh-HBB
I
1
Processor
PCI bus
Figure 2. General structure of the Hardware Building Board (HBB). As a result of the above design considerations the Hardware Building Board (HBB) given in Fig. 2 has been designed [15]. It is based on the TMS320C31 to optimize the ratio between computational power and cost and includes a bank of up to 1 MB of 0-wait state static RAM memory distributed as two blocks of 128 K words of 32 bits each.
Figure 4. Modular solution for the Robust Speech Recognizer. The example gives an idea about the modularity of the HBB. With minor changes in the configuration, two HBBs may be combined to implement a Noise-Robust Speech Recognizer. The first HBB (NC-HBB) is configured as a Noise-Canceling system. The Noise Canceling Algorithm is highly recursive and it only uses the inner memory of the DSP, therefore, the static RAM may be eliminated from this HBB. The full contour blocks show the hardware physically present in each card. The blocks not present are dashed-contour. A second HBB will be used for Speech Recognition (SR-HBB). The Pre-Acconditioning Electronics and the Analog Interface could be removed from this board. In this case the amount of memory present would have a direct influence in the number of Word Models available for a given command set. Therefore, depending on the application, as much memory as available would be required. The addition of new NC-HBB's allows the concurrent operation of several noise cancelers supporting more conversational channels for other speakers, thus enabling the management of multiple-party Command Interfaces.
V- 179
F
Figure 6.d. Power Spectrum of the Clean Speech Trace. /I& ,.
the SR-HBB-
.............
............................
....
........................
I
Figure 5. Communication strategy between the NC and the SR HBB's for Noisy Speech Recognition. The two boards shown in Fig. 4 communicate under the host control through the PCI bus. This trading is organized as shown in Figure 5. The white-filled blocks designate the HBB Drivers and the gray-filled blocks designate the Noise Canceling and Speech Recognition microkemels. The bar-filled boxes designate the intermediate buffers. On the NC-HBB the input buffers will be filled with input data under the control of the I-C (Initialize-Codec) function. The NC (Noise-Canceller) microkernel will produce clean speech data to be transferred by the RM-DSP (ReadMemory-DSP) function. The EPD (EndPointDetector) uses the average energy difference between clean and noisy speech to detect the speech segments, switching the Speech Recognition processes conveniently. On the SR-HBB, the WM-DSP (WriteMemory-DSP) will transfer clean speech data to the input buffers of the TE (Template Exlractor), VQ (Veclor Quantizer) and HMMP (HMM-Parser) for recognition purposes. Resulting recognition data are transferred to the host by the interrupt-driven function Read-DSP (R-DSP). The One-Word Command Models used in the HMMParser may be dynamically changed using the WriteMemoy-DSP PM-DSP) function.
4. RESULTS AND DISCUSSION The algorithms for Noise Canceling have been thoroughly simulated and tested, some results being shown in Figure 6 below.
Figure 6.a. Noisy Speech Trace (Primary Microphone).
Figure 6.b. Power spectrum of the Noisy Speech Trace.
V-180
i
5i 101 i5i
mi
51 301 51 401 451 501
m
601 651
mi
151 801 ffii FOI
9ji
imi 1051 iioi 1151
Figure 6.e. Average Power of the Noisy (thin line) and Clean (thick line) Speech Traces (dB). ....................................................................................................
I 51 101 is1
mi xi
301
xi
101 451 501 E 601 611 101 751 801 ffii 901
/I
si imr 1051 iioi iisi
Figure 6.f. Average Power Difference between Noisy and Clean Speech traces (upper line). Adaptive threshold for begin-endpoint detection (middle line). Segmentation results (lowest and thickest line). A pair of microphones separated 20 cm were used in the experiment described. A noise source (lortdspeaker) placed at a distance of 1.50 m. from the reference microphone induced a level of noise on the reference microphone between 85-100 dB which consisted in strong car-engine bursts, its intensity changing with time. A speaker uttered the sequence of isolated one-word commands /lej, right, up, down, go, stop/. The noisy speech and its Power Spectrum may be seen in Figs. 6.a and b. The four first words may be barely observed, but the sudden and intense increase in the power of noise coinciding with the beginning of the word /go/ blur off the spectral structure of the last two words. In Fig. 6.c the resulting Speech Trace shows an average reduction of 10 dB after noise removal, and the speech bursts associated to the six words may be now appreciated. The corresponding power spectrum is shown in Fig. 6.d. Even the high frequency burst corresponding to the initial fricative of the word /stop/ (last word) may be clearly distinguished in this case. In Fig. 6.e the upper (thinner) line shows the average power of the noisy trace (Fig. 6.a). The lower (thicker) trace shows the average power of clean speech. The peak to valley difference is now well above 20 dB. The difference in the average power between both traces (noise reduction level) is plotted in Figure
6.f (upper line). This level sinks down in the presence of speech, as in that case both the clean and noisy average power coincide, because speech is not removed. This fact may be used for speech segmentation (begin-end detection), as shown in the bottom (thick) line. The accuracy of this method is discussed in a related work [16].
Module Noise Canceller Parameter Extractor
Task
Noise
ICancellation
Execution C31-50 C31-60 Time Real Real (CydSec) Time YO Time YO 23,748,000 94.99 79.16
IGradientAdaptivel 5,595,0001 Lattice Levinson-Durbin I 263,6001
22.381
18.65
1.O51
0.88
Table 1. Computational performance of the algorithmic set on the TMS320C3 1 platforms. The performance of the computing platform is given in Table 1. The third column gives the number of execution cycles required to compute each task. The fourth and fifth columns give the percentage of real time for the 50 and 60 MHz. versions of the HBB. It may be seen that a 14-stage Noise Canceller copes almost all the computational power of a single 50-MHz HBB, this being by far the most computing-power consuming section. Other expensive tasks are Gradient Adaptive Lattice and Formant Estimation. These three processes, which are devoted to increase the system robustness require the 117.76% of the 60-MHz HBB computing power, or almost 80% of the global computational complexity: the Noise Canceller to remove as much noise as possible and the Parameter Extractor and Formant Estimator to create more robust template vectors for recognition.
5. ACKNOWLEDGMENTS This work is being funded by grants TIC95-0122, TIC96-1889CE. ESPRIT IVORY no 20277 and TIC97-1011.
6. REFERENCES [ l ] S. Aguilera, M. A. Berrojo, F. M. Gimtnez, J. ColBs, J. Macias, J. M. Montero, "Impaired Persons Facilities based on a Multi-Modality Speech Processing System", Proc. of the Speech and Lang. Techn. for Disabled Persons, Stockholm, Sweden, May 31-June 2, 1993, pp. 129-132. [2] P. Gbmez, D. Martinez, V. Nieto, V. Rodellar,
Aided Language Learning Incorporating Speech Assessment Techniques", Proc. of the Int. Conf on Spoken Lang. Proc. ICSLP'94, Yokohama, Japan, Sept. 18-22, 1994, pp. 1295-1298. [3] B. Z. Manaris and B. M. Slator, "Interactive Natural Language Processing: Building on Success", Computer, July 1996, pp. 28-32. [4] P. Martin, F. Crabbe, S. Adams, E. Baatz and N. Yankelovich, "SpeechActs: A Spoken-Language Framework", Computer, July 1996, pp. 33-40. [ 5 ] S. Oviatt, "User-Centered Modeling for Spoken Language and Multimodal Interfaces", ZEEE Multimedia, Winter 1996, pp. 26-35. [6] A. Waibel, "Interactive Translation of Conversational Speech", Computer, July 1996, pp. 41-48. [7] J. J. Mariani, "Spoken Language Processing in Multimodal Communication", Proc. of the Int. Con$ on Speech Processing ICSP'97, Seoul, Korea, August 26-28, 1997, pp. 3-12. [8] S. Furui, "Recent Advances in Robust Speech Recognition", Proc. of the ESCA-NATO Tutorial and Research Workshop on Robust Speech Rec. for Unknown Comm. Channels, Pont-aMousson, France, 17-18 April 1997, pp. 11-20. [9] R. Martinez, A. Alvarez, V. Nieto, V. Rodellar and P. Gbmez, "ASR in Highly Non-Stationary Environments using Adaptive Noise Canceling Techniques", Proc. of the ESCANATO Tutorial and Research Workshop on Robust Speech Rec. for Unknown Comm. Channels, Pont-&Mousson, France, 17-18 April, 1997, pp. 181-184. [ 101 Project IVORY (Integrated Voice Recognition system), http://tamarisco.datsi.fi.upm.es/projects/IVORY/IVORY.html. [ 111 S. Haykin, Adaptive Filter Theory, Prentice-Hall, Englewood Cliffs, NJ, 1996. [I21 R. Martinez, A. Alvarez, V. Nieto, V. Rodellar and P. Gbmez, "Implementation of an Adaptive Noise Canceller on the TMS320C3 1-50 for Non-Stationary Environments", Proc. of the 13'h Int. Conf: on Dig. Sig. Proc., Santorini, Greece, 2-4 July, 1997 pp. 49-52. [13] J. R. Deller, J. G. Proakis and J. H. L. Hansen, DiscreteTime Processing of Speech Signals, Macmillan Pub. Co., Englewood Cliffs, NJ, 1993. [ 141 M. J. F. Gales, ""NICE" Model-Based Compensation Schemes for Robust Speech Recognition", Proc. of the ESCANATO Tutorial and Research Workshop on Robust Speech Rec. for Unknown Comm. Channels, Pont-&-Mousson, France, 17-18 April 1997, pp. 55-64. [15] P. Gbmez, M. PCrez, N. Mayo, F. Rubio, A. Alvarez, R. Martinez, V. Rodellar, V. Nieto, U. Zangheri and P. Pisani, "Visual Representations of the Speech Trace on a Real Time Platform", 1997 Workshop on Sign. Proc. Systems, De Montfort University, Leicester, UK, November 3-5, 1997, pp. 283-292. [ 161 R. Martinez, A. Alvarez, P. Gbmez, M. PCrez, V. Nieto, V. Rodellar, "A Speech Pre-Processing Technique for End-Point Detection in Highly Non-Stationary Environments", Proc. of EUROSPEECH97, Rhodes, Greece, 22-25 September, 1997, pp. 1111-1 114.
"MECALLSAT; A Multimedia Environment for Computer-
V-181