SPEECH ENHANCEMENT IN VEHICULAR ENVIRONMENT

SPEECH ENHANCEMENT IN VEHICULAR ENVIRONMENT Abdul Wahab and Tan Eng Chong School of Applied Science, Nanyang Technological University, Nanyang Avenue, Singapore 639798 and Hüseyin Abut E.C.E. Department, San Diego State University, San Diego, CA92182

Abstract The increasing demand for mobile multimedia communication prompted extensive studies on implementing effective and optimum way to communicate in the vehicular environment. The proposed system cover the analysis and cancellation/suppression of the various noise in a vehicular environment ranging from engine noise, wind noise to road noise. In addition it is critical to also understand the echoes that can be generated due to the effect of a vehicle chamber. All these noises and unwanted signals impede on the speech performance and have been a cause of clarity in speeches even on an uncompressed voice. The compressed speech uses the LPC algorithm in the VOCODER and these worsen the speech clarity and could be disastrous, where speech could be totally unrecognisable. Hence, the intelligent dashboard would require microphone arrays and a multi-tasking system to handle the vast processing requirement. Numerous asynchronous scheduling tasks necessitate the architecture to be re-configurable. A cost-effective technology for this would be to employ a powerful digital signal processing (DSP) sub-system equipped with the necessary support electronics.

1. Introduction It has been in the public debate for some time that vehicles in the future would need to detect, process, and communicate significantly more information. They will be between the vehicle and the driver, among the people in the vehicle and between the vehicle and the outside world, including other vehicles, road itself, and the Advanced Traffic Management and Information Systems (ATMIS). The driver and other passengers may want to communicate with the outside world verbally, or to have a conference call. These activities have been traditionally handled by car phones or short-wave radios, where the underlying signal is the band-limited voice grade waveform. These signals are transmitted over a communication channel, which is extremely corrupted by echoes both in the transmission link and inside the chamber of the vehicle, natural and man-made noise from numerous sources, and interfering signals from other channels, passengers, and audio information subsystem. It is commonly accepted that

the next generation car phones will be totally digital cellular and the volume of applications will increase. However, a number of ills will not go away and a speech processing system will be required to tackle them. Some of the tasks for this system will be the noise suppression, echo cancellation, source localisation, speaker identification, speech coding, compression and transmission by digital means. We will discuss briefly the spectral dissection of various degradations in vehicular environment. A proposed costeffective model for the speech processing and communication system and the re-configurable digital signal-processing concept will be introduced. Finally, the conclusion and summary of the proposed system architecture.

2. Spectra of Vehicular Disturbances In order to justify various components of the proposed system, it would be appropriate to observe visually various ills mentioned above and vehicular echo problem. To study the problem carefully and to gather road data, we have performed a field test. We have equipped a compact van with a DAT tape recorder and a low-cost low-pass microphone. There were two passengers to act as interference sources in addition to the driver. We have travelled along the city streets and two expressways in Singapore for a number of hours. We have a database of 40 minutes long recordings under 16 different experimental conditions. We have captured most of the recordings onto a hard disk using the speech I/O unit of a digital signal processing development system. We have sampled our data with a clock rate of 8,000 samples/s, which is the Nyquist frequency after properly bandlimiting the signal to the voice-grade service bandwidth of the next generation digital cellular phones. In Figure 1, we present in two plots the spectrum of the engine noise while the vehicle is moving at a nominal speed of 60 km/h. The windows were rolled up and the chamber was quiet. There was not any other vehicle in the vicinity and it was not possible to detect wind noise inside the vehicle. As it can be seen from these two plots, the engine noise does have any effect above 200 Hz. This should be very easily tackled by the enhanced speech processing and communication system proposed here.

Figure 1. Spectrum of the engine noise in frequency ranges 0-1,000 Hz and 200-1,000 Hz for a vehicle moving at 60 km/h (windows rolled up and quiet inside the chamber.) and the wind noise were rather significant. In addition to In Figure 2, we display the spectra in three plots in the the very-low frequency components of the previous case frequency ranges 0-1,000 Hz, 200-1,000 Hz, and 1,000representing the engine noise, we have two additional 4,000 Hz. In this case, the vehicle is stationary with the spectral regions to consider. windows down. There was a heavy vehicle moving at about 50 km/h and the levels of the ambient road noise

Figure 2. The spectrum of the stationary engine noise, ambient wind noise, and interferences from vehicles passing by. The windows are down and the speakers are silent. The frequency ranges are from 0-1,000 Hz, 200-1,000 Hz and 1,000-4,000 Hz, respectively.

As it can be seen from the second plot of this figure, there is considerable information in the frequency range between 200-400 Hz. We believe this is coming from the ambient wind noise and the wind generated by vehicles passing by and the road noise coming from the tire friction on pavement. Suppression of this degradation is not as simple as the previous one since it exhibits a slowly varying random behaviour. Nevertheless, a slowly adaptive filtering process should be able to minimise its effects. Noise components in the frequency range 1,000 Hz -4,000 Hz exhibit a coloured noise spectrum in a

widely spread fashion. Since this spectrum is covering the complete speech frequency range, it is very difficult to tackle. Source localisation based on adaptive beamforming followed by a trainable and quickly adapting estimation and cancellation scheme will be needed to suppress the contributions from these sources. Finally, in Figure 3, we display similar spectra under more severe conditions. This time, the vehicle is travelling at a speed of 60 km/h with windows rolled up; there are other vehicles passing by; the driver is trying to communicate and the two passengers kept talking. The first spectrum is

very similar to the one in Figure 1. However, the noise in the low frequency range 200-400 Hz is drastically reduced

in comparison to Figure 2.

Figure 3. The Spectrum when the driver is trying to communicate and two passengers kept talking in a moving vehicle at a speed of 60 km/h with windows rolled up. As before, frequency ranges are from 0-1,000 Hz, 200-1,000 Hz and 1,000-4,000 Hz, respectively.

In the last figure, it is possible to observe the formant structure of the speech. We believe this will be one of the most frequently encountered scenarios and the speech enhancement task will be very demanding since all three speakers are talking and their acoustical echoes are riding on all other ills. It is impossible to completely eliminate all the degradations in this case. But the advanced speech enhancement features of the proposed system will be able to improve the quality of speech to permit uninterrupted communication.

3. The Enhanced Speech Processing and Communication System The speech quality of the emerging totally digital cellular phones will, to a greater extent, depend on the speech quality available at the near-end transmitter of the communication link. Despite this, most research efforts have been directed towards speech coding techniques, channel transmission issues of cellular telephony and noise control and optimisation [1-4]. Very little research has been performed on the effects of ambient acoustical noise and the echoes in the vehicular environment. Throughout the world it is observed that a significant percent of cellular phone users are in vehicular chambers, cars, trucks, buses, and public transportation systems where degradations due to echoes, interferences, and various types of noise are severe. Recently, some research results

which address some of these problems have been reported [4-10]. An ideal solution to these is to have an enhanced speech processing and communication system with reconfigurable and multi-tasking architecture. The system should be able to locate an intended speaker, cancel echoes generated inside the vehicle, combat various noise, and jamming signals as well as handle all the speech processing, compression, transmission, reception, and data and network communication tasks. In Figure 4, we present a block diagram of the proposed speech processing and communication system. Speech input to the system will be provided by a microphone array strategically positioned on the dashboard to capture various signals from speech, different types of noise, echoes and other interferences. The front-end CODEC will have a set of 16-bit analogueto-digital (A/D) and digital-to-analogue (D/A) converters with sampling rate of between 8,000-10,000 samples per second. Before any processing task, the system should be able to locate and identify the primary speaker. That is, the system must focus to its primary user. Speech from other people in the vehicle, from the hi-fi systems, echoes, engine noise, road noise, wind noise, noises from standing nearby and passing by vehicles will be considered unwanted input signals and hence, our objective is to eliminate them, or at least, suppress them significantly. This, in turn, will improve the quality of the speech from the genuine user.

CODEC 16-bit A/D and D/A converters Sampling rate 8 to 10 Ks/sec

SPEECH CODER CANCELLER/ ENHANCER

8.0 VCELP 4.8CELP 2.4 MELP 32Kb/s LD-CELP ADPCM 16Kb/s LD-CELP

TRANSMITTER/ RECEIVER

Figure 4. The Block Diagram of the Proposed Speech Processing and Communication System One of the most annoying impediments to speech quality in a vehicular chamber is the echo generated by the leakage of the far-end speaker. When the near-end speaker (i.e. the driver) or any of the passengers in the car speaks, this echo is mixed with his/her speech and transmitted as a composite signal. Thus, the first task of the proposed speech enhancement system is to adaptively cancel the echo during non-speech periods. However, it should not work as a canceller when the near-end speaker speaks. In other words, no adaptation is to be performed when the near-end speaker talks. This necessitates the inclusion of a near-end speaker activity detection mechanism. In our literature survey [4-10], we have noticed that some researchers have used a coefficient adaptation algorithm based on the least-mean-squared (LMS) error criterion for echo cancelling. Albeit being very successful in echo cancellation, the basic LMS technique is not very effective in tackling other degradations. Secondly, in the vehicular hands-free cellular communication framework, the engine, road, and wind noise components need to be considered. It has been observed that the degradation in the intelligibility and the general quality of the cellular speech due to this imperfection is equally disturbing as the echo of the previous section. Hence, the second objective of the enhanced speech processing and communication system is to combat these imperfections of the cellular speech or data. Although there are some recent studies and analyses on the spectra of these noise sources [6-9], they are not directly applicable here since these noise sources have statistically different spectral behaviour. For instance, the engine noise is significantly correlated with the engine RPM and therefore, it is rather deterministic. On the other hand, the road and wind noises are stochastic in nature and spread over a frequency range. The worst class of degradation is from the interspeaker interference. In this case, the primary signal and the interfering signals have similar spectra. Thus, it is an extremely difficult problem to tackle. This was the main reason why we are proposing the inclusion of speaker tracking and identification capabilities in this speech processing system.

This last point, in particular, suggest a type of beam forming structure based on a microphone array followed by an adaptive filtering scheme. Beamforming techniques, which have found important applications in radar, sonar, radio astronomy, geophysics, and biomedical signal processing applications, appear to be a conceptually sound candidate for our speech enhancement task. The most simple form of beamforming is called the delay and sum beamforming, which compensates the delay of the target signal and sums the signals in the beam so that the target signals have the same phase while the interfering signals exhibit different phase. Here we propose to use the delay and sum beamforming technique. First, it follows the genuine speaker and then adaptively cancels noises coming from the interfering speakers, the engine, the wind --especially critical when the windows are down-- and the road noise coming from other vehicles and the road-tire friction. There are some studies on this method for speech recognition in a hands-free telephone set-up [4-10]. Figure 5 shows the structure of the proposed enhancer with the microphone array and the A/D converters (Dn+1) as the inputs. The output of the system is a cleaned speech to be transmitted after compression. Genuine Speaker Tracker

Speech + noise

Speech output

Σ

- noise and other

M1

D1

FIR1

M2

D2

FIR2

M3

D3

FIR3

imperfection

Σ Filter coefficient update

Mn+1

Dn+1

FIRn+1

Figure 5. The Speech Enhancement Circuit.

4. Re-configurable Digital Signal Processing The above speech enhancement architecture requires a considerable amount of computations. Depending on the particulars of the actual speech/speaker detection circuitry, the beamformer, adaptive filter banks and digital speech compression algorithm, we anticipate the overall computational complexity to be on the order of 35-40 million operations per second (MOPS)1. In particular, the 2,400 bits/s U.S. government standard MELP coder will require 22-25 MOPS [11-12]. The remaining 13-15 MOPS will be needed for all other tasks. This conservative figure should be sufficient since all tasks 1

Here we use the term MOP in the framework of the Texas Instruments TMS320C4X DSP systems family.

other than the speech compression will be performed in a re-configurable multi-tasking fashion2. We believe The Texas Instruments, Inc., TMS32C4X DSP hardware platform operating at 40 MHz should be able to handle all the computational needs. In order to have a microphone array size of six or more we propose the front-end audio input/output unit to have an eight channel aggregate 200,000 Hz A/D rate in a multiplexed fashion and a minimum of two output channels. Operating of the system will require a scheduler and a memo-passing facility so that information can be passed from one process to another. A memo in this case will consist of the type of processing requirement, the placement of data in memory and, of course, the originating and destination units.

5. Conclusions In this study, we propose a working model for future dashboards in intelligent vehicles. The system includes a totally digital speech processing and communication system. Since it is a digital system it will be easily reconfigured to work as an advanced packet data communication system including fax and electronic mail, voice mail and high-speed data transfer tasks. We have presented the enhanced speech communication sub-system and the source tracking and noise cancellation circuitry. However, we would like to emphasise that the proposed architecture and its components are to be accepted as models in transition. In other words, we will improve and appropriately modify the system as the technology in this field evolves.

6. References [1] Thomas E. Miller and Jeffrey Barish, “ Optimizing Sound for Listening in the Presence of Road Noise”, The International Conference on Signal Processing Applications and Technology, ICSPAT '93, Santa Clara, Calif., USA, Sept. 28- Oct. 1 93, Vol. 1, pp. 97-106. [2] Carlos R. Martins, Moises S. Piedade, INESC and Ceautl Lisboa, “ Fast Adaptive Noise Canceller using the LMS Algorithm”, The International Conference on Signal Processing Applications and Technology, ICSPAT '93, Santa Clara, Calif., USA, Sept. 28- Oct. 1 93, Vol. 1, pp. 121-127. [3] Harrison, W. A., J. S. Lim and E. Singer, “A New Application of Adaptive Noise Cancellation”, IEEE Trans. Acoust., Speech and Signal Processing, Vol. ASSP-34, No. 1, pp. 21-27, Feb. 1986.

2

It should be easy to guess that the computational complexity would increase enormously if the architecture did not have re-configurability. That is, the overall computational load would be unacceptably high if the algorithms and circuits for all tasks were kept running at all times.

[4] H. Olson, "Electronic Control of Noise, Vibration and reverberation," J. Acoust. Soc. Am., Vol.28, 1956, pp. 966-972. [5] D. Messerschmitt, D. Hedberg, C. Cole, A. Haoui, and P. Winship, "Digital Voice Echo Canceller with a TMS320C20," in DSP Applications, K.-S. Lin, Ed., Prentice-Hall, 1987. [6] S. Oh, V. Viswanathan, and P. Papamichalis, "Hands-Free Voice Communication in an Automobile With a Microphone Array," Proc. IEEE ICASP-92, pp. I281 -284, San Francisco, CA. [7] I. Claesson, S.E. Nordholm, B.A. Bengtsson, and P. Erickson, "A Multi-DSP Implementation of a Broad-Band Adaptive Beamformer for Use in a Hands-Free Mobile Radio Telephone," EEE Trans. on Vehicular Technology, Vol. 40, pp. 194-201, Feb. 1991. [8] L.J. Griffiths and C.W. Jim, "An Alternative Approach to Linearly Constrained Adaptive Beamforming," EEE Trans. on Antennas Propag., Vol. AP-30, pp. 27-34, January 1982. [9] E. Arkan, "Echo and Road Noise Cancellation in Digital Cellular Telephone," M.S. Thesis, San Diego State University, Spring 1994. [10] E. Arkan, H. Abut, S. Pelling, fj. harris, and G.C. Marques, "Implementation of a 5.0 KB/s Coder for Vehicular Applications: Part: II Acoustic Echo and Noise Canceller, Proc. of ASILOMAR-1993 Conf. on Sig., Sys. & Computers, pp. 776-780, IEEE Computer Society Press, 1993. [11] J. Tardelli, Chair, "US DoD Selection of 2400 BPS Standard," Special Session SPEC3, Proceedings of IEEE ICASSP-96, Pp. 1137-1164, May 1996, Atlanta, GA. [12] A. McCree, K. Truong, E.B. George, T.P. Barnwell, III and V. Viswanathan, "A 2.4 KBIT/S MELP Coder Candidate for the new U.S. Federal Standard," Proceedings of the IEEE ICASSP-96, May 1996, Atlanta, GA.