Low-Power Dual-Microphone Speech Enhancement ... - IEEE Xplore

3526

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 55, NO. 7, JULY 2007

Low-Power Dual-Microphone Speech Enhancement Using Field Programmable Gate Arrays David Halupka, Member, IEEE, Alireza Seyed Rabi, Parham Aarabi, Senior Member, IEEE, and Ali Sheikholeslami, Senior Member, IEEE

Abstract—This paper discusses two implementations of a dual-microphone phase-based speech enhancement technique. The implementations compared are based on a field-programmable gate array (FPGA) and a digital signal processor (DSP). The time-varying, frequency-dependent, phase-difference between each incoming sound signal is used to mask each respective frequency, thereby reducing noise by minimizing the contributions of signal frequency components that have a low signal-to-noise ratio (SNR). Phase-based filtering can achieve high SNR gains with just two microphones, making it ideal for small devices that lack the room for a multimicrophone array. Moreover, these devices often have a limited battery life and lack the processing power needed for software-based speech enhancement. This paper presents an FPGA-based dual-microphone speech enhancement implementation, which was designed specifically for low-power operation. This implementation is compared with an off-the-shelf DSP implementation, with respect to processing capabilities and power utilization. Index Terms—Digital signal processors, field-programmable gate arrays (FPGAs), low-power systems, sound localization, speech enhancement, speech processing.

I. INTRODUCTION

S

PEECH recognition systems do not work effectively in practical environments where noise, background conversations, or reverberations corrupt the signal from the speaker of interest. This shortcoming has fueled research in the areas of speech enhancement [1]–[3]. Microphone array-based speech enhancement techniques, in particular, have received much interest because of their potential for significant noise reduction. Various microphone array-based speech enhancement techniques have been proposed, including beamforming [4], superdirective beamforming [5], [6], postfiltering [7], [8], and phase-based filtering [1], [9]–[11]. Of these techniques, phase-based filtering has shown a great deal of promise. Phase-based filtering is a form of time-frequency masking; we will refer to it as PBTFM in this paper. PBTFM has been

Manuscript received June 29, 2005; revised August 12, 2006. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Shuvra S. Bhattacharyya. This work was supported by grants and funding from the National Research Council of Canada and the Government of Ontario. The authors would also like to thank Altera for their financial and technical support. D. Halupka, P. Aarabi, and A. Sheikholeslami are with the Department of Electrical and Computer Engineering, University of Toronto, Toronto, ON M5S 3G4, Canada (e-mail: [email protected]; [email protected]; [email protected]). A. S. Rabi was with the Department of Engineering Science, University of Toronto, Toronto, ON, Canada. He is now with The Johns Hopkins University, Baltimore, MD 21218 USA (e-mail: [email protected]). Digital Object Identifier 10.1109/TSP.2007.893918

shown to achieve recognition gains of 28.9% over the single channel noisy signal, 22.0% over superdirective beamforming, and 8.5% over postfiltering [2] with two microphones in a reverberant environment (with a reverberation time of 0.1 s). However, PBTFM requires that speaker’s position be known. A popular method for determining the speaker’s position is by estimating the time delay of arrival (TDOA) of the speaker’s signal between a microphone pair [12]. However, during the testing of speech enhancement techniques the location of the speaker is manually specified. Manual specification of the speaker’s position is not practical in real-world situations as the user’s position is not going to be fixed or known a priori. Furthermore, it would be desirable for the system to operate with as little user interaction as possible. Hence, a complete multimicrophone system must use an automatic and reliable method of determining the speaker’s position. However, to estimate the position of the speaker an exhaustive linear search needs to be performed, as the likelihood of each possible location for the speaker must be systematically evaluated. Hence, the TDOA algorithm is computationally expensive. For example, a 1-GHz Intel Pentium III processor is needed to perform real-time TDOA estimation for a 20-kHz dual-microphone audio stream, which is processed in half-overlapping 1024-sample (51.2-ms) segments. In terms of the number of floating-point operations, a 1024-sample fast Fourier transform (FFT) requires approximately 102 400 floating-point operations; the FFT of each microphone signal is required. The number of calculations required for the cross-correlation search depends on the chosen algorithm, the intermicrophone distance, and the required localization accuracy. For the following analysis, we use the least computationally expensive method, an intermicrophone separation of 20 cm, and a localization accuracy of 0.125 ms. This results in approximately 2.3 million additional floating-point operations for the exhaustive search, for an overall total of 2.4 million operations every 25.6 ms; almost 100 million floating-point operations per second. Hand-held devices, which would benefit most from speech recognition, do not have the equivalent processing power of a Pentium III processor. Users will also not accept speech recognition systems that significantly reduce the battery life of their device. As a result, there is a definite need for power-efficient implementations of sound localization, speech enhancement, and speech recognition algorithms. This paper presents a field-programmable gate-array (FPGA) implementation of phase-error based filtering and sound localization specifically designed for low power consumption. The presented filtering and localization system is implemented and tested on a Stratix EP1S40F780C5 40 000-logic-element

1053-587X/$25.00 © 2007 IEEE

HALUPKA et al.: LOW-POWER DUAL-MICROPHONE SPEECH ENHANCEMENT USING FPGAS

device. This implementation is compared with an off-the-shelf digital signal processor (DSP) implementation, based on Freescale Semiconductor’s DSP56858. The DSP implementation presented in this paper is meant to provide a point of comparison for the results presented in this paper. Section II of this paper gives a brief overview of microphone array-based speech enhancement techniques discussed in this paper. Section III describes the FPGA implementation and power-saving optimizations utilized. Section IV outlines the DSP based implementation and tradeoffs necessary to achieve real-time operation. Finally, Section V provides a comparison of these two implementations in terms of postfiltering signal-to-noise ratio (SNR), signal processing capabilities and power utilization.

3527

In the frequency domain, the short-duration microphone signals can be represented as (3a) (3b) where all upper-case terms are the frequency representations of their respective lowercase time-domain signals. A common TDOA estimation method makes use of the unfiltered cross correlation (UCC) [13] between the two microphone signals. The TDOA estimate, , can be calculated by finding the delay (phase shift) that maximizes the cross correlation between and , as given by (4)

II. ARRAY-BASED SPEECH PROCESSING This section provides a brief overview of the microphonearray-based algorithms discussed in this paper. A. Time Delay of Arrival Estimation Sound localization is performed by utilizing a microphone array that observes a sound source at different positions in an environment. Two microphones are capable of constraining a sound source’s position to a hyperboloid in three dimensions. Triangulation, amongst multiple microphone pairs, will provide an estimate of the speaker’s three-dimensional position. A simple model for the signals observed by two microphones in a nonreverberant environment is given as (1a) (1b) where is the signal source of interest, and and are used to model microphone noise, environmental noise, and possibly signals from other speakers. Attenuation of has been neglected, which is a reasonable assumption when the intermicrophone distance is much smaller than the source-to-microphone distance. The goal of sound localization is to use the and to deduce , the TDOA of observed signals the source between the two microphones. Typical acoustic environments are reverberant and therefore cause signal-correlated noise. This complex environment can be modeled as (2a) (2b) where and are the impulse responses of the environment with respect to each microphone’s position. Noise that and is not due to the signal reverberation is modeled by . TDOA estimates are computed based on short-duration signal segments to allow for tracking of a moving sound source. Moreover, it is likely that in the case of multiple speakers not all speakers will be talking at the same time. Therefore, small signal segments can sometimes capture only one active speaker, thereby reducing the amount of noise present.

Frequency-based cross correlation, unlike time-based cross correlation, allows for a fractional intersample location resolution. In practice, unfiltered cross correlation does not perform well in reverberant environments [12], [14]. As such, a frequencyis typically introduced into the cross dependent factor correlation. The weighting factor is chosen in order to make the TDOA estimates less sensitive to noise and/or reverberations. Frequency-weighted cross correlation is defined as (5)

where two example weighting functions are

(6a) (6b) denoting maximum-likelihood (ML) and phase transform (PHAT) weighting, respectively. The ML weight is useful if the noise spectra are know a priori, which is seldom the case in practice. For reverberant environments the PHAT weight has been demonstrated to be most suitable [12], [14] because the signal spectrum is normalized (whitened). In a purely reverberant environment, the signal-to-reverberant-noise-ratio is similar over each microphone’s frequency spectrum. Thus, unfiltered cross correlation places undue emphasis on dominant frequency components for an environmental model where all frequencies components have an equal noise-to-signal ratio. PHAT TDOA estimation can be mathematically simplified to (7)

where “ ” denotes the phase angle of its argument. Note that only the phase of each frequency component is used to estimate

3528


Fig. 1. Two-microphone speech enhancement system block diagram, which shows relationship between speech localization and speech enhancement systems.

Fig. 2. Block diagram for proposed hardware architecture to be used to compute TDOA estimates.

the TDOA because the amplitude of each frequency component . is made to be unity by Once the location of the speaker is known, speech enhancement is possible. B. Phase-Based Time-Frequency Masking Phase-based time-frequency masking is based on the following simple observation. Under ideal conditions (no noise, no reverberations, and a single speaker), the microphone signals observed during each fixed time interval will be related by , where represents the speaker’s location, that is, the speaker’s produced signal’s intermicrophone TDOA. (8) However, in noisy and/or reverberant environments, the two microphone signals are no longer strictly related, and the resulting phase-error is given by (9) Work by [2] showed that is proportional to the amount of noise and reverberation corrupting the desired signal. Hence, the can be used to construct a time-varying filter phase error that attenuates the amplitude of signal frequencies that have a high phase error [1]. The time-varying phase-error-based filter proposed by [1] is (10) with the frequency-dependent weight given by (11) It is assumed that is wrapped to be in the range . The term is an adjustable parameter that controls the aggressiveness of the filter. In low-SNR conditions, a high value of is favorable, whereas in high-SNR conditions, a low value of is favorable as a high value of will actually corrupt the signal of results in good interest [1], [2]. It was shown by [9] that speech enhancement; the results presented in this paper utilize

Fig. 3. Block diagram for proposed hardware architecture to be used to perform modified phase-based time-frequency masking.

this fixed value of . However, this parameter can be adjusted via configuration pins or a programmable register. PBTFM is a time-varying filter, which filters short-duration signal segments. It is possible that each segment, after filtering, will not align properly with the previously processed segment at the segment boundaries, thereby introducing high-frequency distortions. To overcome this shortcoming, we process half-overlapping signal segments that are windowed using a Hanning window, thereby providing smooth intersegment signal transitions. III. FPGA IMPLEMENTATION A simplified block diagram for a dual-microphone sound localization and speech enhancement system is shown in Fig. 1. In this figure, the enhanced speech signal is shown to be used for speech recognition but can be also used for other applications, such as a front end for telephony. Note that the sound localization subsystem is utilized to provide location information directly to the speech enhancement block. In practice, instantaneous TDOA estimates would need to be filtered or averaged over a finite length of time as they tend to be error prone in low-SNR conditions. Moreover, if there are conflicting conversations then the TDOA estimates would alternate between different speakers, depending on which speaker is active during each time segment. Therefore, in general, some form of user feedback or tracking algorithm would be utilized so that the user’s speech, rather than noise, is localized and enhanced.


3529

Fig. 4. Block diagram for proposed hardware architecture to be used to perform sound localization and speech enhancement.

Fig. 5. FFT twiddle multipliers: (a) Lifting-based multiplier; (b) complex number multiplier.

A. Sound Localization The PHAT TDOA application-specific integrated circuit design proposed in [15] used the system architecture shown in Fig. 2. Not shown in this diagram are the memories required for signal buffering and storage for intermediate results. The system in [15] also computed TDOA estimates based on nonoverlapping finite-duration signal segments. B. Speech Enhancement Phase-error-based speech enhancement, as described in Section I, will require a hardware architecture similar to that shown in Fig. 3. Half-overlapping finite-duration signal segments are buffered from the two microphones. Once a complete signal segment is buffered, it is multiplied by a Hanning window. The windowed time-domain signals are then transformed to the frequency domain via FFT. Phase-based speech enhancement uses the phase information between the two signals’ spectra to change the magnitude of one of the signals. A phase-based filter is easily described and implemented in polar representation. Hence, prior to filtering, each signal’s spectra is first converted from Cartesian to polar representation. Given the speaker’s location, a phase-based filter is constructed and applied to one of the input signals. The filtered signal is then converted back to the time domain and reconstructed into a nonsegmented signal by the summation of half-overlapping processed signal segments. C. Integrated Sound Localization and Speech Enhancement Many of the blocks shown in Figs. 2 and 3 are similar. Specifically, the front-end signal processing required for both the standalone sound localization and speech enhancement implementations is almost identical. Therefore, a combined hardware architecture is beneficial as resources can be saved. Fig. 4 shows the

functional block diagram of the proposed FPGA system. Note that this system applies the phase-error filter to both microphone signals and then delay-and-sums the filtered signals. For the proposed integrated implementation, the dual-microphone signals are processed in half-overlapping 51.2-ms (1024sample) segments. Therefore, a minimum of 1024 8 bits of memory are required to buffer each microphone channel before processing. Once every 25.6 ms (51.2 ms/2), the input buffers are copied into the processing storage to be processed. In order to minimize time and resources the Fourier transform of two real signals is computed concurrently. A synthetic signal is constructed so that one microphone signal is the real component of the synthetic signal, and the other microphone . signal is the imaginary component The two transformed microphone signals can be extracted from by using the real and the transformed synthetic signal imaginary conjugate symmetries of the Fourier transform. An integer fast Fourier transform [16] algorithm is used to perform an invertible discrete Fourier transform. This version of the FFT algorithm uses lifting operations to perform twiddle factor multiplication [Fig. 5(a)] as opposed to using a complex number multiplier [Fig. 5(b)]. It has been shown by [16] that this method of computing the FFT yields a perfectly invertible Fourier transform; that is, the transform does not introduce noise as a result of finite-precision arithmetic. As a side benefit, the lifting-based multiplier uses only three multiplications and three additions, rather than four multiplications and two additions that are required by a typical complex number multiplier. Cartesian-to-polar and polar-to-Cartesian coordinate conversion can be performed using the CORDIC algorithm [17]–[20]. A general-purpose CORDIC implementation can convert between the two coordinate systems depending on how it is configured. Instead, we used two specialized CORDIC units, which

3530


Fig. 6. Solid line is used to show piecewise cosine approximations, which are overlaid by a dotted ideal cosine function. (a) Rectangular; (b) three-level cosine; (c) seven-level cosine; and (d) 15-level cosine.

we were able to carefully optimize, as they only perform one way coordinate conversion. The frequency-dependent intermicrophone phase information is used to estimate the TDOA and also to construct the PBTFM filter. Each microphone signal is filtered using the PBTFM filter, and the two signals are delayed-and-summed together. After performing the inverse FFT on the resulting signal, the processed data is half-overlapped and summed with the previously processed samples, which also need to be buffered. Typically, the PBTFM mask need only to be applied to one of the input channels to obtain the filtered signal. However, in this implementation, we chose to apply the mask to both channels and then delay-and-sum the resulting signals. Delay-and-summing does not result in significant SNR improvement, but it does allow for the ability to use multiple instances of this design to provide a simple method of multimicrophone PBTFM; a formal method of multimicrophone PBTFM is presented in [3], [21]. The proposed multimicrophone filtering is achieved

by allowing the user to specify both the intermicrophone and intramicrophone TDOA. That is, the filtered signal is given by (12) where is the intermicrophone TDOA, and is the phase shift required to phase-align signals between different microphone pairs. An Altera FPGA housed on a Microtronix development kit is utilized as the target device. The FPGA system runs on a 6-MHz clock. Analog-to-digital conversion (ADC) is provided by two 8-bit National Semiconductor ADC08831, with a sampling rate of 20 kHz. D. TDOA Estimation Optimizations The discrete-time equivalent of (7) is (13)


TABLE I TDOA POWER CONSUMPTION

3531

TABLE II PERCENTAGE OF ABNORMAL TDOA ESTIMATES

Ideal cosine is represented using a 10th-order Taylor Expansion that was 16-bit accurate.

where is the discrete frequency index, and is the number of samples being processed. The discrete-time TDOA search is the most computationally expensive and power-consuming operation to be performed in this design. For example, searching over all possible delays between 32 to 32 samples, with a resolution of 1/8 samples requires approximately 130 000 evaluations of the cosine function. The sound localization implementation presented in [15] uses just such an approach and consequently uses 28.98 mW of power. There are three commonly used methods of evaluating trigonometric functions in hardware: the CORDIC algorithm, Taylor expansion, or via parametric curve fitting. Work by [22] showed that the PHAT cosine can be approximated by a rectangular function without a significant reduction in TDOA estimation accuracy. In this paper, various quantizations (approximations) of the cosine function were studied. A sample of some of the promising ones are shown in Fig. 6. Table I shows the required average power to perform TDOA estimation using different approximations. Power estimates are given for a 0.18- m CMOS process available through TSMC. The power measurements are obtained using Synopsis design tools and are based on average capacitative switching activity for each node in the circuit.1 Each design evaluated was synthesized to target a 0.18- m Artesian gate level cell library. Each cosine approximation was evaluated in terms of its accuracy in localizing a recording of a male speaker in a noisy and reverberant environment. Approximately 60 s of recorded speech was used for these tests, the signal to be localized was placed in four different positions about the microphone pair on a semicircular arc. The noise source was placed directly in front of the microphone pair. Localization accuracy was measured for four different SNR conditions. Table II summarizes the localization accuracy of each function tested. The results are presented as the percentage of location estimates that were off by more than 2 samples from their expected value, averaged over the four signal source locations tested. The seven-level approximation to the cosine function is used for this implementation, as it has good localization accuracy for the amount of power utilized.

TABLE III WORD ERROR RATES FOR A SPEAKER IN A NOISY AND REVERBERANT ENVIRONMENT USING MICROSOFT’S ENGLISH RECOGNIZER v6.0 ( = 5)

Phase-error-based time-frequency masking and beamforming are straightforward operations. However, division, or

equivalently inversion followed by multiplication, is needed to scale the magnitudes of different signal frequency components. Division is typically an expensive operation to perform in hardware, especially when it comes to fixed-point operations. For example, a 32-bit divider is necessary to perform 16-bit fixed-point division, in order to obtain a nontruncated 16-bit quotient. Our implementation performs PBTFM without the use of division and without any significant loss in perceptual signal quality or SNR gain. The basis for this approach is based on the following observation. Each microphone’s phase will, unavoidably, have some degree of error. This error is a result of windowing, and the limited numerical precision used to sample and process the signal. Therefore, a curve that approximates the PBTFM function can be used to perform phase-based speech enhancement instead of using the smooth curve described by (11) directly. Through comparitative2 experiments, qualitative listening tests, and quantitative experiments using Microsoft’s English Recognizer v6.0 speech recognition system, we found that a step approximation (0.05 rad steps) to the PBTFM masking function yields good enhancement results. Table III shows speech recognition results based on a filtered signal obtained from the proposed PBTFM approximation and the standard PBTFM approach. The noisy signals were obtained by mixing the desired signal and the noise signal in MATLAB at different SNR; both signals used recordings of human speakers. The test utterances used to train the speech recognition system were used as the desired signal. We used approximately three minutes worth of speech, containing 487 words. The speech recognizer’s capability to train on the fly and adapt to adverse noise conditions was disabled for these tests. It is interesting to note that the MATLAB version of the algorithm, in some cases, performs worse than the proposed hardware approximation to the algorithm. This does not mean that the algorithm is flawed. In fact, better SNR performance is a

1Synopsis power measurements were verified to be within 5% of power estimates reported by a SPICE circuit simulation tool. Estimates for this paper are meant to serve as a method of evaluating the relative power consumption between different designs.

2The speech enhancement results presented are meant to be purely comparative. We are not evaluating the performance of the speech recognition system. We only wish to show that there are no abnormal recognition or SNR improvements/losses as a result of our PBTFM approximations.

E. Phase-Error-Based Filter Optimizations

3532


TABLE IV DESIGN PLACEMENT SUMMARY

result of the leniency of the approximated algorithm to frequencies that have a low phase-error. The PBTFM filter of (11) begins to punish frequencies with a relatively small phase-error rather quickly. However, the proposed approximation treats all frequencies within a relatively small phase-error region equally. Specifically, frequencies with phase errors relatively close to zero are not attenuated. The proposed filter does not damage the signal of interest, which would be only affected by small phase errors. For the implementation of the approximated algorithm the step function can be easily stored in a lookup table due to our . However, multiplication is still required quantization of to multiply the magnitudes of each microphone channel. To further minimize power consumption we replaced multiplication operations with a limited number of right shift and addition operations. We divide the magnitude to be attenuated by a selectable power of 2 and then selectively add magnitudes divided by successively larger powers of two, as follows: (14) This allows us to represent the approximate function by just four parameters. The first parameter describes the magnitude of the initial right shift, in the range of 0 to 15, and the remaining three binary parameters select the number of additional right shift operations. The results of all shift operations are summed. F. Numerical Precision The proposed algorithms lend themselves easily to using fixed-point numerical representation; the input data set is bounded, the output data is bounded, and the algorithmic operations are fixed. Hence, the dynamic range of each operation can be controlled, such that for all possible input combinations the output dynamic range is never exceeded. Furthermore, for a fixed algorithm and a fixed binary-width representation, a fixed-point representation will result in a higher numerical accuracy per number of bits utilized, over an equivalent floating-point numerical representation. This is because all bits in the fixed-point numerical representation are used to represent the mantissa, that is, no bits are wasted in representing the power-of-two exponent. It is clear that the precision of a fixed-point numerical representation increases proportionately with the number of bits utilized. However, the power utilized by operations also increases: linearly for addition or subtraction operations and quadratically for multiplication operations. Therefore, there is a tradeoff between numerical precision and power consumption.

Fig. 7. (a) Processing noise vs numerical bit precision. (b) Figure-of-merit used to pick optimal power/error numerical precision point.

A MATLAB model of the proposed design was developed in order to find an optimal numerical precision. This emulation model allows for the use of different numerical representation bit widths to evaluate the overall accuracy of the proposed design. We measured the amount of error caused as a result of the limited numerical precision being used by measuring the amount of noise artificially injected into the output signal. We used two exactly similar input signals and performed all computational operations except PBTFM. Hence, the output signal should be equal to the input signal. We measured the amount of noise injected using the following ratio:

(15)

is the power-spectrum of the original signal and where is the power-spectrum of the processed signal. This quan-


3533

TABLE V BDTI’S FIXED-POINT PROCESSOR COMPARISON [24]

tity is plotted in Fig. 7(a) for numerical bit widths in the range of 8 to 32 bits. Note that there is very little change in error after exceeding 19 bits of numerical precision. We used a figure-of-merit that takes into account the amount of noise due to using a limited numerical precision and the amount of average power expended for addition operations. That is, our figure of merit is given as

TABLE VI POSTPROCESSING SNR ( = 5) [25]

(16) is the average power dissipated for additions, and where is the postprocessing noise-to-signal ratio. A plot of this figure-of-merit is shown in Fig. 7(b). The optimal point, with respect to power and accuracy, occurs around bit widths of 15 to 16 bits. We used a 16-bit representation to absorb any slight numerical errors introduced by speech enhancement. IV. DSP IMPLEMENTATION FPGA-based signal processing has been shown to be more efficient in comparison to DSP-based signal processing by results published by the Berkeley Design Technology, Inc.3 (BDTI) [23]. The goal of this section, however, is not to provide an “FPGA versus DSP comparison,” but rather, to provide a point of reference for the results in this paper. The DSP implementation discussed herein is based on Freescale Semiconductor’s DSP56858 DSP. BDTI’s results, a pertinent sample of which is shown in Table V, show that the chosen DSP is not the most power efficient DSP available. However, the results reported in this section can be normalized to obtain performance metrics for more power-efficient DSPs. We chose to use Freescale’s DSP as we originally anticipated that it would be able of supporting the developed algorithm. The DSP software code was developed to have minimal execution time—instruction count—while trying to minimize power utilized by memory traffic, by capitalizing on data locality wherever possible. To minimize execution and development time, vendor provided assembly level functions were utilized wherever possible. Additional code was written in ANSI C and further optimized using aggressive compiler settings. The DSP code and required algorithmic constants, such as those for the FFT and CORDIC algorithms, are stored in ROM and copied into RAM on power-up. Due to the nature of the optimizations required, the code base, constants, and data do not fit into the DSP’s memory cache. Moreover, the CODEC writes samples directly to memory using its direct-memory-access capabilities. Hence, instructions and data would still have to access the memory subsystem periodically. CodeWarrior provides predefined trigonometric functions but each function result in over 300 assembly-level instructions. 3www.bdti.com

Considering that the cosine function needs to be evaluated once and for each possible time delay , the per frequency built-in trigonometric functions significantly impact the processing latency of a cosine-based TDOA algorithm. Instead, a rectangular function [22] is used as it is simple enough to allow for real-time TDOA estimates. However, even by making such a significant tradeoff, in terms of accuracy and speed of operation, the DSP is still incapable of providing real-time TDOA estimates for a 20-kHz audio signal. In fact, the DSP is only capable of providing real-time TDOA estimates for an 8-kHz audio signal, with the TDOA search window restricted to 3.375 samples ( 0.422 ms) and using a 0.125 sample (15.625 s) resolution. Real-time speech enhancement of 12-kHz audio is possible if TDOA estimation is disabled. V. RESULTS The benefits and drawbacks of the PBTFM approach as compared to other speech enhancement techniques have been previously discussed [1], [2], [9] for both synthetic and realistic environments. Hence, the results in this section will only compare the performance of a 64-bit floating-point MATLAB implementation to the FPGA and DSP implementations in terms of postprocessing SNR. The following test setup is used to measure the performance of our two implementations. The desired signal is artificially mixed in MATLAB with a conflicting human speaker using six different SNRs. The noise source is also delayed by 0.5 ms with respect to the virtual microphone pair; the signal of interest is in phase with respect to the virtual microphone pair. A total of three minutes of speech is used. In order to avoid the biasing introduced by microphones, analog amplifiers, and sampling circuitry, as well as digital-to-analog conversion (DAC), all data is written and read from the FPGA and DSP using a digital interface. Table VI summarizes the postprocessing SNR obtained using the MATLAB, FPGA, and DSP implementations. Power consumption for the FPGA and DSP implementation is presented in Table VII, shown using two metrics. We used two methods for measuring power consumption as both the FPGA and the DSP are housed on evaluation boards and make use offchip components. The IC computational core power is only the power utilized by the FPGA’s or DSP’s low-voltage processing

3534


TABLE VII OVERALL SYSTEM POWER UTILIZATION

FPGA core power is biased by leakage power due to the 90% of unutilized FPGA logic-elements. The DSP uses limited phase-based sound localization and speech enhancement in order to achieve real-time signal processing: 8-kHz input sampling, reduced search range and resolution. TABLE VIII COMPARISON OF FPGA AND DSP IMPLEMENTATIONS

core. Whereas, the incremental board-level system power is the incremental power consumed by the evaluation board when the system is operation with respect to the board’s programmed quiescent power consumption. The programmed quiescent power is defined as the power utilized by the evaluation board once the DSP or FPGA has been programmed, but while signal processing is suspended. The presented board-level power measurements are meant to provide the reader with an appreciation of the overall power an FPGA or DSP implementation will require. Specifically, we are trying to show that although an FPGA or DSP might be advertised as a low-power device, the required peripheral electronics might make the overall system power consumption quite high. In this paper, the FPGA implementation consumes less power, in terms of core IC power consumption and especially in terms of the board-level power consumption, than the DSP implementation. Table VII summarizes the subtle differences in the FPGA and DSP implementations. Although the DSP implementation offers a quick method to demonstrate the functionality of the proposed phase-error-based filter, the chosen DSP is not a suitable platform for a speech enhancement system. For a real-time implementation, a hardware-based system is most suitable. An FPGA or a fully custom application specific integrated circuit implementation is capable of providing real-time performance with a designer controlled power consumption. We have also ported the FPGA design to a 0.18- m black-box gate-level CMOS implementation, which had an estimated power consumption of only 3.44 mW. VI. CONCLUSION An FPGA implementation of sound localization and phaseerror based speech enhancement is presented in this paper. Algorithmic, architectural, and low-level optimizations are employed in order to minimize the power utilized by both algorithms. As a result, the novel contributions of this work can be summarized as follows. • Optimization of a widely accepted sound localization algorithm for power consumption without loss of localization accuracy.

• Optimization of a state-of-the-art speech enhancement algorithm for power consumption while retaining speech recognition performance equivalent to the original algorithm. • The implementation of both algorithms on an FPGA and a DSP platform. The resulting FPGA implementation uses approximately 68.55 mW of power while achieving similar sound localization accuracy and speech enhancement quality as a 64-bit floating-point software implementation. The presented DSP implementation used approximately 3.5 more power than the FPGA and was not able to achieve the same rate of signal processing as the FPGA. The 6-MHz FPGA system performed better than the 120-MHz DSP, because the FPGA implementation is structured to perform parallel and compound data processing operations every clock cycle, whereas the DSP requires additional clock cycles to fetch instructions that perform simple data processing operations. REFERENCES [1] G. Shi and P. Aarabi, “Robust digit recognition using phase-dependent time-frequency masking,” in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Processing (ICASSP), Hong Kong, Apr. 2003, pp. 684–687. [2] P. Aarabi and G. Shi, “Phase-based dual-microphone robust speech enhancement,” IEEE Trans. Syst., Man, Cybern., vol. 34, no. 4, pp. 1763–1773, Aug. 2004. [3] C. Lai and P. Aarabi, “Multiple-microphone time-varying filters for robust speech recognition,” presented at the IEEE Int. Conf. Acoustics, Speech, Signal Processing (ICASSP), Montreal, QC, Canada, May 2004. [4] G. DeMuth, “Frequency domain beamforming techniques,” in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Processing, May 1977, vol. 2, pp. 713–715. [5] J. Bitzer, K. U. Simmer, and K. D. Kammeyer, “Theoretical noise reduction limits of the generalized sidelobe canceller (GSC) for speech enhancement,” in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Processing, Mar. 1999, vol. 5, pp. 2965–2968. [6] J. Bitzer, K. D. Kammeyer, and K. U. Simmer, “An alternative implementation of the superdirective hcantformer,” in Proc. IEEE Workshop Applications Signal Processing to Audio Acoustics, Oct. 1999, pp. 7–10. [7] R. L. Bouquin-Jeanes, A. A. Azirani, and C. Faucon, “Enhancement of speech degraded by coherent and incoherent noise using a crossspectral estimator,” IEEE Trans. Speech, Audio, Signal Process., vol. 5, pp. 484–487, Sep. 1997. [8] C. Marro, Y. Mahieux, and K. U. Simmer, “Analysis of noise reduction and dereverberation techniques based on microphone arrays with postfiltering,” IEEE Trans. Speech, Audio, Signal Process., vol. 6, pp. 240–259, May 1998. [9] G. Shi, P. Aarabi, and N. Lazic, “Adaptive time-frequency data fusion for speech enhancement,” presented at the FUSION, Cairns, Australia, Jul. 2003. [10] P. Aarabi, G. Shi, and O. Jahromi, “Robust speech separation using time-frequency masking,” in Proc. Int. Conf. Multimedia Expo (ICME), Baltimore, MD, Jul. 2003, vol. 1, pp. 741–744. [11] P. Aarabi and S. Mavandadi, “Robust sound localization using conditional time-frequency histograms,” Inf. Fusion, vol. 4, pp. 111–122, 2003. [12] M. Brandstein and H. Silverman, “A robust method for speech signal time-delay estimation in reverberant rooms,” in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Processing (ICASSP), May 1997, pp. 375–378. [13] C. H. Knapp and G. Carter, “The generalized correlation method for estimation of time delay,” IEEE Trans. Acoust., Speech, Signal Process., vol. 24, no. 4, pp. 320–327, Aug. 1976. [14] T. Gustafsson, B. D. Rao, and M. Trivedi, “Analysis of time-delay estimation in reverberant environments,” in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Processing (ICASSP), Orlando, FL, May 2002, vol. 2, pp. 2097–2100.


[15] D. Hulupka, N. J. Mathai, P. Aarabi, and A. Sheikholeslami, “Robust sound localization in 0.18 m CMOS,” IEEE Trans. Signal Process., vol. 53, no. 6, pp. 2243–2250, Jun. 2005. [16] S. Oraintara, Y.-J. Chen, and T. Nguyen, “Integer fast Fourier transform (INTFFT),” IEEE Trans. Signal Process., vol. 50, no. 3, pp. 607–618, Mar. 2002. [17] R. Andraka, “A survey of CORDIC algorithms for FPGA based computers,” in Proc. 1998 ACM/SIGDA 6th Int. Symp. FPGAs, Monterey, CA, 1998, pp. 191–200. [18] S. Freeman and M. O’Donnell, “A complex arithmetic digital signal processor using CORDIC rotators,” in Proc. 1995 Int. Conf. Acoustics, Speech, Signal Processing (ICASSP), 1995, vol. 5, pp. 3191–3194. [19] E. Grass, B. Sarker, and K. Maharatna, “A dual-mode synchronous/asynchronous CORDIC processor,” in Proc. 8th Int. Symp. Asynchronous Circuits Systems, Apr. 2002, pp. 76–83. [20] J. Volder, “The CORDIC trigonometric computing technique,” IRE Trans. Electron. Comput., vol. EC-8, no. 3, pp. 330–334, Sep. 1959. [21] C. Lai, “Analysis and extension of time-frequency masking,” M.A.Sc. thesis, Department of Electrical and Computer Engineering, University of Toronto, Toronto, ON, Canada, 2003. [22] S. Mavandadi and P. Aarabi, “Time-delay of arrival estimation using non-linear phase-error selector functions,” presented at the Sensor Fusion: Architectures, Algorithms, Applications VII (AeroSense 2003), Orlando, FL, Apr. 2003. [23] Altera, “FPGAs for high-performance DSP applications” (white paper) , May 31, 2005 [Online]. Available: http:/www.altera.com/literature/wp/wp_dsp_comp.pdf [24] I. B. Berkeley Design Technology, “Speed per milliwat ratios for fixedpoint packaged processors,” May 31, 2005 [Online]. Available: http:// wwwbdti.com/bdtimark/chip_fixed_power_scores.pdf [25] D. Halupka and S. A. Rabi, “Clips of TFM pre- & post-processed audio samples Aug. 2004 [Online]. Available: http://www.apl.toronto.edu/ projects/fpgatfm.html [26] D. Halupka, S. A. Rabi, P. Aarabi, and A. Sheikholeslami, “Comparison of FPGA and DSP based phase-error filtering,” in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Processing, Mar. 2005, pp. 149–152. David Halupka (S’99–M’02) received the B.Sc. (Hons.) degree and the M.A.Sc. degree, both in computer and electrical engineering, from the University of Toronto, Toronto, ON, Canada, in 2002 and 2004, respectively. He is currently working towards the Ph.D. degree in electrical engineering at the Department of Electrical and Computer Engineering, University of Toronto. His interests include reliable circuit design in the presence of CMOS process variation and bio-inspired methods of signal processing. Mr. Halupka holds the prestigious Natural Sciences and Engineering Research Council of Canada’s Canadian Graduate Scholarship and has also held the Ontario Graduate Scholarship during his M.A.Sc. work. He was the recipient of the IEEE Canadian Foundation’s McNaughton Scholarship in 2001.

Alireza Seyed Rabi received the B.A.Sc. degree with honors from the University of Toronto, Toronto, ON, Canada. He is currently working towards the M.D./Ph.D. degree at The Johns Hopkins University, Baltimore, MD. Mr. Rabi received the Queen Elizabeth II Aiming for the Top Scholarship award from the University of Toronto, the Leonardo daVinci Top Competitor award and the University of Toronto Scholar award, along with a University of Toronto Entrance Scholarship, the Whealy Joseph Scholarship, and the Guenther J. Frank Scholarship. In 2004–2005, he received the Etkin Medal of Excellence. He is also a two-time NSERC award holder.

3535

Parham Aarabi (M’02–SM’06) received the B.A.Sc. degree in engineering science (electrical option) and the M.A.Sc. degree in computer engineering from the University of Toronto, Toronto, ON, Canada, in 1998 and 1999, respectively, and the Ph.D. degree in electrical engineering from Stanford University, Stanford, CA, in 2001. He is a Canada Research Chair in multisensor information systems; a tenured Associate Professor in The Edward S. Rogers, Sr. Department of Electrical and Computer Engineering; and the Founder and Director of the Artificial Perception Laboratory, University of Toronto. His current research, which includes multisensor information fusion, human–computer interactions, and hardware implementation of sensor fusion algorithms, has appeared in over 50 peer-reviewed publications and has been covered by media such as The New York Times, MIT’s Technology Review Magazine, Scientific American, Popular Mechanics, the Discovery Channel, CBC Newsworld, Tech TV, Space TV, and City TV. Dr. Aarabi received many awards, including, most recently, the 2002, 2003, and 2004 Professor of the Year Awards; the 2003 Faculty of Engineering Early Career Teaching Award; the 2004 IEEE Mac Van Valkenburg Early Career Teaching Award; the 2005 Gordon Slemon Award; the 2005 TVO Best Lecturer (Top 30) selection; the Premier’s Research Excellence Award; the 2006 APUS/SAC University of Toronto Undergraduate Teaching Award; as well as MIT Technology Review’s 2005 TR35 “World’s Top Young Innovator” Award.

Ali Sheikholeslami (S’98–M’99–SM’02) received the B.Sc. degree from Shiraz University, Shiraz, Iran, in 1990 and the M.A.Sc. and Ph.D. degrees from the University of Toronto, Toronto, ON, Canada, in 1994 and 1999, respectively, all in electrical and computer engineering. In 1999, he joined the Department of Electrical and Computer Engineering, University of Toronto, where he is currently an Associate Professor. His research interests are in the areas of analog and digital integrated circuits, high-speed signaling, VLSI memory design (including SRAM, DRAM, and CAM), and ferroelectric memories. He has collaborated with industry on various VLSI design projects in the past few years, including work with Nortel, Canada, in 1994, with Mosaid, Canada, since 1996, and with Fujitsu Laboratories, Japan, since 1998. He is currently spending the first half of his sabbatical year with Fujitsu Laboratories, Japan. He presently supervises three active research groups in the areas of ferroelectric memory, CAM, and high-speed signaling. He has coauthored several journal and conference papers (in all three areas), in addition to two U.S. patents on CAM and one U.S. patent on ferroelectric memories. Dr. Sheikholeslami has served on the Memory Subcommittee of the IEEE International Solid-State Circuits Conference (ISSCC) from 2001 to 2004 and on the Technology Directions Subcommittee of the same conference from 2002 to 2005. He presented a tutorial on ferroelectric memory design at the ISSCC 2002. He was the Program Chair for the Thirty-Fourth IEEE International Symposium on Multiple-Valued Logic (ISMVL 2004) held in Toronto, ON, Canada. He received the Best Professor of the Year Award in 2000, 2002, and 2005 by the popular vote of the undergraduate students in the Department of Electrical and Computer Engineering, University of Toronto. He is a registered Professional Engineer in the province of Ontario, Canada.