Performance improvement of a bitstream-based ... - Semantic Scholar

3 downloads 199 Views 653KB Size Report
sociate editor coordinating the review of this manuscript and approving it for publication was Dr. ...... The performance comparison will be discussed in Section IV-A3. ..... 1991 to 1993) and Samsung Advanced Institute of. Technology (SAIT) ...
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 10, NO. 8, NOVEMBER 2002

591

Performance Improvement of a Bitstream-Based Front-End for Wireless Speech Recognition in Adverse Environments Hong Kook Kim, Senior Member, IEEE, Richard V. Cox, Fellow, IEEE, and Richard C. Rose, Senior Member, IEEE

Abstract—In this paper, we propose a feature enhancement algorithm for wireless speech recognition in adverse acoustic environments. A speech recognition system is realized at the network side of a wireless communications system and feature parameters are extracted directly from the bitstream of the speech coder employed in the system, where the feature parameters are composed of spectral envelope information and coder-specific information. The coder-specific information is apt to be affected by environmental noise because the speech coder fails to generate high quality speech in noisy environments. We first found that enhancing noisy speech prior to speech coding improves the recognizer’s performance. However, our aim was to develop a robust front-end operating at the network side of a wireless communications system without regard to whether speech enhancement was applied at the sender side. We investigated the effect of a speech enhancement algorithm on the bitstream-based feature parameters. Consequently, a feature enhancement algorithm is proposed which incorporates feature parameters obtained from the decoded speech and a noise suppressed version of the decoded speech. The coder-specific information can also be improved by re-estimating the codebook gains and residual energy from the enhanced residual signal. HMM-based connected digit recognition experiments show that the proposed feature enhancement algorithm significantly improves recognition performance at low signal-to-noise ratio (SNR) without causing poorer performance at high SNR. From large vocabulary speech recognition experiments with far-field microphone speech signals recorded in an office environment, we show that the feature enhancement algorithm greatly improves word recognition accuracy. Index Terms—Bitstream-based front-end, feature enhancement, robust speech recognition, speech enhancement, wireless speech recognition.

I. INTRODUCTION

W

IRELESS communications networks provide more adverse environments for speech recognition than wireline networks. The adverse environments are characterized by low signal-to-noise ratio (SNR) speech due to ambient noise [1] and by packet loss due to channel impairments [2], [3]. The former environment also occurs in wireline networks and results in degrading automatic speech recognition (ASR) performance [1]. In addition to the packet loss, low-bit-rate speech coders employed in wireless networks result in additional degradation to

Manuscript received September 25, 2001; revised August 2, 2002. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Harry Printz. The authors are with the AT&T Labs—Research, Florham Park, NJ 07932 USA (e-mail: [email protected]; [email protected]; [email protected]). Digital Object Identifier 10.1109/TSA.2002.804302

recognition performance [4]. Moreover, in noisy acoustic environments, a speech coder fails to generate high quality speech. As a remedy to these problems, we can consider introducing robust techniques that are conventionally used in wireline speech recognition [5], [6]. They include speech enhancement, feature normalization/transformation and model compensation. A typical speech enhancement algorithm tries to subtract the noise spectrum from the noisy speech spectrum by using a spectral subtraction method [7]–[9]. As another approach, speech enhancement decomposes a noisy signal space into a signal-plusnoise subspace and a noise subspace, and then removes the noise subspace [10]. These speech enhancement algorithms have been successfully applied to both speech coding and recognition. Feature normalization techniques such as cepstral mean subtraction (CMS) [11] and high-pass filtering of cepstral coefficients [12] have been developed to remove the convolutional noise caused by the channel, where the channel includes the transducer effect and telephone channel characteristics [1], [12]. Model compensation or adaptation techniques [6] are used to update the model parameters with a small amount of data obtained from a new environment. However, model compensation techniques are out of the scope of this work because we are mainly concerned with the design of a robust front-end for wireless speech recognition. In this work, we assume that a speech recognition system is constructed in the network side of a wireless network and the recognition feature parameters are directly obtained from the bitstream generated by a speech coder [13]–[15]. Therefore, the robustness of the feature parameters is highly dependent on the accuracy of spectral analysis used in the speech coder. When speech enhancement is applied to the input noisy speech as a preprocessing stage to speech coding, the speech encoder can obtain a more accurate spectral analysis on the noisy speech. Thus, improved performance of the speech recognition system can be achieved [15]. However, speech enhancement may degrade the speech quality in a high SNR environment and may lower recognition accuracy as well. Moreover, it may not be practical to implement a speech enhancement algorithm in the encoder side of the speech coder. The speech recognition system considered in this work must give reliable performance for speech signals obtained from a broad range of wireless phones. Some phones may incorporate speech enhancement algorithms, but others may not. This leads us to develop an enhancement scheme that only operates in the network side of a wireless communications system. We first introduce a well-known speech enhancement technique as a preprocessor prior to the speech encoder in order to

1063-6676/02$17.00 © 2002 IEEE

592

IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 10, NO. 8, NOVEMBER 2002

Fig. 1.

Block diagram for a speech enhancement scenario combined with the IS-641 speech coder.

investigate how much the enhancement algorithm improves the speech recognition performance. The speech enhancement algorithm used in this work is based on minimum mean-square error log-spectral amplitude (MMSE-LSA) estimation [16] and it has been applied to some standard speech coders [17]–[19]. Toward this end, we consider the scenario where speech enhancement and the speech coder are combined as shown in Fig. 1 and we measure the effect of speech enhancement on speech using objective criteria. Based on the findings about the desirable property of the speech enhancement algorithm on the objective measures, we propose a feature enhancement algorithm that combines two sets of feature parameters which are obtained from both the decoded noisy speech signal and its enhanced version. Moreover, we apply the speech enhancement algorithm to the decoded residual signal and then re-estimate the coder-specific parameters1 such as the fixed codebook gain and the adaptive codebook gain from this enhanced residual signal. To avoid degrading the recognition performance under a high SNR condition, the feature enhancement algorithm is turned on or off depending on an estimated SNR. This has the effect of disabling the feature enhancement algorithm when the estimated SNR is high. Following this introduction, we briefly review the bitstream-based front-end for wireless speech recognition and the MMSE-LSA speech enhancement algorithm in Section II. In addition, the motivation of this work will be discussed in Section II. In Section III, we describe the proposed feature enhancement algorithm and discuss its implementation issues. In Section IV, we perform connected digit recognition in additive noisy environments and large vocabulary recognition with both close-talking microphone and far-field microphone speech signals recorded in an office environment, in order to show the effectiveness of the proposed feature enhancement algorithm. Finally, we present our conclusions in Section V. II. BACKGROUND AND MOTIVATION In order to provide the background and motivation for the network based feature enhancement algorithm that is presented in Section III, it is necessary to review several well known results. The enhancement algorithm is applied in the context of the IS-641 standard—algebraic code excited linear prediction (ACELP)—speech coder [20] and incorporates the MMSE-LSA speech enhancement system as part of the algorithm. We give 1The bitstream-based front-end [15] uses spectral envelope and periodicity information for recognition parameters, while conventional front-ends utilize only spectral envelope information. The periodicity information is represented as the parameters obtained from the codebook gains that are somewhat dependent on the speech coder structure. So we call these parameters coder-specific parameters.

an overview of the IS-641 speech coder used in the IS-136 digital cellular system, and briefly describe the bitstream-based front-end in Section II-A. The MMSE-LSA speech enhancement algorithm is described in Section II-B. Furthermore, the feature enhancement algorithm acts to minimize a line spectrum pair (LSP) based average spectral distortion measure. This distortion measure is defined and motivated in terms of its ability to act as an objective measure of speech signal distortion in Section II-C. A. Review of a Bitstream-Based Front-End In this work, we choose the IS-136 digital cellular system as a wireless communications system that employs the IS-641 speech coder [20]. The IS-641 speech coder compresses speech at a rate of 7.4 kbit/s by first decomposing a 20 ms speech segment into spectral envelope and residual signal. The spectral envelope is represented as ten LSPs that are quantized with a 26-bit split vector quantizer. The residual signal is modeled once every 5 ms speech segment called subframe. An adaptive codebook is employed to remove the periodicity and the resultant signal is further modeled by an algebraic fixed codebook. Therefore, the parameters of the speech coder are ten LSPs, four adaptive codebook lags and gains, and four fixed codebook gains and codebook indices. The bitstream-based front-end [15] extracts two recognition feature sets from these parameters: LPC-cepstra and mel-cepstra based feature sets. The common procedure is as follows: We first obtain the decoded LSPs with a frame rate of 50 Hz. Since the frame rate of the conventional spectral analysis used for speech recognition is usually 100 Hz and we want to simply replace the conventional front-end with the proposed one, we interpolate these LSPs with those of the previous frame and thus make the frame rate 100 Hz. Next, the LSPs are converted into the linear predictive coding (LPC) coefficients of order ten. For the LPC-cepstra based feature set, cepstral coefficients of order ten are obtained from the LPC-to-cepstrum conversion. We obtain the ten liftered cepstral coefficients by applying the bandpass lifter [21] to the cepstral coefficients. For the mel-cepstra feature set, we apply 24 mel-filter banks to the spectral envelope that is obtained from the decoded LSPs. By applying the discrete cosine transform to the 24 mel-filter bank energies, the ten mel-frequency cepstral coefficients are finally obtained. For the coder-specific parameters we decode the residual signal from the adaptive codebook lags, the shape or algebraic codebook indices, and the codebook gains. Then, we compute an energy parameter by taking the logarithm of the square-sum of the residual for 10 ms. From the speech coder, we have four adaptive codebook gains and four fixed codebook gains for each analysis frame because they are transmitted once every

KIM et al.: PERFORMANCE IMPROVEMENT OF A BITSTREAM-BASED FRONT-END

subframe. The first two gains are merged into one parameter and the last two gains are also merged into another parameter [15]. As a result, we obtain two adaptive codebook gain parameters and two fixed codebook gain parameters twice every frame. Consequently, the feature vector is 13-dimensional including ten LPC-cepstral coefficients for the LPC-cepstra based feature set or ten mel-frequency cepstral coefficients for the mel-cepstra based feature set, a normalized logarithmic residual energy, an adaptive codebook gain parameter and a fixed codebook gain parameter. The energy normalization is performed so that the maximum value of energy in an utterance becomes zero. Finally, we apply CMS to each feature parameter except the normalized logarithmic residual energy, and add the first and second time differences of each parameter, where the first and the second differences are computed by five and three frame windows, respectively. Therefore, the feature vector dimension is 39. B. MMSE-LSA Algorithm The MMSE-LSA speech enhancement algorithm operates in the frequency domain. A nonlinear frequency dependent gain function is estimated from the noisy speech signal and then applied to the spectral components of the noisy speech signal in an attempt to obtain estimates of spectral components of the cor, and responding clean speech. Let be the discrete Fourier transforms of clean , additive noise and noisy speech , respeech spectively, as shown in Fig. 1. The objective of MMSE-LSA is that minimizes the distortion meato find the estimator for a given noisy observation sure . The MMSE-LSA gives an estimate of the clean spectrum speech spectrum that has the form (1) is a nonlinear multiplicative gain function where depending on the speech-presence probability, , the a priori and the a posteriori SNR estimate, , for SNR estimate, each frequency bin [18]. Finally, we can obtain the enhanced , by applying the inverse discrete Fourier speech signal, . For the detailed description of the altransform to gorithm, the reader is referred to [17]–[19]. C. Effect of MMSE-LSA on Spectral Distortion and Correlation This section motivates the proposed feature enhancement algorithm as a means for reducing distortion as defined by objective measures that are described here. Since the bitstream-based front-end is based on LSP parameters, energy and two codebook gain parameters as mentioned in Section II-A, we need to define distortion measures related to these parameters. The measures are the average spectral distortion (SD) of LSPs and the correlation coefficients of the other three parameters. They are defined in this section and motivated in terms of their ability to predict distortions in speech. We first denote the LSP vectors of the th clean speech segment, the corresponding noisy speech segment and enhanced , and , respectively. Let , speech segment as

593

and be the LPC power spectra obtained from , and , respectively. The SD in dB between clean , is defined as speech and noisy speech, SD SD (2) where is the total number of frames [22]. Similarly, we can define the SD between clean speech and enhanced speech as by replacing in (2) with . The goal of SD with the feature enhancement algorithm is to reduce SD : respect to SD SD

SD

(3)

In practice, the SD can be replaced by the following Euclidean distance between LSPs [22] (4)

SD where

(5) and are the th element of and . and Equations (3)–(5) will be used for the objective function in the proposed feature enhancement algorithm in Section III. Fig. 2(a) shows the relationship between the SD measure given in (2) and the SNR for speech signals corrupted by two types of background noise. Clean speech was corrupted by car noise and babble noise at SNRs ranging from 0 to 30 dB. The SD, shown in the vertical axis of Fig. 2(a), was computed from LSPs estimated from the clean and noise corrupted speech. It is clear from the figure that SD increases linearly with decreasing SNR for both noise types suggesting that it may be a reasonably good predictor of speech recognition accuracy. The feature enhancement algorithm developed in Section III is designed to reduce the average SD under noisy conditions. In addition, we measure the contribution of the speech enhancement algorithm to the coder-specific parameters by defining the correlation coefficients as

(6)

and (7)

is one of the coder-specific parameters of the th clean where speech frame. These include the residual energy, the fixed code-

594

IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 10, NO. 8, NOVEMBER 2002

Fig. 2. Average SD and correlation coefficients for clean speech and noisy speech. (a) Average SD for LSPs and correlation coefficients for (b) residual energy, (c) fixed codebook gain parameter, and (d) adaptive codebook gain parameter. (5 and  represent the car and the babble noisy speech, respectively).

book gain parameter and the adaptive codebook gain parameter. and are the coder-specific parameters estimated from is the corthe noisy and enhanced residual, respectively. and and is the correrelation coefficient between and . We can say that the fealation coefficient between . ture enhancement algorithm performs well if Before computing (6) or (7), each parameter is normalized so that it has zero mean. Fig. 2(b)–(d) shows the relationship between the correlation coefficients given in (6) and the SNR for noise corrupted speech. The figures show that the correlation coefficients for the residual energy, the fixed codebook gain parameter and the adaptive codebook gain parameter all increase with increasing SNR for both noise types. An enhancement procedure is described in Section III for increasing the magnitude of the correlation coefficients for noise corrupted speech. The effect of the speech enhancement algorithm on both the average SD and the correlation coefficients is investigated here in order to illustrate deficiencies of the algorithm in dealing with certain noise types. The MMSE-LSA speech enhancement algorithm was applied to noisy speech, and the average SD, , and correlation coefficients, , between clean SD speech and enhanced speech were measured. Fig. 3 shows the average SD and the correlation coefficients plotted as dashed lines with the curves of Fig. 2 overlaid as solid lines for comparison. The curves in Fig. 3(a) show that the speech enhancement algorithm reduced the average SD for the noisy speech corrupted by car noise, but had no effect for the babble noise con-

dition. This is because the spectral characteristics of the babble noise are similar to those of the speech signal making it difficult for the speech enhancement algorithm to correctly estimate the babble noise spectrum. This unwanted effect can be mitigated in the design of the feature enhancement algorithm in Section III. Also, from Fig. 3(b)–3(d), we found that the speech enhancement algorithm improved the correlations of the residual energy and fixed codebook gain parameter under both car and babble noise conditions. However, speech enhancement had no effect on the correlation of the adaptive codebook gain parameter for the babble noise condition. This inconsistency in the effect of the speech enhancement algorithm on the correlation coefficients in different noise types is dealt with by the feature enhancement algorithm in Section III. III. FEATURE ENHANCEMENT In this section, we propose a feature enhancement algorithm that resides within the network infrastructure of the wireless communications system. The feature enhancement algorithm is realized by utilizing the decoded speech and its enhanced speech, as shown in Fig. 4. The spectral envelope parameters are enhanced in the LSP domain by using the algorithm described in Section III-A. Also, the coder-specific parameters are updated by one of two methods: a direct assignment method and a re-estimation method, which is explained in Section III-B. Finally, we propose a SNR-based enhancement method in order to avoid degrading the recognition performance under high SNR condi-

KIM et al.: PERFORMANCE IMPROVEMENT OF A BITSTREAM-BASED FRONT-END

595

Fig. 3. Average SD and correlation coefficients for noisy speech and enhanced speech; (a) average SD for LSPs and correlation coefficients for (b) residual energy, (c) fixed codebook gain parameter, and (d) adaptive codebook gain parameter. (5 and  represent the car and the babble noisy speech, and a dashed line and a solid line indicate the values before and after speech enhancement, respectively).

Fig. 4. Block diagram of the proposed feature enhancement algorithm realized in the decoder side of the speech coder.

tions. This allows the proposed feature enhancement algorithm to be selectively applied to speech signals depending on the es-

timated SNR. We begin by defining notation used in describing the feature enhancement algorithm.

596

IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 10, NO. 8, NOVEMBER 2002

A. Spectral Parameters As shown in Fig. 4, we first define the following four -dimensional LSP column vectors and four three-dimensional coder-specific parameter vectors at analysis frame index obtained from speech corrupted by background noise, . and : an LSP vector and a coder-specific — parameter vector derived from the bitstream of the speech coder using the bitstream-based ASR front-end described in Section II-A. and : an LSP vector and a coder-spe— cific parameter vector obtained from analysis of the decoded speech signal, which contains distortions introduced by both background noise, , and quantization noise, , from the speech decoder. and : an LSP vector and a coder-spe— cific parameter vector obtained from analysis of the enhanced version of the decoded speech signal. In and other words, they are enhanced versions of . and : estimates of clean speech LSP vector and — coder-specific parameter vector after applying the proposed feature enhancement algorithm.2 The objective of the proposed feature enhancement algorithm is to obtain an estimate of the clean speech spectrum vector, , which is in some way closer to the actual clean speech spectrum vector, , than the noisy speech spectrum vector, (8) is defined by where the distortion . The definition of is identical to defined in (5). Therefore, (8) is closely related to (3). The wanted property of the speech enhancement algorithm is to obtain smaller SD between clean speech and enhanced speech than the SD between clean speech and noisy speech. Likewise, the feature enhancement algorithm is motivated by the desire , obtained by the to improve upon the spectrum estimate, MMSE-LSA speech enhancement algorithm. Since we are constrained to operate only on the network side of the wireless channel, we do not have access to the clean speech spectrum vector. The speech enhancement algorithm shown in Fig. 4 must , that has operate on the reconstructed noisy speech signal, been obtained from the bitstream, and is therefore limited in its ability to recover the original speech, . It is shown below that having access to the spectrum vector that is obtained directly from the bitstream provides an opportunity to improve on the spectral estimate obtained from the speech enhancement algorithm. The speech enhancement algorithm attempts to minimize the mean square error between the log spectral magnitudes of the clean and noisy speech. It can be shown that the estimated log spectral magnitude can be obtained from the noisy speech log spectral magnitude by a nonlinear multiplicative gain funcin (1). However, in the system of Fig. 4, these tion, SNR terms must be estimated from the decoded noisy speech , and not the original noisy speech signal, . Consignal, sequently, these SNR estimates will be biased and the recon2The

subscript l is dropped in the rest of the paper for the sake of simplicity.

Fig. 5. Simplified schematic diagram of the proposed feature enhancement algorithm.

structed speech, , will not be an accurate representation of . A simplified block diagram is shown in Fig. 5 to illustrate the modeling assumptions used in feature enhancement. The distoris the spectral distortion between LSP spectrum tion vectors obtained from the bitstream-based front-end under the and that there assumptions that there is no additive noise, is dominated with this additive is additive noise, . is the spectral environmental noise. The distortion distortion resulting from the encoding/decoding of noisy speech as well as additional spectral analysis of this decoded speech. is the spectral distortion between The distortion , obtained from the decoded speech and the LSP vector, , obtained by the MMSE-LSA speech enthe LSP vector, hancement. Intuitively, we can simplify the functions of speech decoder and speech enhancement as (9) (10) where is a distortion vector induced by quantization noise and is assumed to be independent to ; and are the background noise vector and the quantization noise vector obtained is an from the speech enhancement algorithm. In addition, estimate of actual clean LSP vector, , after speech enhanceis different from the actual clean LSP vector, ment, where because speech enhancement introduces spectral distortion to clean speech while it removes additive noise. By using the and , we approximate notations of as (11) and are weighting values for two where distortions. Equation (11) means that the spectral distortion caused by background noise can be estimated by subtracting a properly scaled distortion introduced by the speech decoder from the scaled distortion estimated by speech enhancement. The right-hand side of (11) becomes kind of a criterion function

KIM et al.: PERFORMANCE IMPROVEMENT OF A BITSTREAM-BASED FRONT-END

597

in the gradient descent algorithm because this function is a as the assumption of (8). minimum at the optimum value of . We assume that a wanted We denote the equation as is not far from . This corresponds that an initial value of in the gradient descent algorithm is set to and value of only one iteration every analysis frame is performed. Thus, we propose a feature enhancement algorithm having the form of (12) denotes a gradient operator. and are the compensation vectors for the speech encoding/decoding effect and background noise, respectively. by In (12), it will be shown later that assuming that the quantization noise is independent of the background noise. Now we consider the convergence of the criterion. By substituting (12) into (8), we obtain where

(13) In order to say that (12) is a meaningful update equation, should be greater than or equal to 0. If in component-wise the enis located in between and , hanced LSP vector, . In other words and if

(14)

if

(15)

and for the two cases, where means for all . Next, we try to find a simple update equation. From (11) and and can be represented as (12), the functions

Thus,

(16) (17) and are smoothing coeffiwhere cients. From the assumption of (9), the gradient of with respect to becomes zero. Thus, and , where is the identity matrix. The right-hand side of (17) is derived as

(18) Since the speech enhancement algorithm applied to the decoded speech sufficiently removes the noise component in the noisy is independent of . speech, we can also assume that Therefore, from (17) and (18), we have (19) and substituting (19) into (12), we By using obtain the following equation of (20)

Fig. 6. Average SD for LSPs under car (-5-) and babble (--) noise conditions, where a dashed line and a solid line indicate SDs prior to feature enhancement and after feature enhancement, respectively.

We have the following remarks on the smoothing coefficient and the filter stability. 1) Determination of : By substituting (9) into (20), (20) is rewritten by (21) From the sufficient condition for a stability of the least mean, square algorithm [23], we know that is the trace of a matrix and is the dimension of where . Therefore, we set for the fastest convergence, where is an infinitesimal number. In this paper, since the IS-641 speech coder provides 10-dimensional LSP vector for representing spectral envelope, is equal to 10 and thus is set to 0.2 in practice. 2) Filter Stability: We apply the stabilized procedure to by simply checking the ordering property of LSP [24]. The stais guaranteed bility of the synthesis filter constructed from where is the if th element of . Although , and satisfy the orfor an dering property, (20) might result in , which makes the synthesis filter unstable. For example, this situation happens if

If is decided as unstable, we replace the vector with the orig. By doing this, we inal one. In other words, we set can solve the stability problem of the feature enhancement algorithm. The performance of the feature enhancement algorithm described so far was measured by determining the degree to which the SD defined in (2) was reduced for the connected digit task in two noise conditions. Fig. 6 shows the SD after applying the feature enhancement algorithm to noisy speech. Compared with Fig. 3(a), the SD was not affected by the feature enhancement for the babble noisy speech, while we could improve the SD for the car noisy speech. Therefore, we could conclude here that the feature enhancement algorithm of (20) would provide improved ASR accuracy.

598

IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 10, NO. 8, NOVEMBER 2002

Fig. 7. Procedure of re-estimating the coder-specific parameters.

B. Coder-Specific Parameters In this subsection, we discuss enhancement procedures for the coder-specific parameters that include the residual energy, the adaptive codebook gain parameter and the fixed codebook gain parameter. Two procedures are investigated including the direct assignment of enhanced parameters as estimates of the clean speech parameters and the re-estimation of parameters using residual enhancement. 1) Direct Assignment: In addition to distorting the spectral envelope, the additive noise also blurs the voicing information used in the bitstream-based front-end. It was found in Section II that enhancing the noisy speech signal could improve the voicing information as well as the spectral information and be the coder-specific for some noise types. Let and parameter vectors associated with spectral vectors , respectively. One way to obtain the estimate of the clean speech coder-specific parameters is simply to assign the values obtained from the enhanced speech (22) We refer to the enhancement of the coder-specific parameters using (22) as the direct assignment method. 2) Re-Estimation of Parameters Using Residual Enhancement: We consider a procedure for re-estimating the clean speech excitation parameters which is based on the block be the enhanced residual signal diagram in Fig. 7. Let obtained by applying the speech enhancement algorithm to the . For the th subframe, an error signal from residual signal, a one-step predictor can be written as for (23) is the subframe size of 40 samples, is the transwhere is mitted long-term predictor lag of the th subframe and the desired long-term predictor gain for the th subframe. The open-loop long-term predictor analysis is used to find a coeffi, such that the energy of over the samples cient, is given by is minimized [25], [26]. Thus,

of (23). In other words, we find such that the energy of over the subframe interval is minimized. Con, is given sequently, the re-estimated fixed codebook gain, by

for

(25)

[27], where In (25), if and otherwise . Also, and are the th pulse magnitude and pulse position of the th subframe ). Here, denotes an integer which is smaller ( , and , we obtain the esthan or equal to . From timates of clean speech parameters by following the procedure described in Section II-A. The enhancement of the coder-specific parameters with this approach is referred to as the re-estimation method. Fig. 8(a)–(c) shows the correlation coefficients, as defined in (6) or (7), of the residual energy, the fixed codebook gain parameter and the adaptive codebook gain parameter plotted with respect to the input SNR. Separate curves are displayed in each plot for car (- -) and babble (- -) noise types. The curves show correlation coefficients measured prior to feature enhancement (dashed lines), after applying the direct assignment method (dash-dotted lines) and after applying the re-estimation method (solid lines). It is clear from Fig. 8 that the two methods increased correlation for the residual energy and the fixed codebook gain parameter. The direct assignment method gave higher correlation than the re-estimation method. It is also clear from the figure that neither of the two methods increased correlation for the adaptive codebook gain parameter at any SNR level. Consequently, no feature enhancement algorithm is applied to the adaptive codebook gain parameter in the following experiments. C. SNR-Based Feature Enhancement

for (24) Next, the fixed codebook gain is re-estimated by projecting , onto the error signal the decoded fixed codebook signal,

At high input SNR, the speech enhancement algorithm has a tendency to degrade ASR performance [28]. Under some circumstances, it can introduce spectral distortion to the enhanced speech. This suggests that the feature enhancement procedure described in Section III would perform better if it were applied only in regions of low SNR. In other words, the parameter update given in (20) would be applied only to utterances where

KIM et al.: PERFORMANCE IMPROVEMENT OF A BITSTREAM-BASED FRONT-END

599

Fig. 8. Correlation coefficients for (a) residual energy, (b) fixed codebook gain parameter, and (c) adaptive codebook gain parameter under car (-5-) and babble (--) noise conditions, where dashed lines, dash-dotted lines and solid lines designate measurements made prior to feature enhancement, after applying the direct assignment method and after applying the re-estimation method, respectively.

the estimated SNR (ESNR) is less than a predefined threshold, SNR . Therefore, (20) is modified as if ESNR SNR otherwise. (26) The same SNR-based criterion will also be used to determine when the code-specific parameters are updated. The ESNR is obtained for an entire utterance as follows. and be the 160-dimenAs shown in Fig. 4, we let sional decoded speech signal column vectors at the th analysis frame before and after applying the speech enhancement algorithm, respectively. The ESNR for a sentence is estimated as ESNR

(27)

where is the total number of frames in the sentence. The above equation is identical to the standard definition of segmental SNR , is used in place of clean except that the enhanced signal, speech, . Fig. 9 shows the ESNR plotted against the input SNR. The difference between the ESNR and the input SNR corresponds to the gain of the speech enhancement algorithm in the decoder side of the speech coder. While Fig. 9 shows how the speech enhancement algorithm improves SNR for noisy speech, it is well known that speech enhancement degrades recognition performance of clean speech. This problem was addressed by to empirically determined value of SNR setting SNR

Fig. 9. Estimated SNR against input SNR. The difference between the two SNRs corresponds to the gain of the speech enhancement algorithm to the decoded speech.

dB. As a result, feature enhancement was selectively applied to the noisy speech whose ESNR was less than 40 dB. IV. SPEECH RECOGNITION EXPERIMENTS This section presents the results of an experimental study evaluating the performance of the feature enhancement algorithm in Section III. A comparison between the bitstream-based front-end described in Section II-A and the front-ends commonly used for ASR in wireline and wireless telecommunications scenarios is also performed. The section is composed of

600

IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 10, NO. 8, NOVEMBER 2002

three parts. Section IV-A describes the wireline and wireless front-ends and also describes a set of experiments where the ASR performance using these front-ends is compared to that of the bitstream-based system on a noisy connected digit recognition task. Section IV-B describes the performance of the feature enhancement algorithm on the same noisy connected digit task. Section IV-C presents the results of a similar study measuring large vocabulary ASR performance on a directory retrieval task from users in a mobile wireless domain. A. Evaluation of Bitstream-Based Front-End on a Noisy Connected Digit Task In this subsection, we first define the wireline and the wireless front-ends used in this work. Then, we evaluate the ASR performance of the bitstream-based front-end and these front-ends on a noisy connected digit string task. The task and the recognition system will be described in Section IV-A2. The performance comparison will be discussed in Section IV-A3. 1) Wireline and Wireless Front-Ends: The feature analysis algorithm in the wireline and wireless front-ends are identical but the input to each front-end is different. The wireline front-end directly utilizes speech signals recorded over the public switched telephone network (PSTN) or recorded by a microphone. On the other hand, feature parameters in the wireless front-end are extracted by using the decoded speech signals obtained from the IS-641 speech coder on the network side of the IS-136 digital cellular system. Therefore, we can estimate the effect of a wireless environment on the ASR performance by comparing the performance of the two front-ends where speech coding distortion due to the IS-641 speech coder simulates the wireless environment. Similar to the bitstream-based front-end, each front-end provides both the LPC-cepstra based feature set and the mel-cepstra based feature set. The performance of the LPC-cepstra and the mel-cepstra feature sets will be evaluated in connected digit string recognition and large vocabulary recognition, respectively. The detailed feature extraction procedure in the wireline front-end is described in [29]. Speech signals are preempha. sized by using a first order differentiator of After that, a Hamming window of length 30 ms is applied to each speech segment to obtain a linear prediction polynomial by using the autocorrelation method. The Levinson-Durbin recursion is applied to the autocorrelation coefficients to extract the LPC coefficients of order ten. For the LPC-cepstra based feature set, the ten LPC coefficients are converted into the twelve cepstral coefficients, and then a bandpass cepstral lifter is applied to the cepstral coefficients. The analysis is repeated once every 10 ms, which results in a frame rate of 100 Hz. Instead of LPC cepstral coefficients, we extract twelve mel-LPC based cepstral coefficients for the mel-cepstra based feature set. A feature vector is 39-dimensional including 12 LPC-cepstral coefficients or mel-LPC cepstral coefficients postprocessed with CMS, a normalized logarithmic energy, and their first and second time differences. The first and the second differences are also computed by five and three frame windows, respectively. The feature analysis algorithm in the wireless front-end is identical to that of the wireline front-end described above. The

TABLE I COMPARISON OF WORD AND STRING ACCURACIES (%) BETWEEN THE BITSTREAM-BASED FRONT-END, THE WIRELINE FRONT-END AND THE WIRELESS FRONT-END FOR CONNECTED DIGIT STRINGS RECORDED BY A QUIET TELEPHONE HANDSET

only difference is that the input to the wireless front-end is speech signals that have been reconstructed from the IS-641 decoder. It also produces both the LPC-cepstra and the mel-cepstra based feature sets. 2) Connected Digit Database and Recognition System: In this part, we explain the connected digit string database and the speech recognition system. We used the same hidden Markov model (HMM) structure and subword models described in [11], [15]. Each digit was modeled by a set of left-to-right continuous density HMMs. In this task, we used a total of 274 context-dependent subword models, which were trained by maximum likelihood estimation. Subword models contained a head-body-tail structure. The head and tail models were represented with three states and the body models were represented with four states. Each state had eight Gaussian mixtures. Silence was modeled by a single state with 32 Gaussian mixtures. As a result, the recognition system had 274 subword HMMs, 831 states and 6672 mixtures. The training set and the test set consisted of 9766 and 1584 digit strings, respectively, recorded over the PSTN. The length of all digit strings for the testing was 14. The recognition experiments were done with an unknown length grammar. The word recognition accuracy was computed by counting insertion, deletion and substitution errors. 3) Discussion: We discuss the ASR performance of the front-ends on the connected digit string recognition task. The performance degradation will be shown for each front-end under car and babble noise conditions. After that, the effect of the MMSE-LSA speech enhancement algorithm is evaluated where the speech enhancement algorithm is applied directly to the input speech. First, we evaluated the word and string accuracies of the bitstream-based, wireline and wireless front-ends and showed their performance in Table I. From the results, we found that the bitstream-based front-end reduced the word error rate by 22% compared to the wireless front-end and it also provided statistically comparable performance to the wireline front-end. Second, we evaluated the recognition performance of the front-ends under noisy conditions. We took two types of background noise signals, car and babble noise. Each noise signal was added to the test digit strings. The level of additive noise was chosen to obtain segmental SNR ranging from 0 to 30 dB. The recognition rate was obtained by averaging the results from both car and babble noise speech. Table II shows the recognition accuracies of the three front-ends plotted against segmental SNR for the connected digit strings. At low SNRs, the recognition performance was degraded severely for all the

KIM et al.: PERFORMANCE IMPROVEMENT OF A BITSTREAM-BASED FRONT-END

TABLE II COMPARISON OF WORD ACCURACIES (%) BETWEEN THE BITSTREAM-BASED FRONT-END, THE WIRELINE FRONT-END AND THE WIRELESS FRONT-END FOR CONNECTED DIGIT STRINGS IN NOISY ENVIRONMENTS WHERE THE WORD ACCURACY WAS OBTAINED BY AVERAGING THE RESULTS UNDER CAR AND BABBLE NOISE CONDITIONS

TABLE III COMPARISON OF WORD ACCURACIES (%) BETWEEN THE BITSTREAM-BASED FRONT-END, THE WIRELINE FRONT-END AND THE WIRELESS FRONT-END FOR CONNECTED DIGIT STRINGS IN NOISY ENVIRONMENTS, WHERE A SPEECH ENHANCEMENT ALGORITHM WAS APPLIED TO THE INPUT SPEECH AND THE WORD ACCURACY WAS OBTAINED BY AVERAGING THE RESULTS UNDER CAR AND BABBLE NOISE CONDITIONS

front-ends. While the performance of the wireless front-end was worse than the others at all SNRs, the performance of the bitstream-based front-end was comparable to or better than that of the wireline front-end except for 0 dB SNR. Next, we applied the speech enhancement algorithm to the noisy speech so that the noisy speech signals were enhanced prior to speech encoding for both the bitstream-based front-end and the wireless front-end. Table III shows the recognition performance under this condition. Comparing to the results of Table II, we found that the speech enhancement algorithm improved recognition performance at low SNR, but it lowered the recognition accuracies at high SNR. Interestingly, the bitstream-based front-end gave better word accuracies than the other front-ends by incorporating the speech enhancement in the encoder side of the speech coder. B. Evaluation of Feature Enhancement Algorithm on a Noisy Connected Digit Task We evaluated the ASR performance of the feature enhancement algorithm in Section III by applying the algorithm to the bitstream-based front-end. Three components in the feature enhancement algorithm were evaluated sequentially. They were spectral parameter enhancement, coder-specific parameter enhancement and SNR-based enhancement, where the direct assignment method and the re-estimation method were applied separately. Table IV shows the word accuracies of the bitstream-based front-end against different SNRs. In the table, the baseline means that we did not apply the feature enhancement algorithm and thus the results were identical to the results in the third row of Table II. We first applied the spectral parameter enhancement of (20) to the noisy speech (denoted as LSP). However, the

601

TABLE IV WORD ACCURACIES (%) OF THE BITSTREAM-BASED FRONT-END WITH AND WITHOUT FEATURE ENHANCEMENT FOR CONNECTED DIGIT STRINGS IN NOISY ENVIRONMENTS, WHERE BASELINE MEANS THE FRONT-END WITHOUT FEATURE ENHANCEMENT

improvement in word accuracy was small except for the 0 dB SNR case. The word accuracy under the clean condition decreased under this scenario. Next, we combined the coder-specific parameter enhancement algorithm with the spectral parameter enhancement. Compared with the direct assignment method (denoted as DA) and the re-estimation method (denoted as RE), the DA method was better than the RE method for connected digit string recognition. In order to investigate the reason for this result, we performed separate comparisons of the DA and RE methods under car and babble noise conditions. It was revealed that the DA method gave better performance than the RE method for the noisy speech corrupted by car noise while the RE method was slightly better than the DA method under the babble noise condition. When the feature enhancement algorithm was applied to both spectral parameters and coder-specific parameters, the word error rates were reduced by approximately 20% and 30% at 20 dB SNR and 10 dB SNR, respectively.3 In order to prevent the performance of clean speech from being degraded after applying the feature enhancement algorithm, the SNR-based approach described in Section III-C was applied. As shown in the fifth and sixth rows of Table IV, the SNR-based feature enhancement methods (denoted as LSP DA SNR or LSP RE SNR) restored the recognition performance to the baseline accuracy under the clean condition, and the word accuracies below 30 dB SNR were the same to those obtained by LSP DA or LSP RE. We also compared the performance of the speech enhancement algorithm applied at the encoder (the third row of Table III) and the performance of LSP DA SNR. At SNRs above 20 dB, we found that the feature enhancement algorithm gave comparable performance to the bitstream-based front-end with speech enhancement in the encoder side. C. Performance Evaluation of Feature Enhancement Algorithm on Large Vocabulary Recognition In this subsection, we performed the same experiments described in Sections IV-A and IV-B for a large vocabulary recognition task. The recognition task and the recognition system will be explained in Section IV-C1. In Section IV-C2, we will compare the ASR performance of the bitstream-based front-end with 3A significance test showed that the improvement of the feature enhancement algorithm over the baseline was statistically significant.

602

IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 10, NO. 8, NOVEMBER 2002

TABLE V COMPARISON OF WORD ACCURACIES (%) BETWEEN THE BITSTREAM-BASED FRONT-END, THE WIRELINE FRONT-END AND THE WIRELESS FRONT-END FOR THREE DATASETS RECORDED BY A CLOSE-TALKING MICROPHONE IN AN OFFICE ENVIRONMENT

those of the wireline and wireless front-ends. Also, we will discuss the ASR performance of the bitstream-based front-end with the feature enhancement algorithm. 1) Database and Recognition System: For large-vocabulary speech recognition, continuous density context dependent HMM models were trained from a set of 16 538 utterances of personal names that were also recorded over the PSTN. A top-down decision tree based method was used for clustering HMM states. The resulting model consisted of 7731 phone units and approximately 7500 states with a maximum of six Gaussian mixtures per state, where each phone was modeled by a three-state left-to-right HMM. The test data for the large vocabulary task consisted of utterances collected from AT&T Labs employees interacting with a database retrieval application in an office environment [30], [31]. A total of 10 496 utterances were collected from 50 speakers. Each utterance was spoken simultaneously through a close-talking head-mounted microphone and a far-field microphone located approximately 0.75 m from the speaker. A subset of the data consisting of isolated utterances of first-names, last-names and work locations for a 3500 employee directory was considered. The number of utterances for each of the three datasets was 2893, 2947, and 2005 for the first-name, last-name, and work location, respectively. Also, the vocabulary size for the grammar was 1902, 2978, and 62 for the first-name, last-name and work location, respectively. The data from the close-talking microphone were used for the experiments simulating the clean condition. Instead of adding background noise signals to the test data in the connected digit string recognition, we used the data recorded by a far-field microphone for this purpose. 2) Discussion: The experimental results for the bitstreambased front-end and the feature enhancement algorithm on the large vocabulary task will be presented as they were in Sections IV-A3 and IV-B for the connected digit task. Table V shows the word accuracies of the bitstream-based, wireline and wireless front-ends for the three datasets in the large vocabulary recognition task. The relative performance of the front-ends had the same tendency as it did in the connected digit string task. In other words, we found that the bitstream-based front-end gave better recognition accuracy than the wireless front-end and also it provided a comparable performance to the wireline front-end. We evaluated the recognition performance of the front-ends by using the speech data recorded by a far-field microphone in an office environment and showed the results in Table VI. Comparing to the close-talking microphone speech recognition results shown in Table V, we could see the performance degra-

TABLE VI COMPARISON OF WORD ACCURACIES (%) BETWEEN THE BITSTREAM-BASED FRONT-END, THE WIRELINE FRONT-END AND THE WIRELESS FRONT-END FOR THREE DATASETS RECORDED BY A FAR-FIELD MICROPHONE IN AN OFFICE ENVIRONMENT

TABLE VII COMPARISON OF WORD ACCURACIES (%) BETWEEN THE BITSTREAM-BASED FRONT-END, THE WIRELINE FRONT-END AND THE WIRELESS FRONT-END FOR THREE DATASETS RECORDED BY A FAR-FIELD MICROPHONE IN AN OFFICE ENVIRONMENT, WHERE A SPEECH ENHANCEMENT ALGORITHM WAS APPLIED TO THE INPUT SPEECH

TABLE VIII WORD ACCURACIES (%) OF THE BITSTREAM-BASED FRONT-END WITH AND WITHOUT FEATURE ENHANCEMENT ALGORITHM FOR THREE DATASETS RECORDED BY A FAR-FIELD MICROPHONE IN AN OFFICE ENVIRONMENT, WHERE BASELINE MEANS THE FRONT-END WITHOUT FEATURE ENHANCEMENT

dation caused by the use of the far-field microphone. Next, we applied the MMSE-LSA speech enhancement algorithm to the far-field microphone speech signals and evaluated the performance of the front-ends. Table VII shows the recognition performance after applying the speech enhancement algorithm. The bitstream-based front-end was applied after processing the IS-641 speech coder on the enhanced speech signals. By comparing the results shown in Tables VI and VII, it was clear that, although the far-field microphone degraded ASR performance, applying speech enhancement to the input speech resulted in significant improvement. This was especially true for the bitstream-based front-end. Finally, we applied the feature enhancement algorithm to the far-field microphone speech. Table VIII shows the word accuracies of the bitstream-based front-end for the three datasets recorded by a far-field microphone. The spectral parameter enhancement reduced the word error rates by 21%, 23%, and 77% for the first-name, last-name, and location datasets, respectively. Comparing to the coder-specific enhancement methods, the RE method was better than the DA method, which agrees with the result under the babble noise condition in the connected digit string recognition as mentioned in Section IV-A3. The perfor-

KIM et al.: PERFORMANCE IMPROVEMENT OF A BITSTREAM-BASED FRONT-END

mance of the SNR-based feature enhancement algorithm was shown in the last two rows in the table. However, it did not give any improvement on the word accuracy. In the SNR-based approach, we set the threshold as 40 dB, but the average SNR for the far-field microphone speech was measured as 23.5 dB [30]. Therefore, the enhancement algorithms were never actually deactivated by the threshold based procedure. V. CONCLUSIONS We have proposed a feature enhancement algorithm for wireless speech recognition in adverse environments. The feature enhancement algorithm was realized by incorporating the MMSE-LSA speech enhancement algorithm within a gradient descent based parameter update procedure in the network side of the wireless communications system. This procedure utilized feature parameters obtained from the decoded speech and the enhanced version of the decoded speech. Moreover, the coder-specific parameters such as the residual energy, the fixed codebook gain parameter and the adaptive codebook gain parameter could also be re-estimated by applying the speech enhancement algorithm to the decoded residual signal. Additionally, the algorithm was selectively applied depending on the estimated SNR. First of all, we performed connected digit HMM recognition experiments to show the effectiveness of the proposed feature enhancement algorithm. In a clean environment, the bitstream-based front-end gave better recognition performance than the wireline front-end and the wireless front-end using the decoded speech. However, it degraded the recognition performance under severe noisy conditions. By incorporating the proposed feature enhancement algorithm into the bitstream-based front-end, we could obtain improved recognition performance for speech in noisy environments without hurting the performance obtained for speech in clean environment. Next, we performed large vocabulary recognition experiments, where speech data were simultaneously recorded by using both a close-talking microphone and a far-field microphone and the recognizer was trained only with the telephone-line speech. For the close-talking microphone speech, the bitstream-based front-end provided performance that was comparable to the wireline front-end and gave better performance than the wireless front-end. On the other hand, the bitstream-based front-end severely degraded the performance for the far-field microphone speech recognition, but it greatly improved the performance by incorporating the feature enhancement algorithm. ACKNOWLEDGMENT The authors would like to thank Dr. H.-G. Kang for his preparation of an enhancement module and also thank the anonymous reviewers whose valuable comments and suggestions led to improvement in this paper. REFERENCES [1] J.-C. Junqua and J.-P. Haton, Robustness in Automatic Speech Recognition. Norwell, MA: Kluwer, 1996.

603

[2] V. K. Varma, “Testing speech coders for usage in wireless communications systems,” in Proc. IEEE Workshop on Speech Coding for Telecommunications, QC, Canada, Oct. 1993, pp. 93–94. [3] A. Gallardo-Antolin, F. Diaz-de-Maria, and F. Valverde-Albacete, “Avoiding distortions due to speech coding and transmission errors in GSM ASR tasks,” in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Processing, Phoenix, AZ, Mar. 1999, pp. 277–280. [4] B. T. Lilly and K. K. Paliwal, “Effect of speech coders on speech recognition performance,” in Proc. Int. Conf. Spoken Language Processing, Philadelphia, PA, Oct. 1996, pp. 2344–2347. [5] C. Mokbel, L. Mauuary, L. Karray, D. Jouvet, J. Monne, J. Simonin, and K. Bartkova, “Towards improving ASR robustness for PSN and GSM telephone applications,” Speech Commun., vol. 23, no. 1–2, pp. 141–159, Oct. 1997. [6] C.-H. Lee, “On stochastic feature and model compensation approaches to robust speech recognition,” Speech Commun., vol. 25, no. 1–3, pp. 29–47, Aug. 1998. [7] S. F. Boll, “Suppression of acoustic noise in speech using spectral subtraction,” IEEE Trans. Acoust., Speech, Signal Processing, vol. 27, pp. 113–120, Apr. 1979. [8] M. Berouti, R. Schwartz, and J. Marhoul, “Enhancement of speech corrupted by acoustic noise,” in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Processing, Washington, DC, April 1979, pp. 208–211. [9] P. Lockwood and J. Boudy, “Experiments with a nonlinear spectral subtractor (NSS), hidden Markov models and the projection, for robust speech recognition in cars,” Speech Commun., vol. 11, no. 2–3, pp. 215–228, June 1992. [10] J. Huang and Y. Zhao, “An energy-constrained signal subspace method for speech enhancement and recognition in white and colored noise,” Speech Commun., vol. 26, no. 3, pp. 165–181, Nov. 1998. [11] M. Rahim, B. H. Juang, W. Chou, and E. Buhrke, “Signal conditioning techniques for robust speech recognition,” IEEE Signal Processing Lett., vol. 3, pp. 107–109, Apr. 1996. [12] H. Hermansky and N. Morgan, “RASTA processing of speech,” IEEE Trans. Speech Audio Processing, vol. 2, pp. 578–589, Oct. 1994. [13] J. M. Huerta and R. M. Stern, “Speech recognition from GSM codec parameters,” in Proc. Int. Conf. Spoken Language Processing, Sydney, Australia, Nov. 1998, pp. 1463–1466. [14] H. K. Kim and R. V. Cox, “Bitstream-based feature extraction for wireless speech recognition,” in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Processing, Istanbul, Turkey, June 2000, pp. 1207–1210. , “A bitstream-based front-end for wireless speech recognition on [15] IS-136 communications system,” IEEE Trans. Speech Audio Processing, vol. 9, pp. 558–568, July 2001. [16] Y. Ephraim and D. Malah, “Speech enhancement using a minimum mean-square error log-spectral amplitude estimator,” IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-33, pp. 443–445, Apr. 1985. [17] A. J. Accardi and R. V. Cox, “A modular approach to speech enhancement with an application to speech coding,” in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Processing, Phoenix, AZ, Mar. 1999, pp. 201–204. [18] D. Malah, R. V. Cox, and A. J. Accardi, “Tracking speech-presence uncertainty to improve speech enhancement in nonstationary noise environments,” in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Processing, Phoenix, AZ, Mar. 1999, pp. 789–792. [19] R. Martin and R. V. Cox, “New speech enhancement techniques for low bit rate speech coding,” in Proc. 1999 IEEE Workshop on Speech Coding, Porvoo, Finland, June 1999, pp. 165–167. [20] T. Honkanen, J. Vainio, K. Järvinen, P. Haavisto, R. Salami, C. Laflamme, and J.-P. Adoul, “Enhanced full rate speech codec for IS-136 digital cellular system,” in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Processing, Munich, Germany, Apr. 1997, pp. 731–734. [21] B. H. Juang, L. R. Rabiner, and J. G. Wilpon, “On the use of bandpass liftering in speech recognition,” IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-35, pp. 947–954, July 1987. [22] K. K. Paliwal and B. S. Atal, “Efficient vector quantization of LPC parameters at 24 bits/frame,” IEEE Trans. Speech Audio Processing, vol. 1, pp. 3–14, Jan. 1993. [23] B. Widrow and S. D. Stearns, Adaptive Signal Processing. Englewood Cliffs, NJ: Prentice-Hall, 1985. [24] G. S. Kang and L. J. Fransen, “Application of line spectrum pairs to low bit rate speech coders,” in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Processing, Tampa, FL, Mar. 1985, pp. 7.3.1–7.3.4. [25] R. P. Ramachandran and P. Kabal, “Pitch prediction filters in speech coding,” IEEE Trans. Acoust., Speech, Signal Processing, vol. 37, pp. 467–478, Apr. 1989.

604

IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 10, NO. 8, NOVEMBER 2002

[26] A. M. Kondoz, Digital Speech: Coding for Low Bit Rate Communications Systems. Chichester, U.K.: Wiley, 1994. [27] M. A. Ramirez and M. Gerken, “Joint position and amplitude search of algebraic multipulses,” IEEE Trans. Speech Audio Processing, vol. 8, pp. 633–637, Sept. 2000. [28] C. Kermorvant, “A Comparison of Noise Reduction Techniques for Robust Speech Recognition,”, IDIAP-RR 99-10, 1999. [29] C.-H. Lee, E. Giachin, L. R. Rabiner, R. Pieraccini, and A. E. Rosenberg, “Improved acoustic modeling for large vocabulary continuous speech recognition,” Comput. Speech Lang., vol. 6, pp. 103–127, Apr. 1992. [30] B. Gajic and R. C. Rose, “Hidden Markov model environmental compensation for automatic speech recognition on hand-held mobile devices,” in Proc. Int. Conf. Spoken Language Processing, Beijing, China, Oct. 2000, pp. 405–408. [31] R. C. Rose, S. Parthasarathy, B. Gajic, A. E. Rosenberg, and S. Narayanan, “On the implementation of ASR algorithms for hand-held wireless mobile devices,” in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Processing, Salt Lake City, UT, May 2001, pp. 17–20.

Hong Kook Kim (M’99–SM’01) received the B.S. degree in control and instrumentation engineering from Seoul National University, Seoul, Korea, in 1988. He received the M.S. and Ph.D. degrees in electrical engineering from Korea Advanced Institute of Science and Technology (KAIST) in 1990 and 1994, respectively. While pursuing the Ph.D. degree, he also worked with Samsung Electronics Co., Ltd., Korea (from 1991 to 1993) and Samsung Advanced Institute of Technology (SAIT) (from 1993 to 1994), developing a voice dialing system for digital keyphone and implementing the algorithms of speech coding and speech synthesis. From 1994 to February 1998, he was a Member of Technical Staff with SAIT and he had developed speech coders operating at 4 kbit/s. In 1998, he was a Senior Researcher and a Research Scientist with MMC Technology, Inc., and KAIST, respectively and he has developed a voice dialing system operating in a mobile environment. From December 1998 to May 2000, he was with AT&T Labs-Research, Florham Park, NJ, as a Consultant in wireless speech recognition. Since June 2000, he has been a Senior Technical Staff Member with the Voice Enabled Services Research Lab, AT&T Labs-Research. His current research interests include speech analysis for wireless speech recognition and coding, speech recognition in noisy environments and speech coding for the Internet and its interoperability in communications networks.

Richard V. Cox (S’69–M’70–SM’87–F’91) received the Ph.D. degree in electrical engineering from Princeton University, Princeton, NJ. In 1979, he joined the Acoustics Research Department of Bell Laboratories. He has conducted research in the areas of speech coding, digital signal processing, analog voice privacy, audio coding, real-time implementations, speech recognition, and speech enhancement. He is well known for his work in speech coding standards. He collaborated on the low-delay CELP algorithm that became ITU-T Recommendation G.728 in 1992. He managed the ITU effort that resulted in the creation of ITU-T Rec. G.723.1 in 1995. In 1987, he was promoted to Supervisor of the Digital Principles Research Group. In 1992, he was appointed Department Head of the Speech Coding Research Department of AT&T Bell Labs. In 1996, he joined AT&T Labs as Division Manager of the Speech Processing Software and Technology Research Department. In 1999, he was awarded the AT&T Science and Technology Medal. In August 2000, he was appointed Speech and Image Processing Services Research Vice-President. In February 2002, as part of a reorganization, he was appointed Voice Enabled Services Research Vice-President. Dr. Cox is a Fellow of the IEEE and is currently President of the IEEE Signal Processing Society. He recently completed nine years as a member of the Board of Directors of Recording for the Blind & Dyslexic, the only U.S. provider of textbooks and reference books for people with print disabilities. At RFB&D, he is continuing to help lead the effort to develop digital books combining audio, text, images and graphics for their consumers. These “multimedia books” became generally available in September 2002 for RFB&D K-14 students throughout the U.S.

Richard C. Rose (S’84–M’86–SM’00) received B.S. and M.S. degrees from the Electrical and Computer Engineering Department, University of Illinois. He received thr Ph.D. E.E. degree from what is now known as the The Center for Signal and Image Processing (CSIP) at the Georgia Institute of Technology, Atlanta, in 1988 with a thesis in speech coding and speech analysis. From 1980 to 1984, he was with Bell Laboratories, now a division of Lucent Technologies, working on signal processing and digital switching systems. From 1988 to 1992, he was a member of the Speech Systems and Technology group, now called the Information Systems Technology Group, at the Massachusetts Institute of Technology (MIT) Lincoln Laboratory working on speech recognition and speaker recognition. He has been with AT&T since 1993 and is currently a member of the Voice Enabled Services Research Laboratory at AT&T Labs-Research in Florham Park, NJ. He spent the spring of 1996 at Furui Lab at NTT in Tokyo. He is the author of over 60 refereed papers and is an inventor on 13 patents and patent applications. Dr. Rose served as a member of the IEEE Signal Processing Society Technical Committee on Digital Signal Processing from 1990 to 1995 and was on the organizing committee of the 1990 and 1992 DSP workshops. He has served as an adjunct faculty member of the Georgia Institute of Technology. He was elected as an At Large Member of the Board of Governors for the Signal Processing Society from 1995 to 1997 and served as membership coordinator during that time. He was an Associate Editor for the IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING from 1997 to 1998 and is currently a member of the IEEE SPS Speech Technical Committee. He is a member of Tau Beta Pi, Eta Kappa Nu, and Phi Kappa Phi.

Suggest Documents