Static and Dynamic Variance Compensation for ... - Semantic Scholar

14 downloads 227 Views 675KB Size Report
Corporation, Kyoto 619-0237, Japan (e-mail: [email protected]; [email protected]; ...... (AES) 13th Regional Conv.,. Tokyo, Japan, 2007 ...
324

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 2, FEBRUARY 2009

Static and Dynamic Variance Compensation for Recognition of Reverberant Speech With Dereverberation Preprocessing Marc Delcroix, Member, IEEE, Tomohiro Nakatani, Senior Member, IEEE, and Shinji Watanabe, Member, IEEE

Abstract—The performance of automatic speech recognition is severely degraded in the presence of noise or reverberation. Much research has been undertaken on noise robustness. In contrast, the problem of the recognition of reverberant speech has received far less attention and remains very challenging. In this paper, we use a dereverberation method to reduce reverberation prior to recognition. Such a preprocessor may remove most reverberation effects. However, it often introduces distortion, causing a dynamic mismatch between speech features and the acoustic model used for recognition. Model adaptation could be used to reduce this mismatch. However, conventional model adaptation techniques assume a static mismatch and may therefore not cope well with a dynamic mismatch arising from dereverberation. This paper proposes a novel adaptation scheme that is capable of managing both static and dynamic mismatches. We introduce a parametric model for variance adaptation that includes static and dynamic components in order to realize an appropriate interconnection between dereverberation and a speech recognizer. The model parameters are optimized using adaptive training implemented with the Expectation Maximization algorithm. An experiment using the proposed method with reverberant speech for a reverberation time of 0.5 s revealed that it was possible to achieve an 80% reduction in the relative error rate compared with the recognition of dereverberated speech (word error rate of 31%), and the final error rate was 5.4%, which was obtained by combining the proposed variance compensation and MLLR adaptation. Index Terms—Dereverberation, model adaptation, robust automatic speech recognition (ASR), variance compensation.

I. INTRODUCTION

T

HE deployment of automatic speech recognition (ASR)-based products has been limited due in part to the lack of robustness of current systems to noise and reverberation. The noise robustness problem has attracted much attention, and many solutions have been proposed including feature-based approaches [1], [2] and model-based approaches [3]–[6]. The recognition of reverberant speech has received less attention, although it is essential if we are to achieve the recognition of distant speech occurring, for example, in a Manuscript received March 10, 2008; revised September 20, 2008. Current version published February 11, 2009. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Mark Hasegawa-Johnson. The authors are with the NTT Communication Science Laboratories, NTT Corporation, Kyoto 619-0237, Japan (e-mail: [email protected]; [email protected]; [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TASL.2008.2010214

meeting scenario. Because reverberation is characterized by a long-term convolutive distortion, it may not be well mitigated with such conventional techniques as cepstral mean normalization (CMN) [7]. Recognizing reverberant speech thus remains a very challenging problem that we tackle in this paper. As with research on noise-robust ASR, investigations of reverberant speech recognition include model-based and featurebased approaches. Model-based approaches usually consist of modifying a clean speech acoustic model to account for the effect of reverberation. One simple approach is to adapt the model using reverberant adaptation data [8]. Other approaches modify the acoustic model by including the effect of previous speech frames in order to model reverberant speech in a more precise way [9]–[11]. In [9] and [10], a simple predictive model for reverberation was introduced with parameters estimated using adaptive training. In [11], reverberant speech was modeled by convolving a clean acoustic model with a reverberation model obtained with a set of measured room impulse responses in the feature domain. However, owing to the highly nonstationary behavior of reverberation, it is difficult to design speech models for reverberant speech, and current model-based approaches may still need further investigation to prove their effectiveness. In addition to work on model-based approaches, research has also been undertaken on feature-based approaches for the robust recognition of reverberant speech. For example, reverberant speech may be processed with a speech dereverberation technique prior to ASR. Recently, speech dereverberation has received increased attention [7], [12]–[18]. Dereverberation consists of estimating the reverberation energy and then removing it from the observed reverberant speech. Many approaches have been proposed for estimating reverberation energy including room impulse response estimation-based methods [15], harmonicity-based approaches [14], and late reverberation energy estimation methods [17], [18]. The reverberation energy is suppressed either in the time domain using inverse filtering [13]–[15] or in the spectral domain using spectral subtraction [7], [17], [18]. Many speech dereverberation methods may reduce reverberation. However, due to the imperfect reverberation estimation of most dereverberation methods, speech features are usually unreliable. Since reverberation is nonstationary, the feature reliability may change from frame to frame, and this results in a dynamic mismatch between speech features and the acoustic model. Consequently, high ASR performance is not obtained. By using a matched condition model, which is trained with dereverberated signals as test data, the degradation of the ASR performance can be

1558-7916/$25.00 © 2009 IEEE

DELCROIX et al.: STATIC AND DYNAMIC VARIANCE COMPENSATION FOR RECOGNITION OF REVERBERANT SPEECH

mitigated. However, it is expensive to prepare appropriate models for all conditions. In addition, only static mismatches can be compensated by optimizing acoustic model parameters. Recently, there have been several proposals suggesting the use of feature reliability information to improve the performance of ASR systems that employ speech enhancement preprocessing [19]–[24]. The idea consists of focusing on reliable feature components during decoding. As an example, dynamic variance compensation increases the model variance for unreliable feature components. In [19], a substantial ASR improvement was reported when the feature variance could be obtained in an oracle situation, i.e., where the feature variance of the enhanced speech features was calculated from clean speech features that were known a priori and the enhanced features. We call this feature variance “oracle variance.” However, the performance was much poorer when an estimated feature variance was used instead of the oracle variance. There have been several proposals regarding the estimation of the variance of enhanced features [19], [20], but the methods are usually dependent on speech enhancement and may therefore not be used directly with a dereverberation preprocessor. For example, in [19], feature variance is derived from a speech enhancement method based on a Gaussian mixture model of clean speech. When dealing with a dereverberation preprocessor, the feature variance could be approximated as, for example, an estimated mismatch (given by the distance between an enhanced feature and an observed reverberant feature) as with the blind source separation (BSS) preprocessor [20]. However, the estimated variance may be far from the oracle variance and therefore, and this can be result in unsatisfactory levels of performance. In order to interconnect a dereverberation preprocessor and a speech recognizer to ensure good performance, we propose introducing a dynamic variance compensation scheme into a static adaptive training framework [25]. We design a novel parametric model for variance compensation that includes static and dynamic components. The dynamic component can be derived from the dereverberation preprocessor output as an approximated mismatch. The static adaptation is realized by weighting the acoustic model variances as with conventional static variance adaptation [26]. The model parameters are optimized using an adaptive training approach derived from the expectation maximization (EM) algorithm. Therefore, it is possible to mitigate the mismatch between oracle and the estimated feature variance. Consequently, we propose a new approach for combining feature based and model-based approaches within an adaptive training framework, in order to increase the robustness of ASR to reverberation. Moreover, the proposed variance adaptation method can be combined with conventional mean adaptation techniques such as maximum likelihood linear regression (MLLR) to further reduce the mismatch. It is important to note that we do not adopt any method for adapting the mean parameters of the acoustic model in a dynamic manner, namely frame by frame, in order to reduce the dynamic mismatch. We assume that such compensation should be accomplished by the dereverberation preprocessor. In this paper, we focus on the recognition of reverberant speech whereas [25] aimed at a more general presentation

325

Fig. 1. Schematic diagram of recognition system for reverberant speech.

of the concepts of static and dynamic variance compensation. Consequently, this paper discusses the entire recognition system for reverberant speech in more detail. Moreover, we have also included experiments and discussions of practical considerations such as the convergence of the EM algorithm, the distance between speaker and microphone, and extension to unsupervised adaptation, to show the method’s potential for real-life applications. The organization of the paper is as follows. In Section II, we provide an overview of the recognition system and briefly review the principles of the two main steps, namely the dereverberation preprocessor and variance compensation. In Section III, we introduce our proposed method for variance compensation, which is based on a parametric model of the variance, and show how the parameters can be estimated using an adaptive training scheme. In Section IV, we show simulation results that attest to the significant improvement brought about by the proposed method for the recognition of reverberant speech. Section IV reports supervised adaptation results. Section V shows that the method can be easily extended to unsupervised adaptation, and also presents results obtained under different recording conditions. Finally, we conclude the paper and discuss some future research directions. II. OVERVIEW OF RECOGNITION SYSTEM Fig. 1 is a schematic diagram of the proposed ASR system for the recognition of reverberant speech. First, a speech dereverberation preprocessor is used to remove the effect of reverberation from the observed signal. Even if reverberation is significantly reduced, a mismatch remains between the dereverberated speech features and the clean speech features used for training the acoustic model. Therefore, we use variance compensation to mitigate the mismatch. Let us briefly recall the principles behind each of these steps. A. Conventional Speech Recognizer Recognition is usually achieved by finding a word sequence that maximizes a likelihood function as (1) where is a frame index, and

is a sequence of speech features , is a language model. Speech is

326

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 2, FEBRUARY 2009

modeled using a hidden Markov model (HMM) with the state density modeled by a Gaussian mixture (GM)

(2) where is the state index, is the Gaussian mixture component is the number of Gaussian mixtures, and and index, are a mean vector and a covariance matrix, respectively. In the following, we consider diagonal covariance matrices and by , where is the denote the diagonal elements of feature dimension index. The parameters of the acoustic model are trained with clean speech data. In practice, speech features used for recognition may differ from clean speech features used for training because of noise, reverberation, or distortions induced by speech enhancement preprocessing. Here, we focus on the problem of reverberation. To reduce the effect of reverberation, we use a dereverberation method for preprocessing. B. Dereverberation Based on Late Reverberation Energy Suppression Let us briefly review the speech dereverberation method used for preprocessing. Reverberant speech is usually modeled as the convoluwith a room impulse response as tion of clean speech (3) where is a convolution operator and is a discrete time index. represents the multipath propThe room impulse response agation of speech caused by the reflections of sounds from surfaces in the room. The reverberation time (RT60) of the room impulse response is typically several hundred milliseconds in usual living spaces [27]. Let us divide the room impulse response into two parts as (4) represents early reflections, which consist of the where direct path and all the reflections arriving at the microphone represents late rewithin 30 ms of the direct sound, and flections, which consist of all later reflections. Consequently, we can rewrite the microphone signal as (5) In this paper, we set the duration of the early reflections at the same length as the speech analysis frame used for recognition [16]. Although early reflections generally modify the spectral shape of the speech signal, they can be compensated to a great degree by current ASR systems based on, for example, cepstrum mean normalization (CMN) [28]. In contrast, the later parts of the reverberation, namely late reflections, fall outside the spectral analysis frame, and as a result past frames with a certain attenuation are added to the current frame. The induced

Fig. 2. Schematic diagram of the dereverberation preprocessor.

, is thus nonstationary and cannot distortion, i.e., be handled by conventional techniques. Many researchers have reported that late reflections are the main cause of the severe degradation in ASR performance when recognizing reverberant speech [7], [12]. Accordingly, we assume that the main role of the dereverberation preprocessor is to eliminate as far as possible all reverberation components corresponding to the late reflections. In this paper, we use a dereverberation method that focuses on late reverberation removal. As an example, we adopt the method that was introduced in [16], [17]. Fig. 2 is a schematic diagram of the dereverberation preprocessor. The dereverberation method consists of two steps. First, late reverberation, , is estimated using multistep linear prediction. We can be can show that an estimate of the late reverberation approximated by a convolution of observed reverberant speech with a multistep delay linear prediction filter as (6) The coefficients of filter normal equation given as

can be obtained by solving a

(7) where , is a delay of , around 30 ms, is the filter order, and is a time averaging operator. indicates the Moore–Penrose pseudoinverse. The derivation of this result is given in [16]. Once we calculate the is obprediction coefficients , the prediction filter zeros at the beginning of , namely tained by adding , and the late reflections are obtained with (6). Then, the estimated late are subtracted from the observed signal reflections by using a spectral subtraction technique [29] in the short-time is power spectral domain, and the dereverberated signal resynthesized based on the overlap–add synthesis technique by substituting the phase of the observed signal for that of the dereverberated signal. Note that the method can be extended to a multimicrophone case [17], but here we consider the more challenging single microphone case. This dereverberation method may remove late reverberation well. It is robust and has low computational complexity. However, due to imperfect estimation of the late reverberation and

DELCROIX et al.: STATIC AND DYNAMIC VARIANCE COMPENSATION FOR RECOGNITION OF REVERBERANT SPEECH

327

the use of spectral subtraction, distortions arise that prevent any great improvement in ASR performance. Therefore, there is a dynamic mismatch between the clean feature and the dereverberated feature. In Section II-C, we investigate the use of dynamic variance compensation to mitigate the effect of such a mismatch.

estimating the model parameters using adaptive training implemented with the EM algorithm. By considering a model for that includes static and dynamic components, we may compensate for both static and dynamic mismatches.

C. Dynamic Variance Compensation

A. Combination of Static and Dynamic Variance Adaptation

Initially, dynamic variance compensation was introduced to compensate for distortion caused by noise reduction speech enhancement. In this paper, we investigate the use of dynamic variance compensation with a speech dereverberation preprocessor. Here, we briefly recall the main idea behind dynamic variance compensation. Let us model the mismatch between clean speech and reverberant speech as

In this paper, in order to account for both static and dynamic mismatches, we propose a novel variance adaptation scheme that combines static and dynamic compensation. The compensated mixture covariance matrix is modeled as

(8) is the observed reverberant speech feature and where modeled as a Gaussian with

is

III. PROPOSED METHOD FOR VARIANCE COMPENSATION

(12) and represent static and dynamic variance comwhere ponents, respectively. and with a parametric represenWe further express tation similar to MLLR. The static variance can thus be expressed as (13)

(9) where is an estimate of the mismatch, i.e., , is the dereverberated speech feature, and represents a time-varying feature covariance matrix. The likelihood of a reverberant speech feature given a state can be obtained by marginalizing the joint probability over clean speech

where is a matrix of static variance compensation parameters. Equation (13) can be simplified if we assume the use of a diagonal covariance matrix, which is widely employed in speech recognition (14) where can be interpreted as the weight of the variances of the acoustic models [26]. In a similar way, we model the dynamic variance as

(10)

(15)

where we used the probability multiplication rule. We can further develop (10), by inserting (2) and (9)

where is a matrix of dynamic variance compensation parameis the previously defined feature variance. Ideally, ters, and feature variance should be computed as the squared difference between clean and enhanced speech features [19]. However, this calculation is not possible because the clean speech features are unknown. Here, we consider a diagonal feature covariance and assume that it is proportional to the square of an estimated mismatch, , i.e., the difference between observed reverberant and dereverberated speech features. Intuitively, this means that speech enhancement introduces more distortions when a great amount of reverberation energy is removed. Therefore, we can as express the dynamic variance component

(11)

where it is assumed that .1 is a time-varying mixture variance obtained after compensation. It is shown in [19] that dynamic variance compensation is very effective, especially when the oracle feature vari. In practice, the oracle feature variance is ance is used for not available, and therefore the compensated variance may not be optimal. In an effort to approach the performance obtained with oracle variance, we propose a novel parametric , and a procedure for model for compensated variance 1Note that here our explanation considers a mismatch between clean speech and reverberant speech features. However, the result of (11) may also be obtained by considering a mismatch between clean and enhanced features as in [19].

(16) are model parameters. where Therefore, with the proposed model, we can rewrite the timevarying state variance as (17) and can be optimized by using adaptive The parameters the model is equivalent to that training. Note that if is of conventional static variance compensation [26] and if

328

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 2, FEBRUARY 2009

Fig. 3. Schematic diagram of adaptation.

constant and , it is equivalent to the conventional dynamic variance compensation model [19]. The proposed model enables us to combine both static and dynamic variance compensation within an adaptive training framework. It is important to note that the proposed method can be further combined with mean adaptation techniques such as MLLR [3], in order to further reduce the gap between model and speech features, as discussed in Section IV-D. Fig. 3 is a schematic diagram of the adaptation process used . First dereverberation and feature to estimate parameters extraction are performed. Adaptation is performed with the EM using algorithm to obtain optimal variance parameters the dereverberated features, estimated mismatch , an acoustic model and labels. In Section III-B, we discuss the adaptation in detail. B. Adaptation of Variance Model Parameters The model variance parameters, by maximizing the likelihood as

, can be obtained

(18) is a sequence of observed speech feawhere tures. For simplicity, we consider supervised adaptation, where is known. The extension to unsupervised the word sequence adaptation is very straightforward, and we will prove experimentally in Section V-B that the method may also work well in this case. The maximum-likelihood estimation problem can be solved using the EM algorithm. We define an auxiliary function as

(19)

is a mismatch feature sequence, , is a set of all possible state sequences, is a set of all mixture components, represents the acoustic model parameters, and represents an estimate of parameter obtained from the previous step of the EM algorithm. By considering the HMM model for clean speech, assuming that and are independent, we express the auxiliary function as

where

(20) where means that the terms that do not depend on are neglected on the right-hand side. and mean that the state sequence at frame has a state and the mixture component sequence at frame has a component , respectively. The auxiliary function can be rewritten by replacing the summation over and with the summation over , , and , as shown in equation (21) at the bottom of the page, where is the number of HMM states, and is the number of mixture components in a GMM. The auxiliary function of (21) is similar to that used for stochastic matching [26]. The difference arises from the model of the mismatch given by (16) that includes a dynamic part. should be obtained by maximizing (21). We observe that the auxiliary function decomposes into and . However, there is no two functions . Thereclosed form solution for the joint estimation of [i.e., fore, we consider the three following cases, [Dynamic VariStatic Variance Adaptation (SVA)], ance Adaptation (DVA)], and a combination of the two [Static and Dynamic Variance Adaptation (SDVA or DSVA)]. ): Let us 1) Static Variance Adaptation (SVA, with respect to for here consider the maximization of a constant . By considering the model expressed by (14) and

(21)

DELCROIX et al.: STATIC AND DYNAMIC VARIANCE COMPENSATION FOR RECOGNITION OF REVERBERANT SPEECH

329

performing a similar calculation to that in [26], we can show that a closed form solution may be obtained as (27) (22) where pressed as

is a mismatch estimate ex-

where is the mixture component occupancy probability, which can be obtained using the forward–backward algois an estimate of the dererithm, and verberated feature variance. It is given by (28) (23) where speech feature, expressed as

is an estimate of the clean

(24) and variance, expressed as

is an estimate of the clean feature

(25) Equations (24) and (25) follow from derivations similar to those , the problem is reduced to conin [30]. Note that if ventional static model variance adaptation as proposed in [26], which is sometimes referred to as variance scaling. Looking at (22), we can interpret as the average of the ratio between the enhanced feature variance and the model variance. ): When 2) Dynamic Variance Adaptation (DVA, , we can find a closed form solution to the maximization problem. By inserting (16) and (9) in (21) and maximizing with respect to , we find the following expression:

(26)

where is an estimate of the mismatch variance given the enhanced feature and the acoustic model. From a similar definition to those in (24) and (25), it follows that

Note that it follows from (26) that is simply a weighted average of the ratio between the mismatch variance given the enhanced feature and the acoustic model, and the estimated mis. match variance 3) Static and Dynamic Variance Adaptation (SDVA or DVSA): It may not be easy to find a closed form solution of the EM algorithm when the maximization relative to and is performed at the same time. However, we determined that solutions could be found if we considered the maximization and separately. As these two maximization relative to problems involve the same likelihood function, the likelihood would also increase if we performed maximization relative to each parameter in turn, as regards the Expectation Conditional Maximization (ECM) algorithm [31]. This procedure may approach the general case. With the first case, we start by removing the static bias with static variance adaptation as described in Section III-B.I and . Then, using the previously adapted acoustic setting model, we perform dynamic variance adaptation as shown in Section III-B.II. This is referred to as Static and Dynamic Variance Adaptation (SDVA). We also consider the opposite case, where first dynamic variance adaptation is performed followed by static variance adaptation (i.e., Dynamic and Static Variance Adaptation—DSVA). IV. EXPERIMENTS We carried out experiments to confirm the effectiveness of combining static and dynamic variance adaptation. A. Experimental Settings In this experiment, we used the SOLON recognizer [32] modified to account for the decoding rule of (11). The recognition task consisted of continuous digit utterances. The acoustic model consisted of speaker independent word-based HMMs with 16 states and three Gaussians per state. The HMMs were trained using clean speech drawn from the TI-Digit database. The sampling rate was downsampled from 20 to 8 kHz. The acoustic features consisted of 39 coefficients: 12 MFCCs, the 0th cepstrum coefficient, delta, and acceleration. Cepstral Mean Normalization (CMN) was applied to the features. The above experimental setting almost corresponds to a clean speech set of Aurora 2 noisy digits recognition tasks [33].

330

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 2, FEBRUARY 2009

We generated reverberant speech by convolving clean speech with a room impulse response. The impulse response was measured in a room with a reverberation time of around 0.5 s. The distance between the speaker and the microphone was 1.5 m except in Section V-A, where we used a distance of 2 m for comparison. The Deutlichkeit values (D50) of the two impulse responses with distances of 1.5 and 2.0 m were 0.694 and 0.629, respectively. The D50 of an impulse response is defined as the ratio of the power of the impulse response within the first 50 ms to that of the whole impulse response, which is mathematically defined as

TABLE I BASELINE ASR RESULTS MEASURED BY WORD ERROR RATE (WER (%)).

(29)

is the room impulse response, and is the sampling where rate. In the dereverberation described in Section II-B, we set the s , the delay length of the prediction filter at 4000 taps of the multistep linear prediction at 30 ms, and the frame rate and frame size for spectral subtraction at 3.7 and 30 ms, respectively. The test set consisted of 561 utterances (6173 digits) spoken by 104 male and female speakers. The clean speech utterances were obtained from the TI-Digit clean test set. To account for the long distortion caused by reverberation, the test utterances were generated by concatenating two or three utterances from the same speaker without inserting pauses, so that the average duration of the test utterances was around 6 s. We measure the ASR performance using the word error rate (WER). B. Baseline Results Table I gives the following baseline recognition results. • Clean: Recognizing clean speech without preprocessing or variance compensation. • Reverberant: Recognizing reverberant speech without preprocessing or variance compensation. • Dereverberated: Recognizing dereverberated speech with dereverberation preprocessing [17] and without variance compensation. • Variance compensation ( , ): Dereverberated variance compensation with variance given by the square , of the estimated mismatch (without adaptation, i.e., ) • Variance compensation (by using oracle feature variances): variance compensation with ideal (orDereverberated acle) variance given by the square of the mismatch between clean features known a priori and dereverberated speech features. We observed severe degradation induced by reverberation. Only a small error reduction was achieved when using single-channel dereverberation. We also show that variance compensation reduces the error especially with the oracle variance, in which case the WER is very close to that of clean speech. This result confirms the great potential of dynamic variance compensation. Our

Fig. 4. WER as a function of the number of adaptation data for SVA (thin solid line), DVA (dashed line), SDVA (thick solid line), and DSVA (dash-dotted).

objective is to approach the level of performance provided by the oracle variance. C. Results of Variance Adaptation We use speaker independent adaptation data, i.e., utterances spoken by various speakers, to evaluate the adaptation performance with respect to mitigating only the distortion originating from the preprocessor. The adaptation data consists of 520 utterances, which were generated by concatenating two or three utterances, similarly to the test set preparation. They were spoken by the same female and male speakers that spoke the test set. To test the influence of the number of adaptation data, we used subsets of adaptation data containing from 2 to 512 utterances extracted randomly from the 520 adaptation utterances. The number of iterations of the EM algorithm was set at 2 for SVA and 30 for DSVA; see the discussion at the end of this section. Fig. 4 plots the WER as a function of the number , DVA, DVSA, and of adaptation utterances for SVA SDVA. The results are averaged over five randomly generated adaptation data sets. Inter-algorithm WER differences are highly significant: the two-sided 99% confidence intervals for for any of the data points WER were never wider than in Fig. 4. We observe that in all cases, convergence is almost achieved after two utterances since the number of adaptation parameters is small, namely 39 for SVA and DVA, and 2 39 for SDVA and DSVA. A great reduction in the WER from 31% to 15.2% is achieved using SVA. DVA achieved results that were almost as good. In contrast, when using DSVA and SDVA the performance improved by an additional 1% and 2%, respectively. The SDVA

DELCROIX et al.: STATIC AND DYNAMIC VARIANCE COMPENSATION FOR RECOGNITION OF REVERBERANT SPEECH

performance was better than the DSVA performance by up to 1.0% due to the order of the static and dynamic variance optimization, and SDVA and DSVA are further compared in [34]. These results show that even though the proposed method are inferior to the clean speech case and oracle results shown in Table I, the proposed method could significantly improve the ASR performance by reducing the error by 56% compared with the recognition of dereverberated speech. This experiment reveals the effectiveness of combining static and dynamic variance adaptation. Finally, let us briefly discuss the convergence of the proposed adaptation scheme. Since the proposed scheme adds a dynamic variance term to the acoustic model variances, they will become time-varying and therefore we may expect poor convergence with the EM algorithm. Fig. 5 plots the distortion (negative log-likelihood) and the WER as a function of the number of iterations of the EM algorithm for SVA, DVA, DSVA, and SDVA. The EM algorithm converges after only two iterations for SVA. In contrast, the convergence is much slower with DVA. The poor convergence is due to the difficulty of handling the dynamic component, and from the fact that the mismatch may not be well modeled with only a dynamic component, as suggested in Section IV-C. Such a poor convergence property would pose a problem, especially for online applications. When combining static and dynamic adaptations, we observe that even though distortion does not completely converge after 32 iterations, WER converges after two to four iterations. Therefore, a limited number of iterations may be sufficient to attain good performance suggesting the potential of the method for online uses. D. Results of Variance Adaptation Combined With MLLR for Mean Adaptation Here, we investigate the use of feature variance adaptation with mean adaptation using global MLLR with a full transformation matrix. In this experiment, when combining mean and variance adaptation we observed that better results were obtained when first performing variance adaptation followed by mean adaptation with MLLR. Therefore, we use this order for the results presented here. Fig. 6 plots WER as a function of the number of utterances when using only MLLR (mean), SVA MLLR (mean), DVA MLLR (mean), SDVA MLLR (mean), and DSVA MLLR (mean). The values of the two-sided 99% confidence intervals were up to 3.7% for different samples of adaptation data. Note that SVA MLLR (mean) is equivalent to “unconstrained MLLR” where the covariance matrices are transformed independently of the mean vectors [3]. With only MLLR, WER converges to around 17%. By combining SVA with MLLR, MLLR can reduce WER is reduced to 11%. Using DVA MLLR converges to a WER further to 8%. Finally, SDVA WER close to 5%, which corresponds to a relative error rate reduction of more than 80% compared with the recognition of dereverberated speech. This WER approaches that of clean speech. This experiment confirms the effectiveness of combining the proposed method with mean adaptation. Note that MLLR required more than eight utterances to converge. Therefore, SVA MLLR, DVA MLLR, and DSVA

331

Fig. 5. Distortion and WER as a function of the number of iterations of the EM algorithm for SVA, DVA, SDVA, and DSVA. In this experiment, four utterances were used for adaptation.

Fig. 6. WER as a function of the number of adaptation data for MLLR (dotted line), SVA MLLR (thin solid line), DVA MLLR (dash-dotted line), and SDVA MLLR (thick solid line).

+

+

+

MLLR, also needed more than eight utterances to converge. When SDVA MLLR is used, better performance is achieved at the cost of more adaptation data (here more than 128 utterances). However, when using SDVA MLLR, we obtain poorer results when too few utterances are used. The problem may arise

332

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 2, FEBRUARY 2009

TABLE II ASR RESULTS FOR DISTANCES OF 1.5 AND 2 m BETWEEN THE SPEAKER AND THE MICROPHONE, WHICH COMPARE THE PROPOSED VARIANCE ADAPTATION, VARIANCE COMPENSATION (WITHOUT ADAPTATION), AND REVERBERANT/DEREVERBERATED SPEECH RECOGNITION (WITHOUT ANY VARIANCE ADAPTATION/COMPENSATION)

from instabilities that occur when performing the EM algorithm in turns. One way to solve this problem may be to perform interleaved updates of the mean and variance parameters in the EM algorithm. Future work will include an investigation of this matter. Examining the results shown in Figs. 4 and 6, we observe that, depending on the amount of adaptation data available, different adaptation techniques should be used to obtain optimal results. In each case, optimal results are obtained by combining dynamic variance adaptation with static adaptation of mean or variance. V. PRACTICAL ISSUES Here, we discuss issues related to the use of the proposed adaptation method in more practical situations. First, we confirm the improvement in ASR performance realized by the method when dealing with more challenging reverberation, obtained by increasing the distance between the speaker and the microphone. Then, we show experimentally that the method may easily be extended to unsupervised adaptation. A. Results for a Different Distance Between the Speaker and the Microphone Here, we investigate the method in a more severe case when the distance between the speaker and the microphone was 2 m. Table II shows comparative results for distances of 1.5 and 2 m. Even if the reverberation time remains the same, increasing the distance to the microphone makes dereverberation more challenging because the ratio of early to late reverberation energy decreases, as revealed with the D50 measure described in Section IV-A. Consequently, the WER of reverberant speech degrades from 32.7% to 37.1%. Even in this more challenging case, we observe the same tendency for the 1.5- and 2-m results. In Table II, we highlight the optimal WER obtained for a given number of adaptation utterances. We observe that when only a few adaptation utterances are available, optimal performance was obtained by employing only SDVA. If more than eight utterances are available, we may combine mean and variance adaptation. Here, DVA MLLR(mean) gives optimal results. Finally, if a large number of adaptation data are available, here more than 128 utterances, performance further improves by MLLR(mean). This experiment confirms the using SDVA improvement brought about with the proposed method even in more challenging reverberant environments, and that optimal

performance was obtained by combining dynamic variance adaptation with static adaptation of mean or variance. Moreover, after adaptation, similar results were obtained for 1.5 and 2 m. B. Unsupervised Adaptation Thus far, we have considered supervised adaptation. However, when a set of adaptation parameters is globally shared among all the Gaussians of the models, there is the potential for extension to unsupervised adaptation. Indeed, in this case, the estimation of the global adaptation parameters is far less sensitive to errors in estimated labels than cluster-based adaptation. We tested unsupervised adaptation for SDVA and SDVA MLLR(mean) when the distance between the speaker and the microphone was 1.5 m. For unsupervised adaptation, we first apply the recognizer to the unlabeled adaptation data to obtain the HMM state alignment, then adaptation is achieved using the previously obtained state alignment instead of the labels. We consider open and closed adaptation. With open adaptation, the adaptation data set is different from the recognition data set. This is equivalent to the previous experiment and is used here for comparison. With closed adaptation, we use the same data set for recognition and adaptation. Table III gives the WER for unsupervised adaptation in the open and closed cases. We observe that using unsupervised SDVA reduces the performance by only around 0.4% of WER compared with the supervised case. Moreover, the same performance is obtained for closed and open adaptation. Note also that as only two utterances may be sufficient to achieve convergence with SDVA, adaptation could be performed online or in a semi-batch way. Therefore, we expect that the proposed algorithm could be made robust to changes in the acoustic environment as long as the time scale of the changes is of the order of a couple of seconds. When using unsupervised SDVA MLLR(mean), even if the reduction in WER is less than that obtained in the supervised case, we still observe a WER reduction of around 2% compared with unsupervised SDVA when sufficient adaptation data are used. This result confirms that the proposed method may be combined with MLLR even for unsupervised adaptation. VI. CONCLUSION In this paper, we investigated the use of variance compensation to improve the performance of a speech dereverberation

DELCROIX et al.: STATIC AND DYNAMIC VARIANCE COMPENSATION FOR RECOGNITION OF REVERBERANT SPEECH

TABLE III COMPARISON OF SUPERVISED AND UNSUPERVISED ADAPTATION MLLR(MEAN) FOR SDVA AND SDVA

+

333

Laboratories, and especially K. Kinoshita, for their support of this work, and Dr. D. Kolossa of Berlin University of Technology for the fruitful discussions about variance compensation techniques during her stay at NTT Communication Science Laboratories. They would also like to thank the anonymous reviewers for their valuable comments on our paper, which have improved its paper quality. REFERENCES

preprocessor. We proposed a new parametric model for the HMM variances that includes both static and dynamic elements. The model parameters were optimized using adaptive training, and a solution was provided with the EM algorithm. This is a high-performance approach for combining a dereverberation preprocessor and a speech recognizer. We tested the method in a simulation with a reverberation time of 0.5 s. Dereverberation combined with variance adaptation alone was effective in reducing the WER by more than 50%, especially when we combined static and dynamic adaptation. The adaptation quickly improves the performance with only two utterances and two iterations, and this will be effective for fast changing environments or reverberant conditions by extending the proposed method to online adaptation schemes. Moreover, by combining the proposed variance adaptation method with conventional MLLR for mean adaptation, the proposed method achieved an 80% relative error rate reduction. Although most results were obtained for supervised adaptation, we also confirmed the efficiency of the method for unsupervised adaptation. This shows the potential of using the proposed method for semi-batch applications. In this paper, we focused on the dereverberation problem since a dereverberation preprocessor often induces a dynamic mismatch between the clean and dereverberated speech features due to imperfect estimation of the late reverberation and the use of spectral subtraction. However, our primary motivation for this work is to develop a general interconnection method for any speech enhancement preprocessor with speech recognizers [25]. The formulation in this paper does not depend on a particular preprocessor, and the method could in theory be applied to other preprocessor techniques. Therefore, an investigation of the use of the proposed method with other speech enhancement methods such as the ETSI advanced front-end [35], blind source separation [36], [37], and spectral subtraction for noise reduction [29] by using a common database (e.g., Aurora-2 [33]) will form part of our future work. We will also try combining a matched condition model and the proposed adaptation by using the matched condition model as an initial model. Although it is expensive to prepare an appropriate model for unseen conditions, this will further improve the ASR performance. ACKNOWLEDGMENT The authors would like to thank the members of the Signal Processing Research Group of NTT Communication Science

[1] L. Deng, A. Acero, M. Plumpe, and X. Huang, “Large-vocabulary speech recognition under adverse acoustic environments,” in Proc. Int. Conf. Spoken Lang. Process. (ICSLP’00), 2000, vol. 3, pp. 806–809. [2] B. Raj and R. M. Stern, “Missing-feature approaches in speech recognition,” IEEE Signal Process. Mag., vol. 22, no. 5, pp. 101–116, Sep. 2005. [3] M. J. F. Gales and P. C. Woodland, “Mean and variance adaptation within the MLLR framework,” Comput. Speech Lang., vol. 10, pp. 249–264, 1996. [4] H. Jiang, K. Hirose, and Q. Huo, “Robust speech recognition based on a Bayesian prediction approach,” IEEE Trans. Speech Audio Process., vol. 7, no. 4, pp. 426–440, Jul. 1999. [5] A. Sankar and C.-H. Lee, “Robust speech recognition based on stochastic matching,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP’95), 1995, vol. 1, pp. 121–125. [6] M. J. F. Gales and S. J. Young, “Robust continuous speech recognition using parallel model combination,” IEEE Trans. Speech Audio Process., vol. 4, no. 5, pp. 352–359, Sep. 1996. [7] I. Tashev and D. Allred, “Reverberation reduction for improved speech recognition,” in Proc. Joint Workshop Hands-Free Speech Commun. Microphone Arrays (HSCMA’05), 2005, CD-ROM. [8] L. Couvreur and C. Couvreur, “Blind model selection for automatic speech recognition in reverberant environments,” J. VLSI Signal Process. Syst., vol. 36, no. 2-3, pp. 189–203, 2004. [9] C. K. Raut, T. Nishimoto, and S. Sagayama, “Model adaptation by state splitting of HMM for long reverberation,” in Proc. 9th Eur. Conf. Speech Commun. Technol. (Interspeech’05-Eurospeech), 2005, pp. 277–280. [10] T. Takiguchi and M. Nishimura, “Acoustic model adaptation using first order prediction for reverberant speech,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP’04), 2004, vol. 1, pp. 869–972. [11] A. Sehr and W. Kellerman, “A new concept for feature-domain dereverberation for robust distant-talking ASR,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP’07), 2007, vol. 4, pp. 369–372. [12] B. W. Gillespie and L. E. Atlas, “Acoustic diversity for improved speech recognition in reverberant environments,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP’02), 2002, vol. 1, pp. 557–600. [13] P. A. Naylor and N. D. Gaubitch, “Speech dereverberation,” in Proc. Int. Workshop Acoust. Echo and Noise Control (IWAENC’05), 2005, iwaenc05.ele.tue.nl/proceedings/papers/pt03.pdf. [14] T. Nakatani, K. Kinoshita, and M. Miyoshi, “Harmonicity-based blind dereverberation for single-channel speech signals,” IEEE Trans. Audio, Speech, Lang. Process., vol. 15, no. 1, pp. 80–95, Jan. 2007. [15] T. Hikichi, M. Delcroix, and M. Miyoshi, “Speech dereverberation algorithm using transfer function estimates with overestimated order,” Acoust. Sci. Technol., vol. 27, no. 1, pp. 28–35, 2006. [16] K. Kinoshita, T. Nakatani, and M. Miyoshi, “Spectral subtraction steered by multi-step forward linear prediction for single channel speech dereverberation,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP’06), 2006, vol. 1, pp. 817–820. [17] K. Kinoshita, M. Delcroix, T. Nakatani, and M. Miyoshi, “A linear prediction-based microphone array for speech dereverberation in a realistic sound field,” in Proc. Audio Eng. Soc. (AES) 13th Regional Conv., Tokyo, Japan, 2007, CD-ROM. [18] M. Wu and D. Wang, “A two-stage algorithm for one-microphone reverberant speech enhancement,” IEEE Trans. Audio, Speech, Lang. Process., vol. 14, no. 3, pp. 774–784, May 2006. [19] L. Deng, J. Droppo, and A. Acero, “Dynamic compensation of HMM variances using the feature enhancement uncertainty computed from a parametric model of speech distortion,” IEEE Trans. Speech Audio Process., vol. 13, no. 3, pp. 412–421, May 2005.

334

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 2, FEBRUARY 2009

[20] D. Kolossa, H. Sawada, R. F. Astudillo, R. Orglmeister, and S. Makino, “Recognition of convolutive speech mixtures by missing feature techniques for ICA,” in Proc. Asilomar Conf. Signals, Syst., Comput. (ACSSC’06), 2006, pp. 1397–1401. [21] J. Arrowood and M. Clements, “Using observation uncertainty in HMM decoding,” in Proc. Int. Conf. Spoken Lang. Process. (ICSLP’02), 2002, vol. 3, pp. 1562–1564. [22] J. Droppo, A. Acero, and L. Deng, “Uncertainty decoding with splice for noise robust speech recognition,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP’02), 2002, vol. 1, pp. 57–60. [23] H. Liao and M. J. F. Gales, “Joint uncertainty decoding for noise robust speech recognition,” in Proc. 9th Eur. Conf. Speech Commun. Technol. (Interspeech’05-Eurospeech), 2005, pp. 3129–3132. [24] M. P. Cooke, P. D. Green, L. B. Josifovski, and A. Vizinho, “Robust automatic speech recognition with missing and uncertain acoustic data,” Speech Commun., vol. 34, pp. 267–285, 2001. [25] M. Delcroix, T. Nakatani, and S. Watanabe, “Combined static and dynamic variance adaptation for efficient interconnection of a speech enhancement pre-processor with speech recognizer,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP’08), 2008, pp. 4073–4076. [26] A. Sankar and C.-H. Lee, “A maximum-likelihood approach to stochastic matching for robust speech recognition,” IEEE Trans. Speech Audio Process., vol. 4, no. 3, pp. 190–202, May 1996. [27] H. Kuttruff, Room Acoustics, 3rd ed. London, U.K.: Elsevier Science, 1991. [28] T. F. Quatieri, Discrete-Time Speech Signal Processing. Upper Saddle River, NJ: Prentice-Hall, 2002. [29] S. F. Boll, “Suppression of acoustic noise in speech using spectral subtraction,” IEEE Trans. Acoust., Speech, Signal Process., vol. ASSP-27, no. 2, pp. 113–120, Apr. 1979. [30] R. C. Rose, E. M. Hofstetter, and D. A. Reynolds, “Integrated models of signal and background with application to speaker identification in noise,” IEEE Trans. Speech Audio Process., vol. 2, no. 3, pp. 245–257, May 1994. [31] X.-L. Meng and D. B. Rubin, “Maximum likelihood estimation via the ECM algorithm: A general framework,” Biometrika, vol. 80, pp. 267–278, 1993. [32] T. Hori, “Ntt speech recognizer with outlook on the next generation: Solon,” in Proc. NTT Workshop Communication Scene Analysis, SP-6, 2004, CD-ROM. [33] H. G. Hirsch and D. Pearce, “The AURORA experimental framework for the performance evaluations of speech recognition systems under noisy condition,” in Proc. ISCA Tutorial Research Workshop Autom. Speech Recognition: Challenges for the New Millenium (ITRW ASR2000), Paris, France, Sep. 2000, pp. 18–20. [34] M. Delcroix, T. Nakatani, and S. Watanabe, “Dynamic feature variance adaptation for robust speech recognition with a speech enhancement pre-processor,” 2007, IEICE Tech. Rep., 2007-SP-105. [35] ETSI ES 202 050 v1.1.1 “STQ; Distributed speech recognition; advanced front-end feature extraction algorithm; compression algorithms,” 2002. [36] H. Sawada, S. Araki, and S. Makino, “A two-stage frequency-domain blind source separation method for underdetermined convolutive mixtures,” in Proc. 2007 IEEE Workshop Applicat. Signal Process. Audio Acoust. (WASPAA’07), 2007, pp. 139–142. [37] L. Parra and C. Spence, “Convolutive blind separation of non-stationary sources,” IEEE Trans. Speech Audio Process., vol. 8, no. 3, pp. 320–327, May 2000.

Marc Delcroix (M’06) was born in Brussels, Belgium, in 1980. He received the M.Eng. degree from the Free University of Brussels, Brussels, Belgium, and the Ecole Centrale Paris, Paris, France, in 2003 and the Ph.D. degree from the Graduate School of Information Science and Technology, Hokkaido University, Sapporo, Japan, in 2007. From 2004 to 2008, he was a Researcher at NTT Communication Science Laboratories, Kyoto, Japan, working on speech dereverberation and speech recognition. He is now working at Pixela on software development for digital television. Dr. Delcroix received the 2005 Young Researcher Award from the Kansai section of the Acoustic Society of Japan, the 2006 Student Paper Award from the IEEE Kansai section, and the 2006 Sato Paper Award from ASJ.

Tomohiro Nakatani (SM’06) received the B.E., M.E., and Ph.D. degrees from Kyoto University, Kyoto, Japan, in 1989, 1991, and 2002, respectively. He is a Senior Research Scientist at NTT Communication Science Labs, NTT Corporation, Kyoto. Since he joined NTT Corporation as a Researcher in 1991, he has been investigating speech enhancement technologies for developing intelligent human–machine interfaces. From 1998 to 2001, he was engaged in developing multimedia services at business departments of NTT and NTT-East Corporations. In 2005, he visited the Georgia Institute of Technology, Atlanta, as a Visiting Scholar for a year, and started investigating a probabilistic formulation of speech dereverberation with Prof. Juang. Dr. Nakatani was honored to receive the 1997 JSAI Conference Best Paper Award, the 2002 ASJ Poster Award, and the 2005 IEICE Paper Awards. He is a member of IEEE CAS Blind Signal Processing Technical Committee, an Associate Editor of the IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, and a Technical Program Chair of IEEE WASPAA-2007. He is a member of IEICE and ASJ.

Shinji Watanabe (M’03) received the B.S., M.S., and Dr.Eng. degrees from Waseda University, Tokyo, Japan, in 1999, 2001, and 2006, respectively. In 2001, he joined Nippon Telegraph and Telephone Corporation (NTT) and has since been working at NTT Communication Science Laboratories, Kyoto, Japan. His research interests include Bayesian learning, pattern recognition, and speech and spoken language processing. Dr. Watanabe is a member of the Acoustical Society of Japan (ASJ) and the Institute of Electronics, Information and Communications Engineers (IEICE). He received the Awaya Prize from the ASJ in 2003, the Paper Award from the IEICE in 2004, the Itakura prize from ASJ in 2006, and the TELECOM System Technology Award from the Telecommunications Advancement Foundation in 2006.

Suggest Documents