IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 9, NO. 2, FEBRUARY 2001
127
HMM-Separation-Based Speech Recognition for a Distant Moving Speaker Tetsuya Takiguchi, Member, IEEE, Satoshi Nakamura, Member, IEEE, and Kiyohiro Shikano, Member, IEEE
Abstract—This paper presents a hands-free speech recognition method based on HMM composition and separation for speech contaminated not only by additive noise but also by an acoustic transfer function. The method realizes an improved user interface such that a user is not encumbered by microphone equipment in noisy and reverberant environments. The use of HMM composition has already been proposed for countering additive noise. In this paper, the same approach is extended to handle convolutional acoustic distortion in a reverberant room, by using an HMM to model the acoustic transfer function. The states of this HMM correspond to different positions of the sound source. It can represent the positions of the sound sources, even if the speaker moves. This paper also proposes a new method, HMM separation, for estimating the HMM parameters of the acoustic transfer function on the basis of a maximum likelihood manner. The proposed method is obtained through the reverse of the process of HMM composition, where the model parameters are estimated by maximizing the likelihood of adaptation data uttered from an unknown position. Therefore, measurement of impulse responses is not required. The paper also describes the performance of the proposed methods for recognizing real distant-talking speech. The results of experiments clarify the effectiveness of the proposed method. Index Terms—Adaptation, composition, distant moving speaker, noise, reverberation, separation.
I. INTRODUCTION
I
N hands-free speech recognition, one of the key issues as regards practical use is the development of a technology that allows accurate recognition of noisy and reverberant speech. Such technology will play an especially important role in the recognition of distant-talking speech. In the past few years, much work has been done on HMM’s and their training algorithms, to improve the accuracy of speaker-independent speech recognition. To achieve high recognition accuracy, a user must normally be equipped with a close-talking microphone. If the user speaks at a distance from the microphone, the recognition accuracy is seriously degraded by the influence of reverberation and environmental noise. Many researches have been done on robust speech recognition, where the two most important problems to be overcome are additive noise and convolutional distortion. Manuscript received July 29, 1999; revised January 24, 2000. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Rafid A. Sukkar. T. Takiguchi is with IBM Research, Tokyo Research Laboratory, Kanagawa, Japan (e-mail:
[email protected]). S. Nakamura was with the Graduate School of Information Science, Nara Institute of Science and Technology, Nara, Japan. He is now with ATR Spoken Language Translation Labs, Kyoto, Japan (e-mail:
[email protected]). K. Shikano is with the Graduate School of Information Science, Nara Institute of Science and Technology, Nara, Japan (e-mail:
[email protected]). Publisher Item Identifier S 1063-6676(01)00847-1.
Additive noise usually consists of background noise, the voices of other speakers, and so on. Its effect on the input speech appears as additions in the wave domain and the linear-spectral domain. Convolutional distortion is usually caused by the telephone channel, microphone characteristics, reverberation, and so on. Its effect on the input speech appears as convolution in the wave domain, and is represented as multiplication in the linear-spectral domain. Many methods have been presented for solving each problem. Their approaches can be summarized as feature compensation and model compensation. As examples of the former approach, the spectral subtraction method for additive noise and the cepstral mean normalization method for convolutional distortion have been proposed and their effectiveness confirmed ([1], [2]). As examples of the latter approach, conventional multi-template methods, and model adaptation methods ([3], [4]), and model (de-)composition methods ([5], [6], [8], [10], [13]) have been proposed. Among the last group of methods, HMM composition is the most promising, because an HMM for noisy speech can be easily generated by composing the speech HMM’s and the noise HMM trained during a period of noise. It has been shown in ([6], [8]) that the composed noisy HMM has very high accuracy. A recognition method based on signal decomposition using HMM’s is proposed in [5]. The recognition is carried out by extending the normal Viterbi decoding algorithm to a search of the combined state-space of clean-speech HMM’s and a noise HMM. However, this method assumes that speech and noise are independent in the log-spectral domain. Techniques using microphone arrays have also attempted to enhance speech intelligibility (e.g., [17]–[19]), or to recognize speech in adverse environments (e.g., [20], [21]). The reverberation can be defined by an impulse response (acoustic transfer function). The influence of the reverberation is described by a scalar index of the reverberation time (e.g., [22], [23]). The impulse response will change according to not only the shape of a room but also the temperature, the humidity, and the positions of the source and microphone. Figs. 1 and 2 show examples of waveform and narrow-band spectrograms for original (clean) speech and reverberant speech. When the training data of an acoustic model consists of clean-speech data, as shown in Fig. 1, and the testing data consists of reverberant speech, as shown in Fig. 2, a serious mismatch between the training data and the test utterances occurs. Present spectralmatching measures have the shortcoming of being easily affected by noise, reverberation, and so on, and are very sensitive to spectral distortion. On the other hand, if the training data consists of speech from every conceivable combination of signal conditions, the recognition accuracy will not be seriously de-
1063–6676/01$10.00 © 2001 IEEE
128
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 9, NO. 2, FEBRUARY 2001
in the model domain. Section IV describes the performance of the HMM composition and separation methods for real distant-talking speech. Finally, Section V summarizes the paper and suggests future research directions. II. HMM COMPOSITION On the assumption that speech signal, , and noise signal, , are independent, the observed signal, is represented by Fig. 1. Original speech: the speech waveform and narrow-band spectrogram of the Japanese utterance /ai/.
(1) The conventional approach uses noise statistics obtained during a period of noise, and recognizes input noisy speech by using noise-added reference patterns. Since the levels of speech and noise are generally different in training and testing, an adjustment factor is introduced. If the level of the speech will be multiplied by an adjustment factor data is different, , and if the level of the noise data is different, will be multiplied by an adjustment factor . In this paper, the observed signal is represented by (2)
Fig. 2. Reverberant speech (reverberation time = 0:6 s): the speech waveform and narrow-band spectrogram of the Japanese utterance /ai/.
This relation is preserved in the linear-spectral domain as follows: (3)
graded. However, it is not practical to collect a huge set of utterances in every conceivable combination of signal conditions. Therefore, it is desirable to adapt the acoustic model to the target environment by using a small amount of a user’s speech. We apply HMM composition to the recognition of speech contaminated by not only additive noise but also the reverberation of the room [13]. If the components are independent of each other and additive, HMM composition can be used. The noise and speech are independent and additive in the linear-spectral domain. While the acoustic transfer function and speech are convolutional in the time domain, they are independent and additive in the cepstral domain. Therefore, HMM composition is applicable to noisy and reverberant speech. There have been already some reports on the problem of compensating for spectral tilt in speech with noise and channel distortion ([7], [9], [11], [12]). This paper addresses the problem of compensating not only for the spectral tilt but also for a room’s acoustic transfer function. It also proposes the HMM separation method for estimating the HMM parameters of an acoustic transfer function ([15], [16]). The model parameters are estimated by maximizing the likelihood of adaptation data uttered from an unknown position. The HMM separation method does not require measured impulse responses or a reference signal. The proposed method is obtained through the reverse of the process of HMM composition. The remainder of this paper is organized as follows. The next section describes a robust speech recognition method based on HMM composition for noisy and reverberant speech. Section III describes a method for estimating the HMM parameters of an acoustic transfer function, based on HMM separation
, and are short-term linear where spectra in the analysis window , respectively. The HMM composition executes the addition in the model domain instead of in the parameter domain. Generally, parameters for speech recognition are represented by the cepstrum. The parameters have to be transformed to the linear domain, since an addition of speech and noise holds ([6], [8]). For convolutional distortion, the observed signal is represented by (4) is an acoustic transfer function (room impulse rewhere sponse). The length of the acoustic transfer function is . The spectral analysis for speech recognition is based on short-term windowing. If the length is shorter than that of the window, the observed distorted spectrum is generally represented by (5) However, since the length of the acoustic transfer function is greater than that of the window, the observed distorted spectrum is approximately represented by (6) is the acoustic transfer function from the sound where is source to the microphone in the analysis window . a function of the window , since the sound source may move.
TAKIGUCHI et al.: HMM-SEPARATION-BASED SPEECH RECOGNITION FOR A DISTANT MOVING SPEAKER
129
3) Compute the cosine transform of each Gaussian probability density function (PDF) of the HMM’s. (10) is a cosine transform matrix, , and are the mean vector and covariance matrix of a Gaussian PDF in the log-power spectral domain. The transposition is denoted by “ ”. 4) Compute the exponential transform to the linear-spectral domain. The normal random vector obtained by exponen, has log-normal distribution. tial transform, The mean and covariance are given by Here,
Fig. 3. Block diagram of the proposed HMM composition.
(11)
The multiplication can be converted into addition in the cepstral domain as follows:
(12)
(7) , and are cepstra for where the observed signal, acoustic transfer function, and speech signal, respectively. Therefore, the observed signal is represented by
(8) Since spectral analysis in speech recognition is based on short-term windowing, the multiplication of the short-term signal spectra and the acoustic transfer function is equivalent to periodic convolution in the time domain. However, actual distorted speech results from linear convolution. Since the proposed HMM composition of the speech and acoustic transfer function realizes only periodic convolution, the composed HMM cannot model acoustically distorted speech accurately. In this paper, the covariance matrix of the Gaussian probability density function (PDF) deals with the influence of a long impulse response. This procedure is summarized in Fig. 3. The cosine transform, inverse cosine transform, exponential transform, and log transform are performed on HMM parameters. The terms (Cos), are defined as transforms of the (Exp), (Log), and Gaussian PDF. The procedure is as follows [14]. 1) Estimate the HMM’s of the speech, noise, and acoustic transfer function in the cepstral domain. 2) Compose HMM’s of the speech and acoustic transfer function in the cepstral domain. (9) , and are Here, the mean vectors and covariance matrices of the cleanspeech HMM, the acoustic transfer function HMM, and the composed HMM’s in the cepstral domain, respectively.
and are the mean vector and covariHere, ance matrix in the linear-spectral domain. 5) Compose the two distributions according to (8):
(13) , and are the mean vectors Here, and covariance matrices of the noise and composed observed-speech HMM’s in the linear-spectral domain, respectively. 6) Compute the log transform of the composed HMMs: (14)
(15) 7) Compute the inverse cosine transform to the cepstral domain: (16) The HMM recognizer decodes observed speech on a trellis diagram by maximizing the log-likelihood. The decoded path will find an optimal combination of speech, noise, and the acoustic transfer function. Fig. 4 shows the proposed acoustic transfer function HMM in the case of five states. Each state of the acoustic transfer function HMM corresponds to a position in a room, and all transitions among states are permitted. Therefore, the proposed acoustic transfer function HMM is able to represent the positions of sound sources, even if the speaker moves. Since each state of the acoustic transfer function HMM has a Gaussian distribution, it is also possible for the acoustic transfer function HMM to deal with variations in a user’s position or the influence of a long impulse response.
130
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 9, NO. 2, FEBRUARY 2001
The separation of HMM’s is defined by the operator paper. For example,
in this
(20) denotes the separation of from . Accordingly, the equation for estimating the acoustic transfer function HMM is written as
(21) Fig. 4. Ergodic HMM of acoustic transfer functions.
III. ESTIMATION OF THE ACOUSTIC TRANSFER FUNCTION ON THE BASIS OF HMM SEPARATION Estimation of the HMM parameters of the acoustic transfer function remains a serious problem. The mean vectors of the acoustic transfer function HMM are derived from measured impulse responses. However, it is inconvenient and unrealistic to measure impulse responses for a new environment. This section presents a new method for estimating the HMM parameters of the acoustic transfer function on the basis of HMM separation in the model domain. The estimation is implemented by maximizing the likelihood of adaptation (observed) data from any user’s position. A maximum-likelihood (ML) estimation method is presented in [3], where methods of estimating the feature-space and model-space bias are introduced. The re-estimation formula includes the sequence of the adaptation data, and deals only with convolutional noise. On the other hand, the re-estimation formula in HMM separation includes the sufficient statistic (model parameter) instead of the sequence of the adaptation data, and deals with additive and convolutional noise. Model parameters are estimated in an ML manner by using the expectation-maximization (EM) algorithm, which maximizes the likelihood of the observed speech: (17) Here, denotes the set of HMM parameters, while the suf, and denote clean speech, noise, and the acoustic fixes transfer function. The observed speech is now represented by
(18) and are the Fourier (cosine) transform Here, and inverse Fourier (cosine) transform, respectively. , and are cepstra for the observed speech, the clean speech, and the acoustic transfer is the function of quefrency in the -th frame, and linear spectrum for noise of frequency in the -th frame. Accordingly, the acoustic transfer function is represented by
(19)
where the suffixes cep and lin represent the cepstral domain and the linear-spectral domain, respectively. The terms Cos, Log, and Exp are the cosine transform, logarithm transform, and exponential transform of the Gaussian PDF, respectively. This equation shows that HMM separation is applied twice to noisy and reverberant speech. First, the HMM separation method is applied in the linear-spectral domain to estimate the distortedspeech HMM’s by ML estimation. Then, the distorted-speech HMM’s are converted to the cepstral domain, and the HMM separation method is applied again in the cepstral domain to estimate the acoustic transfer function HMM by ML estimation. The procedure is as follows. (fo1) Re-estimate the parameters cusing on only the observation probability density function) of composed HMM’s with corresponding transcription by ML estimation in the cepstral domain, using adapand are the mean vectors and tation data. Here, covariance matrices of the observed-speech HMM. Next, of the noise HMM from estimate the parameters the signal during a period of noise, and then convert the , to the log-spectral domain: Gaussian PDF, (22) is a cosine transform matrix. The noise HMM, , is also converted to the log-spectral domain in a , in the log-spectral similar way. The Gaussian PDF, domain is then converted to the linear-spectral domain:
Here,
(23) (24) and are the mean vectors and covariHere, , in the ance matrices of the observed-speech HMM, linear-spectral domain. from as follows: 2) Separate
(25) As shown in (25), the HMM separation method deals with the sufficient statistic (model parameter) of the observed speech instead of the sequence of the observed speech.
TAKIGUCHI et al.: HMM-SEPARATION-BASED SPEECH RECOGNITION FOR A DISTANT MOVING SPEAKER
Fig. 5. Parameter estimation by HMM separation. The HMM separation method is applied twice to noisy and reverberant speech.
3) Convert
to the log-spectral domain: (26)
(27) Then, convert the Gaussian PDF in the log-spectral domain to the cepstral domain:
131
Fig. 6. Illustration of model separation. The composed HMM is separated into a known HMM and an unknown HMM.
states in the cepstral domain (the second HMM separation) cannot be calculated. Therefore, one iteration is executed in each domain. After separating the acoustic transfer function , from in the cepstral domain, we HMM, compose HMM’s of the speech, acoustic transfer function, and noise, and then repeat procedures 1–4 until the log-likelihood probability converges. The acoustic transfer function HMM is estimated in an ML manner by using the expectation-maximization (EM) algorithm, which maximizes the likelihood of the observed speech. (30)
(28) and are the mean vectors and Here, covariance matrices of the distorted-speech HMM, , in the cepstral domain. from as follows: 4) Separate
Here, the observed speech is given by (31) Therefore,
is represented by
(32)
(29) practically. It is impossible to measure the data However, the HMM separation method can deal with the , instead of the series of the model parameter, . data, The procedure is summarized in Fig. 5. We estimate the , of composed HMM’s by ML estimation in parameters, the cepstral domain, using observation data. After separating , from the distorted-speech HMM, in the linear-spectral domain, the second HMM separation , in can be executed by using the model parameter, the cepstral domain. As shown in Fig. 6, the distorted-speech , is separated into the acoustic transfer funcHMM, , and the clean-speech HMM, , in the tion HMM, , cepstral domain. The acoustic transfer function HMM, , is updated only when the distorted-speech HMM, , is updated is updated. The distorted-speech HMM, , is updated. only when the observed-speech HMM, This is because the assignment of observation vectors to HMM
, is given, and the distorted speech and noise When noise, are independent of each other in the linear-spectral domain, this equation is represented by
(33) Calculate the probability in the cepstral domain:
(34) This equation is identical to that of HMM separation in the cepstral domain. The HMM separation method deals with the model
132
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 9, NO. 2, FEBRUARY 2001
parameter, , instead of the series of the data, . , is calculated by HMM sepaThe model parameter, ration in the linear-spectral domain, which maximizes the likelihood of the observed speech. Therefore, the proposed algorithm is based on the maximum likelihood of observed speech. The HMM separation method, as shown in Fig. 5, is applied twice to noisy and reverberant speech. Though this method is able to estimate the acoustic transfer function HMM for each clean-speech HMM, it is not possible to collect sufficient examples of all the phonemes. Therefore, the estimation method based on tied-mixture HMM’s is introduced in the following sections. The following sections describe the model separation in detail. Section III.A describes the separation of a known noise HMM and unknown distorted-speech HMM’s. Section III.B describes the separation of known speech HMM’s and an unknown acoustic transfer function HMM.
For the adaptation data sequence , let and be the unobserved state sequence and the unobserved mixture component label, respectively. The joint probability of observing the , and can be calculated as sequences (37) Therefore, the probability of observing the sequence by
is given (38)
where the summations are taken over all possible state sequences and all possible mixture component labels. is Now, the separation of distorted-speech HMM’s handled in an ML framework
A. Separation of a Noise HMM and Distorted-Speech HMMs This section describes the separation of a known noise HMM and unknown distorted-speech HMM’s. Consider tied-mixture , where HMM’s with diagonal covariance matrices, is the transition probability ma, is the observation probtrix, and ability density function (PDF). is the number of states. The is given by observation Gaussian PDF (35) where probability of mixture in state ; total number of Gaussian PDF’s tied by all of the states; multivariate Gaussian distribution given by;
(39) is the model parameter of noise in the linear-specwhere tral domain. The above ML parameter estimation can be solved by using the EM algorithm. The EM algorithm is a two-step iterative procedure. In the first step, called the expectation step (E-step), the following auxiliary function is calculated as shown in (40) at the bottom of the page, where is the total number of adaptation data. phonemes, and each phoneme consists of is the -th observation sequence for a phoneme , Next, . Finally, and are the unoband the length is served state sequence and the unobserved mixture component . labels corresponding to the observation sequence , and The joint probability of observing the sequences can be calculated as
(41)
(36) is the dimension of the adaptation vector . Next, where and are the mean vector and covariance matrix corresponding to mixture , respectively, and denotes transposition.
is the probability density function where of the random variable . be the Let the observation Gaussian PDF in the model form shown in (35), and the observation Gaussian PDF in the be a single Gaussian. Since the model is model in the linear-spectral domain, independent of the model the mean vector and covariance matrix corresponding to mixture in the model are derived by adding the mean vector and
(40)
TAKIGUCHI et al.: HMM-SEPARATION-BASED SPEECH RECOGNITION FOR A DISTANT MOVING SPEAKER
covariance matrix in the model covariance matrix in the model
to the mean vector and
133
and (47)
and (42) and are the mean vector and covariance where . Further, matrix corresponding to mixture in the model and are the mean vector and covariance matrix in the . Therefore, (41) can be written as model
, can be calculated efficiently by using the The last term, forward-backward algorithm [24]. The M-step in the EM algorithm maximizes with respect to (48) which enables us to solve and
. Therefore, we get and
(43)
(49)
where
It is straightforward to derive that [25]
(50) (51) Equation (49) shows that the HMM separation method deals with the model parameter instead of the sequence of the observed speech. If we have a large amount of adaptation (observation) data, , will not be less than zero. However, it is the mean vector, desirable to adapt the acoustic model to the target environment by using only a small amount of adaptation data. In this case, the mean vector may be less than zero. When a mean vector, , is less than zero, the mean vector, , is not used for estimating the acoustic transfer function; in this paper, only the , is used. nonnegative vector, (44)
Here, we focus on only the terms involving . Therefore, (44) can be written as shown in (45) at the bottom of the page, where (46)
B. Separation of Clean-Speech HMM’s and an Acoustic Transfer Function HMM This section describes the separation of known-speech HMM’s and an unknown acoustic transfer function HMM. The HMM separation method is applied in the cepstral domain, as , which is shown in Fig. 5. First, the model parameter estimated in Section III.A, is converted to the cepstral domain. Then, the separation of the acoustic transfer function HMM,
(45)
134
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 9, NO. 2, FEBRUARY 2001
, is estimated by using maximum likelihood in the model domain (52) is the model parameter of the clean speech in the where praccepstral domain. It is impossible to measure the data tically. However, the HMM separation method can deal with the , instead of the series of the data. model parameter, The auxiliary function is defined in a similar way to that in Section III-A as shown in (53) at the bottom of the page. Since , (53) can be we focus on only the terms is written as shown in (54) at the bottom of the page where the tied-mixture HMM, and the total number of Gaussian mixand are the mean vector and cotures is . Next, is a variance matrix corresponding to the mixture , and is equal to the single Gaussian. We assume that the above in Section III-A. The M-step in the EM algorithm maximizes with respect to
Since the model parameter in the linear-spectral domain can be calculated from (49), the model parameter of , is given by (57) On the other hand, the mean vector and covariance matrix in can also be represented by using the term the model as follows: (58)
(59) Therefore, we get (60) (61)
(55) which enables us to solve Therefore,
.
According to (56), on the assumption that the variance is fixed, is given by the re-estimation formula of
(62) (56)
(53)
(54)
TAKIGUCHI et al.: HMM-SEPARATION-BASED SPEECH RECOGNITION FOR A DISTANT MOVING SPEAKER
This equation shows that the HMM separation method deals instead of the data . with the model parameter , The re-estimation formula of the covariance matrix, cannot be derived in a similar way. Therefore, we employ the term as follows:
where mula of
135
converges to 0. Therefore, the re-estimation foris given by
and (63) can be derived by a Taylor The re-estimation formula of . This is because use of expansion with respect to to converge to . the EM algorithm causes , on the asAccording to sumption that the variance is fixed, the re-estimation formula of is given by
(64)
Equation (64) shows that the HMM separation method deals instead of the data . with the model parameter , we Then, according to get
(65)
where
(66) Now, define a function
as follows: (67)
If is expanded in a Taylor series through terms of the first order, we obtain
(68)
(69) Equation (69) shows that the HMM separation method deals instead of the data . In with the model parameter this paper, according to (63), (64), and (69), the acoustic transfer function HMM is re-estimated. IV. EXPERIMENTS Speech recognition experiments were carried out to investigate the effectiveness of the proposed method. We measured distant-talking speech, capturing the sound signal by using a single-directional microphone. Section IV-B describes the performance of the HMM composition and separation methods for the speech of a distant stationary speaker. For speech recognition with a distant moving speaker, the performance is described in Section IV-C. A. Experimental Conditions The recognition algorithm is based on 256 tied-mixture diagonal covariance HMM’s. Each HMM has three states and three self-loops. The models of 55 context-independent phonemes are trained by using 2620 words in the ATR Japanese speech database for speaker-dependent HMM’s. The other 500 words in the same database are used for testing. The speaker-independent HMM’s are trained by using about 9600 sentences uttered by 64 speakers, which are contained in the Acoustical Society of Japan (ASJ) continuous speech database. The speech signal is sampled at 12 kHz and windowed with a 32-ms Hamming window every 8 ms. Then FFT is used to calculate 16-order MFCC’s (mel-frequency cepstral coefficients) and power. In recognition, the power term is not used, because it is only necessary to adjust the signal-to-noise-ratio (SNR) in the HMM composition. In Section IV-C, 16-order MFCC’s with their first-order differentials ( MFCC), and first-order differentials of normalized logarithmic energy ( power), are calculated as the observation vector of each frame. There are 256 Gaussian mixture components with diagonal covariance matrices shared by all of the models for MFCC and MFCC, respectively. There are 128 Gaussian mixture components shared by all of the models for power. A single Gaussian is employed to model noise and an acoustic transfer function. Real Distant-Talking Speech: To evaluate our method with real speech, we measured distant-talking speech from four sound source positions, p1, , p4 (Fig. 7). The distant-talking speech is contaminated by computer noise, air conditioner noise, and ventilation fan noise, and the SNR is 16.7 dB on average. The SNR is calculated as follows: (70)
136
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 9, NO. 2, FEBRUARY 2001
Fig. 7.
Recording condition for the speech of a distant stationary speaker.
Fig. 9. Word-recognition rates with speaker-dependent models in a real environment.
Fig. 8.
Cepstral coefficients for different sound source positions.
where and denote the observed speech and the noise at time , respectively. One male speaker is used as the testing speaker in speaker-dependent (SD) experiments. Two male speakers and one female speaker are used as the testing speakers in speaker-independent (SI) experiments. Each testing as adaptation data, which speaker utters 1–50 words are not used in the training. For testing, 500 words, which are different from the words used in the training, are used. Section IV-B1 describes word-recognition experiments carried out on real distant-talking speech. Simulated Distant-Talking Speech: To evaluate our method in unknown positions and compare it with cepstral mean normalization (CMN), we used simulated distant-talking speech. We measured nine transfer functions corresponding to nine sound source positions, h1, , h5 and p1, , p4, by using the method reported in [26]. Distant-talking speech was simulated by linear convolution of clean speech and the measured impulse responses. The length of the original impulse response was about 180 ms. Fig. 8 shows the cepstral coefficients of the acoustic transfer functions from several training positions. The differences shown will cause degradation of speech recognition. Sections IV-B2–IV-B4 describe word-recognition experiments that we carried out with simulated distant-talking speech. B. Evaluation for Speech of the Distant Stationary Speaker 1) Performance for Real Distant-Talking Speech: Recognition results obtained with the speaker-dependent (SD) and speaker-independent (SI) models are shown in Figs. 9 and 10. The recognition rate with initial HMM’s (clean-speech HMM’s) is 77.2% for the SD model and 54.4% for the
Fig. 10. Word-recognition rates with speaker-independent models in a real environment.
SI model. The recognition rate with HMM’s composed of clean-speech HMM’s and a noise HMM is 87.5% for the SD model and 61.5% for the SI model. Application of the HMM separation method to only the mean vector, “Sepa.(Mean),” improved the recognition rate for 10 adaptation words to 90.5% for the SD model and 64.9% for the SI model, where only the mean vector was estimated for the acoustic transfer function. Application of the HMM separation method to both the mean vector and covariance matrix, “Sepa.(Mean,Cov.),” increases the performance to 91.2% for the SD model and 66.2% for the SI model. These results show the effectiveness of the estimated covariance matrix of the acoustic transfer function. Some adaptation data cause a small decrease in the recognition rate. This is because there is a mismatch between the adaptation data and the testing data. The recognition rate in the case of the known acoustic transfer function is 92.2% for the SD model and 67.8% for the SI model.
TAKIGUCHI et al.: HMM-SEPARATION-BASED SPEECH RECOGNITION FOR A DISTANT MOVING SPEAKER
Fig. 12. Cepstral distance unknown-testing positions. Fig. 11.
137
between
known-training
positions
and
Word-recognition rates in a reverberant environment.
Here, the mean value of the -th cepstral coefficient for the acoustic transfer function, , is given by (71) where is the total number of frames in the training data, and is the -th cepstral coefficient in frame for the distorted , speech generated by linear convolution. The clean speech, is is the -th cepstral coefficient in frame . The covariance given by (72)
where we assume that the cepstral coefficients are uncorrelated. In this experiment, 500 words were used for the training. These recognition results show that the performance of the HMM separation method becomes closer to that of the case using the known acoustic transfer function as the number of adaptation data increases. Finally, in the case of the matched condition, the SD and SI recognition rates are 96.4% and 70.7%, where each phoneme HMM is trained by using simulated distant-talking speech. 2) Comparison with Cepstral Mean Normalization (CMN): Fig. 11 shows the speaker-dependent experimental , p4 by using different amounts results averaged over p1, of adaptation data, where distant-talking speech is simulated by linear convolution of clean speech and the measured impulse responses. The recognition rate with initial HMM’s (clean-speech HMM’s) is 88.1%. Application of the HMM separation method to only the mean vector, the performance, “Sepa.(Mean),” is improved to 91.8% with 10 adaptation words. Application of the HMM separation method to both the mean vector and covariance matrix, “Sepa.(Mean,Cov.)” increases the performance by about 1%. In the CMN-based testing case, the phoneme HMM’s are trained by using the CMN-processed clean-speech data. Subtraction of each cepstral mean value from each of the testing data gives a recognition rate of 80.7%. The experimental results
clearly show that the simple CMN technique does not work well. In this simulated experiment, the silent parts of the samples are excluded. The length of one word is about 0.6 sec on average. Fig. 11 also shows the recognition rate in the case of the known acoustic transfer function, where the model parameters of the acoustic transfer function are estimated according to (71) and (72). The recognition rate in the case of the known acoustic transfer function is 92.8%. These results show that there is essentially no difference between the known acoustic transfer function and the estimated acoustic transfer function with the HMM separation method. In the case of the matched condition, the SD recognition rate is 96.6%, where each phoneme HMM is trained by using acoustically distorted speech. Comparison of this result with that of the composed HMM, “Sepa.(Mean,Cov.),” shows a difference in performance of 3.3%. 3) Performance for Unknown Positions: The performance of the proposed method was evaluated for unknown positions of the testing speaker. The five positions, h1, , h5, were used for the model composition. Fig. 4 shows the acoustic transfer function HMM. Each state directly corresponds to one of the training positions h1, , h5. All transitions among states are permitted, and their probabilities are defined as 0.2. The other four positions, p1, , p4, were used for the recognition tests. Fig. 12 shows the cepstral distance between the known-training positions and the unknown-testing positions. The cepstral distance, , is given by (73) is the -th cepwhere is the cepstral order. Next, stral coefficient for the known-training position, and is the -th cepstral coefficient for the unknown-testing position. For example, the training position h2 is the closest position for the testing position p2 in the cepstral domain. Table I shows the average word-recognition rates with the SD model for the known-training and unknown-testing posirepresents “before compensating for the tions. In Table I, represents “after compenacoustic transfer function”, and sating for the acoustic transfer function.” The recognition rates with the composed ergodic HMM’s for the known-training and
138
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 9, NO. 2, FEBRUARY 2001
TABLE I WORD-RECOGNITION RATES [%] FOR KNOWN/UNKNOWN POSITIONS
unknown-testing positions are 87.2% and 86.2%, respectively. The table confirms that the degradation between the training and testing sound source positions is relatively small for all composed HMM’s. This is because the cepstral distance between a testing position and the closest training position is not so far as shown in Fig. 12. Fig. 13 shows the recognition rates for an unknown-testing position p2 obtained by using the acoustic transfer function of , h5, and also shows the each of the training positions h1, cepstral distance between the testing position p2 and each of the training positions h1, , h5. This figure indicates that the closest position results in the best performance, 86.2%. When the cepstral distance is greater than the closest position, the performance will be reduced. In the case of the training position h1 the recognition rate is decreased to 82.6%. This figure also shows that the performance difference between the acoustic transfer function HMM of the closest position and the ergodic HMM’s of the acoustic transfer functions (shown in Table I) is quite small, because the decoded path obtained by using the ergodic HMM’s finds the optimal combination of the acoustic transfer function HMM and the clean-speech HMM. 4) Performance at Various SNRs: The recognition rates at various SNR’s with 10 adaptation words are shown in Table II, where the computer-noise signal is added to the simulated distant-talking speech signal for p1 at various SNRs: 0 dB, 5 dB, 10 dB, 15 dB, and 20 dB. At an SNR of 0 dB, the recognition rate with the clean-speech HMM’s (HMM-S) is 7.8%. The recognition rate with the composed HMM’s (HMM-SN) obtained from the clean-speech HMM’s and the noise HMM is 81.6%. Applying the HMM composition and separation methods to noisy and acoustically distorted speech, “Sepa.(Mean),” increases the performance by about 1.0%, where the mean vector of the acoustic transfer function HMM is estimated and composed. Also, at an SNR of 20 dB, the recognition rate is improved from 82.8% to 92.7%. The recognition rate with the matched HMM’s is 88.2% at an SNR of 0 dB, and 96.4% at an SNR of 20 dB. In comparison with the performance of the matched HMM’s, the difference is 5.7% at an SNR of 0 dB, and 3.7% at an SNR of 20 dB. C. Evaluation for the Speech of a Distant Moving Speaker This section describes the performance of the HMM composition and separation methods for recognition of the speech of a distant moving speaker. The speech of the distant moving speaker is recognized by using an ergodic HMM of acoustic transfer functions. Each state of the ergodic HMM of acoustic transfer functions corresponds to a position in a room, where all transitions among states are permitted. Therefore, the proposed
Fig. 13. p2.
Word-recognition rates and cepstral distance for an unknown position
TABLE II WORD-RECOGNITION RATES [%] WITH TEN ADAPTATION WORDS VARIOUS SNRS
AT
ergodic HMM of acoustic transfer functions is able to trace the positions of sound sources. 1) Speech Data Collection of a Distant Moving Speaker: Recognition experiments were conducted to evaluate the effectiveness of an ergodic HMM of acoustic transfer functions for recognition of the speech of a distant moving speaker. Fig. 14 shows the recording condition of the speech of the distant moving speaker. One male is walking from the “starting position” shown in Fig. 14. The speaker utters 31 sentences while moving. We also record distant-talking speech from the positions of sound sources g1, g2, and g3 when the speaker does not move. One sentence is used for estimation of the acoustic transfer function. 2) Results for the Speech of a Distant Moving Speaker: The points to be investigated are the performance of • Parallel models of acoustic transfer functions: Composed HMM’s for each acoustic transfer function are separately set. Likelihood scores for each composed HMM are calculated, and composed HMM’s having the maximum likelihood are then selected. and • Ergodic models of acoustic transfer functions A phrase recognition experiment was carried out for continuous-sentence speech, in which the sentences included six to seven phrases on average. The task contained 306 phrases with a phrase perplexity of 306. The phrase accuracy is calculated as follows: Accuracy
(74)
TAKIGUCHI et al.: HMM-SEPARATION-BASED SPEECH RECOGNITION FOR A DISTANT MOVING SPEAKER
Fig. 14.
139
Recording condition for the speech of a distant moving speaker.
PHRASE
TABLE III ACCURACY [%] FOR THE SPEECH STATIONARY SPEAKER.
Fig. 15. Example of a composed ergodic HMM in experiments with a distant moving speaker. OF THE
DISTANT TABLE IV PHRASE ACCURACY [%] FOR THE SPEECH OF A DISTANT MOVING SPEAKER.
where total number of phrases; number of deletions; number of substitutions; number of insertions. The phrase accuracy for the speech of a nearby speaker was 90.4%. Table III shows the average phrase accuracy [%] for the speech of the distant stationary speaker. The phrase accuracy with clean-speech HMM’s was 69.5%. Next, we compose the clean-speech HMM’s and each of the acoustic transfer function HMM’s, g1, g2, and g3. The performance of the parallel models, where composed HMM’s having maximum likelihood are selected, is 76.5% on average. The performance of the composed ergodic HMM’s (shown in Fig. 15) is 75.5% on average. Comparison of this result with that for the parallel model shows a difference in performance of 1.0%. This is because all transition probabilities of acoustic transfer functions in the ergodic HMM are set equally, and a wrong path might be chosen. Table IV shows the average phrase accuracy [%] for speech recognition of the distant moving speaker. The phrase accuracy with clean-speech HMM’s is 63.3%. The performance of the parallel models, where composed HMM’s having maximum likelihood are selected, is 76.7%. The performance with the ergodic HMM’s of acoustic transfer functions at g1, g2, and g3 is improved to 82.3%. These experimental results show the effectiveness of the ergodic HMM’s for recognition of the speech of a distant moving speaker. V. CONCLUSION This paper has detailed a robust speech recognition technique for acoustic model adaptation based on HMM composition and separation methods in noisy and reverberant environments, where a user speaks from a distance of 0.5 m–3.0 m. The aim of the HMM composition and separation methods is to estimate the model parameters so as to adapt the model to a
target environment by using a small amount of a user’s speech in noisy and reverberant environments. In this paper, the HMM composition algorithm for additive noise was extended to model the acoustic transfer function of a reverberant room. In this approach, an attempt is made to model the acoustic transfer function by means of an HMM. The states of the acoustic transfer function HMM correspond to different sound source positions. This HMM can represent the positions of sound sources, even if the speaker moves. In Section III, a new method was proposed for estimating the HMM parameters of the acoustic transfer function on the basis of HMM separation. This method is able to estimate the model parameters by using observed speech uttered from an unknown position without measurement of impulse responses. The estimated acoustic transfer function, clean-speech HMM’s, and noise HMM are composed to recognize noisy and reverberant speech. Speech recognition experiments were carried out to investigate the effectiveness of the HMM composition and separation methods for real speech of the distant stationary speaker. The proposed method improves the word-recognition rates for the speaker-dependent (SD) and speaker-independent (SI) models. The experimental results also show that the covariance matrix of the acoustic transfer function is an effective means of compensating for the influence of long impulse responses. However, the performance of the proposed method is lower than that of the matched condition, where each phoneme HMM is trained by using simulated distant-talking speech. Therefore, further improvement of the HMM adaptation method is necessary. This paper also investigated the performance of the HMM composition and separation methods for recognition of the speech of a distant moving speaker. Such speech is recognized by using an ergodic HMM of acoustic transfer functions. The experimental results show that the ergodic HMM can improve the performance of speech recognition for a distant moving speaker.
140
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 9, NO. 2, FEBRUARY 2001
In summary, the HMM composition and separation methods are applicable to a wide variety of additive noise and convolutional distortion tasks. In future work, we will investigate how to choose the number of states in the ergodic HMM, and how to estimate the transition probabilities of acoustic transfer functions in the ergodic HMM. REFERENCES [1] S. F. Boll, “Suppression of acoustic noise in speech using spectral subtraction,” in Proc. IEEE Acoustic, Speech, Signal Processing ’79, 1979. [2] A. Acero, “Acoustical and environmental robustness in automatic speech recognition,” Ph.D. dissertation, Sept. 1990. [3] A. Sankar and C.-H. Lee, “Robust speech recognition based on stochastic matching,” in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Processing, 1995, pp. 121–124. [4] V. Abrash, A. Sankar, H. Franco, and M. Cohen, “Acoustic adaptation using transformations of HMM parameters,” in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Processing, 1996, pp. 729–737. [5] A. P. Varga and R. K. Moore, “Hidden Markov model decomposition of speech and noise,” in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Processing, 1990, pp. 845–848. [6] M. J. F. Gales and S. J. Young, “An improved approach to the hidden Markov model decomposition of speech and noise,” in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Processing, 1992, pp. 233–236. , “PMC for Speech Recognition in Additive and Convolutional [7] Noise,”, CUED-F-INFENG-TR154, 12, 1993. [8] F. Martin, K. Shikano, and Y. Minami, “Recognition of noisy speech by composition of hidden Markov models,” in Proc. EUROSPEECH ’93, 1993, pp. 1031–1034. [9] Y. Minami and S. Furui, “A maximum likelihood procedure for a universal adaptation method based on HMM composition,” in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Processing, 1995, pp. 129–132. , “Adaptation method based on HMM composition and EM algo[10] rithm,” in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Processing, 1996, pp. 327–330. [11] M. Afify, Y. Gong, and J.-P. Haton, “A general joint additive and convolutive bias compensation approach applied to noisy lombard speech recognition,” IEEE Trans. Speech Audio Processing, vol. 6, pp. 524–538, Nov. 1998. [12] O. Siohan and C.-H. Lee, “Iterative noise and channel estimation under the stochastic matching algorithm framework,” IEEE Signal Processing Lett., vol. 4, no. 11, pp. 304–306, 1997. [13] S. Nakamura, T. Takiguchi, and K. Shikano, “Noise and room acoustics distorted speech recognition by HMM composition,” in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Processing, 1996, pp. 69–72. [14] T. Takiguchi, “Hands-free speech recognition by hmm composition in noisy reverberant environments,” M.S. thesis, Nara Inst. Sci. Technol., Nara, Japan, 1996. [15] T. Takiguchi, S. Nakamura, Q. Huo, and K. Shikano, “Adaptation of model parameters by HMM decomposition in noisy reverberant environments,” in Proc. ESCA-NATO Workshop Robust Speech Recognition Unknown Communication Channels, 1997, pp. 155–158. [16] , “Model adaptation based on HMM decomposition for reverberant speech recognition,” in Proc. IEEE Int. Conf. Acousics., Speech, Signal Processing, 1997, pp. 827–830. [17] M. Miyoshi and Y. Kaneda, “Inverse filtering of room acoustics,” IEEE Trans. Acoust., Speech, Signal Processing, vol. 36, pp. 145–152, Feb. 1988. [18] H. Wang and F. Itakura, “An approach of dereverberation using multimicrophone sub-band envelope estimation,” in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Processing, 1991, pp. 953–956. [19] P. W. Shields and D. R. Campbell, “Intelligibility improvements obtained by an enhancement method applied to speech corrupted by noise and reverberation,” in Proc. ESCA-NATO Workshop Robust Speech Recognition Unknown Communication Channels, 1997, pp. 91–94. [20] M. Omologo, M. Matassoni, P. Svaizer, and D. Giuliani, “Hands-free speech recognition in a noisy and reverberant environment,” in Proc. ESCA-NATO Workshop Robust Speech Recognition Unknown Communication Channels, 1997, pp. 195–198. [21] T. Yamada, S. Nakamura, and K. Shikano, “Robust speech recognition with speaker localization by a microphone array,” in Proc. Int. Conf. Speech Language Processing, 1996, pp. 1317–1320. [22] G. W. Mackenzie, Acoustics. New York: Focal, 1964.
[23] M. Tohyama, H. Suzuki, and Y. Ando, The Nature and Technology of Acoustic Space. New York: Academic, 1995. [24] L. E. Baum, “An inequality and associated maximization techniques in statistical estimation for probabilistic functions of Markov processes,” Inequalities, vol. 3, pp. 1–8, 1972. [25] B.-H. Juang, “Maximum-likelihood estimation of mixture multivariate stochastic observations of Markov chains,” AT&T Tech. J., vol. 64, no. 6, pp. 1235–1249, 1985. [26] Y. Suzuki, F. Asano, H.-Y. Kim, and T. Sone, “An optimum computergenerated pulse signal suitable for the measurement of very long impulse responses,” J. Acoust. Soc. Amer., vol. 97, no. 2, pp. 1119–1123, 1995.
Tetsuya Takiguchi (M’99) received the B.S. degree in applied mathematics from Okayama University of Science, Japan, in 1994, and the M.E. and Ph.D. degrees in information science from Nara Institute of Science and Technology, Nara, Japan, in 1996 and 1999, respectively. From April 1996 to March 1999, he was a Research Assistant at Nara Institute of Science and Technology. He is currently a research staff member at IBM Research, Tokyo Research Laboratory, Kanagawa, Japan. Dr. Takiguchi is a member of the Acoustical Society of Japan.
Satoshi Nakamura (M’89) was born in Japan on August 4, 1958. He received the B.S. and Ph.D. degrees in electronics engineering from Kyoto Institute of Technology, Kyoto, Japan, in 1981 and 1992, respectively. From 1981 to 1986 and 1990 to 1993, he was with the Central Research Laboratory, Sharp Corporation, Nara, Japan, where he was engaged in speech recognition researches. From 1986 to 1989, he was a Researcher with the Speech Processing Department, ATR Interpreting Research Laboratories, Kyoto. From 1994 to 2000, he was an Associate Professor with the Graduate School of Information Science, Nara Institute of Science and Technology. In 1996, he was a Visiting Research Professor with the CAIP Center, Rutgers, The State University, New Brunswick, NJ. He is currently the Head of the Department, ATR Spoken Language Translation Laboratories. His current research interests include speech recognition, speech translation, spoken dialogue systems, stochastic modeling of speech and a microphone array. Dr. Nakmura received the Awaya Award from the Acoustical Society of Japan in 1992. He is a member of the Acoustical Society of Japan and the Information Processing Society of Japan.
Kiyohiro Shikano (M’84) received the B.S., M.S., and Ph.D. degrees in electrical engineering from Nagoya University, Nagoya, Japan, in 1970, 1972, and 1980, respectively. He is currently a Professor with the Nara Institute of Science and Technology (NAIST), Nara, Japan, where he is directing the Speech And Acoustics Laboratory. His major research areas are speech recognition, multimodal dialog systems, speech enhancement, adaptive microphone array, and acoustic field reproduction. Since 1972, he had been with NTT Laboratories, where he had been engaged in speech recognition research. During 1990–1993, he was the Executive Research Scientist at NTT Human Interface Laboratories, where he supervised the research of speech recognition and speech coding. During 1986–1990, he was the Head of Speech Processing Department at ATR Interpreting Telephony Research Laboratories, Kyoto, Japan, where he was directing speech recognition and speech synthesis research. During 1984–1986, he was a Visiting Scientist in Carnegie Mellon University, Pittsburgh, PA, where he was working on distance measures, speaker adaptation, and statistical language modeling. Dr. Shikano received the Yonezawa Prize from IEICE in 1975, the Signal Processing Society 1990 Senior Award from IEEE in 1991, and the Technical Development Award from ASJ in 1994. He is a member of the Institute of Electronics, Information and Communication Engineers of Japan (IEICE), Information Processing Society of Japan, and the the Acoustical Society of Japan (ASJ).