IEICE TRANS. INF. & SYST., VOL.E88–D, NO.3 MARCH 2005
384
PAPER
Special Section on Corpus-Based Speech Technologies
Multiple Regression of Log Spectra for In-Car Speech Recognition Using Multiple Distributed Microphones Weifeng LI†a) , Tetsuya SHINDE† , Hiroshi FUJIMURA†† , Chiyomi MIYAJIMA†† , Takanori NISHINO†† , Katunobu ITOU†† , Nonmembers, Kazuya TAKEDA†† , Member, and Fumitada ITAKURA††† , Fellow
SUMMARY This paper describes a new multi-channel method of noisy speech recognition, which estimates the log spectrum of speech at a closetalking microphone based on the multiple regression of the log spectra (MRLS) of noisy signals captured by distributed microphones. The advantages of the proposed method are as follows: 1) The method does not require a sensitive geometric layout, calibration of the sensors nor additional pre-processing for tracking the speech source; 2) System works in very small computation amounts; and 3) Regression weights can be statistically optimized over the given training data. Once the optimal regression weights are obtained by regression learning, they can be utilized to generate the estimated log spectrum in the recognition phase, where the speech of close-talking is no longer required. The performance of the proposed method is illustrated by speech recognition of real in-car dialogue data. In comparison to the nearest distant microphone and multi-microphone adaptive beamformer, the proposed approach obtains relative word error rate (WER) reductions of 9.8% and 3.6%, respectively. key words: speech recognition, microphone arrays, adaptive beamforming, signal-to-deviation ratio, multiple regression
1. Introduction Improving the accuracy of speech recognition in a noisy environment is one of the important issues in extending the application domain of speech recognition technology [1]. Among the various approaches previously proposed for noisy speech recognition, speech enhancement based on multi-channel data acquisition is currently under extensive research. The most fundamental and important multi-channel method is the microphone array beam-former method because the assumption that the target and the interfering signals are not spatially dispersed, and are apart from each other, is reasonable in many situations. In other words, the microphone array beam-former method is effective when the positions of the speaker and the noise sources are predetermined and the positions do not change during the signal acquisition process. On the other hand, when the spatial configuration of the speaker and noise sources is unknown or changes continuManuscript received June 23, 2004. Manuscript revised September 14, 2004. † The authors are with the Department of Information Electronics, Graduate School of Engineering, Nagoya University, Nagoya-shi, 464–8603 Japan. †† The authors are with the Department of Media Science, Graduate School of Information Science, Nagoya University, Nagoya-shi, 464–8603 Japan. ††† The author is with the Faculty of Science and Technology, Meijo University, Nagoya-shi, 468–8502 Japan. a) E-mail:
[email protected] DOI: 10.1093/ietisy/e88–d.3.384
ously, the directivity of the beamformer must be adaptively controlled toward the new environment. One of the most common approaches for adaptive microphone array systems is the Generalized Sidelobe Canceller (GSC), introduced by Griffiths and Jim [2]. Many modern concepts (e.g. [3], [4]) are derived from this scheme. However, these techniques require a lot of computations for updating the filter parameters as well as precise arrangement and calibration of sensors [5]. Moreover, a persistent problem with microphone arrays has been the poor low-frequency directivity for practical array dimensions [6]. As an approach to distant speech recognition, on the other hand, the authors have earlier proposed the feature averaging method, which uses an averaged cepstrum of the distant speech signals captured through distributed microphones [7]. We experimentally confirmed that the feature averaging method improves the recognition accuracy of the distant speech by about 20% from single channel distant speech. By extending the simple averaging, i.e., equal weighting, to the regression, i.e., optimal weighting, in this paper, we propose Multiple Regression of the Log Spectrum (MRLS) of speech captured by spatially distributed microphones so as to approximate the speech at the close-talking microphone in the log spectrum domain∗ . Since MRLS is much simpler than the microphone array technique, the following advantages are expected: 1) The method does not require a sensitive geometric layout, calibration of the sensors nor additional pre-processing for tracking the speech source; 2) System works in very small computation amounts; and 3) Regression weights can be statistically optimized over the given training data. Once the optimal regression weights are obtained by regression learning, they can be utilized to generate the estimated log spectrum in the recognition phase, where the speech of closetalking is no longer required. The aim of this paper is to describe the proposed method and evaluate the performance of the method for incar speech recognition. In Sect. 2, we introduce the spectral regression method for speech enhancement. In Sect. 3, the in-car speech database that is used for the evaluation experiments is described. In Sect. 4, we present evaluation exper∗ Minimizing the regression error in the log spectrum domain is equivalent to minimizing the cepstrum distance due to the orthogonality of the discrete time cosine transform (DCT).
c 2005 The Institute of Electronics, Information and Communication Engineers Copyright
LI et al.: MULTIPLE REGRESSION OF LOG SPECTRA FOR IN-CAR SPEECH RECOGNITION
385
+ bi (log N − log N 0 ),
iments, and Sect. 5 summarizes this paper. where
2. Multiple Regression of Log Spectra (MRLS) 2.1 Two-Dimensional Taylor-Series Expansion of Log Spectrum Assume that speech signal xi (t) at the ith microphone position is given by a mixture of the source speech s(t) and the noise n(t) convolved with transfer functions to the position, hi (t) and gi (t), i.e., xi (t) = hi (t) ∗ s(t) + gi (t) ∗ n(t),
(6)
(1)
as shown in Fig. 1. Assume also that the power spectrum of xi (t) is given by the ‘power sum’ of the filtered speech and noise, i.e.,
log Xi0 = ai log S 0 + bi log N 0 . Using superscript (•)(d) for the deviation from (•)0 , e.g. log Xid = log Xi −log Xi0 , the Taylor expansion can be rewritten as log Xi(d) ≈ ai log S (d) + bi log N (d) . 2.2 Multiple Regression of Multi-Channel Signals Approximation of log S (d) by the multiple regression of log Xi(d) has the form log S (d) ≈
Xi (ω) = |Hi (ω)|2 S (ω) + |Gi (ω)|2 N(ω),
(7)
M
λi log Xi(d) ,
(8)
i=1
where S (ω), Xi (ω) and N(ω) are the power spectra of the speech signal at its source position, the noisy speech signal at the ith microphone position and the noise signal at its source position, respectively. (The frequency index (ω) will be omitted in the remainder of this paper.) Consequently, the corresponding log power spectrum of the signals at the ith microphone position are given by (2) log Xi = log |Hi |2 S + |Gi |2 N .
2 M (d) (d) (d) = log S − λi ai log S + bi log N
∆ log Xi =
i=1
(3)
where ai and bi are given by 2
(4)
|2 S
(5)
Note that both ai and bi are functions of the ratio between signal and noise at their source positions, i.e., S /N. Small deviations in the log-power-spectrum of the signal at the ith microphone position can be approximated by a twodimensional Taylor-series expansion around Xi0 , i.e., log Xi −
log Xi0
2 M M (d) (d) = 1 − λi ai log S − λi bi log N i=1
|Hi | S |Hi + |Gi |2 N |Gi |2 N . bi = |Hi |2 S + |Gi |2 N
ai =
i
i=1
The derivative of log Xi can be calculated by ∂ log Xi ∂ log Xi ∆ log S + ∆ log N ∂ log S ∂ log N = ai ∆ log S + bi ∆ log N,
where M denotes the number of microphones and λi is the regression weight. Note that the feature averaging method can be realized by setting equal weights, i.e., λi = 1/M, regardless of the microphone positions. By substituting this into Eq. (7), the regression error of the approximation, , can be calculated as follows. 2 M (d) (d) λi log X ε = log S −
≈ ai (log S − log S ) 0
Assuming the orthogonality between log S (d) and log N (d) , the expectation value of the regression error becomes M 2 2 M λi ai {log S (d) }2 + λi bi {log N (d) }2 . E 1 − i=1
Signal captured through distributed microphones.
i=1
The minimum regression error is then achieved when M
E {ai } λi = 1,
i=1
M
E {bi } λi = 0.
(9)
i=1
Thus, the optimal {λi } can be uniquely determined as a vector that is orthogonal to {bi } and its inner product with {ai } is equal to unity. The relationship among these three vectors is shown in Fig. 2. Here, ai and bi correspond to the Signal-to-Noise and Noise-to-Signal ratios, respectively, at the microphone position, and the relationship ai + bi = 1
Fig. 1
i=1
(10)
holds for every microphone position. Therefore, once the
IEICE TRANS. INF. & SYST., VOL.E88–D, NO.3 MARCH 2005
386
Signal-to-Noise ratios at all microphone positions are given, λi can be uniquely determined. Multiple regression on the log power spectrum domain can be regarded as an implicit estimation of the local SNR at each microphone position. Note that in practical implementation, the optimal {λi } is obtained by regression training using the log spectrum of the close-talking microphone and those of the five distant microphones simultaneously, i.e., the optimal weights are obtained by minimizing the mean squared error: E=
N
{[log S (d) ]n −
n=1
M
λi [log Xi(d) ]n }2 ,
(11) [log Xi(d) ]n
over the training examples. Here [log S ]n and denote the output from the close-talking microphone and the ith distant microphone for the nth training example, respectively, and N denotes the number of training examples. Finally the optimal {λi } can be obtained by solving simultaneous equations given by i = 1, . . . , M.
We have devised the following procedure to implement the technique. The log power spectrum is calculated through mel-filter bank (MFB) analysis, followed by a log operation [8]. The spectrum of the speech captured by the closetalking microphone, X0 , is used as the speech at the source position S . All log power spectrum log Xi values are normalized so that their means over an utterance become zero, i.e., log Xi(d) ≈ log Xi − log Xi .
i=1 (d)
∂E = 0, ∂λi
2.3 Implementation
(12)
In the recognition phase, the optimal weights are utilized to generate the estimated log spectrum using Eq. (8), where the speech of close-talking is no longer required. On the other hand, when multiple-regression is performed on the power-spectrum domain, {ai } and {bi } are given by
(13)
Note that in this implementation, minimizing the regression error is equivalent to minimizing the MFCC distance between the approximated and the target spectra, due to the orthogonality of the discrete time cosine transform (DCT) matrix. Therefore, the MRLS has the same form as the maximum likelihood optimization of the filter-and-sum beamformer proposed in [9]. 3. In-Car Speech Corpus
holds. However unlike in the log power spectrum domain, |Hi | and |Gi | are independent: they can not uniquely relate to the optimized λi .
The multiple microphone speech corpus is a part of the CIAIR (Center for Integrated Acoustic Information Research) in-car speech database collected at Nagoya University [10], which contains data from 800 speakers. The corpus includes isolated word utterances, phonetically balanced sentences and dialogues recorded while driving. Data is collected with a specially designed data collection vehicle (DCV) that has multiple data acquisition capabilities of up to 16 channels of audio signals, and three channels of video and other driving-related information, i.e., car position, vehicle speed, engine speed, brake and acceleration pedals and steering handle. Five spatially distributed microphones (#3 to #7) are placed around the driver’s seat, as shown in Fig. 3, where top and side views of the driver’s seat are depicted. Microphone
Fig. 2 The geometric relationship among optimal regression weights λ and the Taylor series coefficients a and b. In the log-power-spectrum domain, ai + bi = 1 continues to hold.
Fig. 3 Side view and top view of the arrangement of multiple spatially distributed microphones and the linear array in the data collection vehicle.
ai = |Hi |2 bi = |Gi |2 , since Xi = |Hi |2 S + |Gi |2 N
LI et al.: MULTIPLE REGRESSION OF LOG SPECTRA FOR IN-CAR SPEECH RECOGNITION
387
positions are marked by the black dots. While microphones #3 and #4 are located on the dashboard; #5, #6 and #7 are attached to the ceiling, and microphone #6 is nearest to the speaker. In addition to these distributed microphones, the driver wears a headset with a close-talking microphone (#1). A four-element linear microphone array (#9 to #12) with an inter-element spacing of 5 cm is located at the visor position. In the majority of the corpus, the speaker is driving in city traffic near Nagoya University.
the FIR filters are adapted during both noise and speech intervals. The replicas ya (n) are subtracted from yb f (n). As a result, in the output yo (n), the target signal is enhanced and the detrimental signals such as ambient noise and interferences are suppressed. In our experiments, the number of taps and step-size of the adaptation in the adaptive beamformer are set as 100 and 0.01 experimentally.
4. Experimental Evaluations
Speech signals used in the experiments were digitized into 16 bits at the sampling frequency of 16 kHz. For the spectral analysis, a 24-channel mel-filter bank (MFB) analysis is performed by applying the triangular windows on the FFT spectrum of the 25-ms-long windowed speech with a frame shift of 10 ms. Spectral components lower than 250 Hz are filtered out since the spectra of the in-car noise signal is concentrated in the lower-frequency region. This basic analysis is realized through HTK standard MFB analysis [8]. The regression analysis is performed on the logarithm of MFB output to obtain the estimated feature vectors. Then 12 cepstral mean normalized mel-frequency cepstral coefficients (CMN-MFCC) are obtained through Discrete Cosine Transformation (DCT) on the log MFB parameters and by subtracting the cepstral means for the speech recognition experiments. The training data comprises a total of 7,000 phonetically balanced sentences (uttered by 202 male and 91 female speakers): 3,600 of them were collected in the idling condition and 3,400 of them while driving the DCV on the streets near Nagoya University. The structure of the HMM is fixed, i.e., 1) three-state triphones based on 43 phonemes that share 1,000 states; 2) each state has 32-component mixture Gaussian distributions; and 3) the feature vector is a 25-dimensional vector (12 CMNMFCC + 12 ∆ CMN-MFCC + ∆ log energy). The test data for recognition includes 364 sentences,
4.1 Adaptive Beamforming (ABF) The adaptive beamforming approach has become attractive for speech enhancement and speech recognition (e.g. [11]). One of the most common approaches for adaptive microphone array systems is the Generalized Sidelobe Canceller (GSC) [2]. For comparison, we apply the Generalized Sidelobe Canceller (GSC) to our in-car speech recognition. Four linearly spaced microphones (#9 to #12 in Fig. 3) with an inter-element spacing of 5 cm at the visor position are used. The architecture of the GSC used is shown in Fig. 4. It comprises a fixed beamformer (top branch), a blocking matrix and three adaptive FIR filters (bottom branch). The top branch produces the beamformed signal, which is used as the primary signal. In our car system, τi is set equal to zero since the speakers (drivers) sit directly in front of the array line, while wi is set equal to 1/4. The delay is chosen as half of the adaptive filter order to ensure that the component in the middle of each of the adaptive filters at time n corresponds to yb f (n). The blocking matrix in the bottom branch is used to block out the target signal with the output ui and takes the form: 0 1 −1 0 B = 0 1 −1 0 , 0 0 1 −1 which takes the difference between the signals at the adjacent microphones. The three FIR filters are adapted sampleby-sample to generate replicas of the noise or interfering sources involved in the beamformed signal by using the Normalized Least Mean Square (NLMS) method [12]. Note that
Fig. 4
Block diagram of Generalized Sidelobe Canceller.
4.2 Recognition Experiments
Fig. 5
Diagram of the MRLS-based HMM and speech recognition.
IEICE TRANS. INF. & SYST., VOL.E88–D, NO.3 MARCH 2005
388
uttered by 21 speakers while talking to an ASR based restaurant guidance system while driving the DCV [13]. The diagram of the MRLS-based HMM and speech recognition is given in Fig. 5. The MRLS weights are learned by the use of 12,569 frames as examples (from 72 sentences uttered by another 5 speakers under the same road and driving condition as the test data). The training data contains various driving conditions such as waiting for traffic signal, turning and accelerating. Then the optimal MRLS weights are used to generate the 7,000 sentences for training HMM and the test 364 sentences using the speech from 5 distributed microphones only, and the speech at the close-talking microphone is no longer required. For language models, forward bi-gram and backward tri-gram models with 2,000 word vocabularies are used. The decoder used is Julius 3.4.1 [14]. Note that in our experiments, both the training data and test data are not pre-processed by spectral subtraction (SS) method. To evaluate of the performance of MRLS, we performed the following four recognition experiments: CTK recognition of the close-talking microphone speech using the HMM trained by the close-talking microphone; ABF recognition of adaptive beamforming output using the HMM trained by the adaptive beamformer’s output; MRLS recognition of MRLS output using MRLS HMM; and DST recognition of the nearest distant microphone’s (#6 in Fig. 3) speech by the corresponding HMM. Figure 6 shows averaged word accuracy scores. The word accuracy scores are defined as Accuracy =
N−S −D−I × 100 [%], N
(14)
where N, S , D and I denote the number of words, the number of substitutions, the number of deletions, and the number of insertions, respectively. This figure clearly shows that the MRLS method outperforms the adaptive beamforming method. Compared to the use of the nearest distant microphone, recognition accuracy is improved by 4% with the proposed MRLS method. In comparison to the nearest distant microphone and multi-microphone adaptive beamformer, the proposed MRLS obtains a relative word error
Fig. 6
Recognition performance of MLRS.
rate (WER) reduction of 9.8% and 3.6% respectively. This result indicates shows that approximation of the speech signals from the close-talking microphone using MRLS is reasonable and practical for improving the recognition accuracy. In light of the computation cost, the proposed MRLS method is far superior to the adaptive beamforming approach. 4.3 Spectral Distortion The effectiveness of the approximation by regression is verified from the viewpoint of spectral distortion. The signalto-deviation ratio (SDR) is given by N 2 X (l) , (15) SDR [dB] = 10 log10 l=1 N (l) ˆ (l) 2 l=1 X − X where X (l) is the log spectrum vector of the reference speech captured by the close-talking microphone, and Xˆ (l) is the log spectrum vector to be evaluated. The evaluated log spectra are those of the distant microphone output, adaptive beamformer output, and the result of the approximation through multiple regression. N denotes the number of frames during one utterance. Figure 7 illustrates the results of evaluated SDR values, which are averaged over all the utterances in the test data. Among the distributed distant microphones, the best SDR value is obtained by the nearest microphone, microphone #6, as we expected. The effectiveness of the spectrum regression is revealed from the result that the SDR of the approximated log spectrum is about 1.7 dB higher than that of the best microphone result, whereas the adaptive beamformer contributes only a 0.3-dB improvement. 4.4 Discussion It should be noted that the proposed MRLS method somewhat depends on the positions of distributed microphones. The basic idea of MRLS is to approximate the log mel-filter bank (log MFB) outputs of a close-talking microphone by a linear combination of the log spectra of the distant microphones, which can be regarded as an extension of spectral
Fig. 7 SDR values for the test data at five distributed microphones (#3 to #7 in Fig. 3) and for those obtained by using adaptive beamformer (ABF) and MRLS method.
LI et al.: MULTIPLE REGRESSION OF LOG SPECTRA FOR IN-CAR SPEECH RECOGNITION
389
subtraction. Therefore, to perform the “spectral subtraction” more effectively, the noise signal as well as the comparatively clean signal is also desirably required. The widely distributed microphones (e.g., microphone #3-#7 used in this paper) can record both the noise signal and the comparatively clean signal in an effective way. From our experiments, the recognition accuracy of MRLS method by the use of locally concentrated microphones (e.g., #9-#12) gives 65.3%, while the SDR values gives 20.5 dB. Although the achieved recognition accuary is not as high as that of microphones #3-#7, it still outperforms the adaptive beamforming. 5. Conclusion In this paper, we have proposed a new speech enhancement method for robust speech recognition in noisy car environments. The approach utilizes multiple spatially distributed microphones to perform multiple regression of the log spectra (MRLS), aiming at approximating the log spectra at a close-talking microphone. The method does not require sensitive geometric layout, calibration of the sensors nor additional pre-processing for tracking the speech source. The regression weights can be statistically optimized over the given training data. The results of our studies have shown that the proposed method can obtain good approximation to the speech of a close-talking microphone and can outperform the adaptive beamformer in terms of recognition performance while maintaining a low computation cost. The present speech enhancement method is limited to simple linear regression, and nonlinear regression approaches such as neural networks [15] and support vector machines [16] are expected to yield better performance if better approximation to the clean speech is taken into account. Other methods for speech enhancement may be combined with the proposed method to obtain improved accuracy in recognition of speech in noisy environments. Further research for adapting the MRLS weights to a particular driving condition is expected to improve the performance even more. Moreover, this method is expected to enhance recognition accuracy in very noisy situations and be applicable to a large number of real-life environments. References [1] C. Junqua and P. Haton, Robustness in Automatic Speech Recognition, Kluwer Academic Publishers, 1996. [2] L.J. Griffiths and C.W. Jim, “An alternative approach to linearly constrained adaptive beamforming,” IEEE Trans. Antennas Propag., vol.AP-30, no.1, pp.27–34, Jan. 1982. [3] Y. Kaneda and J. Ohga, “Adaptive microphone-array system for noise reduction,” IEEE Trans. Acoust. Speech Signal Process., vol.34, no.6, pp.1391–1400, 1986. [4] O. Hoshuyama and A. Sugiyama, “A robust adaptive beamformer for microphone arrays with a blocking matrix using constrained adaptive filters,” Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, pp.925–928, 1996. [5] M. Brandstein and D. Ward, Microphone Arrays: Signal Processing Techniques and Applications, Springer-Verlag, 2001.
[6] I. McCowan, D. Moore, and S. Sridharan, “Near-field adaptive beamformer with application to robust speech recognition,” Digit. Signal Process.: A Review, vol.12, no.1, pp.87–106, Jan. 2002. [7] Y. Shimizu, S. Kajita, K. Takeda, and F. Itakura, “Speech recognition based on space diversity using distributed multi-microphone,” Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol.III, pp.1747–1750, Istanbul, June 2000. [8] S. Young, J. Odeu, D. Ollason, V. Valtchev, and P. Woodland, “The HTK Book,” Version 2.1, Cambridge University, 1997. [9] M.L. Seltzer, B. Raj, and R.M. Stern, “Speech recognizer-based microphone array processing for robust hands-free speech recognition,” Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP2002, vol.I, pp.897–900, Orlando, 2002. [10] N. Kawaguchi, S. Matsubara, H. Iwa, S. Kajita, K. Takeda, F. Itakura, and Y. Inagaki, “Construction of speech corpus in moving car environment,” Proc. 6th International Conference on Spoken Language Processing, ICSLP2000, pp.362–365, Beijing, China, 2000. [11] X. Zhang and J.H.L. Hansen, “A constrained switched adaptive beamforming for speech enhancement & recognition in real car environments,” IEEE. Trans. Speech Audio Process., vol.11, no.6, pp.733–745, Nov. 2003. [12] S. Haykin, Adaptive Filter Theory, Prentice Hall, 2002. [13] H. Fujimura, K. Itou, K. Takeda, and F. Itakura, “In-car speech recognition experiments using a large-scale multi-mode dialogue corpus,” Proc. Independent Components Analysis (ICA), vol.IV, pp.2583–2586, 2004. [14] http://julius.sourceforge.jp [15] S. Haykin, Neural Networks – A Comprehensive Foundation, Prentice Hall, 1999. [16] V. Vapnik, The Nature of Statistic Learning Theory, Springer, 1995.
Weifeng Li received the B.E. degree in Mechanical Electronics at Tianjin University, China, in 1997. In 2003, he received the M.E. degree in Information Electronics at Nagoya University, Japan. Currently, he is a Ph.D. candidate at the graduate school of engineering in Nagoya University, Japan. His research interests are in the area of speech signal processing and robust speech recognition.
Tetsuya Shinde received the B.E. and M.E. degrees in Information Electronics at Nagoya University in 2001 and 2003. He is currently a researcher in NEC incorporation. His research interests are speech recognition.
Hiroshi Fujimura received the B.E. degree at Nagoya University in 2003. He is currently pursuing a Master degree at the graduate school of Information Science at Nagoya University. His research interests are robust speech recognition
IEICE TRANS. INF. & SYST., VOL.E88–D, NO.3 MARCH 2005
390
Chiyomi Miyajima received the B.E. degree in computer science and M.E. and Dr.Eng. degrees in electrical and computer engineering from Nagoya Institute of Technology, Nagoya, Japan, in 1996, 1998, and 2001, respectively. From 2001 to 2003, she was a Research Associate of the Department of Computer Science, Nagoya Institute of Technology. Currently she is a Research Associate of the Graduate School of Information Science, Nagoya University, Nagoya, Japan. Her research interests include automatic speaker recognition and multi-modal speech processing. She is a member of ASJ and JASL.
Takanori Nishino received the B.E., M.E., and Dr. Eng. degrees in from Nagoya university in 1995, 1997 and 2003, respectively. He was assistant professor in the faculty of urban science, Meijo university from 2000 to 2003. He is currently assistant professor in the EcoTopia science institute, Nagoya university. His research interests are spatial audio and human behavioral signal processing. He is a member of the Acoustical Society of Japan, the Information Processing Society of Japan and the Audio Engineering Society.
Katunobu Itou received the B.E., M.E. and Ph.D degress in computer science from Tokyo Institute of Technology in 1988, 1990 and 1993 respectively. From 2003, he has been an associate professor at Graduate School of Information Science of the Nagoya University. His research interest is spoken language processing. He is a member of the IPSJ and ASJ.
Kazuya Takeda received the B.S. degree, the M.S. degree, and the Dr. of Engineering degree from Nagoya University, in 1983, 1985, and 1994 respectively. In 1986, he joined ATR (Advanced Telecommunication Research Laboratories), where he involved in the two major projects of speech database construction and speech synthesis system development. In 1989, he moved to KDD R & D Laboratories and participated in a project for constructing voiceactivated telephone extension system. He has joind Graduate School of Nagoya University in 1995. Since 2003, he is a professor at Graduate School of Infomation Science at Nagoya University. He is a member of the IEEE and the ASJ.
Fumitada Itakura was born in Toyokawa near to Nagoya, in 1940. He earned undergraduate and graduate degrees at Nagoya University. In 1968, he joined NTT’s Electrical Communication Laboratory in Musashino, Tokyo. He completed his Ph.D. in speech processing in 1972, writing his dissertation on “Speech Analysis and Synthesis System based on a Statistical Method.” He worked on isolated word recognition in the Acoustics Research Department of Bell Labs under James Flanagan from 1973 to 1975. Between 1975 and 1981, he researched problems in speech analysis and synthesis based on the Line Spectrum Pair [LSP] method. In 1981, he was appointed as Chief of the Speech and Acoustics Research Section at NTT. He left this position in 1984 to take a professorship in communications theory and signal processing at Nagoya University. After 20 years of teaching and research at Nagoya University, he retired from Nagoya University and joined Meijo University in Nagoya. His major contributions include theoretical advances involving the application of stationary stochastic process, linear prediction, and maximum likelihood lassification to speech recognition. He patented the PARCOR vocoder in 1969 the LSP in 1977. His awards include the IEEE ASSP Senior Award, 1975, an award from Japan’s Ministry of Science and Technology, 1977, the 1986 Morris N. Liebmann Award (with B. S. Atal), the 1997 IEEE Signal Processing Society Award, and the IEEE third millennium medal. He is a fellow of the IEEE and a member of the Acoustical Society of Japan.