#2007 The Acoustical Society of Japan
Acoust. Sci. & Tech. 28, 3 (2007)
PAPER
Implementation of realtime STRAIGHT speech manipulation system: Report on its first implementation Hideki Banno1; , Hiroaki Hata2 , Masanori Morise2 , Toru Takahashi2 , Toshio Irino2 and Hideki Kawahara2 1
Faculty of Science and Technology, Meijo University, 1–501, Shiogamaguchi, Tempaku-ku, Nagoya, 468–8502 Japan 2 Faculty of Systems Engineering, Wakayama University, 930, Sakaedani, Wakayama, 640–8510 Japan ( Received 31 May 2006, Accepted for publication 31 October 2006 ) Abstract: A very high quality speech analysis, modification and synthesis system—STRAIGHT— has now been implemented in C language and operated in realtime. This article first provides a brief summary of STRAIGHT components and then introduces the underlying principles that enabled realtime operation. In STRAIGHT, the built-in extended pitch synchronous analysis, which does not require analysis window alignment, plays an important role in realtime implementation. A detailed description of the processing steps, which are based on the so-called ‘‘just-in-time’’ architecture, is presented. Further, discussions on other issues related to realtime implementation and performance measures are also provided. The software will be available to researchers upon request. Keywords: STRAIGHT speech manipulation system, Realtime, Pitch synchronous analysis, F0 extraction, Voice conversion PACS number: 43.42.Ja, 43.72.Ar
1.
[doi:10.1250/ast.28.140]
INTRODUCTION
STRAIGHT [1] (Speech Transformation and Representation by Adaptive Interpolation of weiGHTed spectrogram) was originally designed to investigate human speech perception in terms of auditorily meaningful parametric domains. STRAIGHT’s design was motivated by the belief that nonlinear systems such as human speech perception should be investigated by using their normal input signals, i.e., ecologically relevant stimuli. Although the underlying structure of STRAIGHT is similar to that of the classical channel VOCODER [2], the speech sounds reproduced and/or manipulated by STRAIGHT are sometimes indistinguishable from the original speech sounds in terms of their naturalness [3,4]. This conceptual simplicity together with manipulation flexibility and a highly natural reproduced speech quality have made STRAIGHT a powerful tool for speech perception research [5–9]. In addition to the utility of the basic STRAIGHT system, its extensions to auditory morphing [3,4,10,11] opened up novel prospects in speech manipulations. Realtime STRAIGHT, which is the focus of this article,
e-mail:
[email protected]
140
will also promote the other aspects of STRAIGHT’s design objective that have not been exploited further. From the beginning, STRAIGHT was designed so that it could be applied to auditory feedback research [12] in the near future when processors could become sufficiently fast for realtime operation. This was because auditory feedback was the research topic of one of the authors. Because of this background, all the algorithms incorporated into STRAIGHT were already designed so as to be compatible with realtime operation [1]. However, the current STRAIGHT implementation in Matlab (henceforth referred to as ‘‘Matlab STRAIGHT’’) does not consider realtime processing so that users can operate STRAIGHT parameters easily. This article reports the first attempt at testing how the original design objective works with current technologies. When STRAIGHT is implemented in realtime, it will also be useful in various types of applications such as voice conversion, text-to-speech synthesis, musical performances, and speech style conversion.
2.
INTRODUCTION TO STRAIGHT [1]
STRAIGHT has been evolving through investigations with regard to the following topics: (1) Periodic excitation in voiced sounds can be inter-
H. BANNO et al.: IMPLEMENTATION OF REALTIME STRAIGHT
preted as a two-dimensional sampling operation of the smooth time-frequency surface that represents articulatory information [1,13]. (2) Group delay manipulation of the excitation source [13]. (3) F0 estimation that does not require a priori knowledge for designing a relevant analysis window [1]. (4) Extended pitch synchronous analysis that does not require alignment with pitch marks [1]. (5) F0 extraction based on fixed-point analysis of a mapping from the carrier frequencies of the analyzing wavelet to the instantaneous frequencies of the corresponding wavelet transform [14]. (6) Acoustic event information extraction based on fixedpoint analysis of window center location with respect to the centroid of the windowed signal and minimumphase group delay based compensation [15]. (7) Auditory morphing [3,10]. (8) Algorithm AMALGAM [16] that can seamlessly morph different speech processing algorithms such as waveform-based synthesis, sinusoidal models and STRAIGHT. (9) Nearly defect-free F0 extractor using multiple F0 cues and post processing suitable for offline and quality sensitive applications [17]. These studies roughly trace the course of STRAIGHT’s evolution. The components developed in the first and third topics were replaced by their counterparts developed in the fourth and fifth topics, respectively, and the former no longer exist. The modules developed in the sixth and eighth topics have not yet been integrated into STRAIGHT. At this stage, the most important topic among those presented for realtime STRAIGHT implementation is the fourth one—the extended pitch synchronous spectral analysis. The topic of secondary importance is that on F0 extraction. 2.1. Architecture of STRAIGHT Figure 1 shows the schematic diagram of Matlab STRAIGHT. STRAIGHT is basically a channel VOCODER with enhanced parameter modification capabilities and a very high quality. The parameters manipulated are (a) smoothed spectrogram, (b) fundamental frequency, and (c) time-frequency periodicity map. The frequency resolution of the periodicity map is set to one ERBN rate by smoothing along a nonlinear frequency axis. STRAIGHT offers a graphical interface for analysis, modification, and synthesis, and it also allows direct access to the Matlab functions. The central feature of STRAIGHT is the extended pitch synchronous spectral analysis that provides a smooth artifact-free time-frequency representation of the spectral envelope of the speech signal.
ERB: equivalent rectangular bandwidth
input speech
output speech
F0 adaptive interference-free spectral information extractor F0 source information extractor
spectral information
minimum phase filter
modification
source information
mixed mode excitation source with group delay manipulation
Fig. 1 Schematic structure of Matlab STRAIGHT.
2.2. Extended Pitch Synchronous Analysis The most unique feature of STRAIGHT is its extended pitch synchronous analysis. Unlike other pitch synchronous procedures, STRAIGHT does not require its analysis frame to be aligned with the pitch marks placed on the waveform under study. This analysis employs a compensatory set of windows. The primary window is an effectively isometric Gaussian window convoluted with a pitch adaptive Bartlett window hðtÞ. The fundamental period is represented as t0 . 2 tt 0 wp ðtÞ ¼ e hðt=t0 Þ ð1Þ 1 jtj; jtj < 1 hðtÞ ¼ 0 otherwise, where represents a temporal stretching factor for improving the frequency resolution slightly. The operator represents convolution. This convoluted window is pitch synchronized and it yields temporally constant spectral peaks at harmonic frequencies. However, periodic zeros (period is the fundamental period t0 ) between the harmonic components still remain. The modulation of this window by a sinusoid of frequency f0 =2 yields a compensating window that produces spectral peaks at positions where zeros were located in the original spectra. t wc ðtÞ ¼ wp ðtÞ sin : ð2Þ t0 A temporarily stable composite spectrum Pr ð!; t; Þ is represented as a weighted squared sum of the power spectra P2o ð!; t; Þ and P2c ð!; t; Þ using the original time window and the compensatory window, respectively. qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ð3Þ Pr ð!; t; Þ ¼ P2o ð!; t; Þ þ ðÞP2c ð!; t; Þ where the mixing coefficient ðÞ is numerically optimized to minimize temporal variations both in the peaks and the
141
Acoust. Sci. & Tech. 28, 3 (2007)
valleys. This provides pitch synchronous spectral analysis without the need for pitch marking. This pitch marking independence is crucial for realtime STRAIGHT implementation. In addition, STRAIGHT introduces spline-based F0 adaptive spectral smoothing that only eliminates the interferences due to periodic excitation to finally yield a smoothed spectrum [1]—the so-called STRAIGHT spectrum. 2.3. Instantaneous Frequency Based F0 Extraction The quality of source information extraction, namely F0 extraction, is critical for the quality of resynthesized speech. The F0 trajectory has to be band limited so that it does not contain any traces of the F0 jump because such discontinuities introduce an additional temporal jitter when F0 or/and the temporal axis are manipulated. An instantaneous frequency based source information extractor was developed to meet these requirements [14]. 2.4. Resynthesis and Group Delay Manipulation Group delay manipulation was introduced to enable an F0 control that is finer than the resolution determined by the sampling interval. It was also used to reduce the ‘‘buzzy’’ sound usually found in synthetic speech. Based on prior listening tests and considerations on ecological validity, the minimum phase impulse response was used. All these operations are implemented as group delay manipulations and integrated.
3.
REALTIME STRAIGHT
Realtime STRAIGHT, named ‘‘Herium,’’ is a multiplatform software currently available on Windows XP Home/Professional and Mac OS X 10.2 or later versions. This section introduces the issues involved in converting Matlab STRAIGHT into a C-based realtime executable system. 3.1. Architecture Although STRAIGHT was designed to be compatible with realtime applications, the implementation of Matlab STRAIGHT is not compatible with realtime processing. This is because the Matlab version was designed for interactive explorations and offline batch processing. Therefore, a complete reconfiguration of the program architecture was inevitable. Realtime speech processing systems, realtime STRAIGHT being one of them, require an audio interface that is capable of full-duplex recording and playback operations. To handle this duplex data stream between the audio interface and PC, realtime systems must be equipped with speech input and speech output buffers. In addition, constituent processes should communicate based on the
142
status of the data in each buffer. 3.1.1. Buffer-based processing steps Figure 2 illustrates how the constituent tasks of realtime STRAIGHT are coordinated by sharing two buffers and how each reference pointer for synthesis is synchronized. The figure also shows how the audio interface acquires contiguous output audio data seamlessly. The following steps outline how realtime STRAIGHT processing takes place. (1) Initialize the input and output buffer contents to zero. (2) Fetch the input speech data into a portion located at the end of the input buffer. The length of this portion is referred to as the ‘‘buffer shift length.’’ (3) Estimate F0 using the speech data stored in the input buffer. (In the current implementation, a single F0 is calculated using all the data in the input buffer.) (4) Read a data segment from where a pointer (RPS: reference point for synthesis) points. Next, extract the STRAIGHT spectrum from the read data segment. Finally, transform the STRAIGHT spectrum into an impulse response. (5) Add the impulse response to the output buffer at the RPS of the output buffer. Then, add the fundamental period (reciprocal of F0 ) for synthesis to the RPS position to advance the pointer. The fundamental period is modified if F0 conversion is applied to the input speech data. (6) Repeat the previous step until the RPS location surpasses the buffer shift length from the beginning of the buffer. (7) Transmit the buffer shift length portion of the output buffer from the beginning of the buffer to the audio interface. (See Fig. 2) (8) Shift the contents of both buffers backward by a length equal to that of the buffer shift. This operation generates an empty space of buffer shift length at the end of each buffer. The RPS locations are also shifted backward by subtracting the buffer shift length from their location counters. Then, the second step is repeated. This is a pipeline process. In the current implementation, the buffer size is 32 ms and the buffer shift length is 8 ms. The intrinsic delay due to this buffering, 24 ms in this case,
Fig. 2 The input buffer (left) and output buffer (right). This plot shows the status just after the sixth step (RPS: reference point for synthesis).
H. BANNO et al.: IMPLEMENTATION OF REALTIME STRAIGHT
is added to the internal processing delays and latency of the audio interface. 3.1.2. F0 extraction The STRAIGHT F0 extractor has not been implemented in the current realtime STRAIGHT, because the F0 extraction algorithm in Matlab STRAIGHT must be slightly modified to be implemented in realtime. Therefore, instead of using a STRAIGHT F0 extractor, the conventional Cepstrum-based F0 extractor was implemented using the same input buffer structure (32-ms window length with 8-ms frame shift). The quefrency of the maximum peak of the Cepstrum is used to determine the fundamental period and convert it into F0 . A voiced/unvoiced decision is made based on the peak height. The effects of F0 extractor replacement are discussed in the evaluation section. 3.1.3. Spectral estimation The spectral information extraction of Matlab STRAIGHT is excessively redundant because the main target application is the interactive exploration of speech attributes and perceptual effects. The current default analysis frame rate of Matlab STRAIGHT is 1 ms. However, thanks to the compensating windowing, the current version does not require a fine temporal resolution in the spectral analysis. Realtime STRAIGHT analyzes spectral information only when it is required, in other words, when each excitation pulse is generated. This corresponds to a ‘‘just-in-time’’ analysis. Its implementation is also useful in offline processing because it enables a huge reduction in the necessary storage for STRAIGHT parameters. The F0 adaptive analysis time window length is updated only once in a single frame. This design was selected because based on our preliminary tests, minor F0 errors do not degrade STRAIGHT spectra and have negligible effects on the reproduction quality. The spectral analysis in realtime STRAIGHT can be switched between the Cepstrum-based method [18] and the STRAIGHT analysis. The effects of this difference between the spectral analyses are discussed in the evaluation section. 3.1.4. Synthesis from STRAIGHT spectra The following functions are eliminated in the current realtime STRAIGHT. . Minimum phase impulse response. . Group delay manipulation in a higher frequency region to reduce the ‘‘buzzyness’’ of the artificial excitation sources . An F0 control finer than sampling pulse resolution using group delay manipulation. This elimination can slightly degrade the naturalness of resynthesized speech. The difference between realtime STRAIGHT and the conventional analysis/synthesis system is only a spectral representation.
3.1.5. Implementation Realtime STRAIGHT is a multi-platform system. It is built on a multi-platform audio processing library set ‘‘spLibs,’’ [19] also developed by the first author. The current implementation uses ‘‘spBase,’’ the basic library; ‘‘spAudio,’’ the audio input and output library; ‘‘spComponent’’ the GUI library; and ‘‘spLib,’’ the signal processing library. The portability of spLibs makes realtime STRAIGHT easily portable. It should be noted that the processing time of realtime STRAIGHT is significantly reduced by using fast FFT libraries distributed by chip manufacturers. It was found that in our applications, these FFTs were two to four times faster than the usual mathematical function libraries. In STRAIGHT, the analysis window size is dependent on the sampling frequency. This dependency introduces a super linear increase in the number of numerical operations and introduces an upper limit to the sampling frequency for realtime operation. Based on the tests using various types of PCs, it is safe to state that a normal realtime operation at a sampling frequency of 32 kHz is possible on PCs with processors running at around 1 GHz. The latency without an audible gap was less than 100 ms, although the value really depends on the operating system, the audio interface, and so on. 3.2. Evaluation The preliminary evaluation of the effects of the modifications introduced for realizing realtime STRAIGHT was conducted by replacing Matlab STRAIGHT’s components with modules that implemented algorithms used in realtime STRAIGHT. Figure 3 shows the trajectories extracted using the conventional Cepstrum pitch extractor and STRAIGHT F0 extractor. The speech sample is a recorded female speech sampled at 16 kHz. The conventional frame based analysis and coarse temporal resolution due to sampling introduces staircase features in the Cepstrum-based trajectory. The perceptual effects of this jaggedness were not prominent in the case of male speech; however, the degradation was noticeable in female speech, especially in the case of highpitched subjects. This degradation is easily audible when headphones are used instead of loud speakers. Voiced/ unvoiced decision errors observed around 100 ms and 200 ms in the Cepstram case can cause noticeable degradation. Figure 4 shows the spectrograms extracted using the conventional fixed frame Cepstrum analysis and STRAIGHT spectral analysis. The fixed frame rate of the conventional method failed to capture the fine and smooth spectral structures found in the STRAIGHT spectrogram. The degradation in the resynthesized speech when the Cepstrum-based spectrogram was used could be easily perceived. This degradation is mainly due to inappropriate
143
Acoust. Sci. & Tech. 28, 3 (2007) 260
8000
250
7000
240
6000 Frequency [Hz]
Frequency [Hz]
230 220 210 200
5000 4000 3000 2000
190 1000
180 170
0
200
400 Time [ms]
600
0 0
800
400 Time [ms]
600
800
200
400 Time [ms]
600
800
8000
260 250
7000
240
6000
230
5000
Frequency [Hz]
Frequency [Hz]
200
220 210 200
4000 3000 2000
190 1000
180 170
0 0
0
200
400 Time [ms]
600
800
Fig. 4 Smoothed spectrogram extracted by Cepstrumbased method (upper plot) and extended pitch synchronous method used in Matlab STRAIGHT (lower plot).
Fig. 3 F0 trajectories extracted by Cepstrum-based method (upper plot) and instantaneous frequency based method used in Matlab STRAIGHT (lower plot).
(usually too strong) spectral smoothing in the conventional method. Further, this degradation is audible in the current realtime STRAIGHT, which has an option for switching the spectral analysis between the STRAIGHT-based method and the Cepstrum-based method. It is interesting to note that when the STRAIGHT-based F0 extraction for the Cepstrum analysis/synthesis system was used, a noticeable improvement in the speech quality was observed. Subjective evaluation has also been performed. Ten listeners with normal hearing participated in this experiment. The speech materials used are ten sentences uttered by four males and four females (80 utterances in total). The sampling frequency of the materials is 22.05 kHz. Figure 5 shows the mean opinion scores (MOS) of the subjective experiment. ‘‘Org’’ is original; ‘‘STR,’’ Matlab STRAIGHT; ‘‘RT-STR,’’ realtime STRAIGHT; and ‘‘CepVoc,’’ the conventional Cepstrum analysis/synthesis system. The MOS of the realtime STRAIGHT is approx-
144
5 Male Female 4
Total
3
2
1
0 CepVoc
RT-STR
STR
Org
Fig. 5 Results of subjective experiment. The error bars are the 1 standard deviation.
H. BANNO et al.: IMPLEMENTATION OF REALTIME STRAIGHT
graded components to Matlab STRAIGHT’s latest algorithms by developing computationally efficient implementations.
ACKNOWLEDGMENTS The authors would like to thank Professor Fumitada Itakura for his valuable suggestions and discussions. This research was partly supported by MEXT leading project eSociety, the grant in aid for scientific research for young researchers (B) (16700183), and ERATO by JST. REFERENCES
Fig. 6 STRAIGHT in realtime operation on a tablet PC (Sony VAIO-U)—the black unit on a top of the silver unit (AD/DA converter, Roland UA-25). The graphical user interface (GUI) on the PC’s display is shown. The left panel in the images shows the GUI of realtime STRAIGHT. Four sliders control the round trip delay, F0 conversion ratio, frequency axis stretching coefficient and gender of the resynthesized speech.
imately one point better than that of the Cepstrum-based system. This indicates that the quality of the realtime STRAIGHT is better than that of Cepstrum-based system. In comparison with Matlab STRAIGHT, realtime STRAIGHT has poorer performance, especially in the case of female speech. This is caused by voiced/unvoiced decision errors in males and females, and jaggedness of F0 trajectories mainly in females.
4.
DEMONSTRATIONS
Figure 6 shows a frame from a movie demonstrating realtime STRAIGHT running on a tablet PC (Sony VAIOU). The figure also shows the GUI of realtime STRAIGHT. In addition to sliders for F0 conversion and frequency axis stretching, a gender control slider is also installed. This slider changes the frequency stretching factor in proportion to the cubic root of the F0 conversion factor . This relation, ¼ 1=3 , was determined in our preliminary tests, and it was found to be consistent with a number of literatures on speech perception research. The source code of this software is available upon request to researchers who already have Matlab STRAIGHT.
5.
CONCLUSION
The first implementation of realtime STRAIGHT is introduced. Currently, it runs faster than realtime on PCs with processors running at around 1 GHz. However, this performance was achieved by eliminating some important components and replacing some components with conventional procedures. Our next step is to upgrade the down-
[1] H. Kawahara, I. Masuda-Katsuse and A. de Cheveigne´, ‘‘Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequencybased F0 extraction,’’ Speech Commun., 27, 187–207 (1999). [2] H. Dudley, ‘‘Remaking speech,’’ J. Acoust. Soc. Am., 11, 169– 177 (1939). [3] H. Matsui and H. Kawahara, ‘‘Investigation of emotionally morphed speech perception and its structure using a high quality speech manipulation system,’’ Proc. Eurospeech 2003, Geneva, pp. 2113–2116 (2003). [4] T. Yonezawa, N. Suzuki, K. Mase and K. Kogure, ‘‘Gradually changing expression of singing voice based on morphing,’’ Proc. Interspeech 2005, Lisboa, pp. 541–544 (2005). [5] T. Irino and R. D. Patterson, ‘‘Segregating information about the size and shape of the vocal tract using a time-domain auditory model: The Stabilised Wavelet Mellin Transform,’’ Speech Commun., 36, 181–203 (2002). [6] D. R. R. Smith, R. D. Patterson, H. K. R. Turner and T. Irino, ‘‘The processing and perception of size information in speech sounds,’’ J. Acoust. Soc. Am., 117, 305–318 (2005). [7] J. Jin, H. Banno, H. Kawahara and T. Irino, ‘‘Intelligibility of degraded speech from smeared STRAIGHT spectrum,’’ Proc. ICSLP 2004, Vol. 4, Jeju, pp. 530–533 (2004). [8] C. Liu and D. Kewley-Port, ‘‘Vowel formant discrimination for high-fidelity speech,’’ J. Acoust. Soc. Am., 116, 1224–1233 (2004). [9] P. F. Assmann and W. F. Katz, ‘‘Synthesis fidelity and timevarying spectral change in vowels,’’ J. Acoust. Soc. Am., 117, 886–895 (2005). [10] H. Kawahara and H. Matsui, ‘‘Auditory morphing based on an elastic perceptual distance metric in an interference-free timefrequency representation,’’ Proc. ICASSP 2003, Vol. 1, Hong Kong, pp. 256–259 (2003). [11] Y. Sogabe, K. Kakehi and H. Kawahara, ‘‘Psychological evaluation of emotional speech using a new morphing method,’’ Proc. ICCS 2003, Vol. 2, Melbourne, pp. 628–633 (2003). [12] H. Kawahara and J. C. Williams, ‘‘Effects of auditory feedback on voice pitch trajectories,’’ in Vocal Fold Physiology, P. J. Davis and N. H. Fletcher, Eds. (Singular Publishing Group, San Diego, 1996) Chap. 18, pp. 263–278. [13] H. Kawahara, ‘‘Speech representation and transformation using adaptive interpolation of weighted spectrum: Vocoder revisited,’’ Proc. ICASSP ’97, Vol. 2, Muenich, pp. 1303–1306 (1997). [14] H. Kawahara, H. Katayose, A. de Cheveigne´ and R. D. Patterson, ‘‘Fixed point analysis of frequency to instantaneous frequency mapping for accurate estimation of F0 and periodicity,’’ Proc. Eurospeech ’99, Vol. 6, Budapest, pp. 2781– 2784 (1999).
145
Acoust. Sci. & Tech. 28, 3 (2007) [15] H. Kawahara, Y. Atake and P. Zolfaghari, ‘‘Accurate vocal event detection method based on a fixed-point analysis of mapping from time to weighted average group delay,’’ Proc. ICSLP 2000, Beijin, pp. 664–667 (2000). [16] H. Kawahara, H. Banno, T. Irino and P. Zolfaghari, ‘‘Algorithm AMALGAM: Morphing waveform based methods, sinusoidal models and STRAIGHT,’’ Proc. ICASSP 2004, Vol. 1, Montreal, pp. 13–16 (2004). [17] H. Kawahara, A. de Cheveigne´, H. Banno, T. Takahashi and T. Irino, ‘‘Nearly defect-free F0 trajectory extraction for expressive speech modifications based on STRAIGHT,’’ Proc. Interspeech 2005, Lisboa, pp. 537–540 (2005). [18] A. V. Oppenheim, ‘‘Speech analysis-synthesis system based on homomorphic filtering,’’ J. Acoust. Soc. Am., 45, 458–465 (1969). [19] http://www.sp.m.is.nagoya-u.ac.jp/people/banno/spLibs/ Hideki Banno was born in Toukai, Japan on January 22, 1974. He received B.E., M.E., and Ph.D. degrees from Nagoya University in 1996, Nara Institute of Science and Technology in 1998, and Nagoya University in 2003, respectively. From 2001 to 2003, he was a research assistant of the Graduate School of Economics, Nagoya University. From 2003 to 2005, he was a research assistant of the Faculty of Systems Engineering, Wakayama University. From 2004 to 2005, he was also an invited researcher of ATR Spoken Language Communication Research Laboratories. He is currently a lecturer of the Faculty of Science and Technology, Meijo University, since 2005. His research interests include speech analysis and synthesis, acoustic signal processing and auditory perception. He is a member of ASJ and IEICE. Hiroaki Hata received the B.E. degree in Systems Engineering from Wakayama University in 2005. He is currently a master student of the Graduate School of Systems Engineering, Wakayama University. His research interests include implementation of speech analysis-synthesis systems.
Masanori Morise received the B.E. and M.E. degrees in Systems Engineering from Wakayama University, in 2004 and 2006, respectively. In 2006, he became a Research Fellow (DC1) of JSPS. He is currently a doctoral candidate of the Graduate School of Systems Engineering of Wakayama University, since 2006. His research interests include acoustic signal processing and electro-acoustic systems. He is a member of ASJ and IEICE.
146
Toru Takahashi was born in Akita, Japan on June 8, 1973. He received a B.E in computer science, and a M.E. and Ph.D. degree in electrical and electronic engineering from Nagoya Institute of Technology in 1996, 1998, 2004, respectively. He has been a research assistant in the Faculty of Systems Engineering at Wakayama University since 2004. His research interests include speech analysis and synthesis, acoustic signal processing, and auditory perception. He is a member of ASJ, IEICE and JPSJ. Toshio Irino was born in Yokohama, Japan, in 1960. He received the B.S., M.S., and Ph.D. degrees in electrical and electronic engineering from Tokyo Institute of Technology, 1982, 1984, and 1987, respectively. From 1987 to 1997, he was a research scientist at NTT Basic Research Laboratories. From 1993 to 1994, he was a visiting researcher at the Medical Research Council-Applied Psychology Unit (MRC-APU, currently CBU) in Cambridge, UK. From 1997 to 2000, he was a senior researcher in ATR Human Information Processing Research Laboratories (ATR HIP). From 2000 to 2002, he was a senior research scientist in NTT Communication Science Laboratories. Since 2002, he has been been a professor of the Faculty of Systems Engineering, Wakayama University. He is also a visiting professor at the institute of statistical mathematics. The focus of his current research is a computational theory of the auditory system. Dr. Irino is a member of the Acoustical Society of America (ASA), the Acoustical Society of Japan (ASJ), and the Institute of Electronics, Information and Communication Engineers (IEICE). Hideki Kawahara received the B.E., M.E. and Dr.Eng. degrees in Electrical Engineering from Hokkaido University, Sapporo, Japan in 1972, 1974 and 1977, respectively. In 1977, he joined the Electrical Communications Laboratories of Nippon Telephone and Telegraph public corporation. In 1992, he joined ATR Human Information Processing research laboratories, Japan as a department head. In 1997, he became an invited researcher of ATR. He is currently a professor of the Faculty of Systems Engineering, Wakayama University, since 1997. He received the Sato award from the ASJ in 1998, the TELECOM System Technology Prize from the Telecommunications Advancement Foundation Award, Japan in 1998 and the EURASIP best paper award for his contribution to the Speech Communication journal, in 2000. His research interests include auditory signal processing models, speech analysis and synthesis, electro-acoustic systems and auditory perception. He is a member of ASA, ASJ, IEEE, IPSJ, ISCA, IEICE and JNNS.