An Open Source Software System For Robot Audition HARK and Its ...

2008 8th IEEE-RAS International Conference on Humanoid Robots December 1 ~ 3, 2008 / Daejeon, Korea

An Open Source Software System For Robot Audition HARK and Its Evaluation Kazuhiro Nakadai 1,3 , Hiroshi G. Okuno 2 , Hirofumi Nakajima 1 , Yuji Hasegawa 1 , Hiroshi Tsujino 1 1

Honda Research Institute Japan Co., Ltd., JAPAN {nakadai, nakajima, yuji.hasegawa, tsujino}@jp.honda−ri.com 2

Kyoto University, JAPAN [email protected]−u.ac.jp

Abstract— Robot capability of listening to several things at once by its own ears, that is, robot audition, is important in improving human-robot interaction. The critical issue in robot audition is real-time processing in noisy environments with high flexibility to support various kinds of robots and hardware configurations. This paper presents open-source robot audition software, called “HARK”, which includes sound source localization, separation, and automatic speech recognition (ASR). Since separated sounds suffer from spectral distortion due to separation, HARK generates a temporal-frequency map of reliability, called “missing feature mask”, for features of separated sounds. Then separated sounds are recognized by the Missing-Feature Theory (MFT) based ASR with missing feature masks. HARK is implemented on the middleware called “FlowDesigner” to share intermediate audio data, which provides real-time processing. HARK’s performance in recognition of noisy/simultaneous speech is shown by using three humanoid robots, Honda ASIMO, SIG2 and Robovie with different microphone layouts.

I. I NTRODUCTION Speech recognition is essential for communication and social interaction, and people with normal hearing capabilities can listen to many kinds of sounds under various acoustic conditions. Robots should have hearing capability equivalent to ours to realize human-robot communication and social interaction, when they are expected to help us in a daily environment. In such a daily environment, there exist a lot of noise sources including robots’ own motor noises besides a target speech source. Many robot systems for social interaction avoided this problem by forcing the attendants of interaction to wear a headset microphone [1]. For smoother and more natural interactions, a robot should listen to sounds by its own ears instead of using attendants’ headset microphones. Thus, “Robot Audition”, which realizes recognition of noisy speech such as simultaneous speech by using robot-embedded microphones, was proposed in [2]. It has been studied actively for recent years, as typified by organized sessions on robot audition at IROS 2004–2008, and we reported simultaneous speech recognition by missing-feature-theory-based integration of sound source localization, sound source separation and speech recognition of separated speech [3]. We released our robot audition system as open source software by taking the following two points into account. 1) Evaluation of a technique which each researcher is focusing on in a total robot audition system, and 2) Construction of a robot system with listening capability for researchers who are not familiar with robot audition.

978-1-4244-2822-9/08/$25.00 ©2008 IEEE

3

Tokyo Insititute of Technology, JAPAN

The robot audition software system is called HARK1 which stands for Honda Research Institute Japan Audition for Robots with Kyoto University2 . HARK provides a complete module set to make a real-time robot audition system without any implementation. Many multi-channel sound cards are supported to build a real-time robot audition system easily. Modules for preprocessing such as sound source localization, tracking and separation which were used in our reported systems are available. The modules for preprocessing are able to integrate with automatic speech recognition based on the missing feature theory. Thanks to its module-based architecture, it has general applicability to various types of robots and hardware configurations. In this paper, we describe implementation of HARK, and evaluate its performance in speech recognition in noisy environments and general applicability with demonstrations. The rest of the chapter is organized as follows: Section II introduces related work for HARK. Section III explains the implementation of HARK. Section IV describes evaluation of the system, and the last section concludes the paper. II. R ELATED W ORK In robotics, a variety of software programming environments have been reported [4], [5], [6], [7]. Every environment improved software reusability by using modular architecture. Robot audition software also should be based on a programming environment with modular architecture. Requirements for such a programming environment are open-sourced to increase its stability continuously, module integration with a small latency for real-time processing, and many prepared modules for signal processing and ASR to keep convenience for users. A data-flow oriented software programming environment called FlowDesigner[8]3 satisfies the above three. It is open-sourced and integrates modules by using shared objects, that is, function-call-based integration. In addition, another package for robot audition called ManyEars4 is available on FlowDesigner. Therefore, we decided to use FlowDesigner as a programming environment for HARK. A. FlowDesigner FlowDesigner allows us to create reusable modules and to connect them together using a standardized mechanism 1 “hark”

means “listen” in old English. at http://winnie.kuis.kyoto-u.ac.jp/HARK/ 3 http://flowdesigner.sourceforge.net/ 4 http://manyears.sourceforge.net/ 2 Available

to create a network of modules. The modules are connected dynamically at run time. A network composed of multiple modules can be used in other networks. This mechanism makes the whole system easier to maintain since everything is grouped into smaller functional blocks. If a program is created by connecting the modules, the user can execute the program from the GUI interface or from a shell terminal. When two blocks have matching interfaces, they are able to be connected regardless of their internal processes. One-to-many and manyto-many connections are also possible. A block is coded in programming language C++ and implemented as an inherited class of the fundamental block. Dynamic linking at run time is realized as a shared object in the case of Linux. Since data communication is done by using a pointer, it is much faster than socket communication. Therefore, FlowDesigner maintains a well-balanced trade-off between independence and processing speed. B. ManyEars ManyEars provides a set of modules for microphone array processing on FlowDesigner such as steered-beamformingbased sound source localization[9], particle-filtering-based sound source tracking[9], and geometric-source-separationbased sound source separation[10]. We provide patches for ManyEars to keep software compatibility with HARK. Thus, most modules in ManyEars are replaceable with the corresponding ones in HARK. Because HARK provides the corresponding modules based on other array processing methods, users are easily able to build various kinds of robot audition systems by combining modules in HARK and ManyEars. III. I MPLEMENTATION OF HARK This section describes the implementation of HARK. HARK consists of a lot of modules which are implemented as blocks on FlowDesigner , which works on Linux (Fedora distributions are recommended). The modules are categorized into six categories: multi-channel audio input, sound source localization and tracking, sound source separation, acoustic feature extraction for automatic speech recognition (ASR), automatic missing feature mask (MFM) generation, and ASR interface. Missing-feature-theory based ASR (MFT-ASR) is also provided as a patch source code for Julius/Julian[11] which are Japanese/English open source speech recognition systems. Only MFT-ASR is implemented as a non-FlowDesigner module in HARK, but it connects with FlowDesigner by using a module in ASR interface. Table I summarizes the categories and modules provided by HARK. A. Multi-channel Audio Input These modules produce an output of multi-channel audio signals. In this category, there are two modules: AudioStreamFromWave and AudioStreamFromMic. The AudioStreamFromWave module reads audio signals from a file of WAVE format (RIFF waveform Audio Format). The AudioStreamFromMic module captures audio signals using a sound input device. In the AudioStreamFromMic module,

TABLE I M ODULES PROVIDED BY HARK Category Name Multi-channel Audio Input Sound Source Localization and Tracking

Sound Source Separation

Acoustic Feature Extraction

Automatic Missing Feature Mask Generation ASR Interface MFT-ASR

Module Name AudioStreamFromMic AudioStreamFromWave LocalizeMUSIC ConstantLocalization SourceTracker DisplayLocalization SaveSourceLocation DSBeamformer Separation (as a patch for ManyEars) Synthesize Delta MelFilterBank FeatureExtraction SpectralMeanNormalization SaveFeatures DeltaMask MFMGeneration StationaryNoiseEstimator SpeechRecognitionClient Multiband Julius/Julian (non-FlowDesigner module)

three types of devices are available: ALSA (Advanced Linux Sound Architecture) based sound cards5 , RASP series6 and TD-BD-8CSUSB7 . ALSA is an open source sound I/O driver, and it supports a lot of sound cards including multi-channel ones such as RME Hammerfall. RASP series are multi-channel audio signal processing devices with CPUs produced by JEOL System Technology Co., Ltd. They support both multi-channel audio signal recording and processing. TD-BD-8CSUSB is one of the smallest 8ch audio input systems for embedded use, which is produced by Tokyo Electron Device Limited. It captures audio signals via USB interface. B. Sound Source Localization and Tracking Six modules are prepared for sound source localization and tracking. LocalizeMUSIC calculates directions of arrival of sound sources from a multi-channel audio signal input, and outputs the number of sound sources and their directions by each time frame. Since this module uses an adaptive beamforming algorithm called MUltiple Signal Classification (MUSIC)[12], it localizes multiple sound sources robustly in a real environment. ConstantLocalization outputs fixed sound directions without localization, and it is mainly used for performance evaluation and debugging. ManyEars provides LocalizeBeam with steered beamforming, and these three modules are replaceable with each other. SourceTracker tracks sound sources from the localization results, and it outputs sound source directions with source IDs. When the current direction and the previous direction derive from the same sound source, they are given the same source ID. DisplayLocalization is a viewer module for LocalizeMUSIC, ConstantLocalization, and SourceTracker. SaveSourceLocation stores localization results to a text file in every time frame. 5 http://www.alsa-project.org/ 6 http://jeol-st.com/mt/2007/04/rasp2.html 7 http://www.inrevium.jp/eng/news.html

(in Japanese)

a) cube layout (ASIMO)

b) circular layout (ASIMO)

d) sparse layout (Robovie) c)sparse layout (SIG2)

Fig. 1.

8 ch microphone arrays

C. Sound Source Separation Three modules are prepared for sound source separation, that is, DSBeamformer, Separation and Synthesize. DSBeamformer separates sound sources by using sound source tracking results and a mixture of sound sources in the frequency domain. This uses a simple delay-and-sum beamformer. Thus, it is easy to control beamformer parameters, and it shows high robustness for environmental noises. A more sophisticated sound source separation module called Separation is provided as a patch source code for SeparGSS in ManyEars. Separation uses Geometric Source Separation (GSS) algorithm and multi-channel post-filter [3]8 . GSS is a kind of hybrid algorithm of Blind Source Separation (BSS)[14] and beamforming. It relaxes BSS’s limitations such as a permutation and a scaling problem by introducing “geometric constraints” obtained from the locations of microphones and sound sources. We modified the GSS so as to provide faster adaptation using stochastic gradient. It improved 10.3 dB in signal-to-noise ratio on average for separation of three simultaneous speech signals [3]. The multi-channel post-filter [3] enhances the output of GSS. It is a spectral filter using an optimal noise estimator described in [15]. We extended the original noise estimator to estimate both stationary and non-stationary noise by using multi-channel information, while most postfilters address the reduction of a type of noise, stationary background noise [16]. Finally, Separation outputs spectra of separated sound and post-filtered sound. Synthesize converts separated sound spectra obtained from DSBeamformer and Separation into time domain signals for performing ASR and making audible signals. D. Acoustic Feature Extraction for ASR This category includes five modules for acoustic feature extraction: MelFilterBank, FeatureExtraction, SpectralMeanNormalization, Delta, and SaveFeatures. Most conventional ASR systems use Mel-Frequency Cepstral Coefficient (MFCC) as an acoustic feature, but noises and distortions which are 8 Another implementation of GSS which supports adaptive stepsize contorl[13] is supported in non-free version of HARK.

Fig. 2.

An example of a robot audition system using HARK

concentrated in some areas in the spectro-temporal space are spread to all coefficients in MFCC. Thus, HARK provides Mel-Scale Log Spectrum (MSLS) [17] as an acoustic feature. MelFilterBank consumes an input of a spectrum, analyzes it using Mel-frequency filter banks, and produces an output of a Mel-scale spectrum. FeatureExtraction consumes an input of the Mel-filter spectrum, performs liftering, and produces an output of MSLS features. SpectralMeanNormalization performs normalization to improve noise robustness of MSLS features. Finally, Delta consumes an output of the MSLS features, calculates the linear regression of each spectral coefficient, and produces an output of the MSLS features with delta features. SaveFeatures stores a sequence of MSLS features into a file per utterance. E. Automatic Missing Feature Mask Generation Missing Feature Theory (MFT) is an approach for noiserobust ASR [18], [19]. MFT uses only reliable acoustic features in speech recognition and masks out unreliable parts caused by interfering sounds and preprocessing. MFT thus provides smooth integration between preprocessing and ASR. The mask used in MFT is called missing feature mask (MFM) represented as a spectro-temporal map. This category includes three modules to generate MFM without using prior information. The inputs of MFMGeneration are Mel-filtered spectra of the separated sound, the post-filtered sound, and the background noise estimated by StationaryNoiseEstimator. It estimates a leak noise by using the fact that the post-filtered sound is similar to clean speech while the separated sound includes a background and a leak noise. When the estimated leak noise is lower than a threshold, such a frequency bin is regarded as reliable, otherwise it is unreliable. MFMGeneration thus produces MFM as a binary mask. DeltaMask consumes an input of MFMs, and produces an output of delta MFMs. F. ASR Interface and MFT-ASR SpeechRecognitionClient consumes inputs of acoustic features, MFMs and sound source directions. It sends them to Multiband Julius/Julian via TCP/IP communication. Multiband Julius/Julian is provided as a patch source code to original

selected 1 mic(front) proposed

-10

-8

-6

-4

-2

0

2

4

Target-to-robot-noise ratio (dB)

a) speech(0deg) + robot’s noise Fig. 3.

100 90 80 70 60 50 40 30 20 10 0

Word Correct Rate (%)



100 90 80 70 60 50 40 30 20 10 0


-10

-8

-6

-4

-2

0

2


4

100 90 80 70 60 50 40 30 20 10 0


-10

-8

-6

-4

-2

0

2

4


b) speech(0deg) + music(60deg) c) speech(0deg) + speech(60deg) + robot’s noise + robot’s noise Isolated Word Recognition against diffuse and directional noise

Julius/Julian. It supports various types of HMMs such as shared-state triphones and tied-mixture models. Both statistical language model and network grammar are supported. In decoding, an ordered word bi-gram is used in the first pass, and a reverse ordered word tri-gram in the second pass. It works as a standalone or client-server application. To run as a server, we modified the system to be able to communicate acoustic features and MFM via a network. IV. E VALUATION Fig. 2 shows a robot audition system using HARK which we used for evaluation. The system uses MUSIC for sound source localization (LocalizeMUSIC) GSS and multi-channel postfilter for sound source separation (Separation), and speech recognition client for MFT (SpeechRecognitionClient). As shown in Fig. 1, we used four types of hardware configurations such as Honda ASIMO (two microphone layouts), SIG2 and Robovie. The robot audition system is applied to all configurations by modifying only a file describing microphone positions. We, then, evaluate HARK in terms of the following points: 1) speech recognition performance(ASIMO, SIG2), 2) processing speed (ASIMO), and 3) application of HARK (ASIMO, Robovie). A. Speech recognition performance Speech recognition performance was evaluated in robustness for diffuse and directional noises, and effectiveness of preprocessing and MFT-ASR. 1) robustness for diffuse and directional noises: Isolated word recognition was performed using Honda ASIMO with an 8 ch circular microphone array shown in Fig. 1a). ASIMO was located at the center of a room which was 7 m×4 m. Three walls were covered with sound absorbing materials, while the other wall was made of glass which makes strong echoes. The reverberation time (RT20) of the room was about 0.2 seconds. However, the reverberation was not uniform in the room because of an asymmetrical echo generated by the glass wall. As a test dataset, an ATR phonemically-balanced wordset(ATR-PB) was used. This word-set includes 216 Japanese words per person. We used six persons’ data from the wordset, and the average word correct rate (WCR) was measured. A triphone acoustic model was trained with another speech

dataset called JNAS (Japanese Newspaper Article Sentences) corpus which includes 60-hour speech data spoken by 306 male and female speakers. Thus, this speech recognition was a word-open test. Three types of noise environments were selected in this experiment. The first one was an ASIMO noise including sounds produced by fans and motors. Since this noise is mainly produced inside ASIMO, it is regarded as a kind of diffuse noise. The second was speech noise which was located 60-degree left from the front direction of ASIMO. The distance between ASIMO and the speech source was 1 m. In this case, the ASIMO noise was also added. Thus, this noise environment is regarded as a mixture of directional and diffuse noises. The last was similar to the second one except for using a music source as a direction noise. This means that this is the case of simultaneous speech recognition with diffuse noise. In every noise environment, the target speech source was set to the front direction and at 1 m away from ASIMO. The level of the target speech source was changed from -8 dB to 3 dB compared with the ASIMO noise. Speech data for isolated word recognition was synthesized by using impulse responses and noise data which were pre-measured with the microphone array in these noise environments. This means that speech files were inputted to the robot audition system. The robot audition system for this experiment was constructed by replacing AudioStreamFromMic with AudioStreamFromWave in Fig. 2. Fig. 3 shows the results. To know baseline performance, we measured WCR by selecting a front microphone without any robot audition system. Fig. 3a) shows that the robot audition system drastically improves the performance. Even when SNR is as low as -6 dB, WCR still maintains 90 %. This indicates that post-filtering process in sound source separation module deals with diffuse noise properly. From Fig. 3b) and c), WCRs without any preprocessing were almost 0%. Because target speech source is contaminated by both ASIMO noise and another directional noise, it is difficult to recognize such noisy speech without preprocessing. HARK-based robot audition system improved speech recognition drastically. WCRs were over 75 % for SNR higher than -4 dB in both cases. GSS worked for directional noise sources regardless of noise types. 2) effectiveness of preprocessing and MFT-ASR: To know the effectiveness of each techniques in the robot audition system, we conducted isolated word recognition of three


100 90 80 70 60 50 40 30 20 10 0

TABLE II CPU O CCUPANCY OF E ACH P ROCESS C1 C2 C4 C5 C6

30

60

localization

separation

44.0%

47.3%

occupancy

feature extraction 7.8%

misc. 0.9%

localization does not affect WCR so much, because we found that localization degrades only 5 pts in WCR compared with manually-given localization from another experiment.

Speaker Interval (deg)

Fig. 4.

Effectiveness of Preprocessing and MFT-ASR

simultaneous speeches. SIG2 with an 8-ch microphone array shown in Fig. 1c) was used in a 4 m × 5 m room with 0.3– 0.4 sec. of RT20 . Three simultaneous speeches for test data were recorded with the 8-ch microphone array in the room by using three loudspeakers (Genelec 1029A). The distance between each loudspeaker and the center of the robot was 2 m. Two types of speaker settings, i.e., (-30◦ , 0◦ , 30◦ ) and (-60◦ , 0◦ , 60◦ ) were evaluated (0◦ means the front direction of SIG2). As a test dataset, a female (f101), a male (m101) and another male (m102) speech sources were selected from ATRPB for the left, center and right loudspeakers, respectively. Three words for simultaneous speech were selected at random. In this recording, the power of robot was turned off. By using the test data, the system recognized the three speakers with the following eight conditions: C1: The input from the left-front microphone was used without any preprocessing using a clean acoustic model. C2: GSS and post-filter were used as preprocessing, but MFT function was not. The clean acoustic model was used. C3: The same condition as C2 was used except for the use of MFT function with automatically generated MFM. C4: The multi-condition-trained acoustic model was used. The other conditions were the same as C3. C5: The same condition as C4 was used except for the use of a priori MFM created by a clean speech. This is regarded as the potential upper limit of our system. The clean acoustic model was trained with 10 male and 12 female ATR-PB word-sets excluding the three word-sets (f101, m101, and m102) which were used for the recording. Thus, it was a speaker-open and word-closed acoustic model. The multi-condition-trained acoustic model was trained with the same ATR-PB word-sets as mentioned above, and separated speech datasets. The latter sets were generated by separating three-word combinations of f102-m103-m104 and f102-m105m106, which were recorded in the same way as the test data. Fig. 4 shows WCRs for the front speaker. This showed that every technique contributed to speech recognition performance. WCR was around 90% in 60◦ interval case, and sound source separation is the most effective, because GSS based sound source separation works very well. On the other hand, 30 pts degradation was found with C2 in 30◦ interval case, because sound sources were close to each other. However, missing-feature-theory-based ASR and multi-condition-trained acoustic model compensated this degradation and improved WCR upto around 80%. The fact that a priori mask showed a quite high performance may suggest not a few possibilities to improve the algorithms of MFM generation. Note that

B. Processing Time As for processing speed, we measured processing time when our robot audition system separated and recognized speech signals of 800 seconds consisting of isolated words. It took 499 seconds for our robot audition system to recognize the speech signal with Pentium 4/2.4 GHz. FlowDesigner consumed 369 seconds, and Multiband Julius consumed the remaining time. An occupancy rate of each process running on FlowDesigner is shown in Table II. Sound source localization and separation consumed most of computational power. This is because eigen value decomposition for localization and update of decomposition matrix for separation requires a lot of multiplications. This would be reduced by using special hardware like a DSP or FPGA. A delay between a beginning of localization and a beginning of recognition was about 0.40 seconds, and a delay between an ending of separation and an ending of recognition was about 0.046 seconds. As a whole, our robot audition system worked in real time. C. Applications of HARK We introduce two applications by using HARK. One is a robot referee of Rock-Paper-Scissors (RPS) sound games. The other is a meal order taking robot. In both applications, HARK-based robot audition systems were implemented on a laptop PC with CoreT M 2 Duo/2 GHz. The first one is implemented by using ASIMO with another microphone array configuration shown in Fig. 1b). To implement the robot referee, the robot audition system was connected with a simple speech dialog system. It is implemented as a system-initiative dialog system for multiple players based on deterministic finite automata. It first waits for a trigger command to start an RPS sound game, controls the dialog with players in the game, and finally decides the winner of the game. Fig. 5 shows the snapshots of this referee task. For the moment, this system supports up to three players. In the case with two players, we attained a 70% task completion rate for the games on average[20]. It takes around 2 sec to estimate the number of simultaneous players, because the system detects that all players finish their RPS utterances. Thus the system latency was 2–3 sec in this task. The second application also deals with simultaneous speech to take meal orders. In this case, we used Robovie with a microphone array shown in Fig. 1d). In this case, a robot audition system was connected with a simple dialog system for meal order taking. In this application, we replaced LocalizeMUSIC with LocalizeBeam. The system worked properly with steered beamforming. Fig. 6 shows snapshots of the meal order taking task. In this task, latency from an end of users’

a) U2:Let’s play an RPS game. Fig. 5.

b) U1-U3: c) U1:paper, U2:paper d) A: U3 won. rock-paper-scissors... U3:scissors Snapshots of rock-paper-scissors game (A: ASIMO, U1:left user, U2:center user, U3: right user)

a) A: May I take your order?

b) U1-U3: order dishes at the same time.

Fig. 6.

c) A: repeat the order for confirmation

d) A: tell the total price

Snapshots of meal order taking task (A: ASIMO, U1:left user, U2:center user, U3: right user)

utterance to a beginning of robots’ response was only 1.9 sec. This latency was mainly caused by speech recognition because ASR outputs recognition results after the system detects the end of speech. V. C ONCLUSION We described an open source robot audition system called HARK. HARK provides high performant and real-time robot audition systems with a complete module set to construct a robot audition system including a multi-channel input module supporting many sound input devices. It uses modular architecture and takes general applicability into account. Thus, HARK is applied to various robots and hardware configurations such as a microphone layout with a minimum modification. In addition, modules in another package for robot audition called ManyEars are replaceable with those in HARK, because both are implemented on data-flow oriented software programming environment, FlowDesigner. These advantages were confirmed by using performance evaluation of speech recognition in noisy environments, processing speed, and two applications of HARK by combining them with dialog systems by using Honda ASIMO and Robovie. We believe that HARK is helpful for rapid prototyping of robot audition systems, and also contributing to dialog, vision and navigation research in terms of introducing highly-performant robot audition to their system. The extensions of HARK such as sound scene viewer, sound source identification, music recognition are future work. ACKNOWLEDGMENT We thank Dr. Jean-Marc Valin and Dr. Shunichi Yamamoto for their valuable advice on HARK implementation. R EFERENCES [1] C. Breazeal, “Emotive qualities in robot speech,” IROS-2001. IEEE, 2001, pp. 1389–1394. [2] K. Nakadai et al., “Active audition for humanoid,” AAAI-2000. AAAI, 2000, pp. 832–839.

[3] S. Yamamoto et al., “Making a robot recognize three simultaneous sentences in real-time,” IROS-2005. IEEE, 2005, pp. 897–902. [4] M. Montemerlo et al., “Perspectives on standardization in mobile robot programming: The carnegie mellon navigation (CARMEN) toolkit,” IROS-2003. IEEE, 2006, pp. 2436–2441. [5] N. Ando et al., “RT-middleware: Distributed component middleware for RT (robot technology),” IROS-2005. IEEE, 2005, pp. 3555–3560. [6] C. Schlegel, “A component approach for robotics software: Communication patterns in the orocos context,” in 18. Fachtagung Autonome Mobile Systeme (AMS). Springer, 2003, pp. 253–263. [7] C. Côté et al., “Using MARIE in software development and integration for autonomous mobile robotics,” Int’l Journal of Advanced Robotic Systems, vol. 3, no. 1, pp. 55–60, 2006. [8] C. Côté et al., “Reusability tools for programming mobile robots,” IROS 2004. IEEE, 2004, pp. 1820–1825. [9] J.-M. Valin et al., “Robust localization and tracking of simultaneous moving sound sources using beamforming and particle filtering,” Robotics and Autonomous Systems Journal, vol. 55, no. 3, pp. 216–228, 2007. [10] J.-M. Valin et al., “Enhanced robot audition based on microphone array source separation with post-filter,” IROS-2004, 2004, pp. 2133–2128. [11] T. Kawahara and A. Lee, “Free software toolkit for Japanese large vocabulary continuous speech recognition,” ICSLP-2000, vol. 4, 2000, pp. 476–479. [12] F. Asano et al., “Sound source localization and signal separation for office robot “Jijo-2”,” MFI-99, 1999, pp. 243–248. [13] H. Nakajima et al., “Adaptive step-size parameter control for real-world blind source separation,” ICASSP-2008, 2008, pp. 149–152. [14] L. C. Parra and C. V. Alvino, “Geometric source separation: Mergin convolutive source separation with geometric beamforming,” IEEE Trans. on Speech and Audio Processing, vol. 10, no. 6, pp. 352–362, 2002. [15] Y. Ephraim and D. Malah, “Speech enhancement using minimum meansquare error short-time spectral amplitude estimator,” IEEE Trans. on Acoustics, Speech and Signal Processing, vol. ASSP-32, no. 6, pp. 1109– 1121, 1984. [16] I. McCowan and H. Bourlard, “Microphone array post-filter for diffuse noise field,” ICASSP-2002, vol. 1, 2002, pp. 905–908. [17] Y. Nishimura et al., “Noise-robust speech recognition using multi-band spectral features,” in Proc. of 148th ASA Meetings, no. 1aSC7, 2004. [18] J. Barker et al., “Robust ASR based on clean speech models: An evaluation of missing data techniques for connected digit recognition in noise,” in Proc. of Eurospeech-2001. ESCA, 2001, pp. 213–216. [19] M. Cooke et al., “Robust automatic speech recognition with missing and unreliable acoustic data,” Speech Communication, vol. 34, no. 3, pp. 267–285, May 2000. [20] K. Nakadai et al., “A robot referee for rock-paper-scissors sound games,” PICRA-2008, 2008, pp. 3469–3474.

An Open Source Software System For Robot Audition HARK and Its ...

An Open Source Software System For Robot Audition HARK and Its ...

Suggest Documents

An Open Source Software System For Robot Audition HARK and Its ...

Introduction to Open Source Robot Audition Software HARK

AUREA: an open-source software system for accurate and user ...

AUREA: an open-source software system for accurate and user ...

Towards an Openness Rating System for Open Source Software

FCSTrans: An open source software system for FCS file conversion ...

Open-Source Real-Time Robot Operation and Control System for ...

Phenobook: An open source software for

Sleep: An Open-Source Python Software for

Phenobook: An open source software for

DeconvolutionLab2: An open-source software for deconvolution ...

DeconvolutionLab2: An open-source software for deconvolution ...

Open Source Software for Train Control Applications and its ...

sound source separation of moving speakers for robot audition

open source software as an integrated library management system

Open Source Software for Routing

Design and Implementation of A Robot Audition System for Automatic ...

An Easily-Configurable Robot Audition System Using Histogram ...

Gephi : An Open Source Software for Exploring and Manipulating ...

CHAP: An Open Source Software for Processing and Analyzing ...

Poppy: Open Source 3D Printed Robot for

Gephi: An Open Source Software for Exploring and Manipulating ...

An open-source software framework for food safety analysis and ...

CHAP: An Open Source Software for Processing and ...