Speaker Localization on a Humanoid Robot's ... - Semantic Scholar

Speaker Localization on a Humanoid Robot’s Head using the TDOA-based Feature Matrix Ui-Hyun Kim1,2, Jinsung Kim1, Doik Kim1, Hyogon Kim2, and Bum-Jae You1

Abstract—Research on human-robot interaction has recently been getting an increasing amount of attention. In the research field of human-robot interaction, speech signal processing in particular is the source of much interest. In this paper, we propose a time delay of arrival (TDOA)-based feature matrix and its algorithm based on the minimum sum of absolute errors (MSAE) for sound source localization. The TDOA-based feature matrix is defined as a database matrix for TDOAs calculated from pairs of microphones installed on a humanoid robot. In conventional methods for sound source localization with multi-channels, approximate nonlinear equations are used to find the position of a sound source using estimated TDOAs. Our proposed method, using a TDOA-based feature matrix and its algorithm, can simply estimate the location of the sound source without calculating approximate nonlinear equations. We also report on a speaker localization system with six microphones for a humanoid robot called MAHRU in KIST. Experimental results show that a humanoid robot using the TDOA-based feature matrix and its algorithm can successfully localize a speaker direction over the whole azimuth and the height divided into three parts.

N

I. INTRODUCTION

OWADAYS, robot development is focused on the services robots need for communication between human and robot; humanoid robots are expected to communicate more like humans. In particular, speech signal processing becomes a technology of much interest to the research field of human-robot interaction. For this reason, studies on sound source localization of speech or acoustic signals are being carried on actively, and various algorithms and new methods that are more robust in a real environment are being introduced. Sound source localization is defined as the determination of the coordinates of sound sources in relation to a point in space. It is achieved by using differences in the sound source received by different microphones to estimate the direction and eventually the actual location of the sound source. For example, human ears act as two different sound observation points, enabling humans to estimate the direction of source of the sound. Suppose that the sound source is modeled as a point source, two different clues can be used in sound source localization. The first clue is the inter-aural level difference (ILD). Emanated sound signals have a loudness that gradually 1. Ui-Hyun Kim, Jinsung Kim, Doik Kim, and Bum-Jae You are with the Center for Cognitive Robotics Research, Korea Institute of Science and Technology, Seoul, Korea, Republic (e-mail:{uihyun, jinsung, doikkim, ybj}@kist.re.kr). 2. Ui-Hyun Kim and Hyogon Kim are with the Department of Computer and Radio Communications Engineering, Korea University, Seoul, Korea, Republic (e-mail: {uihyun, hyogon}@korea.ac.kr).

decays as the position of microphone moves further away from the sound source [6]. The system using ILD clue usually considers the direction of a sound source with regard to the direction of the microphone which has the greatest sound loudness. This has the merit of being a simple system, but it has a poor localization capability and numerous microphones are needed. The other clue that can be used for sound source localization is the inter-aural time difference (ITD), more commonly referred to as TDOA. Assuming that the distance between each microphone and the sound source is different, the sound signals produced by the source will arrive at the microphone at different times due to the finite speed of sound. There are many algorithms to estimate the most likely TDOA between a pair of microphones, such as the maximum likelihood (ML) approach, the generalized cross-correlation (GCC) and its phase transform (GCC-PHAT) approach, cross-power spectrum phase (CSP) analysis approach, and frequency whitening approach [2]–[5]. The auditory systems of a humanoid robot using a microphone array have been developed in various forms; Jijo-2 of AIST [14] with eight microphones, QRIO of Sony [8] with seven microphones, ASIMO of Honda [9] and Robita of Waseda university [10] with two microphones. SIG of Kyoto university [11], [12] uses two pairs of microphones, one pair installed at the ear position of the head to collect sound from the external world, and the other placed inside the head to collect internal sounds (caused by motors) for noise cancellation. Like humans, these robots use binaural localization i.e., the ability to locate the source of sound in space of three dimensional. However, it is a difficult challenge to use only one pair of microphones on a humanoid robot to match exactly the hearing capabilities of humans. The human hearing sense takes into account the acoustic shadow created by the head and the reflections of the sound by the two ridges running along the edges of the outer ears [13]. We developed the speaker localization system to be applied to a humanoid robot called MAHRU. In usual methods for sound source localization on humanoid robots, we had to use approximate nonlinear equations to utilize the inaccurate TDOAs influenced by the humanoid robot’s shape, environmental noise, calculation errors, etc. However, we propose the method to localize a sound source using the TDOA-based feature matrix and its algorithm based on MSAE. Our proposed method does not need calculating approximate nonlinear equations, and it compensates for the inaccuracy in utilizing the inaccurate TDOAs. The paper is organized as follows: Section II describes a

microphone array on a humanoid robot’s head, the algorithms required by our system for sound source localization, and the TDOA-based feature matrix and its algorithm. In Section III, we present our system outline and the results of the experiment. Section IV concludes. II. SPEAKER LOCALIZATION A. Microphone Array The initial step for sound source localization is to design the microphone array so that they are suited to the system’s purpose. In this paper, our goal is to create a human-friendly auditory system which is similar to that of a human being’s in terms of its morphology and abilities. With regard to morphology, the auditory system is comprised of a pair of microphones located on the left and right sides of a humanoid robot’s head. Some methods for sound source localization have been studied with one pair of microphones installed on the external or internal ear position of a dummy’s head. However, it is well-known that there are some difficulties in detecting the exact position of a sound source in spatial space with only one pair of microphones. Therefore, although the morphological aspect is weakened, we designed the microphone array with six microphones to reinforce the localization ability. We installed six microphones on the left and right side of a humanoid robot’s head. This formation is the middle phase of the formation of the auditory system which is similar to that of a human being’s. The concrete microphone array is shown in Fig. 1. The horizontal distance between microphones of each pair is over 15 cm and the vertical distance is over 14 cm. We regard the central point as a motor axis of the humanoid robot’s neck, and we placed our microphone array on its head to give them bilateral symmetry. Consequently, we have the advantage of preventing the

microphone array from changing their direction entirely, regardless how the humanoid robot’s neck rotates. Our installed microphone array is used to find the direction of a sound source using the TDOA information from each pair of microphones. That means that we use the ITD clue for finding the direction of a sound source. To localize a sound source, we organize the TDOA-based feature matrix for fifteen pairs of microphones obtained through the combination of six microphones. B. Voice Activity Detection (VAD) Assuming that speech absence and speech presence are H0 and H1, there are two hypotheses for VAD to consider for each frame with the short-time Fourier transform (STFT) (1) H 0 : speech absent : X k (n ) = N k (n )

H1 : speech present : X k (n ) = N k (n ) + S k (n )

(2)

where Xk(n), Sk(n), and Nk(n) are k-th elements of STFTs of measured signal, speech, and uncorrelated additive noise respectively, on n-th frame. k∈{0, 1, ··· , T} is a frequency and n is a time-frame index. We detect the speech frames by using the proposed VAD algorithm as shown in [1]. The speech absent frame and the speech present frame are determined by the maximum likelihood (ML)-based decision rule:

H 0 : log Λ (n) =

1 T

1 H 1 : log Λ (n) = T

T −1

∑ {γ k =0

k

− log γ k − 1} < η

k

− log γ k − 1} > η

T −1

∑ {γ k =0

(3)

where γk=||Xk||2/ λk is the a posteriori signal-to-noise ratio (SNR), λk is an estimated variance of Nk(n), and η is a threshold [1]. 1 (a)

X

0.5 0 -0.5 -1

0

0.5

1 time (s)

1.5

2

20 (b)

log Λ

15 10 5 0

0

0.5

1 time (s)

1.5

2

2 (c)

H1

1.5 1 0.5 0

Fig. 1. A microphone array installed on the head of a humanoid robot called MAHRU. The arrows point to the locations of the microphones.

0

0.5

1 time (s)

1.5

2

Fig. 2. Result of voice activity detection. (a) a speech source, (b) estimated log Λ, (c) estimated speech frames.

where αm, τm, and Nm(n) are an attenuation factor, a time delay from the position of the sound source S(n) to the m-th microphone, and a noise on n-th frame, respectively. At this time, the TDOA τij of two microphones i and j is as follows: (5) τ ij = τ j − τ i

(7)

k

where k is the index value of a time delay, STFT and ISTFT are the short-time Fourier transform and its inverse, and * is the complex conjugate. Figure 4 shows detection of a TDOA on the frame of a measured signal using CSP analysis. We can ascertain that the signal Si(n) is more delayed than the signal Sj(n) by the estimated maximum value of the CSP coefficients. Since the maximum value of the CSP coefficients exists in the time domain through ISTFT, τij depends on the sampling rate of the signal. Because we used a sampling rate of 16 kHz, therefore the minimum τij that we can estimate was limited to 62.5 µs (1 s/16 kHz). 1 (a)

0.5 Si (n)

C. Extraction of a TDOA using CSP analysis The extraction of exact TDOAs has an important significance for sound source localization. An example of the estimation of a TDOA is illustrated in Fig. 3, where Si(n) and Sj(n) are the signals measured from the microphones i and j. Consider that the distance of a pair of microphones is d, and a sound source follows the far field assumption. The received signal from a m-th microphone is (4) X m (n ) = α m S (n − τ m ) + N m (n )

τ ij = arg max(CSPij (k ))

0 -0.5 -1

0

200

400 600 time index of a frame

800

1000

1 (b)

0.5 Sj (n)

Figure 2 shows that the VAD of the decision rule (3) can detect speech frames well. (a) is a speech source with air conditioner noise in the background, (b) and (c) are the calculated results of the likelihood ratio and speech present periods by the VAD of the decision rule when η is 2 (3). We recorded English speech source X sampled at 16 kHz in the general office-room. The frame size K for STFT was 1024 (64 ms), and the frame shift was 256 (16 ms).

0 -0.5 -1

0

200

400 600 time index of a frame

800

1000

0.5 (c)

peak

maximum value 0

-0.5

0

200

400 600 index of time delay - k

800

1000

Fig. 4. Result of CSP analysis. (a) and (b) are a speech in English measured from two microphones placed 21 cm apart, (c) is the estimated maximum value of the CSP coefficients.

Fig. 3. Estimation of a TDOA with a pair of microphones.

Methods for estimating a TDOA τij with unknown parameters τi and τj have been proposed. One of the methods is to use the correlation between the signals measured from the microphones [3]. The Cross-power Spectrum Phase (CSP) Analysis method extracts a TDOA by using the spectrum correlation between signals in the frequency domain. If it is a single sound source, the TDOA τij can be estimated by finding the maximum value of the CSP coefficients [2] given by

⎡ STFT [ S i (n)]STFT [ S j (n)]* ⎤ ⎥ cspij (k ) = ISTFT ⎢ ⎢⎣ STFT [ S i (n)] STFT [ S j (n) ⎥⎦ = ISTFT [e

− j ( ∠Si ( n ) −∠S j ( n ))

]

(6)

D. TDOA-based Feature Matrix TDOAs obtained from each pair of microphones can be employed by various methods for sound source localization. We propose a method using extracted TDOAs to find the direction of a sound source in the whole azimuth and the height divided into three parts. Since sound source localization systems integrated with a visual system have the horizontal viewing angle of a normal camera, usually more than 10 degrees, we take 10 degrees as the unit of azimuth resolution for sound source localization. If the TDOA τij is obtained, then we can estimate the angle θ characterizing the direction of a sound source in Fig. 3. The angle θ is derived from the following equation.

τ c θ = cos−1 ( ij ) d

(8)

where c is the velocity of sound (340.5 m/s, at 15 °C, in air).

We can identify some problems with finding the direction of a sound source in Fig. 5. The results in Fig. 5 were simulated with a distance of 15 cm between two microphones and with a 10-degree interval between angles. (a) is the variation of TDOAs driven by angle θ characterizing the direction of a sound source at 10-degree unit intervals. The TDOAs’ variation is an obvious distinction on each angle and it is symmetrical. (b) is a chart of the differences between TDOAs in 10-degree unit intervals. Consider that there is one pair of microphones 15 cm apart and that we have to estimate the direction of a sound source using CSP analysis with a sampling rate of 16 kHz and the resolution of a 10-degree unit. we can only estimate the direction of a sound source between 50 degrees and 130 degrees without considering the front (0 degree to 180 degree) or back (180 degree to 360 degree) because the TDOA difference that we can estimate is restricted by the sampling rate as mentioned in subsection C, and the TDOAs variation also has an angle of symmetry in Fig. 5. The simple solution to problems like these is to extend the distance of the two microphones or to increase the sampling rate, but these also have their limitations, as a matter of course.

TDOA (microsecond)

500

algorithm compensate for the inaccuracy in estimating the direction of a sound source by utilizing TDOAs calculated from each pair of microphones. The TDOA-based feature matrix is composed of TDOAs determined by angles in 10-degree unit and by all pairs of microphones. We used fifteen pairs of microphones obtained through the combinations of six installed microphones. Its TDOAs were derived from mathematical formula (8) which can be rewritten as follows:

M θp =

cos(θ ) D p c

(9)

where Mθp is a TDOA element from the p-th column at θ-th row in a TDOA-based feature matrix, Dp is the distance between p-th pair of microphones. θ ∈{0˚, 10˚, 20˚, ··· , 350˚} is an angle in 10-degree units characterizing the direction of a sound source. The system using the TDOA-based feature matrix can find a direction of a sound source over the whole azimuth and the height by means of the algorithm based on MSAE. The algorithm for the TDOA-based feature matrix is as follows:

E (θ ) = ∑ M θp − τ p

(10)

θ ' = arg min( E (θ ))

(11)

p

θ

(a)

0

difference between TDOAs (microsecond)

-500

0

50

100

150 200 250 degree (10-degree unit)

300

350

angles that can be estimated with a 16kHz sampling rate :

80

over 62.5 60 (b) 40 20 0

0

50

100

150 200 250 degree (10-degree unit)

300

350

Fig. 5. (a) Distribution chart of the TDOAs. (b) Differences between TDOAs in 10-degree unit intervals.

However, we propose a TDOA-based feature matrix and the algorithm for the TDOA-based feature matrix as a solution to such problems. The TDOAs calculated from each pair of microphones have different information by the microphone array. If there is a database matrix that contains the values of a TDOA expected by the reserved microphone array for any direction of sound sources, it will be able to find the direction of a sound source by comparing the TDOAs measured each time to the values of the database matrix. This is the basic conception of the TDOA-based feature matrix and its algorithm. The TDOA-based feature matrix and its

where τp is a TDOA measured from the p-th pair of microphones installed, and E(θ) is the sum of absolute TDOA errors from the p-th column at each θ-th row in the new TDOA-based feature matrix. θ’ is an estimated angle of the direction of a sound source. We have previously utilized the TDOA-based feature matrix [7]. The earlier TDOA-based feature matrix was divided by the azimuth matrix and the height matrix for azimuth angles between only 0 and 180 degrees. The system using the earlier TDOA-based feature matrix was insufficient for azimuth angles between 0 and 30 degrees and 150 and 180 degrees because its microphone array and matrix permitted the formation for azimuth angles within 180 degrees only. This system also had a method in which the direction of a sound source was estimated by choosing the angle that obtained the greatest angle-score. Since the angle-score was calculated by adding points when TDOAs estimated from each pair of microphones were within angle elements of the earlier TDOA-based feature matrix, for stability, we needed to eliminate angle-scores presumed to be noise. The proposed TDOA-based feature matrix is not split up by the azimuth matrix and the height matrix. Using (10) and (11) can create the expectation of a more effective performance on the part of the sound source localization system than the angle-score of earlier TDOA-based feature matrix could provide. III. EXPERIMENT A. Implementation of the system We used a humanoid robot called MAHRU as a test mock-up. The system ran on a personal computer with AMD

Athlon 3200+/2.0GHz. The OS was Windows XP professional. We used National Instruments NI-4772 for the data acquisition (DAQ) device, PreSonus DigiMax FS preamplifier, and Sony ECM-77B microphones. The speaker localization system is tracked by Fig. 6. Our system employed time frames for the frequency domain approach and the 16 kHz sampling rate that sound source localization systems normally use. The frame size for STFT was 2048 (128 ms), the frame overlap was 1024 (64 ms). When sound data was acquired by the microphone array and DAQ device, the VAD algorithm classified the speech absent frames and the speech present frames. The CSP algorithm extracts TDOAs from each pair of microphones installed on MAHRU at each speech present frame. The estimated TDOAs are utilized for the result of sound source localization by the new TDOA-based feature matrix and its algorithm. The system results showed that MAHRU looked at the position of a sound source by motor event. We set the value of VAD threshold η of (3) at 10 in order to estimate correct speech frames. The conclusive angle θ characterizing the direction of a sound source was determined by selecting the most estimated angle of azimuth and height on speech frames.

(a)

0˚, 0 cm

(b) 10˚, +5 cm

(c) 50˚, +23 cm

(d) 100˚, -38 cm

(e) 150˚, 0 cm

(f) 300˚, -38 cm

(g) 270˚, +23 cm

(e) 290˚, -38 cm

(azimuth, height)

Fig. 7. Experiment with a speech source.

To make a precise experiment, we used speech data recorded in English for 2 seconds, and three speakers. The three speakers were placed on positions which had a height of 130 cm, 150 cm, and 180 cm respectively, because MAHRU is 150 cm tall. The sound source occurred 20 times at each predetermined locus of 10-degree unit and of the three parts height to localize a sound source in whole the azimuth and the height divided into three parts. When recorded speech occurred from a speaker each time, we considered it a success if MAHRU looked at the location of the speaker by the motor TABLE I SUCCESS RATES OF EXPERIMENT

B. Experiment results We tested our system in the general office-room with some noise from air conditioner and from personal computers. The robot was located at the center of a room that is 11 m × 8 m and 4.5 m tall. The wall in front of the robot was a partition of 10 m × 0.5 m and 1 m tall. The wall behind it was glass, covered with windows blinds. For considering microphones performance, we tested within a radius of 5 m distance from where the robot was positioned. Figure 7 shows the room construction and the experimental method with the spontaneous speech of a man at each time. Since the robot’s neck could be rotated within the height range of -15 to 15 degrees, the height resolution to localize was divided into three parts of over 20 cm, near 0 cm, and below -20 cm at 1.5 m distant from a point as a motor axis of the robot’s neck. We can ascertain that the robot succeeded in finding the location of the man speaking in Fig. 7.

Over +20 cm

Near 0 cm

Below -20 cm

97.27%

94.25%

99.61%

95.63%

120 100 success rates (%)

Fig. 6. System overview.

Height

Azimuth 0˚~360˚

80 60 40 20 0 20 over +20 10 near 00 below -20 -10 height (cm)

-20

0

100

200 azimuth (degree)

Fig. 8. Success rates of experiment

300

event and a failure if it looked at the wrong location. Experimental results for 2160 times (36 units × 3 parts × 20 times) are shown in Table I and Fig. 8. At times, our system had an error range of 10 degree along the azimuth or it missed height, but it exhibited a solid performance. The high success rates in Table I and Fig. 8. confirm that we overcame the problems mentioned in Section II. IV. CONCLUSION In this paper, we report on the speaker localization system for a humanoid robot's head in KIST. We also pointed to some problems in estimating the direction of a sound source over the whole azimuth and the height. To solve these problems, we proposed and utilized a TDOA-based feature matrix and its algorithm based on MSAE. Our system of the speaker localization also involved processes of VAD, CSP, and designing the microphone array. Experimental results demonstrated that our method is effective. REFERENCES [1] [2]

[3]

[4] [5]

[6]

[7] [8] [9]

[10]

[11] [12]

[13]

J. Sohn, N. S. Kim, and W. Sung, “A Statistical model-based voice activity detection,” presented at the IEEE Signal Processing Letters, vol. 6, no. 1, pp. 1-3, 1999. T. Nishiura, T. Yamada, S. Nakamura, and K. Shikano, “Localization of multiple sound sources based on a CSP analysis with a microphone array,” presented at the IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 1053-1056, 2000. C. H. Knapp and G. C. Carter, “The generalized correlation method for estimation of time delay”, presented at the IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 24, no. 4, pp. 320-327, 1976. M. Brandstein and D. Ward. Microphone Arrays: Signal Processing Techniques and Applications. Springer. 1st edition. pp. 157-180, 2001. M.S. Brandstein and H. Silverman, “A robust method for speech signal time-delay estimation in reverberant rooms”, presented at the IEEE International Conference on Acoustics, Speech, and Signal Processing, pp.375-378, Munich, Germany, April 1997. K. Guentchev and J. Weng, “Learning-based three dimensional sound localization using a compact non-coplanar array of microphones”, presented at the AAAI Spring Symposium on Intelligent Environments, Stanford CA, USA, March 1998. J. Kim, U. H. Kim, and B. J. You, “Sound source localization using time delay of arrival feature matrix,” presented at the KIEE & IEEK Conference on Information and Control Systems, pp. 143-144, 2007. T. Ishida and Y. Kuroki, “Sensor System of a Small Biped Entertainment Robot”, Advanced Robotics, Vol. 18, No. 10, pp. 1039-1052, 2004. Y. Sakagami, R. Watanabe, C. Aoyama, S. Matsunaga, N. Higaki and K. Fujimura, “The Intelligent ASIMO: System Overview and Integration”, in Proc. of the 2002 IEEE/RSJ Int. Conf. on Intelligent Robots and Systems, pp. 2478-2483, 2002. Y. Matsusaka, T. Tojo, S. Kubota, K. Furukawa, D. Tamiya, K. Hayata, Y. Nakano and T. Kobayashi, “Multi-person Conversation via Multi-modal Interface-A Robot who Communicate with Multi-user”, in Proc. of EUROSPEECH-99, pp.1723- 1726, 1999. H.G.Okuno, K.Nakadai and H.Kitano, “Real-time sound source localization and separation for robot audition”, in Proc. IEEE Int. Conf. on Spoken Language Processing, pp. 193-196, 2002. H. G. Okuno, K. Nakadai, and H. Kitano, “Social interaction of humanoid robot based on audio-visual tracking.”, in Proc. of Eighteenth Int. Conf. on Industrial and Engineering Applications of Artificial Intelligence and Expert Systems, pp. 725-735, 2002. Jean-Marc Valin, François Michaud, Jean Rouat and Dominic Letourneau, “Robust Sound Source Localization Using a Microphone Array on a Mobile Robot”, in Proc. IEEE/RSJ Int. Conf. on Intelligent Robots and Systems, pp.1228-1233 vol.2, 2003.

[14] F. Asano, H. Asoh and T. Matsui, “Sound Source Localization and Signal Separation for Office Robot “Jijo-2”” , in Proc. of the 1999 IEEE Int. Conf. on Multisensor Fusion and Integration for Intelligent Systems, pp. 243-248, 1999.