SLAM-Based Online Calibration of Asynchronous Microphone Array ...

9 downloads 297 Views 594KB Size Report
Sep 30, 2011 - Abstract—This paper addresses the online calibration of ...... phone array and beamformer for intelligent computing spaces,” MIT,. MIT/LCS ...
2011 IEEE/RSJ International Conference on Intelligent Robots and Systems September 25-30, 2011. San Francisco, CA, USA

SLAM-based Online Calibration of Asynchronous Microphone Array for Robot Audition Hiroaki Miura, Takami Yoshida, Keisuke Nakamura, and Kazuhiro Nakadai a special sound capturing device such as a multi-channel A/D converter. There are several commercial products for this, but they are too big in size to be embedded in a robot and smallsize devices are usually expensive. These two assumptions are barriers to deploy microphone array processing for robots although it has drastically improved auditory functions such as sound source localization and separation so far. In the fields of ubiquitous sensors and signal processing, distributed microphone arrays have been studied [6], [7], [8], [9]. Most studies in these fields also assume the above two assumptions. However, for more practical use, recent methods to relax these assumptions have started to be reported. Thrun et al. reported that online calibration of microphone array [10]. They showed the effectiveness of their method using actual microphone devices. However, they assumed that sound source locations are given, and microphones are fully synchronized, i.e., inter-microphone synchronization. Ono et al. defined “Blind Alignment Problem” to estimate sound source locations, microphone locations and microphone clock difference only using sounds observed with the microphones [11]. They clarified requirements to solve this problem such as the relationship among the number of microphones and sound events, and showed a theoretical framework by introducing a supplementary function. However, they focused only on an offline solution, which means that the calculation cost of their method is expensive, and thus it does not work in real time. In addition, with offline processing, it is difficult to know the minimum number of sound events to obtain target precision of calibration in advance. Online calibration incrementally outputs the calibrated information. This means that the system can finish the calibration process as soon as the target precision is fulfilled. Therefore, online calibration is more effective and efficient. This paper proposes a framework combining Simultaneous Localization and Mapping (SLAM) and Beam-Forming (BF) to solves these two problems in an online manner. To use the SLAM framework, we made estimations of microphone locations, a sound source location, and microphone clock differences correspond to mapping, self-localization, observation errors in the SLAM framework, respectively. SLAM calibrates the location of each microphone and the microphone clock difference every time a microphone array observes a sound like a human’s clapping. BF is used to decide the convergence of SLAM estimation by using a localization result with the estimated locations and clock differences as a cost function for the convergence during calibration. It is also used to localize a sound source after cal-

Abstract— This paper addresses the online calibration of an asynchronous microphone array for robots. Conventional microphone array technologies require a lot of measurements of transfer functions to calibrate microphone locations, and a multi-channel A/D converter for inter-microphone synchronization. We solve these two problems using a framework combining Simultaneous Localization and Mapping (SLAM) and beamforming in an online manner. To do this, we assume that estimations of microphone locations, a sound source location, and microphone clock difference correspond to mapping, selflocalization, observation errors in SLAM, respectively. In our framework, the SLAM process calibrates locations and clock differences of microphones every time a microphone array observes a sound like a human’s clapping, and a beamforming process works as a cost function to decide the convergence of calibration by localizing the sound with the estimated locations and clock differences. After calibration, beamforming is used for sound source localization. We implemented a prototype system using Extended Kalman Filter (EKF) based SLAM and Delay-and-Sum Beamforming (DS-BF). The experimental results showed that microphone locations and clock differences were estimated properly with 10-15 sound events (handclaps), and the error of sound source localization with the estimated information was less than the grid size of beamforming, that is, the lowest error was theoretically attained.

I. I NTRODUCTION Microphone array processing is a promising approach to improve robot audition. Actually, many studies on robot audition utilize a microphone array [1], [2], [3], [4]. Thanks to such microphone array processing, sound source localization, sound source separation, and automatic speech recognition of separated speech were reported, and simultaneous speech recognition and moving sound source recognition were attained[5]. Most of their studies, however, require two assumptions for microphone array processing. One is that the location of each microphone is necessary. Because microphones are embedded in a robot (e.g. a robot’s head), it is sometimes hard to measure accurate microphone locations. Instead of using microphone locations, a transfer function between each microphone and a sound source can be used. Since a set of transfer functions are measured at many sound source directions (e.g. at 5 degree intervals), the measurements are time-consuming. The other is that a sound is synchronously recorded with all microphones, that is, inter-microphone synchronization is assumed. This requires H. Miura, T. Yoshida and K. Nakadai are with Graduate School of Information Science and Engineering, TokyoInstitute of Technology, 2-12-1, O-okayama, Meguro-ku, Tokyo, 152-8552, JAPAN.

{miura,yoshida}@cyb.mei.titech.ac.jp K. Nakamura and K. Nakadai are with Honda Research Insititute Japan Co., Ltd., 8-1 Honcho, Wako, Saitama 351-0114, JAPAN.

{keisuke,nakadai}@jp.honda-ri.com 978-1-61284-455-8/11/$26.00 ©2011 IEEE

524

N n K k l c τs[k] xs[k] , ys[k] θs[k] ξs[k] xmn , ymn τmn ξmn ξm ξ N (¯ x, σ) S[k] (ω) Xn[k] (ω) X[k] (ω) A(ξs[k] , ξm , ω)

TABLE I N OTATION OF VALIABLES Number of microphones Microphone index Number of hand clapping Handclap index Recursive time-step index in EKF-SLAM Sound speed Absolute handclap time x-y position of the k-th handclap Orientation of the human walk Human state [xs[k] , ys[k] , θs[k] ] x and y positions of the n-th microphone Microphone clock difference n-th microphone state [xmn , ymn , τmn ]T T , . . . , ξ T ]T Microphone array state [ξm1 mN T , ξ T ]T All states [ξs[k] m Gaussian noise of mean x ¯ and standard deviation σ Spectrum of the k-th handclap k-th handclap spectrum of n-th microphone [X1[k] (ω), . . . , XN [k] (ω)]T Transfer function between ξs[k] and ξm

n-th Microphone

1st Microphone

Fig. 1.

ibration. We implemented a prototype system using Extended Kalman Filter based SLAM (EKF-SLAM) and Delay-andSum Beam-Forming (DS-BF), which realize online calibration and source localization for an asynchronous microphone array. The effectiveness of the proposed approach and the implemented system is shown through numerical experiments as well as an actual microphone array device. The rest of this paper is organized as follows: Section II proposes the calibration method to solve the above mentioned issues. Section III shows the system structure introducing the proposed method. Section IV evaluates the system to show the validity of our proposed method. The last section gives the conclusion.

Microphone Clock Difference

estimation true

Fig. 2.

Model of Human and Microphones

2) Each handclap is a point source. 3) All microphones are stationary and in free space. 4) Each microphone has a constant delay. From 1), we denote the human location at the k-th handclap as ξs[k] . From 2), the spectrum of the k-th handclap at frequency ω is defined as S[k] (ω). 3) and 4) mean that xmn , ymn and τmn should be estimated and converged to constant values. 2) and 3) means that we can define observed signals using a spatial transfer function between the point of the k-th handclap and the microphone array by the following linear model:

II. SLAM- BASED O NLINE C ALIBRATION This section describes the online microphone calibration method based on the EKF-SLAM. The notation of variables used in this paper is shown in Table I. The SLAM framework estimates both a human (sound source) and asynchronous microphone locations simultaneously as the human moves and claps hands. The framework follows the common EKFSLAM in the following sense: 1) Map (landmark location) estimation is extended for estimating the location of each microphone. 2) Self-localization is extended for estimating the human location. 3) Error minimization is utilized for minimizing the error of 1) and 2) including the microphone clock difference. Moreover, in addition to EKF-SLAM, our approach simultaneously estimates the location of the sound source based on DS-BF, which tells us the approximate estimation error and is used as the trigger for finishing the routine.

X[k] (ω) = A(ξs[k] , ξm , ω)S[k] (ω)

(1)

By 4), we can compute A(ξs[k] , ξm , ω) by both the geometrical relationship of the microphone array and τmn . The time when the n-th microphone receives the k-th sound is shown in Fig. 1, which is mathematically described as: Dn[k] + τmn , (2) c where Dn[k] is the distance between the n-th microphone and the k-th sound source calculated by q 2 2 xs[k] − xmn + ys[k] − ymn . (3) Dn[k] = tn[k] = τs[k] +

From Eq. (2), A(ξs[k] , ξm , ω) is ideally calculated as A(ξs[k] , ξm , ω) = [exp(−2πjωt1[k] ), . . . , exp(−2πjωtN [k] )]T .

A. Model and Assumptions 1) Sound Source Model: The proposed framework assumes the following conditions: 1) One person walks around a microphone array while clapping hands.

(4)

This is so-called a steering vector which is utilized for Sound Source Localization (SSL). Previously, common SSL methods [12] assume the pre-measured steering vectors. However, practically the measurement takes a long time 525

Observation Step

Update Step

B. Extended Kalman Filter

mic1 data

Fig. 3 shows the calibration flow based on EKF-SLAM. The EKF framework consists of the following four repetitive steps: prediction, observation, update, and evaluation. The evaluation step is our new routine that utilizes DS-BF to estimate the current prediction error. The intervals of k and l are different, and the prediction step and the evaluation step are executed regardless of the k-th observation. The others, namely the observation step and the update step, are only processed when the system received the k-th handclap. 1) Prediction Step: In the prediction step, we update the mean and variance of a state vector including microphones and the human locations. Here, the updated mean and variance at the l-th update step are denoted by ξˆ[l] and Pˆ[l] . From T T Eq. (5) and Eq. (7), ξˆ[l−1] = [ξˆs[l−1] , ξˆm[l−1] ]T is updated by    sin(θˆs[l−1] ) 0  vs[l] ∆t ˆ ˆ   ˆ (13) ξs[l|l−1] = ξs[l−1] + cos(θs[l−1] ) 0 θ˙s[l] ∆t 0 1 ξˆm[l|l−1] = ξˆm[l−1] . (14)

subtraction measurement

measurement

prediction

update

motion prediction

Kalman Gain calculation

covariance update

Prediction Step

Fig. 3.

Calibration Flow

and assumes well-equipped setups. Thus, this paper obtains A(ξs[k] , ξm , ω) by estimating ξs[k] and ξm using the combination of EKF-SLAM and DS-BF. 2) State Transition Model: The human motion in the state transition model for a EKF-SLAM is considered as straight walk with a constant velocity ξ˙s[l] for an orientation θs[l] (see Fig. 2). Here, we use the index l instead of k since the state is updated in a constant time interval not a handclap interval. Thus, the model is described as    sin(θs[l] ) 0  vs[l] ∆t   + W s[l] , (5) ξs[l+1] = ξs[l] + cos(θs[l] ) 0 θ˙s[l] ∆t 0 1

Pˆ[l−1] is updated as follows: Pˆ[l|l−1]

where vs[l] is the walking velocity of the human, and W s[l] is the gaussian distrubution represented by W s[l] = [N (0, σx ), N (0, σy ), N (0, σa )]T

F

W m[l] W mn[l]

= ξm[l] + W m[l] , =

(7)

[W Tm1[l] , . . . , W TmN [l] ]T

3N ×1

∈R T

= [N (0, σm ), N (0, σm ), 0] .

, (8)

Dn[k] − D1[k] + τmn − τm1 c

δ [k] = [N (0, σr ), . . . , N (0, σr )]

∈R

.

],

(15) (16)

+ τˆmN − τˆm1

The kalman gain minimizes the estimation error between h(ξˆ[k|k−1] ) and ζ[k] . 3) Update Step: This step computes the kalman gain to minimize the estimation error as follows.  −1 T T H[k] P[k|k−1] H[k] K[k] = P[k|k−1] H[k] + Q[k] (19)   ξˆ[k] = ξˆ[k|k−1] + K[k] ζ[k] − h(ξˆ[k|k−1] ) (20)

δ [k] is the gaussian noise for the observation described as N −1×1

,O

3×3N

ˆ N [k] −D ˆ 1[k] D c

(10)

Therefore the observation model is represented as the following microphone clock difference.   D2[k] −D1[k] + τm2 − τm1 c   ..  + δ [k] (11) ζ[k] =  .   DN [k] −D1[k] + τmN − τm1 c T

[I

3×3

2) Observation Step: When each handclap S[k] (ω) is received by the microphone array, we take the difference of the arrival time. Here, we denote the measured time difference as ζ[k] . ζ[k] is predicted from Eq. (11) as follows:   ˆ ˆ 1[k] D2[k] −D + τ ˆ − τ ˆ m2 m1 c   ..  (18) h(ξˆ[k|k−1] ) =  .  

(9)

3) Observation Model: The observation obtains the time of arrival of S[k] (ω) for each microphone, namely tn[k] . Since τs[k] is unknown, we take the subtraction between observation time of the 1st microphone and others as follows to eliminate τs[k] . tn[k] − t1[k] =

=

G[l] Pˆ[l−1] GT[l] + F T RF

where R is the covariance matrix defined by R = diag(σx2 , σy2 , σa2 ), and G[l] is a Jacobian of the state transition model represented by   0 0 −vs[l] ∆t sin(θs[l] ) G[l] = I + F T 0 0 vs[l] ∆t cos(θs[l] )  F . (17) 0 0 0

(6)

We consider stationary microphones for ξm , and the model is represented as: ξm[l+1]

=

P[k]

(12)

=

(I − K[k] H[k] )P[k|k−1] , ∂h(ξ

)

(21)

[k] |ξ=ξˆ[k|k−1] , and Q[k] is the covariance where H[k] = ∂ξ[k] matrix defined by Q[k] = diag(σr2 , . . . , σr2 ) ∈ RN −1×N −1 .

526

4) Evaluation Step: This step evaluates the approximate estimation error of microphone location by DS-BF. [13] ξˆ[k] holds the location and clock difference of all microphones, namely ξˆm . Firstly, we decide a point for beamforming, denoted as ξ˜s = [˜ xs , y˜s ]T . We created a grid map of the sound source location, and we executed beamforming for each grid. The same as Eq. (3), the distance between ξ˜s and the n-th microphone is computed by q 2 2 ˜ ˆmn ) + (˜ ys − yˆmn ) . (22) xs − x Dn = (˜

sound source HARK audio stream sound event detection

EKF-SLAM

mic location and delay

microphone

NO

convergence decision YES

human evaluation point

The time difference between S[k] (ω) emission and microphone reception is described as       ˜1 t˜1 τˆm1 D   1    t˜ =  ...  =  ...  +  ...  (23) c ˜ τ ˆ t˜N DN mN

Fig. 4.

Room Setup

BF sound location with beamforming

sound location with SLAM

sound location with SLAM

Fig. 5.

System Structure

III. S YSTEM I MPLEMENTATION

We implemented the proposed calibration system in an online mannar. We used a room-installed distributed 8-channel microphone array, with the input acoustic signal sampled with Then, the steering vector for ξ˜s is computed by 16[kHz] and 16 bits. The window and shift length for W (ξ˜s , ξˆm , ω) = [exp(−2πjω t˜1 ), . . . , exp(−2πjω t˜N )]T (24) frequency analysis were set at 512 and 160 samples, respectively. For the sound source, we used real human handclaps. The function for the audio stream extraction is impleFinally, with the measured spectrum in Eq. (1), the output mented as modules for open-sourced robot audition software of the beamforming is computed by HARK [14]. The audio stream is sent to MATLAB, detecting Y (ξ˜s , ω) = W ∗ (ξ˜s , ξˆm , ω)X[k] (ω) (25) the sound source events t[k] = [t1[k] , . . . , tN [k] ]T . t[k] is sent to EKF-SLAM and DS-BF modules implemented in ˆ The entire system worked in real If there exists a sound source at the direction of ξˆm , the MATLAB, which output ξ. output of beamforming is maximized. So, the sound source time on a laptop with a 2.0 GHz Intel Core i7 CPU and 8GB will exist at the location where the output of beamforming is SDRAM running Linux. maximized. The output is calculated in each frequency bin IV. E VALUATION ωi . Here, we simply regard the summation of the output with We evaluate the effectiveness of EKF-SLAM using both respect to ωi as the final beamforming result, described as numerical simulation and a real microphone array in a room. follows: Y¯ (ξ˜s ) =

ih X

y(ξ˜s , ωi ) ,

A. Numerical simulation (26)

In the numerical simulation, we used two types of rooms. One is 4 m × 7 m (room A) and the other is 1.2 m × 2.4 m (room B). The number of microphones is 8 in both cases. Their locations are shown in Tab.II. Since we used relative value for observation time shown in Eq. (11), the system has ambiguities. For disambiguation, the location of Mic 1 was always set to the origin, and x and y positions of Mic 2 were maintained as 0 and a positive value, respectively. We used an impulse signal as a sound source which corresponds to a handclap. It moved along the edges of the room in the counterclockwise direction. The starting point was the leftbottom corner of the room, which was given to the system. It produced a sound at five step intervals. One step was around 0.6 m and 0.3 m in rooms A and B, respectively. The standard deviations for a human motion model were 0.1 m for location and 1 deg for orientation. The observation noise had a 0.5 ms deviation which corresponds to 0.17 m. The clock difference of each microphone is fixed, but the deviation is 0.1 s. In initial states of microphone array calibration, the microphones are distributed uniformly in the room and the clock differences were set to 0 for all microphones.

i=il

where il and ih are the frequency bin indices representing the minimum and maximum frequencies considered. Then, argmaxξ˜s y¯(ξ˜s ) is regarded as the estimation result. The error convergence is evaluated as follows. 1) Decide 9 evaluation points to detect convergence shown in Fig.4. Devide the sound source field into 3 by 3 grids. Define the center position of the grid as ξ˜si (1 ≤ i ≤ 9). 2) By ξˆm , make the 9 virtual point sound sources at ξ˜si . Define the 9 sources as Xi (ω) (1 ≤ i ≤ 9). 3) Use the beamformer to estimate the location of Xi (ω). Define the output of the beamformer as y¯(ξ˜si ). 4) Take the error between y¯(ξ˜si ) and ξ˜si , represented as ei = y¯(ξ˜si ) − ξ˜si . P 5) Compute the total error by e¯ = i ei 6) If e¯ is smaller than the grid size, finish EKF-SLAM. The important thing is that this beamforming routine works independently from EKF-SLAM and gives objective estimation status. 527

3

3

3

2.5

2.5

2.5

Power

Power

Power

2

2

2 1.5

1.5

1

1

1

0.5

0.5

0.5

0 3.5

0 3.5

0 3.5 2.5

2.5

2.5 1.5

1.5

1.5 0.5

0.5 −0.5

y [m]

y [ m]

2 −1.5

−3.5

−2

25

1

60 50 40 30 20 10 0

10

15

20

25

# of handclaps [times] b) Average estimation error of microphone locations

5

10

15

20

25

# of handclaps [times] c) Average estimation error of microphone clock differences converged (the error is less than the grid size)

5

10

15

20

25

# of handclaps [times] d) Average estimation error of delay-and-sum beamforming

Fig. 7.

Error [m] 30

one clap every 7 steps one clap every 5 steps 0

30

5

1.5

TABLE II

15

20

0.5 0

5

30

10

15

20

25

30

# of handclaps [times] b) Average estimation error of microphone locations

60 50 40 30 20 10 00

one clap every 7 steps one clap every 5 steps 5

10

15

20

25

30

# of handclaps [times] c) Average estimation error of microphone clock differences grid size of DS-BF

1.0 0.5 0.2 0

25

one clap every 7 steps one clap every 5 steps

0

5

converged (the error is less than the grid size)

10

15

one clap every 7 steps one clap every 5 steps 20

25

# of handclaps [times] d) Average estimation error of delay-and-sum beamforming

Fig. 8.

Estimation Errors in Numerial Simulation

10

# of handclaps [times] a) Estimation error of human location

1.0

1.5

room A room B

1 0

−2

1

30

room A room B 0

0 x [ m]

c) Ground truth

2

0

5

grid size of DS-BF

0

Error [m]

room A room B

2

0

−3.5

−1

3

0

30

Error [ms]

Error [m]

20

3

2

Error [m]

15

# of handclaps [times] a) Estimation error of human location

Error [m]

Error [m]

1

0

Error [ms]

room A room B 10

−2

1 −2.5

Spatial spectrum of beamforming (A sound source is located at the center)

2

5

0 x[ m]

−1

b) After calibration

3

0

2 −1.5

1 −2.5

a) Before calibration (with initial states) Fig. 6.

0.5 y [ m] −0.5

2

0 x [m]

−1 −3.5

−0.5 −1.5

1 −2.5

0

1.5

30

Estimation Errors with a Real Microphone Array

sound sources. Since the grid size of beamforming was 0.2 m, we can say that the process is converged when the error of beamforming becomes less than 0.2 m. Based on this criteria, 13 and 15 handclaps are necessary to calibrate the microphone array in rooms A and B, respectively. In Fig. 7b), the error of microphone locations is getting smaller. This means that our proposed method successfully calibrates microphone positions. On the other hand, In Fig. 7c), it seems that the errors did not change. However, in the first five handclaps, the errors were slightly reduced in both cases. In Fig. 6a), beamforming failed to localize a sound source which was located at the center of the room since microphones are randomly distributed with the initial states. Figure 6b) shows that our microphone calibration method has enough performance to localize a sound source because we can see a sharp peak at the center of the room properly. In addition, it is obvious that the directivity characteristics are successfully estimated compared to Fig. 6c).

M ICROPHONE ARRAY SETTING room A room B Mic No. x (m) y (m) x (m) y (m) 1 0.0 0.0 0.0 0.0 1.0 0.0 0.3 0.0 2 -0.5 3.0 0.3 -0.6 3 1.0 2.5 -0.3 -0.3 4 -1.0 -1.0 -0.3 -0.9 5 6 -0.5 -2.5 -0.3 0.3 1.0 -2.0 0.3 0.6 7 -1.0 1.5 0.0 0.9 8 The origin is the center of each room.

Figure 7a)-d) shows estimation errors in microphone calibration with the proposed method in terms of human location, microphone locations, microphone clock differences, and sound source localization with beamforming, respectively. Figure 6 compares the beamforming results between before calibration, after calibration, and ground truth. In Fig. 7a), human location has large errors due to non-linear human motions, while in Fig. 7d), the error of beamforming converges by every handclap. This shows that the proposed framework for microphone array calibration works properly. However, it is insufficient to localize a sound source only with EKF-SLAM, and beamforming is effective to localize a sound source accurately or multiple

B. The use of a real microphone array Since our proposed method was theoretically validated through numerical simulation, we applied it to the calibration of a real microphone array. We set up an 8 ch microphone array using MEMS microphones on a table in a 4 m × 5 m 528

Estimated human position

Human position (ground truth)

Estimated microphone position

Microphone position (ground truth)

through numerical simulation and experiments with real microphone arrays. The proposed method relaxes two main barriers to deploy microphone array processing in robotics, that is, time-consuming acoustical measurement to obtain transfer functions between a sound source and a microphone array, and the necessity of special hardware for multichannel synchronous recording. Therefore, we believe that further development of this method will bring a breakthrough in the fields of robot audition and sensor networks. In this paper, we used a distributed microphone array in the room. Our method is applicable for robot-embedded microphone arrays and widely-distributed microphones in the field. The applicability of this method to such purposes should be confirmed. We used EKF in this paper although it is not robust enough for a highly non-linear problem. Thus, an extension of the proposed framework to deal with such non-linearity by introducing UKF and particle filtering is remaining future work.

1

1

0.5

0.5

y [m]

y [m]



0

0

-0.5

-0.5

-1

-1 -0.5

0 x [m]

0.5

a) seven-step interval case

Fig. 9.

-0.5

0 x [m]

0.5

ACKNOWLEDGMENTS

b) five-step interval case

This research was partially supported by Grant-in-Aid for Scientific Research No. 22700165, No. 19100003and No. 22118502.

Estimation Results of Microphone Array Calibration

room. The size of the table was around 1.2 m × 2.4 m. The layout of microphones was the same as room B in the numerical simulation. We used a multichannel A/D converter called RASP24 developed by Systems In Frontier Corp. It has a function of synchronous recording, i.e., no microphone clock difference. After we recorded clapping sounds, we added a delay to each channel of the recorded sound. This means that we know the exact values of the microphone clock differences, but the system does not. We asked a person to clap his hands 1) every five steps and 2) every seven steps. The length of his step was approximately 0.3 m which was almost the same as the condition of room B in the numerical simulation. The height of handclaps was 0.05 m higher than that of the microphone array. The standard deviations for clock difference, motion model and observation were set to the same values as the numerical simulation. Figure 8 shows estimation errors with a real microphone array, and Fig. 9 shows final calibration results of microphone arrays. In Fig. 8, we can say that the system showed the same tendency as the numerical simulation. When we consider 0.2 m as beamforming error allowance, only 9 and 14 handclaps are necessary to finish microphone array calibration in the cases of 7-step and 5-step intervals, respectively. These results are comparable to the simulation. Fig. 9 shows the location of each microphone is estimated with a small error. On the other hand, this indicates that precise locations of microphones are not always necessary in microphone array processing like beamforming although this might depend on how accurate sound source localization is desired.

R EFERENCES [1] J.-M. Valin et al., “Enhanced robot audition based on microphone array source separation with post-filter,” in IROS 2004. IEEE, 2004, pp. 2123–2128. [2] F. Asano et al, “Sound source localization and signal separation for office robot “Jijo-2”,” in Proc. of IEEE Int. Conf. on Multisensor Fusion and Integration for Intelligent Systems (MFI-99), 1999, pp. 243–248. [3] S. Yamamoto et al., T. Ogata, and H. G. Okuno, “Enhanced robot speech recognition based on microphone array source separation and missing feature theory,” in IROS 2005. IEEE, 2005, pp. 1489–1494. [4] H. Saruwatari et al., “Two-stage blind source separation based on ica and binary masking for real-time robot audition system,” in IROS 2005. IEEE, 2005, pp. 209–214. [5] K. Nakadai et al., “A robot referee for rock-paper-scissors sound games,” in ICRA-2008, 2008, pp. 3469–3474. [6] K. Nakadai et al., “Real-time tracking of multiple sound sources by integration of in-room and robot-embedded microphone arrays,” in IROS-2006. IEEE, 2006, pp. 852–859. [7] P. Aarabi and S. Zaky, “Robust sound localization using multi-source audiovisual information fusion,” Information Fusion, vol. 2, no. 3, pp. 209–223, 2001. [8] H. Silverman et al., “The huge microphone array,” LEMS, Brown University,” Technical Report, 1996. [9] E. Weinstein, K. Steele et al., “Loud: A 1020-node modular microphone array and beamformer for intelligent computing spaces,” MIT, MIT/LCS Technical Memo MIT-LCS-TM-642, 2004. [10] S. Thrun, “Affine structure from sound,” Advances in Neural Information Processing Systems, vol. 18, pp. 1353–1360, 2006. [11] N. Ono et al., “Blind alignment of asynchronously recorded signals for distributed microphone array,” in 2009 IEEE Workshop Applications of Signal Processing to Audio and Acoustics, 2009, pp. 161–164. [12] K. Nakamura et al., “Intelligent sound source localization for dynamic environments,” in IROS 2009, 2009, pp. 664–669. [13] K. Nakadai et al., T. Nakamura, and H. Tsujino, “Sound source tracking with directivity pattern estimation using a 64 ch microphone array,” in IROS 2005, 2005, pp. 196–202. [14] K. Nakadai et al., “Design and implementation of robot audition system HARK,” Advanced Robotics, vol. 24, pp. 739–761, 2009.

V. C ONCLUSION This paper presented SLAM-based online calibration of asynchronous microphone array for robot audition. We proposed to combine EKF-SLAM and beamforming to realize this, and showed the effectiveness of the proposed method 529

Suggest Documents