DISTRIBUTED MOBILE MICROPHONE ARRAYS FOR ROBOT NAVIGATION AND. ACOUSTIC ... Audio source localization is one of the fundamental problems.
DISTRIBUTED MOBILE MICROPHONE ARRAYS FOR ROBOT NAVIGATION AND ACOUSTIC SOURCE LOCALIZATION Stanisław A. Raczy´nski, Łukasz Grzymkowski, Kacper Gł´owczewski Gda´nsk University of Technology Faculty of Electronics, Telecommunications and Informatics, Department of Automatic Control, ul. G. Narutowicza 11/12 80-233 Gda´nsk, Poland ABSTRACT In distributed microphone arrays, each microphone is connected to a separate recording device, which requires compensation for their different clock shifts. In this paper, we propose a new type of distributed arrays, in which the microphones are moving. This mobility introduces innovation that can be used to estimate the positions of both the sound sources and the microphones, as well as compensate for the clock shifts. Estimation is done using extended Kalman filters with heuristics that improve the convergence. We discuss the proposed method in the context of mobile robotics: each microphone is attached to a controllable mobile robot. In this scenario, the array can be used to localize sound-emitting objects in the robot’s environment, while providing information about robots’ positions. Results show that the error of estimating robots’ positions is as low as 5 cm with the source source localization error of 6 mm and the clock synchronization error of 178 µs, for the case of 9 microphones and 5 sound sources. Index Terms— microphone arrays, distributed microphone arrays, mobile robots, Kalman filters, robot navigation, source localization 1. INTRODUCTION Audio source localization is one of the fundamental problems in the field of signal processing. It is performed using microphone arrays, i.e., microphones distributed in space. The sound reaches them at different times and with different intensity, which is exploited to calculate the direction of arrival (DOA) of the sound. The primary source localization method is, however, based solely on the inter-channel time differences (ITD; also known as the time difference of arrival, TDOA) [1]. Unfortunately, using microphone arrays suffers from two problems: it requires costly specialized multi-channel (8, 16, 32 and more) A/D converters to ensure sampling synchronization between channels [2]; it also requires that the positions of microphones are known with high precision in order
for the system to be able to calculate the TDOA. The first problem can be dealt with by connecting the microphones to much cheaper single-channel A/D converters, however the result is desynchronization of the recorded signals. Lienhart et al. proposed to synchronise the recording devices over a network [3]. Another solution has been proposed by Ono et al., who developed a method to jointly estimate the microphone locations, the single source location and the time origins of the recording devices [2, 4]. This method solves both problems at the same time, although it is designed for off-line processing, which limits its applications. Miura et al. proposed an on-line algorithm based on the extended Kalman filter [5], which they applied to microphone array calibration, i.e., to estimating the positions of the microphones and a single sound source, as well as the clock shifts. In this paper we aim to extend on this work and introduce microphone mobility and increase the number of sound sources to an arbitrary number. Source localisation is an important ability for robots as well. It helps the robot direct its attention or follow the object of interest: an interacting humanoid agent can turn its head towards the speaker [6–9], service robots can follow the voice of a human master [1,10–12], security robots can investigate an unexpected noise source [13], while scout and rescue robots can find their way towards their target [14]. Furthermore, localisation of the source can increase the accuracy of source separation [15], which is very important, for example for robots that communicate with humans using speech. Because in the proposed method, we estimate the positions of the microphones, it can be used for robot navigation, i.e., keeping track of the robot’s position in its environment. Acoustic navigation of robots has been done with non-distributed microphone arrays: either mounted on the walls around the robot to track its position [16] or on the robot itself to track positions of multiple sound sources and determining its own position using triangulation [17]. Mumolo et al. tried to integrate speaker localisation using a small microphone array with odometric and collision sensor-based navigation [1]. To the best of our knowledge, there is, however, no work done in
using distributed microphone arrays for this purpose. 2. DISTRIBUTED MOBILE MICROPHONE ARRAY We define our distributed mobile microphone array as follows. 1. There are N sound sources and M microphones.
where v(t) is the process noise and represents the imprecision of robot movement control. Since the noise comes from many sources (errors for no wheel position/speed feedback, odometry measurement errors, wheel slippage, etc.), we assume it is a zero-mean Gaussian noise: v(t) ∼ N (0, Σv ), where Σv = diag (σx , σy , σϕ , 0).
2. Positions of the sound sources are fixed, but unknown. 3. Microphones are distributed and attached separate recording devices (robots) and their clocks and system times are not synchronized, unknown and the synchronization error is random.
2.2. Observed variables The time of audio event emitted by the i-th source is recorded at by the j-th robot is equal to
4. Robots are mobile, but have only two degrees of freedom: front-back movement and turning around (a typical, carlike mobile robot design). 5. We assume that the speed of the robots is known, e.g., using dead reckoning, and are common to mobile robotics. 6. The sources emit acoustic events that can be detected and identified by all microphones. The robots need to be able to identify which sound source emitted which acoustic event and this can be achieved by either emitting signals separated in time, e.g., clicks, chirps emitted in a known repeated sequence, or separated in frequency, beeps at different frequencies. The following variables need to be estimated in the system: positions of the sources (source localization), positions of the robots (robot navigation), orientations of the robots, system clock shifts of individual robots. For each sound source, its state can be represented by vector h (s) ξi (t) = x(s) i (s) xi
(s)
yi
iT
,
(1)
(s) yi
where and are the coordinates of the i-th sound source, and for each robot by h iT (r) (r) , ξj (t) = x(r) j (t) yj (t) ϕj (t) τj (t) (r)
(5)
(2)
(r)
ti,j = ti +
Based on the above assumptions, the states associated with the sound sources and robots, for a discrete-time system change as (s) (s) ξi (t + 1) = ξi (t), (3) vj (t) sin(ϕj ) vj (t) cos(ϕj ) (r) (r) + v(t), (4) ξj (t + 1) = ξj (t) + ϕ˙ j (t) 0
(6)
where Di,j is the distance between them and c is the speed of sound. This distance for time t can be calculated as 2 2 (r) (s) (r) (s) 2 Di,j (t) = xj (t) − xi + yj (t) − yi . (7) Since we do not know the actual emission time of each audio event ti , we follow [5] and use difference of arrival times between each microphone and a reference microphone ∆ti,j = tj − t1 . There are therefore N × (M − 1) observed variables in the system: .. . ζ(t) = (8) ∆ti,j + W (t) .. . where i = 2, . . . , N , and j = 1, . . . , M and ∆ti,j =
Di,j (t) − Di,1 (t) + τj − τ1 c
(9)
where w(t) is the measurement noise, also assumed to be Gaussian: w(t) ∼ N (0, σW I), (10) where I is the identity matrix.
where xr and yr are r-th robot’s position, ϕr its orientation and τr its clock difference with the real time. 2.1. State transition
Di,j (t) + τj , c
3. INFERENCE Due to the non-linear nature of the state transition and observation, the inference in our system can be performed using the extended Kalman filter (EKF) [18]. 3.1. Prediction The robot state for the next instance is predicted using Eq. 4. The process Jacobian is a block matrix of the form 0 0 F(t) = I + , (11) 0 F22 (t)
where F1 (t) 0 ··· 0 F (t) ··· 2 F22 (t) = . .. .. . 0
0 0
FM (t)
(12)
and (r) vj (t) cos(ϕˆj ) 0 (r) −vj (t) sin(ϕˆj ) 0 . 0 0 0 0
0 0 0 0 Fj (t) = 0 0 0 0
(13) Fig. 1. Example of estimated positions of robots and sound
The a priori error covariance matrix P(t + 1|t) is calculated as P(t + 1|t) = F(t)P(t|t)FT (t) + R,
(14)
sources for 10 microphones (robots) and 10 sources (including 3 reference sources). Microphones are marked with letters, while sound sources with numbers; real positions are shown on the right, while the estimated positions on the left. Short lines indicate robots’ orientations.
where R = diag (M × (0, 0), N × (σx , σy , σϕ , 0)) is the process noise covariance matrix.
∂∆ti,j (t) = −1, ∂τk=1 ∂∆ti,j (t) = 1, ∂τk>1,k=j
3.2. Update The elements of the observation Jacobian H are calculated as ∂∆ti,j (t)
∂∆ti,j (t) (s)
=
∂yk=i
∂∆ti,j (t)
1
=
ˆ k,j (t) cD ∂∆ti,j (t) (s)
∂xk6=i
(r) ∂xk=1
∂∆ti,j (t) (r) ∂xk>1
(15)
(s)
−
(s) yˆk (t)),
= 0,
(16) (17)
∂yk6=i
=− =
1 (r) (s) (ˆ xk (t) − x ˆi (t)), ˆ cDi,k (t) 1
ˆ i,k (t) cD
(r)
(s)
(r) (ˆ yk (t)
(s) yˆi (t)),
(ˆ xk (t) − x ˆi (t)),
where
(18) (19)
(25)
The Kalman gain is then calculated as K = P(t + 1|t)HT HP(t + 1|t)HT + Q
(ˆ y1 (t) − yˆk (t))
(r) (ˆ yj (t)
(24)
2 2 (r) (s) (r) (s) ˆ 2 (t) = x D ˆ (t) − x ˆ + y ˆ (t) − y ˆ . i,k j i j i
(s)
(r)
cDk,1 (t) 1
−
∂∆ti,j (t)
(r)
(ˆ x1 (t) − x ˆk (t))
ˆ k,1 (t) cD 1 (r) (s) − (ˆ xj (t) − x ˆk (t)), ˆ cDk,j (t)
(s)
∂xk=i
(s)
1
=
(23)
−1
,
(26)
where Q = diag(N (M − 1) × (σw )) is the observation noise covariance and h iT ˆ = . . . ∆t c i,j . . . , ζ(t) (27) ˆ ˆ c i,j = Di,j (t) − Di,1 (t) + τˆj − τˆ1 (28) ∆t c are the predicted measurements. The a posteriori state estimation is calculated as ˆ + 1|t + 1) = ξ(t ˆ + 1|t) + K(ζ(t + 1) − ζ(t ˆ + 1)) ξ(t (29) and the a posteriori error covariance matrix as
∂∆ti,j (t) (r)
∂yk=1 ∂∆ti,j (t) (r) ∂yk>1
=− =
1 ˆ i,k (t) cD 1
ˆ i,k (t) cD
(r)
−
(s)
(ˆ yk (t) − yˆi (t)),
∂∆ti,j (t) = 0, ∂ϕk
(20)
P(t + 1|t + 1) = (I − KH)P(t + 1|t).
(21)
4. EXPERIMENTAL RESULTS
(22)
(30)
The proposed method has been tested through simulations. The robots were simulated to be moving with constant speed, while changing their orientation by a random amount in random moments of time. Their movement was limited to a
square 80 m by 80 m with point (0, 0) in its center (i.e., the coordinates were limited to values between -40 and 40); the robots that reached the border of the simulated square room were re-oriented towards point (0, 0). The initial positions of both the robots and the sound sources were randomly selected according to the normal distribution with the mean in point (0, 0). The experiments were run for three process standard deviation values (0.1 cm, 1 cm and 10 cm) and three measurement standard deviation values (2 µs, 20 µs and 0.2 ms). These values correspond to, respectively: “optimistic”, “realistic” and “pessimistic” assumptions about what the errors would be in real life. The speed of sound was assumed to be 343 m/s. The number of sound sources varied between 5 and 10 and the number of microphones between 4 and 10. Due to high non-linearity of the proposed system, the EKF estimates had a tendency to diverge in very bad, i.e., heavily underdetermined cases. To avoid that, some heuristics were added to the algorithm. Firstly, the algorithm was re-initialized every time any of the absolutes of the estimated positions exceeded a threshold, which was set manually between 40 and 80, depending on the number of robots, where higher threshold values were used for experiments with more robots. This helped to deal with cases for which badly selected initial estimates caused the system to diverge. Secondly, we measured the difference between the time a robot reached the room’s border and the time its estimated position did the same. If this difference was higher than expected based on the noise levels present in the system, then the confidence in measurements was reduced by appropriately changing the error covariance matrix. Despite the added heuristics, we have found the proposed algorithm to be often non-convergent in the case, when neither the positions of the sound sources, nor the positions of the microphones, were known. The position estimates would have a constantly and slowly changing common bias due to the underdeterminedness of the system. This bias can be removed by introducing reference sound sources of known positions and we have found that it is completely removed and the system becomes convergent if 3 or more reference sources are used (see Figs 1 and 2). We believe that having 3 reference sound sources in the robots’ environment is a realistic scenario. Another convergence problem can arise if the sound sources are clustered together in a small area, which heavily limits the amount of information observed by the system and which should be avoided. Average estimation errors were used to evaluate the model’s performance. Mean squared error was used to calculate the average position estimation error and arithmetic mean for robot orientation and clock estimation errors. All experiments were repeated N = 100 times. Since the distributions of measured estimation errors ei were found to be log-normal across the experiments, for every parameter
RC 4 4 7 9 9
SC 5 10 7 5 10
SP [cm] 2.7 29.0 0.9 0.6 1.4
RP [cm] 14.0 18.6 4.8 5.0 4.9
RO [◦ ] 6.6 6.4 8.0 9.2 8.9
C [µs] 265 341 99 178 229
Table 1. Estimation errors for “realistic” measurement and
process noises for selected numbers of microphones (RC) and sound sources (SC). Abbreviations: SP – sound source position errors, RP – robot position errors, RO – robot orientation errors, C – clock errors.
set we have calculated the log-normal distribution modes as 2 m = eµˆ−ˆσ using the maximum likelihood estimation of PN the log-normal distribution parameters: µ ˆ = N1 i=1 ln ei , PN σ ˆ 2 = N1 i=1 (ln ei − µ ˆ)2 . The resulting modes are presented in Fig. 3 and selected cases for the “realistic” noises are shown in Table 1. The errors were in general lower for larger numbers of sound sources (more information available) and for smaller numbers of robots (fewer parameters to estimate, i.e., more determined system), as well as for lower measurement noise. Lower process noise resulted in lower estimation error, except for the cases with low measurement noise, when the “realistic” process noise offered better performance than the “pessimistic” one, probably by increasing the amount of innovation in the system, which otherwise was over-trusting to the measurements.
5. CONCLUSION AND FUTURE WORK This article presents a new type of microphone arrays, i.e., distributed mobile microphone arrays. They are applied to the tasks of robot navigation and localization of multiple sound sources. Results show that both tasks can be accomplished with satisfactory precision, although convergence is achieved only after the sound sources are spread across the environment and three reference sources are used. In future work, we will implement the proposed solution and test it using real-life devices. There is a huge potential for improvement in the algorithm: we will try to estimate the velocity of the robots (now fixed and known), so that the system could work without any odometric sensors; we will try to improve the stability of the system and the accuracy of estimation by using unscented Kalman filters, as suggested in [5]; we will also try scenarios, in which not only the microphones, but also the sound sources are moving; we will try to improve the accuracy by fusing the estimates with positions obtained from other sources (GPS, Wi-Fi positioning, etc.). Finally, we will develop algorithms that work with an unknown number of sound sources.
0
100
200
300
400
0.05 0.03
Average error [ms]
0.01
80 60 40 20
Averagege error [degrees] 500
0
10 15 20 25 30
Average error [m]
0
5
Sound sources Robots
0
100
Iteration
200
300 Iteration
400
500
0
100
200
300
400
500
Iteration
Fig. 2. Average estimation errors – from left to right: positions, orientations and clocks, respectively – for the example shown
in Fig. 1. REFERENCES [1] E. Mumolo, M. Nolich, and G. Vercelli, “Algorithms for acoustic localization based on microphone array in service robotics,” Robotics and Autonomous systems, vol. 42, no. 2, pp. 69–88, 2003. [2] N. Ono, H. Kohno, N. Ito, and S. Sagayama, “Blind alignment of asynchronously recorded signals for distributed microphone array,” in Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2009, pp. 161–164. [3] R. Lienhart, I. Kozintsev, S. Wehr, and M. Yeung, “On the importance of exact synchronization for distributed audio signal processing,” in Proc. IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), 2003, vol. 4, pp. 840–843. [4] K. Hasegawa, N. Ono, S. Miyabe, and S. Sagayama, “Blind estimation of locations and time offsets for distributed recording devices,” 2010, pp. 57–64, Springer. [5] H. Miura, T. Yoshida, K. Nakamura, and K. Nakadai, “SLAM-based online calibration of asynchronous microphone array for robot audition,” in Proc. IEEE/RSJ Intl. Conf. on Intelligent Robots and Systems (IROS), 2011, pp. 524–529. [6] J. Hornstein, M. Lopes, J. Santos-Victor, and F. Lacerda, “Sound localization for humanoid robots - building audio-motor maps based on the HRTF,” in Proc. IEEE/RSJ Intl. Conf. on Intelligent Robots and Systems, 2006, pp. 1170–1176. [7] G. Athanasopoulos, T. Dekens, H. Brouckxon, and W. Verhelst, “The effect of speech denoising algorithms on sound source localization for humanoid robots,” in Intl. Conf. on Information Science, Signal Processing and their Applications (ISSPA), 2012, pp. 327–332. [8] S. S. Ge, J.-J. Cabibihan, Z. Zhang, Y. Li, C. Meng, H. He, M. R. Safizadeh, Y. B. Li, and J. Yang, “Design and development of nancy, a social robot,” in Proc. Intl. Conf. on Ubiquitous Robots and Ambient Intelligence (URAI), 2011, pp. 568–573. [9] Y. Matsusaka, T. Tojo, S. Kuota, K. Furukawa, D. Tamiya, K. Hayata, Y. Nakano, and T. Kobayashi,
“Multi-person conversation via multi-modal interface– a robot who communicates with multi-user,” in Proc. European Conf. on Speech Communication Technology (EUROSPEECH), 1999, pp. 1723–1726. [10] M. Sato, A. Sugiyama, and S. Ohnaka, “Auditory system in a personal robot, PaPeRo,” in Proc. Intl. Conf. on Consumer Electronics (ICCE), 2006, pp. 19–20. [11] J. Huang, T. Supaongprapa, I. Terakura, F. Wang, N. Ohnishi, and N. Sugie, “A model-based sound localization system and its application to robot navigation,” Robotics and Autonomous Systems, vol. 27, no. 4, pp. 199–209, 1999. [12] F. Asono, H. Asoh, and T. Matsui, “Sound source localization and signal separation for office robot “jijo-2”,” in Proc. IEEE/SICE/RSJ Intl. Conf. on Multisensor Fusion and Integration for Intelligent Systems, 1999, pp. 243–248. [13] S. H. Young and M. V. Scanlon, “Detection and localization with an acoustic array on a small robotic platform in urban environments,” Tech. Rep., DTIC Document, 2003. [14] A. R. Kulaib, M. Al-Mualla, and D. Vernon, “2D binaural sound localization: for urban search and rescue robotics,” in Proc. International Conference on Climbing and Walking Robots, 2009, pp. 9–11. [15] K. Nakadai, H. G. Okuno, H. Kitano, et al., “Realtime sound source localization and separation for robot audition,” in Proc. IEEE International Conference on Spoken Language Processing, 2002, pp. 193–196. [16] Q. H. Wang, T. Ivanov, and P. Aarabi, “Acoustic robot navigation using distributed microphone arrays,” Information Fusion, vol. 5, no. 2, pp. 131–140, 2004. [17] Y. Sasaki, S. Kagami, and H. Mizoguchi, “Multiple sound source mapping for a mobile robot by selfmotion triangulation,” in Proc. IEEE/RSJ Intl. Conf. on Intelligent Robots and Systems (IROS), 2006, pp. 380– 385. [18] G. Welch and G. Bishop, “An introduction to the Kalman filter,” Tech. Rep., Dept. Comput. Sci., Univ. North Carolina, 2000.
50
●
● ●
40
●
SC
●
5 30
●
6 ●
7 ●
8
●
20
● ●
●
●
●
9 10
●
● ●
●
10
●
● ●
● ●●
●
● ●
●
●
●
●
●
●
● ●
Speaker position estimation error [cm]
Speaker position estimation error [cm]
50
●
● ● ●
40
●
●
30
●
PN
●
●
0.1 1 ●
20 ●
●
10
● ● ●
● ● ● ●
10
● ●
● ● ● ● ● ●
●
0
0 4
5
6
7
8
9
10
0.1
1
Robot count
100
100
●
●
10
Measurement noise
●
●
●
●
● ●
●
75 ●
SC
●
●
●
● ●
50
●
6
●
●
●
5
●
●
●
●
●●
●
7
● ● ●
●
● ●
●
●
●● ●
●
8 ●
●
●
● ● ●
●
●
●
● ●
9
●
●
●
25
●
●
●
● ●
●
●
10
● ● ● ●
●
Robot position estimation error [cm]
Robot position estimation error [cm]
● ●
75
PN 0.1 50
●
1 ●
25
10
● ● ●
● ●
●
● ● ●
0 4
5
●
0
6
7
8
9
10
0.1
1
40
SC 5 6 7 8 9
20
10
40 PN 0.1 1 10 20
0 6
7
8
9
10
● ●
●
●
0 5
10
Measurement noise
Robot orientation estimation error [deg.]
Robot orientation estimation error [deg.]
Robot count
4
●
●
0.1
1
Robot count
10
Measurement noise
●
●
4
4 ●
●●
3
SC 5 ●
●●
2
6 7
●
●
●
●
●
●
●
● ● ●
●
1
●
●
● ●
●
● ● ● ●
● ● ●
●
● ●
●
●
●
●●
● ●
3 ●
PN 0.1
2
1 ●
10
●
●
1
●
●
●
●
●
9 10
●
●
●
●
●
8
●
●
● ●
●
●
● ●
●
●
● ●
●
● ●
●
●
Clock estimation error [ms]
Clock estimation error [ms]
●
●
● ●
●
●
●●
●
●
●
● ● ● ●
0
0 4
5
6
7
Robot count
8
9
10
0.1
1
10
Measurement noise
Fig. 3. Estimation errors when 3 reference sound sources are used, depending on the robot count, the sound source count (SC), the measurement noise variance compared to the “realistic” value of 1 cm and the process noise (PM) variance compared to the “realistic” value of 20 µs. The bars correspond to quartiles calculated over the other parameters and the points to outliers.