Speaker Selection and Tracking in a Cluttered ... - IEEE Xplore

2 downloads 0 Views 904KB Size Report
Jul 7, 2009 - association method using audio and visual data which localizes targets in a cluttered environment and detects who is speaking to a robot.
Y. Lim and J. Choi: Speaker Selection and Tracking in a Cluttered Environment with Audio and Visual Information

1581

Speaker Selection and Tracking in a Cluttered Environment with Audio and Visual Information Yoonseob Lim and Jongsuk Choi Abstract — Presented in this paper is a data association method using audio and visual data which localizes targets in a cluttered environment and detects who is speaking to a robot. A particle filter is applied to efficiently select the optimal association between the target and the measurements. State variables are composed of target positions and speaking states. To update the speaking state, we first evaluate the incoming sound signal based on cross-correlation and then calculate a likelihood from the audio information. The visual measurement is used to find an optimal association between the target and the observed objects. The number of targets that the robot should interact with is updated from the existence probabilities and associations. Experimental data were collected beforehand and simulated on a computer to verify the performance of the proposed method applied to the speaker selection problem in a cluttered environment. The algorithm was also implemented in a robotic system to demonstrate reliable interactions between the robot and speaking targets. Index Terms — Speaker Localization, Human-Robot Interaction, Data Association, and Particle Filter.

I. INTRODUCTION Robots have been developed to improve human life. Some robots can act as museum guides, recognizing human faces and understanding human speech [1]. These robots have integrated systems to recognize the environment through sensory and motor information, including cameras, microphones, and electric motors. This equipment is even applied in toy robots so that they can develop friendly relationships with the children who own them. Speaker tracking is an essential function that robots should have to facilitate human interaction. Numerous approaches for maintaining reliable interactions between a robot and a speaker exist in speaker tracking, which employ a variety of features, tracking schemes, and sensor setups. Kalman filters, under Gaussian models and linear dynamics, are commonly used to track a single object. Probabilistic models, such as Bayesian networks [2] and the Sequential Monte Carlo method [3], have been 1

This research is supported by Development of Active Audition System Technology for Intelligent Robots through Center for Intelligent Robotics in South Korea. Yoonseob Lim is a research scientist at the center of cognitive robotics research, Korea Institute of Science and Technology, Korea. (e-mail: [email protected]) Jongsuk Choi is a senior research scientist at the center of cognitive robotics research, Korea Institute of Science and Technology, Korea. (email: [email protected].) Contributed Paper Manuscript received July 7, 2009

proposed for coping with fluctuating target states in a noisy and cluttered environment. Among them, the particle filter method is well suited to dealing with multiple types of data [4]. A particle filter is an approximation technique for use in a non-linear and nonGaussian situation. Much research has been focused on visual problems. Some research has used object contours in a dense visual clutter [5] or a constant object color distribution over a certain time period [6]. Particle filters have also been applied to audio source localization, using beamformer-based sound source localization [7]. This method has an advantage in that it does not require calculation of time-delay estimates. A single modality tracking method has its own advantages, but we believe that an effective tracker needs to handle a fusion of multiple sensor mechanisms, such as a combination of audio and visual signals. An audio signal is highly adequate for voice activity detection but has lower spatial resolution than visual information. If a robot is situated in a very noisy and reverberant environment, speaker tracking is not possible due to the presence of unwanted or highly reflected signals. Even though visual information has much higher resolution than audio information, it often fails to discriminate, for example, the existence of objects or newly added objects, in multiple object situations. Hence, integration of audio and visual information is desirable. H.D. Kim et al. [8] showed that effective speaker tracking could be accomplished by integrating audio and visual information. In this system, audio and visual information are processed in series. A robot first recognizes the location of a sound source and confirms the source by recognizing a human face in that position. However, this mechanism deals with audio and visual information separately. Therefore, it cannot cope with multiple objects and concentrates only on the object that speaks first. A process for predicting the number of objects and associating current data with measured information is needed for good tracking performance when more than two objects are present. Several algorithms for data association have been developed, especially for radar systems. Reid [9] proposed an algorithm that searches every possible association and determines the most probable association between the target and the measurement. However, the number of probable associations increases exponentially as the number of targets increases. Recently, Vermaak et al. [10] proposed an efficient approach using a particle filter that can detect and track multiple objects when the number of targets is

0098 3063/09/$20.00 © 2009 IEEE

IEEE Transactions on Consumer Electronics, Vol. 55, No. 3, AUGUST 2009

1582

known. Schulz et al. [11] proposed a statistical data association filter and implemented it on a robot using laser-range data. The number of objects is determined based on a sensor model that assigns probabilities to observed features within the perceptual range of the sensor. Zhao et al. [12] proposed a probabilistic algorithm with Markov chain dynamics that can track humans in a crowded environment when the number of humans is not constant. In this paper, we present a data association algorithm based on a particle filter for selecting a speaker and tracking a target in a cluttered environment. We applied a sensitivity model for microphone arrangement for effective evaluation of audio information. Through the proposed algorithm, the speaking state and position of each target can be obtained; this information is used to enable a robot to look at a speaker and track targets. Before this method was implemented in a robotic system, audio and visual information were collected using our mobile robot platform, and the tracking algorithm was verified on a personal computer. The proposed algorithm was also implemented in a robotic system to demonstrate reliable speaker selection and tracking performance. With the suggested method, we found that the robot could easily interact with people, even when the number of people changed. II. PARTICLE FILTER A system is represented by state-space and observation equations of the form X t = f (X t−1, u t ) y t = g(X t , v t )

(1)

where y1:t = (y1 , , y t ). A particle filter approximates (2) for nonlinear and non-Gaussian problems. The design of a particle filter involves the definition of the object representation, state space, dynamic process, and a probability model for sound and vision. Each of these is presented in the following sections. III. ALGORITHM DESCRIPTION

The goal is to find an optimal association that represents the most probable position and speaking status of each target. The weight of a particle should represent the association and speaking probabilities according to the current measurement and the prediction from the previous state. Processing measurement data (audio and video) will be discussed in the following sections. If resampling is required when calculating the weights of particles, a low-variance resampling method is used. A. State Prediction The calculation procedure of the particle filter can be divided into two steps, the prediction step and the update step. In this paragraph, we will describe models for predicting the position and speaking status of each target according to the previous state and data association. We formulate the speaker selection problem as understanding where a speaker is standing and whether the target is speaking as the number of targets (K) is changing during the interaction with the robot. Each target is parameterized by a state X k,t = (x k,t ,sk,t ) , (k = 1,…, K) , where both position (x k,t ) and speaking state (sk,t ) are represented. The individual positions of the targets are assumed to evolve independently according to the Langevin process model, (3):

xk ,t +1 = a x xk ,t + bx F

where y t is a vector of observations, X t is a state ⋅ is a measurement function, f () ⋅ is a system vector, g () transition function, u t and v t are noise vectors, and the subscript t denotes the time index. Based on the observation and the assumptions used in state prediction, the state of the object is estimated recursively. The Kalman filter has been widely used with the assumptions that the system is linear and the noise sources are independent, additive and Gaussian. The particle filter method is an alternative to the extended Kalman filter. It approximates the posterior from a finite number of parameters. If we are given an object representation and a state space model with state X t and observations y t from audio and visual information, the distribution can be recursively calculated from p(X t y1:t ) = c ⋅ p(y t X t )

∫ p(X

X t −1

t

X t−1 )p(X t−1 y1:t−1)dX t−1

(2)

xk ,t +1 = xk ,t + xk ,t +1ΔT

a x = exp(− β x ΔT ) bx = v x 1 − a x

(3)

2

where ΔT is a time step during computation, vx is the steady state root mean velocity, β x is the rate constant and F ~ N (0,1) . The model for predicting speaking status depends on the previous speaking status of each target, (4): ⎛ Ps → Ps p sk,t sk,t−1 = ⎜ ⎝ Pq → Ps

(

)

Ps → Pq ⎞ ⎟ Pq → Pq ⎠

(4)

where Ps → Pq indicates the transition probability from the speaking state to the quiet state. If a target was previously speaking, the probability of that target maintaining the speaking status is much higher than the probability of that target transitioning to quiet status, and vice versa.

Y. Lim and J. Choi: Speaker Selection and Tracking in a Cluttered Environment with Audio and Visual Information

1583

TABLE I Data association probability table for multiple target tracking

y1,t

y 2,t

y 3,t

y j,t

Erase

Undetect

X 0,t

P01

P02

P03

P0 j

P0,e

P0,u

X1,t

P11

P12

P13

P1 j

P1,e

P1,u

X k,t

Pk1 P1,new P1, fal

Pk 2 P2,new P2, fal

Pk 3 P3,new P3, fal

Pkj

Pk,e

Pk,u

New False

P j , new

Pj, fal

We defined the probability of maintaining the speaking or quiet state to be 0.8. This value was determined through simulation and verified in robotic experiments. B. Data Association The proposed method in this article is designed to allow a robot to localize a speaking person among several people and to track the state of each person. To track multiple targets, the robot must be able to associate existing targets with detected objects. The data association probability table is shown in Table 1. Pkj is defined to be the probability of association between the kth target and jth object. When tracking and selecting targets, the targets can disappear or go undetected. Pk,e and Pk,u are the probabilities of the kth target disappearing or going undetected, respectively. In any observation, there could be a new object or a false alarm. P j,new and P j, fal are the probabilities of a new object or a false alarm, respectively, in the jth observation. To associate targets to observed objects, we defined an association vector (r) to be used to find an optimal association hypothesis. If we apply the association vector to equation (2), the likelihood can be described as (5).

p(y t X t ) = PC ⋅

K

∏ p(y k =1

rk,t ,t

X k,t

)

(5)

rk,t indicates the index of the observed object to which the kth target is associated. PC is the probability of a cluttered measurement and is calculated as in [9]. The system we developed can acquire both audio and visual information. Since we assumed that there is only one audio source around the robot and used time delay information to evaluate audio information, visual information is appropriate for solving the data association problem in tracking multiple targets. The association vector can be used to calculate a set of state likelihood for each object from the visual information likelihood of visual information and equation (5) can be expanded as (6).

(

p y rk,t ,t X k,t

(

)

) (

) (

= p y tv rk,t , x k,t ⋅ p rk,t r1:k−1,t ⋅ p y ta sk,t , x k,t Video

)

(6)

Audio

The second term in the video component is the kth target association probability under a predicted association hypothesis. The association result indicates a highly probable association between a target and an observed object, and cannot give the probability of situations like appearance, false alarm, disappearance or an object going undetected. For reliable interactions, the robot needs to recognize the target's current position and speaking state as well as its existence around the robot. The robot deduces the existence of targets based on the association result. The detailed method is described in the section Number of Targets.

C. Audio Measurement 1) Voice Activity Detection (VAD) Traditional VAD systems use pitch, zero crossing rate and the energy of sound [15]. For a more moderate VAD performance, we developed a probabilistic way of determining the current sound as a voice based on Bayes’ theorem. The sound state (st ) , which has a binary value (voice or non-voice), is defined to be the state variable. The proposed VAD algorithm has a simple prediction step and an update step. For prediction, the transition probability of the sound state is based on the previous state, defined as (7).

⎛ Pv → Pv ⎜ ⎝ Pnv → Pv

Pv → Pnv ⎞ ⎛ 0.8 0.2 ⎞ ⎟=⎜ ⎟ Pnv → Pnv ⎠ ⎝ 0.2 0.8 ⎠

(7)

Pv → Pnv is the transition probability from the voice to the non-voice state. Each transition probability is assumed to be constant. With this transition probability matrix, we can predict the current sound state according to the previous sound state (p(st | st −1)) . The current state is confirmed after updating with measured information. We defined a measurement (Z tvad ) to be composed of pitch

IEEE Transactions on Consumer Electronics, Vol. 55, No. 3, AUGUST 2009

1584

( pt ) , zero crossing (zcxt ) and the energy (Et ) of sound [15]. When calculating the zero crossing, a center clipping method is used. The threshold for clipping is chosen to be 64% of the maximal input sound. The likelihood of each measurement p(z i st ),z i ∈ Z tvad is calculated according to

(

where τ ij is the observed time delay between microphone i and j , and τ ijk is the expected time delay between microphone i and j from the kth target.

)

the predicted sound state. Basically, the likelihood follows a Gaussian distribution if the current sound is expected to be a voice. If not, we apply two different constant values depending on whether or not the measurement value is above a predefined threshold. Since the information is three-fold, we use the average value of each likelihood and calculate the probability that the current sound is a voice using (8) and (9).

(

) ∑ p(z s )/ 3

p Z tvad st =

(8)

i t

z i ∈Z tvad

(

)

(

p st Z tvad ~ p(st st−1 )⋅ p Z tvad st

)

(9) Fig. 2. Location of microphones for sound localization (r=15 cm).

After normalizing the calculated values, the voice probability can be obtained. Fig. 1 shows the result of a proposed VAD algorithm applied to a sound recorded in an office.

3) Sensitivity Model To obtain reliable localization ability through audio information, we defined the sensitivity between two microphones, which can actively change its value according to the target's position. The standard deviation for the sound information from each microphone is divided into two components, (11).

σ aij = σ mic × Λ ij

(11)

σ mic is the standard deviation of the error of the microphone and Λ ij is an inverse of the sensitivity value from the relationship between the microphone and the particle state, which is defined as (12). ⎛ dτ ij ⎞−1 Λ ij = ⎜ ⎟ ⎝ dθ ⎠

Fig. 1. Voice activity detection result.

2) Evaluating Audio information Fig. 2 shows the position of three microphones and their relation to the sound source. Since we use audio information to detect the source position and speaking status of each target, we first evaluate the audio information and then predict the speaking status of each target. Time Delay of Arrival (TDOA) values are calculated and used to evaluate the sound signal with a cross-correlation method [8]. With observed time delay values, a process to evaluate audio information is performed, (10).

(

L y ta X k,t

⎡ k ⎢ − τ ij − τ ij = exp⎢ 2 ij 1≤ i< j ≤ m ⎢⎣ 2 σ a

) ∏

(

( )

) ⎤⎥ 2

⎥ ⎥⎦

(10)

(12)

where θ is the horizontal angle of the source that is depicted in Fig. 2. The standard deviation for the microphone (σ mic ) is determined empirically. 4) Audio Likelihood The speaking state of each target (sk,t ) has a binary value: 1 for the speaking state and 0 for the quiet state. As explained in the model description section, these values are selected probabilistically according to the speaking state of the target. Since the speaking state is primarily related to the audio information, it is used to calculate the likelihood of the audio information [13]:

Y. Lim and J. Choi: Speaker Selection and Tracking in a Cluttered Environment with Audio and Visual Information

(

p rk,t = j r1:k−1,t , x k,t

Y

⎧1− PD ⎪ = ⎨0 ⎪P / M ⎩ D k

Fig. 3. Coordinates for visual information. The reference point is the center of the robot's shoulder.

(

⎧K ⎪ a ⎪⎪K b =⎨ ⎪K b ⎪ ⎪⎩K a

) ( = 1,L(y = 0,L(y = 0,L(y

) )< L )≥ L )< L

if sk,t = 1,L y ta X k,t ≥ Lmin if sk,t if sk,t if sk,t

a t

X k,t

a t

X k,t

a t

X k,t

min

(13)

min min

where Ka and Kb are constant likelihood values, and Ka is greater than Kb . By applying a different likelihood for each target according to its speaking state and audio evaluation value, we can improve the chance of selecting the correct speaking state for each target. The effect of this method will be illustrated by comparing the result when only values evaluated from (10) are used. D. Visual Measurement To detect human faces, we used OpenCV (Open Computer Vision), the open source vision library developed by the Intel Company. To pick up visual information, we installed a stereo camera (Videre Design, STH-MDCS2) on our robot platform. Using a face detection routine in OpenCV, we can obtain information about the number and location of human faces. The likelihood for visual observation is described as (14).

(

p y tv rk,t , x k,t

⎡ v ⎢ − y rk,t ,t − x k,t ∝ exp⎢ 2σ v 2 ⎢⎣

)

(

)

if j = 0 if j > 0, j ∈ {r1,t ,

(15)

,rk−1,t }

where M k is the number of targets. If rk,t is 0, it means that the kth target is associated with no observed objects. If the kth target is associated with an observed object that has already been associated with another target, the association probability is set to zero.

Z

p y ta sk,t , x k,t

1585

2⎤

)⎥ ⎥ ⎥⎦

(14)

where y rvk,t ,t defines the location of the detected face associated with the kth target. We defined the detection probability of target (PD ) to be constant and equal to 0.8. The association probability can be defined by (15).

E. Data Fusion For data association in a multiple object environment, we use visual information, and the association vector rk is only related to the visual measurement. We also assume that the speaking state does not influence the visual information. With this assumption, we can compute the total likelihood for each target from visual and audio information, (16).

(

) (

) (

p y j,t rk,t , X k,t = p y tv rk,t , x k,t ⋅ p y ta sk,t , x k,t

)

(16)

The sampling frequencies for audio and visual information are different. Normally, audio data is sampled much more frequently than video data. There is no guarantee that both types of information are available at each time step, so we assign the likelihood of missing data to be 1. F. Speaker Selection After particle filtering, we can obtain the speaking probability of each target. This information can be used to select a speaking target from the several targets around the robot. For more reliable selection of speaking targets, we average the probability value over a certain time range and use the average to determine the speaking target. The target that has the maximal average speaking probability is selected as the speaking target.

Spea ker = arg max P(sk,t y t ), k ∈ {1, k

, K}

(17)

We also define a threshold for determining the speaker. If the maximum probability is not greater than the threshold value, we consider the previously speaking target to stay in the speaking state. G. Number of Targets To associate observed objects to current targets, the number of targets must be recognized at each time step. During tracking, objects can appear, suddenly disappear, or go undetected. To handle these problems, we calculate the number of objects based on both visual and audio information. 1) By Vision Visual information can contain the position of human faces and non-human faces. To handle this problem, we calculate

IEEE Transactions on Consumer Electronics, Vol. 55, No. 3, AUGUST 2009

1586

the frequency of face detection. If more than 5 unassociated faces are found within 0.6 m of one of the most recently observed objects, it is selected to be a new target. 2) By Audio If only visual information is used to update the number of targets around the robot, the robot cannot respond to a stimulus that is out of its sight. To recognize a call from a position out of the robot's view, a circle 2 m in radius is defined around robot. On this circle, 50 points are deployed, separated by 7.2 degrees. At each grid point, we calculate a TDOA value (τ k , ij ) and compare it with the observed time delay (τ ij ) . Each grid point has its own weight value, which is defined and updated according to (18). au au W k,t = W k,t−1 ⋅

∑ (τ

ij

− τ k,ij

)

(18)

1≤ i< j≤ m

target. If the probability is over this threshold, the target is determined to exist. This probability can help the robot recognize undetected targets that exist but are not associated with measurements. An undetected target is tracked if the probability is over threshold. If the probability is less than the threshold, the target is considered to have disappeared. H. Algorithm Procedure In this section, we briefly summarize the proposed method for data association in target tracking and speaker selection. We select a particle filtering method to efficiently search for association hypotheses, and use both audio and visual information. (1) For n = 1,

, N , Initialize the particle weights. α 0,n t = ω tn−1

If the maximum weight value is over a certain limit (we defined it to be 0.5), there could be a sound source on one of the grid points. If the angle between the predicted point and the currently existing target is over 15 degrees, a new target is designated and the robot turns its head and gazes toward the detected position. Since the grid is composed of static points and is selected to hear possible calls from outside the robot's view, the robot relocates the target when it sees the face of the new target.

(2) For k = 1, , K - For n = 1, , N , predict the kth target states. - For n = 1, , N , generate associations between target and visual observation. - Calculating the total likelihood of the kth target. - Update state and normalize the weights α kn,t of particles.

3) Existence Probability Using the method described above, the robot can recognize newly appearing objects. However, the robot needs to know whether some of the targets really exist or not. Calculating a belief of existence is similar to the problem of map building with the noisy sensors of mobile robots. Moravec and Elfes [14] proposed a classic way of updating a belief upon sensory input based on a recursive Bayesian scheme. In our case, this ass ) method calculates the existence probability P (e | y1:t

(3) Calculate the weight of data association and speaking status. ω tn = α Kn ,t ⋅ PC - If resampling is required, resample the particle.

(

)

according to the data association result, which is 1 if a target is associated and 0 if not.

( ) ⎞ ⎛ ⎜ 1− P(e y ) P(e) 1− P(e y )⎟ ⋅ ⋅ ⎟ ⎜1+ ⎜ P(e y ) 1− P(e) P(e y ) ⎟ ⎠ ⎝

n n α k,t = α k−1,t ⋅ p(X k,t y t )

- If resampling is required, resample the particle.

(4) Select a target the meets the condition for selecting a speaker from the objects. (5) Update the current target numbers, using not only audio and video information but also the probability of target existence.

ass

P e y1:t =

ass t

ass 1:t−1

ass t

−1

IV. RESULTS

(19)

ass 1:t−1

We set the prior probability (P(e)) to be 0.5. We need to

(

)

specify the probability P (e | y tass = 1) that a target exists if it is associated with a measurement and the probability

(P(e | y

ass t

)

= 0) that a target exists if it is not associated

with a measurement. In our experiment, we determined these probabilities to be 0.9 and 0.47, respectively. We defined a threshold level for determining the existence of a

A. Simulation Before we implemented the proposed algorithm into a robotic system, we collected two types of data and simulated them on a PC. One portion of the data is a speaker sitting at a fixed point while another person is moving around without saying a word (scene 1). The other portion is two people having a conversation, each sitting in a fixed position (scene 2). The audio data were sampled at 16 kHz with three omnidirectional microphones, and the visual data had 640 × 480 pixels. The visual data acquisition rate depends on the status of the computer, so the frame rate could be different at every second (20~24 frames/sec)

Y. Lim and J. Choi: Speaker Selection and Tracking in a Cluttered Environment with Audio and Visual Information

Fig. 4. Tracking and speaking probability from simulation when two objects are overlapping (Scene 1).

For the simulation, we used an amplifying system that we developed, which nonlinearly amplifies incoming audio signals [8]. With the acquired data, we performed a computer simulation using MATLAB. We processed the audio data with 512 samples in each timeframe with 50% overlap (256 samples). Fig. 4 shows the speaking probability, tracking results and voice information when two objects overlapped in scene 1. We can see that when the two objects crossed each other, the objects had similar speaking probability values and patterns. This is because the state variable contains the target position, and the speaking states are influenced by target positions. Fig. 5 shows the results from scene 2. In this experiment, each speaker spoke 3 sentences. These two results show that speaker selection as well as position estimation can be obtained using the proposed algorithm. To recognize when the speaking person and position change, we proposed a method in which the audio likelihood is dependent on the magnitude of evaluated TDOA values.

Fig. 5. Simulated tracking and speaking probability when two people are conversing (Scene 2).

1587

Fig. 6. Effect of the proposed method for calculating the audio likelihood (Scene 2).

Fig. 6 shows the effect of the proposed method for audio information. After 10 sec, the speaker index is changed from 0 to 1. Before applying the proposed method of calculating the audio likelihood, the speaker did not change. However, if using the proposed method, the speaker changed from 0 to 1. B. Robot

Fig. 7. A conversation experiment with the robot.

After verifying the proposed algorithm via experimental simulation, we implemented the algorithm into our robotic platform, Infotainment (the robot is shown in Fig. 7). This robot has several omnidirectional microphones on its shoulder and around its head. We used the microphone arrays that are installed on its shoulder to record audio signal. Recorded audio signals are amplified using an audio board developed by Tokyo Electronics. It has 8 input channels, and the amplifier gains can be adjusted via a computer program. The amplifying board is different from the board we used for simulation. With simulation data, we found that there is a significant delay in the data acquisition system (about 4 sec); this does not guarantee real-time processing in a robotic system. We conducted two experiments with the robot. One experiment was with a person out of the robot's view calling the robot, and the other was with two people calling the robot in turn. We could not conduct an experiment analogous to the scene 1 simulation because of the slow rate of visual information acquisition (2 or 3 frames/sec).

IEEE Transactions on Consumer Electronics, Vol. 55, No. 3, AUGUST 2009

1588

Fig. 8. Results of the interaction when there is one person speaking.

Fig. 9. Results from a conversation with two people.

Fig. 8 shows the results when a person called the robot. The angle between the person and the robot was 50 degrees. At first, the robot did not recognize that there was a person in front of it. After 8 sec, the robot recognized that there was a speaker at a -50 degree angle (See "Detect") and turned its head towards him (See "Associate"). The existence probability dropped a little, but after the observed face was associated with the sound source, the probability became 1, which means that the target really exists in front of the robot. The red line in the audio data plot indicates the result of the VAD, which is 1 if there is a voice. Fig. 9 shows the results when two people were speaking to the robot. In the beginning, target 0 was within view but target 1 was not. At first, target 0 spoke to the robot, followed by target 1 speaking to the robot after 30 sec. When the first call from target 1 was heard by the robot, it recognized that a sound source was out of its view. The robot turned its head towards it and associated the observed face with the detected sound source (target 1). The robot recognized that there were two targets around it and successfully turned its head to respond to calls from each target. At the end of the experiment (about 80 sec), the disappearance of target 1 resulted in a decrease in existence probability. When this probability was lower than 0.1, the robot recognized that the target had indeed disappeared.

To see the effects of this assumption, we conducted an experiment where two participants spoke simultaneously, as shown in Fig. 10. In this plot, the periods in which several voices were speaking simultaneously are indicated by a red rectangle. At the beginning, the robot recognized the current speaker as target 0, who was positioned at an angle of 0 degrees. When two people spoke the same word together at the same time, we can see that the speaking probability of target 0 stayed higher than that of target 1. Sometimes, there were small fluctuations in the speaking probability for target 1, but it was consistently lower than that of target 0. In this experiment, the two participants spoke numbers together, with a speaking duration of less than 0.5 sec. If two different long sentences were generated, the robot would try to see both faces, one at a time. Fig. 11 shows another case where the robot was confused into associating a voice with target 1 when the sound was actually generated by another person near target 1. From the beginning, target 0 sat right in front of the robot and spoke to it. After 15 seconds or so, target 1 appeared and called to the robot. The robot responded to the call and turned its head towards him (-50 degrees). After 23 seconds, a voice was generated near the position of target 1. This voice belonged to another person (target 2), and the robot did not recognize him. The robot knew only that two targets (target 0 and 1) existed around it. When the voice was transmitted to the robot, it recognized that the voice came from target 1 and turned its head towards him (-50 degree). This is because tracking is based on the positions of the targets. We use the TDOA value to evaluate the audio information, which is highly correlated with the position of the source and the geometry of the microphone array.

IV. DISCUSSION

The robot selects the current speaker based on the speaking probability calculated by the proposed algorithm. The speaking state transition model assumes that the previous speaker is much more likely to speak than the quiet person is.

Y. Lim and J. Choi: Speaker Selection and Tracking in a Cluttered Environment with Audio and Visual Information

1589

Fig. 10. Experimental results in the presence of two simultaneous voices speaking near the robot.

The likelihood of video information is also calculated based on the positions of the faces. All the inferences made in this algorithm depend upon the positions of the targets. This causes confusion in selecting the speaker, as shown in Fig. 11. V. CONCLUSION

Speaker selection is one of the key functions in initiating reliable interactions with a robot. However, in a real environment, a robot can obtain a lot of erroneous information, resulting in inaccurate tracking. The proposed algorithm can simultaneously calculate the position and speaking state of targets in a cluttered environment. It exhibits good performance in selecting a speaker and tracking multiple targets through several experimental simulations, and when implemented in a robotic system.

Fig. 11. The robot mistakes the speaker to be target 1 when target 2 talks to the robot. [7]

[8] [9] [10] [11] [12]

ACKNOWLEDGMENT

[13]

This research is supported by the Development of Active Audition System Technology for Intelligent Robots through the Center for Intelligent Robotics in South Korea.

[14]

REFERENCES [1] [2] [3] [4] [5] [6]

W. Burgard, A.B. Cremers, Dieter Fox, D. Hahnel, G. Lakemeyer, D. Schulz, W. Steiner, and Sebastian Thrun, “The Interactive Museum Tour-Guide Robot”, in Proceedings of the AAAI, Madison, WI, 1998 V. Pavlovic, A. Garg and J. Rehg,"Multimodal speaker detection using error feedback dynamic Bayesian networks", in Proceedings of the IEEE CVPR, Hilton Head Island, SC, 2000, pp 34-43. J. Vermaak, M. Gangnet, A. Blake and P. Perez,"Sequential Monte Carlo fusion of sound and vision for speaker tracking", in Proceedings of the IEEE ICCV, Vancouver, 2001, pp 741-746. A. Doucet, N. de Freitas and N. Gordon,Sequential Monte Carlo Methods in Practice, Springer-Verlag, -, -; 2001. M. Isard and A. Blake, Condensation–conditional density propaga-tion for visual tracking, International Journal on Computer Vision, vol. 29(1), 1998, pp 5-28. P. Perez, C. Hue, J. Vermaak, and M. Gangnet, Color-based probabilistic tracking, in Proceddings of the 7th European Conference on Computer Vision, 2002.

[15]

D.B. Ward, E.A. Lehmann, and R.C. Williamson, Particle Filtering Algorithms for Tracking an Acoustic Source in a Reverberant Envi-ronment, IEEE Transactions on Speech and Audio Processing, vol. 11(6), 2003, pp 826-836. H.D. Kim, J.S. Choi and M.S. Kim, Human-Robot Interaction in Real Environments by Audio-Visual Integration, International Journal of Control, Automation, and System, vol. 5(1), 2007, pp 61-69. D. Reid, An Algorithm for Tracking Multiple Targets, IEEE Transactions on Automatic Control, vol. 24(6), 1979, pp 843-854. J. Vermaak, S.J. Godsill, and P. Perez, Monte Carlo Filtering for MultiTarget Tracking and Data Association, IEEE Transactions on Aerospace and Electronic Systems, vol 41(1), 2005, pp 309-322. D. Schulz, W. Burgard, D. Fox and A.B. Cremers, People Tracking with mobile robots using sample-based joint probabilistic data association filters, The International Journal of Robotics Research, vol. 22(2), 2003, pp 99-116. T. Zhao and R. Nevatia, "Tracking multiple humans in crowded environment", in Proceedings of the 2004 IEEE CVPR'04, 2004, pp. 406413. D.G. Perez, G. Lathoud, J.M. Odobez and I.M. Cowan, Audiovisual Probabilistic Tracking of Multiple Speakers in Meetings, IEEE Transactions on Audio, Speech and Language Processing, vol. 15(2), 2007, pp 601-616. H.P. Moravec and A.E. Elfes. High resolution maps from wide angle sonar. In Proc.of the IEEE Int. Conf. on Robotics and Automation(ICRA),1985. L.R. Rabiner and R.W Schafer, Digital Processing of Speech Signals,Prentice Hall,-,-; 1978. Yoonseob Lim was born in South Korea in 1979. He received a B.S. and M.S. from Seoul National University, South Korea in 2002 and 2004, respectively. He joined the Center of Cognitive Robotics Research, Korea Institute of Science and Technology. His research interests include audio-visual tracking technique, humanrobot interaction and cognitive robotics.

Jongsuk Choi received a B.S., M.S., and Ph.D. in Electrical Engineering from the Korea Advanced Institute of Science and Technology in 1994, 1996, and 2001, respectively. In 2001, he joined the Intelligent Robotics Research Center, Korea Institute of Science and Technology (KIST), Seoul Korea as a Research Scientist and now is a Senior Research Scientist at KIST. His research interests include signal processing, human-robot interaction, mobile robot navigation and localization.

Suggest Documents