Document not found! Please try again

Particle filtering with multiple cues for object

0 downloads 0 Views 1MB Size Report
4 and bootstrap filter. 1 ... Section 4 describes a PF algorithm we developed based on fusing colour and texture information cues. ... The Monte Carlo approach relies on a sample-based construction to represent the state PDF. ..... M. Arulampalam, S. Maskell, N. Gordon, and T. Clapp, “A tutorial on particle filters for online.
Particle Filtering with Multiple Cues for Object Tracking in Video Sequences Paul Brasnett, Lyudmila Mihaylova, Nishan Canagarajah and David Bull University of Bristol, Department of Electrical and Electronic Engineering, Merchant Venturers Building, Woodland Road, UK ABSTRACT In this paper we investigate object tracking in video sequences by using the potential of particle filtering to process features from video frames. A particle filter (PF) and a Gaussian sum particle filter (GSPF) are developed based upon multiple information cues, namely colour and texture, which are described with highly nonlinear models. The algorithms rely on likelihood factorisation as a product of the likelihoods of the cues. We demonstrate the advantages of tracking with multiple independent complementary cues compared to tracking with individual cues. The advantages are increased robustness and improved accuracy. The performance of the two filters is investigated and validated over both synthetic and natural video sequences. Keywords: particle filtering, Bayesian methods, tracking in video sequences, colour, texture, Gaussian sum particle filtering

1. INTRODUCTION Object tracking is required in many vision applications such as human-computer interfaces, video communication/compression, road traffic control, security and surveillance systems. Often the goal is to obtain a record of the trajectory of moving single or multiple targets over time and space. Object tracking in video sequences is a challenging task because of the large amount of data used and the common requirement for real-time computation. Moreover, most of the models encountered in visual tracking are nonlinear, non-Gaussian, multi-modal or any combination of these. In this paper we focus on particle filtering techniques for tracking, since they have recently proven to be a powerful and reliable tool for nonlinear systems.1–3 Particle filtering is a promising technique because of its inherent property to allow fusion of different sensor data, to account for different uncertainties, to cope with data association problems when multiple targets are tracked with multiple sensors and to incorporate constraints. There are still issues which must be properly addressed if successful algorithms are to be developed. These include the degeneracy problem (when all but one particle has significant normalised weight), appropriate choice of proposal distributions, likelihood evaluation and computational complexity for efficient implementation. Here we look at the likelihood evaluation, specifically comparing single and multiple independent cues. Particle filtering has been applied to various fields and has other names including the CONDENSATION algorithm4 and bootstrap filter.1 All these algorithms fall into the category of sequential Monte Carlo techniques since they keep track of the state through sample-based representation of probability density functions over time. Different tracking algorithms are proposed for video sequences and their particularities are mainly application dependent. Many of them rely on a single cue, which can be chosen according to the application context. A cue-selection approach is proposed in Ref. 5 which is embedded in a hierarchical vision-based tracking algorithm. When the target is lost, layers cooperate to perform a rapid search for the target and continue tracking. Another approach, called democratic integration6 implements cues concurrently where all vision cues are complementary and contribute simultaneously to the overall result. Robustness and generality are major features of this approach. Colour cues are a significant part of many tracking techniques,7–12 in combination with shape,7 motion and sound.8 Colour cues provide a weak model and as such are very unrestrictive about the objects they can be Further author information: (Send correspondences to P.B.) E-mail: P.B.: [email protected] E-mail: L.M.: [email protected]

430

Image and Video Communications and Processing 2005, edited by Amir Said, John G. Apostolopoulos, Proc. of SPIE-IS&T Electronic Imaging, SPIE Vol. 5685 © 2005 SPIE and IS&T · 0277-786X/05/$15

used to represent. The greatest weakness of colour cues is the ambiguity in scenes with objects or regions with colour features similar to those of the object of interest. As shown in Ref. 8, by fusing colour and motion cues this ambiguity can be considerably reduced if the object is moving. The motivation for multiple cues comes from the inability of a single cue to fully describe the object and therefore its inability to achieve accurate and robust results. For example in Ref. 8 colour is used as the main cue and motion and sound cues are proposed to complement the colour because they can provide a greater degree of discrimination. In Ref. 9 a general framework is introduced for the integration of multiple cues, driven by the idea that different cues are suitable for different conditions. Activating and suppressing them dynamically, may increase the performance of the system. So far, texture cues have not been widely used for video based tracking purposes. Furthermore, they have not been applied to tracking with particle filtering techniques. In this paper we show that colour and texture complement each other and provide reliable performance. Additionally to colour and texture, other cues such as gradient, edges, motion and illumination can be added when necessary to avoid ambiguities. Adaptive colour and texture segmentation for tracking moving objects is proposed in Ref. 10. An autobinomial Gibbs Markov random field is used for modelling the texture whilst colour is modelled by a 2D Gaussian distribution. In this paper texture is represented using a wavelet decomposition which is a different approach to Ref. 10 and in the framework of particle filtering. In Ref. 10 segmentation-based tracking with a Kalman filter is considered instead of feature-based tracking proposed here. The paper is organised as follows. Section 2 states the problem of object tracking using video sequences within a sequential Monte Carlo framework. Section 3 introduces the motion model and cues used for the tracking. Section 4 describes a PF algorithm we developed based on fusing colour and texture information cues. Section 5 presents a GSPF algorithm for object tracking in video sequences. In Section 6 the filters’ performance is investigated and validated over different scenarios. We show the advantage of fusing multiple cues compared to colour-based tracking only and texture-based tracking only by synthetic and natural video sequences. Finally, Sect. 7 discusses the results and open issues for future research.

2. SEQUENTIAL MONTE CARLO FRAMEWORK The aim of sequential Monte Carlo estimation is to evaluate the posterior probability density function (PDF) p(xk |Z k ) of the state vector xk given a set Z k = {z 1 , . . . , z k } of sensor measurements at a time k. The Monte Carlo approach relies on a sample-based construction to represent the state PDF. Multiple particles (samples) of the state are generated, each one associated with a weight which characterises the quality of a specific particle. An estimate of the variable of interest is obtained by the weighted sum of particles. Two major stages can be distinguished : prediction and update. During the prediction each particle is modified according to the state model of the region of interest in the video frame, including the addition of random noise in order to simulate the effect of the noise on the state. In the update stage, each particle’s weight is re-evaluated based on the new data. A resampling procedure deals with the elimination of particles that have small weights and replication of the particles with larger weights.

3. MOTION AND LIKELIHOOD MODELS 3.1. Motion Model For the purpose of tracking an object in video we initially choose a region which defines the object. The shape of this region is fixed a priori and in our case we choose a rectangular box characterised by the state vector x = (x, x, ˙ x ¨, y, y, ˙ y¨) , with x and y denoting the pixel location of the top, left corner of the rectangle, with velocities x˙ and y, ˙ respectively, and accelerations x ¨ and y¨. Note that the dimensions of the rectangle are fixed through the sequence. The motion model used is the constant acceleration model13 xk+1 = F xk + Γv k ,

v k  N (0, Q),

(1)

SPIE-IS&T/ Vol. 5685

431

where F = diag(F , F ), Γ = (Γ , Γ ),  1 F = 0 0

1 2 2T

T 1 0



1

2T

2



T , Γ =  T , 1 1

and T is the sampling interval. The Gaussian system noise v = (v x , v y ) has a covariance matrix Q = diag(Qx , Qy ), Qx = V σx2 , Qy = V σy2 , with 1 V =

1 3 2T 2

4 4T 1T3 2 1 2 2T

T T

1 2 2T



T , 1

where σx and σy are the standard deviations of the noises respectively in x and y coordinates. It is suggested that suitable values for σv2 are in the range 0.5am ≤ σv ≤ am , with am being the maximum acceleration.13

3.2. Likelihood Models Under the assumption that the features being used as cues are independent, the overall likelihood is a product of the likelihoods of the separate cues, in our case colour and texture, as shown in equation (2) L(z k |xk ) = Lcolour (z colour,k |xk )Ltexture (z texture,k |xk ).

(2)

where z k denotes the measurement vector, composed by the measurement vector z colour,k from the colour cue and the vector z texture,k from the texture cue. Sections 3.3 and 3.4 give more details for the particular cues considered. To combine colour and texture features in a Bayesian framework we need to make the assumption of conditional independence.

3.3. Colour Cue The colour likelihood model has to be defined in a way to favor candidate colour histograms close to the reference histogram. The histogram, hcx = (hc1,x , . . . , hcB,x ), for a region Sx corresponding to a state x is given by hci,x = CN



δi (bcu ), i = 1 . . . B,

(3)

uSx

where bcu ∈ {1 . . . B} is the histogram bin index associated with the intensity at pixel location u = (x, y) in B channel c of the colour image y c and CN is a normalising constant such that i=1 hci,x = 1. We use 8x8x8 bin histograms in the three channels of red, green, blue (RGB) colour space.8 ˆ1 = to make decisions about the closeness of two histograms h  A distancec metricˆ which  is appropriate c 12, 14, 15 c∈{R,G,B} hx1 and h2 = c∈{R,G,B} hx2 is the Bhattacharyya similarity distance ˆ 1, h ˆ 2) = d(h

 ˆ 1, h ˆ 2 ), 1 − ρ(h

(4)

m   ˆ 1,i h ˆ 2,i . h

(5)

ˆ 1, h ˆ 2 ) is the Bhattacharyya coefficient, where ρ(h ˆ 1, h ˆ 2) = ρ(h

i=1

ˆ 2 ) is, the more similar the distributions are. The Bhattacharyya distance ˆ 1, h The larger the coefficient ρ(h values are within the interval [0, 1]. For two identical normalised histograms we obtain d = 0 (ρ = 1) indicating a perfect match.

432

SPIE-IS&T/ Vol. 5685

Based on this distance, the colour likelihood model can be defined by8  Lcolour (z colour |x) ∝ exp −



 2 d2 (hcx , hcref )/2σC ,

(6)

c∈{R,G,B}

where the standard deviation σC specifies the Gaussian noise in the measurements, hcx is the current histogram of the target, and hcref is the reference histogram. Small Bhattacharyya distances correspond to large weights in the particle filter.

3.4. Texture Cue Texture is an appealing feature to use as a basis for an observation model because of its intuitive definition. Qualitatively, texture can be described by terms such as fine, coarse, grained and smooth. Although there is no unique definition of texture, it is generally agreed that texture describes the spatial arrangements of pixel levels in an image, which may be stochastic or periodic, or both.16 When a texture is viewed from a distance it may appear to be fine, however, when viewed from close up it may appear to be coarse. Texture properties are analysed by different techniques such as statistical methods, spatial frequency methods, structural methods and fractal-based methods. The method for feature extraction we have chosen is the spatialfrequency method of wavelets which implements a discrete wavelet transform. 3.4.1. The discrete wavelet transform The discrete wavelet transform (DWT)17, 18 decomposition performs a series of subband filtering operations on the original image region. The lower frequency subband then forms the input image for the next level of the transform (Fig. 1a). It should be noted that the resulting transformed image has the same dimensions as the original image due to a subsampling stage between each stage of decomposition. The wavelet transform provides a representation containing different frequencies and different resolutions in frequency and time. In this work the Haar wavelets are used as the filter due to their simplicity and compactness. Other wavelet transforms can be applied and allow for an improved overall performance. 3.4.2. Texture analysis using wavelets The coefficients of each channel in the DWT contain information about the spatial and spatial-frequency information of the input image (see Fig. 1b). The LH channel (channel 9) for example contains the lower subband of the horizontal frequencies and the higher subband of the vertical frequencies, this channel is therefore dominated by horizontal edge information. We now describe the process of generating a feature vector representing the texture of a region. Firstly, to remove the effect of the input image mean we subtract it from the image before we take the wavelet transform. Then a three-level DWT decomposition is applied to generate ten channels (Fig. 1). The texture for the mth channel xm is then represented by the l1 -norm em =

M P 1  |x(i, j)| , M P i=1 j=1

(7)

where M and P are the dimensions of the channel and x(i, j) is the wavelet coefficient at location i,j. The texture of the region using this scheme is represented by a ten-element feature vector t = {e1 , e2 , . . . , e10 }.

(8)

Before we use the feature vector to represent the texture PDF we must ensure that the values are in the range [0, 1] and are normalised. To do this we mean shift by min{t, 0} and then normalise the vector, where ¯ It is now possible min{x} is the smallest value in x. We notate the mean-shifted, normailsed feature vector as t. to determine the distance between two texture feature vectors. SPIE-IS&T/ Vol. 5685

433

LL

HL 1

2

3

4

5 8 6

7

9

10

LH (a) Two-level wavelet transform

HH

(b) The ten channels of a three-level discrete wavelet transform

Figure 1. Discrete wavelet transform

A measure, d, is defined using the Bhattacharyya coefficient characterising the difference (distance) between two normalised texture feature vectors t¯1 and t¯2 . d(t¯1 , t¯2 ) =

 1 − ρ(t¯1 , t¯2 ),

(9)

where the Bhattacharyya coefficient, ρ(t¯1 , t¯2 ), is defined as ρ(t¯1 , t¯2 ) =

m   t¯1,i t¯2,i .

(10)

i=1

The texture likelihood can then be defined in a similar way to the colour likelihood introduced in Sect. 3.3.  Ltexture (z texture |x) ∝ exp −d2 (¯tx , ¯tref )/2σt2 ,

(11)

where σt is the standard deviation of the Gaussian texture noise, ¯tx is the texture histogram of the current frame, and ¯tref is the reference texture histogram.

4. A PARTICLE FILTER ALGORITHM FOR OBJECT TRACKING USING MULTIPLE CUES Within the Bayesian framework, the conditional PDF p(xk+1 |Z k ) is recursively updated according to the prediction step p(xk+1 |Z k ) =



p(xk+1 |xk )p(xk |Z k )dxk ,

(12)

p(z k+1 |xk+1 )p(xk+1 |Z k ) , p(z k+1 |Z k )

(13)

and the update step p(xk+1 |Z k+1 ) =

434

SPIE-IS&T/ Vol. 5685

where p(z k+1 |Z k ) is a normalising constant. The recursive update of p(xk+1 |Z k+1 ) is proportional to p(xk+1 |Z k+1 ) ∝ p(z k+1 |xk+1 )p(xk |Z k ). Usually, there is no simple analytical expression for propagating p(xk+1 |Z numerical methods.

(14) k+1

) through (14) so we can use

In the particle filter approach a cloud of N weighted particles, drawn from the posterior conditional PDF, is used to map integrals to discrete sums. The posterior p(xk+1 |Z k+1 ) is approximated by pˆ(xk+1 |Z k+1 ) ≈

N  l=1

(l) δ(xk+1 − x(l) ), W k+1 k+1

(15)

(l) are the normalised importance weights. New weights are calculated, where δ is the delta-Dirac function and W k putting more weight on particles that are important according to the posterior PDF (15). (l)

k+1 The random samples {xk }N ). It is often impossible to sample from the l=1 are drawn from p(xk+1 |Z k posterior density function p(xk |Z ). This difficulty is circumvented by making use of the importance sampling from a known proposal distribution p(xk+1 |xk ).

The developed particle filter algorithm is given in Table 1. Note that the resampling step is included in order to avoid the problem of degeneracy, when only one particle has significant normalised weight. Table 1

The particle filter with color and texture cues

Initialisation (l)

1. k = 0, for l = 1, . . . , N , generate samples {x0 } from the initial distribution p(x0 ). Prediction Step 2. For k = 1, 2, . . . ,

(l)

(l)

l = 1, . . . , N , sample xk+1 ∼ p(xk+1 |xk ), the motion model (1).

Measurement Update: evaluate the importance weights 3. On the receipt of a new measurement, compute the weights (l)

(l)

Wk+1 ∝ L(z k+1 |xk+1 ). (l) = 4. Normalise the weights, W k+1

(l)

Wk+1 N (l) . Wk+1 l=1

(16) (l)

The likelihood L(z k+1 |xk+1 ) is calculated from Eq. (2).

Output 5. A collection of samples, from which the approximate posterior distribution is computed, as follows pˆ(xk+1 |Z k+1 ) =

N 

(l) δ(xk+1 − x(l) ). W k+1 k+1

l=l

6. The posterior mean E[xk+1 |Z

k+1

] is computed using the collection of samples (particles) ˆ k+1 = E[xk+1 |Z k+1 ] = x

N 

(l) x(l) . W k+1 k+1

(17)

l=1

Selection step (resampling) (l) (l) , in order to obtain N new random 7. Multiply/ suppress samples xk+1 with high/ low importance weights W k+1 (l) samples approximately distributed according to p(xk+1 |Z k+1 ). 8. Set k = k + 1 and return to step 2.

SPIE-IS&T/ Vol. 5685

435

5. A GAUSSIAN SUM PARTICLE FILTER WITH MULTIPLE CUES The GSPF19, 20 is a recently developed filtering technique in which the state filtering and prediction probability density functions are approximated by finite mixtures of Gaussian components. When the object model and observation model noises are non-Gaussian, they can also be represented as Gaussian mixtures. The GSPF is essentially a bank of Gaussian particle filters.21 The GSPF reduces the estimation problem solution to updating a Gaussian mixture, where mean, covariance, and weights are tracked with each new observation. The filtering state PDF can be approximated as follows p(xk+1 |z k+1 ) =

G 

N (xk+1 ; µxk+1 ,i , Σxk+1 ,i ),

(18)

i=1

by G components having mean and covariances respectively µxk+1 ,i and Σxk+1 ,i . The GSPF algorithm with multiple information cues is given in Table 2, note that G is the number of Gaussian components. Table 2

The GSPF with color and texture cues

Initialisation 0,i = 1. k = 0, for i = 1, . . . , G, initialise µx 0 ,i , Σx 0 ,i , W

1 . G

Time update For k = 1, 2, . . ., (j)

2. For i = 1, . . . , G, obtain samples from N (xk ; µx k ,i , Σx k ,i ) and denote them as {xk,i }N j=1 for N particles. (j)

(j)

3. For i = 1, . . . , G, j = 1, . . . , N , sample xk+1,i ∼ p(xk+1,i |xk+1,i ), the motion model (1). k,i . 4. For i = 1, . . . , G, update weights Wk+1,i = W (j) ¯x ¯ x k+1,i and sample variance Σ 5. For i = 1, . . . , G, from {xk+1,i }N j=1 obtain sample mean µ k+1,i by averaging over N particles.

The time updated (prediction) density can now be approximated as p(xk+1 |Z k ) =

G 

¯ x ,i ). Wk+1,i N (xk+1 ; µ ¯ x k+1 ,i , Σ k+1

(19)

i=1

Measurement update (j)

N ¯x 6. For i = 1, . . . , G, obtain samples from the distribution N (xk+1 ; µ ¯ x k+1 ,i , Σ k+1,i ) and denote them as {x k+1,i }j=1 . (j)

(j)

7. For i = 1, . . . , G, j = 1, . . . , N compute the weights ωk+1,i ∝ L(z k+1 |xk+1,i ). The likelihood L is calculated from Eq. (2). 8. For i = 1, . . . , G, j = 1, . . . , N estimate the mean and covariance by N

(j) (j) j=1 ωk+1,i x k+1,i , N (j) j=1 ωk+1,i

µx k+1,i = N Σx k+1,i =

436

SPIE-IS&T/ Vol. 5685

j=1

(j)

(j)

(20)

(j)

ωk+1,i (xk+1,i − µx k+1,i )(xk+1,i − µx k+1,i ) . N (j) j=1 ωk+1,i

(21)

9. For i = 1, . . . , G, update the weights Wk+1,i 10. Normalise the weights

N (j) i=1 ωk,j = Wk,i G N , (j) i=1 j=1 ωk,j

(22)

k+1,i =  Wk+1,i . W G i=1 Wk+1,i

(23)

Output ˆx ˆk+1 )(xk+1 − x ˆk+1 )] 11. Calculate the state estimate x ˆk+1 = E(xk+1 |Z k+1 ) and covariance Σ k+1 = E[(xk+1 − x x ˆk+1 =

G 

k+1,i µk+1,i , W

(24)

i=1

ˆx Σ k+1 =

G 

k+1,i [Σx xk+1 − µk+1,i )(ˆ xk+1 − µk+1,i ) ]. W k+1,i + (ˆ

(25)

i=1

Resampling k+1,i by the residual resampling algorithm. Other schemes such as the one described in 12. Resample the weights W Ref. 2 can also be applied. k+1,G and let k+1,1 , . . . , W 13. For g = 1, . . . , G draw a number j ∈ {1, . . . , G} with probabilities proportional to W



k+1,i = 1 . µk,g , Σx = µk,j , Σx , set W k+1,g

k+1,j

G

14. Set k = k + 1 and return to step 2.

During the resampling step of the GSPF mixands with insignificant weights are discarded, whilst mixands with significant weights are duplicated. Notice that the resampling procedure applied to the GSPF with multiple cues slightly differs from the described in Ref. 19 because of the fact that we are not representing the target noise as a Gaussian mixture. In the GSPF mixing components are resampled whilst in the PF the resampling step applies to particles. In the implemented GSPF resampling is performed at each time step. Since resampling is introducing some bias, the calculation of the estimates in both PF and GSPF is performed before the resampling step.

6. EXPERIMENTAL RESULTS This section evaluates two different ideas presented by this paper: i) the performance of colour cues, texture cues and combined colour and texture cues, ii) the performance of the PF and GSPF with multiple cues.

6.1. Colour and Texture Cues To evaluate the performance of the colour, texture and combined colour and texture cues we first present results from artificial sequences with known object motion obtained by 100 independent Monte Carlo runs in Sect. 6.1.1. Then in Sect. 6.1.2 we present results from natural sequences obtained by using the particle filter. 6.1.1. Synthetic sequences To generate the synthetic sequences we use textures taken from the Brodatz set,22 as seen in Fig. 2. The background to a particular sequence is taken as one of the textures and then the object to be tracked is a subregion of a texture overlaid on the background as illustrated in Fig. 3. If we know the correct location of the target object, xi , in each frame i = 1, . . . , Nf and R is the number of realisations, the root mean squared error (RMSE) at a given frame is23   R 1  RM SEx =  (xi − xˆi )2 , (26) R i=0

SPIE-IS&T/ Vol. 5685

437

(a) Texture 1

(b) Texture 2

(c) Texture 3

(d) Texture 4

Figure 2. Textures, taken from the Brodatz texture book,22 used to create foreground and background objects in the artificial sequences.

RM SExy = (RM SEx + RM SEy )/2.

(27)

We use the RM SExy defined in Eq. (27) to further define an error value for the sequence, Eseq Eseq =



 RM SExy /Nf rames .

(28)

This mean error can then characterise the tracking performance through an entire sequence. The mean error is presented for a set of synthetic sequences in Table 3.

Background Texture

Synthetic Frame

Object Texture

Figure 3. Synthetic sequence creation 3

3 Colour and Texture Cues Colour Cues Texture Cues

2.5

2.5

RMSE

xy

2

xy

2

RMSE

Colour and Texture Cues Colour Cues Texture Cues

1.5

1.5

1

1

0.5

0.5

0 0

10

20

30

Frame

(a)

40

50

0 0

10

20

30

40

50

Frame

(b)

Figure 4. RMSExy for colour, texture and combined colour and texture cues (from 100 runs). The tracked object in Figure (a) has different colour and texture to the background. Figure (b) is from a sequence where the texture of the background and of the object are the same. The colour of the object and of the background are very similar.

438

SPIE-IS&T/ Vol. 5685

(a) Frame 1

(b) Frame 35

(c) Frame 73

(d) Frame 1

(e) Frame 35

(f) Frame 73

(g) Frame 1

(h) Frame 35

(i) Frame 73

Figure 5. Tracking an object with camera induced motion: results from the first, intermediate and final frames. Figs. (a)-(c) are generated using colour cues only, (d)-(f) use texture cues only, (g)-(i) use combined colour and texture cues. Table 3. Eseq of particle filter with texture feature cues for synthetic sequences. The features used are shown in Fig. 2

Object Texture 1 2 3 4

Background Texture 1 2 3 4 0.081 0.053 0.088 0.049 1.145 0.636 1.023 1.493 0.149 0.465 0.199 0.154 4.593 4.881 0.983 1.772

The results for experiments with each cue are presented in Figure 4 with the RM SExy over 55 frames. Two different synthetic sequences were used; i) the target object has texture and colour distinct to the background (Fig. 4a) and ii) the target object has similar colour and texture to the background (Fig. 4b). It can be seen that for the first sequence (Fig. 4a) colour cues track the object with a much lower degree of accuracy than the texture or combined cues. Texture provides good tracking results but the most accurate results are observed when the colour and texture cues are combined. A different outcome is noticeable in the second sequence (Fig. 4b). Here the weak model of the colour cues performs significantly better than the texture cues. The colour cue is able to track the object through most of the sequence, although it does loose it in the last couple of frames. The texture cue looses the object at around frame 15 because of the similarity between the background and the target object. It can be seen that

SPIE-IS&T/ Vol. 5685

439

the combined cues provide a good track that is more accurate and robust than the two cues individually. 3

3 PF GSPF 2.5

2

2 xy

2.5

RMSE

RMSExy

PF GSPF

1.5

1.5

1

1

0.5

0.5

0 0

10

20

30

40

50

0 0

10

Frame

20

30

40

50

Frame

(a) Different Background and Object (b) Similar Background and Object TexTextures and Colours tures and Colours Figure 6. RMSExy for the particle filter and Gaussian sum particle filter for two artificial sequences. The results were obtained from using colour and textured cues combined.

6.1.2. Natural sequences It is important to investigate the tracking performance over natural sequences as well. Results with a natural sequence are given in Fig. 5. By using only colour cues we do not track the object accurately as is demonstrated by Fig. 5b. Texture cues are more accurate but do occasionally drift away from the target, as in Fig. 5f. When colour and texture are combined we see consistently more accurate and robust results.

6.2. Comparison of Particle Filter and Gaussian Sum Particle Filter Here we present results of experiments carried out to compare the performance of the PF and the GSPF algorithms with multiple cues. In order to assess both algorithms under the same conditions, we chose N = GM where N is the number of particles in the particle filter, G is the number of Gaussian mixing components in the GSPF and M is the number of particles per component in the GSPF. Therefore the two particle filters both have the same total number of particles. Both filters were given the same noise covariances. It can be seen in Fig. 6 that the performance, in terms of RM SExy of the two particle filters is very similar.

7. CONCLUSIONS AND FUTURE WORK This paper has presented a particle filter and a Gaussian sum particle filter for tracking. Both algorithms track objects in video by fusing multiple cues. Colour and texture features are the cues described here, other possible cues include motion, edges and illumination. The advantages of combined colour and texture cues compared to texture only and colour only demonstrated here are: i) improved accuracy and ii) improved robustness. The benefits of fusing multiple independent cues is shown to apply to both filters. Current and future areas for research include the investigation of other cues, different fusion schemes, improved proposal distributions, automatic detection and recovery from a lost track, tracking multiple objects and online adaption of the target model.

ACKNOWLEDGMENTS This work has been conducted with the UK MOD Data and Information Fusion Defence Technology Centre under project DIF DTC 2.2.

440

SPIE-IS&T/ Vol. 5685

REFERENCES 1. N. Gordon, D. Salmond, and A. Smith, “A novel approach to nonlinear / non-Gaussian Bayesian state estimation,” IEE Proceedings on Radar and Signal Processing 40, pp. 107–113, 1993. 2. A. Doucet, S. Godsill, and C. Andrieu, “On sequential Monte Carlo sampling methods for Bayesian filtering,” Statistics and Computing 10(3), pp. 197–208, 2000. 3. M. Arulampalam, S. Maskell, N. Gordon, and T. Clapp, “A tutorial on particle filters for online nonlinear/non-Gaussian Bayesian tracking,” IEEE Trans. on Signal Proc. 50(2), pp. 174–188, 2002. 4. M. Isard and A. Blake, “Condensation – conditional density propagation for visual tracking,” International Journal of Computer Vision 28(1), pp. 5–28, 1998. 5. K. Toyama and G. Hager, “Incremental focus of attention for robust vision-based tracking,” International Journal of Computer Vision 35(1), pp. 45–63, 1999. 6. J. Triesch and C. von der Malsburg, “Democratic integration: Self-organized integration of adaptive cues,” Neural Computation 13(9), pp. 2049–2074, 2001. 7. C. Shen, A. van den Hengel, and A. Dick, “Probabilistic multiple cue integration for particle filter based tracking,” in Proceedings of the VIIth Digital Image Computing : Techniques and Applications, C. Sun, H. Talbot, S. Ourselin, T. Adriansen, Eds., 10-12 Dec. 2003. 8. P. P´erez, J. Vermaak, and A. Blake, “Data fusion for tracking with particles,” Proceedings of the IEEE 92, pp. 495–513, March 2004. 9. M. Spengler and B. Schiele, “Towards robust multi-cue integration for visual tracking,” Machine Vision and Applications 14(1), pp. 50–58, 2003. 10. E. Ozyildiz, N. Krahnst¨ over, and R. Sharma, “Adaptive texture and color segmentation for tracking moving objects,” Pattern Recognition 35, pp. 2013–2029, October 2002. 11. P. P´erez, C. Hue, J. Vermaak, and M. Ganget, “Color-based probabilistic tracking,” in 7th European Conference on Computer Vision Vol. 2350 of Lecture Notes in Computer Science, pp. 661–675, (Denmark), May 2002. 12. K. Nummiaro, E. Koller-Meier, and L. V. Gool, “An adaptive color-based particle filter,” Image and Vision Computing 21(1), pp. 99–110, 2003. 13. Y. Bar-Shalom and X. Li, Estimation and Tracking: Principles, Techniques and Software, Artech House, 1993. 14. A. Bhattacharayya, “On a measure of divergence between two statistical populations defined by their probability distributions,” Bulletin of the Calcutta Mathematical Society 35, pp. 99–110, 1943. 15. D. Comaniciu, V. Ramesh, and P. Meer, “Kernel-based object tracking,” IEEE Transactions on Pattern Analysis and Machine Intelligence 25(5), pp. 564–577, 2003. 16. R. Porter, Texture Classification and Segmentation, PhD Thesis, University of Bristol, Center for Communications Research, 1997. 17. E. J. Stollnitz, T. D. DeRose, and D. H. Salesin, “Wavelets for computer graphics: A primer part 1,” IEEE Computer Graphics and Application 15, pp. 76–84, May 1995. 18. E. J. Stollnitz, T. D. DeRose, and D. H. Salesin, “Wavelets for computer graphics: A primer part 2,” IEEE Computer Graphics and Application 15, pp. 75–85, July 1995. 19. J. H. Kotecha and P. M. Djuri´c, “Gaussian sum particle filtering,” IEEE Transactions on Signal Processing 51, pp. 2602–2612, Oct. 2003. 20. J. H. Kotecha and P. M. Djuri´c, “Gaussian sum particle filtering for dynamic state space models,” in Proceedings of the ICASSP, (Salt Lake City, Utah, USA), 2001. 21. J. H. Kotecha and P. M. Djuri´c, “Gaussian particle filtering,” IEEE Transactions on Signal Processing 51, pp. 2592–2601, Oct. 2003. 22. P. Brodatz, Textures: A Photographic Album for Artists and Designers, Dover Publications, 1966. 23. Y. Bar-Shalom, X. R. Li, and T. Kirubarajan, Estimation with Applications to Tracking and Navigation, John Wiley & Sons, 2001.

SPIE-IS&T/ Vol. 5685

441