Multi-Target Tracking Using Hybrid Particle Filtering - IEEE Xplore

0 downloads 0 Views 996KB Size Report
We address the problem of multi-target tracking based on sequential Monte Carlo filtering for a visual access control application. Sequential Monte Carlo ...
Multi-Target Tracking Using Hybrid Particle Filtering Jens Rittscher Nils Krahnstoever and Luis Galup General Electric Global Research, One Research Circle, Niskayuna, NY 12309, USA {jens.rittscher|nils.krahnstoever|luis.galup}@research.ge.com

Abstract

formance of current face recognition systems and producing a high resolution face image. In this paper a tracking module which is suited for this particular task is presented.

We address the problem of multi-target tracking based on sequential Monte Carlo filtering for a visual access control application. Sequential Monte Carlo methods are very suitable for approximating posterior distributions for single target tracking applications. However, tracking multiple targets is more difficult and critically depends on the ability to represent all statistically significant modes with a sufficient number of samples. Even when tracking a single target, controlling the effective sample size of the particle set only crudely estimates how well it approximates the posterior target distribution. In contrast, previous work demonstrates that using a Kalman filter control loop, which monitors the performance of the particle filter, can dramatically improve posterior distribution approximation in a dynamic fashion. This paper extends this principle to multi-target tracking by introducing a technique called mode stratification. In addition, a method to automatically augment and delete the number of modes using local relative entropy measures is introduced. Experiments applying the proposed technique for visual head tracking in an access control application illustrate the effectiveness of the method.

1

00456

Figure 1: EntryScan Portal. This chemical trace detector is used to screen passengers at airports or other mass transit terminals. The system overview illustrates that a sequence of face images is acquired to generate an high quality image of every passenger which is then stored with the trace information. A number of researchers have demonstrated that sequential Monte Carlo methods are a very attractive principle for achieving robust visual tracking. When applying these techniques in practice two specific issues need to be addressed: the dependency on prior model assumptions and the effects of sample impoverishment when tracking multiple targets with a finite number of particles. In practical applications, the main problem is to maintain multiple modes with a finite number of particles [9, 16]. Although a number of importance sampling methods [11] and specific sampling methods for maintaining multiple modes [16] have been suggested, none of these methods safeguard against the loss of the target. In other words, these methods cannot automatically adjust the number of particles in situations where the loss of lock is likely to occur. The estimate of the effective sample size [11] is often used to monitor the quality of the estimated distribution, but as it is illustrated in figure 2, it can only serve as a very rough guidance, especially when tracking multiple modes. A purely model-driven approach does of course depend on prior model choices. This dependency on prior model assumptions can be addressed by combining the purely model driven approach with data driven methods such as mean

Introduction

A number of applications require real-time detection and tracking of specific objects. In our particular setting is is necessary to acquire the faces of people passing an access point or portal. The GE IonTrack EntryScan, shown in figure 1, is a chemical trace detector which tests for substances like narcotics and explosives. In the near future this technology will be deployed to screen passengers in airports or other mass transit terminals. To be effective in practice every trace must be associated with an image of the persons’ face. It is important that this face image is of sufficiently high quality and suitable for automatic face recognition. Rather than taking a single snapshot of a person, we propose to track the heads of all passengers approaching the EntryScan portal. A direct benefit of this approach is that the sequence of images can be used for improving the per1

Proceedings of the Seventh IEEE Workshop on Applications of Computer Vision (WACV/MOTION’05) 0-7695-2271-8/05 $ 20.00 IEEE

2

shift [6] or correlation based tracking. This principle of hybrid sampling for tracking a single target has been explored in [15]. In order to address the challenge of mixing the deterministic and probabilistic search directly, we argue that it is best to use and explicit control loop as proposed in [15] for tracking a single particle.

The objective of this section is to identify a sampling method which effectively prevents sampling impoverishment in the case of sudden changes in the dynamic system. These changes can be caused by outliers in the observation and noise. This is particularly relevant for the visual tracking problem, which is our application of interest. One additional requirement is that the filter needs to be efficient with respect to the number of particles necessary for tracking. One standard method of preventing sampling impoverishment is the resampling of the posterior. During a resampling step, particles are sampled independently from the weighted sample set St and their weights are adjusted accordingly. As stated in [11], it is not obvious why resampling is useful in practice. Although, it can prune away underrepresented samples and produce multiple copies of good samples, it does not address the problem of missing important peaks altogether. In previous work we [15] use a control loop which directly monitors the performance of the tracker. In the following we assume that the observation likelihood has the form p(z|x) = eλC(x,z) , (1)

Here we explore how this principle can be applied to the problem of maintaining multiple modes. Since the dynamics and the observation processes of the different modes are statistically independent, the resampling of each of these particle sets can be performed independently. This is achieved through a novel method termed mode stratification which is introduced in section 3.1. Here a stratum is defined to be the set of particles which represents a particular mode of the posterior distribution. The resulting algorithm dynamically adjusts the number of particles needed to maintain each of the modes. In addition to particle set maintenance, it is necessary to address the problem of mode creation and model deletion. Any method to maintain multiple modes would be incomplete without a systematic criterion of when to create or delete strata. For this purpose, and for making mode stratification computationally efficient, the concept of a control space is introduced. It can be viewed as a subspace of the configuration space. Within this control space, it is possible to estimate the background noise and to introduce a natural criterion, based on relative entropy, which indicates when to create and delete modes.

where the function C(·), referred to as cost function, measures the quality of the hypothesized state x given the set of observations or measurements z. In [15] this specific form of the observation likelihood p(z|x) is used to aid the random search by applying selective gradient descent steps to minimize C(·, z) with respect to x. The value of the scaling parameter λ is determined in a learning step and controls which values of the cost function C(·) are significant. Assuming that only a single target is present, the expected value of the cost function C(·) is estimated in addition to the state vector xt itself, i.e.

In section 3.3 we illustrate the performance of the newly introduce method using an tutorial example. The results of tracking faces past an entry point are presented in section 4.1.

p(x)

Enhancing Sequential Sampling

ct = max C(xt , zt ) .

p(x)

xt

x

ESS = 64

(2)

In practice, this is achieved by tracking the lock score ct using a Kalman filter. The experiments presented in [15] show that very simple dynamical models effectively capture the change of the cost score due to the variation of the object’s visual appearance. The prediction of the lock score at time t, c− t is used to generate a fixed number of new particles. These new particles are sampled from the sample set St−1 . This is done whenever the cost score of a new sample C(xt , zt ) lies outside the acceptance interval of the Kalman filter prediction. A bookkeeping step ensures that underrepresented particles are pruned away. This idea is related to the method of partial rejection control [11], for which it has also been shown by Liu et al. [12] that this operation is proper in the sense of Monte Carlo estimation. Their method can be understood as resampling at

x

ESS = 91

Figure 2: Effective Sample Size. Both graphs show two synthetically generated particle sets. The bimodal distribution shown on the right was generated such that both peaks are populated with a fixed number of particles. The dominant peak was removed to generate the particle set shown on the right. Note that the effective sample size increases, although the main peak is omitted.

2 Proceedings of the Seventh IEEE Workshop on Applications of Computer Vision (WACV/MOTION’05) 0-7695-2271-8/05 $ 20.00 IEEE

3.1

periodic as well as dynamic checkpoints whenever the effective sample size of the sample set St , defined as 1 i 2 (π i t)

ESSt = 

When observing a set of targets with a single target likelihood function, each target gives rise to a separate mode in the posterior distribution. For each of these modes, a stratum is defined as a set of particles representing this mode. More specifically, a stratum is represented by a particle set Stk of size Ntk , k = 1, . . . , Kt as follows

(3)

drops below a certain threshold. As shown in figure 2 monitoring the effective sample size alone is not sufficient to decide if an important peak has been lost. When tracking multiple targets using one observation model the posterior p(x|z) can, as proposed by Vermaak et al. [16], be modeled as a non-parametric mixture model. The advantage of this approach, as supposed to using one state vector for multiple objects, is that it is no longer necessary to employ an exclusion principle or model partial occlusions. Firstly, our model parameters of the acceptance probability are determined using a separate control loop, on a mode-by-mode basis. Secondly, we adjust the size of the particle set, Nt , dynamically in relation to the predicted variance σt− associated with the Kalman filter update. This is motivated by the fact that this variance is a measure of uncertainty. Hence, more particles need to be generated in case σt− is large. Lastly, we dynamically adjust the number of modes by looking at the ensemble statistics of all the modes together. A summary of our single-target sampling with partial rejection control is given in Fig. 3.

Nk

i i i i t Stk = {(xik,t , πk,t , wk,t )i=1 |xik,t ∈ X, πk,t , wk,t ∈ R+ } (4) where k

Nt Kt  

k

i πk,t = 1,

k=1 i=1

Nt 

i wk,t = 1, k = 1, · · · , Kt

(5)

i=1

The π’s are the ensemble weights of each particle, while the w’s are the stratum (local) weights of each particle and Kt is the number of currently maintained strata at time t. The posterior distribution is represented by the union of these particle sets, St = St1 ∪ · · · ∪ StK and is approximated by k

p(xt |zt ) ≈

Nt Kt  

i πk,t δ(xt − xik,t ) .

(6)

k=1 i=1 i Therefore the wk,t encapsulate the relative heights of the i ’s encappeaks represented by each stratum, while the πk,t sulate the likelihood weights of the particles within each stratum.

Sampling with Partial Rejection Control For j = 0, . . . , Nt with Nt = βσt− (a) Generate a sample (xjt , πtj ) with cost score cjt = C(xjt , zt ) by sampling from St−1 (b) Accept sample (xjt , πtj ) with probability j − p = N (c− t , σt )(ct ) (c) If rejected repeat steps a and b until acceptance.  Normalize the weights πtj such that j πtj = 1. Compute the Kalman filter update to obtain c− t+1 − and σt+1 .

Mode Stratified Particle Filtering (1) Initialization: t = 0 Set K0 = 0. (2) For t = 1, 2, · · · (a) Sampling Step. Perform a resampling step in each stratum Vtk as described in Section 2. (b) Redefine Strata. Redefinition of the strata as described in Section 3.2. (c) Manage Strata. Create or delete strata as explained in Section 3.2. i (d) Normalization. Normalize the πk,t over all the strata.

Figure 3: Sampling with Partial Rejection Control.

3

Mode Stratification

Multi-Target Tracking

Figure 4: Overview over the mode stratified particle filter.

Maintaining multiple modes for extended periods of time with only a finite number of samples is a major challenge [9, 16]. The most commonly observed problem is that one single peak quickly starts to dominate the entire particle set. The following section will explain how the method of partial rejection control can be applied to each of visible modes to ensure their survival over time.

3.2

Creation and Deletion of Modes

As discussed in section 2 both the number of particles created to support each mode, Ntk , as well as the acceptance 3

Proceedings of the Seventh IEEE Workshop on Applications of Computer Vision (WACV/MOTION’05) 0-7695-2271-8/05 $ 20.00 IEEE

probability depend on the predicted lock score c− t . In order to ensure the survival of each of the different modes it is necessary to introduce a separate control loop for each mode. To facilitate the resampling, merging, splitting, creation and deletion of strata, a control space, Y , a subspace of the configuration space X, is introduced that is defined by a projection f : X → Y . The control space Y is divided into disjoint (e.g., rectan gular) cells Yi of a fixed volume, Y = Yi . Based on this control space partitioning, the k-th stratum at time t, Vtk Vtk := {Yi : there exists a j s.t. f (xjk,t ) ∈ Yi } ,

require a very large amount of particles and sample impoverishment would need to be monitored very carefully. The initialization of this experiment was performed as follows: A target model consisting of a circular shape with a previously learned color distribution for white colored balls was used. The observation likelihood is based upon the Bhattacharyya coefficient. An optical flow field is computed to provide low level information used for importance sampling. A standard implementation was used for these experiments. The flow field itself is computed on a subsampled image in order to reduce the computation time but more efficient methods could be implemented. To further the efficiency of the sampling step a gradient decent step as described in [15] is applied. Our method of mode stratification allows us to easily address the aforementioned tracking challenges. Each of the strata will now be analyzed individually. Since it is stationary, the cost scores for the decoy ball, (shown as a solid red line in figure 6) are consistently high. Applying the gradient descent steps adaptively prevents the samples from drifting away from the optimal position and no random search is required to attain the predicted lock scores. Throughout the sequence no more than 3 particles are ever necessary to represent this mode. This behavior is very different to the cost

(7)

is defined as the collection of cells that the particles contained in Stk map into. In this paper it is assumed that the cell sets Vtk are disjoint, which means that a cell is committed to at most one stratum at a time. In addition, it is enforced that neighboring occupied cells belong to the same stratum. The size and the elements of each Vtk are adaptively determined in the sampling step (Figure 3). After each sampling step, the state of each of the particles, xik,t changes. The individual particle sets have to be redefined accordingly, as do their state variables and cell memberships. The latter can necessitate a splitting and merging of strata in order to fulfill the disjoint-cell and neighborhood requirements above. In addition, the control space Y is used to maintain creation and deletion of strata which are responsible for managing the appearance and disappearance of tracks over time. This will be described in section 4. An overview over the mode stratified sampling approach is given in Figure 4.

3.3

Figure 6: Cost scores. This graph shows the highest cost score ct (see equation 2) attained in each stratum (plotted as a solid line). The color of each of the lines indicates the index of the stratum. The actual results correspond to the experiment presented in figure 5. Note that the cost scores drop very sharply once the white cue ball is hit by the cue. This illustrated that the scores of the cost function are severely affected by the observation noise, here caused by the motion blur. The role of the predicted lock score (plotted as a dotted line) is discussed in section 2.

Example: Pool Game

The first experiment is a tutorial example which showcases the mechanics and the capabilities of our proposed algorithm. The sequence shows a clip from a pool game. It contains the following actions: the cue ball is struck by a cue, rebounds off of a side cushion, and then collides into the main pack of other balls before coming to rest. The goal is to track the cue ball. We chose this example because it illustrates a number of challenges inherent with tracking using a particle filter. First of all, another “decoy” ball is similar in appearance (the white ball with the stripe) and can confuse a tracker, particularly when they are close. Note that the cue ball actually strikes the decoy ball. To further confuse the tracker, we actually have it track both balls simultaneously. The second challenge to the tracker is that the cue ball undergoes a dramatic appearance change. The cue ball is stationary at first (appears white and round), whereupon it undergoes a strong momentum change. Due to motion blur it appears as a gray elongated blur in subsequent frames. Tracking this sequence using a standard particle filter successfully would

scores from the stratum associated with the cue ball. The cost scores of this stratum are plotted as a solid green line in figure 6. Note that once the cue ball is struck by the cue, the attained lock scores are much lower than their predictions (shown as as dotted green line). This causes an increase of the search scale and generation of a relatively large particle set. As soon as this situation is resolved the particle set shrinks down to less than 10 particles. The effectiveness of incorporating the optical flow is apparent once the cue ball bounces of the wall. The bounce effectively causes no or 4

Proceedings of the Seventh IEEE Workshop on Applications of Computer Vision (WACV/MOTION’05) 0-7695-2271-8/05 $ 20.00 IEEE

log( π i ) 0 -2 -4

a

b

c 1s

d 2s

3s

a

t

c

b

d

Figure 5: Pool sequence. To compare the heights of the different modes the maximal weight log πi attained in each of the strata is shown with respect to time. Frames of the tracking sequence are shown for all the salient events in the experiment. This illustrates how effective the proposed sampling scheme deals with the situation of the rapid acceleration of the cue ball (a) and the collision (c). In each frame the top 20 particles are shown. Please see text for a detailed discussion of the result.

only very little uncertainty. This is a vast improvement to the results reported in [15] with simplified model assumptions and fewer particles. Both graphs in figures 5 and 6 show that an additional third stratum was created during the collision of the cue ball with the other light colored ball. After only a short time this stratum dissolves since its particles get merged with the two existing strata. The simplicity of this tracking scenario highlight the advantages of the proposed method. In order to compensate for the uncertainty associated with the rapid acceleration of the cue ball, the method increases the number of particles necessary to support this mode. Although the likelihood scores attained in the different modes are significantly different, both modes survive although only a few particles are necessary.

4

form if no target is present or corresponds to a known background distribution. In contrast, with the presence of a single target, this distribution would contain a single fairly isolated mode. Between these two cases it is easy to decide whether an existing stratum should be abandoned (former case) or an a stratum should be created (latter case). Hence the goal is to reduce the general N-target case to these two circumstances. We introduce distributions on the control space as follows,  pki,t = ckt p(ztk |x)p(x)dx Xi

ztk

where notation is used to express the observations zt with the target corresponding to the k th stratum removed from the observations. This removal is done in image space directly by masking out the visual features that arise from a target at that particular strata location. The special case zt0 is defined as the observations with all foreground strata removed. For control space regions that contain strata, the integral can be calculated directly from the strata particle representations. For the remaining space, the above integral is obtained through Monte Carlo sampling in combination with a suitable importance sampler. The resulting control space distributions directly reflect the modal structure of the current configuration space and can be used to manage the death and birth of strata. If all visible targets are accounted for by existing strata and were to be removed from the observation set, the control space distribution of the remaining observation (i.e., a supposedly empty background image) should contain no further information, i.e., have a high entropy. On the other hand, if visible targets remain (visible targets that are currently not tracked), there is a higher information content and a resulting low entropy. Hence, the creation of strata can be managed by computing the entropy of the control space distribution p0t = {p0i,t }, that is hypothesized to contain no targets. The deletion is managed by computing the entropy of pkt = {pki,t } that should contain a single target (if the target is still present). Both entropies are calculated relative

Automatic Mode Detection

Unlike occlusion and proximity between targets, which are handled during the redefinition of the strata assignments as described above, the appearance and disappearance of targets have to be managed in a separate process. By identifying cells of the control space as either belonging to the background or the foreground, it is possible to treat this problem systematically. First, each cell of the control grid is associated with a likelihood value. It is computed from either the strata samples contained in this cell or, if no particle resides in the cell, by randomly sampling from the configuration space that maps to this cell. The configuration space is mapped to the control space as Y = f (X). Hence, each control space cell Yi defines a volume in configuration space as Xi := {x ∈ X : f (x) ∈ Yi }. −1 The probability  of observing a target x ∈ f (Yi ) is proportional to Xi p(zt |x)p(x)dx. In this present work we assume, that this distribution (over i) is approximately uni-

5 Proceedings of the Seventh IEEE Workshop on Applications of Computer Vision (WACV/MOTION’05) 0-7695-2271-8/05 $ 20.00 IEEE

to a learned background reference distribution qt = {qi,t } containing no targets. More specifically, KL(pki,t , qi,t ) :=



pki,t log

i

In 8(c) both strata corresponding to the two targets were removed from the distribution and one can see that the residual background is almost uniformly distributed, which is also reflected in a low relative entropy. The small bumps that can be seen are caused by the targets in the background that have not yet been discovered by the tracker. In 8(c) and (d) only the right and left targets were removed from the distribution respectively, leading to a high entropy relative to the background distribution. The entropy thresholds were set to be τd = 1.9 and τb = 4.1 and. One can see that the criterion is far from triggering the deletion of strata for cases (c) and (d), since KL(pkt , qt ) > τd and has not yet reached a level high enough for creating a new stratum for case (b) since KL(p0t , qt ) < τb .

pki,t . qi,t

The reference distribution {qi,t } can be assumed to be uniform if no known background distribution is available or be a filtered version of p0t , qi,t = (1 − t )qi,t−1 + t p0i,t , with t =