Multiple-Target Tracking by Spatiotemporal Monte Carlo Markov

1 downloads 0 Views 2MB Size Report
nearest neighbor) or probabilistic (e.g., probabilistic data association) way at each ... istic data association filter (JPDAF) [20] are two classical methods. MHT is a ...
2196

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,

VOL. 31, NO. 12,

DECEMBER 2009

Multiple-Target Tracking by Spatiotemporal Monte Carlo Markov Chain Data Association Qian Yu, Member, IEEE, and Ge´rard Medioni, Fellow, IEEE Abstract—We propose a framework for tracking multiple targets, where the input is a set of candidate regions in each frame, as obtained from a state-of-the-art background segmentation module, and the goal is to recover trajectories of targets over time. Due to occlusions by targets and static objects, as also by noisy segmentation and false alarms, one foreground region may not correspond to one target faithfully. Therefore, the one-to-one assumption used in most data association algorithms is not always satisfied. Our method overcomes the one-to-one assumption by formulating the visual tracking problem in terms of finding the best spatial and temporal association of observations, which maximizes the consistency of both motion and appearance of trajectories. To avoid enumerating all possible solutions, we take a Data-Driven Markov Chain Monte Carlo (DD-MCMC) approach to sample the solution space efficiently. The sampling is driven by an informed proposal scheme controlled by a joint probability model combining motion and appearance. Comparative experiments with quantitative evaluations are provided. Index Terms—Multiple-target tracking, data association, MCMC, visual surveillance.

Ç 1

INTRODUCTION

T

HE ability to simultaneously track multiple targets is a critical component of any video surveillance system as it provides the description of spatiotemporal relationships among moving objects in the scene. Unlike single-target tracking, where the main focus is modeling the appearance of the target or estimating the kinematics state, the core issue in tracking multiple targets is to recover the data association between multiple targets and multiple observations. Environments of interest usually contain an unknown number of targets (of various types), and multiple observations of targets are reported. Various types of observations are adopted as input for tracking under different situations. Foreground regions using motion segmentation techniques are used as the input of our tracking algorithm. Many data association algorithms have been proposed in recent decades [10], [13], [14], [15], [16], [17]. Most existing algorithms impose a one-to-one mapping between targets and observations. This assumes that at a given time instant, one observation can be associated with at most one target and vice versa: One target corresponds to at most one observation. This assumption is reasonable when the considered observations are point based. However, in the visual tracking problem, the observations correspond to blobs, or regions, which cannot be faithfully modeled by a single point. Moreover, erroneous detections due to

. Q. Yu is with the Institute for Robotics and Intelligence Systems, University of Southern California, 3737 Watt Way, PHE220, Los Angeles, CA 90089. E-mail: [email protected]. . G. Medioni is with the Institute for Robotics and Intelligence Systems, University of Southern California, 3737 Watt Way, PHE212, Los Angeles, CA 90089. E-mail: [email protected]. Manuscript received 7 Apr. 2008; revised 10 Sept. 2008; accepted 1 Oct. 2008; published online 10 Oct. 2008. Recommended for acceptance by S.-C. Zhu. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number TPAMI-2008-04-0204. Digital Object Identifier no. 10.1109/TPAMI.2008.253. 0162-8828/09/$25.00 ß 2009 IEEE

occlusion and spurious segmentation provide a set of observations where a single moving object is often detected as multiple moving regions or multiple moving regions are merged into a single blob. Therefore, the one-to-one association is often violated in real environments. Here, we propose a general framework which makes use of spatiotemporal consistency in both motion and appearance and does not require the one-to-one mapping between observations and targets. Although our framework can accommodate additional information, such as generic model information (which is discussed in future work), in this paper, we use consistency of motion and appearance as the only constraints. Instead of inferring the association and targets’ states according to current observations, our method uses a batch of observations. A track is regarded as a path, in space-time, traveled by a target. We aim to recover the tracks of an unknown number of targets using the consistency in motion and appearance of tracks. Due to the high computational complexity of such an association scheme, a Data-Driven Markov Chain Monte Carlo (DD-MCMC) [18] method is proposed to sample the solution space. Both spatial and temporal association samples are incorporated into the Markov chain transitions. One key contribution of the paper is the explicit use of spatiotemporal smoothness in motion and appearance to overcome the one-to-one assumption used in most of data association algorithms by using a spatiotemporal MCMC. The paper is organized as follows: The related work is discussed in Section 2. We formulate the multiple-target tracking problem and present our spatiotemporal MCMC data association algorithm in Sections 3 and 4, respectively. We discuss how to determine the parameters used in our probabilistic model by Linear Programming and provides comparative results on both simulated and real data sets in Section 5, followed by conclusions and discussion in Section 6. Published by the IEEE Computer Society

Authorized licensed use limited to: University of Southern California. Downloaded on November 2, 2009 at 16:25 from IEEE Xplore. Restrictions apply.

YU AND MEDIONI: MULTIPLE-TARGET TRACKING BY SPATIOTEMPORAL MONTE CARLO MARKOV CHAIN DATA ASSOCIATION

2

RELATED WORK

Object tracking is a fundamental issue for video analysis and surveillance systems. There exist many ways to categorize tracking problems. In terms of the number of objects of interest, tracking can be categorized into two types. One is single-object tracking, which focuses on estimating the state (position, dimension, and velocity, etc.) of the object according to an appearance or motion cues (readers can refer to the survey [6] by Yilmaz et al. for an extensive overview). The other type, which our method addresses, is multiple-target tracking. For multiple-target tracking, since there simultaneously exist multiple targets and multiple observations of targets in each frame, data association becomes a first-line problem in multiple-target tracking. The purpose of data association is to recover the correct correspondence between observations and targets. Indeed, data association and state estimation are two interrelated problems. Once data association is established, filtering techniques can be applied to estimate the state of targets. The way to evaluate a possible data association is to see whether the estimated states of targets form consistent trajectories in terms of both motion and appearance. The existing multiple-target tracking methods can be categorized into two types: single scan and multiple scan (or n-scan) [19]. For single-scan algorithms, the data association decision is sequentially made in a deterministic (e.g., nearest neighbor) or probabilistic (e.g., probabilistic data association) way at each time step. In contrast to single scan, n-scan methods defer the data association decision when n frames observations are collected and thus are also called deferred logic methods. Although single-scan methods are computationally more efficient than n-scan methods, the solution of single scan is obviously suboptimal, compared to multiple-scan methods. The batch processing of the observations from many frames together requires intensive computation power for a problem of a nontrivial size. A formulation of the combinatorial optimization is often adopted in solving multiple-target tracking problem. The task to label each observation with either a track ID or a false alarm is related with a set packing problem, which is essentially an NP problem [11]. No matter how the formulation is transformed, such as a 0-1 integer programming problem [11] and a multidimensional assignment problem [19], in practice, the whole solution space has to be reduced to a feasible one by the use of heuristics. Among the large body of work in multiple-target tracking, the multiple hypothesis tracker (MHT) [3] and joint probabilistic data association filter (JPDAF) [20] are two classical methods. MHT is a statistical framework to evaluate the likelihood of each hypothesis, which represents a set of assignments of observations and targets. To find the best hypothesis over time, in practice, k-best hypotheses are maintained at each time, which can be solved in polynomial time [13]. The essential difference between JPDAF and MHT is that instead of finding the best hypothesis, JPDAF computes the expectation of the state of targets over all hypotheses (joint association events). Also, any practical implementation of MHT and JPDAF requires pruning the set of all hypotheses to a smaller set and thus leads to a suboptimal solution. Both of these data association methods

2197

assume the one-to-one mapping between observations and targets. Instead of explicitly reducing the size of the hypothesis set, sampling-based techniques have recently been proposed to solve the combinatorial optimization problem. Many of them adopt sequential inference to avoid exponential explosion of the size of the solution space. An MCMC-based variant of the auxiliary variable particle filter is proposed to approximately infer the position of the targets [12]. It is worth noting that the data association in this paper considers split and merged measurements, and the MCMC sampling simulates the probability of a data association. However, this paper assumes a known number of targets, and data association is determined in a sequential way. In [21], transdimensional MCMC is used to sample the probability of the data association with a varying number of targets. This sequential tracking method uses pairwise Markov random field (MRF) based prior to penalize the overlap between targets at the same time. In [9], an articulated human foreground model is adopted and multiple people are detected and tracked in crowded scenes using a MCMC-based method to estimate the state and the number of targets sequentially. In [7], an ad hoc Markov network is used to model the interaction between multiple targets at each time and a mean field Monte Carlo algorithm is applied to approximately estimate the posterior density of each target. In [22], a multiview approach uses particle filter-based method to segment and track people against clutter. In order to reduce the ambiguity in data association, many of these sequential methods employ model information to identify a specific type of target against background, such as [7], [9], [12], [17]. Among many sampling-based methods, Oh et al. originally propose using MCMC to directly sample data association in an n-scan setting [15] and this method is a general framework, which is capable of initiating and terminating a varying number of tracks and is able to incorporate any domain knowledge. This method is appropriate for point-based observations, but cannot be applied for region-based observations. Compared with this method, our approach overcomes the one-to-one assumption by introducing the spatiotemporal data association. Also, we encode both motion and appearance information in the posterior distribution, which allows the method to deal with region-based observations in vision applications. Moreover, since the success of the MAP formulation relies on the definition of a posterior distribution, we avoid determining the posterior empirically and instead introduce a practical method to estimate the parameters in the posterior offline.

3

MULTIPLE-TARGET TRACKING

3.1 Anatomy of the Problem Suppose there are K unknown targets in the scene within the time interval ½1; T . The input for the tracking algorithm is a set of regions after foreground segmentation. Let yt denote the set of foreground regions at time t and Y ¼ [Tt¼1 yt be the set of all available foreground regions within ½1; T . In the simplest case, a single target is perfectly

Authorized licensed use limited to: University of Southern California. Downloaded on November 2, 2009 at 16:25 from IEEE Xplore. Restrictions apply.

2198

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,

VOL. 31, NO. 12,

DECEMBER 2009

Fig. 2. Two ways to represent foreground regions. (a) Pixel-level labeling foreground regions and (b) rectangle cover of foreground regions.

Fig. 1. Segmentation of foreground regions in space-time by the use of motion and appearance smoothness. (a) Foreground region in one frame and (b) motion and appearance of two targets.

segmented from the background, and tracking is straightforward. When there are multiple targets in the scene and they never overlap, nor get fragmented, the one-to-one mapping, which is assumed by many tracking algorithms, holds: Any track k contains at most one observation at one time instant, i.e., jk \ yt j  1; 8k 2 ½1; K, and no observation belongs to more than one track: i \ j ¼ ; i 6¼ j, 8i; j 2 ½1; K. If the one-to-one mapping holds, tracking can be done by associating the foreground regions directly. However, in the most general case, which is common in real scenarios, one foreground region may correspond to multiple targets (one example is shown in Fig. 1a), and one target may correspond to multiple foreground regions. In this case, without using any model information, it is difficult to segment the foreground regions in a single frame. However, if we consider this task in space-time, the smoothness in motion and appearance of targets can be used to solve this problem. One example is shown in Fig. 1b, where the segmentation of the foreground regions becomes much easier than in Fig. 1a: If we look at several observations over time, smoothness in motion and appearance of targets helps to disambiguate the targets. There are many ways to represent foreground regions corresponding to diverse targets. The most detailed representation is to assign to each foreground pixel a label (or a set of labels). The label (or labels) indicates the target (or targets) that the pixel belongs to. We allow the case where one pixel is assigned to multiple labels to represent the occlusion situation, as shown in Fig. 2a. Note that areas with a common label may not necessarily be connected. This is different from a partition segmentation problem, where regions must be disjoint, i.e., each pixel belongs to one region exclusively. Although such a representation is very accurate, labeling each pixel is expensive to implement. We adopt a more efficient alternative representation, and use rectangles to approximately represent the shapes of targets and the bounding rectangles form a cover of foreground regions, as shown in Fig. 2b. The overlap between two rectangles indicates an occluded area. Given

pixel labels, we can precisely derive a rectangle cover representation, and conversely pixel labels can be approximated obtained from the rectangle cover representation. The approximation is useful since it provides an efficient explanation of foreground regions with occlusion, and significantly reduces the complexity of the problem. In such a scheme, the center and the size of a rectangle are used as the abstract representation of motion states and the foreground area covered by a rectangle contains the appearance of one target. Covering rectangles with labels (track IDs) over time forms a cover of foreground regions in a sequence of frames, and a track is a set of covering rectangles with the same label. Formally, a cover ! with m covering rectangles of Y is defined as follows: ! ¼ fCRi ¼ ðri ; ti ; li Þg; ri 2 r ; ti 2 ½1; T ; li 2 ½1; K;

ð1Þ

subject to 8i; j; i 6¼ j 2 ½1; m; ti 6¼ tj ; li 6¼ lj ;

ð2Þ

where CRi is one covering rectangle, ri and ti represent the state (center position and size) and the timestamp for one rectangle, li indicates the label assigned to the rectangle ri , and K is the upper bound of the number of targets. r is the set of all possible rectangles. Although the candidate space of possible rectangles is very large, i.e., jr j is a large number, it is still a finite number if we discretize the state of a rectangle in 2D image space. The constraint in (2) means that any two covering rectangles cannot share the same timestamp and track label. In other words, one track can have at most one covering rectangle at one time instant. Thus, the number of rectangles that one cover can contain is bounded, m  M ¼ KT . The way to form one cover can be regarded as: First, select m rectangles from space r and then fill them into KT sites. One site corresponds to one unique pair of time mark and track label, i.e., < ti ; li > . No two rectangles can fill the same site. Let k ðtÞ denote the covering rectangle in track k at time t. If we consider k ðtÞ a virtual measurement, the data association between virtual measurements still complies to the one-to-one mapping, namely, there is at most one virtual measurement for one track at one time instant. The virtual measurement derives from foreground regions: A virtual measurement can correspond to (i.e., cover) more than one foreground region or a part of a foreground region. The relationship between virtual measurements and real observations from foreground regions reveals the spatial data association between foreground regions. A similar concept of virtual measurement is also introduced in [23] for establishing correspondence in the Structure from Motion (SfM) problem. By

Authorized licensed use limited to: University of Southern California. Downloaded on November 2, 2009 at 16:25 from IEEE Xplore. Restrictions apply.

YU AND MEDIONI: MULTIPLE-TARGET TRACKING BY SPATIOTEMPORAL MONTE CARLO MARKOV CHAIN DATA ASSOCIATION

2199

consistency in motion and appearance over time. Formally, in an Bayesian formulation, the tracking problem is to find a cover to maximize a posterior (MAP) of a cover of foreground regions, given the set of observations Y , ! ¼ arg maxðpð!jY ÞÞ:

ð5Þ

In the MAP problem defined in (5), the cover ! is denoted by a set of hidden variables. We make inference about ! from Y over a solution space ! 2  Fig. 3. One possible cover of the observations, which includes two tracks ð1 ; 2 Þ. The dashed rectangles represent the covering rectangles of foreground regions. The uncovered regions correspond to false alarms.

introducing the concept of virtual measurement, we differentiate a spatial data association from a temporal data association. The optimal joint spatiotemporal data association leads to the final solution for such a multiple-target tracking problem. Let m M denote the space of all possible combinations of m locations from M sites, the whole solution space (! 2 ) can be represented as ¼

M [ m¼1

m ¼

M h i [ m r      r : M  |fflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflffl}

m¼1

ð3Þ

ð6Þ

The likelihood pðY j!Þ represents how well the cover ! explains the foreground regions Y in terms of the spatialtemporal smoothness in both motion and appearance. The prior model regulates the cover to avoid overfitting the smoothness. In the following sections, we discuss the prior and likelihood models used in our method.

3.2 Prior Probability To find a cover with reasonable properties, we first define a prior model which considers the following criterion: We prefer a small number of long tracks with little overlap with other tracks. Accordingly, we adopt the prior probability of a cover ! as the product of the following terms:

m

The structure of the solution space is typical for vision problems. As in [18], the solution of the segmentation problem is formulated in a similar way, where the entire solution space is a union of m-partition spaces (m is the number of regions). In the case of a single target with perfect foreground segmentation, the set of minimum bounding rectangles (MBRs) for each foreground region at different times forms the best cover of the target. However, when interocclusion between multiple targets and noisy foreground segmentation exists, it is not trivial to find the optimal cover. Fig. 3 shows a case with observations in five frames and illustrates the cases of split observations (in frame 2) and merged observations (in frame 3). Let k denote one track in !. A cover with K tracks can also be written as follows: ! ¼ f1 ; . . . ; K g:

!  pð!jY Þ / pðY j!Þpð!Þ; ! 2 :

ð4Þ

Fig. 3 shows one possible cover ! ¼ ð1 ; 2 Þ with two tracks in different colors. In the perspective of implementation, one cover contains a set of tracks. Each track consists of a sequence of covering rectangles, which are represented as the dashed rectangles in Fig. 3. For example, 1 and 2 contain five rectangles, one at each time instant. As defined in (1), besides the location and the size, each covering rectangle has two properties, the track ID and the time label. Temporal data association is implemented by changing the track IDs, for example, splitting 1 into two tracks. Spatial data association involves the operation of changing the location and the size of one covering rectangle, for example, a diffusion of one track at a time. Intuitively, exploring the solution space from one cover to another cover is implemented by changing properties of the covering rectangles. The underlying constraint for tracking is that a good explanation of the foreground regions exhibits good

pð!Þ ¼ pðNÞpðLÞpðOÞ: 1.

Number of tracks: Let K denote the number of tracks. We adopt an exponential model pðNÞ to penalize the number of tracks pðNÞ ¼

2.

1 expð0 KÞ: z0

ð8Þ

Length of each track: We adopt an exponential model pðLÞ of the length of each track. Let jk j denote the length, i.e., the number of elements (rectangles) in k , pðLÞ ¼

3.

ð7Þ

K Y 1 expð1 jk jÞ: z k¼1 1

ð9Þ

Spatial overlap between different tracks: We adopt an exponential model in (10) to penalize overlap between different tracks, where ðtÞ denotes the average overlap ratio of different tracks at time t, T Y 1 expð2 ðtÞÞ; z t¼1 3   P i ðtÞ\j ðtÞ i ðtÞ\j ðtÞ6¼; i ðtÞ[j ðtÞ  : ðtÞ ¼  i ðtÞ \ j ðtÞ 6¼ ;

pðOÞ ¼

ð10Þ

In the solution space of (3), the prior model is applied to prevent the adoption of a more complex model than necessary. For example, a short track usually has better smoothness than a long track. Merely considering the smoothness defined by the likelihood will segment a long track into short tracks. In an extreme condition, each track contains a single observation and has the best smoothness.

Authorized licensed use limited to: University of Southern California. Downloaded on November 2, 2009 at 16:25 from IEEE Xplore. Restrictions apply.

2200

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,

The prior penalizes such an extreme condition by all three terms: the number of tracks, length of each track, and overlap among different tracks. We consider another extreme condition: a cover !1 that contains two perfect tracks, 1 and 2 , that completely overlap with each other; another cover !2 with one track 1 . Without the prior, the decision cannot be made since the number of targets is unknown and !1 and !2 have the same smoothness. The parameters in the prior model are hard to determine empirically. We will show how to determine the parameters appropriately for specific data sets.

3.3 Joint Likelihood pðY j!Þ We assume that the characteristics of motion and appearance of targets are independent; therefore, the likelihood can be written as pðY j!Þ ¼ fF ð!Þ

K Y

fðk Þ;

ð11Þ

k¼1

where fF ð!Þ represents the likelihood of the uncovered foreground area by ! and fðk Þ is the likelihood for each track. The area not covered by any rectangle indicates the false alarm in observations. We prefer to cover foreground regions as much as possible unless the spatiotemporal smoothness prevents us from doing so. We adopt an exponential model of uncovered areas as 1 fF ð!Þ ¼ expð3 F Þ; z3

ð12Þ

where F is the foreground area (in pixels) which is not covered by any track. The appearance of foreground regions covered by each track k is supposed to be coherent and the motion of such a rectangle sequence should be smooth. Hence, we consider a probabilistic framework for incorporating two parts of independent likelihoods, motion likelihood fM and appearance likelihood fA , then fðk Þ ¼ fM ðk ÞfA ðk Þ:

VOL. 31, NO. 12,

DECEMBER 2009

probability density function N ð; ; Þ given the predicted kinematic state k ðti Þ, 

k ðti Þ ¼ LM ½k ðti Þ ¼ N ðk ðti Þ; H k ðti Þ; Sk ðti ÞÞ; LM ½k ðti Þj ð15Þ where Sk ðti Þ ¼ H Sk ðti ÞH T þ R and Sk ðti Þ is the prior estimate of the covariance matrix at time ti . The motion likelihood for track k can be represented as fM ðk Þ ¼

jk j Y

LM ½k ðti Þ:

ð16Þ

i¼3

Since we consider derivatives in kinematic states, we need two observations to initialize one track. Thus, motion likelihood can be computed from the third observation on. The motion likelihood in (15) can be obtained as follows: T 1 1 1 N½k ðti Þ ¼ j2Sk ðti Þj2 exp2fðek ðti ÞÞ fSk ðti Þg ek ðti Þg ; ek ðti Þ ¼ k ðti Þ  H k ðti jti  1Þ:

ð17Þ

The details of updating the prior and posterior estimates in Kalman filters can be found in [4]. Note that, if missing detection happens in k at time t or, say, there is no observation at time t for track k, the prior estimate is assigned to the posterior estimate.

3.3.2 Appearance Likelihood In order to model the appearance of each detected region, we adopt the nonparametric histogram-based descriptor [10] to represent the appearance of foreground area covered by !. The appearance likelihood of one track is modeled as a chain-like MRF. The likelihood between two neighbors is defined as follows: 

LA ðk ðti Þ; k ðti1 ÞÞ ¼ LA ½k ðti Þ ¼ ð1=z4 Þ expð4 Dðk ðti Þ; k ðti1 ÞÞÞ;

ð13Þ

ð18Þ

We represent the elements in track k as ðk ðt1 Þ; k ðt2 Þ; . . . ; k ðtjk j ÞÞ, where ti 2 ½1; T , and ðtiþ1  ti Þ 1. Each k ðti Þ can be regarded as the observation of track k at time ti . Since missing detection may happen, it is possible that no observation is assigned to track k at some time instants.

where DðÞ represents the symmetric Kullback-Leibler Distance (KL) between the histogram-based descriptors of foreground covered by k ðti Þ and k ðtiþ1 Þ. The entire appearance likelihood of k can be factorized as

ykt ¼ Hxkt þ v;

jk j Y

LA ½k ðti Þ:

ð19Þ

i¼2

3.3.1 Motion Likelihood For each target, we consider a linear kinematic model: xktþ1 ¼ Axkt þ w;

fA ðk Þ ¼

ð14Þ

where xkt is the hidden kinematic state vector, which includes the position ðu; vÞ, size ðw; hÞ, and the first order _ in 2D image coordinates. The _ v; _ w; _ hÞ derivatives ðu; observation ykt in (14) corresponds to the position and size of k ðtÞ in 2D image coordinates. w  N ð0; QÞ; v  N ð0; RÞ are Gaussian process noise and observation noise. To determine the motion likelihood LM for each track, according to (14), an observation k ðti Þ has a Gaussian

Given one cover, the motion and appearance likelihood of a target is assumed to be independent of other targets. The joint likelihood of a cover can be factorized in (20), pðY j!Þ ¼ fF ð!Þ

K Y

fM ðk ÞfA ðk Þ

k¼1

¼ fF ð!Þ

jk j K Y Y k¼1

i¼3

LM ½k ðti Þ

jk j Y

!

ð20Þ

LA ½k ðti Þ :

i¼2

With some manipulations, we combine the prior pð!Þ in (7) and the likelihood pð!jY Þ in (20) to rewrite the posterior represented in (21),

Authorized licensed use limited to: University of Southern California. Downloaded on November 2, 2009 at 16:25 from IEEE Xplore. Restrictions apply.

YU AND MEDIONI: MULTIPLE-TARGET TRACKING BY SPATIOTEMPORAL MONTE CARLO MARKOV CHAIN DATA ASSOCIATION

2201

pð!jY Þ / expfC0 Slen  C1 K  C2 F  C3 Solp  C4 Sapp  Smot g; ! ! K T X X Slen ¼  jk j ; Solp ¼ ðtÞ ; k¼1

Sapp ¼

jk j K X X

T ¼1

Dðk ðti Þ; k ðtiþ1 ÞÞ;

k¼1 i¼2

Smot ¼

jk j K X X

ðlogðjSk ðti ÞjÞ þ eðti ÞT Sk ðti Þ1 eðti ÞÞ;

k¼1 i¼3

ð21Þ where eðti Þ ¼ k ðti Þ  k ðti jti  1Þ and C0 ; . . . ; C4 are positive real constants, which are newly introduced parameters replacing ði ; zi Þ, i ¼ 0; . . . ; 4. The parameters in the prior and likelihood functions are absorbed in the free parameters C0 ; . . . ; C4 . Once one possible cover ! is given, the variable Slen , K, F , Solp , Sapp , and Smot can be computed. The global maximum (called mode in statistics) of the posterior pð!jY Þ is our MAP solution. Equation (21) reveals that the MAP estimation is equivalent to finding the minimum of an energy function. Determining the parameters in such a posterior is as important as maximizing the posterior. Improper parameter setting makes the optimization process meaningless. This is an issue which is very often ignored and thus causes critiques of Bayesian MAP inference. In Section 5.1, we discuss how to determine the parameters automatically by Linear Programming.

4

SPATIOTEMPORAL MCMC DATA ASSOCIATION

Directly optimizing a posterior by enumerating all possible solutions in the solution space defined in (3) is simply not feasible. We propose to use data-driven MCMC to estimate the best spatiotemporal cover of foreground regions. To ensure that detailed balance is satisfied, the Markov chain is designed to be ergodic and aperiodic. It is also important to design samplers that converge quickly. Due to ergodicity of the Markov chain, there is always a “path” from one state to another state with nonzero probability. However, sufficient flexibility in the transition of Markov chain significantly reduces the mixing time. In the design of the transition of Markov chain, we manage to give flexibility in two ways. First, we design 10 types of transitionss. They contain some redundancy, for example, merge (or split) can be implemented by death moves with extension moves and switch can be implemented by split and merge moves. Second, within a time span, the “future” and “past” information is symmetric: We can extend a track in both the positive and negative time direction. Thus, we select moves uniformly at random (u.a.r.) in both temporal directions: forward and backward. This bidirectional sampling has more flexibility and reduces the total number of samples. This differs from the temporal moves proposed in [24], where only forward inference is used. In the following section, we only describe sampling in the positive time direction and the sampling in the other direction proceeds in a symmetric way.

Fig. 4. Illustration of neighborhood and association likelihood, where k ðt3 Þ has three neighbors.

To make the sampling more efficient, we define the neighborhood in spatiotemporal space. Two covering rectangles are regarded as neighbors if their temporal distance and spatial distance are smaller than a threshold. The neighborhood actually forms a graph, where a covering rectangle corresponds to a node and an edge between two nodes indicates that two covering rectangles are neighbors. In the rest of the paper, we use “node” and “covering rectangle” interchangeably. A neighbor with a smaller (larger) frame number is called a parent (child) node. The neighborhood makes the algorithm more manageable since candidates are considered only within the neighborhood system. Fig. 4 illustrates the neighborhood. The joint motion and appearance likelihood of assigning an observation y (i.e., one foreground region) to a track k after ti is represented as Lðyjk ðti ÞÞ ¼ LM ðyjk ðti ÞÞLA ðy; k ðti ÞÞ:

ð22Þ

In our proposal distribution, the sampler contains two types of moves: temporal and spatial moves. One move here means one transition of the state of the Markov chain. Temporal moves only change the label of rectangles in the cover. However, since detected moving regions do not always correspond to a single target (they may represent parts of a target or delineate multiple targets moving closely to each other), merely using temporal moves cannot probe the spatial cover of the foreground. Hence, we propose a set of spatial moves to segment, aggregate or diffuse detected regions, to infer the best cover of the foreground. The spatial and temporal moves are interdependent: The result of a spatial move is evaluated within temporal moves, and the result of a temporal move guides subsequent spatial moves. The overview of our MCMC data association algorithm is shown in Algorithm 1. The input to the algorithm is the set of original foregrounds Y , initial cover !0 , and the total number of samples nmc . The initial cover !0 is initialized with a greedy criterion, namely, using the MHT algorithm but keeping only the best hypothesis at each time. The covering rectangles in !0 are directly obtained from MBRs of foreground regions. Each move is sampled according to its own prior probability. Since the temporal information is also applied in the spatial moves, we first take   nmc ( ¼ 0:15 in experiments) temporal moves and then both types of moves are nondiscriminatorily considered. Note that instead of keeping all samples, we only keep the cover with the maximum posterior since we don’t need the whole distribution but the MAP estimate.

Authorized licensed use limited to: University of Southern California. Downloaded on November 2, 2009 at 16:25 from IEEE Xplore. Restrictions apply.

2202

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,

Algorithm 1. Spatiotemporal MCMC Data Association Input: Y ; nmc ; ; ! ¼ !0 Output: ! for n ¼ 1 to nmc do if n <   nmc then Sample one temporal move. else Sample one move from all candidate moves. end if Propose !0 according to qð!; !0 Þ Sample U from Unif½0; 1 if U < Að!; !0 Þ then !n ¼ !0 , else !n ¼ !, end if if pð!n jY Þ > pð! jY Þ then ! ¼ !n end if end for The target distribution is the posterior distribution of !, i.e., ð!Þ ¼ pð!jY Þ, which is defined on a union of varying dimension subspaces. Thus, we adopt a transdimensional MCMC algorithm [25], which deals with the case of proposal and target distributions in varying dimension spaces. One move from !m 2 m to !m0 2 m0 (m 6¼ m0 ) is a jump between two different models. Reverse-Jump MCMC [2], proposed by Green, connects these two models by drawing “dimension matching” variables u and u0 from proposal distributions qm ðuÞ and qm0 ðu0 Þ provided that dimð!Þ þ dimðuÞ ¼ dimð!0 Þ þ dimðu0 Þ, where dimðÞ denotes the dimension of a vector. Then ! and !0 can be generated from some deterministic functions of ! ¼ gð!0 ; u0 Þ and !0 ¼ gð!; uÞ. The acceptance ratio is defined as follows:    ð!0 Þ qm0 ð!j!0 Þ @ð!0 ; u0 Þ : ð23Þ m ð!; !0 Þ ¼ min 1; ð!Þ qm ð!0 j!Þ  @ð!; uÞ  The temporal moves of merge, split, and switch do not change the number of covering rectangles but change only the label of the rectangles. All spatial moves do not change the label of the rectangles but only change the state of rectangles. These types of moves do not change the dimension of the space. The temporal moves of birth, death, extension, and reduction involve the issue of transdimension dynamics. Note that both dimension increasing and decreasing moves only change one part of the cover and do not affect the remaining part of a cover. For a pair of dimension increasing/decreasing move, if u is a random variable, u  qðuÞ, the move is defined as !0 ¼ gð!; uÞ ¼ ½!; u and dimð!0 Þ ¼ dimð!Þ þ dimðuÞ, then qm ð!0 j!Þ ¼ qðuÞ. In RJ-MCMC, if u is independent of !, it is easy to show that the Jacobian is unity [5]. In such a Markov chain transition, the computation for each MCMC move is actually low since we only need to compute the ratio ð!0 Þ=ð!Þ instead of computing the value of each posterior. Moreover, since the Markov chain dynamics only change one part of the cover and do not affect the remaining part of a cover, the ratio ð!0 Þ=ð!Þ can be computed by only considering the change from ! to !0 .

VOL. 31, NO. 12,

DECEMBER 2009

For instance, for a split/merge move, we only need to consider the likelihood change and the prior change for the affected track. In subsequent sections, we show how to devise the Markov chain transition by considering specific choices for the proposal distribution qð!0 j!Þ.

4.1 Markov Chain Dynamics Dynamics 1-7 are temporal moves, which involve changing the label of rectangles. The operation of selecting candidate rectangles in birth move and extension move only involves selecting from the covering rectangles of original foreground regions. Dynamics 8-10 are spatial moves, which change the state of covering rectangles. The prior for each move from 1 to 10 are predetermined as pð1Þ to pð10Þ. Dynamics 1-2: Forward Birth and Death. For a forward birth move, we pick two neighbor nodes in different frames to form a track seed, which contains two nodes,   0 ð24Þ ! ¼ fri gm i¼1 ! ðw; frmþ1 ; rmþ2 gÞ ¼ ! : For the first candidate rectangle, we u.a.r. select one from covering rectangles of original foreground regions that have not been covered, i.e., qb ðrmþ1 Þ is equal to one over the number of original bounding rectangles that are not covered. Suppose the set of child nodes of rmþ1 that have not been covered is childðrmþ1 Þ, the probability of selecting the second candidate is qb ðrmþ2 jrmþ1 Þ ¼ P

ð log LA ðrmþ2 ; rmþ1 Þ þ 1Þ1

y2childðrmþ1 Þ

ð log LA ðrmþ2 ; rmþ1 Þ þ 1Þ1

:

ð25Þ When we select the second node in a track seed, we only use appearance likelihood in (25) (since the computation of the motion likelihood needs at least two nodes). To avoid the probability of one candidate dominating all the other, we use the inverse of the negative log likelihood to define the probability. For the reverse move, we u.a.r. select one from the existing track seeds and remove it from the current cover, i.e., qðseedÞ is equal to one over the number of track seeds. By Metropolis-Hastings method, we need two proposal probabilities qbirth ð!; !0 Þ and qdeath ð!0 ; !Þ. qbirth ð!; !0 Þ is a conditional probability for how likely the Markov chain proposes to move to !0 , and qdeath ð!0 ; !Þ is the likelihood for coming back. Then, the accept probability of a birth move is   ð!Þqdeath ð!0 ; !Þ ; ð26Þ Að!; !0 Þ ¼ min ð!0 Þqbirth ð!; !0 Þ where the proposal probability of a birth move is a product of the prior of a birth move pð1Þ and the probability of selecting two candidates rectangles, i.e., qbirth ð!; !0 Þ ¼ pð1Þqb ðrmþ1 Þqb ðrmþ2 jrmþ1 Þ. The proposal probability of a death move is a product of the prior of a death move and the probability of selecting one seed track, i.e., qdeath ð!0 ; !Þ ¼ pð2ÞqðseedÞ. Dynamics 3-4: Forward Extension and Reduction. For a forward extension move, we select a track k 2 !k according to its length, i.e.,

Authorized licensed use limited to: University of Southern California. Downloaded on November 2, 2009 at 16:25 from IEEE Xplore. Restrictions apply.

YU AND MEDIONI: MULTIPLE-TARGET TRACKING BY SPATIOTEMPORAL MONTE CARLO MARKOV CHAIN DATA ASSOCIATION

expðe jk jÞ : k 2! expðe jk jÞ

qe ðk Þ ¼ P

Suppose the end node of track k is at frame ti , we select one covering rectangle of an original foreground regions rmþ1 from childðk ðti ÞÞ and add it into k . The probability of selecting a new node qe ðrmþ1 Þ can be represented as qe ðrmþ1 k ðti ÞÞ ¼P

ð log Lðrmþ1 ; jk ðti ÞÞ þ 1Þ1 1

y2childðk ðti ÞÞ\0

ð log Lðrmþ1 ; jk ðti ÞÞ þ 1Þ

:

ð27Þ

This probability is similar to the one in (25) but considers both motion and appearance likelihoods. For the reverse move, we u.a.r. select a track k that contains more than two nodes and remove the end node from k . To give the capability of multiple extensions or reductions, after one extension, we continue to extend the same track k with a probability e . Similarly, after one reduction, we continue to reduce k with probability r . The proposal probability of extension is qextension ðÞ ¼ pð3Þqe ðk Þðe Þn1 ð1  e Þ

n Y

qe ðrmþi Þ

i¼1

and the proposal probability of the reverse move is qreduction ðÞ ¼ pð4Þqr ðk Þðr Þn1 ð1  r Þ, where n indicates the number of extension or reduction moves that actually occur. Dynamics 5-6: Merge and Split. If a track’s (k1 ) end node is in the parent set of another track’s (k2 ) start node, this pair of tracks is a candidate for a merge move. We select u.a.r. a pair of tracks from candidates and merge the two tracks into a new track k ¼ fk1 g [ fk2 g. The proposal probability of a merge move is qmerge ðÞ ¼ pð5Þqm ðk1 ; k2 Þ. For the reverse move, we select a track k according to qs ðk Þ ¼ P

expðs jk j1 Þ jk j 4

expðs jk j1 Þ

and then select a break point according to the probability brk ðiÞ: brk ðiÞ ¼ Pj

 log Lðk ðtiþ1 Þjk ðti ÞÞ k 2j

j¼0

 log Lðk ðtiþ1 Þjk ðti ÞÞ

:

ð28Þ

brk ðiÞ is designed to prefer breaking a track at the location where the motion and appearance likelihood has a low value. The nodes in the track which are after the break point are moved to a new track. If the break point happens at the first link or the last link, the split operation has the same effect as a reduction operation. The proposal probability of a split move is qsplit ðÞ ¼ pð6Þqs ðk Þbrk ðiÞ. Dynamics 7: Switch. If there exist two locations p; q in two tracks k1 ; k2 , such that k1 ðtp Þ is in the parent set of k2 ðtqþ1 Þ and k2 ðtq Þ is in the parent set of k1 ðtpþ1 Þ as well, this pair of nodes is a candidate for a switch move. We u.a.r. select a candidate and define two new tracks as k0 1 ¼ fk1 ðt1 Þ; . . . ; k1 ðtp Þ; k2 ðtqþ1 Þ; . . . ; k2 ðtjk2 j Þg; k0 2 ¼ fk2 ðt1 Þ; . . . ; k2 ðtq Þ; k1 ðtpþ1 Þ; . . . ; k1 ðtjk1 j Þg:

ð29Þ

2203

The reverse move of a switch is symmetric, i.e., the reverse move of a switch is still a switch. The proposal probabilities of a switch move and its reverse move are identical, thus there is no need to compute the proposal probability. The acceptance probability of a switch move is   ð!0 Þ 0 : Aswitch ð!; ! Þ ¼ min 1; ð!Þ Dynamics 8: Diffusion. We select one covering rectangle k ðtÞ in a track according to the probability  log Lðk ðti Þjk ðti1 ÞÞ : qdif ðk ðtÞÞ ¼ PK Pj j k i¼2  log Lðk ðti Þjk ðti1 ÞÞ k¼1 This probability prefers selecting a covering rectangle that has a low motion and appearance likelihood with its preceding neighbor. The low motion and appearance likelihoods indicate that the covering rectangle of the track in this frame may be erroneous. In order to update its state, we first obtain its estimated state ðtÞ from the motion model, and then update its position and size according to the appearance model. Generate a new covering rectangle k0 ðtÞ from the k ðtÞÞ probability Sðk0 ðtÞj   dE  0 Sðyt jyt Þ  N yt þ  ð30Þ x¼yt ; u ; dx where E ¼  log LA ðxjyt Þ is the appearance energy function,  is a scalar to control the step size, and u is a Gaussian white noise to avoid local minimum. In practice, we adopt the spatioscale mean shift vector [8] to approximate the gradient of the negative appearance likelihood in terms of position and scale. A scale space is conceptually generated by convolving a filter bank of spatial Difference of Gaussian (DOG) filters with a weight image. Searching the mode in such a 3D scale space is efficiently implemented in [8] by a two-stage mean-shift procedure that interleaves spatial and scale mode-seeking, rather than explicitly building a 3D scale space and then searching. In our experiments, we only compute the mean shift vector in scale space once, namely, perform the spatial mean shift once followed by the scale mean shift without iterations. The diffusion move is illustrated in Fig. 5. The color histogram of one track is derived in a RGB space with 16  16  16 bins. Around the initial state ðtÞ, a weight image is computed using histogram backprojection to replace each pixel with the probability associated with that RGB value in the color histogram. Note that the weight image is masked by the foreground regions: the weight of a background pixel is always zero, as shown in Fig. 5. A new proposal is generated by drifting the initial state along the mean shift vector and adding a Gaussian noise according to (30). The newly generated covering rectangle takes the place of k ðtÞ. The diffusion move may cause partial foreground regions left over. These regions can be covered by new rectangles generated in birth moves if they can form a consistent track. The proposal probability of a diffusion move is qdif ðÞ ¼ k ðtÞÞ. The diffusion move is also pð8Þqdif ðk ðtÞÞSðk0 ðtÞj symmetric. The acceptance ratio of a diffusion move is

Authorized licensed use limited to: University of Southern California. Downloaded on November 2, 2009 at 16:25 from IEEE Xplore. Restrictions apply.

2204

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,

VOL. 31, NO. 12,

DECEMBER 2009

Fig. 5. Illustrations of a diffusion move: The RGB color histogram is quantized in 16  16  16 bins; the weight image is the backprojection from the color histogram and is masked by foreground regions; the blue dashed rectangle indicates the prediction from the motion model, the red arrow is the spatial scale mean shift vector, and the dashed red rectangle shows the proposal by a diffusion move.

  ð!0 ÞSðk ðtÞjyt Þ : Adif ð!; ! Þ ¼ min 1; ð!ÞSðk0 ðtÞjyt Þ 0

ð31Þ

Both motion information and appearance information are considered in the diffusion operation: The initial state of computing mean shift vector is the predicted state according to Kalman filter ðtÞ and the diffusion vector is computed according to appearance information. The diffusion is used for generating new hypotheses and the decision of acceptance is still made according to the MetropolisHasting algorithm, where the posterior distribution that encodes the joint motion and appearance likelihood plays an important role in accepting a good solution. Since we do not have a precise segmentation of the foreground regions, the appearance computation may not be very accurate when occlusion happens. The motion likelihood helps in estimating a good cover when appearance is not reliable. This is the reason why we need the joint motion and appearance model. The parameters C0 ; . . . ; C4 represent the trade-off between different factors in the posterior and are trained offline to adapt to a specific data set. Dynamics 9: Segmentation. If more than one track’s prediction k ðtÞ have enough overlap with one covering rectangle y at time t, as illustrated in Fig. 6a, this indicates that one covering rectangle may correspond to multiple tracks. Such a rectangle is regarded as a candidate for a segmentation move. The tracks are related tracks of the candidate y. Randomly select such a candidate y and for

Fig. 6. Illustrations of segmentation (a) and aggregation (b) moves, where the color indicates the object ID, dashed boxes indicate the estimated rectangles from the motion model, and regions with red boundaries are original foreground regions.

each related track k generate a new covering rectangle k0 ðtÞ k ðtÞÞ. The segmentation according to the probability Sðk0 ðtÞj move is achieved through diffusion moves (each related track performs one diffusion). Thus, the reverse of a segmentation move is also a segmentation move. The acceptance ratio of one segmentation move is Q   ð!0 Þ Sðk ðtÞjyt Þ Q : ð32Þ Aseg ð!; !0 Þ ¼ min 1; ð!Þ Sðk0 ðtÞjyt Þ Dynamics 10: Aggregation. If one track’s prediction k ðtÞ has enough overlap with more than one covering rectangle at time t, as illustrated in Fig. 6b, this indicates that the observation of this track in this frame may be fragmented into multiple regions. This forms a candidate for an aggregation move. Randomly select such a candidate k ðtÞ and for the track k generate a new covering rectangle k0 ðtÞ k ðtÞÞ. The newly according to the probability Sðk0 ðtÞj generated covering rectangle takes the place of k ðtÞ. The aggregation move is also symmetric and its acceptance ratio is similar to the one in (31). Both segmentation and aggregation moves are implemented by diffusion moves. In other words, the segmentation and aggregation moves are particular types of diffusion moves that address the merged and fragmented observations, respectively.

5

EXPERIMENTS

5.1 Parameter Training Properly selecting the parameters in (21) is necessary to assure the Markov chain converges to the correct distribution. To determine the parameters in a principled way is not a trivial task. First, the posterior can only be known up to a scale because the computation of the normalization factor over the entire ! is intractable. Second, the parameters, which encode a lot of specific domain knowledge, such as false alarms, overlap, etc., are highly scenario related. Empirical knowledge cannot help in determining the parameters in such a complex posterior, and otherwise would make the process not repeatable. Determining a proper setting of the parameters is a first-line problem before any stochastic optimization, and a casual setting of the parameters in the posterior makes all the efforts in searching a MAP solution meaningless. The global optimal solution, which is “optimal” to some oracle type of posterior, may not be more meaningful than some other inferior local maxima or even nonmaxima at all. This issue

Authorized licensed use limited to: University of Southern California. Downloaded on November 2, 2009 at 16:25 from IEEE Xplore. Restrictions apply.

YU AND MEDIONI: MULTIPLE-TARGET TRACKING BY SPATIOTEMPORAL MONTE CARLO MARKOV CHAIN DATA ASSOCIATION

was noticed by Tu and Zhu in [18], where the authors applied Data-Driven MCMC in solving image segmentation (segmentation is also intrinsically ambiguous). The authors proposed a K-Adventurer algorithm to extract K distinct solutions from the Markov chain sampling. This method requires storing the Markov chain for selecting K distinct solutions. However, for the consideration of computational cost and difference in definition of the goal, in a multipletarget tracking problem, it is not proper to keep multiple solutions from the whole chain. Here, we propose an automatic solution to determine the parameters in such a probabilistic model. Given one !, the log posterior density function is a linear combination of the parameters (note that the log posterior density is not a linear function of !, otherwise direct optimization of such a posterior can be expected). Such a linear combination in parameter space is commonly seen in the definition of a posterior that can be factorized into a set of independent components. As mentioned in Section 4, we only need to compute the ratio ð!0 Þ=ð!Þ in the Markov chain transition instead of computing the value of ð!0 Þ and ð!Þ. Inspired by this property, although we cannot know the value of ð!0 Þ and ð!Þ, we can establish a set of constraints ð!Þ=ð!0 Þ ðor Þ 1 if we know whether one solution is no worse than the other. Such constraints can be transformed into a set of linear inequations of the parameters. After collecting enough inequations, we can apply Linear Programming to find a feasible solution of the parameters. Given ground truth data, the information of whether one solution is no worse than the other is easy to know by degrading the ground truth using the spatial and temporal moves defined in Section 4.1. In the experiments, the ground truth contains tracks with correct label and locations. We obtain foreground regions as observations. First, by fitting partial ground truth and observations into the motion model, we determine parameters in the motion model, i.e., Q and R in (14). This information is required to compute Smot in (21). Then, we start with the best cover ! obtained from ground truth and use the temporal and spatial moves to degrade the best cover to !i . For each !i , we have a constraint that ð! Þ=ð!i Þ 1:

ð33Þ

Given one cover, according to (21), the log function of the  posterior fðCj!Þ ¼ logðpð!jY ÞÞ is a linear function in terms of the free parameters. Equation (33) provides one linear inequation, i.e., fðCj! Þ  fðCj!i Þ 0. After collecting multiple constraints, we use Linear Programming to find a solution of positive parameters with a maximum sum as Maximize : aT C Subjectto : AT C  b; C 0;

ð34Þ

where C ¼ ½C0 ; . . . ; C4 , a ¼ ½1; 1; 1; 1; 1T , and each row of AT C  b encodes one constraint from (33). In our experiments, 5,000 constraints, which cover most of the cases of different moves from multiple sequences in one data set, are sequentially generated and added to a constraint set. Due to the ambiguity existing in ground truth, a small number of conflict constraints may exist. Any constraint that conflicts with the existing set is ignored. In fact, the objective function, namely, a in (34) is a rather loose

2205

^  Ck=kCk of the estimated Fig. 7. Average normalized error kC ^ with different number of constraints. parameters C

parameter as long as enough constraints are collected. Any vector a containing five positive numbers will work. In order to determine how many constraints are enough to get an accurate estimate of the parameters, we simulate a density function with five parameters. For one given number of constraints (x-axis in Fig. 7), we independently generate multiple sets of constraints and compute the average estimate errors, which are shown in Fig. 7. Note that, since the parameters indeed encode the scenariorelated knowledge, we train the parameters for different data sets. Here, the data set means a set of video sequences in similar scenario (similar background, moving objects, and using the same method for foreground segmentation). A desired Markov chain transition and a correct MAP solution are ensured by the trained parameters.

5.2 Simulation Results To demonstrate the concept of our approach, we design simulation experiments. In an L  L square region, there are K (unknown number) moving discs. Each disc presents an independent color appearance and an independent constant velocity and scale change in the 2D region. False alarms (nonoverlapping with targets) are u.a.r. located in the scene, and the number of false alarms is a uniform distribution on ½0; F A. If the number of existing targets in the square region is less than the upper bound N, a target is added randomly. We also add several bars as occlusions in the scene. This static occlusion causes a target to break into several foreground regions. This simulates real scenarios when foreground regions are fragmented due to noisy background modeling. The input to our tracking algorithm contains only foreground regions in each frame without any shape information. The design of this simulation experiment is through particular considerations for evaluating a method’s ability to recover spatial data association. In such a simulation, if no occlusions between objects and no static occlusion occur, the decision of temporal data association is very easy to make without any ambiguity. It is just due to the lack of spatial completeness so that these sequences challenge many existing data association methods. Without jointly considering the spatial and temporal data association, a tracking algorithm cannot produce the correct segmentation of the foreground regions. Fig. 8 gives the results of our spatiotemporal MCMC data association algorithm. Colored and black rectangles display the targets and false alarms,

Authorized licensed use limited to: University of Southern California. Downloaded on November 2, 2009 at 16:25 from IEEE Xplore. Restrictions apply.

2206

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,

Fig. 8. Simulation result L ¼ 200, N ¼ 7, F A ¼ 7, and T ¼ 50. Color rectangles indicate the IDs of targets. Targets may split or merge when they appear. (a) 1st frame, (b) 30th frame, (c) 43rd frame, and (d) 50th frame.

respectively. Red links indicate that spatial segmentation happens between nodes. To evaluate the performance of our approach quantitatively, we adopt the metric “Sequence Tracking Detection Accuracy” (STDA) proposed in [26], which is a spatiotemporal-based measure penalizing fragmentation in the temporal and the spatial domains. To compute the STDA score, one needs to compute one-to-one match between the tracked targets and the ground truth targets. The matching strategy itself is implemented (in the evaluation software) by specifically computing the measure over all the ground truth and detected object combinations and to maximize the overall score for a sequence [26]. Given M matched tracks including tracked targets k ðiÞ, k ¼ 1; . . . ; M, i ¼ 1; . . . ; T and the corresponding ground truth tracks Gk ðiÞ, k ¼ 1; . . . ; M, i ¼ 1 . . . ; T , STDA can be computed as PT Gi ðtÞ\i ðtÞ

M X 1 Gi ðtÞ[i ðtÞ NG þ NT ST DA ¼ ; ð35Þ ðG [  ¼ 6 ;Þ 2 N frame i i i¼1 where the denominator for each track Nframe ðGi [ i 6¼ ;Þ indicates the number of frames in which either a ground truth or a tracked target (or both) is present. The numerator for each track measures the spatial accuracy by computing the overlap of the matched tracking results over the ground truth targets in the sequence. The normalization factor is the average of number of tracked targets NT and the number of ground truth targets NG . STDA produces a real number value between 0 and 1 (worst and best possible performance, respectively). As the target density and false alarm rate increase, tracking becomes increasingly difficult. For each different setting (i.e., number of targets and number of false alarms), we generate 20 sequences for comparison of the average performance. Each sequence contains T ¼ 50 frames. We compare our

VOL. 31, NO. 12,

DECEMBER 2009

Fig. 9. (a) STDA as the function of N, the maximum number of targets, L ¼ 200, F A ¼ 0, and T ¼ 50 and (b) STDA as the function of F A, the number of false alarms, L ¼ 200, N ¼ 5, and T ¼ 50.

method with other methods, including a JPDAF-based method from [10], the MHT from [13], and our own algorithm with temporal moves only. All methods employ the same motion and appearance likelihood. To prune hypotheses, both JPDAF and MHT adopt the minimum ratio between the likelihood of the worst hypothesis to the likelihood of the best one. Any hypothesis with a likelihood lower than the product of this ratio and the likelihood of the best hypothesis is discarded. For JPDAF, we use 1-scanback and keep at most 50 hypotheses at each scan. For MHT, we use 3-scanback and keep at most 300 group hypotheses. In fact, even if a larger scanback (5-scanback) is used, MHT does not show significant improvement in the simulation. This can be explained as temporal data association (when no occlusion happens) is quite straightforward in the simulation, and the ambiguities caused by errors in spatial relationship cannot be solved by simply increasing number of scanback. The initial cover !0 for MCMC sampling is initialized by a greedy criterion, namely, using the MHT algorithm but keeping only the best hypothesis at each time. The MCMC sampler was run for a total of 10,000 iterations where the first 1,500 iterations consist solely of temporal moves. The average score from multiple runs of our method is reported. Fig. 9a compares the performance when the number of targets increases. Fig. 9b shows the tolerance to false alarms for different methods. Because we consider the spatial and temporal association seamlessly, the performance of our method dominates the other three methods. The other three methods work almost equally poorly since they often fail at similar cases when split or merged observations exist. To extend our algorithm for long sequences, we implement the proposed association algorithm as an online algorithm within a sliding window of size W . The overlap between sliding windows is defined by W . When a sliding

Authorized licensed use limited to: University of Southern California. Downloaded on November 2, 2009 at 16:25 from IEEE Xplore. Restrictions apply.

YU AND MEDIONI: MULTIPLE-TARGET TRACKING BY SPATIOTEMPORAL MONTE CARLO MARKOV CHAIN DATA ASSOCIATION

Fig. 10. STDA and runtime (second) for online/offline, different W window size, and nmc number of samplings. L ¼ 200, F A ¼ 0, and T ¼ 1;000.

window moves, the new sliding window has W new frames and W  W frames overlap with the previous sliding window. The cover of the overlapped part of the current sliding window is initialized from the best cover of the previous sliding window. The cover of the new frames is initialized by the greedy criterion. In the experiments, we use W ¼ 1. The comparison between the online and offline versions is shown in Fig. 10. By implementing the online version, we reduce the complexity of data association and control the delay of output for long sequences. We also use the simulation experiments to test the robustness of the posterior density function to parameter changes. Since direct comparison of the posterior shape or evaluation of the mode drift is difficult, we still use the STDA score to evaluate the effect of parameter changes in ^ the MAP solution. To evaluate each component C^i 2 C, ^ i ¼ 0; . . . ; 4, we use the estimated Ci as the center and uniformly select Ci 2 ½C^i ð1  =2Þ; C^i ð1 þ =2Þ. We generate multiple sequences with a setting of N ¼ 7 and F A ¼ 7 and use a posterior with the contaminated parameters to do the MCMC sampling. The results are shown in Fig. 11. From this figure, we can see that the average STDA score does not change significantly when there exist variations in parameters. Thus, the posterior is robust to parameter variations. This can be understood as there exists a domain in IR5 where all of the constraints in training can be satisfied. The parameters that are out of (but close to) the valid domain merely break part of the constraint pool and do not likely lead to a very wrong solution in one particular sequence. Since the constraint pool is generated for a general setting, the broken constraints may not appear in that particular sequence.

5.3 Real Scenarios We show results and evaluations on three video sets to demonstrate the effectiveness of our method in real scenarios. The first set is a selection from CLEAR [1], which is captured with a stationary camera, mounted a few meters above the ground and looking down toward a street. The targets in the scene include vehicles and pedestrians. The second set, called “campus ground set,” is captured with a stationary camera on a tripod. The foreground in the second set is clear; however, the intertarget occlusion is extensive. The third set is a selection from VIVID-I and II data sets, which are captured from UAV cameras. For videos captured from UAV cameras, we first compensate for background

2207

Fig. 11. Average STDA score change with parameter variations.

motion. This can be accomplished by an affine transformation [10] since the camera is relatively far from the scene and the background can be modeled as a plane. The main difficulty of the third data set comes from noisy foreground regions and false alarms caused by erroneous registration and parallax. The input to our tracking algorithm contains foreground regions which are extracted using a dynamic background model estimated within a sliding window [10]. Tracking is performed automatically from the detected blobs without any manual initialization. In the experiments, we use online tracking with a sliding window W ¼ 50 and nmc ¼ 1;000. The first sliding window is initialized with the greedy criterion. Table 1 gives the quantitative comparison. The complete track is defined as 80 percent of the trajectory is tracked and no ID changes. The tracking process runs at around 3 fps on a Pentium IV 3.0 GHz PC. Some foreground regions used as input and some tracking results are shown in Fig. 12. One advantage of our tracking algorithm (shown in both simulation and real data set) is worth highlighting. Because the bidirectional (forward/backward) sampling is applied in a symmetric way, our approach can deal with the case where targets are merged or split when they appear. Fig. 13 illustrates the comparison to the algorithms with only forward or backward inference on the image sequence in “campus ground set.” The first observation is obtained at t ¼ 48 due to a 95-frame sliding window for motion segmentation. The colors at the bottom of each chart correspond to labels allocated by the algorithm for the three moving persons in the sequence, while the red bars correspond to mislabeled targets due to merged observations. The proposed bidirectional sampling allows to estimate the trajectories and label them consistently throughout the sequence. We observe that the failure cases of this tracker occur often in the following situations. First, when the motion of a target cannot be faithfully represented by the constant velocity model, the MAP solution prefers splitting the track, although the appearance is still consistent. This issue can be addressed by using a more general motion model or TABLE 1 Comparative Results on Three Real Data Sets

Method 1: JPDAF in [10]; Method 2: the Proposed Method.

Authorized licensed use limited to: University of Southern California. Downloaded on November 2, 2009 at 16:25 from IEEE Xplore. Restrictions apply.

2208

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,

VOL. 31, NO. 12,

DECEMBER 2009

Fig. 12. Experimental results for real scenarios from both stationary cameras and Unmanned Aerial Vehicle (UAV) cameras.

directly modeling the smoothness of a trajectory. Also, when the scene is very crowded so that targets seldom separate from each other, the tracker may fail and either merge all of them together or regard them as false alarms. To some extent, this issue can be solved by incorporating model information and using it to guide the spatial and temporal MCMC sampling.

6

CONCLUSION AND DISCUSSION

We have presented a framework to find a global optimal spatiotemporal association which maximizes the consistency of motion and appearance of targets over time. Our method overcomes problems encountered with one-to-one mapping between observations and targets. A data-driven

MCMC method is used to sample the solution space efficiently and the forward and backward inferences enhance the search performance. Compared to other data association algorithms, the proposed method shows remarkable improvement both temporally (i.e., consistency of labels) and spatially (i.e., accuracy of outlined regions). The work can be extended along the following lines. First, the target motion model can be extended to a more general model. Second, our framework can naturally incorporate object model information in two ways: 1) We can assign a model likelihood for each node to extend our likelihood function and 2) we can also use model information to drive the MCMC proposal. Third, tracking failures caused by long term occlusions can be resolved by data association at the level of tracklets.

Authorized licensed use limited to: University of Southern California. Downloaded on November 2, 2009 at 16:25 from IEEE Xplore. Restrictions apply.

YU AND MEDIONI: MULTIPLE-TARGET TRACKING BY SPATIOTEMPORAL MONTE CARLO MARKOV CHAIN DATA ASSOCIATION

2209

Fig. 13. Comparison of the bidirectional inference and single direction inference. (a) Frames at t ¼ 50; 55; 65, (b) forward inference only, (c) backward inference only, (d) JPDAF in [10], and (e) proposed method.

ACKNOWLEDGMENTS This work was supported by MURI-ARO W911NF-06-1-0094.

REFERENCES [1] [2]

http://www.clear-evaluation.org/, 2009. P. Green, “Trans-Dimensional Markov Chain Monte Carlo,” Highly Structured Stochastic Systems, Oxford Univ. Press, 2003. [3] D. Reid, “An Algorithm for Tracking Multiple Targets,” IEEE Trans. Automatic Control, vol. 24, no. 6, pp. 84-90, Dec. 1979. [4] P.S. Maybeck, Stochastic Models, Estimation, and Control. Academic Press, Inc., 1979. [5] P.J. Green, Trans-Dimensional Markov Chain Monte Carlo. Oxford Univ. Press, 2003. [6] A. Yilmaz, O. Javed, and M. Shah, “Object Tracking: A Survey,” ACM Computing Surveys, vol. 38, 2006. [7] T. Yu and Y. Wu, “Collaborative Tracking of Multiple Targets,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 834841, 2004. [8] R.T. Collins, “Mean-Shift Blob Tracking through Scale Space,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 234240, 2003. [9] T. Zhao and R. Nevatia, “Tracking Multiple Humans in Crowded Environment,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 406-413, 2004. [10] J. Kang, I. Cohen, and G. Medioni, “Continuous Tracking within and across Camera Streams,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, vol. 1, pp. 267-272, June 2003.

[11] C. Morefield, “Application of 0-1 Integer Programming to Multitarget Tracking Problems,” IEEE Trans. Automatic Control, vol. 22, no. 3, pp. 302-312, June 1971. [12] Z. Khan, T. Balch, and F. Dellaert, “Multitarget Tracking with Split and Merged Measurements,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, vol. 1, pp. 605-610, 2005. [13] I. Cox and S. Hingorani, “An Efficient Implementation of Reid’s MHT Algorithm and Its Evaluation for the Purpose of Visual Tracking,” Proc. Int’l Conf. Pattern Recognition, pp. 437-443, 1994. [14] T. Fortman, Y. Bar-Shalom, and M. Scheffe, “Sonar Tracking of Multiple Targets Using Joint Probabilistic Data Association,” IEEE J. Oceanic Eng., vol. OE-8, no. 3, pp. 173-184, July 1983. [15] S. Oh, S. Russell, and S. Sastry, “Markov Chain Monte Carlo Data Association for General Multiple-Target Tracking Problems,” Proc. 43rd IEEE Conf. Decision and Control, 2004. [16] Z. Khan, T. Balch, and F. Dellaert, “An MCMC-Based Particle Filter for Tracking Multiple Interacting Targets,” Proc. European Conf. Computer Vision, pp. 279-290, 2004. [17] Z. Khan, T. Balch, and F. Dellaert, “MCMC-Based Particle Filtering for Tracking a Variable Number of Interacting Targets,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 27, no. 11, pp. 1805-1819, Nov. 2005. [18] Z. Tu and S. Zhu, “Image Segmentation by Data Driven Markov Chain Monte Carlo,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 24, no. 5, pp. 657-673, May 2002. [19] A.B. Poore, “Multidimensional Assignment Formulation of Data Association Problems Arising from Multitarget and Multisensor Tracking,” Computational Optimization and Applications, vol. 3, pp. 27-57, 1994. [20] Y. Bar-Shalom, T. Fortmann, and M. Scheffe, “Joint Probabilistic Data Association for Multiple Targets in Clutter,” Proc. Conf. Information Sciences and Systems, 1980.

Authorized licensed use limited to: University of Southern California. Downloaded on November 2, 2009 at 16:25 from IEEE Xplore. Restrictions apply.

2210

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,

[21] K. Smith, D. Gatica-Perez, and J.-M. Odobez, “Using Particles to Track Varying Numbers of Interacting People,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 962-969, 2005. [22] A. Mittal and L. Davis, “M2tracker: A Multi-View Approach to Segmenting and Tracking People in a Cluttered Scene,” Int’l J. Computer Vision, vol. 51, pp. 189-203, 2003. [23] F. Dellaert, S.M. Seitz, C.E. Thorpe, and S. Thrun, “Structure from Motion without Correspondence,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2000. [24] S. Cong, L. Hong, and D. Wicker, “Markov Chain Monte Carlo Approach for Association Probability Evaluation,” Proc. IEE— Control Theory and Applications, vol. 151, no. 2, pp. 185-193, Mar. 2004. [25] P. Green, “Reversible Jump Markov Chain Monte Carlo Computation and Bayesian Model Determination,” Biometrika, vol. 82, pp. 711-732, 1995. [26] P.S.R. Kasturi, D. Goldgof, and V. Manohar, “Performance Evaluation Protocol for Text and Face Detection and Tracking in Video Analysis and Content Extraction (VACE-II),” technical report, Univ. of South Florida, 2004. Qian Yu received the BEng and MEng degrees from the Department of Computer Science and Technology at Tsinghua University, Beijing, in 2001 and 2004, respectively. He received the PhD degree from the Computer Science Department at the University of Southern California, Los Angeles. He is currently working as a member of the technical staff at Sarnoff Corporation in Princeton, New Jersey. His research interests include computer vision, machine learning, and pattern recognition. He is a member of the IEEE and the IEEE Computer Society.

VOL. 31, NO. 12,

DECEMBER 2009

Ge´rard Medioni received the Diploˆme d’Inge´n ieur from the Ecole Nationale Supe´rieure des Te´le´communications (ENST), Paris, in 1977, and the MS and PhD degrees from the University of Southern California (USC) in 1980 and 1983, respectively. He has been with USC since then and is currently a professor of computer science and electrical engineering, a codirector of the Institute for Robotics and Intelligent Systems (IRIS), and a codirector of the USC Games Institute. He served as the chairman of the Computer Science Department from 2001 to 2007. He has made significant contributions to the field of computer vision. His research covers a broad spectrum of the field, such as edge detection, stereo and motion analysis, shape inference and description, and system integration. He has published books, more than 50 journal papers, and 150 conference articles. He is the holder of eight international patents. He is an associate editor of the Pattern Recognition and Image Analysis Journal and the International Journal of Image and Video Processing and is on the Advisory Board of the IEEE Transactions on Pattern Analysis and Machine Intelligence. He served as program cochair of the 1991 IEEE Computer Vision and Pattern Recognition (CVPR) Conference in Maui and the 1995 IEEE International Symposium on Computer Vision in Coral Gables, general cochair of the 1997 IEEE CVPR Conference in Puerto Rico, conference cochair of the 1998 ICPR, general cochair of the 2001 IEEE CVPR Conference in Kauai, general cochair of the 2007 IEEE CVPR Conference in Minneapolis, general cochair of the 2009 IEEE CVPR Conference in Miami, and program cochair of the 2009 IEEE Workshop in Applications of Computer Vision in Snowbird. He is a fellow of the IEEE, IAPR, and AAAI.

. For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.

Authorized licensed use limited to: University of Southern California. Downloaded on November 2, 2009 at 16:25 from IEEE Xplore. Restrictions apply.

Suggest Documents