Multi-Target Tracking by Learning Class-Specific

0 downloads 0 Views 4MB Size Report
the MCMC sampling simulates the probability of a data association. Since date- ... In this paper, a data-driven particle filtering (DDPF) method for multi-target.
Multi-Target Tracking by Learning Class-Specific and Instance-Specific Cues Min Li, Wei Chen, Kaiqi Huang, and Tieniu Tan National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences {mli,wchen,kqhuang,tnt}@nlpr.ia.ac.cn

Abstract. This paper proposes a novel particle filtering framework for multi-target tracking by using online learned class-specific and instancespecific cues, called Data-Driven Particle Filtering (DDPF). The learned cues include an online learned geometrical model for excluding detection outliers that violate geometrical constraints, global pose estimators shared by all targets for particle refinement, and online Boosting based appearance models which select discriminative features to distinguish different individuals. Targets are clustered into two categories. Separatedtarget is tracked by an ISPF (incremental self-tuning particle filtering) tracker, in which particles are incrementally drawn and tuned to their best states by a learned global pose estimator; target-group is tracked by a joint-state particle filtering method in which occlusion reasoning is conducted. Experimental results on challenging datasets show the effectiveness and efficiency of the proposed method.

1

Introduction

Multi-target tracking (MTT) plays an important role in applications like visual surveillance, intelligent transportation and behavior analysis. It is still a challenging problem in computer vision community. Challenges include but are not limited to target auto-initialization, inter-object occlusions and intensive computational cost caused by joint-state optimization. Extensive work has been done on multi-target tracking for the last decades. Some formulates MTT as a data-association problem between candidate regions and target trajectories. Multiple hypothesis tracker (MHT) [1] and joint probabilistic data association filter(JPDAF) [2] are two classical data-association methods widely used in computer vision context, e.g. the MHT in [3] and JPDAF in [4]. However, MHT/JPDAF has to exhaust all possible associations and thus is computationally intensive. Recently, sampling-based techniques have been proposed for data association. In [5], an MCMC (Markov Chain Monte Carlo) based variant of the auxiliary variable particle filter is proposed, in which the MCMC sampling simulates the probability of a data association. Since dateassociation based methods highly depend on reliable low-level detection results and often do not consider the interactions between targets, they are subject to bad detection results or occlusions. R. Kimmel, R. Klette, and A. Sugimoto (Eds.): ACCV 2010, Part II, LNCS 6493, pp. 67–81, 2011. Springer-Verlag Berlin Heidelberg 2011

68

M. Li et al.

There is also much work that focuses on multi-target tracking by Bayesian inference in state space. Data association may be involved, but not necessary. Isard et al. [6] propose a Bayesian multi-blob tracker which combines a multi-blob likelihood function with a particle filter. In [7], Adaboost and mixture particle filters [8], which model the filtering distribution as an M-component mixture model, are combined to detect and track multi-targets fully automatically. In [9], joint posterior is approximated by a variational density, based on which a set of autonomous while collaborative trackers are used to overcome the coalescence problem (several trajectories stick to one same target). Qu et al. [10] propose an interactively distributed multi-target tracking algorithm using a magnetic-inertia potential model to solve the multiple object labeling problem in the presence of occlusions. Khan et al. [11] propose a novel MCMC sampling step to obtain an efficient multi-target filter in joint-state space. However, this model can not deal with occlusion. Yang et al. [12] propose a game-theoretic multiple target tracking algorithm, in which the tracking problem is solved by finding the Nash Equilibrium of a game. Zhang et al. [13] propose a species based PSO (particle swarm optimization) algorithm for multiple object tracking, which brings a new view to the tracking problem from a swarm intelligence perspective. Hess et al. [14] present a discriminative training method for particle filters and attempt to directly optimize the filter parameters in response to observed errors. Breitenstein et al. [15] use the continuous confidence of pedestrian detectors and online trained, instance-specific classifiers as a graded observation model to track multi-person in the particle filtering framework. In this paper, a data-driven particle filtering (DDPF) method for multi-target tracking is proposed, and multi-human tracking is used as a study case. Object detection is not a necessary part, but considering target auto-initialization, detection is still introduced into the tracking process. The proposed DDPF framework is not a pure stochastic state inference method, instead, a set of classspecific or instance-specific cues, meaning some kind of models learned online from low-level data, are used to make the state inference process more efficient and reliable. Targets are clustered into tow categories. Separated-target is tracked by an ISPF (incremental self-tuning particle filtering) [16] method, in which particles are incrementally drawn and tuned to their best states by a learned global pose estimator; target-group is tracked by a joint-state space particle filtering method in which occlusion reasoning between group members is conducted. The remainder of the paper is organized as follows. Section 2 gives an overview of the proposed framework. Section 3 introduces our detection and data association method. The Bayesian inference process for multi-target tracking is described in Section 4. Experimental results and analysis are presented in Section 5. Finally, we draw our conclusions in Section 6.

2

Overview of the Proposed Framework

In particle filtering framework, tracking can be casted as an inference task in a Markov model with hidden state variables [17]:

Multi-Target Tracking by Learning Class-Specific and Instance-Specific Cues

69

 p(Xt |Ot ) ∝ p(ot |Xt )

p(Xt |Xt−1 )p(Xt−1 |Ot−1 )dXt−1

(1)

where Xt describes the state variable at time t and Ot = {o1 , o2 , ..., ot } is a set of observations. The tracking process is governed by the observation model p(ot |Xt ) and the dynamic model p(Xt |Xt−1 ) between two states. In our framework, separated-target is tracked by a single particle filter, and Xt contains only one single state St (St = {x, y, s}, meaning x, y translation and scale); for a group with n targets, Xt contains the joint states of several targets St = {Sit }ni=1 , as well as the occlusion matrix π t = {πtij }i Th H(xf , yf )

(2)

where Th is a threshold (usually set to 0.3). An example of the learned geometrical model is shown in Fig. 2 (c), which is nearly a plane. The surface shape of the geometrical model depends on the the number and locations of vanishing points of scene structures, and a plane surface usually means one-point perspective. One advantage of the proposed geometrical model is that it can incrementally learn scene geometrical information without camera calibration.

Fig. 2. An example of online learned geometrical model for excluding detection outliers that violate perspective projection. (a) Detection results. Object 2 (in red) is an outlier. (b) Definitions of foot point and object height. (c) Learned surface of object height vs object’s foot point. Blue points are online collected training samples.

Data Association. After object detection and filtering the outliers, detection results should be associated with trajectories of old targets. First, a similarity matrix between trajectory-detection pairs is computed. The similarity score Simda between a trajectory tr and a detection d is computed as a product of appearance similarity and distance similarity:

Multi-Target Tracking by Learning Class-Specific and Instance-Specific Cues Simda (tr, d) = Simc (tr, d) · pN (|d − tr|)

71 (3)

2 N (|d − tr|; 0, σdet )

where pN (|d − tr|) ∼ is a Gaussian function used to measure the distance similarity between tr and d and Simc (tr, d) is the output (scaled to (0, 1) by a sigmoid function) of an online Boosting classifier used to measure the appearance similarity. Then, a simple matching algorithm is conducted: the pair (tr∗ , d∗ ) with the maximum score is iteratively selected, and the rows and columns belonging to trajectory tr and detection d in the similarity matrix are deleted. Note that only the association with a matching score above a threshold is considered as a valid match.

Fig. 3. Comparison of appearance similarity matrices. (a) trajectories and detections. (b) obtained by feature template matching. (c) obtained by Boosting classifiers. In (b) and (c), yellow block means the maximum of a column and red means the second maximum.

Online Boosting Classifiers. Instance-specific Boosting classifiers are used in data association, as well as the observation model for tracking. Our Boosting classifier is similar to [20] and is trained online on one target against all others. Patches used as positive training examples are sampled from the bounding box of the optimal state given by the tracking process. The negative training set is sampled from all other targets, augmented by background patches. Fig. 3 shows an comparison of appearance similarity matrices obtained by feature template matching (see Section 4.2) and Boosting classifiers. As shown in Fig. 3 (b) and (c), the discriminability of online Boosting is much more powerful than feature template matching, because the average ratio of column maximum to second maximum of online Boosting is 14.5, while that of feature template matching is only 1.3.

4

Tracking by Particle Filtering

To decompose the full joint-state optimization problem in multi-target tracking into a set of “easy-to-handle” low-dimensional optimization problems, we adopt a “divide-and-conquer” strategy. Targets are clustered into two categories: separated-target and target-group. Separated-target is tracked by a single-state particle filter and target-group is tracked by a joint-state particle filter. A set of targets G = {tri }ni=1 with n ≥ 2 is considered as a target-group if:  Clustering Rule: ∀ tri G, ∃ trj G(j = i), s.t. |Si Sj |/min(|Si |, |Sj |) > Tc

72

M. Li et al.

Fig. 4. An example of training and using of the full-body pose estimator. (a) the training process. (b) pose-tuning process. S0 (red) is a random particle, S3 (yellow) is the tuned particle with the maximum similarity, others are intermediate states (black) (c) the curve of similarity vs iteration# for (b).

where Si is the optimal state of target tri at frame #t-1 or its prediction state at frame #t using only previous velocity, | · | is the area of a state, and Tc is a threshold (usually set to 0.4). This rule makes target pairs that have or are going to have significant overlapping areas be in the same group. Targets that do not belong to any group are separated targets. 4.1

Global Pose Estimators

Sampling-based stochastic optimization methods usually need to sample a lot of particles, the number of which may grow exponentially with the increase of state-dimension. However, most particles have almost no contribution to state estimation because of their extremely small weights, while they consume most of the computational resources. To tremendously decrease the unnecessary computational cost, ISPF (incremental Self-tuning Particle Filtering)[16] uses an online learned pose estimator to guide random particles to move towards the correct directions, thus every random particle becomes intelligent and dense sampling is unnecessary. Inspired by [16], we also use pose estimators in multi-target tracking. Different from one pose estimator per target in [16], we train two global pose estimators online for all targets. One is ff b trained with full-body human samples, the other is fhs trained with only the head-shoulder parts of human samples. fhs is used when most parts of a target is occluded by other targets except the head-should part. Fig. 4 shows an example of how to train and use a pose estimator. As shown in Fig. 4 (a), we do state perturbation dS around the optimal state S of each target to get training samples, then HOG (Histogram of Oriented Gradients) [21] like features F are extracted and LWPR is used to learn the regression function dS = f (F). By using the regression function, as shown in

Multi-Target Tracking by Learning Class-Specific and Instance-Specific Cues

73

Fig. 4 (b) and (c), given an initial random state around a target, we can guide it move to its best state with the maximum similarity in several iterations (each iteration includes two main steps: ΔSi = f (F(Si )) and Si+1 = Si − ΔSi , see [16] for details). 4.2

Observation Model

To make appearance model robust to partial occlusions, we use a part-based representation for each target, shown in Fig. 5 (a) p1 ∼ p7 . As shown in Fig. 5 (a), each part is represented by a L2-normalized HOG histogram and a L2normalized IRG (Intensity/Red/Green) color histogram: Fj ={Fjhog , Fjirg }. The final representation is the concatenation of all features F = {Fj }m j=1 , where m is the number of all parts. Note that feature F is also the input of online boosting classifiers, and it is not the same F used in pose estimator.

Fig. 5. Part based object representation and calculation of the visibility vector. (a) Target is divided into m parts (denoted by {pj }m j=1 , here m=7), and each part is represented by a HOG histogram and an IRG (Intensity/Red/Green) histogram denoted as Fj ={Fjhog , Fjirg }. (b) An example of calculation of visibility vector. Elliptical regions are the valid areas of targets. Knowing the configuration of states {Si } and their occlusion relations {π ij }, the visibility vector of S1 can be easily obtained: (1,1,1,1,0,1,1).

Part based Feature Template Matching. Only using the online trained Boosting classifiers for appearance model may suffer from occlusions, thus a partbased template matching method is introduced to our appearance model. Given the state configuration X = {S, π } of a target-group (subscript t is dropped), where S={Si }ni=1 is the joint state and π = {π ij }1≤i

Suggest Documents