Multi-target Tracking Using Hough Forest Random Field - IEEE Xplore

0 downloads 0 Views 5MB Size Report
Abstract—This paper presents a novel tracking-by-detection approach for multi-target tracking. There are two major steps in our framework: data association to ...
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2015.2489438, IEEE Transactions on Circuits and Systems for Video Technology IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. XX, NO. XX, 2015

1

Multi-target Tracking Using Hough Forest Random Field Jun Xiang, Nong Sang, Jianhua Hou, Rui Huang, Changxin Gao

Abstract—This paper presents a novel tracking-by-detection approach for multi-target tracking. There are two major steps in our framework: data association to form global tracklet association, followed by trajectory estimation to deal with the remaining gaps. In the first step, we formulate tracklet association as an inference problem in a Hough Forest Random Field (HFRF), which combines Hough Forest (HF) and Conditional Random Field (CRF), and allows us to model both local and global tracklet relationships in one unified model. In the second step, we improve the Reversible-Jump Markov Chain Monte Carlo (RJMCMC) particle filtering method with explicit mutualocclusion reasoning to fill in the remaining gaps from the first step and increase the overall tracking precision. Extensive experiments have been conducted on five public datasets, and the performance is comparable to the state of the art, if not better. Index Terms—multi-target tracking, Hough forest, CRF, tracking by detection, trajectory estimation, RJMCMC, occlusion reasoning.

I. I NTRODUCTION ULTI-TARGET tracking is a fundamental task in computer vision with a wide range of applications, including visual surveillance, human behavior analysis and video retrieval. Although many effective approaches have been proposed, robust tracking remains a challenging problem due to poor image quality, high motion complexity, large appearance variation, frequent occlusions, etc. Recently, tracking-by-detection approaches have become increasingly popular thanks to the significant improvements in object detection techniques [1], [2], [3], [4]. In the realm of multi-target tracking, these approaches generally consist of two parts [5]: data association and trajectory estimation. In the stage of data association, detection responses generated by a pre-trained detector are associated to different unique IDs corresponding to different targets, based on position, size, motion, and appearance, etc. The problem can usually be formulated as a Maximum-A-Posterior (MAP) estimation problem or, equivalently, an energy minimization problem. Although much progress has been achieved in data association, many methods associate detector responses or tracklets

M

Manuscript received XXX XX, 2015; revised XXX XX, 2015. This work was supported by the Project of the National Natural Science Foundation of China No.61271328, 61433007 and No. 61141010. (Corresponding Author: C. Gao) J. Xiang, N. Sang, R. Huang and C. Gao are with National Key Laboratory of Science and Technology on Multispectral Information Processing, School of Automation, Huazhong University of Science and Technology, Wuhan, 430074, China. Email: [email protected] (J. Xiang), [email protected] (N. Sang), [email protected] (R. Huang), [email protected] (C. Gao). J. Hou is with Hubei Key Laboratory of Intelligent Wireless Communications, South-central University for nationalities, Wuhan, 430074, China. Email: [email protected] (J. Hou).

(i.e., short and relilable tracks) independently. This independent assumption makes it difficult to locally distinguish detector responses with similar appearance, especially in a crowded scene with frequent occlusions. To tackle this issue, Conditional Random Field (CRF) has been applied to data association [6], [7], [8] because CRF can model long-term dependency among the detections/tracklets. However, due to the complex structure of the CRF models for multi-target tracking, the parameter learning and inference of such models are often intractable. Furthermore, becuase the energy functions of such CRFs are not sub-modular, it is difficult to find the global optimal solution. In [9], a Hough Forest Random Field (HFRF) combining Hough Forest (HF) and CRF has been successfully applied to the object detection and segmentation problem. In this work, we have formulated data association as an inference problem in HFRF and obtained good performance in multitarget tracking. Obviously, if the detector is flawless, the multi-target tracking problem can be solved solely by data association. However, the accuracy of the state-of-the-art object detectors is still far from perfect. Common errors include missed detections, false alarms, imprecise localizations due to the presence of projected shadows or partial occlusions. The detector’s limitation will inevitably produce some isolated responses that cannot be linked with any other responses. These isolated and missed detections will lead to trajectory gaps during the data association stage. To address this problem, trajectory estimation has been used to reconstruct the entire trajectory of each target by filling in the gaps where no detections are present. Although interpolation, as a straightforward method, has been used in [6], [7], [10], it cannot handle such situations where targets move unpredictably. More delicate techniques were proposed for this purpose, e.g., particle filtering based prediction [11], trajectory extension [12], etc. However these approaches still have relatively simple motion models and do not consider the interactions between targets. In this work, we adapted the Reversible-Jump Markov Chain Monte Carlo (RJMCMC) particle filtering framework [13], [14] to take into account the target interactions and mutual occlusions for trajectory estimation. To this end, we propose a multi-target tracking framework that combines HFRF based data association and a modified RJMCMC algorithm for trajectory estimation. In the data association step, we adapted the HFRF framework, which has seen success in the joint object recognition and segmentation task [9]. There are at least two benefits of applying this framework to the multi-target tracking problem: 1) By adding a new type of binary label variables into the traditional CRF

1051-8215 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2015.2489438, IEEE Transactions on Circuits and Systems for Video Technology IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. XX, NO. XX, 2015

formulation for data association, more relative interactions in the spatio-temporal domain can be considered to improve the final association; 2) The CRF inference and learning procedures are unified within the Hough forest framework, which improves the convergence rate of the Swendsen-Wang cut (SW-cut) algorithm [15] employed for the MAP inference. In the trajectory estimation step, different from the simple approaches such as interpolation, we adapted the more decent RJMCMC particle filtering algorithm [13], [14] to fill in the remaining gaps and refine the final trajectories, in which explicit mutual occlusions and target interactions are taken into account. In particular, we propose a novel mutual-occlusion reasoning model to explicitly determine the occlusion relationship among targets, so that the observation likelihood can be exactly computed, as opposed to considering all possible occlusion configurations, for more efficient MAP inference in RJMCMC. In summary, the main contributions of this work are as follows: •





We designed a multi-target tracking scheme consisting of a novel HFRF based data association step, followed by an auxiliary visual tracking method to deal with the usual limitations of the tracking-by-detection approaches, resulting in a state-of-the-art multi-target tracker; The data association problem is formulated as the one of inference in the HFRF model, in which the CRF learning and inference are unified within the Hough forest computational framework; We modified the RJMCMC algorithm to handle the mutual-occlusion reasoning more efficiently.

The flowchart of our approach is illustrated in Fig. 1. The input of our algorithm is video sequences containing detected targets (e.g., bounding boxes) by a detector. Lowlevel reliable tracklets are first generated by linking detection responses in consecutive frames using a rather conservative strategy [16]. In the training stage, a Hough Forest model is learned using the training samples collected from the set of reliable tracklets. There is no separate traditional CRF learning in the HFRF framework. In the tesing stage, a HFRF model is built based on the reliable tracklets. There are two types of hidden random variables in the model. The association variable represents the relationship between a pair of reliable tracklets satisfying certain spatial and temporal constraints, and the indicator variable further indicates the relationship among certain association nodes. The SW-cut inference algorithm is then conducted under the guidance of statistics stored in the trained HF model. Finally, the modified RJMCMC particle filtering algorithm is employed to fill in the remaining gaps among trajectories to improve the overall tracking precision. The rest of the paper is structured as follows: Section II gives a review about the related work. The HFRF based data association framework for multi-target tracking is presented in Section III. In Section IV, we describe a modified RJMCMC algorithm for trajectory estimation to address trajectory gaps and refine the final tracking results. Experimental results are presented and discussed in Section V, and conclusions are drawn in Section VI.

2

II. R ELATED WORK In this section, we briefly review previous research relevant to our work. Most current approaches for multi-target tracking are based on the tracking-by-detection strategy [7], [10], [17], [18]. Given noisy observations, an important issue of these approaches is to design more accurate association models to handle occlusions, false alarms, missed detections, and to produce final trajectories for each target. As opposed to approaches [13], [19] that only make frame-by-frame associations, the association process can also be performed in a batch of several frames [8], [20], [18]. These multi-frame based association approaches can overcome the sparsity in the detection sets induced by missed detections, but the complexity of the optimization increases dramatically. To reduce the computation, Huang et al. [16] developed a hierarchical association framework, in which low level reliable tracklets, i.e., a set of short yet confident tracks, are first generated. The resulting tracklets are then fed into a MAP association problem solved by the Hungarian algorithm at the middle level, and are further refined at a higher level to model scene exits and occluders. Following this hierarchical method, Kuo et al. [21] proposed an algorithm for online learning a discriminative appearance model to resolve ambiguities between the different targets. Other hierarchical methods are also considered in [22], [23]. More recently, CRF based tracking methods [24], [7], [10], [8] have been proposed in order to better distinguish difficult pairs of targets in crowded scenarios by considering longterm dependency among the detections/tracklets. Under the CRF framework, the assumption that associations between detection/tracklet pairs are independent is relaxed, thus the dependencies among local associations can be exploited to improve the tracking performance. In [24], local associations, including motion dependency and occlusion dependency between two tracklets, are considered. The affinities and dependencies are represented by unary and pairwise energy terms respectively. Different from [24], Yang et al. [7] focused on better distinction between difficult pairs of targets by considering discriminative features. The global descriptors for distinguishing different targets, and pairwise descriptors for differentiating difficult pairs, are incorporated into an online learned CRF framework. Instead of modeling the dependencies at tracklet level such as [24], [7], Heili et al. [10] performed association at detection level. Specifically, the association of detection pairs is modeled not only based on a similarity measure but on a dissimilarity measure as well, and the model parameters are learned in an unsupervised data-driven fashion. Following the CRF framework in [10], Heili et al. [8] defined a novel potential function based on visual motion features, and considered the long-term connectivity between pairs of detections on which energy potentials can be dependent. As mentioned in Section I, the CRF model inference is often intractable for multi-target tracking because the models are not sub-modular. To address this issue, Yang et al. [24], [7] used a heuristic algorithm, and Heili et al. [10], [8] adopted an iterative approximate algorithm. Hough Forest Random Field is a new computational frame-

1051-8215 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2015.2489438, IEEE Transactions on Circuits and Systems for Video Technology IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. XX, NO. XX, 2015

training samples collection Input video

Object detector

Low level association

Appearance motion +1 -1

3

HF learning

"

reliable tracklets CRF building Input video

Object detector

SW based inference

Low level association Trajectories with gaps reliable tracklets

RJMCMC testing

The final trajectories

Fig. 1. System overview. Given an input sequence, our system outputs trajectories for each target in 2D world coordinates. By employing several set of observation cues to generate training samples, HF learning guides SW inference. Finally, the RJ-MCMC Particle Filtering efficiently connects or shorts the set of possible trajectories’ gaps

work combining Hough Forest and CRF [9]. The main motivation of HFRF is to bypass the complex parameters learning by integrating the CRF learning and inference into a unified Hough forest framework. Specifically, all of the statistics required in CRF inference are estimated in a nonparametric manner utilizing the discriminative codebook formed by the leaf nodes of the learned Hough forest. Note that HFRF was first proposed to solve the object recognition and segmentation problem. We have adapted this framework to deal with the multi-target tracking data association problem. More specifically, we additionally introduce a new set of binary label variables associated with a pair of nodes in the traditional CRF formulation to identify short trajectories belonging to the same target. Then more relative interactions in the spatio-temporal domain can be considered to help guide the SW-cut inference algorithm. Although data association based tracking approaches achieve state-of-the-art results, they rely heavily on the detector’s accuracy. The isolated and missed detections will lead to trajectory gaps during the data association stage, so a trajectory estimation step is often necessary to fill the gap between trajectories. Traditional linear interpolation algorithms are often used for this purpose [6], [7], [10], but the simple mechanism behind interpolation makes it hard to obtain satisfactory trajectories. For instance, it might lead to unrealistic or unreasonable trajectories. As stated in [11], [5], [25], another possible solution to this problem is to use Category Free Tracking (CFT) methods, as opposed to the tracking-by-detection methods, when the detections are unreliable. A popular CFT framework is based on particle filtering and its variants [26], [13], [14], [27], in which tracking has been formulated as Bayesian inference in the state space. Another example is trajectory extension [12], in which a conservative CFT algorithm is used to safely track reliable targets so that missed head or tail parts of tracklets are partially recovered. However, these approaches [11], [12] only used very simple motion models, and did not consider target interactions. Following the RJMCMC framework for multi-target tracking [13], Choi et.al [14] presented an MRF

model to describe target interactions by using repulsive and attractive forces respectively, while [27] designed a new observation likelihood model by introducing a vectorial occlusion variable for mutual occlusions among targets. In our work, we first extended the mutual-occlusion reasoning model proposed in [28] to explicitly infer the occlusion relationship among the targets, without employing the vectorial occlusion variable, so that the observation likelihood can be exactly computed for more efficient MAP inference in RJMCMC. In addition, after the data association step, we have obtained relatively decent trajactories (with a small amount of gaps), the number of trajectories (i.e., the number of targets) is known, and our RJMCMC-based trajectory estimation procedure has very low computational cost. To summarize, our work follows the tracking-by-detection strategy. During the data association step, we introduced additional binary label variables associated with pairs of the traditional CRF nodes to take into account more spatiotemporal dependency structures for association using the HFRF framework. An additional trajectory estimation step based on RJMCMC with a novel occlusion handling method is used to fill in the gaps and improve the tracking precision. To the best of our knowledge, there has been no report on similar work for multi-target tracking, and our framework has achieved good performance comparable to the state of the art, if not better. III. HFRF BASED DATA A SSOCIATION Given a video input, we first detect targets in each frame by a pre-trained detector. Similar to [16], we adopt a low level association process to connect detection responses into short but reliable tracklets. For better readability, we first introduce the general notation used throughout the paper. A tracklet ts te Ti = {dii , ...dii } is defined as a set of detection responses in consecutive frames, where tsi and tei denote the start and end frames of Ti , dti = {pti , sti } denotes the response at t including position pti and size sti . A. HFRF Model A graph G = (V, E, R) for a traditional CRF is first created over the set of tracklets {Ti }, where V and E denote the set of

1051-8215 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2015.2489438, IEEE Transactions on Circuits and Systems for Video Technology IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. XX, NO. XX, 2015

node

vi

node v

node

j

vi j

j

w ij

vi node v

y

yi

node v node

4

j

ri

rj

(b) time overlapping

(a) sequential in time

Fig. 3. The illustration of HFRF model. Observation of nodes are denoted as ri , rj and their hidden labels yi , yj form the association nodes. wij represents the binary indicator indicating whether the two nodes belong to the same target.

Fig. 2. Examples of temporal relationships about two nodes of an edge.

nodes and edges respectively, and R represents observations. Tracklet Ti is linkable to Tj if the gap between the end of Ti and the beginning of Tj satisfies 0 < tsj − tei < Tthr , where Tthr = 8 is the maximal gap threshold between any linkable pair of tracklets. A linkable pair of tracklets (Ti1 → Ti2 ) forms a graph node vi = (Ti1 → Ti2 ), i = 1, ...|V |. Each edge ej = (vj1 , vj2 ) ∈ E represents a correlation between two nodes. For efficiency, in this paper we limit the number of edges by imposing 0 < tsj2 − tej1 ≤ Tint , where Tint is the maximal time interval of two nodes. The edges in CRF capture the temporal relationship between two nodes, and two temporal relationships are defined: 1) “sequential in time” means that for tracklets in node vi and tracklets in node vj , no time overlapping exists, or there exist some tracklets shared by two nodes, as shown in Fig. 2(a); 2) “time overlapping” means that time overlapping exists, as shown in Fig. 2(b). Given the observations R, we define two sets of hidden random variables Y and W , denoted as Ω = {Y, W }. The association variables Y = {yi } represent a set of binary variables associated with nodes of the above CRF. Specifically, yi = 1 indicates the two tracklets in node vi should be associated, otherwise yi = 0 means the opposite. In addition, the indicator variables W = {wij } represent a set of binary label variables associated with pairs of nodes (vi , vj ) defined above, indicating whether the two nodes in an edge belong to the same target (wij = 1) or not (wij = 0). The resulting HFRF model is illustrated in Fig. 3 which was first appeared in [9]. It is worth pointing out that in [9] the indicator nodes were defined to be “associated with the edges” in the original CRF, which is not to be confused with the new HFRF model (Fig. 3). In other words, the indicator nodes in the new HFRF model are generated from the edges in the original CRF. When we are discussing “sampling edges” using the SW-cut algorithm in the later sections it is the same as sampling the indicator variables in the new HFRF model. Let pi and pij denote the posterior distribution over nodes and pairs of nodes in HFRF, defined as: pi = p(yi |ri )

(1)

pij = p(yi , yj |ri , rj ) · p(wij |yi , yj , ri , rj )

(2)

Then, the posterior distribution of CRF is defined as: ∏ ∏ p(Ω|G) = p(Y, W |G) ∝ pi · pij i∈V

(3)

(i,j)∈E

Finally, our association based tracking is formulated as a binary labeling problem using the joint MAP assignment: Ω∗ = (Y ∗ , W ∗ ) = arg max p(Y, W |G) Y,W

(4)

B. HFRF Inference For HFRF inference, like [9] we use modified SwendsenWang cut (SW-cut) [15] which iterates the MetropolisHastings (MH) reversible jumps to find a global MAP solution. SW-cut iterates the following two steps: 1) Graph clustering probabilistically samples edges based on p(yi , yj |ri , rj ), and samples their labels wij based on p(wij |yi , yj , ri , rj ). This forms connected components (CCs). A CC is a subset of neighboring association nodes with the same label yi = yj connected with wij = 1. In our tracking framework, a CC represents a trajectory of a particular target. For those edges that are not sampled, we say that they have been randomly “cut”. The cut is a set of edges that would have linked CC to external association nodes. 2) Graph relabeling first randomly selects one of the CC’s obtained in Graph clustering, then alters labels of all association nodes in that CC, and finally, cuts those edges linking to the rest of the graph nodes having the same node label. A particular label assignment to edges and nodes of graph jointly define one state in space of inference solutions. In each iteration, SW-cut randomly decides whether to accept the new state or to keep the previous state. Let q(A → B) be the proposal distribution from state A to B, and q(B → A) denote the converse. Then, the acceptance rate of MH jumps, from state A to B, is defined as q(B → A) p(Ω = B|G) α(A → B) = min( 1, · ) (5) q(A → B) p(Ω = A|G) As shown in [9], [15], the acceptance rate can be further formulated as: α(A → B) = min(1, qr(A → B) · pr(A → B)) ∏ B q(B → A) (i,j)∈CutB (1 − pij ) qr(A → B) = = ∏ A q(A → B) (i,j)∈CutA (1 − pij ) pr(A → B) =

∏ pB ∏ pB p(Ω = B|G) ij i = · A A p(Ω = B|G) p p ij i∈CC i

(6) (7)

(8)

j∈N (i)

In equation (7) and (8), cutA and CutB respectively represent the “cut edges” in state A and B, and N (i) is the neighbors of node vi . With probability α(A− > B), the label state will move from A to B. Otherwise, it remains in state A and then a different CC is randomly selected for another new state. An iteration of SW-cut algorithm is shown in Fig. 4. As mentioned in [9], the jumps between two states are only governed by two ratios in equation (7) and (8). These two ratios can be directly estimated by Hough Forest (HF), which will be expounded in the following.

1051-8215 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2015.2489438, IEEE Transactions on Circuits and Systems for Video Technology IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. XX, NO. XX, 2015

p12 1

2

p 24

p 23

3 CC

p 25

p12 1

2

p 24

4

4

p 45

p 45

5

(a) State A

p 23

3

q ( B  ! A) q( A ! B )

CC

p 25

B

3 ( i , j )C ut (1  p ij ) B

3 ( i , j )C ut (1  p ) A

B

B

(1  p 24 )(1  p 25 ) p (:

B |G)

p (:

A |G)

3

p

i C C

p

B 2 A 2

B 3 A 3

p p

5

p p

(b) State B

A ij

B

B i A i

˜ 3

˜

B 12 A 12

p ij

A

p ij

j N ( i )

p p p p

B 24 A 24

p p

B 25 A 25

(c)probability ration

Fig. 4. One iteration of SW-cut algorithm. The filled color circles denote the nodes of CRF. (a) A connected component in initial state A is generated by probabilistically sampling edges (solid line) or “cutting” edges (dash line). SW-cut randomly selects a CC constituted by nodes 2 and 3, and the neighboring nodes of 2 are 1, 4, 5. (b) New state B is generated by randomly changing the color of nodes in the CC. Now the CC has the same label as nodes 4 and 5, which results in cutting edges (2, 4) and (2, 5) in state B. (c) Computing of the two distribution ratios in equation (7) and (8).

5

in new points set T Pb = {ˆ pt1 , ...ˆ ptm } which are successfully t+1 t+1 tracked from T Pf = {p1 , ...pk } and located within βt . An Forward-Backward (FB) error of a back-forward point pair is defined as the distance between the corresponding point pair in T Pb and T Pf , and the worse tracked points are filtered out if the corresponding FB errors are larger than a threshold. Finally, the sparse motion flow is extracted from the remaining points by quantifying 12 orientation bins, and the motion feature of tracklet Ti at time t is described as t = {xti,ang , xti,dis }, where xti, ang is the average angle fi,m of motion flows falling in the largest orientation bin, x ti, dis is the average motion displacement. Then the motion set of ts te Ti = {dii , · · · dii } is described as: Fim = {f

t si i, m , ...,

f

tei −1 i, m }

(10)

By combining appearance model and motion model, the complete feature representation of Ti is defined as:

C. Learning As mentioned above, we employ a HF learning framework to estimate the two probability ratios, formulated in (7) and (8). This section introduces such model in detail and involves five parts: feature description, samples collection, forest construction, definition of three types of statistics, and finally the estimation of two ratios. 1) Feature description: We describe features for tracklets in terms of appearance and motion. Considering a detection dti of tracklet Ti at time t, for appearance description, we extract 4 channel color histograms represented as {hi,c }, c ∈ {whole, head, torso, legs}. Here c is a channel variable, indicating the parts of target which are defined by vertically partitioning the bounding box of detection into three parts, with 20% in the top aiming at capturing the “head”, the 40% and 40% in the middle and the bottom aiming at capturing the “torso” and the “legs”, respectively. For each channel of histograms, we employ Gaussian weighted color histogram method proposed by Milan [28] to avoid taking into account too many pixels from the background. Then the appearance descriptor of Ti is described as: ts

te

i i Fia = {fi,a , · · · , fi,a }

(9)

t where fi,a = {hti,whole , hti,head , hti,torse , hti,legs }, t ∈ [tsi , tei ] represents the 4 channel histograms of Ti at time t. For motion model, we utilize point based Median Flow tracking method [29] to extract sparse motion flow histogram. Here a brief description of this method is given, and for the more detail we refer the reader to [29]. For tracklet Ti , we consider each pair of frame images It , It+1 , t ∈ [tsi , tei − 1], and let βt , βt+1 denote the corresponding bounding boxes of detections. A set of points is initialized on a rectangular grid (with 10 pixel interval) within βt . These points are tracked by Lucas-Kanade (LK) tracker [30] which generates a sparse motion flow between It and It+1 . Let T P ini = {pt1 , ...ptn } det+1 notes the positions of the initial points, T Pf = {pt+1 1 , ...pk } is the tracking result of LK, where f stands for forward and k indicates the number of points which are successfully tracked and located within bounding box βt+1 . Then, the points in T Pf are tracked backward from It+1 to It , resulting

=

Fi = {Fia , Fim } tsi tei tsi tei −1 {fi,a , ..., fi,a , fi,m , ...fi,m }

(11)

2) Training samples and test samples: Similar to [16], [6],we collect positive and negative training samples using two spatial-temporal constraints: 1) detection responses in one tracklet describe the same target; 2) any responses in two different tracklets which overlap in time represent different targets. Consequently, positive samples are collected from pairs of responses within the same tracklet, and negative samples are extracted from pairs of responses coming from two different tracklets which overlap in time. We denote the feature vector of positive r+ and negative r− as  + r = {ρ(f i,t1a , f i,t2a ), av(f i,t1m , f i,t2m )}     ∀t1 , t2 ∈ [tsi , tei − 1]    r− = {ρ (f i,t1a , f j,t2a ), 0, 0}    ∀t1 ∈ [tsi , tei ], ∀t2 ∈ [tsj , tej ], and i ̸= j

(12)

where ρ is the Bhattacharyya distance measuring the similarity between appearance features, av is the average function, and the motion feature for negative sample is defined as a 2D vector (0,0). For the sake of simplicity, we denote one training sample, positive or negative, as a unified form r+ (or r− ) = ri (xi,a , xi,m , yi )

(13)

where xi, a ∈ ℜ4 represents similarity between two appearance descriptors, xi,m ∈ ℜ2 is motion vector, and yi indicates positive or negative class-label. Now, we define feature description for a CRF node. Considering a node vi = (Ti1− > Ti2 ), the complete feature a m representation of Ti1 is given in Eq.(11) as Fi1 = {Fi1 , Fi1 }. It is reasonable to believe that information around the center of a tracklet should get more confidence level than that from the head or tail of this tracklet. In this sense, we formally represent the whole tracklet observations in a Gaussian kernel

1051-8215 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2015.2489438, IEEE Transactions on Circuits and Systems for Video Technology IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. XX, NO. XX, 2015

based weighted average manner :  ′ ′ Fi1 = {fi1,a , fi1,m }   / 2 ∑ t  1 N −1 2 ′ s  f = f i1,a ·exp(−|t−In(ti1 + 2 )| 2σ ) i1,a N t∈[tsi1 ,tei1 ] / 2 ∑ t  1 ′ 2  fi1,m·exp(−|t−In(tsi1 + N−2   fi1,m = N−1 2 )| 2σ ) t∈[tsi1 ,tei1−1]

(14) Here In is the rounding operator, and N Then the feature description of a CRF node can be defined as: = tei −tsi +1.

′ ′ ′ ′ ri′ = {ρ(fi1,a , fi2,a ), av(fi1,m , fi2,m )} ′ ′ ′ = ri (xi,a , xi.m )

t

s

p i i22 Ti2

` 'p

t

s

t

e 1

s

e

t

e

p i i22  f i 1i.1m ˜ ( t i 2  t i 1 )  p i 1i 2

T i1 t

e

p i 1i 1

ts

Fig. 5. The illustration of persistence constraint model. Where pi2i2 is the te−1

te

i1 ·(tsi2 −tei1 ) is the estimation position real position of Ti2 at tsi2 , pi1i1 +fi1,m

te−1

i1 represents the based on linear motion models of Ti1 at time tsi2 , and fi1,m velocity of Ti1 at tei1

(15)

For clarity, we use ri to represent the feature description for both training sample as (13), or nodes as (15), unless in cases where it is explicitly needed. So far, we have discussed the feature description of tracklet, and based on which, we explain the collection of training samples, and define the feature of training sample as well as nodes. Next, we utilize training samples to construct Hough forest. 3) Constructing the Hough Forest: As a special case of random forest [31], Hough Forests (HF) [4] has recently attracted a lot of attention in computer vision. Methodologically, we construct HF in a similar way as [4], and define two optimal splitting measurements based on class uncertainty and motion flow uncertainty. Let Rk = {ri (xi,a , xi,m , yi )} denotes a set of training examples reaching a tree node k. The class-label uncertainty U1 (Rk ) = |Rk | · H(Y ) measures the impurity of labels, where H(Y ) is class entropy, |Rk | is the number of training ∑ examples. The 2 motion flow uncertainty U2 (Rk ) = 1 |xi,m − x ¯Rk ,m | measures the motion impurity of the + |R | k

6

+ i∈Rk

whole training set, where x ¯Rk ,m is the mean value of motion vectors over Rk , Rk+ denotes the set of positive samples. One key element of HF learning is to find a test function to split the training examples at a splitting node, aiming at minimizing the uncertainty measures. To split samples at tree node k, we generate a pool of tests {τ k }, and each test is defined by a channel variable c ∈ {whole, head, torso, legs} and a real handicap value δ. First, a feature channel of appearance descriptor is uniformly sampled, and a handicap value δ is chosen randomly in the range of feature values of the selected appearance channel. Then, with equal probability, we pick the test τ k∗ that either minimizes U1 (Rk ) or U2 (Rk ): τ k∗= argmin(U∗({i|τ k (xi,a) < δ})+U∗({i|τ k (xi,a) ≥ δ}) (16) k

where *=1 or *=2. In this manner, the tree construction ensures that samples arriving at leaves have low variations in both class labels and motion flows. 4) The Statistics: After HF is constructed, we compute three types of statistics by using the evidences collected in every leaf node of HF. Then the three statistics and a trajectory persistence constraint are used to estimate the two ratios in equation (7) and (8). Firstly, supposing ri arrives at a leaf node L in a tree of HF, we compute the number of training examples that reached L and belong to class y, denoted as ΦL = { ϕL (y) : y = 1 or 0 }. We divide the statistic term pi = p(yi |ri ) into two parts: sample label based probability

plab (·) and trajectory persistence constraint based probability pper (·). The label posterior plab (·) is estimated by normalizing ϕL (y) over the total number of samples in L, which is formulated as plab (yi |ri ) =

ϕL (y = yi ) ∑ ϕL (y)

(17)

y∈{0,1}

We define pper (·) as: pper (yi |ri ) = δ(yi − 1) · pp (ri ) + δ(yi ) · (1 − pp (ri )) (18) Here pp (ri ) = G(∆p; 0, σp2 ) is defined based on the position distance between the two tracklets of ri . As shown in Fig. 5, te−1 ts te i1 ∆p = pi2i2 − pi1i1 − fi1,m · (tsi2 − tei1 ) denotes the positions difference between the estimation of position Ti1 and the real position of Ti2 at time tsi2 . G(· ; 0, σp2 ) is the zero-mean te−1

i1 Gaussian function, and fi1,m represents the velocity of Ti1 e−1 at time ti1 . The main purpose of the persistence term is to reduce identity switches by an assumption that targets move slowly and in most cases also smoothly. In other words, if yi = 1, pper (yi |ri ) should be high when the locations of two tracklet are close; And if yi = 0, pper (yi |ri ) should be high when the locations of two tracklet are far away. Finally, pi = p(yi |ri ) is expressed as:

p(yi |ri ) = plab (yi |ri ) · pper (yi |ri )

(19)

Secondly, we estimate the joint posterior probability p(yi , yj |ri , rj ). Let (ri , rj ) ∈ {sequential in time, time overlapping} denotes the temporal relationship between two CRF nodes, and ri and rj respectively arrives at leaf nodes L and L′ . For pair of leaf nodes (L, L′ ), we compute two-class counts ΨL,L′ = { ψL,L′ (y, y ′ , r, r′ ) }, where ψ L,L′ (y, y ′ , r, r′ ) is the number of training pairs belonging to class (y, y ′ ) that reached (L, L′ ), and simultaneously have the same type of temporal relationship as (r, r′ ). The joint posterior p(yi , yj |ri , rj ) is estimated by normalizing ψL,L′ (yi , yj , ri , rj ) over the number of training pairs with the same relationship (ri , rj ) in (L, L′ ), formulated as: p(yi , yj |ri , rj ) =

ψL,L′ (yi , yj , ri , rj ) MLL′ (ri , rj )

(20)

Here MLL′ (ri , rj ) is the number of pairs arriving at leaf L and L′ with the same relationship as (ri , rj ).

1051-8215 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2015.2489438, IEEE Transactions on Circuits and Systems for Video Technology IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. XX, NO. XX, 2015

Thirdly, we estimate p(wi,j =1|yi = yj , ri , rj ), the probability that two graph nodes with the same class label belong to the same target instance. Intuitively, this probability is directly proportional to motion flow similarity, and should be low if yi = yj = 0. In this paper, we empirically select p(wi,j = 1|yi = yj = 0, ri , rj ) = 0.1, and formulate p(wi,j = 1|yi = yj = 1, ri , rj ) as a Gaussian kernel based Parzen estimator: p(wi,j = 1|yi = yj = 1, ri , rj ) ∑ (|xn,m−xn′ ,m |−∆¯ xm )2 1 = (r1i,rj ) ( )) 2πσ 2 exp(− 2σ 2 NL,L′

n∈L+,n′ ∈L′+ &&(n,n′ )=(ri,rj )

=

H ∑

B ph p (y B |r ) lab (yi |ri ) · per i i ph (yiA |ri ) pper (yiA |ri ) lab t=1 H ϕLh (yiB ) (δ(y B −1)·p (r )+δ(y B )·(1−p (r ))) ∑ p i p i i i 1 i · H A ) (δ(y A −1)·p (r )+δ(y A )·(1−p (r ))) (y p i p i t=1 ϕLh i i i i

=

1 H

|

{z

}|

{z

}

b

a

(22) An illustration of estimating Eq.22 (a) for a tree is shown in Fig. 6(a). Here yiA , yiB are colored / by green and red B respectively. Then we obtain ϕLhi (yi ) ϕLhi (yiA ) = 3/6 . Second, the posterior distribution ratio of pair of nodes is formulated as a product of two parts: pB ij pA ij

= H1

H ∑ h=1

ph (yiB ,yjB |ri ,rj ) ph (wij |yiB,yjB,ri,rj) · ph (yiA ,yjA |ri ,rj ) ph (wij |yiA ,yjA,ri,rj)

B B H ψLh Lh (yi ,yj ,ri ,rj) ph (wij |y B, y B,ri ,rj) ∑ i j i j 1 =H · A A A A h (y , y ,ri ,rj) ph (wij |y , y ,ri, rj) h=1 ψLh L i j i j i j {z } {z }| | a

(23)

b

An illustration of estimating Eq.23 (a) for a tree is shown in Fig. 6(b) Suppose yiA and yiB are labeled by green and red color respectively, while yjA and yjB are labeled as red and green color respectively. Considering the denominator in Eq. 23 (a), ψLhi Lhj (yiA , yjA , ri , rj ) denotes the number of pairs satisfying the following two conditions:1) samples with green color in Lhi correspond to samples with red color in Lhj ; and 2) these pairs of samples have the same relationship as (ri , rj ). In this example, there are three pairs linked by green

i

I

\

B

h

Li

( yi )

#

A

h Li

( yi )

#

3

node i

6

\

t

t

Li L j

t

t

Li L j

B

B

A

A

( y i , y j , ri , r j ) ( y i , y j , ri , r j )

#

2

#

3

node j

<