Online Multiple Object Tracking with the Hierarchically

0 downloads 0 Views 493KB Size Report
mixture probability hypothesis density (GM-PHD) filter because this filter is robust to noisy and random data processing containing many false observations.
Online Multiple Object Tracking with the Hierarchically Adopted GM-PHD Filter using Motion and Appearance Young-min Song and Moongu Jeon, Member, IEEE Gwangju Institute of Science and Technology School of Electrical Engineering and Computer Science, Gwangju, Korea {sym, mgjeon}@gist.ac.kr Abstract This paper presents an online multiple object tracking (MOT) method based on tracking by detection. Tracking by detection has the inherent problems by false and miss detection. To deal with the false detection, we employed the Gaussian mixture probability hypothesis density (GM-PHD) filter because this filter is robust to noisy and random data processing containing many false observations. Thus, we revised the GM-PHD filter for visual MOT. Also, to handle miss detection, we propose a hierarchical tracking framework to associate fragmented or ID switched tracklets. Experiments with the representative dataset PETS 2009 S2L1 show that our framework are effective to decrease the errors by false and miss detection, and real-time capability. Keywords: Multiple object tracking; tracking by detection; online

1. Introduction Multiple object tracking (MOT) is one of the most important research areas in visual surveillance. This is because the high-end computers for research have been inexpensive but also the machines computing power has improved and for decades. In other words, the MOT techniques has been ready to operate in commercial applications. Of course, for the real-time applications, the MOT algorithms can operate online. However, many state of the arts techniques [2-6] run offline. They can achieve good performance but not real-time capability. Thus, we propose an online MOT method for real-time applications. Also, our method adopts tracking by detection approach which many online and offline MOT techniques have used. Tracking by detection involves intrinsic problems from false detection and miss detection. Normally, false detections and miss detections causes object ID switches and fragmented tracklets, respectively. To

solve these problems, we devised an online multiobject tracker with the hierarchically adopted Gaussian mixture probability density filter (GMPHD) using motion and appearance. The GM-PHD filter [8] is proposed for tracking the random and noisy data such as radar observations. It means that the filter has strength to deal with false alarms. Hence, we revised it visual MOT, expecting handling false detection problem. We chose the data association method [9] for the GM-PHD filter because this method find rapidly associate n target states at time k-1 with m observations at time k in n by m execution time. Also, to handle miss detections, we devised two levels association. One is the lowlevel association between detection and tracking target in frame by frame. The other is the mid-level association between the fragmented tracklets. The proposed online MOT framework using two levels association is described in section 2 including the revised parameters of the GM-PHD filter tracker for visual MOT implementation. In section 3, experiments show that our algorithm achieve realtime capability, and the performance get better stage by stage.

2. Algorithm Description 2-1. The GM-PHD tracker for visual MOT The multi-target tracker based on the GM-PHD filter, called the GM-PHD tracker [10], is revised for visual multi-object tracking with the following steps: Step 0: Initialization At the initial time k = 0, the Gaussian mixture is initialized by the initial states set which is transition from the observations set Z0 = {z0(1),…,z0(I)}, i.e., detection results. An observation vector consists of (x(i), y(i), w(i), h(i)) which denote position at x-axis, yaxis, width, and height of object size. The initial Gaussian component has weight w0(i), mean m0 (i), and covariance matrix P0(i). Each Gaussian mean vector m0 (i) consists of (x(i), y(i), vx(i), vy(i), w(i), h(i)),

Fig. 1: A tree structure of data association in lowlevel including prediction and update procedures at time k. which denotes position at x-axis, y-axis, velocity at x-axis, y-axis, width, and height of bounding box, respectively. An unique identifier is assigned to each Gaussian to form the set T0 = {t0(1),…, t0(I)} where t0(i) denotes the tag of ith Gaussian component. Step 1: Prediction With velocity values vx and vy at time k-1, we can predict mk|k-1 using Kalman filter. Step 2: Update After data association, each Gaussian is updated by the corresponding observation z. Then, the Gaussian components weight wk(i), mean mk|k (i), and covariance matrix Pk(i) are derived. Step 3: Pruning Given a threshold Tth, the updated Gaussian mixture with weights wk(i) < Tth are pruned. Then, the weights of the surviving Gaussians are re-normalized. Step 4: Target State Estimation Connect targets at time k to tracklets collected until time k-1 having the same labels. 2-2. Data Association for the GM-PHD tracker Data association is essential process for matching observations and target states which mean detection and tracking targets, respectively. Find an observation z in observation set Z that updates a Gaussian with the maximum weight by the data association rules [9] for the GM-PHD filter. If there are n target states at time k-1 and m observations at time k, the data association method has n by m time complexity. 2-3. Hierarchical GM-PHD filter for visual MOT In our tracking framework, we designed two levels association procedures. One is low-level association and the other is mid-level association. The former means association between detections at time k and tracking targets at time k-1. The latter indicates association between dead tracklets and alive tracklets at time k. The framework are described in Fig. 1 and 2.

Fig. 2: A demonstration of data association in midlevel at time k=5. In the low-level association, observations indicate detection bounding boxes containing position (x, y) and size (width, height) at time k. States mean target information including position (x, y), velocity (vx, vy) and size (width, height) at time k1. Associate a state at time k-1 with the observation z that makes the state have the maximum weight as seen in Fig. 1. If there is no miss detection, tracklets can keep their IDs with few ID switches and fragments. However, if there are miss detections, tracklets can be cut off and tracklet ID can be changed even if the newly birth object is actually the same object with one of missed objects before. To improve this frame-wise association tracking, we added the midlevel association. First, eliminate the non-reliable tracklets with the length shorter than the threshold Lth because the short tracklets are likely generated by false detections. We set Lth to 10. To judge whether the length is under Lth or not, we let the frames delay as much as Lth -1. Nine frames is just a few seconds so does not interrupt real-time capability. Second, classify the remaining reliable tracklets into two categories. Then, the alive tracklet now being tracked and the dead tracklet not being tracked belong to observations and states, respectively, for the mid-level association. From the dead tracklet, the color histogram mean of whole targets, position, velocity, and size of the last target construct the state vectors. From the alive tracklet, the same elements of the first target consist of the observation vectors. For instance, as seen in Fig. 2, there are one dead tracklet and two alive tracklets at time k=5. The mid-level association connects the dead tracklet with ID 1 with the alive tracklet with ID 3 that re-assigns proper ID 1 to the alive tracklet. Also, the loss tracklet generated by miss detection are approximated by linear interpolation with the last position of the ID 1 dead tracklet and the first position of the alive ID 3 tracklet. Thus, these elimination and association procedures can correct the false detection and false negative alarms problems, respectively.

Table 1: Performance comparison of each stage on PETS S2L1. The best are marked with bold font. *TA : tracklet assciation

Rcll

Prcn

F1

FAF

Stage 1

89.0

73.4

80.4

1.82

Stage 2

82.2

86.4

84.4

Stage 3(a)

82.2

86.4

84.7

79.8

(w/o TA*)

Stage 3(b) (w/t TA*)

MT

PT

FP

FN

IDs

FM

MOTA MOTP

FPS

89.4% 10.6%

1444

492

298

196

50.1

70.7

60~63

0.70

68.4% 31.6%

558

797

72

193

68.1

70.7

59~62

84.4

0.70

68.4% 31.6%

558

797

47

193

68.7

70.7

33~35

82.1

1.21

68.4% 31.6%

961

684

46

168

63.6

70.8

32~34

Fig. 3: Tracking results: (a) stage 1 and (b) stage 2. Our framework consists of three stages as follows: Stage 1: Low-level association Stage 2: Short tracklets elimination Stage 3: Mid-level association In the next chapter, we present the effectiveness of our framework.

3. Experiments In this chapter, we introduce the quantitative results of each stage as seen in Fig. 3 and 4. Also, the qualitative experimental results in terms of the evaluation metrics [12-13] such as recall (rcll), precision (prcn), F1, i.e., harmonic mean of recall and precision, false alarms per frame (FAF), mostly tracked (MT), partially tracked (PT), false positives (FP), false negatives (FN), ID switch (IDs), fragmented tracklet (FM), multiple object tracking accuracy (MOTA), multiple object tracking precision (MOTP), and frames per second (FPS) as shown in Table 1. We evaluated our algorithm by using the provided tools from MOT benchmark [1] with the representative and textbook dataset PETS 2009 S2L1 [11]. Also, we used the detection results from PETS 2009 as tracking input. Fig. 3 show the tracking results of stage 1 and stage 2 with short tracklets elimination. The performance of stage 2 is better than stage 1 in terms of false positives related factors such as precision and FP. Especially, the comprehensive metric MOTA and MOTP became much better than before because many false tracklets by false detection were removed in stage 2 as seen in Table 1. In addition, with stage 3, i.e., mid-level association, our algorithm overcome miss detection problem by associating the fragmented and ID switched tracklets like the highlighted objects of (a-1) to (a-2) each

Fig. 4: Tracking results at time k=530 before miss detection: (a-1) with mid-level association and (b-1) without mid-level association, Tracking results at time at time k=552 after miss detection: (a-2) with mid-level association and (b-2) without mid-level association, (c) miss detection at time k=551. other in Fig. 4. On the other hand without stage 3, the objects (b-1) and (b-2) were not associated although they are identical to (a-1) and (a-2). Instead, ID switch occurred. Comparing stage 3(a) with stage 3(b), even though we approximated loss tracklets between associated tracklets in stage 3(b), the overall performance got worse than without approximation in view of F-score, FAF, and MOTA. Only FM was better. That may be because of using classic appearance model. i.e., color histogram. Rather, the false association generated more false positives. As a results of the experiments, stage 3(a) with low-level association, short tracklets elimination, mid-level association, and without loss tracklets approximation shows the best performance in term of comprehensive metrics such as F-score, FAF, IDs, and MOTA better than former stages. Especially, IDs decreased much comparing with stage 1. In stage 2, the elimination removed true positives but stage 3 revise that by connecting ID switched tracklets by mid-level association. As stage goes by, operation speed was almost two times slower than the first stage but still faster than 30 fps. We implemented the tracker by using C++ with OpenCV2.4.11 64bit built on Intel CPU i74770K 3.5GHz, DDR3 32.0GB RAM computer and tested it on 768 by 576 resolution image sequences.

In summary, we verified that the MOT framework with hierarchical association is effective against false detection and miss detection, and has real-time capability.

4. Conclusion Multiple object tracking (MOT) techniques based on tracking by detection approach has the intrinsic problems because of dependency to detection results. Generally, the problems come from false detection and miss detection. To handle these problem, we propose an online MOT with the hierarchically adopted Gaussian mixture probability hypothesis density (GM-PHD) filter using motion and appearance. First, the GM-PHD filter [8] is robust to false alarms because it was devised for random and noisy point data processing. Also, there is a simple data association method [9] for the filter. The existing MOT tracker [10] based on the GM-PHD filter was designed for radar data tracking. Hence, we revised these filter and data association method for visual MOT and adopted the GM-PHD filter hierarchically in two levels association to handle false and miss detection. In low-level association, stage 1, our tracker makes the results with many false positives. Then, the errors from false detection are removed by short tracklets elimination in stage 2. For this, a few frames is delayed but it does not interrupt online implementation. After the elimination, in mid-level association, stage 3, we can link the ID switched objects and the fragmented tracklets using motion and appearance of those tracklets. We evaluated our tracker by using the provided tools from MOT benchmark [1] on the dataset PETS 2009 S2L1 [11] in terms of the evaluation metrics [12-13]. The experiments show that the low-level association finds affinity between detections and tracklets, short tracklets elimination removes false positives, and the mid-level association finds affinity between ID switched or fragmented tracklets so decreases false negatives, ID switches and fragments. In conclusion, the performance improves, stage by stage and shows that our tracker is reasonable to handle key problems in MOT research. In future work, our algorithm can get more competitive performance comparing with the state of the arts MOT techniques, if we use more robust appearance feature instead of color histogram. Also, because our tracker run online over 30 fps, we expect that it will be easily applied in commercial real-time applications.

Acknowledgment

This work was supported by the ICT R&D program of MSIP/IITP. [B0101-16-0525, Development of global multi-target tracking and event prediction techniques based on real-time large-scale video analysis]

References [1] L. Leal-Taixé, A. Milan, I. Reid, S. Roth, and K. Schindler, “MOTChallenge 2015:Towards a Benchmark for Multi-Target Tracking”, ArXiv e-prints, Apr. 2015. [2] L. Leal-Taixé, M. Fenzi, A. Kuznetsova, and B. Rosenhahn, S. Savarese, “Learning an image-based motion context for multiple people tracking”, In CVPR, Jun. 2014. [3] A. Milan, S. Roth, and K. Schindler, “Continuous Energy Minimization for Multitarget Tracking”, In IEEE TPAMI, Vol. 36, No. 1, pp. 58-72, Jan. 2014. [4] C. Dicle, O. Camps, and M. Sznaier, “The Way They Move: Tracking Targets with Similar Appearance”, In ICCV, Dec. 2013. [5] A. Geiger, M. Lauer, C. Wojek, C. Stiller, and R. Urtasun, “3D Traffic Scene Understanding from Movable Platforms”, In IEEE TPAMI, Vol. 36, No. 5, pp. 10121025, May 2014. [6] H. Pirsiavash, D. Ramanan, and C. Fowlkes, “Globally-Optimal Greedy Algorithms for Tracking a Variable Number of Objects”, In CVPR, Jun. 2011. [7] M. Hofmann, M. Haag, and G. Rigoll, “Unified Hierarchical Multi-Object Tracking using Global Data Association”, In PETS, Jan. 2013. [8] B-N. Vo and W-K Ma, “The Gaussian Mixture Probability Hypothesis Density Filter”, In IEEE TSP, Vol. 54, No. 11, pp. 4091-4104, Nov. 2006 [9] K. Panta, D. E. Clark, and B-N Vo, “Data Association and Tracking Management for the Gaussian Mixture Probability Hypothesis Density Filter”, In IEEE Transactions on Aerospace and Electronic Systems, Vol. 45, Issue. 3, pp. 1003-1016, Jul. 2009 [10] K. Panta, B-N. Vo, and D. E. Clark, “An Efficient Track Management Scheme for the Gaussian-Mixture Probability Hypothesis Density Tracker”, In ICISIP, Oct. 2006 [11] IEEE International Workshop on Performance Evaluation of Tracking and Surveillance (PETS) 2009 Dataset [12] K. Bernardin and R. Stiefelhagen, “Evaluating Multiple Object Tracking Performance: The CLEAR MOT Metrics”, Image and Video Processing, May 2008. [13] Y. Li, C. Huang, and R. Nevatia, “Learning to associate: HybridBoosted multi-target tracker for crowded scene”, In CVPR, Jun. 2009.

Suggest Documents