Motion compensation based automatic tracking of

CS 6350 TERM PROJECT ASSIGNMENT REPORT

1

TPA-1: Motion compensation based automatic tracking of object silhouette, under camera movement with jitter Abhishek Kumar, CS15M008, and Abhigyan, EE13B001,

Abstract This report investigates tracking of unknown object in a video stream with jitters. The object is defined by its location and extent in a single frame. In every frame that follows, the task is to determine the object’s location and extent or indicate that the object is not present. We have used a novel tracking framework (TLD) that explicitly decomposes the long-term tracking task into tracking, learning and detection followed by grabcut to segment the object in motion. The tracker follows the object from frame to frame. The detector localizes all appearances that have been observed so far and corrects the tracker if necessary. Keywords: TLD, grab-cut, MOG, median filter,

I. I NTRODUCTION

W

ITH the increasing availability of small, inexpensive video recording devices, casual movie making is now within everyone’s reach. The camera can be put with any moving object for

several purposes for instance drone etc. Automatic tracking a foreground object from a video shot having un-constrained and jittery camera movement becomes a challenging task.


2

II. ALGORITHMS USED Following are the standard algorithms we followed in our implementations for tracking an object in motion in jittery video. A brief introduction of them is as below:

A. Tracking-Learning-Detection (TLD) TLD [1] is an award-winning, real-time algorithm for tracking of unknown objects in video streams . The object of interest is defined by a bounding box in a single frame. TLD simultaneously tracks the object, learns its appearance and detects it whenever it appears in the video. The result is a real-time tracking that often improves over time. As the name suggests, it consists of three main parts. • Tracking • Detection • Learning

B. Grab Cut GrabCut [2] algorithm is used for foreground extraction with minimal user interaction. We used opencv’s existing method for grabcut. Grabcut algorithm requires a rectangle around the object as input from the user. We automated this task. We saved the rectangle co-ordinates and it’s width and height obtained from TLD-tracking of the object and while applying grabcut on original image , we passed the rectangle.


3

Figure 1: Object segmentation using grabcut

C. Video Stabilization For our experiments , we stabilized the video using YouTube website. We tried other techniques also but we faced several issues like frame cropping etc. We uploaded the video and then after stabilization again downloaded it to perform experiments. Internally youtube follows several stabilization techniques [4] , [5] , [6].

D. Mixture of Gaussian The background of the scene contains many non-static objects such as tree branches and bushes whose movement depends on the wind in the scene. This kind of background motion causes the pixel intensity values to vary significantly with time. So a single Gaussian assumption for the pdf of the pixel intensity will not hold. Instead, a generalization based on a mixture of Gaussians has been used in to model such variations. the pixel intensity was modeled by a mixture of K Gaussian distributions (K is a small number from 3 to 5) For our implementation we used 3-3 gaussians for background and foreground , learning-rate-0.01, threshold: 0.15-0.45 and positive deviation threshold-2.5 depending on input data.

E. Median Filter The median filter is a nonlinear digital filtering technique, often used to remove noise. The main idea of the median filter is to run through the signal entry by entry, replacing each entry with the median of


4

neighboring entries. The pattern of neighbors is called the ”window”, which slides, entry by entry, over the entire signal.

Noisy MOG output

Median filtered image

Figure 4: Noise removal using median filter

III. I MPLEMENTATION We did following two implementations for the tracking in jittery video - 1). Predator with grab-cut , 2). MOG with median filter. We wanted to compare the two algorithms Predator and MOG.

A. Predator with Grabcut To complete our task for object tracking we extended an already existing implementation (predator) [3] which internally implements Predator [1] . In our implementation , we pass video input and we mark the object to be tracked in the first frame from the video. Predator tracks the object in motion with rectangular box. For every frame , we stored the co-ordinates (x,y) , width(w) and height (h) which is used further while applying the grabcut on the original frames. Since grab-cut takes long time, we kept it’s implementation separate from Predator implementation. While passing the rectangle to grab-cut we increased the width and height with some amount appropriately, so that most part of the object can be grabbed inside the rectangle. Grab-cut segments the object from the frames and we create the video from these segmented frames. The flow chart for our implementation is shown in Figure 5.


5

Figure 5: Flowchart for our implementation with Predator(TLD) and Grabcut for motion tracking

B. MOG with Median Filter

We first, did the video stabilization to the jittery video. And applied MOG. To remove the noise resulting because of jitter and moving background in the video, we applied median filtering on the result.

Figure 6: Flowchart for our implementation with MOG and Median Filter for motion tracking


6

IV. OUTPUT A. Data-A 1) Predator with Grabcut on Jittery Video: Below are the outputs of Predator with grab-cut on jittery video data-A. tracked output

segmented output

segmented binary output

ground truth

Figure 7: Frame-10

Figure 8: Frame-180 2) Predator with Grabcut on Stabilized Video: Outputs of Predator with grab-cut on stabilized video data-A. tracked output

segmented output


Figure 9: Frame-10

Figure 10: Frame-150

ground truth


7

3) Mixture of Gaussian on stabilized video : Results for Mixture of Gaussian without median filter and with median filter on stabilized video input were observed as below. The results are not good because of the motion of the background. Original Frame

MOG output

MOG after Median Filter

ground truth


B. Data-C

1) Predator with Grabcut on Jittery Video: Below are the outputs of Predator with grab-cut on jittery video data-C.

tracked output

segmented output


Figure 12: Frame-10


ground truth


8

Figure 14: Frame-249 2) Predator with Grabcut on Stabilized Video: Below are the outputs of Predator with grab-cut on stabilized video data-C. tracked output

segmented output


ground truth

Figure 15: Frame-70

Figure 16: Frame-248 3) MOG: Results for Mixture of Gaussian without median filter and with median filter on stabilized video input were observed as below.

Original Frame

MOG output


ground truth


9

Figure 17: Frame-10


C. Frog-Data 1) Predator with Grabcut on Jittery Video: Results for Predator with grab-cut on jittery frog video are as follows tracked output

segmented output


ground truth

Figure 19: Frame-10

Figure 20: Frame-190 2) Predator with Grabcut on Stabilized Video: Results for Predator with grab-cut on stabilized frog video are as follows: tracked output

segmented output


Figure 21: Frame-50

ground truth


10

Figure 22: Frame-240 3) MOG : MOG was applied after video stabilization. Original Frame

MOG output


ground truth

Figure 23: Frame-50


D. Observation Ground Truth for object in motion was manually extracted from the video for every 10’th frame and the observed (segmented object) result was compared with it. The misclassification error measures the percentage of background pixels that have been wrongly assigned to the foreground and vice versa. The misclassification Error is calculated using the formulae:

ME = 1 −

|BO ∩ BT | + |FO ∩ FT | |BO + FO |

(1)

where, Where BO and FO are the background and foreground pixels, respectively, of a perfectly segmented, or ground-truth, image, and BT and FT are the background and foreground pixels of a test image.


11

Model

Predator with Grabcut(Unstabilized)

Predator with GC(Stabilized)

MOG

MOG with Median Filter

Dataset-A

5.86%

5.75%

23.3

17.28

Dataset-C

10.85%

8.6%

15.64

14.68

Frog-data

10.15%

9.00%

21.47%

17.65%

Table I: % Avg. miss classification on stabilized/unstabilized video

Model

Predator with Grabcut(Jittery)

Jitter Level-1

11.07%

Jitter Level-2

10.38%

Jitter Level-3

11.41%

Jitter Level-4

10.42%

Table II: % Avg. misclassification for different levels of jitters in frog-data

Stabilized Figure 27: % Misclassification per 10’th frame for data C

Jittery


12

V. C ONCLUSION

1) Our implementation works almost the same for both the jittery as well stabilized video. For stabilized ones there is small amount of improvement in tracking results. 2) We observed that, in data-C the object moves with a higher speed in consecutive frames, so the tracking accuracy depends on the initialization. On the other hand in video data-A, the object moves slowly that is why initialization of object does not affect that drastically as in data-C. 3) From figure 27, we can conclude that , if the object moves rapidly in consecutive frames , stabilization can make Predator find the lost object quickly also the bounding box size will not vary much. 4) The good thing about Predator is that if the object is lost in some frame , it can again find it . In figure 27 , the peak in the figure shows that the object is lost completely. But after some frames the object is again found resulting in less error. 5) The results of segmentation can be improved by improving grab-cut and applying region growing on tracked object. 6) MOG gave good results for frog data after stabilization because the background is almost homogeneous with color and contains no distinguishable objects unlike in data-A and data-C where background has different colors and objects. 7) If initialization is done properly, accuracy in both cases( stabilized or jittery data) is almost the same. Stabilizing the the video does not show significant improvement for the algorithm Predator with grab cut. 8) Predator with grab cut is significantly faster as well more accurate than MOG. MOG fails drastically if the video is not stabilized. But Predator still gives good results. 9) Predator performs almost the same for various levels of jitters because it tries to match the initial or the last seen model of object in the whole image using grid search.


13

10) the smaller the initial bounding box , better will be the tracking results using Predator.

R EFERENCES [1] Tracking-Learning-Detection, Zdenek Kalal, Krystian Mikolajczyk, and Jiri Matas, IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 6, NO. 1, JANUARY 2010. [2] ”GrabCut”: interactive foreground extraction using iterated graph cuts , Carsten Rother , Vladimir Kolmogorov , Andrew Blake, ACM Transactions on Graphics (TOG) - Proceedings of ACM SIGGRAPH 2004. [3] http://www.cs.tau.ac.il/projects/predator. [4] http://googleresearch.blogspot.in/2012/05/video-stabilization-on-youtube.html. [5] Calibration-Free Rolling Shutter Removal, Matthias Grundmann, Vivek Kwatra, Daniel Castro, Irfan Essa ,International Conference on Computational Photography [Best Paper], IEEE (2012). [6] Auto-Directed Video Stabilization with Robust L1 Optimal Camera Paths, Matthias Grundmann, Vivek Kwatra, Irfan Essa ,IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2011)