Enhanced Video Target Tracking using Kalman Filter Guided

466

Int'l Conf. IP, Comp. Vision, and Pattern Recognition | IPCV'15 |

Enhanced Video Target Tracking using Kalman Filter Guided Covariance Descriptor with Gaussian Similarity Weighting O. Akbulut1, S. Erturk2 1 Computer Engineering Department, Kocaeli University, Kocaeli, Turkey 2 Electronics and Telecommunications Engineering Department, Kocaeli University, Kocaeli, Turkey Abstract – The region covariance descriptor, which includes statistical and spatial features as well as correlation between features, has been widely used for target representation in visual tracking. Robustness, enabling fusion of several features, low-computational load are powerful features of the region covariance descriptor for target representation. In this paper, we have proposed a novel approach in that isotropic Gaussian weighting and Kalman filtering is used together with the region covariance descriptor which increases performance of visual tracking in relatively complex situations such as occlusion, appearance changes etc. Experimental results demonstrate the effectiveness of this approach in terms of robust visual tracking. Keywords: Covariance Descriptor, Gaussian weighting, Kalman filter

1

Introduction

Target tracking is an important topic in computer vision applications. The performance of the tracking process depends on some special conditions such as semi/full occlusion, illumination conditions, noise factor, appearance changes and dynamic background in many cases. For example, using a fixed appearance model for a target region may degrade visual tracking performance. In visual tracking, the selection of appropriate features, which represent different characteristic properties of the target, influences the quality of visual tracking. Conventionally, a single feature descriptor such as the color histogram or gradient based histogram has been used to represent the target for tracking purposes. In [1], color histogram based object representation is used for tracking. A kernel based search approach has been carried out via mean shift algorithm. However, performance degradation might occur if there is an abrupt motion in the scene. A more effective tracking performance based on particle filter has been realized in [2]. Color distributions are integrated into particle filtering to incorporate the scale changes and abrupt motion seen in tracking. Color histograms are robust to partial occlusion and they have rotation and scale invariant features. However, only statistical information is considered in most cases whereas spatial information is ignored. Furthermore,

using color histogram is not actually effective for higher number of bins due to exponential size. The histogram of gradient (HOG) descriptor is well-known single feature descriptor used in visual tracking. In [3], the histogram of gradients has been utilized for multiple pedestrian tracking. The local binary pattern (LBP) that is a texture based alternative single feature descriptor has been used in visual tracking in [4]. Exploiting local binary codes for each pixel in target region, histogram based representation is computed. However, tracking performance of LBP based algorithm has not been investigated for challenging sequences. Instead of using a single feature descriptor, multi-feature descriptors have also been used for target representation in the literature. These approaches can improve the target tracking performance and achieve robustness at the expense of computational load as shown in [5-7]. In [5], intensity gradients and color histograms based elliptical head tracking approach has been proposed. Intensity gradients are calculated along the person’s head boundary while color histograms are computed at the interior of the heads. Tuzel et al. [6,7] proposed a region covariance descriptor (RCD) based appearance model for target tracking and object detection. RCD enables efficient fusion of multiple feature descriptors maintaining the computational load of a single feature descriptor. RCD has been used in different computer vision areas such as human detection [8,9] and face recognition [10]. Instead of using RCD based deterministic search in tracking problems, RCD based probabilistic search [11-13] has also been utilized to achieve better tracking performance. In [11], a particle filter based probabilistic target tracking has been proposed using RCD to represent target region. An incremental model-updating scheme has been integrated into the existing approach to handle appearance changes occurred in target by enabling low computational load. A Monte Carlo method, which is a special case of particle filter, combined with RCD has been proposed to track non-rigid objects in [12].They use multiple covariance matrices for a target region instead of single covariance matrix for partial occlusion. In [13], system dynamic models related with tracking process have been defined in Riemannian manifold and covariance based tracking has been performed. The state samples have been drawn on the manifold geodesics instead of vector space. The selection and number of different types of feature vectors used in RCD may affect the


performance of target tracking. In [14], performance evaluation for choosing feature vectors to construct the best covariance descriptor has been evaluated. It is emphasized that color features are more important than the other features to construct covariance matrix. The RCD approach has also been utilized in multi target tracking applications as in [15]. In [15], an easy model update approach, which is based on the mean of the last covariance matrix and current covariance matrix, is used to speed up the tracking process. In [16], covariance descriptors has been used for 3D shape matching and retrieval taking into account the geometry of the Riemannian manifold. In this paper, a novel approach has been proposed to improve the tracking performance of the RCD based approach. There are three fundamental contributions of the proposed approach. The first one is to use isotropic Gaussian shaped weighting for the similarity metric used in covariance matrix matching to give more weight to locations with higher probability. Additionally, Kalman filtering [17] has been integrated to the proposed approach for determining the probable target region in the corresponding frame to guide the Gaussian weighting (and search) center. As a last contribution, occlusion detection has been integrated by detecting suddenly increasing distance errors of the target object.

2

Region Covariance Descriptor

RCD has a discriminative property to represent the appearance model of the object region for detection and tracking applications. RCD enables efficient fusion of multiple feature vectors. It is possible to construct multivariate data by extracting different features for each pixel of the image data such as intensity, color, texture etc. Each column and each row in the multivariate data represents an observation and a feature respectively. Note that, each observation corresponds to a pixel. The multivariate data can be denoted by

F n

where {fi }i

1

>f1

.. fi .. f n @ ,

(1)

are d dimensional feature column vectors

associated with the pixel index i , and n is the number of total pixels in the image. Feature vectors generated for each pixel are illustrated for a sample object region in Fig. 1.

Fig. 1 Feature vectors for each pixel in an object region.

It is clear that, the distribution of the multivariate data can be interpreted by constructing the covariance matrix. From this point of view, a region covariance matrix can be defined for a

467

sub-region R inside the image that represents the appearance descriptor of the target object. The representation of the region covariance matrix is given by 1 S T CR (2) ¦ fi ȝ R Ii ȝ R , s 1 i 1 where CR is d u d symmetric positive definite matrix (SPD), ȝ R is the mean of the corresponding feature vector and s is the number of pixels within the sub-region. The covariance descriptor encapsulates statistical information of various feature vectors and the size of the covariance descriptor only depends on the number of feature vectors regardless of the object region size. The performance of visual object tracking is directly related to the similarity (or dissimilarity) metric and model update steps of the covariance matrix. Forstner et al. [18] have proposed a dissimilarity criterion to compare the two SPD covariance matrices in the form of dist C1 , C2

d

¦ log

2

Oi C1 , C2 ,

(3)

i 1

where Oi are the generalized eigenvalues of matrices C1 and C2 given by C1 X C2 X O with X being the generalized eigenvector matrix. The reader is referred to [6,7] for detailed information and the advantages of RCD. In this paper, a single covariance matrix is used to model appearance of the target object and the model update method by means of Riemannian geometry is utilized as presented in [7], where T previous covariance matrices have been taken into consideration in order to obtain a sample mean covariance matrix by gradient descent.

3

Proposed Method

High-performance tracking of an object is not always possible in RCD based tracking methods in case of complex situations. Performance degradation might occur during the template matching or model update of covariance matrices in tracking. Different candidates of RCD may have similar feature descriptors/similar distributions according to the reference RCD. For example, most of the RCD based approaches do not take first order statistics into account over the distribution during the RCD based template matching. In cases of several local minimums with close values, determining the best/correct candidate RCD can be difficult. Occlusion is also a crucial issue in target tracking problems. RCD based methods may demonstrate relatively low-tracking performance when full or semi occlusion occurs in video sequences. In general, RCD based methods use multiple covariance matrices to represent target region. Using multiple covariance matrices is useful when the reference target is occluded by another object. However, there is no adaptive control mechanism to check if there is full occlusion and to handle model update in the case of full occlusion. To deal with these shortcomings, three contributions are proposed in this paper. The flowchart of

468


proposed approach is shown in Fig. 2.The contributions of this paper are highlighted in this Figure.

influence is reduced with distance to this point. Thanks to this approach, a candidate with similarity measure that has a slightly higher distance but is close to the most likely position will be preferred to another candidate with a slightly lower distance that is further away from the most likely position.

3.2

In order to determine the most likely position of the candidate target using previous motion characteristics, the KF is utilized in this paper, using a constant acceleration motion model. The best candidate template is thus searched around the center location determined by the KF using a fast search approach. Note that, just the prediction step of the KF has been taken into account to identify the center location (i.e. most likely position) of the target in the following form:

Fig. 2 Flowchart of proposed approach.

3.1

xˆ k

In RCD based target tracking, the correct candidate position should provide a good similarity measure, i.e. a low distance measure with respect to the reference template. However, due to scene context it is possible that an incorrect candidate position also provides a low distance measure, which can sometimes be even below the correct candidate value. In this paper, it is proposed to introduce a weight to the distance measures to give more emphasis to the candidates that are regarded to be more likely considering motion characteristics obtained from the previous frames. An isotropic and monotonically increasing Gaussian formed function (GFF) as defined in (4) is used for the weighting process with a predetermined V value in the form of

B G2 D x, y, V u B S ,

Ak xˆk 1

(5)

Weighting of distance measures

GFF x, y

Kalman filter guidance

(4)

where G2 D x, y, V is a 2D-Gaussian function normalized to unity height. The minimum and maximum amplitude of the upside down Gaussian function are denoted as S and B , respectively, which have been determined experimentally to achieve good tracking performance throughout all sequences. The GFF used in proposed approach is shown in Fig. 3.

where xˆ k is a priori prediction at step k , Ak is the statetransition matrix, and xˆk 1 refers to a posteriori state estimation at step k 1 . Note that, the prediction step is linear discrete time approximation from the continuous time system. The remaining parts of the KF in our approach have been used without any change. In this paper, the predicted measurement of the KF is used as most likely target position, which is used for the weighting of RCD distance measures as explained in the previous sub-section.

3.3

Occlusion Detection

In order to increase the accuracy of the target tracking during occlusion, another contribution is also introduced by means of the KF. In this contribution, the priori state estimation is used instead of the actual best measurement of the target position when unexpected deviation from the distance measure is encountered. This restriction has been found to be very useful in case of occlusions. Hence, a threshold is adaptively calculated from the moving average value of the distance measures using a pre-determined number of M previous frames to detect unexpected deviation. The corresponding threshold at time t can be formulated as M § · Coeff u ¨ 1 ¦ distt i Cref t i , Ctargett i ¸ , M i 1 © ¹ where Coeff is a preset coefficient.

TH t

4

Fig. 3 Illustration of GFF.

The distance measures are weighted with the GFF so that more influence is given to the most likely target position. The weight of the similarity measure is adjusted so that the

(6)

Experimental Results

Several real-world video sequences with static and dynamic backgrounds have been evaluated in order to assess the performance of the proposed approach versus the reference COV [7], as well as recent state-of-the-art techniques ICTL [11], and MC [12]. Note that, the target within each sequence is manually initialized. Also the proposed framework considers only single target tracking. Visual tracking results corresponding to this work can be found at


http://kulis.kocaeli.edu.tr/RCD_comp_res.php.Video results can also be individually viewed from the given link. The parameters for GFF are V 50 , the minimum and maximum amplitudes are 0.5 and 1 respectively. A 7dimensional feature vector, containing x and y pixel coordinates, RGB color and first derivative gradient at x and y direction, is used for each pixel of the target region. The best target template is searched around the most likely target candidate template location with different scales (large/small). The search range is restricted to r100 pixels and the number of search points is uniformly reduced by decimation, i.e. selecting every 4th pixel location of the possible target region, to decrease computational load of the object detection stage. The number of previous covariance matrices ( T ) used to obtain the sample mean covariance matrix is set to 40, as in COV [7]. In order to assess tracking performance, importance of parameter selection has been investigated. For this purpose, at first, setting the diagonal elements of the priori error covariance matrix (PECM), measurement noise covariance matrix (MNCM) and number of previous frames M to a constant value, a sun-optimal Coeff parameter is determined within a plausible range according to the to the False Alarm Ratio (FAR). FAR shows the ratio of the number of false target detections to the number of total targets for the entire test sequences. Note that, an estimated target, that has a Euclidean distance more than 20 pixels apart from the ground truth, is marked as a false detection. In the next step, a suboptimal M parameter is selected within a plausible range according to FAR using the Coeff parameter determined in the previous step and preserving PECM and MNCM values. Then, the PECM and MNCM values are selected within plausible ranges. Fig. 4 shows the selection of the method parameters from Coeff to MNCM parameter respectively. It is clearly seen that the FAR values for each subfigure in Fig. 4 do not change very much through the neighborhood of the best parameter value. Through this process PECM, MNCM values, Coeff and M parameters are set to 12, 50, 2 and 30 respectively to obtain good tracking performance with lowest FAR. Finally, the Coeff parameter is recalculated using the previously estimated values and parameters to check robustness to parameter changes. Fig. 5 shows the relation between the priori-estimated Coeff parameter and posteriori-estimated Coeff parameter. It is seen that the FAR results for each Coeff parameter given in Fig. 5 remain close throughout the pre-defined range. Note that, 75% of the total false target detections originate from “Woman” sequence. Therefore, the proposed method has about 80% tracking performance according to the FAR for the entire test sequences shown in Fig. 4 and Fig. 5.Note that tracking performance reaches to 95% when the “Woman” sequence is excluded. The ICTL approach has about 77% and 81% tracking performance for the entire test sequences when the “Woman” sequence is included and excluded, respectively.

469

In addition, the MC approach has about 63% and 75% tracking performance for the same situation. Table 1, Table 2 show average (mean), and standard deviation values of the tracking error results for several sequences recorded in stationary and moving camera, respectively. In Table 1, our method provides similar tracking performances compared to existing approaches in less challenging sequences such as the ”Crowd” and “Subway” sequences. However, our method has generally better performance than the other approaches in terms of mean and standard values. As seen in Table 2, nearly all results of the proposed approach are better than the compared methods, with a clear success overall. Note that, the proposed method significantly outperforms the competition in complex sequences such as “Couple” and “Woman”. Furthermore, for example MC and ICTL approaches lose the target due to full occlusion in the “Jogging1” and “Jogging2” sequences. Table 1 Average (Mean) and Standard Deviation (SD) Values of Tracking Error (Pixel) in Sequences recorded in stationary camera. COV [7]

ICTL [11]

MC [12]

Proposed

Ave.

2.56

1.88

2.40

2.27

SD

1.41

1.07

1.26

1.16

Ave.

11.05

13.31

11.61

8.88

SD

9.18

20.71

19.07

7.46

Ave.

5.06

5.15

123.8

3.99

SD

3.84

3.59

83.72

4.13

Ave.

7.94

9.78

11.05

8.08

SD

6.90

8.65

10.29

4.65

Methods Sequences Crowd (440*360) 250 frames EnterExit1co (384*288) 195 frames Subway (352*288) 180 frames Renter1front (384*288) 239 frames

Table 2 Average (Mean) and Standard Deviation (SD) Values of Tracking Error (Pixel) in Sequences recorded in moving camera. Methods

COV [7]

ICTL [11]

MC [12]

Proposed

Ave.

20.26

14.35

16.46

7.79

140 frames

SD

39.95

16.85

19.82

5.30

Jogging1 (352*288)

Ave.

8.58

8.15

90.10

7.25

306 frames

SD

12.40

5.52

51.82

4.02

Jogging2 (352*288)

Ave.

6.76

124.5

6.02

6.00

306 frames

SD

12.36

66.31

6.39

7.19

Woman (352*288)

Ave.

62.80

32.44

110.6

26.42

542 frames

SD

46.91

26.42

75.24

13.80

Sequences Couple (320*240)

Fig. 6 shows quantitative evolution of the tracking performance using the proposed method and COV, ICTL, MC for all frames of the “EnterExit1cor” sequence. The tracking error is measured using the Euclidean distance between the center points of predicted targets and ground truth. It is seen that the proposed method has considerably lower tracking

470


errors, particularly for some frames where the other approaches perform poorly, thanks to the motion prediction and GFF introduced in this paper. Note that, ICTL and MC approaches show close tracking performance compared to the proposed method up to about the 165th frame, however, their performances decrease dramatically due to particle degeneracy and the lack of sufficient samples after this point. Fig. 7 (a) shows visual tracking results of the proposed method and the other approaches for the “Couple” sequence which has abrupt motion and scale variations. Similar feature descriptors occurring on the background reduce the performance of the COV tracker. On the other hand, MC and ICTL trackers are unable to track the target efficiently because of the abrupt motion. However, this challenge does not affect the performance of the proposed method thanks to the introduced KF prediction and GFF weighting. Note that, multiple objects to be tracked are evaluated as a single object at the beginning of the sequence. Two different target region have been seen clearly and these regions are successfully detected by proposed method up to half of the sequence. After 60th frame, one of the target is occluded by another one. Note that the proposed approach is still able to track the targets accurately. However, tracking performance slightly decreases but still better than the compared approaches at the end of the sequence. This performance degradation is simply because of a split event occurred between two target regions and there is no information about the occluded target in the reference covariance matrix. Fig. 7 (b) shows visual tracking results for the “Woman” sequence, which has several long-term semi occlusion, and large-scale variations. It is clear that the reference appearance is affected over time by occlusion. However, the proposed method reduces background clutter and noise influence. Hence, the proposed method outperforms the other trackers thanks to the occlusion detection stage and KF prediction. Fig. 8 shows performance evaluation of the proposed method using only GFF, only KF or both together. The proposed method combining GFF and KF has a better performance in almost all sequences, and clearly outperforms individual utilization cases in the overall. Moreover, the combined case is relatively more insensitive to occlusion and similar feature descriptors with respect to the target region. Using GFF and KF individually provides worse performance in cases of existing abrupt motion, large-scale variations and high number of similar feature descriptor regions. The proposed algorithm takes about 0.74 ms per frame with a non-optimized code using MATLAB on a 3.4 GHz. PC. The ICTL and MC trackers take 0.30 and 0.36 ms per frame, respectively on the same platform. The number of search points related to RCD based tracking that is also valid for COV [7], has an effect on the relatively higher computational load. Note that, the proposed GFF and KF approaches that constitute the novel part of this paper have no significant effect on the complexity of the tracking process.

a)

b)

c)

d) Fig. 4 Parameter optimization according to the FAR value a) suboptimal Coeff selection b) sub-optimal M selection c) sub-optimal PECM selection d) sub-optimal MNCM selection


Fig. 5 The relation between priori estimated Coeff parameter and posteriori estimated Coeff parameter according to FAR value.

5

Conclusions

471

Fig. 6 Comparison of tracking error performance using proposed method and COV [7], ICTL [11], MC [12] approaches for EnterExit1cor sequence.

[7] Porikli, F., Tuzel, O., and Meer, P., “Covariance Tracking using model update on lie algebra,” In Proc. IEEE Conference Computer Vision and Pattern Recognition, 2006.

This paper proposes a novel visual tracking approach incorporating Gaussian formed function and Kalman filter with RCD. Robust target tracking is achieved using Kalman filter guided Gaussian similarity weighting. Additionally, an adaptive occlusion detection phase is integrated into the proposed approach to ensure more reliable tracking results. Detailed experimental results demonstrate robustness of the proposed method in relatively complex situations. The proposed approach is robust to abrupt motion as well as smooth motion in visual tracking. The proposed algorithm can easily be extended to RCD based multiple target tracking if desired.

[8] Tuzel, O., Porikli, F., and Meer, P., “Pedestrian Detection via Classification on Riemannian Manifolds,” IEEE Trans. on Pattern Analysis and Machine Intel, 30, 10, 1713-1727, 2008.

6

[12] Ding , X., Huang, C., Huang, F., Xu, L., and Fang L. X., “Region covariance based object tracking using Monte Carlo method,” 8th IEEE International Conf. on Control and Automation, pp. 1802-1805, 2010.

References

[1] Comanicu, D., Ramesh, V., and Meer, P., “Real-time tracking of non-rigid objects using mean shift,” IEEE Conf. on Computer Vision and Pattern Recognition, 142–149, 2000. [2] Nummiaro, K., Koller, M. E., and Van Gool, L., “A color-based particle filter,” In Proc.Of Workshop on Generative-Model Based Vision, 53-60, 2002.

[9] Paisitkriangkrai, S., Shen, C., and Zhang, J., “Fast pedestrian detection using a cascade of boosted covariance features,” IEEE Trans. Circuits Syst. Video Technol., 18, 8, 1140-1151, 2008. [10] Pang, Y., Yuan, Y., and Li, X., “Gabor-based region covariance matrices for face recognition,” IEEE Trans. Circuits Syst. Video Technol., 18, 7, 989-993, 2008. [11] Wu, Y., Cheng, J., Wang, J., H. Lu, et al. "Real-Time Probabilistic Covariance Tracking with Efficient Model Update," IEEE Trans on Image Processing., 21, 5, 2824-2837, 2012.

[13] Liu, Y., Li, G., and Shi, Z., “Covariance Tracking via Geometric Particle Filtering,” EURASIP Journal on Advances in Signal Processing, 2010.

[3] Sun, L., Liu,G., and Liu, Y., “Multiple pedestrians tracking algorithm by incorporating histogram of oriented gradient detections,” IET Image Processing,7,7, 653-659 2013.

[14] Cortez, C. P., Undurraga, R. C., Mery, Q. D., and Alvaro, S., "Performance Evaluation of the Covariance Descriptor for Detection," International Conf. of the Chilean Computer Science Society, pp. 133-141, 2009.

[4] Rami H., Hamri, M., Masmoudi L., “Object Tracking in Images Sequence using Local Binary Pattern,” Int. Journal of Computer Applications, 63, 20, 19-23, 2013.

[15] Palaio, H., Batista, J., “A region covariance embedded in a particle filter for multi-objects tracking,” 8th Int. Workshop on Visual Surveillance, France, 2008.

[5] Birchfield, S., “Elliptical head tracking using intensity gradients and colour histograms,” In Proc. of IEEE Conf. on Comp. Vis. Pattern Recog., pp. 232-237, 1998.

[16] Tabia, H., Laga, H., Picard, D., and Gosselin, P. H., “Covariance Descriptors for 3D Shape Matching and Retrieval,” IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4185-4192, 2014.

[6] Tuzel, O., Porikli, F., and meer P., “Region Covariance: A fast descriptor for detection and classification,” In Proc. 9th European Conf. on Computer Vision, pp. 18-25, 2005

472


[17] Kalman, R. E., “A New Approach to Linear Filtering and Predictions Problems,” Transactions of the ASME- Journal of Basic Engineering, 82, pp. 35-45, 1960.

[18] Forstner W., and Moonen, B., "A metric for covariance matrices," Technical report Dept. of Geodesy and Geoinformatics,

1999.

# #1

16

#93

97

#2

7

#

#

35

#

#

#

109

#7

#

7

#3 38 95 Fig. 7 Visual tracking results a) Couple b) Woman (Best view in color) MC [12]

#182

98

#3

#

#237

ICTL [11]

#138

125

#3

Proposed Method

#72

50

507

#537

d) EnterExit1cor(#161)

e) Race1 (#183)

COV [7]

c a) Jogging1 (#75)

b) Jogging2 (#285)

) Reenter1front(#148)

Fig. 8 Performance evolution of the proposed method using only GFF, only KF and both together. (Best view in color) Ground Truth

GFF-KF

Only KF

Acknowledgments We would like to thank F. Porikli for valuable discussions and sharing corresponding video sequence databases. O. Akbulut would also like to thank O. Urhan for constructive discussions.

Only GFF