Robust Visual Tracking Using an Effective ... - Semantic Scholar

5 downloads 11592 Views 6MB Size Report
China, 150001; email: [email protected]. Permission ... an effective appearance model based on sparse coding which simulates the sparse rep- .... sented as a linear combination of the template set composed of both target templates.
Robust Visual Tracking Using an Effective Appearance Model Based on Sparse Coding SHENGPING ZHANG, Harbin Institute of Technology, Science and Technology on Avionics Integration Laboratory

HONGXUN YAO, XIN SUN, and SHAOHUI LIU, Harbin Institute of Technology

Intelligent video surveillance is currently one of the most active research topics in computer vision, especially when facing the explosion of video data captured by a large number of surveillance cameras. As a key step of an intelligent surveillance system, robust visual tracking is very challenging for computer vision. However, it is a basic functionality of the human visual system (HVS). Psychophysical findings have shown that the receptive fields of simple cells in the visual cortex can be characterized as being spatially localized, oriented, and bandpass, and it forms a sparse, distributed representation of natural images. In this article, motivated by these findings, we propose an effective appearance model based on sparse coding and apply it in visual tracking. Specifically, we consider the responses of general basis functions extracted by independent component analysis on a large set of natural image patches as features and model the appearance of the tracked target as the probability distribution of these features. In order to make the tracker more robust to partial occlusion, camouflage environments, pose changes, and illumination changes, we further select features that are related to the target based on an entropy-gain criterion and ignore those that are not. The target is finally represented by the probability distribution of those related features. The target search is performed by minimizing the Matusita distance between the distributions of the target model and a candidate using Newton-style iterations. The experimental results validate that the proposed method is more robust and effective than three state-of-the-art methods. Categories and Subject Descriptors: I.2.10 [Artificial Intelligence]: Vision and Scene Understanding— Video analysis; I.4.6 [Image Processing and Computer Vision]: Segmentation—Region growing, partitioning; I.4.8 [Image Processing and Computer Vision]: Scene Analysis—Motion General Terms: Algorithms Additional Key Words and Phrases: Intelligent visual surveillance, appearance model, sparse coding ACM Reference Format: Zhang, S., Yao, H., Sun, X., and Liu, S. 2012. Robust visual tracking using an effective appearance model based on sparse coding. ACM Trans. Intell. Syst. Technol. 3, 3, Article 43 (May 2012), 18 pages. DOI = 10.1145/2168752.2168757 http://doi.acm.org/10.1145/2168752.2168757

1. INTRODUCTION

Due to the wide use of surveillance cameras in public places, a large number of videos need to be processed. However, it is impractical to analyze these data by human beings due to expensive labor costs. Intelligent surveillance is an effective way to solve this problem by letting computers automatically analyze video data without human participants. Visual tracking is a key preprocessing step for an intelligent surveillance This work was supported by the National Natural Science Foundation of China (No. 61071180), Key Program (No. 61133003), and Aerospace Science Foundation (No. 20105577015). Author’s address: H. Yao, Visual Intelligence Lab, Harbin Institute of Technology, 92 Xidazhi Street, Harbin, China, 150001; email: [email protected]. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from the Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701, USA, fax +1 (212) 869-0481, or [email protected]. c 2012 ACM 2157-6904/2012/05-ART43 $10.00  DOI 10.1145/2168752.2168757 http://doi.acm.org/10.1145/2168752.2168757

ACM Transactions on Intelligent Systems and Technology, Vol. 3, No. 3, Article 43, Publication date: May 2012.

43

43:2

S. Zhang et al.

systems. Its aim is to automatically find the same object in an adjacent frame from a video sequence once it is initialized [Hu et al. 2004]. In addition, it also can be used in many other applications, such as video indexing, human computer interfaces and vehicle navigation. In the past few decades, although a number of elegant tracking algorithms have been proposed, visual tracking still faces too many challenges, such as partial occlusion, camouflage environments, pose changes, and illumination changes. Such challenges cause substantial performance degradations of traditional tracking algorithms, which furthermore restricts their usage in many practical applications. Although visual tracking is very challenging for computer vision, it is a very basic functionality of the human visual system (HVS). Revealing and exploiting the perception mechanism of HVS in visual tracking should be an effective way to improve the robustness of a tracker. Although the details of perception mechanism of HVS are still largely mysterious, studies in psychology, neuroscience, and cognitive sciences have obtained substantial evidence showing how HVS processes the vast amount of incoming visual information and performs effective visual search. There are some significant findings that give us inspiration. First, the receptive fields of simple cells in the visual cortex can be characterized as being spatially localized, oriented, and bandpass, and it forms a sparse, distributed representation of natural images [Olshausen and Field 1996]. Independent component analysis (ICA) [Bell and Sejnowski 1997; Hateren and Schaaf 1998] can obtain such sparse representation. The produced filters by ICA have similar properties as the responses of the receptive fields. Second, many psychophysical studies [Found and Muller 1996; Maunsell and Treue 2006; Rossi and Paradiso 1994] have demonstrated that feature-based visual attention— that is, the ability to improve target search by enhancing the representation of image components that are related to the target—should be particularly useful when searching for the target. Based on these findings, it is possible for us to improve the robustness of a tracker by modeling the appearance of the target in the responses space of general basis functions and further selecting those responses (features) that are related to the target and ignoring those that are not from the view of visual attention. In this article, motivated by the aforementioned psychophysical findings, we present an effective appearance model based on sparse coding which simulates the sparse representation of simple cells in the visual cortex. An early version of the method was presented in Zhang et al. [2010]. We apply the proposed appearance model in visual tracking with a feature selection strategy which simulates the feature-based visual attention mechanism of HVS. The proposed tracking algorithm is implemented in the framework of target representation-and-search. It has three steps as follows. — General basis functions extraction. We first collect a large number of training samples (patches) from natural images in advance. Using independent component analysis, we then train a set of general basis functions. Each basis function corresponds to a receptive field of simple cells in the visual cortex. Given any natural image patch, it can be represented by a linear combination of all basis functions. The resulting representation is sparse. Each combination coefficient is, in fact, a filter response. We refer to responses of all basis functions as features in this work. — Target representation. After dividing the target region into multiple patches and representing each patch as a linear combination of the trained general basis functions, we can compute the probability distribution of all features. Subsequently, we compute the system’s entropy and entropy gain of each feature. We further select features that are related to the tracked target by judging whether each feature’s entropy gain is less than 0 or not. The target is finally represented by the probability distribution of those selected features. ACM Transactions on Intelligent Systems and Technology, Vol. 3, No. 3, Article 43, Publication date: May 2012.

Robust Visual Tracking Using an Effective Appearance Model

43:3

— Target search. The target search is performed by minimizing the Matusita distance between the distributions of the target model and a candidate using Newton-style iterations. The article is organized as follows. In Section 2, related work is introduced for outlining current research progress. Section 3 discusses general basis functions extraction using independent component analysis. Section 4 provides details of the proposed tracking algorithm including the proposed appearance model based on sparse coding, target representation, and target search, respectively. Section 5 includes experimental comparisons between the proposed tracking algorithm and three state-of-the-art methods, and Section 6 concludes the article. 2. RELATED WORK 2.1. Visual Tracking

In the past few decades, numerous tracking algorithms have been proposed. The three main frameworks into which most algorithms fall are (1) moving detection-and-track; (2) filter-and-data association; and (3) target representation-and-search. In the framework of moving detection-and-track [Collins et al. 2001; Stauffer and Grimson 2000], visual tracking is finished by detecting the moving objects in video frames. When there is only one moving object, it is easy to obtain the trajectory of the detected object. However, when multiple objects appear in the video, this framework is not effective for accurately getting the trajectory for each object. In the framework of filter-and-data association [Nummiaro et al. 2003; Isard and Blake 1998], the most widely used particle filter [P´erez et al. 2002] is also well known as the sequential Monte Carlo method [Doucet et al. 2001]. It recursively constructs the posterior probability density function of the state space using Monte Carlo integration. It has been developed in the computer vision community and applied to tracking problems under the name Condensation [Isard and Blake 1998]. In Zhou et al. [2004], an appearanceadaptive model is incorporated in a particle filter to realize robust visual tracking and classification. A hierarchical particle filter is used for multiple object tracking in Yang et al. [2005b]. In this work, we are concerned with target representation-and-search-based tracking algorithms, which have attracted significant attention and achieved considerable success in recent years [Dewan and Hager 2006; Yang et al. 2005a, 2007] since mean shift-based tracking was proposed in Comaniciu et al. [2003]. These methods usually have three steps: (1) target representation, (2) similarity measure between the representations of the target model and a candidate, (3) optimization of the similarity measure. Target representation is very critical for these methods, as the first step it will directly influence the performance of a tracker. A good target representation should not only effectively discriminate the target from its surrounding background but also be robust to uncertain appearance changes, such as partial occlusion, camouflage environment, pose changes and illumination changes. It seems that this leads to a paradox: effective discrimination between target and background requires that target representation is sensitive to appearance changes due to the presence of background pixels in the candidate region, as shown in Figure 1(b). On the other hand, the robustness to uncertain appearance changes requires representation is not seriously affected by such changes, as shown in Figure 1(c). Developing a target representation method that meets two such requirements simultaneously is a very challenging task. In previous approaches, great research efforts have been devoted to developing effective target representation algorithms. In principle, a wide range of features could be used for representing target, including color, texture, and contour. The most ACM Transactions on Intelligent Systems and Technology, Vol. 3, No. 3, Article 43, Publication date: May 2012.

43:4

S. Zhang et al.

Fig. 1. Target search in candidate regions. (a) The target model. (b) The candidate region marked by the red rectangle is not similar to the model due to the presence of background pixels in the candidate region. (c) The candidate region marked by the green rectangle should ideally be similar to the model, even if the bush partially occludes the target.

typically used feature is the color or intensity. In Comaniciu et al. [2003], a color histogram extracted in the target region is used to represent the target’s appearance. The color spaces used in the aforementioned methods are fixed in advance and remain unchanged in the entire tracking process. In fact, for different scenes, adaptive color spaces are more effective for discriminating the target from the background. Collins et al. [2005] present an online discriminate feature selection algorithm to represent the target. The tracker performs well when the target has low contrast with the background. The global color histogram representation loses the spatial information of the target, which precludes the application of more elaborate motion models. In order to overcome this limitation, the target is represented using kernel density estimates in a joint spatial-feature space [Yang et al. 2005a]. It treats the location and features in an integrated way and searches the target by mean shift procedure. In addition to color features, some more discriminant features, such as scale invariant feature transform (SIFT) [Lowe 2004] and histogram of oriented gradient (HOG) [Dalal and Triggs 2005] are also successfully used in visual tracking. In Zhou et al. [2008], SIFT features are used to correspond the region of interests across frames. Avidan [2007] constructs a feature vector including HOG features and RGB features in an ensemble-tracking framework. In addition to these methods that consider target region as a whole, there are also some methods that divide the target region into patches. In order to be robust to partial occlusion, Adam et al. [2006] evenly divide the target region into multiple patches and represent each patch by a gray histogram. Each patch is used to vote on the candidate positions and the final tracking result is obtained by combining the vote maps of all patches. Yang et al. [2007] present a visual tracking algorithm that selects attentional spatial regions to represent the target. These two approaches are robust to partial occlusion since they represent the target by multiple spatial patches. Nejhum et al. [2010] model the constantly changing foreground shape with a small number of rectangular blocks whose positions within the tracking window are adaptively determined. Under the general assumption of stationary foreground appearance, robust object tracking is possible by adaptively adjusting the locations of these blocks. In the case of partial occlusion, there will always be some patches that are not be occluded. The robustness to partial occlusion can be achieved by exploiting such patches to localize the target. However, the strategy of selecting which patches to represent the target is not intuitive. Furthermore, when the selected patches are occluded, these methods still cannot effectively track the target. Our proposed method is quite different from those of Adam et al. [2006] and Yang et al. [2007] Although our method also divides the target region into multiple patches, ACM Transactions on Intelligent Systems and Technology, Vol. 3, No. 3, Article 43, Publication date: May 2012.

Robust Visual Tracking Using an Effective Appearance Model

43:5

the division is just for extracting essential features underlining in patches by ICA. We select those essential features that are related to the tracked target and ignore those that are not. The target is finally represented by the probability distribution of those selected features. With such a strategy, we simulate the feature-based visual attention mechanism of HVS in visual tracking. In addition, unlike Adam et al. and Yang et al.’s methods which use spatially divided regions to represent the target, the selected features used in this work are location independent. Such location-independent property makes the proposed method particularly well suited to handle uncertain appearance changes, especially to partial occlusion. 2.2. Sparse Representation in Computer Vision

The sparse representation theory has shown that sparse signals can be exactly reconstructed from a small number of linear measurements [Cand´es et al. 2006; Donoho 2006]. It has gained increasing attention due to its success in various computer vision applications. A pioneering attempt was conducted to use the sparse representation theory for face recognition. Experimental results showed that the method is superior to other state-of-the-art methods, especially under occlusion. Considering that traditional sparse representation is sensitive to image plane transformations, Huang et al. [2008] proposed a transform-invariant sparse representation and applied it in face recognition. Further, Huang et al. [2009] observed that in some practical sparse data, the nonzero coefficients are often not random but tend to be clustered, leading them to propose the dynamic group sparsity. They applied the proposed method in background subtraction and obtained better performance than traditional methods. Other usage of sparse representation include media recovery [Gu et al. 2008], texture segmentation [Mairal et al. 2008], and lighting estimation [Mei and Ling 2009], etc. Sparse representation is also used in visual tracking [Zhang et al. 2009, 2010]. The more related work is Mei et al.’s L1 tracker [2009]. The target candidate is represented as a linear combination of the template set composed of both target templates and trivial templates. The assumption is that a good target candidate can be sparsely represented by both the target templates and trivial templates. This sparse optimization problem is solved as an L1 minimization problem with none-negative constraints. The significant difference between our work and the L1 tracker is that we just use extracted sparse basis as features to represent the tracked target, rather than using sparse representation to infer which candidate target is more similar to the target template. In addition, because their method needs to perform L1-minimization for each target candidate, the time cost is expensive, which restricts its usage in real-time visual tracking. In contrast, our proposed method only uses sparse representation to extract features in advance and is more suitable for real-time visual tracking. 3. GENERAL BASIS FUNCTIONS EXTRACTION

This section explains how we extract general basis functions from natural image patches. We first review the foundation of an independent component analysis algorithm (ICA), and then describe the general basis functions extraction algorithm based on the ICA. 3.1. Foundation of Independent Component Analysis

Independent component analysis (ICA) is a computational method for separating a multivariate signal into additive subcomponents [Comon 1994]. Although it is a signal processing algorithm used for blind source separation, it has been widely used in pattern recognition fields [Bartlett et al. 2002; Liu and Wechsler 2003]. Now ACM Transactions on Intelligent Systems and Technology, Vol. 3, No. 3, Article 43, Publication date: May 2012.

43:6

S. Zhang et al.

we will review the foundation of ICA from the view of image processing. An image patch, denoted by the vector I, can be represented by the linear combination of N basis functions a1 , a2 , . . . , a N . The basis functions form the columns of a matrix, A = (a1 , a2 , . . . , a N ). The weights of this linear combination (which varies with each image patch) is given by a vector, s = (s1 , s2 , . . . , sN ). Each element of this vector has its own associated basis function, which represents an underlying component of the image patch. All basis functions and corresponding weights form the linear image synthesis model which can therefore be given by I=

N 

s ja j = As.

(1)

j=1

The statistical model in Eq. (1) is called independent component analysis, which describes how the image patch I is generated by a set of basis functions. With a simple assumption that the components si are statistically independent, or as independent as possible, there are several popular methods [Bell and Sejnowski 1997; Hyvarinen 1999] for estimating both A and s given a huge number of observations. After estimating the matrix A, its inverse W can also be obtained. The weight vector s for any observation I can therefore be obtained by a linear transformation s = WI,

(2)

where W is also referred to as the filter matrix whose each row vector is a linear filter. The receptive fields of simple cells in visual cortex have some intrinsic properties: they are localized in space and time, have bandpass characteristics in the spatial and temporal frequency domains, are oriented, and are often sensitive to the direction of motion of a stimulus [Devalois et al. 1982]. Studies [Bell and Sejnowski 1997; Hateren and Schaaf 1998] compared the receptive fields of simple cells and independent component filters generated by ICA on a large set of natural image patches and showed that they have similar properties. On the other hand, independent component coefficients are statistically independent, which is a perfect property for a feature extraction method. Based on such findings, many pattern recognition algorithms use ICA as a feature extraction procedure. For example, independent component coefficients are used as features to discriminate the face and non-face [Bartlett et al. 2002; Liu and Wechsler 2003], which obtain superior performance than other methods. 3.2. General Basis Functions Extraction

Human vision system (HVS) receives vast amounts of visual information from the external world all the time. It is impractical for the HVS to directly process all information. In fact, the receptive fields of simple cells in the visual cortex produce a sparse representation of the incoming data [Olshausen and Field 1996]. The great advantage of such sparse representation is its representation efficiency. Since each basis function represents an intrinsic primitive in the cortex, huge numbers of incoming data include essentially less primitives, and with less basis functions, the incoming data can be efficiently represented. Studies [Bell and Sejnowski 1997; Hateren and Schaaf 1998] show that ICA can produce a sparse representation of an image patch and each basis function corresponds to a receptive field of simple cells in visual cortex. Based on such findings, using basis functions extracted by ICA as features to model the appearance of the tracked target will be more informative and effective. In this work, we exploit the ICA to produce a sparse representation of any image patch. We divide an input RGB image into 8 × 8 image patches. In order to get the ACM Transactions on Intelligent Systems and Technology, Vol. 3, No. 3, Article 43, Publication date: May 2012.

Robust Visual Tracking Using an Effective Appearance Model

43:7

Fig. 2. The training results from natural images. (a) The first 64 basis functions and (b) the corresponding 64 filter functions.

sparse representation of any image patch, we need to obtain the matrix A and W. We first collect 132,000 8 × 8 RGB image patches from natural images as the training set. Using the ICA method [Bell and Sejnowski 1997], we learn M = 8 × 8 × 3 basis functions A = (a1 , a2 , . . . , aM ) and corresponding filters matrix W. The first 64 trained basis functions and corresponding filters are shown in Figure 2. The basis functions resemble short edges, and the filters are similar to receptive fields of simple cells [Hateren and Schaaf 1998]. For any RGB image patch I, its response s j to j th filter w j can be obtained by s j = w j I.

(3)

4. VISUAL TRACKING USING AN APPEARANCE MODEL BASED ON SPARSE CODING 4.1. Appearance Model Based on Sparse Coding

In order to model the appearance of the tracked target, we should choose a feature space to represent the target. Color is the main source from which we can get information about the external world, so traditional methods in many visual research fields model appearance of an object in color space. However, there are several concerning issues. First, there are many color spaces, for example, widely used RGB and HSV spaces. It is not determined which color space representation can reflect the perception mechanism of the human vision system. Second, color information is easily disturbed due to noise or illumination changes. Third, representing each image with color values of all pixels is very resource-consuming. Unlike traditional methods which usually use color space, in this work, we represent the tracked target in the response spaces of basis functions. In other words, we use responses of all basis functions as features to represent the target. The motivations of using responses of basis functions as features are threefold. (1) The sparse representation is efficient, as demonstrated in Section 3.2. (2) The features extracted by ICA are statistically independent. It reduces the redundancy and correlation among features, which will be more suitable to discriminate the tracked target from background. (3) The location-independent property of basis functions makes features more robust to uncertain appearance changes, such as partial occlusion and illumination changes [Wright et al. 2009]. Suppose there are two images: one designed as the model image that includes the tracked target, while the other is the target image in which we need to localize the target. In the model image, the target model is centered at location x and divided ACM Transactions on Intelligent Systems and Technology, Vol. 3, No. 3, Article 43, Publication date: May 2012.

43:8

S. Zhang et al.

into N 8 × 8 RGB image patches I1 , I2 , . . . , IN with step l. The divided patches can partially overlap when l < 8 or non-overlap when l ≥ 8. The corresponding center point locations of N patches are x1 , x2 , . . . , x N . Using the trained filters matrix W, we can obtain a responses matrix Sx = (sij) N×M ,

(4)

where sij = w j Ii is the response of the ith patch to the jth filter. The square of the response, energy, reflects the intensity of this response. We define the normalized energy matrix Ex = (ξij) N×M ,

(5)

s2ij ξij =  M

(6)

where 2 j=1 sij

.

The normalization operation makes all response energies of one patch to M filters sum to 1. With these definitions, a kernel-weighted feature distribution q = (q1 , q2 , . . . , q M )T of the target model can be defined as qj =

N 1  xi − x )Ex (i, j), K( C h

(7)

i=1

where K is the kernel function, h is the bandwidth, and C is the normalization constant. The kernel function assigns higher weights to patches that are nearer to the target center. We choose kernel K(x) with Epanechnikov profile k(x). The constant C  is derived by imposing the condition M j=1 q j = 1. C=

N  i=1

K(

xi − x ). h

Define a column vector K = (K i) N×1 where K i = bution q can also be denoted by the matrix form

1 K( xih−x ). C

(8)

Then the feature distri-

q = ExT K.

(9)

In this work, we use the feature distribution q to represent the appearance of the target. We can interpret its rationality by analogy with the color histogram of a gray image. Because there are only 256 gray values, we can analogize that there are 256 basis functions, with each one corresponding to a gray value. For a any pixel, its gray value can be represented by a linear combination of all 256 basis functions with only a nonzero combination coefficient (response). For example, if its gray value is 125, response of the basis function corresponding to 125 is 1, and all other responses are zero. The color histogram of the gray image is in fact the probability distribution of responses of all pixels in the image. The difference between the color histogram and our proposed appearance model is two-folds: First, color histogram characters distribution of responses of each pixel; however, the proposed appearance model characters distribution of responses of each patch. Second, in a color histogram, each pixel is in ACM Transactions on Intelligent Systems and Technology, Vol. 3, No. 3, Article 43, Publication date: May 2012.

Robust Visual Tracking Using an Effective Appearance Model

43:9

fact represented only with a basis function. In the proposed appearance model, each patch is represented by multiple basis functions. Because the proposed appearance model uses patch-based statistics and sparse coding strategy, it can be more robust and effective to represent the target. 4.2. Target Representation

Visual attention is widely studied in neurosciences fields. Recently, computer vision researchers resort to visual attention models to address some challenges in computer vision fields. Visual attention is a powerful mechanism that enables perception to focus on a small subset of the information picked up by our eyes [Maunsell and Treue 2006]. Most existing models [Assad 2003; Reynolds and Chelazzi 2004; Yantis and Serences 2003] study the effects of shifting attention between different locations in the visual field. In other words, these models are space-based attention. However, attention can be allocated not only to a particular location but also to a particular feature. Feature-based attention is the ability to enhance the representation of image components throughout the visual field that are related to a particular feature, which should be particular useful when searching for a target with that feature [Maunsell and Treue 2006]. Since each feature corresponds to a receptive field and represents an intrinsic primitive, only a few features are activated when HVS receives incoming information. The feature distribution q, in fact, is the probability distribution of feature activities [Olshausen and Field 1996] which reflects the statistics of feature responses. Many psychophysical studies [Found and Muller 1996; Maunsell and Treue 2006; Rossi and Paradiso 1994] have shown that not all features are important for searching a particular target. In fact, HVS improves the target search by enhancing those features that are related to the target and suppressing those that are not. Inspired by these studies, we attempt to select those features that are related to the target to represent the target for visual tracking. We define that a feature is related to the target when succeeding activations of that feature cannot increase the system’s entropy. The intuition behind this definition is straightforward: From the view of signal coding, a feature is an unexpected feature and needs to be coded when succeeding activations of that feature can increase the system’s entropy [Koch and Poggio 1999]. Such features correspond to uncertain appearance changes and are not suitable for robust tracking. In contrast, the related features by our definition are relatively unchanged features (predictable features in signal coding) and are best for robust tracking. In this work, we introduce the entropy gain of each feature to judge whether the feature is related to the target or not. The  joint entropy of distribution q is H(q) = − M j=1 q j log q j. We compute the entropy gain of the j th feature [Hou and Zhang 2008] as  ∂ H(q) −H(q) − q j (1 + log q j) − log q j if q j = 0 gj = = . (10) ∂q j 0 else We define the related feature indexes set F as F = {j|gj < 0} and the nonrelated feature indexes set F as F = {j|gj ≥ 0}. The size of set F is M . An example illustrating that related features are more suitable to track than nonrelated features is shown in Figure 3. We reconstructed the original image (Figure 3(a)) with related features and non-related features, respectively. The reconstructed images are shown in Figure 3(b) and Figure 3(c). Even if with a small part of features (in our experiment, 58 related features of 192 total features), the reconstructed image reserves most of the ACM Transactions on Intelligent Systems and Technology, Vol. 3, No. 3, Article 43, Publication date: May 2012.

43:10

S. Zhang et al.

Fig. 3. The reconstructions of the original image with related features and nonrelated features. (a) The original image. (b) The reconstructed image with 58 related features. (c) The reconstructed image with 134 nonrelated features.

information of the original image. However, the reconstructed image with the most features (134 nonrelated features) loses most of the information of the original image. The illustration shows that representing the target with related features is more effective and suitable for tracking.  With the defined related features, we correspondingly obtain a new filters matrix W  we consisting of filters in W but with indexes in set F. With this new filters matrix W, x using a similar procedure as in can re-compute the new normalized energy matrix E   Eq. (5). Note that the size of matrix Ex is N × M . Then we can obtain the probability distribution of the related features  q = (qˆ 1 , qˆ 2 , . . . , qˆ M )T as T K.  q=E x

(11)

Similarly, we can also compute the feature distribution  p(x + x) of the target candidate centered at location x + x in target image as T K,  p(x + x) = E x+x

(12)

x+x is the normalized energy matrix computed at location x + x in the target where E image with the new filters matrix. 4.3. Target Search

Once we obtain the distributions of the target model  q in the model image and the target candidate  p(x + x) in the target image, we can perform the target search by checking all target candidates and finding a candidate that best matches the target model as the tracked result. We use the Matusita distance measure [Hager et al. 2004] between the distributions of the target model and the candidate as the object function.   q−  p(x + x)2 . O( q,  p(x + x)) =  

(13)

The Matusita distance is conceptually simple and equivalent to the widely used Bhattacharya coefficient metric [Comaniciu et al. 2003]. The aim of tracking is to find a displacement x∗ that minimizes the object function O( q,  p(x + x)) x∗ = arg min O( q,  p(x + x)). x

(14)

In order to solve for Eq. (14), we use a Newton-style iterative procedure [Hager et al. 2004], which is more efficient than the widely used mean shift method [Comaniciu ACM Transactions on Intelligent Systems and Technology, Vol. 3, No. 3, Article 43, Publication date: May 2012.

Robust Visual Tracking Using an Effective Appearance Model

43:11

et al. 2003]. Specially in our work, we can localize the target by iteratively moving the current location x to a new location x + x,   q−  p(x)), (15) x = (JT J)−1 JT (  where 1 1 ∂ p(x) diag( p(x))− 2 2 ∂x ⎞ ⎛ .. . ⎟ ⎜ N xi−x T xi −x 2 ⎟ ⎜ √1  ξ ( , = ⎜ 2C p j(x) i=1 ij h ) g( h  ) ⎟ ⎠ ⎝ .. . M ×2

J =

(16)

where diag( p(x)) denotes the diagonal matrix with column vector  p(x) on its main diagonal, g(x) = −k (x). The complete target search algorithm is presented next. ALGORITHM 1: Search the Target in the Current Frame Given: The target model distribution  q and its center location x in previous frame. (1) (2) (3) (4)

Initialize the location of the target candidate in current frame with x. Compute the distribution of the target candidate  p(x). Compute J matrix according to Eq. 16. Compute the displacement x according to Eq. 15, then the next location of the target candidate is x = x + x. (5) if x <  then Stop; else Go to step 2. Output: The location x of the target in current frame.

5. EXPERIMENT RESULTS 5.1. Settings

In order to validate the robustness of our tracking method, we present some tracking results on four image sequences which respectively correspond to four challenges: partial occlusion, camouflage environments, pose changes, and illumination changes. We compare the proposed method with three state-of-the-art methods: the mean shift tracker [Comaniciu et al. 2003], the L1 tracker [Mei and Ling 2009], and the BH tracker [Nejhum et al. 2010]. For each sequence, target model distribution is computed once on the initializing frame and held fixed for the entire sequence. For all sequences, the division step is set to eight. The comparison is done based on qualitative evaluation by looking at tracked results provided by the algorithms and on quantitative evaluation in terms of relative distance [Yang et al. 2007]. The relative distance is defined as the normalized distance.

    y − y0 2 x − x0 2 Relative Distance = + , sx sy where (x, y) is the location of the tracked target and (x0 , y0 ) and (sx , sy ) are the location and size of the ground truth, respectively. A perfect tracking expects the relative distance to be around 0. ACM Transactions on Intelligent Systems and Technology, Vol. 3, No. 3, Article 43, Publication date: May 2012.

43:12

S. Zhang et al.

Fig. 4. Tracking results on the walking sequence for frames #20, #130, #156, #600, and #618. The first row contains results obtained by our tracker. The second to fourth rows are the results obtained by the mean shift tracker, L1 tracker, and BH tracker, respectively.

Fig. 5. Tracking results on the walking sequence for frames #150, #152, #154, #156, and #160. The first row contains results obtained by our tracker based on negative entropy gain. The second row contains results obtained by our tracker based on positive entropy gain.

5.2. Partial Occlusion

The first experiment uses the walking sequence [Yang et al. 2007]. The walking person is subjected to irregular severe occlusion when passing behind a bush at the 156th frame and the guidepost at the 618th frame. We initialize the model with a region of size 48 × 168 and a bandwidth of h = 87 . As show in Figure 4, the mean shift tracker successfully tracks the person before the 156th frame where there is no occlusion. However, it loses the target at the 156th frame due to bush occlusion and then recovers the target at the 394th frame. At the 600th frame, it loses the target again due to severe occlusion by the guidepost. The L1 tracker is more robust to partial occlusion than the mean shift tracker and successfully tracks the person before the 600th frame. However, it loses the target when the guidepost occludes the person at the 600th frame. The BH tracker achieves more stable tracking results then the mean shift tracker and the L1 tracking before the 600th frame; however, it finally loses the target at the 600th frame. In contrast, our tracker reliably and successfully tracks the person in the entire sequence. The corresponding quantitative comparisons are shown in Figure 6. As we can see, our tracker gets the lowest relative distances for the entire sequence and achieves the best tracking results compared with other methods. In Section 4.2, we defined the related feature based on negative entropy gain. In order to confirm the validity of this definition, we compare it with the related feature definition based on positive entropy gain. The comparison results are shown in Figure 5 from which we can see that the tracker based on the positive entropy gain ACM Transactions on Intelligent Systems and Technology, Vol. 3, No. 3, Article 43, Publication date: May 2012.

Robust Visual Tracking Using an Effective Appearance Model

43:13

Fig. 6. The quantitative comparison of relative distance on the walking sequence.

Fig. 7. Tracking results on the zebra sequence for frames #8, #80, #104, #134, and #148. The first row contains results obtained by our tracker. The second to fourth rows are the results obtained by the mean shift tracker, L1 tracker, and BH tracker, respectively.

definition loses the person when the person passes behind the bush. The reason for this is that, as mentioned in Section 4.2, the feature with positive entropy gain is an unexpected feature [Koch and Poggio 1999]. For example, unexpected features in a walking sequence correspond to the color region of the right shoulder of the person. If we represent the target by such features, when there is occlusion especially when the bush occludes the right shoulder of the person the tracker will lose the target. In contrast, if we represent the target by features with negative entropy gains (corresponding to predictable features), the tracker will be more robust to partial occlusion. 5.3. Camouflage Environment

The camouflage environment is very challenging for visual tracking since there are similar objects around the target. We validate whether our tracker is robust to camouflage backgrounds by tracking a zebra with another zebra nearby in a zebra sequence [Yang et al. 2007]. The target model is initialized with a region of size 180 × 46 and a bandwidth of h = 92. As shown in Figure 7, the mean shift tracker, the L1 tracker, and the BH tracker lose the target and track the other zebra when the target passes the other zebra at the 104th frame. Our tracker accurately tracks the target. The reason is that the mean shift tracker, the L1 tracker, and the BH tracker use color RGB as features to represent the target. When the nearby background presents a similar color, ACM Transactions on Intelligent Systems and Technology, Vol. 3, No. 3, Article 43, Publication date: May 2012.

43:14

S. Zhang et al.

Fig. 8. The quantitative comparison of relative distance on the zebra sequence.

Fig. 9. Tracking results on the face sequence for frames #1, #16, #50, #96, and #208. The first row contains results obtained by our tracker. The second to fourth rows are the results obtained by the mean shift tracker, L1 tracker, and BH tracker, respectively.

the trackers cannot accurately localize the target. However, our tracker represents the target with features that are related to the target. Such representation can effectively discriminate the target from the background, even if it is a in camouflage environment. The quantitative comparisons are shown in Figure 8. We can see that our tracker gets the lowest relative distances for the entire sequence when compared with other methods. 5.4. Pose Changes

In the third experiment, we use the face sequence which we recorded in an indoor office. This sequence involves strong pose changes. We initialize the target model of the frontal face with a region of size 76 × 84 and a bandwidth of h = 56. As shown in Figure 9, the mean shift tracker tracks the background region when the head turns left or right. Our tracker achieves better tracking results than the mean shift tracker and comparable tracking results with the L1 tracker and the BH tracker. The quantitative comparisons are shown in Figure 10 which also shows that our tracker is superior to the mean shift tracker. ACM Transactions on Intelligent Systems and Technology, Vol. 3, No. 3, Article 43, Publication date: May 2012.

Robust Visual Tracking Using an Effective Appearance Model

43:15

Fig. 10. The quantitative comparison of relative distance on the face sequence.

Fig. 11. Tracking results on the trellis sequence for frames #10, #60, #120, #180, and #240. The first row contains results obtained by our tracker. The second to fourth rows are the results obtained by the mean shift tracker, L1 tracker, and BH tracker, respectively.

5.5. Illumination Changes

In order to validate whether the proposed tracking method is robust to illumination changes, we conducted the fourth experiment on the trellis sequence [Ross et al. 2008]. The sequence was recorded in outdoor environments, where lighting conditions often change drastically: a person walks underneath a trellis covered by vines, resulting in a significant variation in appearance due to illumination changes. The target model is initialized with a region of size 66 × 60 and a bandwidth of h = 40. The tracking results of the 10th, 60th, 120th, 180th, and 240th frames are presented in Figure 11. The mean shift tracker gradually converges to the hair of the person due to the target’s appearance changes caused by illumination changes. The L1 tracker starts to show some drifting on the 180th frame and finally loses the target. The head of the person was successfully tracked using our proposed method during the entire sequence. The BH tracker performs slightly worse compared with our tracker. The quantitative comparisons are shown in Figure 12. Our tracker is superior to other methods. ACM Transactions on Intelligent Systems and Technology, Vol. 3, No. 3, Article 43, Publication date: May 2012.

43:16

S. Zhang et al.

Fig. 12. The quantitative comparison of relative distance on the trellis sequence. Table I. The Average Processing Speed (Frame/s) and the Number of Iterations on the Walking Sequence Method Mean shift tracker Our tracker

Speed

Number of iterations

9.0 8.7

5 2

5.6. Processing Speed

We compare the processing speed of our tracker with the mean shift tracker on the walking sequence which consists of 671 frames of size 320 × 240. Note that the algorithms are implemented in Matlab on a Pentium-IV 3GHz PC. Although our algorithm involves matrix operations, the average processing speed of our tracker is very close to the original mean shift tracker, since our tracker uses the Newton-style iterations, which requires fewer iterations to converge than the mean shift procedure as shown in Table I. 6. CONCLUSION

In this article, we propose a robust visual tracking algorithm by simulating the sparse coding and feature-based visual attention mechanisms of the human visual system. There are three innovations in this work. First, unlike previous methods that usually use color RGB features, we exploit independent component analysis to extract general basis functions and consider their responses as features. The obtained features are statistic independent and location independent, which makes the features robust to uncertain appearance changes. Second, we represent the target with only those features that are related to the target and ignore those that are not. Such a feature selection mechanism provides the ability to effectively discriminate the target from the background. Finally, by optimizing the object function using Newton-style iterations, the proposed tracking algorithm is computationally efficient. The comparison results between the proposed algorithm and three state-of-the-art methods on four challenging sequences validate that our algorithm is more robust and effective. Although visual tracking with the proposed appearance is formulated under the mean shift tracking framework which is not suitable for low frame-rate sequences, the proposed appearance model is very general, which can be used in other tracking frameworks, for example, particle filters tracking or other vision applications, such as object detection and recognition. ACM Transactions on Intelligent Systems and Technology, Vol. 3, No. 3, Article 43, Publication date: May 2012.

Robust Visual Tracking Using an Effective Appearance Model

43:17

Future work will focus on improving the selection mechanism of related features to better adapt it to robust visual tracking and also to extend our method to processing adaptive scale changes. In addition, we try to reduce the complexity of the algorithm by exploiting more efficient optimization methods. REFERENCES A DAM , A., R IVLIN, E., AND S HIMSHONI , I. 2006. Robust fragments-based tracking using the integral histogram. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 798–805. A SSAD, J. 2003. Neural coding of behavioral relevance in paretal cortex. Current Opinion Neurobiol. 13, 2, 194–197. AVIDAN, S. 2007. Ensemble tracking. IEEE Trans. Patt. Anal. Mach. Intell. 29, 2, 261–271. B ARTLETT, M., M OVELLAN, J., AND S EJNOWSKI , T. 2002. Face recognition by independent component analysis. IEEE Trans. Neural Netw. 13, 6, 1450–1464. B ELL , A. AND S EJNOWSKI , T. 1997. The independent components of natural scenes are edge filters. Vision Res. 37, 23, 3327–3338. ´ , E., R OMBERG, J., AND T AO, T. 2006. Stable signal recovery from incomplete and inaccurate meaC AND ES surements. Comm. Pure Appl. Math 59, 8, 1207–1223. C OLLINS, R. T., L IPTON, A. J., F UJIYOSHI , H., AND K ANADE , T. 2001. Algorithms for cooperative multisensor surveillance. Proc. IEEE 89, 10, 1456–1477. C OLLINS, R. T., L IU, Y., AND L EORDEANU, M. 2005. Online selection of discriminative tracking feature. IEEE Trans. Pattern Anal. Mach. Intell. 27, 10, 1631–1643. C OMANICIU, D., R AMESH , V., AND M EER , P. 2003. Kernel-based object tracking. IEEE Trans. Pattern Anal. Mach. Intell. 25, 5, 564–577. C OMON, P. 1994. Independent component analysis: A new concept. Signal Process. 36, 3, 287–314. D ALAL , N. AND T RIGGS, B. 2005. Histograms of oriented gradients for human detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 886–893. D EVALOIS, R., Y UND, E., AND H EPLER , N. 1982. The orientation and direction selectivity of cells in macaque visual cortex. Vision Res. 22, 5, 531–544. D EWAN, M. AND H AGER , G. D. 2006. Toward optimal kernel-based tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 618–625. D ONOHO, D. 2006. Compressed sensing. IEEE Trans. Inf. Theory 52, 4, 1289–1306. D OUCET, A., DE F REITAS, N., AND G ORDON, N. 2001. Sequential Monte Carlo Methods in Practice. Springer-Verlag, New York, NY. F OUND, A. AND M ULLER , H. 1996. Searching for unknown feature targets in more than one dimension: investigating a ‘dimension-weighting’ account. Perception Psychophysics 58, 1, 88–101. G U, J., N AYAR , S., G RINSPUN, E., B ELHUMEUR , P., AND R AMAMOORTHI , R. 2008. Compressive structured light for recovering inhomogeneous participating media. In Proceedings of the European Conference on Computer Vision. 845–858. H AGER , G. D., D EWAN, M., AND S TEVART, C. V. 2004. Multiple kernel tracking with ssd. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 790–797. H ATEREN, J. AND S CHAAF, A. 1998. Independent component filters of natural images compared with simple cells in primary visual cortex. Proc. Biol. Sci. 265, 1394, 359–366. H OU, X. AND Z HANG, L. 2008. Dynamic visual attention: Searching for coding length increments. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 681–688. H U, W., T AN, T., WANG, L., AND M AYBAND, S. 2004. A survey on visual surveillance of object motion and behaviors. IEEE Trans. Syst. Man Cyber. Part C 34, 3, 334–352. H UANG, J., H UANG, X., AND M ETAXAS, D. 2008. Simultaneous image transformation and sparse representation recovery. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1–8. H UANG, J., H UANG, X., AND M ETAXAS, D. 2009. Learning with dynamic group sparsity. In Proceedings of the International Conference on Computer Vision. 1–8. H YVARINEN, A. 1999. Fast and robust fixed-point algorithm for independent component analysis. IEEE Trans. Neural Netw. 10, 3, 626–634. I SARD, M. AND B LAKE , A. 1998. Condensation—conditional density propagation for visual tracking. Int. J. Comp. Vision 29, 5–28. K OCH , C. AND P OGGIO, T. 1999. Predicting the visual world: Silence is golden. Nature Neurosci. 2, 1, 9–10.

ACM Transactions on Intelligent Systems and Technology, Vol. 3, No. 3, Article 43, Publication date: May 2012.

43:18

S. Zhang et al.

L IU, C. AND W ECHSLER , H. 2003. Independent component analysis of gabor features for face recognition. IEEE Trans. Neural Netw. 14, 4, 919–928. L OWE , D. G. 2004. Distinctive image features from scale-invariant keypoints. Int. J. Comp. Vision 60, 2, 91–110. M AIRAL , J., B ACH , F., P ONCE , J., S APIRO, G., AND Z ISSERMAN, A. 2008. Discriminative learned dictionaries for local image analysis. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 1–8. M AUNSELL , J. AND T REUE , S. 2006. Feature-based attention in visual cortex. Trends Neurosci. 29, 6, 317–322. M EI , X. AND L ING, H. 2009. Robust visual tracking using l1 minimization. In Proceedings of the International Conference on Computer Vision. 1436–1443. M EI , X., L ING, H., AND J ACOBS, D. 2009. Sparse representation of cast shadows via l1-regularized least squares. In Proceedings of the International Conference on Computer Vision. 583–590. N EJHUM , S. M. S., H O, J., AND YANG, M.-H. 2010. Online visual tracking with histograms and articulating blocks. Comp. Vision Image Understand. 114, 8, 901–914. N UMMIARO, K., K OLLER -M EIER , E., AND G OOL , L. V. 2003. An adaptive color-based particle filter. Image Vision Comput. 21, 1, 99–110. O LSHAUSEN, B. AND F IELD, D. 1996. Emergence of simple-cell receptive field properties by learning a spase code for natural images. Nature 381, 6583, 607–609. ´ , P., H UE , C., V ERMAAK , J., AND G ANGNET, M. 2002. Color-based probabilistic tracking. In ProceedP EREZ ings of the European Conference on Computer Vision. 661–675. R EYNOLDS, J. AND C HELAZZI , L. 2004. Attentional modulation of visual processing. Ann. Rev. Neurosci. 27, 611–647. R OSS, D., L IM , J., L IN, R., AND YANG, M. 2008. Incremental learning for robust visual tracking. Int. J. Comp. Vision 77, 125–141. R OSSI , A. AND PARADISO, M. 1994. Feature-specific effects of selective visual attention. Vision Res. 35, 5, 621–634. S TAUFFER , C. AND G RIMSON, W. E. L. 2000. Learning patterns of activity using real-time tracking. IEEE Trans. Pattern Anal. Mach. Intell. 22, 8, 747–757. W RIGHT, J., YANG, A. Y., G ANESH , A., S ASTRY, S., AND M A , Y. 2009. Robust face recognition via sparse representation. IEEE Trans. Pattern Anal. Mach. Intell. 31, 2, 210–227. YANG, C., D URAISWAMI , R., AND D AVIS, L. 2005a. Efficient mean-shift tracking via a new similarity measure. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 176–183. YANG, C., D URAISWAMI , R., AND D AVIS, L. 2005b. Fast multiple object tracking via a hierarchical particle filter. In Proceedings of the International Conference on Computer Vision. 212–219. YANG, M., Y UAN, J., AND W U, Y. 2007. Spatial selection for attentional visual tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1–7. YANTIS, S. AND S ERENCES, J. 2003. Cortical mechanisms of space-based and object-based attentional control. Curr. Opin. Neurobiol. 13, 2, 187–193. Z HANG, J., C AI , W., T IAN, Y., AND YANG, Y. 2009. Visual tracking via sparse representation based linear subspace model. In Proceedings of the IEEE International Conference on Computer and Information Technology. 166–171. Z HANG, S., YAO, H., AND L IU, S. 2010. Robust visual tracking using feature-based visual attention. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. 1150–1153. Z HANG, S., YAO, H., S UN, X., AND L IU, S. 2010. Robust object tracking based on sparse representation. In Proceedings of the Visual Communications and Image Processing Conference. 1–8. Z HOU, H., Y UAN, Y., AND S HI , C. 2008. Object tracking using sift features and mean shift. Comput. Vision Image Understand. 113, 3, 345–352. Z HOU, S., C HELLAPPA , R., AND M OGHADDAM , B. 2004. Visual tracking and recognition using appearanceadaptive models in particle filters. IEEE Trans. Image Process. 11, 1491–1506. Received April 2010; revised October 2010, December 2010; accepted February 2011

ACM Transactions on Intelligent Systems and Technology, Vol. 3, No. 3, Article 43, Publication date: May 2012.