Coupled Prediction Classification for Robust Visual ... - IEEE Xplore

0 downloads 0 Views 4MB Size Report
Oct 6, 2009 - Abstract—This paper addresses the problem of robust template tracking in ... We apply the algorithm to the problem of 2D template tracking and ...
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,

VOL. 32, NO. 9,

SEPTEMBER 2010

1553

Coupled Prediction Classification for Robust Visual Tracking Ioannis Patras, Member, IEEE, and Edwin R. Hancock Abstract—This paper addresses the problem of robust template tracking in image sequences. Our work falls within the discriminative framework in which the observations at each frame yield direct probabilistic predictions of the state of the target. Our primary contribution is that we explicitly address the problem that the prediction accuracy for different observations varies, and in some cases, can be very low. To this end, we couple the predictor to a probabilistic classifier which, when trained, can determine the probability that a new observation can accurately predict the state of the target (that is, determine the “relevance” or “reliability” of the observation in question). In the particle filtering framework, we derive a recursive scheme for maintaining an approximation of the posterior probability of the state in which multiple observations can be used and their predictions moderated by their corresponding relevance. In this way, the predictions of the “relevant” observations are emphasized, while the predictions of the “irrelevant” observations are suppressed. We apply the algorithm to the problem of 2D template tracking and demonstrate that the proposed scheme outperforms classical methods for discriminative tracking both in the case of motions which are large in magnitude and also for partial occlusions. Index Terms—Regression, tracking, state estimation, relevance determination, probabilistic tracking.

Ç 1

INTRODUCTION

V

ISION-BASED tracking is one of the fundamental and most challenging low-level problems in computer vision. Formally, it is defined as the problem of estimating the state x (e.g., position, scale, rotation, or 3D pose) of a target, given a set of noisy image observations Y ¼ f. . . ; y ; yg up to the current frame.1 Usually, an estimate of the state at each frame is the location of a minimum of a cost function, or, in the probabilistic framework, the location of a maximum of the posterior pðx j Y Þ. Alternatively, a representation of the posterior pðx j Y Þ can be maintained for each frame of the image sequence.

1.1 Literature Review The large number of methods that have been proposed in the last decades for maintaining a representation or for finding a maximum of the posterior pðx j Y Þ fall into two main categories. In the first category belong the generative methods (e.g., [11], [9]). Such methods require the inversion of the posterior pðx j Y Þ using the Bayes rule and the evaluation of the likelihood pðy j xÞ at certain sample states x. An important drawback is that at least some evaluations need to be performed at sample points in the state space that 1. In this work, we denote with a an observation or (an instantiation of) a random variable a in the previous time instant. For example, we denote with y the observation at the current frame and with y the observation in the frame before.

. I. Patras is with the School of Electronic Engineering and Computer Science, Queen Mary University of London, Mile End Road, E1 4NS London, UK. E-mail: [email protected]. . E.R. Hancock is with the Department of Computer Science, University of York, YO10 5DD York, UK. E-mail: [email protected]. Manuscript received 16 May 2008; revised 21 Oct. 2008; accepted 20 Aug. 2009; published online 6 Oct. 2009. Recommended for acceptance by P. Perez. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number TPAMI-2008-05-0291. Digital Object Identifier no. 10.1109/TPAMI.2009.175. 0162-8828/10/$26.00 ß 2010 IEEE

are close to the “true” state, and therefore, a large number of solutions need to be examined. Detection-based methods [27] that exhaustively search all image locations for the presence of a target fall into this category. In order to reduce the computational complexity and to deal with likelihoods with multiple modes, a number of methods have been proposed for selecting candidate solutions in the generative tracking framework. A common choice utilizes a motion model (e.g., [11]) for the proposal distribution that is for the distribution from which the candidate solutions will be sampled. Typical choices for the motion model range from general constant velocity/ acceleration models to higher order models whose parameters can be learned from training data [11], [12] or online [5]. A motion model assists the estimation process so long as the temporal evolution of the state follows it. However, the residual between the model-based temporal prediction and the true target state can be significant in the general case of irregular motion, novel motion, or a moving camera. Other methods utilize the fact that in certain domains, the true target state may lay on a manifold in the state space and therefore (probabilistic) priors may exist on the state of the target. This is the case when the state encodes the position of multiple interacting targets (such as facial points [21]) or the position of the components of a constrained articulated structure such as the human body [23]. Alternative methods utilize the observations in the current frame in order to sample from areas where the likelihood is expected to be higher. This may be done by performing a two-stage propagation [22], or by using mixtures of learned detectors and dynamic models (e.g., [18]). In the latter case, a target detector needs to be applied at every image location and various scales. The second category consists of the discriminative (or prediction-based) methods. In contrast to generative methods, in discriminative tracking, an observation y delivers a direct prediction of the hidden state x. This alleviates the need for a good proposal distribution and multiple Published by the IEEE Computer Society

1554

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,

evaluations. The predictor can be obtained in two ways. First, it can be derived analytically from modeling assumptions. This is the case with classical motion estimation schemes that utilize the optical flow equation, such as the method of Lucas and Kanade [16] and the work of Simoncelli et al. [24]. Many motion estimation schemes are posed as an optimization problem which is solved using the @y of the observation y with respect to the state x, gradient @x for example, within a gradient descent approach. While a point estimation is usually obtained [16], Simoncelli et al. [24] derive an estimate of the distribution of the posterior pðx j yÞ by explicitly modeling the distribution of the noise in the various terms that appear in the optical flow equation. Second, the predictor may be learned in a supervised way from training data. In the learning framework, a number of researchers have recently proposed methods for deformable motion estimation, 2D template tracking, and 3D human pose estimation [8], [25], [2]. One of the first learning-based approaches is that of Cootes et al. [8], which estimates the parameters that optimally warp an image to an appearance model in an iterative way. The state update x at each iteration is estimated from the intensity differences y between the warped image and the appearance model; note that the intensity residual y depends on the current state x. Instead of estimating x using the gradient of the observa@y ), they learn in a supervised way a linear relation tion (i.e., @x between them, that is, they learn a matrix A such that x ¼ Ay. For 3D human pose tracking, Sminchisescu et al. [25], [6] train a Bayesian Mixture of Experts in order to learn a multimodal posterior pðx j yÞ. Agarwal and Triggs [2] use Relevance Vector Machines (RVMs) in order to learn mappings between vectors of image descriptors and the 3D pose of a human body. In [1], they use Nonnegative Matrix Factorization in order to remove parts of the observation vector that are due to noise or occlusion. For 2D tracking, Williams et al. [30] use RVMs in order to learn the posterior of the location of a visual target (e.g., a human face) given an observation at a certain image location. Finally, for 2D tracking, Jurie and Dhome [15] learn in a supervised way a linear relation between the intensity differences between two templates and the corresponding motion transformation. In order to deal with possibly large prediction errors, most of the previous methods mainly rely on temporal filtering. Sminchisescu et al. [25] and Agarwal and Triggs [2] use as observations features that are extracted from a single-object silhouette. They address prediction errors by adopting a multiple hypotheses tracking framework that performs temporal filtering. On the other hand, Williams et al. [30] couple the regression-based tracking to a detection-based scheme that is employed to validate that the target is at the predicted position or pose. In the case of a validation failure, a full-scale detection phase is initiated. A Kalman filter is used for temporal filtering and leads to a reduction of the error by an order of magnitude. However, none of these methods addresses explicitly the problem of assessing in advance how well the observation y can predict the state x. Nor do they use multiple observations in order to increase robustness. Regression-based methods are known to be sensitive to observations that do not belong to the space that is sampled by the training data set. Therefore, the accuracy of the prediction of the posterior

VOL. 32,

NO. 9,

SEPTEMBER 2010

Fig. 1. Prediction error as a function of the true horizontal and the true vertical displacement. The performance deteriorates sharply outside the training area. Here, a Bayesian Mixture of Experts was trained for displacements in the interval ½11 . . . 11.

pðx j yÞ can deteriorate sharply for observations y that are contaminated with noise or come from areas that are uninformative concerning the state of the visual target (e.g., occluded areas). In particular, in the case of the 2D tracking, when the motion magnitude is larger than in the training set, then the prediction error is likely to be large and tracking is likely to fail. In Fig. 1, we illustrate this effect by plotting the prediction error as a function of the true displacement in an artificial example for a regression-based predictor. Similar observations are reported in [15] and [30] for regression-based schemes and also hold true for gradient-based predictors such as [16].

1.2 Contribution In this paper, we propose a coupled prediction-classification scheme for prediction-based 2D visual tracking. The method allows the use of multiple observations in a way such that each observation yðrÞ ðr 2 fr1 ; . . . rR gÞ contributes to the prediction of the state of the target according to its relevance (or reliability). In our scheme, the corresponding reliability is determined by a probabilistic classifier. In this way, the contribution of predictions that originate from reliable observations is the most significant, while the contribution of predictions of observations that originate from occluded areas or of unreliable observations is largely suppressed. In order to achieve this goal within a discriminative particle filtering framework [25], we introduce an additional random variable r that is used to obtain (or, in general, utilize) multiple observations denoted by yðrÞ, together with a binary random variable z that is related to the relevance of the observation. We use a probabilistic classifier [26] in order to model the conditional probability pðz ¼ 1 j yðrÞÞ (the probability that the observation yðrÞ is relevant/reliable) and use a probabilistic predictor [29] in order to model pðx j yðrÞÞ (the posterior probability of the state x given an observation yðrÞ). Both the predictor and the classifier are trained in a supervised way using data that are generated by applying synthetic transformations to the target template from the first frame. Alternatively, both may be trained using data from an annotated database. During tracking (Fig. 2), multiple observations are generated by sampling r, and the prediction of each observation (as given by the probabilistic predictor) is moderated by the corresponding relevance/reliability weight (as given by the probabilistic classifier). Our overall contributions in this paper can be summarized as follows:

PATRAS AND HANCOCK: COUPLED PREDICTION CLASSIFICATION FOR ROBUST VISUAL TRACKING

1555

Fig. 2. Overview of the proposed method.

We explicitly address the problem of the determination of the relevance/reliability of an observation to the state estimation process by learning in a supervised way the underlying conditional probability distribution. . We devise a probabilistic framework that allows multiple observations yðrÞ to contribute to the prediction of the state of the target according to their corresponding relevance/reliability. . We make explicit the relation between our framework and alternative discriminative and generative estimation/tracking schemes. More specifically, we show that under certain modeling assumptions (simplifications), our estimation scheme is practically equivalent to classical generative and discriminative estimation schemes. The remainder of the paper is organized as follows: In Section 2, we provide an outline of the proposed discriminative tracking framework with data relevance determination. In Section 2.1, we briefly describe the Bayesian Mixture of Experts predictor, and in Section 2.2, we present our method for observation relevance determination. Section 2.3 presents a procedure which, given a predictor, selects an appropriate classifier, and in Section 2.4, we show the relation of the proposed scheme with alternative generative and discriminative tracking methods. Section 3 presents experimental results. Finally, in Section 4, we give some conclusions and directions for future work. An early version of the proposed scheme appears in [20]. .

2

PREDICTION-BASED TRACKING WITH RELEVANCE DETERMINATION

Filtering, such as Kalman filtering or particle filtering, has been the dominant framework for recursive estimation of the conditional probability of the unknown state x given a set of observed random variables Y ¼ f. . . ; y ; yg. In the discriminative filtering framework (Fig. 3a), the filtered density can be derived as [25]: Z pðx j Y Þ ¼ dx pðx j Y  Þpðx j x ; yÞ; ð1Þ where y (y ) is the observation at the current (previous) frame and x (x ) is the state in the current (previous) frame, respectively. Similarly, Y is the set of observations up to the current frame and Y  the set observations up to the previous frame. The derivation of (1) ignores the fact that for certain problems, different parts of the observation y can give different predictions of the state of the target. For example, in [30], for 2D tracking where the evidence y is an image frame, the prediction of the state of the target (e.g., its 2D location) is based on the data yðrÞ extracted from a single

Fig. 3. Graphical models (a) for classical discriminative tracking and (b) for regression tracking with relevance determination.

window centered at a position r. In the absence of a motion model, r is the estimated position of the target in the previous frame, that is, r ¼ x^ . However, using data from a single window disregards the information that is available at other positions r. Similarly, for 3D tracking [2], [25], a single feature vector is extracted from the object silhouette. On the other hand, in the generative particle filtering framework for 2D tracking, it is common practice that several parts of the observation are examined. This is achieved by using multiple samples (particles) r and by assigning to each particle a weight ðy; rÞ proportional to the likelihood pðy j rÞ. The particles r are sampled from pðx j Y  Þ using the transition probability pðx j x Þ and, most usually, the sampled r determines how the observation y will be utilized. The latter means that the likelihood pðy j rÞ is modeled as a function of yðrÞ, that is, pðy j rÞ ¼ pðyðrÞ; cÞ, where c are some model parameters. In the simplest case, a number of measurements yðrÞ at positions rðr 2 fr1 ; . . . rR gÞ around the location of the target in the previous frame are utilized.2 Given the above, the posterior is empirically approximated using a set of weighted particles, that is, a set of pairs fððy; r1 Þ; r1 Þ; . . . ððy; rR Þ; rR Þg. Formally, pðx j Y Þ 

rR 1X ðy; rÞðx  rÞ; Z r¼r1

ð2Þ

where Z is a scaling parameter and ð:Þ is the Kronecker function. Here, we propose a discriminative particle filtering method that utilizes the fact that several parts of the observation can yield predictions of the state of the target. We do so by introducing a random variable r that determines which parts, or in general how, the observation y will be used. Without loss of generality, in the derivations that follow, we will assume that r has the dimensionality and the physical meaning of the hidden state x. For example, in the case of 2D template tracking where x 2 R2 , the random variable r 2 R2 will determine the centers of the windows/patches at which we will extract observations yðrÞ that will give predictions of x. In general, r will be used for obtaining a set of candidate observations yðrÞ and does not need to have the dimensionality of x. We will also condition r on x as we expect that the previous state can be sufficiently informative on how candidate observations can be obtained. Subsequently, we introduce a binary variable z and denote with pðz ¼ 1 j y; rÞ the probability that the observation yðrÞ is relevant for the prediction of the unknown state x. The dependencies of the variables are depicted in Fig. 3b where y is observed and 2. In general, in the case that the state x is not only a 2D displacement, obtaining yðrÞ requires warping.

1556

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,

TABLE 1 Discriminative Filtering with Data Relevance Determination

the remainder are hidden. In this graphical model, the filtered density can be derived as Z pðx j Y Þ ¼ dx pðx j Y  Þ Z  Z ð3Þ drpðr j x Þ dzpðx j z; x ; y; rÞpðz j y; rÞ : In what follows, we will describe our modeling choices and a computational scheme for maintaining a representation that approximates the above posterior. We will show that an approximation of the posterior pðxjY Þ is pðx j Y Þ 

rR 1 X ðy; rÞpðx j z ¼ 1; x ; y; rÞ R r¼r1

þ

rR X

!

ð4Þ

ð1  ðy; rÞÞpðx j z ¼ 0; x ; y; rÞ ;

r¼r1 4

where ðy; rÞ ¼ pðz ¼ 1 j y; rÞ is the relevance of the observation yðrÞ, pðx j z ¼ 1; x ; y; rÞ is the probabilistic prediction for the state x given that the observation yðrÞ is relevant, and pðx j z ¼ 0; x ; y; rÞ is a probabilistic prediction of the state x given that yðrÞ is not relevant. The r1 ; . . . ; rR are samples of the hidden variable r and need to be sampled properly (in the way that is described below) so that (4) indeed becomes an approximation of the posterior. Finally, notice the similarity in form between our approximation (4) and the approximation in the generative framework (2). We will make the relationship explicit in Section 2.4. In order to complete the specification of our framework, we need to define its three main formal components, that is, ðy; rÞ, pðx j z ¼ 1; x ; y; rÞ and pðx j z ¼ 0; x ; y; rÞ. We assume that these probability distributions are either derived from modeling assumptions or learned in a training phase. For example, modeled probabilistic distributions have been used in the context of motion estimation [24].

VOL. 32,

NO. 9,

SEPTEMBER 2010

TABLE 2 Modeling Choices

Here, we opt for a learning approach (as explained in Sections 2.1 and 2.2) in which a probabilistic classifier determines the observation relevance/reliability ðy; rÞ and a Bayesian Mixture of Experts (BMEs) determines the prediction pðx j z ¼ 1; x ; y; rÞ. In the tracking phase given a triple ðy; r; x Þ, the trained BME yields a mixture of Gaussians that is our approximation of pðx j z ¼ 1; x ; y; rÞ. For pðx j z ¼ 0; x ; y; rÞ, that is, for the probabilistic predictions given that the observation is not relevant, we use a modeling approach and approximate it using a Gaussian with a large covariance matrix S0 (alternatively, we could have used a uniform distribution). Our modeling choices lead to an approximation of pðx j Y Þ using a mixture of Gaussians. This allows us to deal with posteriors with multiple modes and also to recover from tracking failures. In order to keep the number of mixture components constant (equal to M), we devise, in the Appendix, a method for approximating an L-component Gaussian mixture with an M-component Gaussian mixture (M  L). In Table 1, we describe the computational scheme that, given an approximation of the posterior pðx j Y  Þ of the state at the previous frame by an M-component mixture of Gaussians, yields an approximation of the state posterior pðx j Y Þ at the current frame by an M-component mixture of Gaussians. In Table 2, we summarize our modeling choices. Note that in (5), the integral is approximated using K þ 1 Gaussian components. In practice, in order to reduce the number of the components, we use the approximation: Z dzpðx j z; x ; y; rÞpðz j y; rÞ ( P ð6Þ  K i¼1 i N ði þ r; Si Þ; if  > z ; ¼ otherwise: N ðx ; S0 Þ; R R As a result, we approximate the term drpðr j x Þ dzpðx j z; x ; y; rÞpðz j y; rÞ with a mixture of unnormalized L (R  L  RK) Gaussians. We reduce it to an M-component mixture in step 4. The computational complexity of this scheme is OðRK þ RJ þ RKMÞ, where OðRKÞ is the complexity of the BME predictor and OðRKMÞ the complexity of the EM algorithm for the reduction of the RK-component mixture

PATRAS AND HANCOCK: COUPLED PREDICTION CLASSIFICATION FOR ROBUST VISUAL TRACKING

of Gaussians to an M-component mixture of Gaussians. OðRJÞ is the complexity of the kernel-based Relevance Vector Machine classifier (Section 2.2), where J is the number of the support vectors.

2.1 Bayesian Mixture of Experts for Regression In what follows, we will describe a method that, given an observation yðrÞ and the target state at the previous frame x , yields a probabilistic prediction of the state x at the current frame. For notational simplicity, let us here denote with y the Cartesian pair ðyðrÞ; x Þ. Our method follows the work of Sminchisescu et al. [25] and uses the Bayesian Mixtures of Experts for regression. Given an observation y, the BME delivers a probabilistic prediction of x as a mixture of Gaussians. The rationale behind this choice over alternative regression methods (e.g., RVMs [26]) is that the BME can successfully model predictive distributions that are multimodal in character. Such distributions often arise in the case of 3D tracking due to, for example, front/back and left/right ambiguity [25], [2], [23]. They are also expected to arise in the case of 2D tracking due to the aperture problem [10]. However, this choice is not restrictive and any linear [32] or nonlinear regression method [8] could be used as an alternative. More generally, any method that can deliver a prediction of the state x, given an observation y can be used. In the case of 2D tracking, the Lucas-Kanade method [16] could be used to make an estimate of the 2D target location x by delivering an estimate of the displacement vector x with respect to the position at which the observation yðrÞ was extracted. Similarly, the method of Simoncelli et al. [24] could deliver a probabilistic prediction (a Gaussian) for x. The (Hierarchical) Mixtures of Experts, which was introduced by Jordan and Jacobs [14], is a method for regression and classification that relies on a soft probabilistic partitioning of the input space. This is determined by gating coefficients gi ðyÞ (one for each expert i) that are input dependent and have a probabilistic interpretation; that is, the coefficients of the siblings at each level of the hierarchy sum up to one. The prediction of each expert i is then moderated by the corresponding gating coefficient. Formally, in the simple case of a flat hierarchy for regression, pðx j yÞ ¼

K X

gi ðyÞfi ðx j yÞ;

ð7Þ

i¼1

where fi ðx j yÞ is a probability density function, usually a Gaussian centered around the prediction of the expert i. In the simple case that linear experts are used: T

e i y gi ðyÞ ¼ P T y ; j je

ð8Þ

  fi ðx j yÞ ¼ N wTi y; Si ;

ð9Þ

and

where the wi and i are the unknowns to be estimated. Jordan and Jacobs [14] proposed a Maximum Likelihood method for the estimation of wi and i , while in [29], a Bayesian approach is used. We adopt the approach in [29] in which a set of hyperparameters models the prior distributions of wi and i ,

1557

and follow a variational approach for the estimation of their posterior distributions. As in [29], we make a Laplace approximation and estimate the mode and the variance of the posteriors, which (with a slight abuse of notation) we denote here as ðwi ; wi Þ and ði ; i Þ. In the process, we also estimate the optimal value for the hyperparameters i that are associated with the noise covariance Si of the prediction of expert i (see [29] for details). In [29], a procedure is described for scalar regression. In the case where the target is a vector x with dimensionality D, we may train D different Mixtures of Experts. Here, we have extended the methodology to experts that have multidimensional output (i.e., fi ðx j yÞ is a multidimensional Gaussian with diagonal noise covariance Si ). In this case, wi is a matrix with the number of rows equal to the dimensionality of y and number of columns equal to the dimensionality of x. For prediction, we marginalize over the parameters and hyperparameters as in [29]. For a new observation y, the predictive distribution is a mixture of Gaussians given by p^ðx j yÞ ¼

K X

  gi ðyÞN wTi y; Si0 ;

ð10Þ

i¼1

where the kth element of the diagonal covariance matrix Si0 , 0 denoted with Sik , is given by 0 Sik ¼ yT wik y þ Sik ;

ð11Þ

where Sik is the corresponding element in the covariance matrix of the ith expert. Alternatively, we may straightforwardly use (7), or use only the prediction of the expert with the highest gating coefficient gi , or approximate (10) with a single Gaussian. For the problem of 2D visual tracking, we aim to estimate the transformation x (e.g., translation, rotation, and scaling) that a visual target undergoes in an image sequence. We train the BME in a supervised way with pairs ðyðxÞ; xÞ in which the observations yðxÞ are produced by synthetically transforming (e.g., translating) the visual target with the transformation x. In this case, we choose to ignore the state x at the previous frame when training the BME. Subsequently, in the test phase, an observation will give a probabilistic prediction according to (10).

2.2 Data Relevance Determination For the determination of the relevance/reliability pðz j y; rÞ of an observation yðrÞ, we use a classification scheme with the RVMs. The goal is to obtain an a priori assessment of whether the probabilistic prediction pðx j yðrÞÞ (10) of the state of the target is expected to be good. To this end, we train an RVM classifier in a supervised way with a set of positive examples that yield good predictions and with a set of negative examples that yield bad predictions. Let us denote with sigm the sigmoid function, with f~ yi g the training set of the classifier, and with ðyi ; yj Þ a kernel function (e.g., a Gaussian, or a linear one). Then, after training and when presented with a novel observation yðrÞ, the RVM yields a prediction of the relevance of the observation yðrÞ as ! X rvm pðz ¼ 1jyðrÞÞ ¼ sigm wi ðyðrÞ; y~i Þ ; ð12Þ i

1558

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,

Fig. 4. Positions r of the vectors selected by the RVM-based classifier. The inner (outer) window illustrates the range from which the training set of the BME-based predictor (RVM-based classifier) is constructed.

where wrvm is a sparse weight vector that is learned in the training phase. The training set f~ yðrÞg is constructed as follows: A candidate observation y~ðrÞ is generated by artificially transforming (e.g., translating) the visual target with a transformation which we denote here with r. Then, for each of the candidate observations, a probabilistic prediction is made using (10). We place in the set of positive examples candidate observations for which an appropriate norm of the difference between the true transformation r and the expected value (i.e., the mean) of the prediction pðx j yðrÞÞ is less than a threshold. That is, kr  E pðxjyðrÞÞ ðxÞk < r :

ð13Þ

As pðx j yðrÞÞ is a mixture of Gaussians, the expected value of x in the above equation can be obtained in closed form. Alternative schemes for constructing the positive training set, such as thresholding the distance between the true transformation r and the mode of pðx j yðrÞÞ, or by thresholding the probability of the ground truth transformation r (i.e., pðr j yðrÞÞ > r ), are also possible. The set of the negative examples is composed of the observations for which (13) is not satisfied. Other examples, such as observations from regions in the background, could also be added in the negative training set. Clearly, the transformations r that generate the candidate training set need to explore larger parts of the state space than the ones used to construct the training set of the BME. In Fig. 4, we illustrate the 2D case in which yðrÞ is an observation taken at 2D windows around each position r. In Fig. 4, superimposed on the original frame we depict the range from which the r is taken for constructing: 1) the training set of the BME-based predictor (inner window) and 2) the training set of the RVM-based classifier (outer window). In the same figure, we superimpose the r for the vectors that have been chosen by the RVM classifier. In Fig. 5, and for the toy example that we used in Fig. 1, we illustrate the true prediction error of an BME with eight experts that have been trained to predict 2D displacements in the interval ½11 . . . 11. Also shown is the corresponding 2D plot of pðz j y; rÞ. Note that we test with observations that result from displacements from both inside and outside the training interval. The RVM has been trained on positive examples that have been selected by thresholding the L1 error norm. It is clear that we can predict reasonably well which observations are associated with a low prediction error. Note that not all observations that fall outside the training range give high prediction errors. This indicates that, in this case, the BME is capable of extrapolating.

VOL. 32,

NO. 9,

SEPTEMBER 2010

Fig. 5. (a) Prediction error and (b) pðz j y; rÞ as functions of the true displacement. Here, a BME was trained to predict displacements in the interval ½11 . . . 11 for a template of size 11  11. An RVM was trained to classify as “relevant” observations in the interval ½22 . . . 22 that can deliver accurate (þ= 2 pixels) predictions of the true displacement.

2.3 Classifier Selection The classification scheme of Section 2.2 depends on the choice of the parameter r (13) and a number of internal parameters that need to be set. The former can be interpreted as the desired level of prediction accuracy. Its selection has direct implications to the complexity and accuracy of the classifier since certain levels of prediction accuracy might be difficult or impossible to achieve. The first consideration is the ability of the regression scheme to learn sufficiently well the true posterior pðx j z ¼ 1; y; x ; rÞ, that is, to learn a pdf with most of its mass concentrated around the true target state x . This depends on the size of the template, the range of x in the training set, and a number of parameters (such as the number of linear predictors and the data y itself). A second consideration is that the classification problem at a certain level of accuracy (i.e., for a certain r ) might be very difficult to solve, while, for a different value of r , it might be considerably easier. A small r can lead to an empty positive training set, while a large value of r can lead to low accuracy. The threshold r and the remaining parameters, such as the internal parameters of the classifier, can be selected in a cross-validation scheme. In what follows, we propose an alternative way of determining the classification scheme. The main idea is that ideally, the probabilistic classifier should rank the observations according to their prediction accuracy. In other words, the rank order of 1  pðz ¼ 1 j y; rÞ should coincide with the rank order of the error of the corresponding prediction. The divergence between the ideal ranking and the ranking obtained by a specific probabilistic classifier can therefore give a measure of the goodness of the classifier in question. Formally, let us assume that the observations are ranked in order according to some criterion. Let q vary between 0 and 1, and let eðqÞ be the mean error of the predictions of the observations that are on the upper fraction q according to the ranking used. Let eo be the mean error under the “ideal” ranking, i.e., under the ranking according to the error itself, and e be the mean error under the ranking of a certain probabilistic classifier. That is, eðqÞ is the mean error of the top q fraction of observations ranked according to the learned pðz ¼ 1 j x; yÞ. We vary the threshold r and the size of the kernel of the RVM classifiers ( phi ) and, in Fig. 6, we depict eo ðqÞ and eðqÞ for some of them, where the differences in the performance are clearly visible.

PATRAS AND HANCOCK: COUPLED PREDICTION CLASSIFICATION FOR ROBUST VISUAL TRACKING

1559

pðx j Y Þ ¼

¼

Fig. 6. Mean error versus the fraction q of positives for different classifiers. The curve eo corresponds to the ideal classifier.

We make the selection according to the 2 test between eo and e, that is, we select a curve that lies on the lower right part of the plot. Such a classifier generates a large number of positives for a given error level, a property that is important in our estimation scheme which might rely on few observations only. As Fig. 6 reveals that a number of classifiers (solid lines) have similar ranking properties. Among classifiers with similar ranking properties (with a 5 percent margin), we favor the one that delivers, on average, the lower weighted error under the weighing of the positives according to the probabilistic weighing scheme.

2.4

Relation to Generative and Discriminative Tracking In this section, we will make explicit the relation between the proposed tracking framework on the one hand, and both discriminative and generative tracking methods on the other hand. More specifically, we will show that under certain modeling assumptions, we derive estimation schemes that are practically equivalent to classical generative and discriminative estimation schemes. The relation to classical discriminative methods (e.g., [25]) is rather straightforward. In the case that a single observation yðrÞ is used and the data relevance  ¼ pðz ¼ 1 j y; rÞ is set to one (i.e., the single observation is considered relevant/reliable), our framework reduces to the discriminative tracking framework of [25]. Formally, (1) can be derived from (3) when the following three modeling choices are made: First, pðr j x Þ ¼ ðr  x Þ. That is, the auxiliary variable r that controls multiple observations coincides with x , and therefore, effectively is not used. . Second, pðz ¼ 1 j y; rÞ ¼ 1 (and therefore, pðz ¼ 0 j y; rÞ ¼ 0). That is, the single observation that is utilized is considered relevant/reliable. . Third, pðx j z; x ; y; rÞ ¼ pðx j x ; yÞ. That is, the prediction does not dependent on the auxiliary variables z; r. We now show that under suitable modeling choices, the proposed framework reduces to an estimation scheme that is equivalent to particle filtering in the generative framework. To commence, we note that, using the Bayes rule, the posterior is given by .

pðy j xÞ pðy j Y  Þ

Z

pðy j xÞ pðx j Y  Þ; pðy j Y  Þ

ð14Þ

dx pðx j x Þpðx j Y  Þ:

ð15Þ

In the generative framework, a number of candidate solutions ri are sampled from pðx j Y  Þ and they are subsequently weighted using the likelihood3 pðy j ri Þ. If we denote with ðy; rÞ the weight that is assigned to sample r, then an approximation of the posterior is given by (2). Recall that in our framework, the approximation of the posterior is given by (4). Also, observe the relation between (2) and (4): In the generative case, the mass of the posterior is on the samples ri , while, in the discriminative case, the mass of the posterior is on the predictions pðx j z ¼ 1; x ; y; ri Þ (let us for the moment ignore the “outlier predictions” pðx j z ¼ 0; x ; y; ri Þ). Therefore, if we choose the predictors such that pðx j z ¼ 1; x ; y; ri Þ is equal to ðx  rÞ, and let r be sampled from pðx j Y  Þ, then the estimation schemes of the two frameworks will be equivalent. There are just two differences in the methods. First, in our framework, once a sample r is obtained in this way, it should be assigned a weight  equal to pðz ¼ 1 j r; yÞ. By contrast, in the generative framework, the sample r should be assigned a weight equal pðyjrÞ  Þ to pðyjY  Þ , or a weight equal to pðy j rÞ since the term pðy j Y is independent of r and therefore is canceled in the normalization. Second, in our case, the prediction of the outliers (i.e., observations that are irrelevant/unreliable) is made explicit in the form of pðx j z ¼ 0; x ; y; ri Þ. Formally, under the modeling choice pðx j z ¼ 1; x ; y; rÞ ¼ ðx  rÞ, (3) becomes pðx j Y Þ

ð16Þ ¼

Z Z







dx pðx j Y Þ

Z

drpðr j x Þ ð17Þ

dzpðx j z; x ; y; rÞpðz j y; rÞ

Z ¼ pðz ¼ 1 j y; xÞ dx pðx j x Þpðx j Y  Þþ Z Z dx pðx j Y  Þ drpðr j x Þpðx j z ¼ 0; x ; y; rÞpðz ¼ 0 j y; rÞ ð18Þ ¼ pðz ¼ 1 j y; xÞpðx j Y  Þþ Z Z    dx pðx j Y Þ drpðr j x Þpðx j z ¼ 0; x ; y; rÞpðz ¼ 0 j y; rÞ: ð19Þ Let us now comment on the main differences between (18) and the filtering equation in the generative framework (15). The first difference is that instead of using the likelihood pðy j xÞ, our framework uses the term pðz ¼ 1 j y; xÞ in order to weight samples r that are sampled from pðx j Y  Þ. In our 3. In general, a number of candidate solutions r 2 fr1 ; . . . ; rR g are sampled from a proposal distribution gðrÞ and they are subsequently Þ weighted by pðy j rÞ pðrjY gðrÞ .

1560

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,

framework, the variable z takes the interpretation of the class of the observation y and pðz ¼ 1 j y; xÞ is the probability that the observation y belongs to the target class. Essentially, a classification scheme is used within the particle filtering framework. This is similar to other works in classificationbased tracking [3], [7] and to more classical target detectors [27]. Therefore, our scheme offers a formal framework in which classification-based approaches can be used for recursive estimation of the posterior density. The second main difference of the degenerate case of our scheme with the generative particle filtering tracking (15) is that in our case, the prediction of the outliers becomes explicit in the form of pðx j z ¼ 0; x ; y; ri Þ, that is, the second term of (18). This bears similarities to generative methods that introduce an occlusion process and condition the likelihood on it. In [31], two likelihood models are defined, one given that the target is occluded and one given that the target is visible. However, while in [31] the occlusion state is inferred, in our case the relevance of the observation is determined by the probabilistic classifier. Also, note that in the general case of our method, the relevance of an observation is not necessarily related to the degree of occlusion but rather to the degree at which a reliable prediction can be obtained from the observation in question. Three possible choices for the prediction given that the observation is irrelevant/unreliable are the following. The first is to use a Gaussian with large variance around x . A second choice is to use a uniform distribution , where 

Suggest Documents