Bittracker â a Bitmap Tracker for Visual Tracking under Very General Conditions. Ido Leichter, Student Member, IEEE, Michael Lindenbaum, Member, IEEE, and ...
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 1
Bittracker – a Bitmap Tracker for Visual Tracking under Very General Conditions Ido Leichter, Student Member, IEEE, Michael Lindenbaum, Member, IEEE, and Ehud Rivlin, Member, IEEE
Abstract— This paper addresses the problem of visual tracking under very general conditions: a possibly non-rigid target whose appearance may drastically change over time; general camera motion; a 3D scene; and no a priori information except initialization. This is in contrast to the vast majority of trackers which rely on some limited model in which, for example, the target’s appearance is known a priori or restricted, the scene is planar, or a pan tilt zoom camera is used. Their goal is to achieve speed and robustness, but their limited context may cause them to fail in the more general case. The proposed tracker works by approximating, in each frame, a PDF (probability distribution function) of the target’s bitmap and then estimating the maximum a posteriori bitmap. The PDF is marginalized over all possible motions per pixel, thus avoiding the stage in which optical flow is determined. This is an advantage over other general-context trackers that do not use the motion cue at all or rely on the error-prone calculation of optical flow. Using a Gibbs distribution with respect to the first-order neighborhood system yields a bitmap PDF whose maximization may be transformed into that of a quadratic pseudo-Boolean function, the maximum of which is approximated via a reduction to a maximum-flow problem. Many experiments were conducted to demonstrate that the tracker is able to track under the aforementioned general context.
This might be caused, for example, by heavy partial occlusions that cause drastic changes in the target’s histogram, or by shape deformations that are poorly modeled by scale alone. The seminal results by Julesz have shown that humans are able to visually track objects merely by clustering regions of similar motion [16]. As an extreme case, consider images (a) and (b) in Figure 1. These two images constitute a consecutive pair of images in a video displaying a random-dot object moving in front of a random-dot background that is moving as well. Since the patterns on the object, on the background, and around the object’s enclosing contour are all alike, the object is indistinguishable from the background for the observer who is exposed to these images at nonconsecutive times. However, if these two images are presented one after the other in the same place, as in a video, the observer is able to extract the object in the two images, shown in (c) and (d).
Index Terms— Tracking, Motion, Pixel classification.
(a)
(b)
(c)
(d)
I. I NTRODUCTION
T
HIS paper is concerned with visual tracking under the very general conditions of a possibly non-rigid target whose appearance may change drastically over time, a 3D scene, general camera motion, and no a priori information regarding the target or the scene except for the target’s bitmap in the first frame, used to initialize the tracker. Much research has been done in the field of visual tracking, bearing fruit to an abundance of visual trackers, but most trackers were not intended for such a general context. The majority of existing trackers, in order to reduce computational load and enhance robustness, are restricted to some a priori known context. These trackers use some (possibly updatable [14]) appearance model or shape model for the tracked object in the images and they track in a low-dimensional state space of the target’s parameters (e.g., via C ONDENSATION [12]). For example, even in [34], in which the tracker is known to be very robust to deformations and changes in appearance, two assumptions are made: that the color histogram of the target does not change very much over time and that the changes in the target’s 2D shape in the image may be approximated by scale alone. As long as the target obeys this model in terms of appearance and shape, the tracker will be robust. However, once the target ceases to obey the model, the tracking is likely to fail without recovery. • The authors are with the Computer Science Department, Technion - Israel Institute of Technology, Haifa 32000, Israel. E-mail: {idol,mic,ehudr}@cs.technion.ac.il.
Digital Object Indentifier 10.1109/TPAMI.2007.70816
Fig. 1. (a) and (b) constitute a consecutive pair of images in a video of a random-dot object moving in front of a random-dot background that is moving as well. The object in the two images is shown in (c) and (d).
figure On the basis of this observation, we propose an algorithm (Bittracker) for visual tracking that relies only on three basic visual characteristics: • (Short-term) Constancy of Color – The color projected to the camera from a point on a surface is approximately similar in consecutive frames; • Spatial Motion Continuity – The optical flow in the image region corresponding to an object is spatially piecewisecontinuous. That is, the optical flow of the vast majority of the pixels in this area is spatially continuous; • Spatial Color Coherence – It is highly probable that adjacent pixels of similar color belong to the same object. The first two characteristics usually hold under a sufficiently high frame rate, and the third holds for natural images. In order to track non-rigid objects of general shape and motion without prior knowledge of their shape, the tracker proposed in
0162-8828/$25.00 © 2007 IEEE
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2
this paper uses the state space of bitmaps to classify whether each pixel in the image belongs to the target (hence the name Bittracker). Note that this state space is even more general than the state space of non-parametric contours, since the former may also accommodate for holes in the target (although level set methods also allow topology changes). As no specific targetrelated, scene-related or camera motion-related assumptions are made, the resulting tracker is suitable for tracking under the aforementioned very general conditions. Moreover, a detailed state of the target is provided, making Bittracker suitable for tasks where precise target characterization is required. For example, it may be used to assist in the process of alpha matting by providing an estimation of the desired foreground object in the frames. Thus, only refinements of the object mask are required to be manually performed. Note, however, that a lower dimensional target state (e.g, a bounding box) may be easily extracted. The tracker works by estimating in each frame the MAP (maximum a posteriori) bitmap of the target. The PDF (probability distribution function) of the target’s bitmap in the current frame is conditional on the current and previous frames, as well as on the bitmap in the previous frame. A lossless decomposition of the information in the image into color information and pixel location information allows color and motion to be treated separately and systematically for the construction of the PDF. One important advantage of the proposed tracker is that the target’s bitmap PDF is marginalized over all possible motions per pixel. This is in contrast to other general-context trackers, which cling to a sole optical flow hypothesis. These trackers perform optical flow estimation – which is prone to error and is actually a harder, more general problem than the mere tracking of an object – or do not use the motion cue at all. Another advantage of the proposed algorithm over other general-context trackers is that the target’s bitmap PDF is formulated directly at the pixel level (unlike image segments). Thus, the precursory confinement of the final solution to objects composed of preliminarily-computed image segments is avoided. The rest of this paper is organized as follows. Related work is reviewed in Section II. Section III outlines the proposed algorithm. Experimental results are given in Section IV, and Section V concludes the paper. II. P REVIOUS W ORK Relevant previous work is mainly in the area of video segmentation. However, very few video segmentation algorithms are intended for the very general context discussed here. Most were developed in the context of a stationary camera (e.g., [21], [31], [43]) or under the assumption that the background has a global, parametric motion (e.g., affine [35] or projective [40], [42]). Recently, the last restriction was relaxed to a planar scene with parallax [17]. Other algorithms were constrained to track video objects modeled well by parametric shapes (e.g., active blobs [36]) or motion (e.g., translation [6], 2D rigid motion [40], affine [8], [32], projective [11], small 3D rigid motion [30] and normally distributed optical flow [18], [41]). These algorithms are suitable only for tracking rigid objects or specific preset types of deformations. The algorithm proposed in this paper, however, addresses the tracking of potentially non-rigid objects in 3D scenes from an arbitrarily moving camera, without prior knowledge other than the object’s bitmap in the first frame.
There are algorithms that address video segmentation and successfully track objects under general conditions as an aftereffect. That is, they do not perform explicit tracking in the sense of estimating a current state conditional on the previous one or on the previous frames. For example, in [37] each set of a few (five) consecutive frames is spatiotemporally segmented without considering the previous results (other than saving calculations). In [24] each frame is segmented into object/background without considering previous frames or previous classifications. (Furthermore, the classification requires a training phase, upon which the classification is performed, prohibiting major changes in the target’s appearance.) In the contour tracking performed in [13], an active contour is run in each frame separately, while the only information taken from previous frames is the previously estimated contour for initialization in the current frame. In this paper, the state (target’s bitmap) is explicitly tracked by approximating a PDF of the current state, which is conditional on the previous state and frame, as well as on the current frame, and by estimating the MAP state. (Of course, conditioning the filtering distribution on the entire history of images would be more precise, but it seems that the very high dimensionality of the state space and the complexity of the state distribution render this impractical.) Optical flow is an important cue for visually tracking objects, especially under general conditions. Most video segmentation algorithms make a point estimate of the optical flow, usually prior to segmentation (e.g., [6], [8], [11], [18], [26], [27], [30], [33], [41], [42]) and sometimes in conjunction with it (e.g, [32]). An exception is the work in [28] and [29], where motion segmentation is applied to consecutive image pairs by using a tensor voting approach, where each pixel may be assigned multiple flow vectors of equal priority. Since optical flow estimation is prone to error, other algorithms avoid it altogether (e.g., [13], [24], [25], [39], [45]), but these algorithms tend to fail when the target is in proximity to areas of similar texture, and may erroneously classify newly appearing regions with different textures. This is shown in an example in [25], where occlusions and newly appearing areas are prohibited due to the modeling of image domain relations as bijections. Another exception to the optical flow point-estimation is [37], where a motion profile vector that captures the probability distribution of image velocity is computed per pixel, and motion similarity of neighboring pixels is approximated from the resemblance of their motion profiles. In the work here, the optical flow is neither estimated as a single hypothesis nor discarded, but the bitmap’s PDF is constructed through a marginalization over all possible pixel motions (under a maximal flow assumption). One class of video segmentation and tracking algorithms copes with general object shapes and motions in the context of an arbitrarily moving camera by tracking a nonparametric contour influenced by intensity/color edges (e.g., [39]) and motion edges (e.g., [27]). However, this kind of algorithm does not deal well with cluttered objects and partial occlusions, and may cling to irrelevant features in the face of color edges or additional moving edges in proximity to the tracked contour. Many video segmentation and tracking algorithms perform spatial segmentation of each frame as a preprocessing step. The resulting segments of homogeneous color/intensity are then used as atomic regions composing objects (e.g., [6], [8], [32]). These algorithms also assign a parametric motion per segment. Rather than confining the final solution in a preprocessing step and making assumptions regarding the type of motion the segments
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 3
undergo, the algorithm proposed here uses the aforementioned spatial color coherence characteristic and works directly at pixel level. Two works closely related to this one are [5] and [19]. These two papers deal with foreground/background segmentation in videos. Although the camera in both is required to be stationary and the input to the algorithm in the second work is enhanced by a synchronous stereo pair of videos, the probabilistic framework is these papers is similar to our approach in several ways. These will be discussed later in the paper. Also relevant is the approach in [15] and [7], which, like here, included the estimation of generic pixel masks to describe the shapes of moving objects. Each object’s appearance and mask are learned off-line in an unsupervised manner from an entire input sequence for the purpose of post-processing the input sequence itself. This approach is thus more suitable for batch analysis and editing of video shots. Less object variability is allowed in [15], where the color and transparency of each pixel in an object are modeled as being normally distributed. In [7], greater object variability is accounted for by modeling the template mean (color and transparency) as varying in a low-dimensional affine subspace, where the subspace coordinate is modeled as normally distributed as well. In both [15] and in [7], only simple geometric transformations to the objects and the background are allowed. (In fact, only translation was assumed in the experiments.) III. T HE Bittracker A LGORITHM A. Overview Every pixel in an image may be classified as belonging or not belonging to some particular object of interest according to the object projected to the pixel’s center point. Bittracker aims to classify each pixel in the movie frame whether it belongs to the target or not. Thus, the tracker’s state space is binary images, i.e., bitmaps. The tracker works by estimating the target’s bitmap at time t, given the movie frames at times t and t − 1 and the estimate of the previous bitmap at time t−1 (t = 0, 1, 2, . . .). Thus, after being initialized by the target’s bitmap in the first frame (for t = 0), the tracker causally propagates the bitmap estimate in an iterative fashion. At each time t, the tracker approximates the PDF of the bitmap Xt , ` ˛ ´ P (Xt ) = Pr Xt ˛It−1 , It , Xt−1
(1)
(It denotes the frame at time t), and then estimates the MAP bitmap by maximizing its PDF. The estimated MAP bitmap may then be used to estimate the target’s bitmap at time t + 1 in the same way, and so forth. Note that the initializing bitmap X0 need not be exact, as the target’s bitmap may be self-corrected with time using the spatial motion continuity and spatial color coherence characteristics, which are incorporated in (1). When the the bitmap tracking problem is formulated as in (1), the solution is targeted directly towards the sought bitmap. Thus, the commonly performed intermediate step of determining optical flow is avoided. This is an important advantage since computing optical flow is a harder, more general problem than estimating the bitmap (given the one in the previous frame). B. The Bitmap’s PDF Modeling the bitmap’s PDF (1) is very complex. In order to simplify it we consider a discrete image I not in the usual sense of
a matrix representing the pixel colors at the corresponding coordinates, but rather as a set of pixels with indices p = 1, 2, . . . , |I| (|I| denotes the number of pixels in image I ), each one having a particular color cp and location lp (coordinates). (The pixels are indexed arbitrarily, regardless of their location in the image. To remove any doubt, there is no connection between the indexing of the pixels in It and the indexing in It−1 . Specifically, if a pixel of index p in It and a pixel of index p in It−1 are such that p = p , it does not imply that the two pixels are related by their colors or locations.) Taking this alternative view, a discrete image I may |I| be decomposed into the pair I = (C, L), where C = {cp }p=1 and |I|
L = {lp }p=1 . Note that no information is lost because the image
may be fully reconstructed from its decomposition. enables “˘ ¯ This ” |It | ˘ p ¯|It | , lt p=1 . us to decompose It into It = (Ct , Lt ) = cpt p=1 The bitmap’s PDF (1) may now be written as ` ˛ ´ P (Xt ) = Pr Xt ˛It−1 , Ct , Lt , Xt−1 .
(2)
We denote the Boolean random variable representing the bitmap’s value at pixel p in It by xpt , which may receive one of the following values: (
xpt
=
1
Pixel p in It belongs to the target, 0 otherwise. ˘
¯|I
(3) |
t . The Note that the notation Xt is the abbreviation of xpt p=1 bitmap’s PDF (2) is modeled as a Gibbs distribution [9] with a potential consisting of functions of up to two bitmap bits,
P (Xt ) ∝
Y
n “˘ ¯ ”o exp −VS xpt p∈S ,
S∈2{1,...,|It |}
|S|≤2
globally conditioned on the variables in the right-hand side of the conditioning line in (2). Such modeling, consisting of a multiplicand per pixel and per unordered pixel pair, is neither practical nor necessary. This PDF is later restricted to include multiplicands for all individual pixels and for a small subset of all unordered pixel pairs. That is, the bitmap’s PDF is considered as a conditional random field [22] with respect to some neighborhood system. To simplify the modeling, the bitmap’s PDF is factorized into a product of two simpler functions: P (Xt ) ∝
F1 (Xt ; It−1 , Ct , Xt−1 ) · F2 (Xt ; Lt , It−1 , Ct , Xt−1 ) . {z } | {z } | F2 (Xt ) F1 (Xt )
(4) The first factor consists of all the multiplicands that are independent of the pixel locations Lt . Thus, it may be modeled using only the constancy of color characteristic, without consideration of the spatial color coherence and motion continuity characteristics. The second factor consists of the rest of the multiplicands and is modeled in light of the latter two visual characteristics. To emphasize that Xt is the variable to be estimated where the rest of the variables are fixed and known, we often denote the two functions in the right-hand side of (4) by F1 (Xt ) and F2 (Xt ). As will be seen in what follows, these two components are easier to model due to the separation of the color information from the location information. 1) Modeling F1 (Xt ): Considering only Lt as observations, the first factor in (4), F1 (Xt ; It−1 , Ct , Xt−1 ), is a “prior-like” part of the target’s bitmap PDF, which does not depend on the observations. With Lt not given, there is no information on
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 4
the motion from frame t − 1 to frame t, and no information on the relative position between the pixels in It . Under these circumstances, the dependency of the bitmap’s bits on the pixels’ colors is much stronger than the dependency between the bits themselves. That is, the decision as to whether a pixel belongs to the target can be made mainly by examining its color with respect to the colors of already-classified pixels in the previous frame. Therefore, using only on the pixels’ colors, it is reasonable to associate with the product F1 (Xt ) only multiplicands that depend on individual bitmap bits, and we model each such multiplicand as its marginal PDF: F1 (Xt ) =
Q|It |
p=1
Pr |
˛ ´ xpt ˛It−1 , Ct , Xt−1 . {z } ` ´ f1 xpt
`
(5)
In practice the optical flow may be such that a pixel in It does not exactly correspond to a single pixel in It−1 . Yet in our model, the correspondence of a pixel p in It is either to some pixel p in It−1 (denoted by p → p ) or to a surface that was not visible in time t − 1 (denoted p → none). Approximating the optical flow by integer shifts in both axes is common in visual tracking and segmentation applications (e.g., [37]). Now, the PDF of a single bit xpt inside the product in (5) may be marginalized over all the potential correspondences of It ’s pixel p to pixels p in It−1 , including the event of its correspondence to none: X
` ´ f1 xpt =
Pr
`
˛ ´ xpt , p → p ˛It−1 , Ct , Xt−1 , (6)
p ∈Nt−1 ∪{none}
where Nt denotes the set {1, 2, . . . , |It |}. Note that any hard decision about the optical flow is avoided when this marginalization is applied. Note that this marginalization is similar to the marginalization over all possible disparities along epipolar lines performed in [19]. We model the color of a pixel p in It as normally distributed with mean equal to the color of the corresponding pixel p in It−1 or as uniformly distributed for pixels corresponding to none. This yields (after a detailed derivation described in Appendix I) ` ´ f1 xpt ∝ (1 − Pnone ) “ ˛ X ” Pr xpt ˛p → p , xpt−1 · × p ∈Nt−1
1 · N p (cp ) ct−1 ,C t |It−1 | ` ˛ ´ + Pnone · Pr xpt ˛p → none · U (cpt ), (7)
where Nμ,C is the Normal PDF of mean μ and covariance matrix C (C is set to an identity matrix scaled by a variance reflecting the degree of color similarity assumed in the constancy of color characteristic), and U is the uniform PDF on the color space (RGB in our implementation). Pnone is a preset constant that estimates the prior probability of having no corresponding pixel in the previous frame. (Pnone is typically set to 0.1, but as explained in Appendix I, it has only minor influence on the tracker.) ` ´ We see that f1 xpt may be viewed as a mixture distribution with a component for having a corresponding pixel in the previous frame (with weight 1 − Pnone ) and a component for having no corresponding pixel (with weight Pnone ). “ ˛ ” Pr xpt ˛p → p , xpt−1 is the probability distribution of the bitmap’s bit at a pixel p, when its corresponding pixel in the previous frame, along with its estimated classification bit, are known. Since the MAP bitmap estimated for the previous frame
may contain errors, we set this PDF to “ ˛ ” Pr xpt ˛p → p , xpt−1 =
(
Pcorrect
xpt = xpt−1
1 − Pcorrect
xpt = xpt−1
,
p ∈ Nt−1 ,
(8)
where Pcorrect is a preset constant as well. Pcorrect , which is typically set to 0.9, approximates the prior probability of the estimated bitmap being correct for a pixel. Note that this parameter is equivalent to 1 − ν in [5], as there ν is the prior probability ` ˛ that a pixel ´ is incorrectly classified. Pr xpt ˛p → none is the prior probability distribution of the bitmap’s bit at a pixel p with no corresponding pixel in the previous frame. This probability distribution is set to
Pr
˛ xpt ˛p
`
´
→ none =
(
Ptarget 1 − Ptarget
xpt = 1 xpt = 0
,
(9)
where Ptarget is another preset constant that approximates the prior probability of a pixel, with no corresponding pixel in the previous frame, to belong to the target. (This constant is typically set to 0.4). While the location information Lt is not used at all for deriving (7) (as the conditioning is on Ct only), in practice we calculate (7) with two modifications, using pixel location information in a limited way: First, instead of evaluating pixel correspondences by merely comparing the candidate pixel themselves, as is realized by the Gaussian component in (7), we compare small image patches (5 pixels in diameter) centered around the candidate pixels. This is accomplished by modifying the normal and uniform PDFs in Equation (7) to products of the color PDFs of the pixels in the patches (see Appendix I for details). This is done in order to make the pixel correspondence distributions less equivocal.(While comparing single pixels might be satisfactory, we found that comparing image patches yielded better results.) Second, we restrict the maximal size of optical flow to M pixels (in our implementation M = 6), and thus compare only image patches that are distanced at most by M and sum over these correspondences only (137 potential correspondences per pixel), which reduces the number of computations. We would like to note that, in contrast to an actual optical flow computation consisting of an intermediate step where the patch around each pixel in one image is compared to all nearby patches in the other image (e.g., as in [44] and [38]), the remainder of the optical flow computations are avoided when the marginalization method is used. 2) Modeling F2 (Xt ): Considering only Lt as observations, the second factor in (4), F2 (Xt ; Lt , It−1 , Ct , Xt−1 ), is an “observation likelihood-like” part of the target’s bitmap PDF, where the pixels’ colors, as well as the previous frame with its corresponding bitmap, are known. Note that unless the image is observed, the pixel locations are random variables with unknown values. (The pixel locations should not be confused with the pixel indices, which are arbitrarily and deterministically ordered.) Given It−1 and Ct , PDFs of pixel correspondences between It and It−1 are induced (as in F1 (Xt )). On the basis of these correspondence PDFs, Lt induces PDFs of optical flow between these two frames. By the spatial motion continuity characteristic, for an adjacent pair of pixels in a region belonging to a single object (where the optical flow is spatially continuous), the discrete optical flow is likely to be the same, and for an adjacent pair of pixels belonging
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 5
to different objects, it is likely to differ. Thus, the likelihood of an unequal bit-assignment to similarly-moving adjacent pixels should be much lower than an equal bit-assignment, and vice versa for differently-moving adjacent pixels. By the spatial color coherence characteristic, the likelihood of an equal bit-assignment to similarly-colored adjacent pixels should be much higher than an unequal bit-assignment. Taking this view and noting that Lt determines pixel adjacency in It and pixel motion from time t − 1 to time t, we associate with F2 (Xt ) multiplicands that depend on pairs of bitmap bits and assume interaction only between bits of pixels that turn out to be adjacent: Y
F2 (Xt ) ∝
` ´ f2 xpt 1 , xpt 2
(10)
unordered pairs p1 , p2 ∈ Nt of adjacent pixels in It .
This implies that all the other multiplicands of bit pairs are modeled as constants. We model multiplicands related to adjacent pixels using modeled probabilities of pixel adjacencies and coordinate differences ˛ ´ adj(p1 , p2 )˛xpt 1 , xpt 2 , cpt 1 , cpt 2 {z } | ` ´ fadj xpt 1 , xpt 2 ˛ ` ´ × Pr Δt (p1 , p2 )˛adj(p1 , p2 ), xpt 1 , xpt 2 , It−1 , Ct , Xt−1 , {z } | ` ´ fΔ xpt 1 , xpt 2 ` ´ f2 xpt 1 , xpt 2 =
Pr
`
(11)
where Δt (p1 , p2 ) = ltp1 − ltp2 and‚ adj(p1 , p‚2 ) is the event of pixels p1 and p2 being adjacent (‚ltp1 − ltp2 ‚2 = 1). Thus, the bitmap’s PDF is reduced to a Gibbs distribution with respect to the first-order neighborhood system [9]. We shall begin with the first multiplicand in the right-hand side of (11). By Bayes’ rule, ˛ ´ ` ´ ` fadj xpt 1 , xpt 2 = p cpt 1 , cpt 2 ˛xpt 1 , xpt 2 , adj(p1 , p2 ) ˛ ` ´ Pr adj(p1 , p˛2 )˛xpt 1 , xpt 2 ` p1 p2 p1 p2 ´ . × p ct , ct ˛xt , xt
(12)
We assume no prior information on the object shape and on the object/non-object color `distribution. ´ Therefore, the influence of the bitmap bits on fadj xpt 1 , xpt 2 is dominated by the first multiplicand, and thus we approximate ˛ ` ´ ` ´ fadj xpt 1 , xpt 2 ∝ p cpt 1 , cpt 2 ˛xpt 1 , xpt 2 , adj(p1 , p2 ) .
(13)
Applying the chain rule yields ` ´ ` ˛ ´ fadj xpt 1 , xpt 2 ∝ p cpt 1 ˛xpt 1 , xpt 2 , adj(p1 , p2 ) ˛ ` ´ × p cpt 2 ˛cpt 1 , xpt 1 , xpt 2 , adj(p1 , p2 ) .
(14)
The first multiplicand on the right-hand side does not depend on the bitmap bits, which leaves only the second multiplicand, which we model as `
´ fadj xpt 1 , xpt 2
( ` p ´ ` ´ U ct 2 + Ncp1 ,Cadj cpt 2 t ∝ ` p2 ´ U ct
xpt 1 = xpt 2 xpt 1 = xpt 2 .
(15) This corresponds to modeling the colors of adjacent pixels as uniformly and independently distributed in the case that they belong to different objects. If these pixels belong to the same object, their color distribution is a mixture of a uniform distribution (corresponding to the case of belonging to different color
segments) and a Gaussian in their color difference (corresponding to the case of belonging to the same segment of homogeneous color). Cadj is assigned an identity matrix scaled by a very small variance, reflecting the variance of the color differences between adjacent pixels belonging to a surface of homogeneous color. (In our implementation the variance was set to 0.01 for each RGB color channel, where the range of each color is [0,1].) We see that for differently-colored adjacent pixels the likelihood is approximately similar for equal and unequal bit-assignments, and for similarly-colored adjacent pixels the likelihood is much higher for equal bit-assignments, which is in keeping with the spatial color coherence characteristic. (15) may be´¯ used to com˘ Equation ` pute the four likelihoods fadj xpt 1 = b1 , xpt 2 = b2 b1 ,b2 ∈{0,1} (up `to a scaling, which is unimportant). Note that the term ´ fadj xpt 1 , xpt 2 causes a bias towards pixel class transitions where the color gradient is high. This effect is similar to the effect of certain corresponding terms in the energy functions in [5] and [19]. We turn now ` to the´ second multiplicand on the right-hand side of (11), fΔ xpt 1 , xpt 2 . After a detailed derivation, which is given in Appendix II, ` ´ fΔ xpt 1 , xpt 2 8 ` ´ > Pf low1 · S1 xpt 1 , xpt 2`; p1 , p2 > ´ > p p2 > 1 > > > +(1 − Pf low`1 )p·1S2 p2xt , xt ´; p1 , p2 p1 < xt = xpt 2 , +0.25 · S3 xt `, xt ; p1 , p2 , ´ = p1 p2 (1 − Pf low2 ) · S > > ` 1 p1xt p, 2xt ; p1 ,´p2 > > > · S +P 2 f low2 > ` xt , xt ; p1 , p´2 > : xpt 1 = xpt 2 , +0.25 · S3 xpt 1 , xpt 2 ; p1 , p2 , `
(16)
´
where S1 xpt 1 , xpt 2 ; p1 , p2 is the probability that ` It ’s pixels p1´ and p2 have identical discrete optical flows, S2 xpt 1 , xpt 2 ; p1 , p2 is the probability that they ` ´ have different discrete optical flows, and S3 xpt 1 , xpt 2 ; p1 , p2 is the probability that at least one of the two pixels has no corresponding pixel in the previous frame (and thus has no optical flow). All these probabilities are conditional on 1) the two pixels’ classification bits; 2) Ct ; and 3) the previous frame along with its estimated bitmap. (See Appendix II for the method used to estimate these probabilities.) Pf low1 is a predefined constant approximating the prior probability that two equally classified, adjacent pixels have similar discrete optical flows (given that the corresponding pixels exist). Pf low2 is another predefined constant approximating the prior probability that two unequally classified, adjacent pixels have different discrete optical flows. Both constants are typically assigned 0.99. Examining (16), we see that the higher the probability of identical discrete optical flows, the higher the likelihood for xpt 1 = xpt 2 , and vice versa for the probability of different discrete optical flows, conforming to the spatial motion continuity characteristic. When at least one of the pixels has no corresponding pixel in the previous frame, there is no preference for any bit assignments, since the optical flow is undefined. 3) The Final Bitmap PDF: The multiplicands in (5) and in (10) may be written as ` ´ f1 `xpt = c1´(p, t)xpt + c2 (p, t), f2 xpt 1 , xpt 2 = c3 (p1 , p2 , t)xpt 1 xpt 2 + c4 (p1 , p2 , t)xpt 1 +c5 (p1 , p2 , t)xpt 2 + c6 (p1 , p2 , t),
(17)
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 6 c˜1 (p1 ,p2 ,t) p1 xt 2
where c1 (p, t) = f1 (1) − f1 (0), c2 (p, t) = f1 (0), c3 (p1 , p2 , t) = f2 (1, 1) − f2 (1, 0) − f2 (0, 1) + f2 (0, 0), c4 (p1 , p2 , t) = f2 (1, 0) − f2 (0, 0), c5 (p1 , p2 , t) = f2 (0, 1) − f2 (0, 0), c6 (p1 , p2 , t) = f2 (0, 0).
(18)
Substituting (17) into (5) and (10), the bitmap’s PDF (4) is finally |It |
P (Xt ) ∝
Yh
i c1 (p, t)xpt + c2 (p, t)
p=1
h
Y
×
c3 (p1 , p2 , t)xpt 1 xpt 2 + c4 (p1 , p2 , t)xpt 1
unordered pairs p1 , p2 ∈ Nt of adjacent pixels in It
i
+ c5 (p1 , p2 , t)xpt 2 + c6 (p1 , p2 , t) .
(19)
In order to estimate the MAP bitmap Xt MAP , (19) should be maximized: Xt MAP = arg max P (Xt ) . (20) Xt
Replacing the objective function by its logarithm and utilizing the fact that the variables are 0-1 leads, after a straightforward derivation given in [23], to Xt
ln
= arg max Xt
"
unordered pairs p1 , p2 ∈ Nt of adjacent pixels in It
` ` ´ ´! f2 xpt 1 = 1, xpt 2 = 1 · f2 xpt 1 = 0, xpt 2 = 0 ` ` ´ ´ xpt 1 xpt 2 f2 xpt 1 = 1, xpt 2 = 0 · f2 xpt 1 = 0, xpt 2 = 1 ´! ` f2 xpt 1 = 1, xpt 2 = 0 ` ´ xpt 1 + ln f2 xpt 1 = 0, xpt 2 = 0 # ´! ` f2 xpt 1 = 0, xpt 2 = 1 p2 ` ´ xt + ln f2 xpt 1 = 0, xpt 2 = 0 ` ´! |It | X f1 xpt = 1 ` p ´ xpt . (21) + ln x f = 0 1 t p=1
After gathering common terms in the resulting polynomial, we obtain Xt MAP = arg max Xt
The resulting objective function has only nonnegative coefficients for the quadratic terms, and therefore its maximization may be reduced into a maximum-flow problem [10]. (Note that formulating the resulting maximization problem as a minimization of a regular energy function in F 2 [20], such as the one in [5], is straightforward.) Note that this heuristic modification to the objective function does not change the score contributed by quadratic terms corresponding to equal bit assignments (i.e, where both variables are assigned 1 or both variables are assigned 0). This modification only reduces the score for unequal bit assignments. In order to assess the quality of the approximation, let us denote the objective function in (22) by Q, MAP . ˆ , and its maximizer by X the altered objective function by Q t We then have the following lower bound for the approximation ratio α: ” ” “ “ “ ” MAP MAP MAP Q X Q X Q X t t t ´ ≥ ` ` ” ≥ “ , α= ´ ˆ X MAP + D Q XtMAP MAP + D ˆ X Q Q t t
where D =
X
X
c˜1 (p1 , p2 , t)xpt 1 xpt 2
unordered pairs p1 , p2 ∈ Nt of adjacent pixels in It
+
|It | X
c˜2 (p, t)xpt ,
c˜1 (p1 ,p2 ,t) p2 xt . 2
C. MAP Bitmap Estimation
MAP
+
(22)
p=1
where c˜1 (p1 , p2 , t) and c˜2 (p, t) are the coefficients obtained after the common terms are gathered. Unfortunately, maximizing quadratic pseudo-Boolean functions is NP-hard [1], [20]. Thus, instead of maximizing the objective function in (22), we choose to replace each quadratic term c˜1 (p1 , p2 , t)xpt 1 xpt 2 with a negative coefficient by the term
1 2
P
(23)
c1 (p1 , p2 , t)|. Calculating adjacent p1 , p2 s.t. |˜ c˜1 (p1 , p2 , t) < 0
this ratio bound in our experiments showed that the approximated bitmaps are no worse than a α-approximations to the MAP bitmaps for α varying in the range between 0.6 and 0.95. Note that these ratios are only lower bounds to the exact ratios. (We could not compute the exact approximation ratios because the exact MAP solutions are inaccessible). We believe that, as implied by the experimental results, the exact ratios are much higher and show that, in practice, the approximated bitmaps are meaningful. Occasionally the estimated MAP bitmap may contain extraneous small connected components. This may happen after a small patch is erroneously attached to the target (due to very similar color or motion) and then disconnected from it as a set of nontarget pixels separating the target from this patch is correctly classified. (In another scenario, the target may actually split into more than one connected component. Note that the bitmap’s PDF does not assume any a priori topological information.) In this case, only the largest connected component in the estimated bitmap is maintained. D. Considering Only Target-potential Pixels
Since the optical flow between adjacent frames is assumed to be limited by a maximal size M , there is no need to solve (22) for all the pixels in It . Instead, it is enough to solve only for the set of pixels with locations similar to the ones constituting the target in It−1 , dilated with a disc of radius equal to M pixels, and set the bitmap to zero for all other pixels. In other words, It is reduced to contain only the pixels that might belong to the target (see the left-hand diagram of Figure 2). The set of pixels in It−1 that may correspond to the pixels in the reduced It contains the set of pixels with locations similar to the ones in the reduced It , dilated with the aforementioned disc. That is, the reduced It−1 constitutes the target in It−1 , dilated twice with the aforementioned disc (see the right-hand diagram of Figure 2). Note that the reduced It−1 is larger than the reduced It , because the latter may include non-target pixels whose corresponding pixels in It−1 are of locations outside the reduced It . Note that changing the pixel-sets It and It−1 to the corresponding reduced versions affects some normalization constants in the formulae.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 7
It
It−1 reduced It PP
PP q 1m p lp : xt−1 = 1
reduced It−1 PP
'$ PP q 1
{lp : p ∈ reduced It } &%
Fig. 2. The reduced It and the reduced It−1 . In practice, the bitmap is estimated in the reduced image and set to 0 outside of it.
figure E. Parameters and Algorithm Outline As all five P∗ parameters are actually prior probabilities and the two others are simply color variances, all these parameters may be set according to their statistics, which may be learned from sequences labeled by the ground truth of each frame’s bitmap and (discrete) optical flow: Pnone may be set to the percentage of pixels having no corresponding pixel in the previous frame; Ptarget may be set to the percentage of the pixels having no corresponding pixel in the previous frame that belongs to the target; Pf low1 (Pf low2 ) may be set to the percentage of the adjacent pixel pairs with equal (different) bit labeling that have similar (different) discrete optical flows; the variance in C may be set to the variance of the color differences between corresponding pixels; the variance in Cadj may be set to the variance of the color differences between adjacent pixels belonging to a segment of homogeneous color; and Pcorrect may be adjusted to equal the percentage of correctly classified pixels returned by the tracker. In our implementation, the noise variance in C was learned from pairs of consecutive frames with known motion. The variance in Cadj was set to that of the low-variance Gaussian component in the distribution of color differences between adjacent pixels. (This Gaussian component corresponds to adjacent pixel pairs belonging to segments of homogeneous color.) However, the five prior probabilities, which may be learned only from long, labeled sequences, might vary across videos. We have therefore set these five prior probabilities in an ad hoc fashion. Nonetheless, because they are prior probabilities of simple events with immediate meanings, setting them ad hoc to reasonable values was possible. Experimenting with the values of these parameters showed that Bittracker has only minor sensitivity to them, as long as the values are reasonable. This low sensitivity characteristic is reinforced by the fact that all these parameters were set to the exact same values for all sequences, although obviously the true values of these prior probabilities varied across sequences. Note in particular that the effect of Ptarget is restricted to “new” pixels (pixels that are revealed after being occluded and thus have no corresponding pixel in the previous frame). Therefore, this parameter may be set to zero, and the “new” pixel will be classified in the next frame (where it is no longer “new”) according to the spatial motion continuity characteristic. Note also that Pnone has only minor influence on the tracker, as is explained in Appendix I. We summarize Bittracker in the outline below. Note that some parts of the algorithm refer to equations given in the appendices. This was done for the sake of readability.
IV. E XPERIMENTS Bittracker was tested on several image sequences, the first two synthesized and the rest natural. All the experiments demonstrate the successful tracking of rigid and non-rigid targets moving in 3D scenes and filmed by an arbitrarily moving camera. As no prior knowledge is assumed regarding the scene or target, and the target’s shape and appearance undergo heavy changes over time (due to deformations, changes in viewing direction or lighting, or partial occlusions), a tracker of a more restricted context such as [34] or [4], which rely on the target’s histogram in the first frame, would not be suitable here. The unsuitability of the mean shift tracker in [4] for some of the sequences is demonstrated. As the tracker was implemented in M ATLAB (except for a C++ implementation of [2] to solve the maximum-flow problem), the execution was slow. On a personal computer with a Pentium IV 3GHz processor, the running time was several seconds per frame, depending on the target size. In all experiments, the parameters were set to the values indicated before, and the tracking was manually initialized in the first frame. Although all the image sequences are colored, they are shown here as intensity images so that the estimated bitmaps, overlaid on top in green, will be clear. Video files of all presented Bittracker tracking results are given as supplementary material. Although all the initial bitmaps were set quite accurately, the tracker’s performance in the case of inaccurate initialization is demonstrated in any sub-sequence beginning in an intermediate frame where the estimated bitmap is inaccurate. An estimated inaccurate bitmap in a frame in the middle of a sequence may be regarded as an initial bitmap for the sub-sequence beginning from this frame. This is true because the tracking is independent of frames and bitmaps at times earlier than the previous frame. 1) Random-dot Sequence: First we tested Bittracker on a random-dot object of gradually time-varying shape and colors moving in front of a random-dot background of gradually timevarying colors that is in motion as well. See Figure 1 for the first two frames ((a)-(b)) and the object in each of them ((c)-(d)). Figure 3 shows, for a number of frames, the estimated bitmap in green, overlaid on top of intensity version images containing only the target. The background was cut from these images to enable the comparison of the estimated bitmap to the target. It is evident that the tracking in this sequence is very accurate. Note that new object pixels and revealed background pixels are correctly classified, due to the spatial motion continuity characteristic. Such a detailed target localization would not have been possible by using a global feature such as the color histogram in [4], which depends on the entire target.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 8
Input: It , It−1 , Xt−1 . Output: Xt . 1) It ← reduced It ; It−1 ← reduced It−1 . flow distribution f11 (p ; p, t) using (28), as well as the two optical flow distributions 2) For all pixels ` p p∈ I´t compute the optical p 1 fΔ marginal p , xt ; p conditional on xt = 0 and xpt = 1 using (37). ` ` ´ ´ 3) For each pixel p ∈ It compute f1 xpt = 1 and f1 xpt = 0 using (31) and (32), respectively. 4) For each pair of adjacent pixels (4-neighborhood) p1 , p2 ∈ It : ` ´ a) Compute fadj` xpt 1 , xpt 2 for ´the four possible bit-assignments using (15). p1 p2 bit-assignments using (39). b) Compute S3 xt , xt ; p1 , p2` for the four possible ´ bit-assignments using (41). c) Calculate the bounds on S1 xpt 1`, xpt 2 ; p1 ,´p2 for the four possible ` ´ d) Obtain the four intervals of fΔ xpt 1 , xpt 2 by substituting S2 xpt 1 , xpt 2 ; p1 , p2 in (16) by the right-hand side of (42) and using the results from` steps (b)´ and (c). e) Set the four `values of ´fΔ xpt 1 , xpt 2 within the corresponding intervals obtained in (d) using Algorithm M INIMIZE. f) Compute f2 xpt 1 , xpt 2 for the four different bit-assignments by substituting the results from steps (a) and (e) in (11). 5) Obtain the objective function in the right-hand side of (21) using the results from steps 3 and 4(f), transform into canonical form (22), and replace each quadratic term c˜1 (p1 , p2 , t)xpt 1 xpt 2 with a negative coefficient by the term c˜1 (p12,p2 ,t) xpt 1 + c˜1 (p1 ,p2 ,t) p2 xt . 2 6) Find the bitmap XtMAP maximizing the objective function obtained in previous step as explained in III-C. 7) Xt ← XtMAP zero-padded into image size. Summary of Bittracker
frame #1
frame #130
frame #70 frame #1
frame #17
frame #35
frame #52
frame #200
Fig. 3. Tracking in the Random-dot sequence. The estimated bitmap is shown in green, overlaid on top of intensity version images containing only the target.
figure We repeated this experiment using the mean shift tracker in [4], which searches in each frame the location in the image whose histogram is most similar to the target’s histogram in the first frame. The tracking failed already in the beginning because the target’s colors change over time and the target and background histograms are quite similar. It should be noted that the scale adaptation was turned off in this experiment, as the target did not undergo any changes in scale. As it turned out, the tracking failed even under this simplifying condition. Furthermore, in this sequence, the pixels added to the target during its enlargement phases were assigned colors independently of those of other target pixels. Therefore, trackers of nonparametric target shapes that do not use the motion cue ( [25] or [45], for example) would not be suitable here either. 2) Cellotape Sequence: Here we tracked a rotating and moving reel of cellotape filmed by a moving camera. A few frames with the corresponding tracking results are shown in Figure 4. The hole in the reel, which was not revealed at the beginning of the video,
Fig. 4. Tracking a rotating reel of cellotape filmed by a moving camera. The hole in the reel, which was not revealed at the beginning of the video, was revealed and marked correctly as the video progressed.
figure was revealed and marked correctly as the video progressed. Note that this change in object topology could not have been coped with using a state space of object enclosing contours. 3) Man-in-Mall Sequence: In this experiment we tracked a man walking in a mall, filmed by a moving camera. A few frames with the tracking results overlaid are shown in Figure 5. Note the zoom-in and zoom-out near the end of the sequence, and the partial occlusion at the end. 4) Woman-and-Child Sequence: Here Bittracker was tested on a sequence of a woman walking in a mall, filmed by a moving camera. See Figure 6 for a few frames and the corresponding tracking results. Note that the tracking overcame lighting changes and long-term partial occlusions. Since the woman and the girl she takes by the hand were adjacent and walking at similar velocity over an extended time period (beginning around frame #100), the
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 9
frame #1
frame #35
frame #100
frame #200
frame #333
frame #456
frame #588
frame #667
Fig. 5. Tracking a man walking in a mall filmed by a moving camera. Note that Bittracker overcomes the zoom-in and zoom-out near the end of the sequence, as well as the partial occlusion at the end.
figure
frame #1
frame #50
frame #1
frame #70
frame #130
frame #200
frame #130
frame #200
Fig. 6. Tracking a woman walking in a mall as filmed by a moving camera. The tracking algorithm overcame lighting changes and longterm partial occlusions. Since the woman and the girl she takes by hand were adjacent and walking at similar velocity over an extended time period (beginning around frame #100), the girl and the woman were joined as the tracking proceeded.
figure girl was joined to the woman in the tracking process. As can be seen in Figure 7, the mean shift tracker in [4] yielded poor results for this sequence, for two reasons. First, the similarity between the histogram of the target (the woman) and the histogram of the target’s surroundings (the floor) caused the the tracker to fail around frame #70. Second, the partial occlusion near the end of the sequence caused the target’s shape to deviate from the shape in the first image of the sequence in a manner that could not be captured by the tracker, which supports only changes in target scale. 5) Herd Sequence: In this experiment Bittracker was tested on one cow running in a herd filmed by a moving camera. A few frames with the tracking results overlaid are shown in Figure 8.
Fig. 7. Tracking results for the Woman-and-Child sequence obtained by the mean shift tracker in [4].
figure Note that the tracking overcame a severe partial occlusion. 6) Lighter Sequence: In this sequence we tracked a lighter undergoing general motion, filmed by a moving camera. Figure 9 shows a few frames along with the tracking results. Note that the areas of the lighter that were previously occluded by other objects or by the lighter itself are correctly classified upon exposure. This experiment was repeated using the mean shift tracker in [4]. The full rotation of the target, which is colored differently in each side, caused its histogram to change quite significantly throughout the sequence. In addition, while the target in the first frame was reasonably approximated by an ellipse of some axis ratio and of axes parallel to the image boundaries, such a shape approximation was too crude for the rest of the sequence, where the target’s shape changes are far from being of only isotropic scale (as is approximated by the mean shift tracker). Thus, although the target was not lost during the sequence, the
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 10
frame #1
frame #10
frame #1
frame #47
frame #19
frame #30
frame #100
frame #118
Fig. 8. Tracking a cow in a running herd filmed by a moving camera. Although the tracked object underwent a severe partial occlusion, the tracking continued.
figure
frame #1
frame #200
frame #450
frame #570
Fig. 9. Tracking a lighter undergoing general motion and severe partial occlusions as filmed by a moving camera.
figure tracking was very inaccurate. Neither were the two aforementioned problems addressed in [3], which suggested an enhanced scale adaptation through the use of negative weights in the kernel. 7) Ball Sequence: Here we tracked a ball, initially rolling in front of the moving camera, but then partially occluded by a toy. Results are shown in Figure 10. Note the correct classification of areas of the ball that appear during its roll behind the toy. 8) Camouflage Sequence: In this last experiment we used the Camouflage sequence from [25]. This sequence was constructed in [25] by cutting a disc-like shape from the textured vegetation in the center of the image and pasting it so as to create apparent motion from the lower left corner to the upper right corner. This can be seen in Figure 11, showing Bittracker’s tracking results for this sequence. As the moving disc has similar texture to part of the background, distinguishing one from the other is impossible when the former is superimposed on the latter, unless the motion cue is used. The algorithm in [25] does not use motion at all, and thus, as reported in [25], failed in this sequence. The tracker in [45] is not appropriate for this sequence, for the same reason, which holds also for the mean shift tracker in [4]. Note that the target’s
Fig. 10. Tracking a rolling ball filmed by a moving camera. Note the occlusion caused by the toy.
figure shape in this sequence is constant and circular (like the support of the Epanechnikov kernel used by the mean shift tracker) and of a constant histogram. Therefore, no enhancement of the mean shift tracker by updating the target’s color model or by adding degrees of freedom to the target’s shape will compensate for not using the motion cue. In contrast, the algorithm proposed here successfully tracks the moving disc throughout the sequence. When the tracking results are examined, several types of imperfections are apparent. One type of imperfection is the (usually temporary) misclassification of small target regions as non-target due to specularities, discretization artifacts and sharp spacial lighting changes, which cause the constancy of color assumption to be violated. Nonetheless, most of these errors are corrected with time due to the spatial motion continuity and the spatial color coherence characteristics. A second type of imperfection is the misclassification of thin target or non-target branches, usually thinner than the diameter of the image patches used to estimate the pixel correspondences. Another type of imperfection might occur when a non-target area is exposed after being occluded by the target, and the color of the exposed area is very similar to the color of a nearby target area but very different from all nearby nontarget areas. In such a scenario the exposed non-target area might be temporarily classified as belonging to the target. However, these errors are fixed after the erroneously classified non-target regions are disconnected from the ball, and thus eliminated by the single connected component constraint (Section III-C). V. C ONCLUSION A novel algorithm for visual tracking under very general conditions was developed. The algorithm handles non-rigid targets whose appearance and shape in the image may change drastically, as well as general camera motion and 3D scenes. The tracking is conducted without any a priori target-related or scene-related information (except the target’s bitmap in the first frame, given for initialization). The tracker works by maximizing in each frame a PDF of the target’s bitmap, formulated at pixel level through a lossless decomposition of the image information into color information and pixel-location information. This image decomposition allows color and motion to be treated separately and systematically. The tracker relies on only three basic visual characteristics:
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 11
Fig. 11. Bittracker ’s tracking results for the Camouflage sequence, constructed in [25] to show the limits of the tracking algorithm proposed there.
figure approximate constancy of color in consecutive frames (short-term constancy of color characteristic), spatial piecewise-continuity in the optical flow of pixels belonging to the same object (spatial motion continuity characteristic), and the belonging of similarlycolored adjacent pixels to the same object (spatial color coherence characteristic). Rather than estimating optical flow by means of a point estimate, we construct the bitmap’s PDF by marginalizing over all possible pixel motions. This is an important advantage, as optical flow estimation is prone to error, and is actually a harder and more general problem than target tracking. A further advantage is that the target’s bitmap PDF is formulated directly at pixel level. Thus, the precursory confinement of the final solution to objects composed of preliminarily-computed image segments, as is common in video segmentation algorithms, is avoided. Experimental results demonstrate Bittracker’s robustness to general camera motion and major changes in object appearance caused by variations in pose, configuration and lighting, or by long-term partial occlusions. Finally, we note that since the proposed tracker estimates the MAP bitmap conditional only on the previous frame and its corresponding bitmap estimate (and of course also on the current frame), a target that reappears after a total occlusion will not be recovered and the estimated target will remain null. In future work, we may try to use the estimated target locations prior to the total occlusion to build some target model that can be used for target detection. A PPENDIX I ` ´ D ERIVATION OF f1 xpt Continuing from Equation (6) by using the chain rule yields ˛ ´ p → p ˛It−1 , Ct , Xt−1 {z } | f11 (p ; p, t) ` ˛ ´ × Pr xpt ˛p → p , It−1 , Ct , Xt−1 . } | (24) ` {z ´ f12 xpt ; p
` ´ P f1 xpt = p ∈Nt−1 ∪{none}
Pr
`
The first multiplicand inside the sum of (24) is the probability that It ’s pixel p corresponds to pixel p in It−1 (or the probability that it corresponds to none) when considering only the pixel colors in It and disregarding their exact placement in the frame. Using Bayes’ rule, we have
˛ ` ´ ` ˛ ´ f11 (p ; p, t) ∝ Pr p → p ˛It−1 , Xt−1 ·p Ct ˛p → p , It−1 , Xt−1 .
(25) Since Lt is not given, the prior on the potentially corresponding pixels p ∈ Nt−1 is uniform, and we set the prior probability of having no corresponding pixel in the previous frame to Pnone . Subject to this and under the constancy of color assumption, for
p ∈ Nt−1 we approximate the first multiplicand inside the sum
of (24) as f11 (p ; p, t) ∝
1 − Pnone · N p (cp ), ct−1 ,C t |It−1 |
p ∈ Nt−1 ,
(26)
where Nμ,C is the normal PDF of mean μ and covariance matrix C that is set as a constant as well. (C is set to an identity matrix scaled by a variance reflecting the degree of color similarity assumed in the constancy of color characteristic) For p = none we approximate f11 (none; p, t) ∝ Pnone · U (cpt ),
(27)
where U is the uniform PDF on the color-space (RGB in our implementation). Note that modeling f11 (p ; p, t) in (26) as a multidimensional Gaussian causes this function (in p ) to be very highly peaked. This makes the tracker’s sensitivity to Pnone minor. In our implementation Pnone was set to 0.1. In practice, in order to estimate the pixel correspondences more exactly, we compare small image patches (5 pixels in diameter) centered around the candidate pixels instead of merely comparing the candidate pixels themselves. This change is made by modifying the normal and uniform PDFs in Equations (26) and (27), respectively, to products of the color PDFs of the pixels in the patches. In addition, since pixels that are projections of different objects are likely to have different optical flows despite their adjacency in the image plane, we avoid comparing an image patch in It to an image patch in It−1 that contains a mix of ´ ´ pixels p´ assigned xpt−1 = 1 and xpt−1 = 0. In such cases, we compare only the pixels in the patch that are assigned the same bitmap value as is assigned to the center pixel, which is the one the correspondence is sought for. We also restrict the maximal size of optical flow to M pixels (in our implementation M = 6), and compare only image patches distanced at most by M , which reduces the number of computations. Thus, the sum in (24) is computed over a subset of feasible pixels in It−1 (137 pixels for M = 6) and none, which reduces computation time. We conclude for the first multiplicand inside the sum of (24): f11 (p ; p, t) ∝
8 < 1−Pnone · N |I |
c¯p t−1 ,C p :Pnone · U (¯ ct ) t−1
n
‚
‚
(¯ cpt )
p ∈ Dt−1 (p), p = none, o
(28)
p where Dt−1 (p) = p : ‚lt−1 − ltp ‚2 ≤ M is the index-set of pixels in It−1 within a radius of M pixels from pixel p, and c¯pt is the vector of colors of every pixel composing the image patch for pixel p in It , say in raster order. Since all the feasible cases for p are covered by Dt−1 (p) ∪ {none}, normalizing to a unit sum over p ∈ Dt−1 (p) ∪ {none} produces the estimated probabilities (although normalizing here is not necessary, as it will only scale P (Xt ), which does not change its maximizing bitmap).
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 12
The second multiplicand inside the sum of (24) is the PDF of the bitmap’s value at pixel p in It , conditional on this pixel’s correspondence to pixel p in It−1 , whose bitmap value is given. Since the MAP bitmap estimated for the previous frame may contain errors, we set this PDF to ` ´ f12 xpt ; p
(
=
Pcorrect
xpt = xpt−1
1 − Pcorrect
xpt = xpt−1
p ∈ Nt−1 , (29)
,
where Pcorrect is a preset constant approximating the prior probability of the estimated bitmap being correct for a pixel. (Pcorrect is typically set to 0.9.) For p = none we set this PDF to ( ` ´ f12 xpt ; none =
Ptarget
1 − Ptarget
xpt = 1 xpt = 0
,
(30)
where Ptarget is also a preset constant approximating the prior probability of a pixel, with no corresponding pixel in the previous frame, to belong to the target. (Ptarget is `typically set to 0.4.) ´ To conclude the steps for computing f1 xpt for pixel p in It , we first use Equation (28) to compute the probabilities f11 (p ; p, t) for p ∈ Dt−1 (p) ∪ {none}, that is, the probabilities for pixel p’s different correspondences to pixels in It−1 (feasible subject to the maximal optical flow assumed), including the probability of having no corresponding pixel. Then, by substituting Equations (29) and (30) into Equation (24), we derive X
` ´ f1 xpt = 1 = Pcorrect ·
approximate this likelihood as is summarized in Table I. When xpt 1 = xpt 2 and both pixels have corresponding pixels in It−1 , it is very likely that Δt (p1 , p2 ) = Δt−1 (p1 , p2 ). The probability of this event is assigned Pf low1 , which is a preset constant of typical value 0.99. The complementary event of Δt (p1 , p2 ) = Δt−1 (p1 , p2 ) is thus assigned 1 − Pf low1 . Equivalently, when xpt 1 = xpt 2 and both pixels have corresponding pixels in It−1 , it is very likely that Δt (p1 , p2 ) = Δt−1 (p1 , p2 ). The probability of this event is assigned Pf low2 , which is a preset constant as well, with a typical value of 0.99. Complementing again yields that the event of Δt (p1 , p2 ) = Δt−1 (p1 , p2 ) is 1−Pf low2 . When one or both of the pixels have no corresponding pixel in It−1 , the spatial motion continuity characteristic is irrelevant and the four different values for Δt (p1 , p2 ) are assigned the same probability of 0.25. Following the partitioning of the possibilities for p1 and p2 summarized I, the sum in (33) may be split ointo three n` in Table ˛ ´ 2 ˛ p1 , p2 ∈ Nt−1 Δt−1 (p1 , p2 ) = Δt (p1 , p2 ) , cases: o ˛ ´ 2 ˛ p1 , p2 ∈ Nt−1 Δt−1 (p1 , p2 ) = Δt (p1 , p2 ) and o n` ˛ ´ p1 , p2 ∈ (Nt−1 ∪ {none})2 ˛p1 = none or p2 = none . xpt 1 = xpt 2 Equation (33) becomes n`
` ´ fΔ xpt 1 , xpt 2
f11 (p ; p, t)
|
p ∈Dt−1 (p)∩{q:xqt−1 =1}
X
+ (1 − Pcorrect ) ·
f11 (p ; p, t)
p1 , p2 ∈ Nt−1 such that Δt−1 (p1 , p2 ) = Δt (p1 , p2 )
(31) |
and by complementing, ` ` ´ ´ f1 xpt = 0 = 1 − f1 xpt = 1 .
A PPENDIX II `
D ERIVATION OF fΔ xpt 1 , xpt 2 In how`
Pr
the we
(32)
following ˛ compute
we ` shall ´derive and fΔ xpt 1 , xpt 2 , which ´ Δt (p1 , p2 )˛adj(p1 , p2 ), xpt 1 , xpt 2 , It−1 , Ct , Xt−1 ,
show equals where
Δt (p1 , p2 ) = ltp1 − ltp2 and adj(p1 , p2 ) is the event of pixels p1 and p2 being adjacent. ´ ` Marginalizing fΔ xpt 1 , xpt 2 over all the potential correspondences of pixels p1 and p2 to pixels in It−1 , including the event
of corresponding to none, and then applying the chain rule, yields ` ´ P fΔ xpt 1 , xpt 2 = p ,p ∈Nt−1 ∪{none} 1
2
Pr
|
p1 , p2 ∈ Nt−1 such that Δt−1 (p1 , p2 ) = Δt (p1 , p2 )
X
+0.25·
´ |
and for
xpt 1
´ 1 ` p1 , p2 , xpt 1 , xpt 2 ; p1 , p2 fΔ
{z } ` ´ S1 xpt 1 , xpt 2 ; p1 , p2 X ´ 1 ` p1 , p2 , xpt 1 , xpt 2 ; p1 , p2 fΔ
+(1 − Pf low1 )·
p ∈Dt−1 (p)∩{q:xqt−1 =0}
+ Ptarget · f11 (none; p, t),
X
= Pf low1 ·
For
{z } ` ´ S2 xpt 1 , xpt 2 ; p1 , p2 ´ 1 ` p1 , p2 , xpt 1 , xpt 2 ; p1 , p2 , fΔ
p1 , p2 ∈ Nt−1 ∪ {none} such that p1 = none or p2 = none
{z ` ´ S3 xpt 1 , xpt 2 ; p1 , p2
=
}
(34)
xpt 2
`
´ ` ´ fΔ xpt 1 , xpt 2 = (1 − Pf low2 ) · S1 xpt 1 , xpt 2 ; p1 , p2 ` ´ ` ´ +Pf low2 ·S2 xpt 1 , xpt 2 ; p1 , p2 +0.25·S3 xpt 1 , xpt 2 ; p1 , p2 .
(35) ——————————————————————————–
˛ ´ p1 → p1 , p2 → p2 ˛adj(p1 , p2 ), xpt 1 , xpt 2 , It−1 , Ct , Xt−1 {z } ´ 1 ` p1 , p2 , xpt 1 , xpt 2 ; p1 , p2 fΔ ˛ ` ´ × Pr Δt (p1 , p2 )˛adj(p1 , p2 ), p1 → p1 , p2 → p2 , xpt 1 , xpt 2 , It−1 , Ct , Xt−1 . {z } | ´ 2 ` p1 xt , xpt 2 ; Δt (p1 , p2 ), p1 , p2 fΔ
`
(33) The second multiplicand inside the sum of (33) is the likelihood of the relative position between adjacent pixels p1 and p2 in It , where the coordinates of their corresponding pixels in It−1 , if any, are known (because the likelihood is conditional on the pixel correspondences and on It−1 , which consists of Lt−1 ). In accordance with the spatial motion continuity characteristic, we
——————————————————————————– The term inside the summations (which is the first multiplicand inside the sum of (33)) is the joint probability that It ’s pixels p1 and p2 correspond to pixels p1 and p2 in It−1 , respectively (or correspond to none). This term is similar to f11 (p ; p, t), which is the correspondence distribution for a single pixel, although
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 13
´ ` 2 xp1 , xp2 ; Δ (p , p ), p , p TABLE I. The Values of fΔ t 1 2 t t 1 2
table
p1 = none or p2 = none xpt 1 = xpt 2 xpt 1 = xpt 2
Δt−1 (p1 , p2 )
0.25 0.25
p ∈ Nt , p ∈ Nt−1 ∪ {none},
In lated
four
`
(36)
likelihoods
` p ´ 1 what follows, we show how fΔ marginal` p , xt ; p´ and used to obtain an estimate for fΔ xpt 1 , xpt 2 .
is calcu-
´
1 p 1) Calculating fΔ marginal p , xt ; p : Marginalizing over the
xpt−1
correctness of and following by a derivation similar to the ——————————————————————————– n X ` ´ S1 xpt 1 , xpt 2 ; p1 , p2 ≥
` ´ S1 xpt 1 , xpt 2 ; p1 , p2 ≤
p1 , p2 ∈ Nt−1 such that Δt−1 (p1 , p2 ) = Δt (p1 , p2 )
X
p1 , p2 ∈ Nt−1 such that Δt−1 (p1 , p2 ) = Δt (p1 , p2 )
corresponding pixel in the previous frame, is ` ´ S3 xpt 1 , xpt 2 ; p1 , p2 X = p2 ∈Nt−1 ∪{none}
+
=
X
´ 1 ` none, p2 , xpt 1 , xpt 2 ; p1 , p2 fΔ ´ 1 ` p1 , none, xpt 1 , xpt 2 ; p1 , p2 fΔ
p1 ∈Nt−1 ∪{none} ´ 1 ` none, none, xpt 1 , xpt 2 ; p1 , p2 − fΔ ` ´ ` ´ p1 p2 1 1 fΔ marginal none, xt ; p1 + fΔ marginal none, xt ; p2 ´ 1 ` none, none, xpt 1 , xpt 2 ; p1 , p2 , (38) − fΔ
and modeling the events p1 = none and p2 = none as independent, we obtain
˛ ` ´ ´ p , xpt ; p = Pr p → p ˛Xt , It−1 , Ct , Xt−1 ,
`
and` obtain estimates for the ´ fΔ xpt 1 = b1 , xpt 2 = b2 (b1 , b2 ∈ {0, 1}).
1 − Pf low1 Pf low2
Pf low1 1 − Pf low2
now the conditioning is also on the Boolean variables of It ’s pixels. Calculating the ` sums in ´ (34) and (35) for a single pair of pixels under one xpt 1 , xpt 2 -hypothesis (out of four different hypotheses) would require estimating this term for a number of cases that is quadratic in the size of the image region for searching corresponding pixels, which we find to be too computationally demanding. (For M = 6 pixels, the number of such cases is (137 + 1)2 = 19, 044.) To reduce the computational cost, we replace the exact calculation of the three sums by an estimate based on the marginal PDFs
1 fΔ marginal
p1 , p2 ∈ Nt−1 = Δt (p1 , p2 ) Δt−1 (p1 , p2 ) = Δt (p1 , p2 )
` ´ S3 xpt 1 , xpt 2 ; p1 , p2 ` ´ ` ´ p1 p2 1 1 = fΔ marginal none, xt ; p1 + fΔ marginal none, xt ; p2 ` ´ 1 ` ´ p1 p2 1 (39) − fΔ marginal none, xt ; p1 · fΔ marginal none, xt ; p2 . ´ ` p1 p2 Turning to S1 xt , xt ; p1 , p2 , which is the probability that the
two pixels have identical discrete optical flows from the previous frame, and denoting ´ ` k xpt 1 , xpt 2 ; p1 , p2 ` ´ 1 ` ´ p1 p2 1 = 1 − fΔ marginal none, xt ; p1 fΔ marginal none, xt ; p2 , ` ´ it is easy to verify the S1 xpt 1 , xpt 2 ; p1 , p2 bounds
(40)
` p1 ´ ` p2 ´ ` p1 p2 ´o 1 1 ; p , p p + f p − k x max 0, fΔ , x ; p , x ; p , x 1 2 1 2 1 2 Δ marginal marginal t t t t n ` p1 ´ 1 ` p2 ´o 1 . min fΔ marginal p1 , xt ; p1 , fΔ marginal p2 , xt ; p2
———————————————————————— ` ´ one performed in Appendix I for obtaining f11 p ; p, t yields (a detailed derivation is given in [23])
(41)
By complementing, the second sum in (34) and (35), which is the probability of having different discrete optical flows, is ` ´ ` ´ S2 xpt 1 , xpt 2 ; p1 , p2 = 1 − S3 xpt 1 , xpt 2 ; p1 , p2 ` p1 p2 ´ − S1 xt , xt ; p1 , p2 .
(42)
————————————————————————
8 > Pcorrect · A= · N p (¯ cp ) > > c¯t−1 ,C t < ` ´ 1 p fΔ (1 − Pcorrect ) · A = · N p (¯ cp ) marginal p , xt ; p ∝ c¯t−1 ,C t > > > :P · U (¯ cp ) none
t
(37)
1−Pnone |Nt−1 |Pcorrect +|Nt−1 ∩{q:xqt−1 =xp t }|(1−2Pcorrect ) 1−Pnone . (|Nt−1 |+1)Pcorrect +|Nt−1 ∩{q:xqt−1 =xp t }|(1−2Pcorrect )
where A= =
and A = = Normalizing to a unit sum over the “generalized neighborhood” of p, Dt−1 (p) ∪ {none}, produces the estimated probabilities. `
´
2) Estimating fΔ xpt 1 , xpt 2 : The third sum in (34) and (35), which is the probability that at least one of the two pixels has no
p ∈ Dt−1 (p) and xpt = xpt−1 ,
p ∈ Dt−1 (p) and xpt = xpt−1 , p = none,
———————————————————————— `
Equations (39)-(42) induce immediate bounds on fΔ xpt 1 , xpt 2 ´ ` ´ ` ´ ` low xpt 1 , xpt 2 ≤ fΔ xpt 1 , xpt 2 ≤ up xpt 1 , xpt 2 .
´
(43)
Thus, for each unordered pair of adjacent pixels p1 and p2 in It , there are the four intervals (bi ∈ {0, 1})
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 14
´
`
low xpt 1 = b1 , xpt 2 = b2 ` ´ ≤ fΔ xpt 1 = b1 , xpt 2 = b2 ` p1 ´ ≤ up xt = b1 , xpt 2 = b2 .
(44)
Avoiding additional computations, we take only these interval ` restrictions into ´ account and set the four likelihoods fΔ xpt 1 = b1 , xpt 2 = b2 (b1 , b2 ∈ {0, 1}), under these interval restrictions, to be as close to each other as possible so that the sum of their differences 1 2
X
X
˛ ` p1 ´ ˛fΔ x = b1 , xp2 = b2 t t
(b1 ,b2 )∈{0,1}2 (b1 ,b2 )∈{0,1}2
` ´˛ − fΔ xpt 1 = b1 , xpt 2 = b2 ˛
(45)
is minimized. (A `minimization method is given next paragraph.) ´ Note that if fΔ xpt 1 , xpt 2 is equal for all four possible bitassignments, it will have no effect at all on the maximization of the bitmap PDF (4), which is proportional ` to it. Qualitatively, ´ by closely clustering the four values of fΔ xpt 1 , xpt 2 , the effect on the bitmap’s PDF is minimized, while the interval restrictions are obeyed. Therefore, the ´larger the uncertainty (i.e., interval) in ` the values of fΔ xpt 1 , xpt 2 , the less the effect of this component on the bitmap’s PDF. It is easily seen through (41) ` thatpithe ´more 1 unequivocal the marginal optical flows fΔ marginal pi , xt ; pi , the smaller these uncertainties. A typical histogram of these interval sizes is given in [23], showing that a large portion of the fΔ s have small intervals and thus significantly affect the bitmap’s PDF. The minimization of (45) within the intervals of (44) may be easily accomplished by algorithm M INIMIZE hereinafter. The setting provided by the algorithm was proven to be optimal in [23]. ACKNOWLEDGMENT We would like to thank the four anonymous reviewers, whose comments in their comprehensive reviews have led to an improved paper. We are also grateful for the discussions with Robert Adler, who helped us to improve the formalization of the probabilistic model. R EFERENCES [1] E. Boros and P.L. Hammer. Pseudo-Boolean optimization. Discrete Applied Mathematics, 123:155–225, 2002. [2] Y. Boykov and V. Kolmogorov. An experimental comparison of mincut/max- flow algorithms for energy minimization in vision. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(9):1124– 1137, 2004. [3] R.T. Collins. Mean-shift blob tracking through scale space. In Proceedings of the 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, volume 2, pages 234–240, 2003. [4] D. Comaniciu, V. Ramesh, and P. Meer. Kernel-based object tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(5):564–577, 2003. [5] A. Criminisi, G. Cross, A. Blake, and V. Kolmogorov. Bilayer segmentation of live video. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, volume 1, pages 53–60, 2006. [6] R. Cucchiara, A. Prati, and R. Vezzani. Real-time motion segmentation from moving cameras. Real-Time Imaging, 10(3):127–143, 2004. [7] B.J. Frey, N. Jojic, and A. Kannan. Learning appearance and transparency manifolds of occluded objects in layers. In Proceedings of the 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, volume 1, pages 45–52, 2003.
[8] M. Gelgon and P. Bouthemy. A region-level motion-based graph representation and labeling for tracking a spatial image partition. Pattern Recognition, 33(4):725–740, 2000. [9] S. Geman and D. Geman. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6(6):721–741, 1984. [10] D.M. Greig, B.T. Porteous, and A.H. Seheult. Exact maximum a posteriori estimation for binary images. Journal of the Royal Statistical Society. Series B (Methodological), 51(2):271–279, 1989. [11] C. Gu and M.C. Lee. Semiautomatic segmentation and tracking of semantic video objects. IEEE Transactions on Circuits, Systems, and Video, 8(5):572–584, 1998. [12] M. Isard and A. Blake. C ONDENSATION - conditional density propagation for visual tracking. International Journal of Computer Vision, 29(1):5–28, 1998. [13] S. Jehan-Besson, M. Barlaud, and G. Aubert. DREAM2 S: Deformable regions driven by an eulerian accurate minimization method for image and video segmentation. International Journal of Computer Vision, 53(1):45–70, 2003. [14] A.D. Jepson, D.J. Fleet, and T.F. El-Maraghi. Robust online appearance models for visual tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(10):1296–1311, 2003. [15] N. Jojic and B.J. Frey. Learning flexible sprites in video layers. In Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, volume 1, pages 199–206, 2001. [16] B. Julesz. Foundations of Cyclopean Perception. The University of Chicago Press, 1971. [17] J. Kang, I. Cohen, G. Medioni, and C. Yuan. Detection and tracking of moving objects from a moving platform in presence of strong parallax. In Proceedings of the 10th IEEE International Conference on Computer Vision, pages 10–17, 2005. [18] S. Khan and M. Shah. Object based segmentation of video using color, motion and spatial information. In Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, volume 2, pages 746–751, 2001. [19] V. Kolmogorov, A. Criminisi, A. Blake, G. Cross, and C. Rother. Probabilistic fusion of stereo with color and contrast for bilayer segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(9):1480–1492, 2006. [20] V. Kolmogorov and R. Zabih. What energy functions can be minimized via graph cuts? IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(2):147–159, 2004. [21] P. Kornprobst and G. Medioni. Tracking segmented objects using tensor voting. In Proceedings of the 2000 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, volume 2, pages 118–125, 2000. [22] J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the 18th International Conference on Machine Learning, pages 282–289, 2001. [23] I. Leichter, M. Lindenbaum, and E. Rivlin. Technical Report CIS-2006-3 (Revised), Computer Science Department, Technion – Israel Institute of Technology, 2006. [24] Y. Liu and Y.F. Zheng. Video object segmentation and tracking using ψ-learning classification. IEEE Transactions on Circuits, Systems, and Video, 15(7):885–899, 2005. [25] A.R. Mansouri. Region tracking via level set PDEs without motion computation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(7):947–961, 2002. [26] V. Mezaris, I. Kompatsiaris, and M.G. Strintzis. Video object segmentation using Bayes-based temporal tracking and trajectory-based region merging. IEEE Transactions on Circuits, Systems, and Video, 14(6):782– 795, 2004. [27] H.T. Nguyen, M. Worring, R. van den Boomgaard, and A.W.M. Smeulders. Tracking nonparameterized object contours in video. IEEE Transactions on Image Processing, 11(9):1081–1091, 2002. [28] M. Nicolescu and G. Medioni. Motion segmentation with accurate boundaries – a tensor voting approach. In Proceedings of the 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, volume 1, pages 382–389, 2003. [29] N. Nicolescu and G. Medioni. Layered 4D representation and voting for grouping from motion. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(4):492–501, 2003. [30] T. Papadimitriou, K.I. Diamantaras, M.G. Strintzisa, and M. Roumeliotis. Video scene segmentation using spatial contours and 3-D robust
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 15
˘ˆ
`
´
`
Input: low xpt 1 = b1 , xpt 2 = b2 up xpt 1 = b1 , xpt 2 = b2 ˘ ` p1 ´¯ p2 Output: fΔ xt = b1 , xt = b2 b1 ,b2 ∈{0,1} .
´˜¯ b1 ,b2 ∈{0,1}
1) Sort the eight interval bounds
˘ ` ´¯ low xpt 1 = b1 , xpt 2 = b2 b
1 ,b2 ∈{0,1}
.
´¯ ˘ ` ∪ up xpt 1 = b1 , xpt 2 = b2 b
1 ,b2 ∈{0,1}
in ascending order. 2) Measure for each adjacent pair´ of bounds bound1 and bound2 (seven pairs) the sum of differences (45) obtained by setting ` 2 within its interval. each of the four fΔ xpt 1 , xpt 2 ’s most closely to bound1`+bound 2 p1 p2 ´ 3) Out of the seven settings, choose the setting of the fΔ xt , xt ’s that had the smallest sum of differences. Algorithm M INIMIZE
[31]
[32] [33] [34] [35]
[36] [37] [38] [39] [40] [41] [42] [43] [44]
[45]
motion estimation. IEEE Transactions on Circuits, Systems, and Video, 14(4):485–497, 2004. N. Paragios and R. Deriche. A PDE-based level-set approach for detection and tracking of moving objects. In Proceedings of the 6th IEEE International Conference on Computer Vision, pages 1139–1145, 1998. I. Patras, E.A. Hendriks, and R.L. Lagendijk. Video segmentation by MAP labeling of watershed segments. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(3):326–332, 2001. I. Patras, E.A. Hendriks, and R.L. Lagendijk. Semi-automatic objectbased video segmentation with labeling of color segments. Signal Processing: Image Communications, 18(1):51–65, 2003. P. P´erez, C. Hue, J. Vermaak, and M. Gangnet. Color-based probabilistic tracking. In Proceedings of the 7th European Conference on Computer Vision, pages 661–675, 2002. F. Precioso, M. Barlaud, T. Blu, and M. Unser. Robust real-time segmentation of images and videos using a smooth-spline snake-based algorithm. IEEE Transactions on Image Processing, 14(7):910–924, 2005. S. Sclaroff and J. Isidoro. Active blobs: region-based, deformable appearance models. Computer Vision and Image Understanding, 89(2):197–225, 2003. J. Shi and J. Malik. Motion segmentation and tracking using normalized cuts. In Proceedings of the 6th IEEE International Conference on Computer Vision, pages 1154–1160, 1998. C. Sun. Fast optical flow using 3D shortest path techniques. Image and Vision Computing, 20(13):981–991, 2002. S. Sun, D.R. Haynor, and Y. Kim. Semiautomatic video object segmentation using Vsnakes. IEEE Transactions on Circuits, Systems, and Video, 13(1):75–82, 2003. H. Tao, H.S. Sawhney, and R. Kumar. Object tracking with Bayesian estimation of dynamic layer representations. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(1):75–89, 2002. Y.P. Tsai, C.C. Lai, Y.P. Hung, and Z.C. Shih. A Bayesian approach to video object segmentation via 3-D watershed volumes. IEEE Transactions on Circuits, Systems, and Video, 15(1):175–180, 2005. Y. Tsaig and A. Averbuch. Automatic segmentation of moving objects in video sequences: a region labeling approach. IEEE Transactions on Circuits, Systems, and Video, 12(7):597–612, 2002. H.Y. Wang and K.K. Ma. Automatic video object segmentation via 3D structure tensor. In Proceedings of the 2003 IEEE International Conference on Image Processing, volume 1, pages 153–156, 2003. Q.X. Wu. A correlation-relaxation-labeling framework for computing optical flow – template matching from a new perspective. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(8):843– 853, 1995. A. Yilmaz, X. Li, and M. Shah. Contour-based object tracking with occlusion handling in video acquired using moblie cameras. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(11):1531– 1536, 2004.
Ido Leichter received the B.Sc. degree in Electrical and Computer Engineering from Ben-Gurion University of the Negev, Be’er-Sheva, Israel, in 1996, and the M.Sc. degree in Computer Science and Applied Mathematics from the Weizmann Institute of Science, Rehovot, Israel, in 2003. During the years 1996-2001 he served in The Intelligence Corps of the Israel Defense Forces (IDF). He is currently a Ph.D. candidate in the Computer Science Department at the Technion - Israel Institute of Technology, Haifa, Israel. His current research focuses on visual tracking in video sequences - a subject in which he has co-authored several papers published in leading computer vision conferences and journals. He is a student member of the IEEE.
Michael Lindenbaum received his B.Sc, M.Sc. and D.Sc. in the Department of Electrical Engineering at the Technion, Israel, in 1978, 1987 and 1990 respectively. From 1978 to 1985 he served in the IDF. He did his post-doctoral research at the NTT Basic Research Labs in Tokyo, Japan and has been with the Department of Computer Science at the Technion since 1991. He has also been a consultant to HP labs. Israel and spent a sabbatical year at NECI, NJ, USA (in 2001). He has served on several computer vision conference committees and co-organized the IEEE workshop on perceptual organization in computer vision. He was an associate editor of Pattern Recognition and is now an associate editor of IEEE Transactions on Pattern Analysis and Machine Intelligence and of Pattern Recognition Letters. Michael Lindenbaum has worked in digital geometry, computational robotics, learning, and various aspects of computer vision and image processing. His current research interest is mainly in computer vision, with a focus on the statistical analysis of object recognition and grouping processes.
Ehud Rivlin (M’90) received the B.Sc. and M.Sc. degrees in computer science and the MBA degree from the Hebrew University, Jerusalem, Israel, and the Ph.D. degree from the University of Maryland, College Park. Currently, he is an Associate Professor in the Computer Science Department, Technion-Israel Institute of Technology, Haifa, Israel. His current research interests are in machine vision and robot navigation.