Exploiting Temporal Geometry for Moving Camera Background Subtraction Daniya Zamalieva
Alper Yilmaz
James W. Davis
Photogrammetric Computer Vision Lab. Photogrammetric Computer Vision Lab. Computer Science and Engineering Dept. The Ohio State University The Ohio State University The Ohio State University Email:
[email protected] Email:
[email protected] Email:
[email protected]
Abstract—In this paper, we introduce a new method for background subtraction for freely moving cameras and arbitrary scene geometry. Instead of relying on frame-to-frame estimation, we simultaneously estimate the epipolar geometries induced by a moving camera in a temporally consistent manner for a number of frames using the temporal fundamental matrix (TFM). The TFM is robustly estimated from the tracklets generated by dense point tracking and used to compute the probability of each tracklet belonging to the background. In order to ensure the color, spatial, and temporal consistencies of tracklet labeling, we minimize spatiotemporal labeling cost in locality of tracklets. Extensive experiments with challenging videos show that the proposed method is comparable and, in most cases, outperforms the state-of-the-art.
I.
I NTRODUCTION
Detecting moving objects in videos is an active area of research and is a precursor to many applications including but not limited to object detection, object recognition, pose reconstruction, tracking, and action recognition. It is generally performed by applying background subtraction, which separates moving objects labeled as the foreground from the stationary scene labeled as the background. In the past two decades the background subtraction problem has received much attention and the proposed methods are summarized in [1]. However, most of the published literature is based on the assumption that the camera remains stationary. This assumption constrains the image pixels that correspond to the background to maintain their positions in consecutive frames throughout the video sequence. While most of these methods are applicable to automated surveillance with stationary cameras, this constraint inhibits their application to scenarios where the camera is moving, such as with video acquired by mobile phones, handheld cameras, and cameras mounted on moving platforms. The types of motion a camera can undergo complicate the task of moving object detection. A typical strategy for reducing this complexity is to introduce constraints on either the camera motion or the scene geometry. In order to restrict the camera motion, researchers use a pan-tilt-zoom (PTZ) camera, where the camera center does not translate [2]. This strong constraint imposes a rotational homography between consecutive frames and eliminates scene parallax. The methods following this strategy generally build a mosaic background model similar to the background models for stationary cameras but with a wider scene coverage. The moving objects are then detected by registering the incoming frames with the constructed model.
Alternative to restricting the camera motion, studies referred to as the plane+parallax methods assume that the static scene can be approximated by a plane [3], [4]. The single plane assumption provides one-to-one pixel mappings between consecutive frames by means of planar homography. Plane+parallax methods register an incoming frame with the scene plane and resolve the parallax caused by the static scene using various constraints, such as the epipolar geometry [3] and the structure consistency [4]. Liu et al. [5] model the global background motion between frames using a homography transform and locate a set of key frames that are conjectured to contain motion cues for detecting the moving objects. The literature contains only a handful of papers that do not impose certain camera motions or scene geometries [5]– [9]. These methods predominantly assume the dominant scene motion is caused by the camera movement and find objects by marking pixels that do not conform with the dominant motion. In [6], Lim et al. compute the fundamental matrix between consecutive frames and use the epipolar constraint to label pixels as foreground or background candidates. The initial labeling is refined by an iterative block-based approach that assumes each block contains a scene plane to model the background and the foreground. Kwak et al. [7] use the same block-based model but adopt a nonparametric belief propagation with Gaussian mixtures to refine the results of the initial background/foreground labeling. As outlined by the authors, both of these methods, however, are prone to complex scene geometry and small moving objects that inhibit the planar scene assumption within each block. In another method, Sheikh et al. [8] use structure from motion to factorize the trajectory matrix generated from a set of tracked points. The authors label point trajectories as background or foreground based on the assumption that background trajectories form a 3D subspace. This factorization based approach depends on dense, long-term and reliable feature tracking and is highly affected by spatial and temporal sparsity of trajectories. A similar method is introduced by Cui et al. [10], which aids the trajectory matrix factorization using group sparsity constraints for the foreground. In [11], Elqursh et al. also exploit longterm point tracks and apply Bayesian filters for estimating the motion and appearance models. In an alternative method, Zhang et al. [9] explicitly estimate the camera motion parameters using a combination of structure from motion and bundle adjustment for full 3D recovery. The recovered depths are then used to label pixels. Their approach, however, requires a number of computationally expensive steps.
In this paper, instead of relying on frame-to-frame motion, we estimate the geometry of a scene from several consecutive frames using the temporal fundamental matrix introduced in [12]. Opposed to the traditional treatment, where the fundamental matrix is estimated using point correspondences, we estimate the temporal fundamental matrix using a set of tracklets, which enforces temporal consistency during estimation. Moreover, this formulation eliminates the problem of insignificant object and camera motion. The likelihood of a pixel belonging to background is computed based on how well the corresponding tracklet fits the estimated geometry. In order to ensure spatiotemporally consistent results, we perform labeling based on groups of pixels with similar spatial and color content, which inhibits labeling noise. The labeling process is posed as a MAP-MRF optimization problem, which is followed by additional constraints that enforce spatial and temporal smoothing. II.
E STIMATING T EMPORALLY C ONSISTENT BACKGROUND
Given a sequence of images, 3D features in the scene generate short trajectories (tracklets) that impose a strong constraint on labeling pixels as background or foreground. One possibility to generate these tracklets is to use a set of interest points and a tracker such as KLT. Due to the sparseness of the tracklets, we use all the pixels and their optical flows to generate a set of tracklets. From among these tracklets, we conjecture that a large subset belongs to the static background while others belong to a distinct set of objects moving about the scene. Considering that the tracklets that belong to the static scene are drawn from an unknown distribution that satisfies a certain 3D geometry, their selection is not trivial. Let the observations from the image sequence, which generate background tracklets, X = {x1 , x2 , . . . , xn }, define the epipolar geometry G, where n is the number of pixels in the image. The continuous random variables G and X are statistically related by some unknown dependency. In this setup, the distribution p(X|G) cannot be directly sampled, but p(G|X = xi ) can be computed for any tracklet xi . The selection of p(G|xi ), however, is not as straightforward. Considering that the tracklets are generated from non-moving 3D scene points, their selection can be facilitated by imposing 3D geometric constraints. An intuitive approach to selecting the background tracklets would be recovery of 3D positions of the points generating the tracklets while recovering the 3D camera motion. The 3D information can be extracted by structure-from-motion followed by a bundle adjustment process. Considering the computational complexity of this process, one can alternatively use the epipolar geometry between consecutive frames to test whether a given tracklet is induced by a static scene point. A good estimate of the frame to frame epipolar geometry may, however, not be possible due to small camera motion resulting in insufficient scene parallax [6], [7]. The addition of noise in tracking may increase the risk of estimating inconsistent fundamental matrices from pairs of images at consecutive time instants. Temporally Stable Epipolar Geometry: In order to offset these problems, we adopt the temporal fundamental matrix
(TFM) [12]. The TFM, F(t), is defined for two cameras independently moving in the scene with overlapping fields of view. In this paper, we modify the TFM to a single moving camera and use it to represent the epipolar geometry G. For a moving camera, let Ω(0) and Θ(0) respectively denote the rotation and translation of the camera at t = 0, and let the camera move with rotational velocities ωx (t), ωy (t), and ωz (t) and translational velocities θx (t), θy (t), and θz (t). The rotation matrix Ω(t) and the translation vector Θ(t) of the camera at time instance t can be described as:
P 1 Ω(t) = P ωz (i) + ωy (i) +
P − ωz (i) + P 1 ωx (i) +
P − P ωy (i) + − ωx (i) + Ω(0), 1 (1) P
P P > θy (i) θz (i)] , where and Θ(t) = [ θx (i) i = 1, . . . , t. Assuming that the camera motion between consecutive frames is small, is approximately 0 and can be neglected. With this formulation, the relative rotation R(t) between the camera in time instances t − 1 and t becomes R(t) = Ω(t)Ω> (t − 1). Similarly, the relative translation can be written as t(t) = R(t)Θ(t) − Θ(t − 1). As a result, the essential matrix at time t can be defined by E(t) = R(t)S(t), where S(t) is a rank deficient matrix obtained from t(t). The corresponding image points x(t) = [x(t), y(t), 1]> and x(t − 1) from a selected tracklet are related by: x(t)> K−> R(t)S(t)K−1 x(t − 1) = 0, | {z }
(2)
F(t)
where K is the camera calibration matrix and F(t) is referred to as the temporal fundamental matrix (TFM). As discussed in [12], assuming that the rotational and translational velocities are polynomials of order nr and nt , respectively, the TFM becomes a matrix function which is a polynomial of order k = nr + nt : k X F(t) = Fi ti , (3) i=0
where Fi are the 3×3 coefficient matrices. Given n trajectories of length nf , the coefficients of the TFM can be estimated by rearranging (3) as a system of linear equations: > > > Mf = (M> 1 , M2 , . . . , Mnnf ) f = 0,
(4)
where Mi = (m0 m1 . . . mk ) and mj = ( x(t)x(t − 1)tj x(t)y(t − 1)tj y(t)y(t − 1)tj
y(t)tj
x(t)tj
y(t)x(t − 1)tj
x(t − 1)tj
y(t − 1)tj
tj )
(5)
and f = (F0 (1, 1), F0 (1, 2), . . . , Fk (3, 2), Fk (3, 3)). In (4), M is an (nnf ) × 9(k + 1) matrix and f is a 9(k + 1) vector. Assuming the existence of a non-zero solution, for nnf ≥ 9(k +1), the rank of M is at most 9(k +1)−1. Then, f is a unit eigenvector of M> M corresponding to the smallest eigenvalue. Once f is estimated, the TFM can be computed by imposing the rank two constraint using SVD. Background Likelihood of Tracklets: Ideally, the images of a static scene point in the tracklet, xi (0), xi (2), . . . , xi (nf − 2), must lie on respective epipolar lines x(1)> F(1), x(2)> F(2), . . . , x(nf − 1)> F(nf − 1). However, due to tracking noise, the distances between the points and the corresponding epipolar lines are generally
greater than 0. The tracklet error with respect to the computed TFM F(t) can be realized as s(xi , F) =
max S(xi (t), F(t), xi (t − 1)),
t=1,...,nf
(6)
where S(x, F, y) =
(Fy)21
(x> Fy)2 , + (Fy)22 + (F> x)21 + (F> x)22
(7)
is the Sampson error. In (6), the max operator ensures the trajectory score s(xi , F) is low only if all pixels in a given tracklet result in low Sampson errors. From among tracklets generated from all the pixels in the image sequence, selection of a set of candidate background tracklets can be achieved by minimizing a cost function using robust estimation schemes. In this paper, we employ the RANSAC algorithm [13]. At each iteration of RANSAC, a subset of tracklets is randomly selected from among all tracklets and it is used to estimate the TFM by solving (4). The TFM induces a scene geometry defined by the selected subset, which is then verified by all the tracklets by computing the error s(xi , F). The tracklets with the error lower than the threshold tR constitute the inlier set. The TFM which results in the maximum number of inliers is then used to compute the p(G|xi ) as a 0 mean Gaussian distribution: 1 s(xi , F)2 √ p(G|xi ) ∝ exp − , (8) 2σg2 σg 2π where σg is the standard deviation of all errors s(x1 , F), s(x2 , F), . . . , s(xn , F). With this formulation, the background tracklets are expected to have a high p(G|xi ). There are several benefits of the sampling scheme discussed above. First, it ensures geometric consistency of tracklets that result in high values of p(G|xi ). In addition, the formulation guarantees temporal consistency between the fundamental matrices for consecutive frame pairs. The limitations related to the small baseline, such as small frame-to-frame camera motion, are eliminated due to estimating the TFM from several frame simultaneously. Hence, the resulting geometric relations significantly improve the background/foreground labeling process. In the next section, we elaborate on how p(G|xi ) is used to segment moving objects in moving camera scenarios.
A. Color Coherency The noise in optical flow estimation may result in inaccuracies in the computed p(G|xi ) values. In order to perform a more coherent labeling process, we suggest that spatially connected pixels with similar color are likely to share the same label. This conjecture can be realized by assigning labels to a group of pixels rather than a single pixel, such that the likelihood of a pixel-group to have a particular label is computed from the likelihoods of pixels composing the group. This suggests that the pixels proximal in color values and spatial coordinates belong to the same object in 3D and share the same local geometry. These regions can be generated by oversegmenting the image, such as by estimating the superpixels. They can later be used to estimate the likelihood of the region ri being a background p(ri |G) by a majority voting: p(ri |G) = median{p(G|xj )|xj ∈ ri }. (9) In this scheme, the pixels with similar color values in the same proximity are enforced to share the same label despite the potential noise observed during the optical flow estimation. This formulation can be further strengthened by posing the labeling as the MAP-MRF optimization problem which can be solved efficiently using the graph-cut algorithm. The resulting scheme implicitly satisfies spatial and temporal smoothing of the labeling process. Note that by using regions instead of individual pixels, we significantly reduce the size of the graph constructed during the optimization process. B. Background/Foreground Labeling Given the likelihoods p(r1 |G), p(r2 |G), . . . , p(rn |G) for each region, our objective is to estimate a binary labeling L∗t = {l1 , l2 , . . . , ln }, which denotes if the pixels in each region belongs to background or foreground. The labels of the regions in frame It can be estimated by L∗t = argminLt E(Lt , r), where the energy function E can be defined by adopting the MAP-MRF framework [14]: X X X E(L(t), r) = D(ri ) + λT T (ri ) + λS V (ri , rj ). ri ∈It
III.
M OVING O BJECT D ETECTION
Traditional algorithms for detecting moving objects using stationary or pan-tilt-zoom cameras assume existence of 1-to-1 mappings between consecutive frames. This assumption lets these methods model the appearance statistics of the static scene as the background model. This condition, however, may not be satisfied for the general moving camera due to potential scene parallax that may result in pixels to have no 1-to-1 mappings from one frame to the next. Such pixels may constitute regions that were invisible but became visible as the camera moves. Therefore, we do not explicitly model the appearance statistics for the background. Instead, we exploit the aforementioned motion based TFM to perform the labeling of the pixels. All pixels forming the tracklet xi receive the same probability p(G|xi ) computed from (8). We conjecture that the tracklets with high p(G|xi ) are likely to be the background tracklets, while the tracklets with low p(G|xi ) are likely to be foreground (moving objects).
ri ∈It
rj ∈N (ri )
(10) Here, D(ri ), T (ri ) and V (ri , rj ) respectively represent the data, temporal, and spatial smoothness terms for the region ri in frame It , and the parameters λT and λS control the effect of the smoothness terms. The neighboring regions of ri are found by the neighborhood operator N (ri ) = {rj |rj ∩ri ⊕R}, where ⊕ is the morphological dilation operator and R is the disk structuring element with radius of 1. The data term in (10) reflects the cost of labeling a region ri as li ∈ {0, 1} and can be written as: 1 − p(ri |G) if li = 0 D(ri ) = , (11) p(ri |G) if li = 1 such that “0” and “1” respectively represent the labels for the background and foreground. Note that, since the normalization term in (8) has no effect in minimization we ignore it. This provides us the ability to compute the cost of assigning li = 0 as 1 − p(ri |G).
frame t
dissimilar low similarity medium similarity high similarity
Fig. 1. The spatial smoothing term is computed for adjacent regions and is proportional to the similarity of their color content. In this illustration, the thickness of the link is proportional to the spatial smoothness term V (ri , rj ).
The spatial smoothing term V (ri , rj ) penalizes the assignment of alternating labels to adjacent regions with similar color distributions. Assuming the histograms Hri and Hrj can be computed for regions ri and rj , the spatial smoothness term can be formulated as: V (ri , rj ) = (1 − δ(li − lj ))B(ri , rj ), where δ(·) is a Kronecker delta function and n q X B(ri , rj ) = Hri (k)Hrj (k),
(12)
(13)
k=1
is the Bhattacharyya distance between the two color histograms. The neighboring regions generated by superpixels and the similarity of histograms between neighboring regions are illustrated in Figure 1, where the thickness of the lines denote the simililiary of the regions. The term T (ri ) penalizes assigning different labels to a region r over time. Considering that the shape of the region may change from frame to frame, the temporal labeling consistency of the region ri at frame It can be achieved by compensating its motion estimated from optical flow of pixels and overlaying it with frame It−1 . The result of this process is exemplified in Figure 2, where the region denoted by red boundary in the top row is overlaid to the previous frame. As shown in the figure, the region of interest intersects with multiple regions in the previous frame. More formally, let R = {rj |rt−1 ∩ rj 6= ∅} be a set of i regions in frame It−1 that overlap with motion compensated rt−1 . Using the set of intersection regions, the temporal term i T (ri ) can be computed as: 1 − τ (ri ) if li = 0, T (ri ) = (14) τ (ri ) if li = 1, where τ (ri ) is defined as |R| X |ri ∩ rj |B(ri , rj ) (αp(rj |G) + (1 − α)τ (rj )). P|R| j=1 |ri | k=1 B(ri , rk ) (15) The ratio in (15) defines how much the history of the region rj should effect the labeling of the region ri based on their overlap and color similarity. The solution to the energy minimization process can be computed using the graph-cut algorithm [15].
τ (ri ) =
b a
c d
frame t-1
Fig. 2. The τ (ri ) for a region marked with red is computed using the overlapping regions a, b, c and d from the previous frame. In this example, “region a” has the most effect since it has the largest overlap and the most similar color content.
IV.
E XPERIMENTS
The printed literature does not have a benchmark dataset to evaluate background subtraction for the moving camera. Due to this unavailability, some studies on the topic do not provide quantitative comparisons [5], [9]. In this paper, we use a set of sequences from the Hopkins dataset [16] and from [17] which have been used by recent papers on our topic [6]–[8]. These sequences contain multiple moving objects and introduce various challenges due to their acquisition as the camera rotates and translates. For quantitative evaluation, we generated ground truth data by manually extracting all moving objects in the videos for all frames. The pixel-based overlap between the detected moving regions and the ground truth is analyzed by F-score, precision, and recall, which are given in Tables I, II and III, respectively. The tables contain results on different versions of the proposed approach as well as two state-of-the-art methods [8] and [6]. In particular, the variations of our approach include 1) the complete method (ours), 2) without temporal smoothness (w/o T ), 3) without MRF optimization (w/o MRF), 4) by replacing TFM with traditional fundamental matrix (ours FM). In addition, we show results for 5) the labeling obtained by thresholding of frame-to-frame fundamental matrix (FM), 6) [8] and 7) [6]. For fair comparison, all competing methods are implemented using the same code base where possible. The qualitative results are presented in Figure 3. Implementation Details: For each test video, we first construct tracklets, xi = {xi (0), xi (2), . . . , xi (nf − 1)}, of length nf = f ps/3 for all image pixels using the optical flow method in [18]. The tracklets are used to estimate the TFM, F(t), with k = 2 for modeling the view-geometry using RANSAC with threshold tR = 0.1. Once the geometry is established, the standard deviation of the errors, σg , in (8) is computed by evaluating error using (6). These errors facilitate computing the likelihood p(G|xi ) for each point in tracklet xi . Considering that tracklets are generated in a sliding window of nf frames, each pixel has nf different likelihood values, which are used to assign the final likelihood by performing median filtering. The resulting pixel likelihoods are grouped into superpixels [19], for minimizing the energy function in (10). The parameters in the optimization framework are set to λT = 2, λS = 3 and α = 0.2 for all sequences. We
TABLE I. seq. cars person cars1 cars2 cars3 cars4 cars5 cars6 cars7 cars8 people1 people2
ours 83.8 85.5 74.7 85.8 91.8 83.1 85.5 89.0 90.0 80.0 72.8 90.1
TABLE II. seq. cars person cars1 cars2 cars3 cars4 cars5 cars6 cars7 cars8 people1 people2
ours 83.6 78.3 62.2 86.6 91.3 82.8 89.5 86.8 87.8 72.7 83.9 91.4
AVERAGE F- SCORE VALUES FOR EACH SEQUENCE . w/o T 75.4 83.7 63.7 81.2 89.9 80.5 83.6 87.6 87.2 75.6 71.6 86.0
w/o MRF 67.1 81.6 60.4 77.9 86.4 77.0 78.6 80.7 82.5 72.7 62.1 82.0
ours FM 17.9 75.3 17.8 50.2 32.8 48.0 17.9 36.4 66.6 20.9 48.4 56.0
FM 22.3 73.4 29.3 45.0 54.7 40.2 40.9 42.5 61.3 38.3 48.6 60.4
[8] 72.7 80.0 67.9 63.6 76.1 76.2 66.2 79.8 87.0 82.2 51.7 78.2
TABLE III. [6] 71.0 82.7 72.6 85.0 77.9 75.1 75.3 68.4 86.2 74.9 49.7 80.5
seq. cars person cars1 cars2 cars3 cars4 cars5 cars6 cars7 cars8 people1 people2
ours 84.2 95.0 94.8 85.3 92.7 84.0 81.9 91.5 92.4 90.6 67.5 89.1
AVERAGE RECALL VALUES FOR EACH SEQUENCE . w/o T 82.8 94.9 83.2 79.6 92.6 85.2 81.5 92.8 93.3 91.0 77.9 91.0
w/o MRF 82.4 96.5 82.0 79.3 91.4 84.1 79.4 94.5 94.9 89.3 85.4 91.6
ours FM 15.5 70.5 17.0 45.8 27.5 43.0 14.4 35.0 65.0 17.2 36.8 45.9
FM 32.1 75.1 47.9 66.3 59.1 60.4 35.3 61.6 70.9 40.2 64.7 68.0
[8] 84.6 95.1 74.3 81.7 97.4 88.3 79.7 96.9 94.4 85.4 80.9 88.0
[6] 64.4 83.1 87.2 77.2 87.7 73.1 82.5 73.1 84.2 76.2 80.9 93.8
AVERAGE PRECISION VALUES FOR EACH SEQUENCE . w/o T 71.3 75.5 53.3 85.2 87.8 77.2 86.5 83.2 82.7 66.0 68.6 82.3
w/o MRF 58.3 71.3 49.2 79.1 82.5 72.0 78.9 71.0 73.9 62.3 49.5 75.3
ours FM 25.7 90.4 18.6 58.3 54.8 57.2 28.3 38.8 75.5 48.8 74.3 81.9
FM 32.6 75.1 21.4 36.9 59.2 34.0 60.3 35.7 61.1 39.7 43.2 60.8
[8] 65.1 69.6 68.5 54.7 62.8 68.5 62.7 68.8 81.3 81.6 40.5 72.6
[6] 79.4 83.5 63.0 95.2 70.6 80.8 69.4 64.4 88.9 73.7 38.5 71.6
developed the methods by a combination of unoptimized C++ and MATLAB programs, and all experiments are conducted on a PC with dual-core Intel i5 Ivy bridge 1.8GHz CPU. The average execution time for a 640 × 480 frame is measured as 48.2 seconds. Discussion: The quantitative results tabulated in Tables I, II, III and the qualitative results illustrated in Figure 3 indicate that the proposed method provides a significant improvement over fundamental matrix based methods: (FM) and (ours FM). This observation is due to small frame-to-frame motion of the foreground object which cannot be distinguished from the background motion. In most frames, “FM” and “ours FM” methods detects only a small part of the object and the results are not consistent over time. For most sequences, the performance of the proposed method is also superior to [8] and [6]. For some cases, we observed that [8] has higher recall and [6] has higher precision; however, for the same sequences, our approach scores the highest F-score. We have observed that the noise in tracklets can cause misdetections around the moving objects for [8]. This method also suffers from inconsistencies due to the lack of temporal constraints. On the other hand, [6] relies on frame-to-frame fundamental matrix for initial labeling refined by employing temporal and spatial constraints. This method shows very good performance when the initial labeling is accurate; otherwise, its performance decreases, especially when the object motion is high, such that occluded background is revealed too fast as the objects move and cause misdetections since they deviate from the previously learned background. While the proposed method is superior to the state-of-theart, we observed the following limitations during our experiments. Our approach requires that the frame-to-frame camera motion allows consecutive frames to overlap significantly to
allow optical flow computation. The TFM formulation also assumes that the frame-to-frame camera motion is smooth for at least nf frames. Otherwise, the proposed method may have false positives. The application of the RANSAC suggests that the majority of the scene is background; consequently, the performance may degrade when large objects with fast motion dominate the scene. Our method is also prone to objects that move very slowly as their motion becomes similar to the background motion. Some of these limitations are prevented by spatial and temporal smoothness in MRF formulation. We also observed a discrepancy between qualitative and quantitative results. For a mid 80% performance in the tables, we observe that the results are good qualitatively. In many cases, this is due to the small differences between the object borders in the ground truth and generated results. Figure 4 presents qualitative results for the proposed method for an exemplar sequence. Note that the moving objects are successfully detected despite appearance changes in the sequence. V.
C ONCLUSIONS
In this paper, we introduce a method for background subtraction for freely moving cameras and arbitrary scene geometry. Our approach addresses the drawbacks of traditional frame-to-frame fundamental matrix-based geometry formulation by estimating a temporally consistent epipolar geometry induced by a moving camera. The results are finalized within an optimization framework which ensures the temporal, spatial, and color consistency of the labeling. Extensive experiments performed using challenging videos demonstrate that the proposed algorithm is comparable and, in most cases, superior to the state-of-the-art. ACKNOWLEDGMENT This work is partially supported by NGA NURI program. R EFERENCES [1] [2] [3] [4]
[5]
M. Piccardi, “Background subtraction techniques: a review,” in Proc. IEEE Int. Conf. on Systems, Man and Cybernetics, 2004. Y. Ren, C.-S. Chua, and Y.-K. Ho, “Statistical background modeling for non-stationary camera,” PR Letters, vol. 24, no. 1-3, p. 183, 2003. M. Irani and P. Anandan, “A unified approach to moving object detection in 2d and 3d scenes,” PAMI, vol. 20, no. 6, p. 577, 1998. C. Yuan, G. Medioni, J. Kang, and I. Cohen, “Detecting motion regions in presence of strong parallax from a moving camera by multi-view geometric constraints,” PAMI, vol. 29, no. 9, pp. 1627–1641, 2007. F. Liu and M. Gleicher, “Learning color and locality cues for moving object detection and segmentation,” in CVPR, 2009.
person
cars2
cars3
cars6
people2
Lim et al. [6]
Sheikh et al. [8]
FM
ours w/o MRF
our method
input frame
cars
Fig. 3. Results of background subtraction for the proposed method (row 2), the proposed method without MRF (row 3), the traditional fundamental matrix (row 4), Sheikh et al. [8] (row 5), and Lim et al. [6] (row 6).
Fig. 4.
[6]
[7]
[8] [9]
[10] [11] [12]
Qualitative results of applying the proposed method to the cars4 sequence.
T. Lim, B. Han, and J. H. Han, “Modeling and segmentation of floating foreground and background in videos,” PR, vol. 45, no. 4, pp. 1696– 1706, 2012. S. Kwak, T. Lim, W. Nam, B. Han, and J. H. Han, “Generalized background subtraction based on hybrid inference by belief propagation and Bayesian filtering,” in ICCV, 2011. Y. Sheikh, O. Javed, and T. Kanade, “Background subtraction for freely moving cameras,” in ICCV, 2009. G. Zhang, J. Jia, W. Hua, and H. Bao, “Robust bilayer segmentation and motion/depth estimation with a handheld camera,” PAMI, vol. 33, pp. 603–617, 2011. X. Cui, J. Huang, S. Zhang, and D. N. Metaxas, “Background subtraction using low rank and group sparsity constraints,” in ECCV, 2012. A. Elqursh and A. M. Elgammal, “Online moving camera background subtraction,” in ECCV, 2012. A. Yilmaz and M. Shah, “Matching actions in presence of camera motion,” CVIU, vol. 104, no. 2, pp. 221–231, 2006.
[13]
[14] [15] [16] [17] [18]
[19]
M. A. Fischler and R. C. Bolles, “Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography,” Comm. of ACM, vol. 24, no. 6, p. 381, 1981. Y. Boykov, O. Veksler, and R. Zabih, “Markov random fields with efficient approximations,” in CVPR, 1998, pp. 648–655. Y. Boykov, O. Veksler, and R. Zabih, “Fast approximate energy minimization via graph cuts,” PAMI, vol. 23, no. 11, pp. 1222–1239, 2001. R. Tron and R. Vidal, “A benchmark for the comparison of 3-d motion segmentation algorithms,” in CVPR, 2007. P. Sand and S. Teller, “Particle video: long-range motion estimation using point trajectories,” in CVPR, 2006, pp. 2195–2202. M. J. Black and P. Anandan, “The robust estimation of multiple motions: Parametric and piecewise-smooth flow fields,” CVIU, vol. 63, no. 1, pp. 75–104, 1996. X. Ren and J. Malik, “Learning a classification model for segmentation,” in ICCV, 2003.