Real-time Foreground Segmentation on GPUs using Local Online ...

4 downloads 291 Views 1019KB Size Report
porally. A novel approach is proposed that incorpo- rates a pixel-based online learning method to adapt to temporal background changes promptly, together with.
Real-time Foreground Segmentation on GPUs using Local Online Learning and Global Graph Cut Optimization Minglun Gong Memorial University of Newfoundland [email protected]

Abstract This paper is to address the problem of foreground separation from the background modeling perspective. In particular, we deal with the difficult scenarios where the background texture might change spatially and temporally. A novel approach is proposed that incorporates a pixel-based online learning method to adapt to temporal background changes promptly, together with a graph cuts method to propagate per-pixel evaluation results over nearby pixels. Empirical experiments on a variety of datasets demonstrate the competitiveness of the proposed approach, which is also able to work in real-time on the Graphics Processing Unit (GPU) of programmable graphics cards.

1

Introduction

A fundamental problem in video content analysis is to segment foreground objects from background scenes, which provides low-level visual cues that are necessary for further analysis. The background modeling perspective has been explored by [9, 3, 6], and has been shown to work in simple scenarios, where a static camera is commonly assumed. One important but rather difficult issue with background modeling, as pointed out by e.g. [9], is to be robust against the changes of background scene textures both spatially and over the time course. To deal with the issue of temporal changes, Kalman filters are employed in [10], which however relies on the strong and often unrealistic assumption of the state spaces being linearly structured. Similarly, [7] adopt a incremental subspace method. Attempts have also been made toward the issue of spatial variations [3, 6], which unfortunately are not suitable for real-time video analysis due to the intensive computational efforts required. In this paper, we propose a novel approach that aims to address the spatio-temporal changes jointly with realtime performance. The approach is derived from a principled online learning method [2] to adapt to pixel-

978-1-4244-2175-6/08/$25.00 ©2008 IEEE

Li Cheng National ICT Australia [email protected]

based temporal background changes promptly, together with a graph cuts method to propagate per-pixel confidence values obtained over nearby pixels. The approach is able to work real-time after being adapted and implemented on GPUs. Extensive experiments are conducted on a variety of datasets used by previous work [9, 2, 10, 8, 3]. The results demonstrate the competitiveness of the proposed approach.

2

The Proposed Approach

When processing a sequence, our approach maintains a separate background model for each pixel in the image. A buffer B is created to hold all the models, each of which consists of n (features, weight) pairs. All these pairs are set to zero initially, but as more frames are observed, a more accurate background model is obtained. To keep the most important observations, all pairs are sorted in descending order of the weights and only those have higher weights are kept in the buffer. In this paper we use pixel color features and fix n = 50. When a new frame t is observed, a three-stage process is issued. The first stage calculates a confidence map F , which tells us how well the new observation fits the existing background model at individual pixel locations. In the second stage, F is used to construct a Markov Random Field (MRF) graph, based on which a near-optimal foreground mask M is computed using a simplified graph cuts method. Finally based on M , the individual background models are updated using the online learning method [2] in the third stage. The following subsections discuss these three stages in details.

2.1

Confidence Map

The confidence map is derived according to Equation (3) of the online learning method [2]. Here we  use B i (x, y).c, B i (x, y).α to denote the ith (features, weight) pair stored for pixel (x, y) in buffer B. According to the online support vector machine representation [2], the confidence value, F t (x, y), of a new observation I t (x, y) belonging to the background model

for pixel (x, y) is: F t (x, y) =

n  X

four neighbors are adaptively determined based on the color differences between the corresponding pixel pairs: B i (x, y).α × k(I t (x, y), B i (x, y).c)



i=1

(1)

where k(·, ·) is a kernel function. The Gaussian kernel is used in this paper: k(c1 , c2 ) = e−

kc1 −c2 k2 2σ 2

(2)

where k · k is the L2-norm of a vector and the standard deviation σ = 8. Clearly the above equation can be used even when the buffer is not fully filled, since all the weights are initialized to zero. In our GPU implementation, each (features, weight) pair is kept in one RGBA color vector, where the alpha channel holds the weight. The whole buffer B is stored in a single texture that contains n tiles, with tile i holding values for B i (x, y). A pixel shader is used to compute the confidence values for all pixels in parallel using a single rendering pass.

2.2

Global Optimization using Graph Cuts

A background pixel usually has higher value in the confidence maps as it fits its current background model better, as empirically shown in Fig. 2. The foreground masks (objects) in [2] are thus obtained by applying a global threshold over the confidence maps. Since the spatial coherence among neighboring pixels is not enforced, the masks can be rather noisy for datasets with especially spatial changes (shown in Fig. 2). To address this issue, a graph cuts method [1] is adapted here. Similar to existing approaches [1, 4], the MRF graph we constructed contains one node for each pixel in the image, plus a source node s and a sink node t. Here the source is used for representing the background and the sink for the foreground. Each pixel node (x, y) has two edges connected to s and t, respectively. The capacities of these two edges, Cs (x, y) and Ct (x, y), are determined based on the confidence value F t (x, y) as: Cs (x, y) Ct (x, y)

= =

max(0, ωh − F t (x, y)) max(0, F t (x, y) − ωl )

(3)

where ωh = 0.5 and ωl = 0.1 are the upper and lower bounds of the confidence value, respectively. Here the higher the confidence of pixel (x, y) belonging to the background, the smaller the cost of assigning (x, y) to the background is, and hence the smaller the Cs (x, y) should be. On the other hand, the value of Ct (x, y) should be higher since the pixel is unlikely to be the foreground. The graph also contains edges connecting between neighboring pixels to enforce spatial coherence. Here the edge capacities between the current pixel and its

C⇐ (x, y) C⇒ (x, y) C⇑ (x, y) C⇓ (x, y)

= = = =

λh − min(λl , |I t (x, y) − I t (x − 1, y)|) λh − min(λl , |I t (x, y) − I t (x + 1, y)|) λh − min(λl , |I t (x, y) − I t (x, y − 1)|) λh − min(λl , |I t (x, y) − I t (x, y + 1)|) (4)

where | · | is the L1-norm. λh = 0.2 and λl = 0.05 are the upper and lower bounds of the jumping cost. This is to encourage the boundary of the foreground masks to follow object boundaries — the smaller the color difference is, the higher the cost is for two neighboring pixels to have different labels, and hence the higher the capacity of the corresponding edge should be. The minimum cut of the above graph, i.e., the optimal solution of the MRF, can be found using the pushrelabel algorithm [4, 5]. The algorithm starts by initializing the excess of each node, defined as the difference between the total amount flowing in and out of this node. It then pushes local flow excess toward the sink while updating the residual graph until all paths to the sink are saturated. Finally, excess that cannot be moved to the sink is pushed backward to the source. After all nodes have zero excess, the residual graph delivers the minimum cut. In theory, O(n2 ) push-relabel steps are needed to compute the minimum cut. However we empirically find that for our application, a small number of pushrelabel steps toward the sink is able to effectively remove false positives in background areas. Similarly a few backward steps can effectively remove false negatives in foreground areas. To facilitate real-time performance, in practice both forward and backward pushrelabel steps are bounded to 10. In our GPU implementation, the residual graph is represented using two textures. At pixel (x, y), the first texture keeps the excess and the label at node (x, y) in its blue and alpha channels, and the residual capacities from (x, y) to s and t in its red and green channels, respectively. The second texture keeps the residual capacities from (x, y) to the four neighboring nodes in its four channels. Each push-relabel step is implemented using two rendering passes: The first pass computes the amount can be pushed away from the current node, whereas the second pass updates the excess and the label at each node. Once all push-relabel steps are completed, a globally near-optimal foreground mask M can be obtained based the labels of the nodes [5].

2.3

Update Background Models

Following [2], once pixels in the new frame t are classified into background and foreground, we update individual background models for the corresponding pixels. This is accomplished by first computing a

weight αt (x, y) for each pixel (x, y): αt (x, y) = min



1 − F t (x, y) k(I t (x, y), I t (x, y))

 ,µ

(5)

where µ = 0.2 sets the maximum weight to prevent a single noisy observation dominating the confidence value. Before inserting  the new (features, weight) pair, I t (x, y), αt (x, y) , into the buffer B, all existing pairs receive a geometric decay to remove outdated observations from the buffer. That is: e i .α = τ × B i .α B

∀i, x, y

(6)

 The pair I t (x, y), αt (x, y) is now inserted into the buffer using a parallel version of the insertion sort as:  e i−1 .α e i−1  if i > 1 & αt ≥ B B i i i t e e B = B if B .α > α   t t I ,α otherwise

∀i, x, y (7)

In practice, the geometric decay and insertion sort are implemented on the GPU using one rendering pass.

3

Experiment Results

The proposed approach is tested with the following four datasets as displayed in Fig. 1: Trees:1 This widely used sequence contains a person walking in front of a waving tree [9]. Lights: The lights in the scene is turned on and off during the capture [2]. Jug:2 The background of this scene is rippling water and the foreground is a floating jug [10]. Railway:3 A strong breeze causes the camera to jitter [8]. The single set of parameters that are specified in the paper are used for all four datasets. The results (Fig. 3) validate that the our approach can handle challenging conditions such as illumination changes, dynamic backgrounds, and camera jitters. For example, while the pixel-based online learning approach [2] adapts quickly to illumination changes and produces a reasonably good result for the “Lights” dataset, it yields noises for the remaining 3 sequences (Fig. 2). Our approach effectively suppresses these isolated noises through graph cuts and generates clean and detailed foreground masks. Our results are also comparable to the ones reported for the same datasets by existing state-of-the-arts algorithms [9, 10, 8, 3]. However, our approach has two 1 Downloaded from http://research.microsoft.com/˜jckrumm/wallf lower/testimages.htm 2 Downloaded from http://www.cs.bu.edu/groups/ivc/data.php 3 Downloaded from http://www.cs.cmu.edu/˜yaser/new backgroun dsubtraction.htm

important advantages — it does not require any additional training data and has real-time online processing capability. The speed of our approach is measured on a Lenovo T60 laptop with Intel Centrino T2500 CPU and ATI Mobility Radeon X1400 GPU. For image sequences with 320×240 resolution, current implementation can achieve 16 FPS. This suggests that a portable and affordable real-time foreground detection system can be built using the proposed approach.

4

Conclusions

A foreground segmentation approach is presented in this paper, which runs real-time on GPUs. The approach models the background using an online learning method, making it capable of adapting to temporal changes promptly. It also enforces the spatial coherence of the solutions through constructing and finding the minimum cut of a MRF graph. Experiments on a variety of challenging datasets show that our approach produces cleaner foreground masks than the pixel-based online learning method [2] does and is comparable to existing non-real-time algorithms [9, 10, 8, 3] as well. Acknowledgement The authors thank Mr. G. Dalley, Dr. J. Krumm, Dr. Y. Sheikh and Dr. S. Sclaroff for sharing their datasets. This work is partially supported by NSERC.

References [1] Y. Boykov, O. Veksler, and R. Zabih. Fast approximate energy minimization via graph cuts. IEEE TPAMI, 23(11):1222–1239, 2001. [2] L. Cheng, S. Wang, D. Schuurmans, T. Caelli, and S. V. N. Vishwanathan. An online discriminative approach to background subtraction. In AVSS, 2006. [3] G. Dalley, J. Migdal, and W. E. L. Grimson. Background subtraction for temporally irregular dynamic textures. In WACV, 2008. [4] N. Dixit, R. Keriven, and N. Paragios. Gpu-cuts: Combinatorial optimisation, graphic processing units and adaptive object extraction. Technical report, CERTIS, 2005. [5] A. V. Goldberg and R. E. Tarjan. A new approach to the maximum flow problem. In STOC, 1986. [6] J. Migdal and W. E. L. Grimson. Background subtraction using markov thresholds. In WMVC, pages 58–65, 2005. [7] A. Monnet, A. Mittal, N. Paragios, and V. Ramesh. Background modeling and subtraction of dynamic scenes. In ICCV, 2003. [8] Y. Sheikh and M. Shah. Bayesian object detection in dynamic scenes. In CVPR, 2005. [9] K. Toyama, J. Krumm, B. Brumitt, and B. Meyers. Wallflower: Principles and practice of background maintenance. In ICCV, 1999. [10] J. Zhong and S. Sclaroff. Segmenting foreground objects from a dynamic textured background via a robust kalman filter. In ICCV, 2003.

Figure 1. The four datasets used, which are referred (from left to right) as Trees, Lights, Jug, and Railway. Top row: first frame in the sequence; Middle row: testing frame; Bottom row: hand-labeled ground truth. No additional images are used for training. In Jug and Railway sequences, the foreground objects cover the background in all frames used for testing.

Figure 2. Results of the principled online learning method [2]. Top row: confidence maps; Bottom row: foreground masks obtained through thresholding. For better visibility, higher confidence values are shown with lower intensity. Since the spatial coherence is not enforced, the masks are rather noisy.

Figure 3. Results of our approach. Top row: foreground masks obtained; Bottom row: masked original images. The graph cuts optimization effectively removes noises, while at the same time preserves the detailed shapes of foreground objects, e.g. the pedestrian in the “Railway” dataset.

Suggest Documents