Region-Level Motion-Based Foreground Detection with Shadow Removal Using MRFs Shih-Shinh Huang1 , Li-Chen Fu1 , and Pei-Yung Hsiao2 1
Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan, R.O.C.
[email protected] 2 Department of Electronics Engineering, Chang Gung University, Tao-Yuan, Taiwan, R.O.C.
[email protected]
Abstract. This paper presents a new approach to automatic segmentation of foreground objects with shadow removal from an image sequence by integrating techniques of background subtraction and motion-based foreground segmentation. First, a region-based motion segmentation algorithm is proposed to obtain a set of motion-coherence regions and the correspondence among regions at different time instants. Next, we formulate the foreground detection problem as a graph labeling over a region adjacency graph (RAG) based on Markov random fields (MRFs) statistical framework. A background model representing the background scene is built and then is used to model a likelihood energy. Besides the background model, the temporal and spatial coherence are also maintained by modeling it as a prior energy. Finally, a labeling is obtained by maximizing a posterior energy of the MRFs. Experimental results for several video sequences are provided to demonstrate the effectiveness of the proposed approach.
1
Introduction
In many applications, success of detecting foreground regions from a static background scene is an important step before high-level processing, such as object identification and event understanding. However, in real-world situations, there exist several kinds of environment variations that will make the foreground detection more difficult. In order to cope with that, the approach here should be able to immune to these variaitions, i.e., being invariant to them or adapting to them. 1.1
Related Works
Techniques for foreground detection can be grouped into two categories, background subtraction and motion-based foreground segmentation. We will give a brief review of these two kinds of techniques. In order to adapt to changes, the background is usually represented by the background model and updated over P.J. Narayanan et al. (Eds.): ACCV 2006, LNCS 3851, pp. 878–887, 2006. c Springer-Verlag Berlin Heidelberg 2006
Region-Level Motion-Based Foreground Detection
879
time. This kind of technique is based on an assumption that the background scene is available and the camera is only subject to minimal vibration without loss of generality. A simply method is to represent the gray level or color intensity of each pixel in the image as an independent and uni-modal distribution [1, 2]. However, in the real world, the appearance of a pixel in most video sequences is in multi-modal distribution. The usage of a mixture of Gaussian distributions is common in modeling multi-model distribution. To overcome this problem, [3] modeled the pixel intensity as a weighted mixture of three Gaussian distributions respectively corresponding to road, vehicle, and shadow. The works in [4] model each pixel as a K mixture of Gaussian distributions where K depends on memory. However, not all distributions are in Gaussian form[5]. In [5], a non-parametric background model based on non-parametric density estimation was proposed to handle the situations where the background scene is non-static but contains minimal motion. The currently proposed approaches are used to represent the background scene by a set of independent models without taking any semantic information into consideration. This makes false detection likely when changes or noise occur. It is here where some sophisticated modeling or updating strategies are applied. The technique of motion-based foreground segmentation is based on the idea that appearance of foreground objects are always accompanied by motion. In general, such technique consists of two steps, i.e., motion segmentation and region classification. The aim of motion segmentation is to divide an image into a set of regions with motion coherence, whereas that of region classification is to assign a label, foreground or background, to each segmented region. For providing a meaningfully semantic description of video, Wang and Adelson [6] employed a k-means clustering algorithm in the affine parameter space to find a small number of motion classes. Finally, each flow vector is assigned to one of the resulting motion classes. Borshukov [7] later improved Wang and Adelson’s algorithm through a merging and multi-stage approach to perform motion segmentation in a more robust way. The aforementioned approaches incur inaccurate segmentation due to inexact motion estimation near the object boundary. In order to overcome this problem, color information is introduced to obtain more accurate segmentation. In [8, 9], an initial segmentation proceeds with color segmentation. Then, regions are merged on the basis of temporal or spatial similarity. 1.2
System Overview
In this paper, we integrate these two kinds of approaches to perform the foreground detection in a more effective manner. Figure 1 shows the block diagram of the proposed algorithm. The main idea is to regard the background model as a portion of knowledge for classification. And, motion-based segmentation is to generate a set of regions for classification in the semantic level. After segmentation, the statistical framework, MRFs, is introduced to formulate the foreground detection problem as a labeling problem. The optimization over the MRFs model is then performed, or specifically a posterior probability is maximized to obtain
880
S.-S. Huang, L.-C. Fu, and P.-Y. Hsiao
a classification result. Finally, regions which have the same classification label and similar colors are merged to derive a more meaningful segmentation. Finally, the background model and the resulting region map are updated accordingly.
I(t-1) I(t)
Region-Based Motion Segmentation Segmentation Map SM(t) Region Map
Region Updating
MRFs Classification
Region Merging
Bakcground Model
Background Updating
Foreground Mask FM(t)
Fig. 1. The block diagram of the proposed algorithm
The remainder of this paper is organized as follows. In section 2, we introduce the region-based motion segmentation algorithm to obtain a set of motioncoherence regions. Section 3 addresses problems of background modeling and updating. The classification process based on the MRFs statistical framework is described in section 4. In section 5, we demonstrate the effectiveness of the developed approach by providing some appealing experimental results. Finally, we conclude the paper in section 6 with some relevant discussion.
2
Region-Based Motion Segmentation
First of all, Horn and Schunck’s method [10] is used to estimate dense optical flow for describing motion vector, (u(x, y), v(x, y)), of every pixel (x, y) between two consecutive image frames I(t−1) and I(t). Segmented regions of the previous image frame, I(t − 1), will then be projected to the current image frame, I(t). Regions with coherent motion are extracted as initial motion markers. Pixels not ascribed to any region are labeled uncertain ones. Finally, a watershed algorithm [11] based on motion and color is utilized to join uncertain pixels to the nearest similar marker. 2.1
Region Projection
Because of inaccuracy in estimating motion of region’s boundary using Horn and Schunck’s method, a parametric affine motion model is adopted to represent the motion of a region. Let the affine motion model Ai represent the motion of a
Region-Level Motion-Based Foreground Detection
881
region Ri , then it is a six-parameter model denoted as the parametric motion vector, i.e. Ai (x, y; Ri ), of any relevant pixel (x, y) ∈ Ri . Specially, Ai can be expressed as Ai (x, y; Ri ) = [Ui (x, y), Vi (x, y)] Given the affine motion model,any pixel (x, y) ∈ Ri in the previous frame, I(t − 1), should be projected to the location (x , y ), where (x , y ) = (x + Ui (x, y), y + Vi (x, y)). Due to quantization error, we define the projection error ep (x, y) of the pixel (x, y) as ep (x, y) =
min (i,j)∈N4 (x,y)∪(x,y)
|I(x, y; t − 1) − I(i, j; t)|,
(1)
where N4 (x , y ) is the set of four connected pixels surrounding (x , y ). |I(x, y; t− 1) − I(i, j; t)| is the Euclidean distance of RGB color vectors between two pixels (x, y) and (i, j) at different time instant. If ep (x, y) is less than a given threshold T hp , then the region label of (x , y ) is assigned to the same one of (x, y). Otherwise, the pixel which has large projection error is labeled as uncertain ones to indicate that the projecting from the previous frame to the current one is failed. 2.2
Motion Marker Extraction
The output of this step is a set of motion-coherent regions, that is, all pixels within a region comply with a motion model. Here, each such region is referred to as a motion maker. By starting to grow from these markers, we can eventually obtain a segmentation. Motion markers here are derived in two ways. First, the regions projected from the previous time frame are one kind of motion markers because each of them arises from an affine motion model. In addition to those, the regions resulting from the newly introduced object(s) may be another kind of motion markers, To handle this situation, a method similar to [12] is used to extract this kind of motion marker. Next, an affine motion model Ai is evaluated to describe the motion of each region, Ri , according to the least square method in [13]. We then exclude the pixel (x, y) from Ri if the motion error, em (x, y), of the pixel (x, y) associated with Ai is larger than a predefined threshold T hm , where the motion error is defined as em (x, y) = |(u(x, y), v(x, y)) − Ai (x, y; Ri )|. (2) After exclusion, the region is the motion marker if the area of it is above a threshold. The set of these motion markers is denoted as M = {Mi |i = 1, 2, ...m}, where m is the number of motion markers. Each motion marker, Mi stands for a segmented region. 2.3
Boundary Determination
After motion marker extraction, the number of the regions to be segmented is known. However, a large number of pixels are not yet assigned to any region. These uncertain pixels are mainly around the contours of the regions. Through the use of the watershed algorithm [11], uncertain pixels will be merged to one of the markers. Finally, we obtain a set of segmented regions with coherent motion.
882
3
S.-S. Huang, L.-C. Fu, and P.-Y. Hsiao
Background Modeling
In this paper, we use the method proposed by [4] to model and update the background scene. A brief description of Stauffer and Grimson’s work is first given and then we introduce the Bhattacharyya distance as the difference measure between the region from the region-based motion segmentation and the one represented by the background model to successfully remove the shadow effect. 3.1
Adaptive Gaussian Mixture Models
The probability of a specific pixel which has an observation o(t) at time instant t can be expressed as P (o(t)) =
K
wi (t)η(o(t); µi (t), Σi (t)),
(3)
i=1
where wi (t) is the weight of the ith Gaussian distribution at time t, µi (t) and Σi (t) are the mean vector and covariance matrix of the ith Gaussian distribution at time t, and η(o; µ, Σ) is the Normal Gaussian distribution. 3.2
Bhattacharyya Distance
Now, we want to introduce how to measure the similarity between the segmented region and the one represented by the background model. Let the region Rs is the one obtained from the region-based motion segmentation process, and let the color observation of the pixel p(x, y) belongs to Rs be denoted as o(x, y). Then, the color of p(x, y) representing by the background model is then defined as the mean vector of the Gaussian distribution that has the minimum Mahalanobis distance [14] from o(x, y). Now, suppose the notation Rb is used to denote the region represented by the background model, then the colors of the regions, Rs and Rb , are both assumed to be of Gaussian distributions. Suppose that µs and Σs are the mean vector and covariance matrix of Rs , respectively, and similarly for µb and Σb are of Rb . The distance measure between Rs and Rb can be related to the probability of classification error in statistical hypothesis testing, which naturally leads to the Bhattacharyya distance [14, 15]. The Bhattacharyy distance, dbhat (., .), is formally defined as follows: b −1 | (µs − µb ) dbhat (Rs , Rb ) = 18 (µs − µb )T | Σs +Σ 2
+
1 2
|
Σs +Σb 2
ln √
|
(4)
|Σs ||Σb |
However, the region similarity defined in this way will lead to mis-classification of the background region where direct light is blocked by the foreground object. The region of this kind is referred to as shadow. According to [16], the intensity of the pixel in shadow will be scaled down by a factor λ with λf ≤ λ ≤ 1, where λf is a constant.
Region-Level Motion-Based Foreground Detection
883
If the actual color vector of a pixel is v = (r, g, b), it will become v = (r , g , b ) after being covered by shadow. In an ideal case, v = λv. Due to light fluctuation and noise effect, the ideal situation hardly takes place. Thus, the scaling factor is defined to be λ∗ , which minimizes f (λ) = (r − λr)2 + (g − λg)2 + (b − λb)2 . By differentiating f (λ) with respect to λ, we can obtain λ∗ according to the following equations. df (λ∗ ) dλ
=0
∗
⇒ λ (r2 + g 2 + b2 ) = rr + gg + bb ⇒ λ∗ =
(5)
rr +gg +bb r 2 +g2 +b2
In order to obtain the measure for region similarity invariant to shadow effect, the pixel in the current image should be scaled down at the first place. But, this is impractical due to expensive computation. Instead of doing this, we just use the mean vectors of Rs and Rb to evaluate λ∗ and scale down the distribution (µs , Σs ) of Rs to (λ∗ µs , (λ∗ )2 Σs ).
4
MRFs-Based Classification
Next, we describe how to incorporate the background model to classify every region into either a foreground object or a background one by Markov Randome Fields (MRFs). The formally statistical framework of MRFs can be found in [17]. Before that, a graph called region adjacency graph (RAG) is used to represent the set of segmented regions. Let G = (S, E) be an RAG, where S = {s1 , s2 , ..., sn } is the set of nodes in graph and each node Si corresponds to a region Ri , and E is the set of edges with (si , sj ) ∈ E if Ri and Rj being neighboring regions. Here, we describe how to define U (O|ω) and U (ω) so as to incorporate the background model as well as temporal and spatial coherence under MRFs framework. The terms, U (O|ω) and U (ω) are the likelihood and prior energy over all sites, respectively. L = {ω1 , ω2 , ..., ωm } is a set of labels. In the foreground detection problem denoted as L = {F, B}, F and B stand for foreground and background, respectively. O = {o1 , o2 , ..., on }: a set of observations associated with each site. The term U (oi |si = ωi ) represents the likelihood energy of the site si to F B (.) and flikelihood (.) are be classified as the label ωi . Two functions, flikelihood defined as depicted in Figure 2(a) to evaluate U (oi |ωi = F ) and U (oi |ωi = B), whereas Ulikelihood and T hlikelihood in Figure 2(a) are two constants. Based on the background model, the distance we use to measure the likelihood energy is dbhat (Ri , Rb(i) ), where Rb(i) is the region represented by the background model as mentioned in section 3. The prior energy is composed of singleton, U1 (.), and pairwise, U2 (., .) energies. The term U1 (.) is related to the temporal coherence and is defined as: −rB dbhat (Ri (t), Ri (t − 1)) if ωi = B U1 (ωi ) = , (6) −rF dbhat (Ri (t), Ri (t − 1)) if ωi = F
884
S.-S. Huang, L.-C. Fu, and P.-Y. Hsiao U 2 (wi , w j )
U ( oi | w i ) B f likelihood (d )
U likelihood
wi = B
neq f spatial (d )
U spatial
wi = w j eq f spatial (d )
F f likelihood (d )
0 Th likelihood
wi = F
0 d
(a) Likelihood Energy Function
wi = w j Th spatial
d
(b) Spatial Energy Function
B F Fig. 2. Energy Functions: (a) shows two functions, flikelihood (.) and flikelihood (.) to neq (., .) and evaluate the likelihood function, U (oi |wi ). (b) shows two functions, fspatail eq (., .), that are for evaluating spatial function U2 (., .). fspatial
where Ri (t − 1) is the corresponding region of Ri (t) at frame I(t − 1) which can be obtained by using affine motion model, rB is the ratio of pixels in Ri (t − 1) that have been classified as background at time instant t − 1, and, rF = 1 − rB . The purpose of introducing this term is to impose the temporal coherence, that is, the region obtained at current time instant tends to be classified as the same label as the corresponding region at the previous time instant. As for the term U2 (., .), we relate it to spatial coherence which means that two neighboring regions with similar color should be assigned to the same label. Therefore, U2 (wi , wj ) for two neighboring sites si and sj can be evaluated by neq eq (.) and fspatial (.), as depicted in Fig. 2(b), which are using two functions, fspatial used under the cases ωi = ωj and ωi = ωj , respectively. The optimization is carried out by using iterative conditional mode (ICM) algorithm to find the most proper label assignment of every region. After classification, the regions neighboring to one another will be merged and used to update the background model and region map.
5
Experiment
In this section, one standard MPEG-4 test sequence as well as two image sequences captured from intelligent home (e-home) demon room belonging to the Intelligent Robotics Laboratory at National Taiwan University are considered to validate our proposed method. Additionally, we compare our algorithm with the one proposed in [18] which is used to extract foreground objects for further human identification. Figure 3(a) illustrates the original frames 15, 25, 50, and 75 of the Hall Monitoring image sequence. Images in Fig. 3(b) and Fig. 3(c) show the detection results of Wang’s approach and ours, respectively. Frame 15 here is to exhibit that our algorithm can automatically detect newly introduced objects. The second case is the image sequence exhibiting the gradual illumination variation and local motion. When a person enters, the background will gradually
Region-Level Motion-Based Foreground Detection
(a) Original Images
(b) Wang Approach
885
(c) Our Approach
Fig. 3. Hall Monitoring Sequence. (a) are the frames, 15, 25, and 50 of the hall monitoring sequence. (b) and (c) are the detection result of Wang’s approach and our proposed approach, respectively.
brighten. This is due to radiance from the fluorescent lamp that is reflected back into the background scene. After leaving the scene, he will wave the curtains to make it flutter. Some possible false positives due to Wang’s algorithm under the condition with gradual illumination variation and local motion are shown in Fig. 4(b). As shown in Fig. 4(c), the detection results of our approach are more robust in such situations.
( a ) Original Images
(b ) Wang Approach
(c) Our Approach
Fig. 4. Gradual illumination variation in e-home demo room. (a) Original images. (b) Detection result of Wang’s approach. (c) Detection results of our approach.
886
S.-S. Huang, L.-C. Fu, and P.-Y. Hsiao
( a ) Original Images
(a) Foreground Detection of Our Appraoch
Fig. 5. Shadow Effect Elimination. (a) Original images. (b) Detection results of our approach.
The final case features two persons entering the scene in order and crossing each other at the top of the image. Our proposed method will eliminate most of the shadow effect by applying the aforementioned scaling factor λ∗ before evaluating Bhattacharyya distance. Empirically, λf is set to 0.7 in this paper.
6
Conclusion
In this paper, we performed the foreground detection at the region level which means that contextual information is taken into consideration. A statistical framework, MRFs, fuses the cues from background model and prior knowledge including temporal and spatial coherence to detect the foreground objects in a more accurate and elegant way. Experimental results demonstrate that our proposed method can successfully extract the foreground objects even under situations with illumination variation, shadow, and local motion.
References 1. Gupte, S., Masoud, O., Martin, R.F.K., Papanikolopoulos, N.P.: ”Detection and Classification of Vehicles”. IEEE Transactions on Intelligent Transportation Systems 3 (2002) 37–47 2. Haritaoglu, I., Harwood, D., Davis, L.S.: ”W4: Real-Time Surveillance of People and Their Activities”. IEEE Transactions on Pattern Analysis and Machine Intelligence 22 (2000) 809–830
Region-Level Motion-Based Foreground Detection
887
3. Friedman, N., Russell, S.: ”Image Segmentation in Video Sequence: A Probabilistic Approach”. International Conference on Uncertainty in Artificial Intelligence (1997) 4. Stauffer, C., Grimson, W.: ”Adaptive Background Mixture Models for Real-Time Tracking”. IEEE Computer Society Conference on Computer Vision and Pattern Recognition 2 (1999) 5. Elgammal, A., Duraiswami, R., Harwood, D., Davis, L.S.: ”Background and Foreground Modeling Using Nonparametric Kernel Density Estimation for Visual Surveillance”. Proceedings of the IEEE 90 (2002) 1151–1162 6. Wang, J.Y.A., Adelson, E.H.: ”Spatio-Temporal Segmentation of Video Data. Proceedings of the SPIE: Image and Video Processing (1994) 7. Borshukov, G.D., Bozdagi, G.: ”Motion Segmentation by Multistage Affine Classification. IEEE Transactions on Image Process 6 (1997) 1591–1594 8. Tsaig, Y., Averbuch, A.: ”Automatic Segmentation of Moving Objects in Video Sequences: A Region Labeling Approach”. IEEE Transactions on Circuits and Systems for Video Technology 12 (2002) 597–612 9. Altunbasak, Y., Eren, P.E., Tekalp, A.M.: ”Region-Based Parametric Motion Segmentation Using Color Information”. Graphical Models and Image Processing: GMIP 60 (1998) 013–023 10. Horn, B.K.P., Schunck, B.G.: ”Determining Optical Flow”. AI Memo 572, Massachusetts Institue of Technology (1980) 11. Salembier, P., Pardas, M.: ”Hierarchical Morphological Segmentation for Image Sequence Coding”. IEEE Transaction on Image Processing 3 (1994) 639–651 12. Choi, J.G., Lee, S.W., Kim, S.D.: ”Spatio-Temporal Video Segmentation Using a Joint Similarity Measure. IEEE Transactions on Circuits and Systems for Video Technology 7 (1997) 279–286 13. Wang, J.Y.A., Adelson, E.H.: ”Representation Moving Images with Layers. IEEE Transactions on Image Processing 3 (1994) 625–638 14. Duda, R.O., Hart, P.E., Stork, D.G.: ”Pattern Classification”. Wiley Interscience (2000) 15. Mak, B., Barnard, E.: ”Phone Clustering Using the Bhattacharyya Distance”. Fouth International Conference on Spoken Language Processing (ICSLP) 4 (1996) 2005–2008 16. Elgammal, A., Harwood, D., Davis, L.S.: ”Non-parametric Model for Background Subtraction”. IEEE International Conference on Computer Vision Frame-Rate Workshop (1999) 17. Li, S.Z.: ”Markov Random Field Modeling in Computer Vision”. Proceedings of European Conference in Computer Vision (1994) 18. Wang, L., Tan, T., Ning, H., Hu, W.: ”Silhouette Analysis-Based Gait Recognition for Human Identification”. IEEE Transaction on Pattern Analysis and Machine Intelligence 25 (2003) 1505–1518