Robust Foreground Segmentation from Color Video Sequences Using Background Subtraction with Multiple Thresholds Hansung Kim†, Ryuuki Sakamoto†, Itaru Kitahara†‡, Tomoji Toriyama†, and Kiyoshi Kogure† †Knowledge Science Lab, ATR, Keihanna Science City, Kyoto, 619-0288, Japan ‡Dept. of Intelligent Interaction Technologies, Univ. of Tsukuba Tsukuba Science City, Ibaraki, 305-8573, Japan E-mail: †{hskim, skmt, toriyama, kogure}@atr.jp, ‡
[email protected] Abstract A new robust method to segment foreground regions from color video sequences using multiple thresholds and morphological processes is proposed. Background models are observed for a long time, and their mean and standard deviation are used for background subtraction. Shadow regions are eliminated using color components, and the final foreground silhouette is extracted by smoothing the boundaries of the foreground and eliminating errors in and outside the regions. Experimental results show that the proposed algorithm works very well in various background and foreground situations. Key words Foreground segmentation, Background subtraction, Color model, Shadow elimination
1. Introduction Object segmentation from a video sequence, one important problem in the image processing field, includes such applications as video surveillance, teleconferencing, video editing, human-computer interface, etc. Conventional object segmentation algorithms are roughly classified into two categories based on their primary segmentation criteria. The first approaches use spatial homogeneity as a criterion. Morphological filters are used to simplify the image, and then a watershed algorithm is applied to the region boundary decision [1][2]. The segmentation results of these algorithms tend to track the object boundary more precisely than other methods because they use a watershed algorithm. However, the main drawback of these algorithms is high computational complexity. The second approaches exploit change detection in video sequences. Some algorithms have used frame difference [3], but the most common approach is background subtraction [4-6], which subtracts the current image from a static background image acquired in advance from multiple images over a period of time. Since this technique works very fast and distinguishes semantic object regions from static backgrounds, it has been used for years in many vision systems. As an example of such systems, we have already developed an immersive free-viewpoint video system using multiple cameras [7]. The system reconstructs 3D models from captured video streams using a shape-from-silhouette method and generates realistic free-view video of those objects from a virtual camera. The
shape-from-silhouette method is a very common way of converting silhouette contours into 3D objects [8-10], but the segmentation process creates a bottleneck. Although background subtraction methods provide fairly good results, they still have limitations for use in practical applications. Most conventional approaches use brightness or color information of the images that cause errors based on lighting conditions and the color of foreground objects. In addition, object shadow in the background region and highlighting from lighting conditions can cause serious trouble. In this paper, a robust foreground segmentation algorithm is proposed. It is assumed that the background is static and the lighting condition does not change drastically. We classify the segmentation mask into four categories based on their reliability and refine them with color information and morphological processes. In the next two sections, we describe the overall flow and detailed algorithms of the proposed method. Experimental results are shown in Section 4, and we conclude in Section 5.
2. Background model The background is modeled in two distinct parts: a luminance model and a color model. Input video stream has three channels with RGB components, but they are very sensitive to noise and changes of lighting conditions. Therefore, we use a luminance component of the color images for initial object segmentation. Image luminance is calculated with the following equation [11]:
Y = 0.299× R + 0.587 × G + 0.114× B .
(1)
However, the luminance component changes drastically by shadows of objects in the background regions and the reflection of lighting in the foreground regions. We construct a second background model with the color component of the image to remove false segmentation. Color component H is extracted from the HIS model as follows [11]:
I = max( R, G , B )
if V ≠ 0 otherwise
(2)
if I = R ⎧ (G − B ) × 60 / S ⎪ H = ⎨180 + ( B − R ) × 60 / S if I = G ⎪240 + ( R − G ) × 60 / S if I = B ⎩ if H < 0 then H = H + 360 .
Luminance component
(1) Background subtraction (2) Shadows elimination
Color component
(3) Object labeling
50000000
(5) Recovering inside holes Final foreground mask
Fig. 2 Block diagram of proposed algorithm
Even in a static scene, changes between frames occur due to noise, characteristics of material, and varying illumination. Fig. 1 shows the distribution of deviation from the mean of each pixel calculated from 150 successive frames. Noise can be described as zero-mean Laplacian distribution, which means that if the camera continues observing the same scene for a certain time period, and if we average each pixel during that period, noise can be canceled or at least suppressed. Therefore, we construct a background model with mean and standard deviation of successive frames.
(a) Luminance of Cams 5 and 6
40000000
number of pixels
number of pixels
Background model
(4) Silhouette extraction
⎧( I − min(R, G, B)) / I S =⎨ 0 ⎩
40000000 30000000 20000000 10000000
-25
0
25
(b) Color component of Cams 5 and 6 Fig. 3 Generated background model
30000000 20000000 10000000
0
-50
Image acquisition
50
deviation
0 -50
-25
0
25
50
deviation
(a) Luminance component (b) Color component Fig. 1 Noise distribution in background
3. Foreground segmentation Figure 2 shows a block diagram of our proposed foreground segmentation algorithm. First, an initial foreground mask is constructed by background difference using multiple thresholds. The initial foreground region is composed of four categories based on their reliability. Then shadow regions are eliminated using color components, and each object is labeled with its own identification number. In the fourth step, to smoothen the boundaries of foreground and to eliminate holes inside the regions, we use a silhouette extraction technique on each object. However, the silhouette extraction technique covers all regions inside the object silhouette even including real holes in the object. Finally we recover real holes in the object using a region growing technique and generate final foreground masks.
(a) Cam 5, frame 25 Fig. 4 Test images
(b) Cam 6, frame 1067
We describe the detailed algorithms and illustrate the results of each step in the following subsections. Fig. 3 shows generated background models with two cameras of our multiple camera system [7]. In Fig. 3(b), a color component described in a range of 0 ~ 360º is mapped into values of gray scale from 0 to 255. Fig. 4 shows the test images used in this experiment.
3. 1 Initial Background Generation (Steps (1) & (2)) Initial background subtraction is performed by subtracting the luminance components of the current frame from the background model. Fig. 5 shows the subtracted luminance components of the test images. As we can see, many ambiguous regions must be determined with fixed single threshold such as shadows and
foreground regions with similar brightness to the background. Therefore we classify the initial object region into four categories using multiple thresholds based on their reliability, as in Eq. (3). LI and LB indicate the luminance components of the current frame and background model, respectively, and σ is a standard deviation of the background model.
BD ( p ) = L I ( p ) − L B ( p )
(3)
BD < K 1σ ( p ) ⇒ (a) Reliable Background ⎧ ⎪ K σ ( p ) ≤ BD ≤ K σ ( p ) ⇒ (b) Suspicious Background ⎪ 1 2 ⎨ K σ p BD K ≤ ≤ ⇒ (c) Suspicious Foreground ( ) 2 3σ ( p ) ⎪ ⎪⎩ K 3σ ( p ) ≤ BD ⇒ (d) Reliable Foreground
However, we found that a large amount of background on the floor was assimilated in the suspicious foreground region in the results, caused by the shadow of an object that changes background brightness. We eliminate the shadows from the initial object by using a color component because the shadow does not change the color property of the background but only the brightness. With Eq. (4), shadows in the Suspicious foreground region are merged into the Suspicious background region. H indicates the color components of images, and σ H is a standard deviation of color component in the background model:
if ( p ∈ region (c ) & H I ( p ) − H B ( p ) < K 1σ H ( p ) ) . then p ⇒ region (b)
(4)
Figure 7 shows the results of test images after the shadow elimination process.
3. 2 Labeling of initial objects (Step (3))
Fig. 5 Results of background subtraction
Fig. 6 Results of initial classification
In a labeling step, all foreground regions (c) and (d) in Eq. (3) are labeled with their own identification numbers. All connected foreground pixels with 4-neighbor regulation are assigned the same labels using a region growing technique [12]. However, there are also small noise regions in the initial object regions. A conventional way of eliminating noise regions is to use a morphological operation to filter small regions. Therefore, we refine the initial mask by a closing and opening process [12]. Then, we sort and relabel all labeled regions in descending order based on size. In the relabeling process, regions smaller than threshold THRG are eliminated. Fig. 8 is the result of the labeling process. In the right image, two people are in the scene, but they are labeled as one object because they are holding hands.
Fig. 7 Results of shadow elimination If the distribution of background difference is assumed to be a Gaussian distribution, the threshold can be determined by standard deviation of the difference to achieve a desired detection rate. However, the distribution of difference is not Gaussian, but it has a very sharp peak at zero [6]. Thus, we use standard deviation of the background model to select thresholds. The deviation of each pixel in the image is different based on the characteristics of the material and lighting conditions that compose the background. Therefore, standard deviation of each pixel can be used as a threshold if observed for a long time. Figure 6 shows the results of classification using Eq. (3). Weighting factors K1 ~ K3 used in Eq. (3) were set to 5, 12, and 60. In the results, the black, dark gray, light gray, and white regions indicate Reliable background, Suspicious Background, Suspicious foreground and Reliable foreground, respectively.
Fig. 8 Labeling results
3. 3 Silhouette extraction (Steps (4) & (5)) In Fig. 8, there are many erroneous regions designated as background or incorrectly labeled inside the object. To smooth the boundaries of the foreground and eliminate holes inside the regions, we propose a silhouette extraction technique that is an improvement of Kumar’s profile extraction technique [4]. A weighted one pixel thick drape is moved from one side to the opposite side. The adjacent pixels of the drape are connected by an elastic spring that covers the object without infiltrating gaps whose widths are smaller than threshold M. This process is performed from all quarters, and the region wrapped by four drapes denotes the final foreground region. Fig. 9 shows the silhouette extraction
process as applied to an initial object. However, the silhouette extraction algorithm has serious drawbacks because it can cause serious errors in the area between objects when multiple objects exist in the scene. It may also cover real holes inside the object. We independently apply the profile extraction technique to each labeled region to avoid the first problem. For the second problem, we perform a region growing technique from Reliable background regions in the silhouette if the region is bigger than small region threshold THRG. Figs. 10 and 11 show the results of the silhouette extraction technique and the final foreground masks generated by the proposed algorithms, respectively. The concave region in the left image and the holes made by the two men in the right image are reconstructed by the region growing from the Reliable background regions.
Visual C++ on a Windows XP operating system. The parameters used in the simulation are listed in Table 1. Most were selected experimentally, but the same set was used for all experiments. Table 1 Parameters used in simulation Parameter
Value
Number of frames for background modeling Weighting factors segmentation
of
σ
for
Minimum valid region size (pixels) Maximum gap width in silhouette extraction
N =150 K1 =5 K2 =12 K3 =60 THRG =300 M =12
Figure 12 shows the segmentation results of the same frames from the other six cameras; the left column shows the captured image, and the right column shows the segmented foregrounds. Image sequences were captured in a typical indoor environment without any special conditions such as lighting equipment or background manipulation. The outsides of the images were cropped to show the detailed boundary of the segmentation results in Fig. 12. The foreground is segmented very well even though the object moves very fast. The images were captured at the same instant, but all cameras were directed to different directions with different backgrounds. Therefore, we conclude that the proposed algorithm works well under various conditions. Table 2 Objective evaluation (%) Fig. 9 Silhouette extraction
Fig. 10 Results of silhouette extraction
Fig. 11 Final foreground mask
4. Experimental results We applied the proposed algorithm to our multiple camera system [7] composed of seven synchronized IEEE-1394 cameras that are oriented toward the center of the space to capture almost the same area. The cameras provide 1024×768 RGB streams at a speed of 30 frames/sec, and we simulated the algorithm on a PC with a Pentium IV 3.2 GHz CPU, 1.0 GByte memory, and
Frame 25
Frame 1067
FB
FF
FB
FF
Cam 1
0.08
0.42
0.16
1.19
Cam 2
0.17
0.78
0.20
1.57
Cam 3
0.04
0.57
0.08
0.49
Cam 4
0.02
0.54
0.00
0.51
Cam 5
0.09
0.56
0.05
1.67
Cam 6
0.08
0.83
0.17
1.04
Cam 7
0.05
0.60
0.14
1.47
Avg.
0.08
0.61
0.11
1.13
Table 2 shows objective evaluations of the proposed algorithm. We compared segmented results with ground-truth segmentation masks generated by manual segmentation and then calculated the percentage of error pixels in the mask. In the table, FB means the foreground region falsely classified as the background region, and FF means the background region falsely classified into the foreground region. In all results, FF is much bigger than FB errors due to blurring from fast motion and object boundaries. Generally, FF error is more uncomfortable to the eye and more unacceptable to many vision systems than FB. In all scenes, average error rates were less than
1.25% because they undoubtedly depend on the speed and size of the foreground objects. Table 3 shows a runtime analysis with the proposed system. The times listed are the average processing times when one man moves in the experimental space. Considering image resolution, processing speed is quite high although it cannot be used in real-time applications yet. If we apply this algorithm to sequences with lower resolution such as VGA or QVGA, it will work in real-time since its computational complexity is in proportion to the square of the image size. Table 3 Processing speed analysis Stage
Time (msec)
Background subtraction
15
Shadow elimination
47
Object labeling
125
Silhouette extraction
235
Recovering inside holes
125
Total
547
5. Conclusion In this paper, we proposed a powerful segmentation algorithm robust to variations in lighting conditions and shadows. The background is modeled in advance using brightness and color components, and the foreground is segmented using multiple thresholds and morphological processes. Experimental results show that the proposed algorithm works very well in various background and foreground situations. Future work on this topic will follow two main directions. Although the proposed algorithm is fast, it does not work in real-time with XGA image sequences. That can be achieved by using hardware accelerators such as a Graphics Processing Unit (GPU) and by employing further optimizations to the implementation. Second, the proposed method assumed that the background is static, and thus it may suffer from errors of scene changes in part of the background. This problem can be coped with by updating the background model in the main process, as Chien proposed [3].
Acknowledgement This research was supported by the National Institute of Information and Communications Technology.
References [1] J. Fan, G. Fujita, M. Furuie, T. Onoye, I. Shirakawa, and L. Wu, “Automatic moving object extraction toward compact video representation,” Opt. Eng., vol.39, pp.438-452, 2000. [2] S. Chien, Y. Huang, and L. Chen, “Predictive watershed: a fast watershed algorithm for video segmentation,” IEEE Trans. CSVT, vol. 13, no. 5, pp. 453 – 461, 2003. [3] S. Chien, S. Ma, and L. Chen, “Efficient Moving Object Segmentation Algorithm using Background Registration Technique,” IEEE Tran. CSVT, vol. 12, no. 7, pp. 577-586, 2002. [4] P. Kumar, K. Sengupta, and S. Ranganath, “Real time detection and recognition of human profiles using Proc. ICPR, inexpensive desktop cameras,” pp.1096-1099, 2000. [5] S. Jabri, Z. Duric, A. Rosenfeld, and H. Wechsler, “Detection and location of people in video images using adaptive fusion of color and edge information,” Proc. ICPR, pp. 627-630, 2000. [6] T. Horprasert, D. Harwood, and L.S. Davis, “A robust background subtraction and shadow detection,” Proc. ACCV, 2000. [7] H. Kim, I. Kitahara, R. Sakamoto, and K. Kogure, “An Immersive Free-Viewpoint Video System Using Multiple Outer/Inner Cameras,” Proc. 3DPVT, 2006. [8] M. Gross, S. Würmlin, M. Naef, E. Lamboray, C. Spagno, A. Kunz, E. Koller-Meier, T. Svoboda, L. Van Gool, S. Lang, K. Strehlke, A. Vande Moere, and O. Staadt, blue-c: A Spatially Immersive Display and 3D Video Portal for Telepresence. Proc. ACM SIGGRAPH, pp. 819-827, 2003. [9] T. Matsuyama, X. Wu, T. Takai, and T. Wada, “Real-Time Dynamic 3-D Object Shape Reconstruction and High-Fidelity Texture Mapping for 3-D Video,” IEEE Trans. CSVT, vol. 14, no. 3, pp. 357-369, 2004. [10] K. M. Cheung, T. Kanade, J.Y. Bouguet, and M. Holler, “A real time system for robust 3D voxel reconstruction of human motions,” Proc. CVPR, pp. 714-720, 2000. [11] L.G. Shapiro and G.C. Stockman, Computer Vision, New Jersey: Prentice Hall, 2001. [12] R. C. Gonzalez and R. E. Wood, Digital Image Processing, Addison-Wesley publishing company, 1993.
(a) Frame 25 Fig. 12. Segmentation results of Cam 1 ~ Cam 7
(b) Frame 1067