Unsupervised background reconstruction based on iterative median blending and spatial segmentation Slim Amri, Walid Barhoumi and Ezzeddine Zagrouba Equipe de Recherche "Systèmes Intelligents en Imagerie et Vision Artificielle" (SIIVA), Institut Supérieur d’Informatique, 2 Rue Abou Rayhane El Bayrouni, 2080 Ariana, Tunisia.
[email protected],
[email protected],
[email protected] wide high-quality panoramic image which synthesizes the visual content of the video background. Many existing solutions are able to reconstruct the background accurately for remote video surveillance since stationary camera with fixed focal length is often used. In particular, in indoor surveillance applications, the scene is completely static (i.e., with no background motion) and the pixel intensity can be then reasonably modeled with a normal distribution [4]. However, the task is complicated in the general case of dynamic scenes due to the independent camera motion, to the background motion, to the lighting effects and to the presence of independently deformable moving objects. Moreover, objects can be introduced or removed from the scene, and the background must be thus adaptively adjusted. This paper describes a novel iterative approach for constructing the background of a complete video shot in the presence of random camera motion and complex foreground and background moving objects. The proposed approach is unsupervised since the pre-training of static objects in background and the models of background and foreground are not needed. It starts by compensating the camera motion between some key-frames and the plan of one image chosen as a reference. Then, given the compensated key-frames, the background reconstruction consists to decide, for each pixel with multiple candidates of background values (overlapping areas between at least two key-frames), the most likely candidate to represent the background while avoiding that foreground objects can be blending into the background image. To do it, the oldest solutions assumed that background would be visible more than fifty percent of the sequence time and defined the median value for each pixel as the background pixel intensity value [5]. The more recent solutions assume that background would be the most observed part over the sequence and define the background pixel intensity as the intensity value with the maximum frequency [6]. Other solutions assume that background intensity is the most stable intensity value across time [1]. However, these solutions are not robust to complex situations such that dynamic scenes and highly cluttered scenes. In particular, they cannot handle scenes with many moving objects, especially if they move slowly and/or if they are stationary for a long period. This kind of situations causes the pixel intensity values to vary significantly with time. The curve depicted in Fig. 1 represents an example of a background pixel, the pixel of coordinates (272,320) of the basketball sequence (c.f. section IV), which changes intensity significantly over a short period of time (72 frame-12 seconds).
Abstract—We propose in this paper a novel iterative approach for unsupervised reconstruction of static background from a complex video shot. After aligning some key-frames of the video onto a reference plane in order to compensate the camera motion, the basic idea of the suggested scheme is to iteratively reconstruct a precise image of the background using median blending and spatial segmentation. In each iteration, coarse binary masks, representing foreground moving objects, are estimated by comparing each motion-compensated key-frame with the corresponding part in the input background image. These masks are then refined by spatial segmentation while profiting of the semantic information offered by region maps. The iterative process allows the blending operator to eliminate the detected moving objects while reconstructing the output background image. Several experiments have been carried out to prove the effectiveness of the suggested unsupervised approach for precise background reconstruction of complex dynamic scenes after a relatively small number of iterations. Keywords-background reconstruction; iterative blending; spatial segmentation; motion compensation
I.
median
INTRODUCTION
Motion detection and tracking are of great interest for a wide range of video processing applications. There are mainly three approaches to detect and segment targets [1]: optical flow, frame difference and background subtraction. Solutions based on optical flow are in general sensitive to noise and they are often computationally expensive [2]. Frame difference based solutions detect accurately the targets even though dynamic environments, but the segmentation of target is not integrated [1]. Since they take advantage of both spatial and temporal cues to identify and track objects of interest [3], background subtraction methods can get an integrated object. It is the easiest and the most effective method among the aforementioned three ones. The simplest way to perform background subtraction is that a background image, which does not include any moving object, is known in advance, and then targets detection is solved by comparing the current frame to such a reference image. Thus, the assumption that an initial background image is given, or at least can be defined using a short training sequence without foreground objects, is often made. However, in the general case (e.g., monitoring of public areas and traffics), this reference image is not given and it is not possible to learn it from a training stage, so it has to be estimated. Background reconstruction consists to construct a
978-1-4244-6493-7/10/$26.00 ©2010 IEEE
406
Only from 1st to 3th frame and from 57th to last frame, the pixel represents background, and for all the rest of sequence it is cluttered by moving objects. It is obvious that intensity distribution is multi-modal so that the assumption of normal distribution model for the pixel intensity would not hold. It is elsewhere clear that the background in this sequence has very high frequency variations, which occur in a very short time, so modeling the background variations with a number of Gaussian distribution [7] will not be accurate. We also notice that the frequency of foreground objects is the maximum and that the curve representing foreground objects is the longest interval. Thus, for this particular case, a foreground object will be blending into the background image with any of the above discussed solutions. We introduce in this paper a novel iterative median scheme for background reconstruction (Fig. 2). It starts by estimating a first version of the background image using a classical median blending of the motion-compensated keyframes. This preliminary background image is often deviated from the correct background color in the case of complex videos. It still contains some parts of the moving objects, especially those occulted by other foreground objects, which partially appear as ghost-like traces. This preliminary background image is then iteratively refined in order to define a pure background image. In each iteration t, moving objects are coarsely localized inside each key-frame, using hysteresis thresholding. Then, a region-based validation process is applied in order to define precisely the foreground regions. These regions will be then ignored during the median blending allowing the output background Bt reconstruction with the intention to remove the foreground objects. The basic idea behind our iterative approach is to combine the pixel level information with the high semantic information offered by regions, what allows an accurate background reconstruction. Thus, after only few iterations, the final background of a complete complex video can be precisely reconstructed even with background motion and large background occultation.
Video Key-frames
Motion Compensation
Motioncompensated Key-frames
Blending
Moving Objects Refinement Preliminary Background Model Coarse Moving Objects
Coarse Moving Objects Detection Stopping criterion Yes
Figure 2. Flowchart of the iterative background reconstruction scheme
II.
intensity
150
100
50
20
30
40
50
60
MOTION COMPENSATION
The motion compensation stage aims to align some keyframes belonging to the video shot onto a reference plane in order to compensate the camera motion. We used our efficient approach based on multi-feature matching [5], which proved its robustness even in the presence of moving objects. Considering the video shot as an ordered set {I1,...,Ik,...,In} of key-frames, the alignment stage returns to project each key-frame Ik onto the plan of one frame Iref chosen as reference. This is done while estimating the homography Hk approximating the global model of the camera motion between Ik and Iref. To minimize the average temporal distance between every key-frame and the reference image, the middle frame of a sequence is set to be the reference frame, what allows accumulative long-term motion estimation [8]. The estimation of Hk consists then in the composition of the homographies describing the motion between each couple (Ii,Ii+1) of successive overlapped keyframes belonging to the entire subsequence between Ik and Iref. To estimate the homography between (Ii,Ii+1), a coarse region segmentation algorithm followed by a region matching process are applied on these frames in order to define the set (⊂ Ii×Ii+1) of region pairs performing high correlation scores and verifying the relative position constraint. Then, the analysis of the matched-regions relative positions permits a partial estimation of the camera translations in the x and the y directions, the rotation angle of the camera, and the scale factor between the considered key-frames. The estimated motion model can be only used as an approximation of the camera motion since it is only practical for very small pan and tilt
200
10
No
Final Background Model
250
0
Moving Objects Silhouettes
Median
70
frame
Figure 1. Intensity history plot (in RGB color space) of a background pixel
The rest of this paper is structured as follows. In the next section, we will briefly present the used motion compensation method. In section 3, we describe the iterative background reconstruction scheme. The experimental study will be presented in section 4. Finally, section 5 is devoted to a synthesis of the proposed approach contributions with some ideas for future works.
407
angles at large focal lengths [5]. Hence, to refine the estimated alignment, Harris points are employed to derive precise projective model approximating the camera motion between two frames. In fact, for each point of interest (x,y) in Ii, we used the already estimated translations of the camera to predict a reduced search space of its most potential homologous correspondents in Ii+1. The limitation of the search space optimizes interest point matching, while avoiding blind search of correspondences. The matching is based on computing zeromean normalized cross correlation scores between the potential homologous points followed by the verification of the uniqueness constraint. RANSAC paradigm was also applied in order to remove matching outliers. Lastly, to estimate the projective motion parameters, we adopted the QR-factorization method followed by a relaxation algorithm to iteratively minimize the sum of distances between all matching pairs. III.
low-pass filter is then applied on the obtained background image in order to reduce the aliasing effects. Then, the difference image Diff t k between the resized versions of the images Ik and Btk is defined (1). Finally, we resize again the difference image Diff t k to the original size of the input image Ik using a bi-cubic interpolation.
ITERATIVE BACKGROUND RECONSTRUCTION
Given the already estimated homographies, the key-frames are firstly aligned in the reference plan while using a classical pixel-based temporal median blending. Although it can eliminate moving foreground objects which are not highly correlated over time (Fig. 3), the median blending alone is incapable to eliminate those with complex time-varying motion. In fact, since various foreground objects can move very quickly (or very slowly) and can be hidden by other moving objects in a sub-sequence of the video, it is usual that a part of the background is never, or partially, visible in this subsequence. Thus, the warped background image (B0) is only considered as an initial approximation of the video background image. It is then used as input for a hysteresis thresholding algorithm in order to roughly detect moving foreground objects. In fact, each motion-compensated key-frame is compared with the correspondent part in B0 to define a binary mask, which coarsely isolates the foreground moving objects in the input key-frame. Each foreground mask is then refined by the use of the relative region map produced during the segmentation process of the motion compensation stage. The region map is used to filter out reliably each group of pixels already marked as foreground object and to update consequently the output background image without considering the filtered moving objects in each key-frame. The same iterative refinement process can be applied again in order to “discover” new foreground pixels, what refines once more the produced background image.
Figure 3. Initial background image B0 Difft
k
( x, y ) =
max c∈{ R ,G , B}
I k ( x, y, c ) − Btk ( x, y, c)
(1)
The thresholding of the produced difference image Diff t k permits to obtain a coarse localization of the moving objects, for the key-frame Ik, while producing a binary mask M tk (where 1 indicates the membership of the pixel into a moving object). We used an hysteresis thresholding technique, in which each pixel (x,y) is assigned to the foreground or to the background in terms of two thresholds Tl and Th (Tl < Th). In fact, if Diff t k ( x, y ) ≥ Th then the pixel (x,y) is associated to a moving object. Moreover, if Diff t k ( x, y ) ≥ Tl and it exists at least one pixel around (x,y) which is already identified as foreground, then (x,y) is considered as a foreground pixel. This consists to propagate competitively, relatively to each pixel (x,y), its foreground mode to the neighboring pixels. This bithresholding guarantees that faithful silhouettes of foreground moving objects in complex videos are accurately extracted, without any a priori information about these objects (Fig. 4).
B. Region-based moving objects refinement The produced foreground silhouettes are yet noisy with jagged boundaries since it is independently based on pixel level. In particular, noise inevitably produces some small insignificant foreground regions. The aim of this stage, in each iteration t, is to refine the produced moving foreground objects mask M tk relative to an aligned key-frame Ik (1≤ k≤ n). In order to profit of the semantic information offered by regions, we used the already computed regions map of Ik, after applying on it the correspondent homography Hk. To guarantee regions connectivity in the motion-compensated region maps, we applied the nearest-neighbours interpolation in order to treat pixels without correspondent values. Then, each region of the produced aligned segmentation is examined to decide if it belongs to a moving object. If a great part (experimentally, we
A. Coarse moving objects detection Given a motion-compensated key-frame (Ik=Hk.Ik) and its correspondent part Btk in the input background image Bt, the goal of this step is to coarsely localize the moving objects in the concerned key-frame. Instead of directly subtracting Btk from Ik, we start by resizing the two images (Ik and Btk ) into r-1 of their original sizes. It was found that r–value around 8 is sufficiently enough to handle a reliable background reconstruction, with a relatively low computational cost. In addition to the minimization of the noise effects, the resize process reduces the presence of small regions inside the ghostlike traces of the moving objects in the input background Bt. A
408
chose the proportion 70%) of the region is covered by an object mask in M tk , the whole region is declared as part of foreground objects; otherwise all the region is declared as a part of the new background image Bt+1. We note that the silhouettes extraction of the moving objects is much better optimized in M tk+1
non-rigid deforming objects. These sequences are high resolution with significant frame-to-frame motion. In Fig. 5, we illustrate the iterative evolution of the background reconstruction on the basketball sequence. We can conclude that the proposed scheme minimizes iteratively the presence of moving objects traces in the produced background image which was constructed without any visible errors, only after two refinement iterations. In particular, some areas of moving foreground objects are initially impossible to construct, since the background behind this part is never visible during the studied sequence [9]. These zones were empirically replaced by red points in Fig. 5. However, the iterative scheme permits a complete reconstruction of the background without any blurring after a reduced number of iterations.
comparatively to M tk . Indeed, by processing at the region level, we impose that connected foreground pixels with same motion will be considered as one moving object while taking into account the spatial correlation of nearby pixels. Once all the foreground masks are refined, we process to blending all the aligned key-frames without considering the detected moving objects (2), in order to generate a new background image Bt+1. This returns to replace the pixels belonging to the moving objects by background pixels. The produced background image Bt+1 is much more precise than the input one Bt since it minimizes the presence of the ghost-like traces caused by moving objects (Fig. 5).
{
},
∀c ∈ R , G , B
Bt ( x , y , c ) =
median
{1≤ k ≤ n / M tk ( x , y ) = 0}
{I
k
}
( x, y , c )
(2)
(a) (a)
(b)
(b)
(c) Figure 4. Coarse detection of moving objects in key-frames 1 and 22. (a): original image, (b): the correspondent part in the preliminary background image, (c): coarse localisation of the moving objects.
IV.
(c)
RESULTS AND EVALUATION
Figure 5. Background reconstruction. (a): incomplete initial background image (B0), (b): complete background image (B1) after one refinement iteration, (c): complete background image (B2) after two refinement iterations.
In this section we describe a set of experiments performed to evaluate the accuracy of our background reconstruction scheme. This scheme was tested on a variety of complex sequences, both indoors and outdoors, containing a variety of
In Fig. 6, the proposed approach was tested on a volley-ball sequence. Although the initial background image does not
409
model the background content very accurately, the refinement step renders the proposed framework capable of precisely reconstructing the background. Indeed, the referee, which is a foreground object with very slow movement, was iteratively removed from the background image. From iteration to the next, additional regions are removed from the referee silhouette until cancelling definitely the presence of referee traces in the background image. This confirms the accurate reactivity of the iterative scheme for fast as well as slow changes in time. This is mainly due to semantic information generously presented by region maps allowing to each pixel to competitively propagate its foreground mode (c.f. key-frame #22 in Fig. 4), or background mode (c.f. key-frame #1 in Fig. 4), to its connected pixels. For example, for the background pixel of coordinates (272,320) of the basketball sequence (c.f. Fig: 1), the number of multiple candidates of the background value decreases from iteration to the next thanks to the ignoring of moving objects. The iterative scheme filters out only background information for the median blending (Fig. 7), what avoids that foreground objects can be blending into the background image. We have also applied the suggested background reconstruction scheme to the popular hall-andmonitor test sequence (Fig. 8).
250
200
intensity
150
100
50
0
10
20
30
40
50
60
70
40
50
60
70
frame
(a) 250
200
intensity
150
100
50
0
10
20
30 frame
(b) Figure 7. Evolution of multiple candidates of the background value for the pixel (272,320) of the basketball sequence. (a): after one refinement iteration, (b): after two refinement iterations.
In this sequence, for which the input data is rather noisy [10], two objects (the bag that the left man is placing on the pedestral and the monitor that the right man is carrying away) change their status during the sequence to background or foreground, respectively. It monitors two men which are walking in the direction of the optical axis (zoom effects) and they stay at the same position for a long time. The initial background image does not model the background content very accurately, since the two men as well as the bag appear gradually in the reconstructed background. The iterative scheme allows an accurate reconstruction of a clean background. Besides, since the real background image is available for the hall-and-monitor sequence (in the first frames of the sequence), we used this sequence to evaluate the quality of the background image, by measuring the difference between the reconstructed background image and the groundtruth background image (the first frame which is already noisy). We measured the PSNR of the suggested scheme after each iteration and we estimated the camera noise by calculating the PSNR between the first two frames of the sequence (Table. 1). We also tested our approach quantitatively by creating ground truth segmentations of the basketball sequence. We computed the background PSNR between the original key-frames and the background image using manually segmented ground-truth masks to only take
(a)
(b)
(c) Figure 6. Background reconstruction. (a): initial background image (B0), (b): background image (B1) after one refinement iteration, (c): background image (B2) after two refinement iterations.
410
background pixels into account. It is clear that the proposed iterative scheme increases considerably the PSNR form iteration to the next (Fig. 9).
V.
Conclusion
We have proposed in this paper an efficient approach for unsupervised reconstruction of the background image for a complex sequence. This approach is mainly based on iterative median blending and spatial segmentation. The originality of the suggested iterative approach resides mainly in the combination of the pixel level information with the high semantic information offered by regions in order to localize moving objects, so the static background can be reconstructed correctly and thereby targets can be extracted perfectly and tracked successfully. Experimental tests have demonstrated the efficiency of the proposed scheme in terms of performance and computational cost. Indeed, given some key-frames (quarter of the total sequence) of a complex video filmed with randomly moving camera and containing several complicated non-rigid moving objects (rapid, slow, similar appearance, partial/total occultation, clutter,...), it is shown that two iterations are usually sufficient for accurate background reconstruction without visible errors. As further work, we aim to apply the suggested iterative scheme on persons tracking. We also aim to apply the proposed solution for video indexing and retrieval purposes.
(a)
REFERENCES [1]
Z.Q. Hou and C.Z. Han, “A background reconstruction algorithm based on pixel intensity classification in remote video surveillance system,” International Conference on Information Fusion, Sweden, pp. 754—759, June 2004 [2] P.S. Singh and K.J. Waldron, “Motion estimation by optical flow and inertial measurements for dynamic legged locomotion,” IEEE Conference on Robotics and Automation, Spain, April 2005 [3] R. Noaman, M. Ali, and K. Jumari, “Background model initialization using modal algorithm for surveillance,” International Journal of Intelligent Information Technology Application, vol. 2, pp. 294—297, 2009 [4] A. Elgammal, D. Harwood, and L. Davis, “Non-parametric model for background subtraction,” European Conference on Computer Vision, Ireland, pp. 751--76, June 2000 [5] E. Zagrouba, W. Barhoumi, and S. Amri, “An efficient image mosaicing method based on multifeature matching,” Machine Vision and Applications Journal, vol. 20, pp. 139—162, 2009 [6] P. Kornprobst, R. Deriche, and G. Aubert, “Image sequence analysis via partial difference equations,” Journal of Mathematical Imaging and Vision, vol. 11, pp. 5—26, 1999 [7] P. Kaewtrakulpong, and R. Bowden, “An improved adaptive background mixture model for realtime tracking with shadow detection,” European Workshop on Advanced Video Based Surveillance Systems, September 2001 [8] A. Krutz; A. Glantz, T. Borgmann, M. Frater, and T. Sikora, “Motionbased object segmentation using local background sprites,” IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 1221—1224, 2009 [9] D. Farin, P.H. De With, and W. Effelsberg, “Robust background estimation for complex video sequences,” IEEE International Conference on Image Processing, pp. 145—148, 2003. [10] E. Hayman and J. Eklundh, “Statistical background subtraction for a mobile observer,” IEEE International Conference on Computer Vision, France, 2003
(b) Figure 8. Background reconstruction. (a): initial background image (B0), (b): background image (B2) after two refinement iterations. 29 28.5 28
PSNR
27.5 27 26.5 26 25.5 25 24.5
after one refinement iteration after two refinement iterations 10
20
30
40
50
60
70
frame
Figure 9. Iterative evolution of the background-PSNR
PSNR between the reconstructed background image and the ground-truth background image
TABLE I.
PSNR (dB) classical median (B0) our scheme (two iterations-B2) camera noise
30.78 33.61 38.57
411