2D Shape Estimation for Moving Objects Considering a Moving Camera and Cast Shadows Roland Mech1 , Jurgen Stauder2 University of Hannover Institut fur Theoretische Nachrichtentechnik und Informationsverarbeitung Appelstr. 9A, D-30167 Hannover, Germany phone: +49-511-762-5308, fax: +49-511-762-5333 email:
[email protected] 1
IRISA/INRIA Campus Universitaire de Beaulieu, F-35042 Rennes Cedex, France phone: +33-29984-7504, fax: +33-29984-7171 email:
[email protected] 2
ABSTRACT
The estimation of the 2D shape of moving objects in a video image sequence is required for many applications, e.g. for so-called content-based functionalities of ISO/MPEG-4, for object-based coding, and for automatic surveillance. Many real sequences are taken by a moving camera and show moving objects as well as their cast shadows. In this paper, an algorithm for 2D shape estimation for moving objects is presented that considers for rst time explicitly both, a moving camera and moving cast shadows. The algorithm consists of ve steps: Estimation and compensation of possibly apparent camera motion, detection of possibly apparent scene cuts, generation of a binary mask by detection of temporal signal changes after camera motion compensation, elimination of mask regions corresponding to moving cast shadows and uncovered background, and nally, adaptation of the mask to luminance edges of the current frame. For identi cation of moving cast shadows, three criteria evaluate static background edges, uniform change of illumination, and shadow penumbra. The proposed algorithm yields accurate segmentation results for sequences taken by a static or moving camera, in absence and in presence of moving cast shadows. Parts of this algorithm have been accepted for the informative part of the description of the forthcoming international standard ISO/MPEG-4. Keywords: 2D shape estimation, segmentation, VOP generation, moving shadows, illumination, camera motion estimation, MPEG-4
1. INTRODUCTION
2D shape estimation for moving objects is required for many applications, e.g. for object based coding or for video analysis like automatic surveillance. Moreover, the upcoming ISO/MPEG-4 standard [9] considers such shape information for so-called content-based functionalities. Current methods for 2D shape estimation for moving objects in video sequences are either based on thresholding the luminance dierence image of two successive frames [3][6][7][8][5][14][9], i.e. a change detection mask is estimated, or they segment each frame into regions with similar motion parameters with respect to a certain motion model [11][1][4]. One main problem in both strategies is the misclassi cation of moving cast shadows as part of moving objects leading to incorrect estimates of object shape. In [10], a method for 2D shape estimation considering moving cast shadows and a static camera is presented, which is based on the estimation of a change detection mask. Pels where the luminance has changed due to a moving cast shadow are detected and eliminated from this mask. These pels are detected by two criteria:
Each pel inside the change detection mask is tested for membership of a static edge. A static edge may belong to background texture and thus hints to a moving shadow.
The neighborhood of each pel is tested for a spatially uniform change of illumination. Changes caused by homogeneous change of illumination hint to a moving shadow.
In [12], this method is improved by a third criterion for detection of pels changed due to a moving cast shadow:
Cast shadows are identi ed by their soft contours that are caused by the penumbra of the shadows.
A robust method for 2D shape estimation for moving objects considering also a moving camera has been presented in [8]. This method is currently investigated by the standardization activities of ISO/MPEG-4 [5][14][9]. The segmentation process is splitted into ve steps: First, a possibly apparent camera motion is estimated and compensated. Then, a possibly apparent scene cut is detected in order to reset the algorithm in such cases. Afterwards, a change detection mask is estimated by a noise robust technique. After that, the uncovered background is eliminated from the mask by using an estimated displacement vector eld. Finally, the mask is adapted to luminance edges of the current frame. For temporal coherency of the estimated object shape, a memory is used, which adapts automatically to the sequence. Although this method leads to quite accurate segmentation results, it happens that cast shadows are misclassi ed as moving objects. In this paper, a new method for 2D shape estimation for moving objects is proposed, which considers both, a moving camera and cast shadows. To deal with sequences captured by a moving camera, the camera motion will be estimated and compensated, rst, as in [8]. Then, a change detection mask between the camera motion compensated previous frame and the current frame will be estimated. As this mask contains also pels where the luminance change is caused by a moving cast shadow, these pels will be detected by the techniques proposed in [10] and [12], and eliminated from the mask.
2. APPEARANCE OF CAST SHADOWS IN VIDEO IMAGES
In Fig. 1, the formation of a cast shadow on a scene background is shown. Cast shadows on moving objects are not considered in this paper. The light coming from a light source reaches the background only partially due to a moving object. The darkened area on the background is called cast shadow. It is illuminated by some ambient diuse light, only. A cast shadow consists of a center part without any light from the light source, called umbra, and a soft transition from dark to bright, the penumbra, where some light from the light source reaches the background [13]. light source camera
object
background
Cast shadow generation: The scene taken by a camera contains a moving object and a moving cast shadow on the background. The shadow is caused by a light source of certain extent and exhibits a penumbra.
Figure 1.
The appearance of a cast shadow in an image taken by a video camera can be described by an image signal model. It describes the image luminance
sk (x; y ) = Ek (x; y ) %k (x; y )
(1) at time instant k at the 2D image position (x; y) by the product of the irradiance E (x; y) and re ectance % (x; y) of the object surface. The irradiance is the received light power per illuminated object surface. k
k
3. DETECTION OF LUMINANCE CHANGES DUE TO MOVING CAST SHADOWS
In this paper, three assumptions are made to detect image regions changed by moving cast shadows. These image regions are background regions covered or uncovered by a moving cast shadow from frame to frame. The three assumptions lead to three criteria evaluating two succeeding frames of an image sequence: 1. camera and background are static, 2. background is plane, light source position is distant from background, 3. light source shape is extended compared to distance to the moving object. For detection of luminance changes due to moving cast shadows, all pels being part of the change detection mask are evaluated by the three criteria introduced in the following subsections. Afterwards, the results of the criteria are combined to a binary mask for regions being changed by moving cast shadows. The change detection mask is assumed to be available.
3.1. Detection of static background edges
The assumption (1) of camera and background being static can be guaranteed either by restriction to a static camera and static background. Or, the previous frame s is motion compensated with respect to the camera motion. As will be explained in Section 4, a camera motion compensated previous frame s is assumed in this paper. In case of textured background, assumption (1) can be used to distinguish possible moving cast shadows from moving objects. First, luminance edges are detected in the previous and current frame. Then, the edges in the previous and current frame are classi ed into moving and static edges, see Fig. 2. An edge is classi ed as static, if the activity in high frequencies of the frame dierence between current and previous frame is low [10]. Other edges are classi ed as moving edges. The threshold for the high frequency activity is adaptively calculated from the high frequency activities of the frame dierence outside the change detection mask assuming Gaussian noise [10]. Edges classi ed as static hint to possible regions of moving cast shadows. k
k;CM C
3.2. Detection of uniform changes of illumination
The assumption (2) means that the background is plane and the light source is distant from the background. In this case, the irradiance on the background is spatially constant. To make use of this fact for shadow detection, the frame ratio F R(x; y ) =
sk+1 (x; y ) sk (x; y )
(x; y) = EE+1(x; y) k
k
%k+1 (x; y ) %k (x; y )
(2)
is evaluated inside the change detection mask. Then, the hypothesis is tested that the luminance at position (x; y) has changed due to a moving cast shadow. If the hypothesis is valid, the background re ectance does not change and % (x; y) = % +1 (x; y) holds. If this is true, neglecting any camera noise, the frame ratio can be simpli ed to k
k
F R(x; y ) =
Ek+1 (x; y ) : Ek (x; y )
(3)
From Eq. 2 can be concluded that the frame ratio is spatially constant in a neighborhood of (x; y) if the hypothesis holds. This is, because the irradiance is constant. For shadow detection, this conclusion is used vice versa: If the frame ratio is locally spatially constant, a moving cast shadow can be assumed at position (x; y). The frame ratio is tested for spatial constancy by evaluating its local spatial variance in a 3 3 pel2 window (for CIF image format), see Fig. 3. A variance smaller than a threshold indicates a uniform change of illumination and is assumed to hint to
current image
EDGE DETECTION
–
HIGHPASS
previous image
EDGE DETECTION
THRESHOLDING
CLASSIFICATION
static edges
Figure 2.
Block diagram of the algorithm for detection of static edges
a moving cast shadow. The threshold is adaptively calculated from the local variances of the frame ratio outside the change detection mask. There is a case, where this criterion will fail. The criterion will detect erroneously a moving cast shadow, if at position (x; y) a uniformly colored moving object is visible that rotates. In this case, the simpli cation from Eq. 2 to Eq. 3 holds and the frame ratio will be locally spatially constant. Such an error can be seen in Fig. 3, where two regions are detected in the facial area.
3.3. Penumbra detection
Assumption (3) says that the extent of the light source is not negligible compared to the distance between light source and moving object. Then, a cast shadow has a penumbra [13]. The idea of the third criterion is to detect shadows by their penumbra. The penumbra causes a soft luminance step at the border of a shadow. The luminance step in an image perpendicular to a shadow border is modeled by the luminance step model shown in Fig. 4. The luminance is assumed to rise linearly from a low luminance inside a shadow to a high luminance outside the shadow. The luminance step is characterized by its step height h, step width w and its gradient g. If the width of a luminance step caused by a
current image
previous image
RATIO
LOCAL VARIANCE ANALYSIS
THRESHOLDING
regions with uniform changes of illumination
Figure 3.
Block diagram of the algorithm for detection of regions with uniform changes of illumination s k(x, y + const.) w h
w + h g
x
Figure 4. Model of an image luminance step in a frame s at time instance k in direction perpendicular to a shadow contour: The luminance step is de ned by the step height h and the amplitude g of the gradient. From h and g the width w can be calculated. In this gure, the shadow contour is assumed to be in y-direction. k
penumbra is much bigger than that of edges caused by the camera aperture for object surface texture edges or object edges, it can be used for shadow detection.
previous image current image
penumbra candidates
–
GRADIENT/HEIGHT EVALUATION
WIDTH CALCULATION AND THRESHOLDING
penumbra Figure 5.
Block diagram of the algorithm for detection of shadow penumbra
For penumbra detection, edges are evaluated in the frame dierence between the previous and the current frame. The luminance edge model from Fig. 4 is therefore applied to the frame dierence. The pels at the border of the change detection mask are selected as penumbra candidates, see Fig. 5. Edges are considered in the frame dierence, because the question whether the relevant edges are in the previous or in the current frame depends on the unknown motion of cast shadows and objects. The penumbra candidates may be object or shadow edges, because the change detection mask contains image regions changed by moving objects or moving cast shadows. The candidate selection has two advantages. First, the number of candidates is low compared to the the number of edges indicated by a standard edge detection algorithm as in [15]. Second, standard edge detection algorithms have diculties in nding soft edges at the border of a shadow. The candidate selection is enhanced by to steps. First, the number of candidates is further reduced. Therefore, an object mask (see Section 4) for the moving objects in the previous frame is, if available, or-connected with the change detection mask before candidate selection to close some wholes in the mask. Second, to enhance the precision, the penumbra candidates are moved perpendicular to the border of the change detection mask to a position of highest gradient in the frame dierence. The gradient is measured perpendicular to the border of the change detection mask using a Sobel operator aligned perpendicular to the edge. To detect those candidates belonging to a penumbra, the height and gradient of the steps of the frame dierence perpendicular to the edge are measured for each candidate, see Fig. 5. The height is measured by the dierence of averaged frame dierences from both sides of the edge. Therefore, 3 3 pel2 averaging windows (for CIF image format) are placed 1 pel beside the edge. The gradient is measured using a Sobel operator aligned perpendicular to the edge. The direction of the edge is measured by a regression line evaluating penumbra candidates in a neighborhood of 3 pel. For each penumbra candidate, from height h and gradient g the width
w
= hg
(4)
of the luminance step is calculated. The width w is thresholded. Each penumbra candidate having a width greater than 2.5 pel (for CIF image format) is detected as penumbra. Other penumbra candidates are said to be object edges. The threshold for w depends on both, camera aperture and 3D scene geometry. Theoretically, it should be larger than any edge width caused by the aperture and smaller than the width of the sharpest shadow edge. Here, the chosen threshold was optimized for the image sequences shown in Section 5 to get few false alarms for penumbras.
4. 2D-SHAPE ESTIMATION OF MOVING OBJECTS CONSIDERING MOVING CAST SHADOWS
current image s k)1
OM k
previous image sk
Camera Motion Estimation and Compensation s k,CMC
Scene Cut Detection s k,CMC Estimation of Change Detection Mask CDM k)1
z–1
Estimation of Initial Object Mask
Displacement Information
OM ik)1 Estimation of Final Object Mask
s k)1
object mask OM k)1 Figure 6.
sequence
Block diagram of the proposed algorithm for 2D shape estimation for moving objects in a video image
To deal with image sequences captured by a static or by a moving camera, with or without appearing moving cast shadows, in this section, the method for 2D shape estimation of moving objects from [8] will be extended by
the algorithm for detection of image regions changed by moving cast shadows presented in Section 3. Fig. 6 gives an overview of the proposed segmentation algorithm. It can be subdivided into ve steps: By the rst step, apparent camera motion is estimated and compensated using an eight parameter motion model. Its eight parameters can re ect any kind of camera motion, assuming that the background is a plane. In the second step, a scene cut detector evaluates whether the mean square error between the current original frame s +1 and the camera motion compensated previous frame s exceeds a given threshold. It causes a reset of the segmentation algorithm in these situations, i.e. all parameters are set to their initial value. The evaluation is only performed in background regions of the previous frame which are taken from the previous object mask (OM ). In that mask, all pels are set to foreground which belong to a moving object in the previous frame. k
k;CM C
k
s k,CMC
s k)1
Computation of the initial CDM CDM i
Relaxation CDM s
Detection and Elimination of Moving Shadow Regions CDM sh OM k Considering a Memory CDM u
Simplification and elimination of small regions
CDM k)1
Block diagram of the algorithm for estimating the change detection mask CDM +1 from two frames and s +1 using the object mask OM as memory
Figure 7.
sk;CM C
k
k
k
By the third step, a change detection mask between two successive frames is estimated (Fig. 7). For that, rst an initial change detection mask CDM between the two successive frames is generated by thresholding the frame dierence using a global threshold. In this mask, pels with changing image luminance are labeled as changed, others are labeled as unchanged. After that, boundaries of changed image regions are smoothed by a relaxation technique using locally adaptive thresholds, resulting to a mask CDM . Thereby, the algorithm adapts frame-wise automatically to camera noise. The mask CDM contains not only pels where the luminance dierence has changed due to a moving object, but also pels where the luminance dierence is caused by moving cast shadows. These pels are to be eliminated from the mask CDM . For the detection of image regions changed by moving cast shadows, the algorithm described in Section 3 is used. The mask after elimination of regions changed by moving cast shadows is denoted as CDM . In order to nally get temporal stable object regions, a memory is used in the following way: The mask CDM is or-connected with the previous object mask OM , resulting in a mask CDM . This is based on the assumption that all pels which belonged to the previous object mask should belong to the current change detection mask. However, in order to avoid in nite error propagation, a pel from OM is or-connected to CDM only, if it was labeled as changed in the mask CDM of one of the last L frames. The value L denotes the depth of the memory and adapts automatically to the sequence by evaluating the size and motion amplitude of the moving i
s
s
s
sh
sh
u
k
k
sh
sh
objects in the previous frame. Finally, the mask CDM is simpli ed and small regions are eliminated, resulting in the nal change detection mask CDM +1 . u
k
x
t
Unchanged Region
Changed Region
Unchanged Region
Moving Object
sk
Displacement Vectors
s k)1 UCB Background
Moving Object Foreground
Background
Example for the separation of changed areas between two frames s and s +1 into moving objects and uncovered background (UCB)
Figure 8.
k
k
In the fourth step, the object mask OM +1 is calculated by eliminating the uncovered background areas from the CDM +1 . Therefore, displacement information for pels within the changed regions is used. The displacement is estimated by a hierarchical block matcher (HBM) [2]. For a higher accuracy of the calculated displacement vector eld, the change detection mask from the third step is considered by the HBM. Uncovered background is detected by pels with foot- or top-point of the corresponding displacement vector being outside the changed area in CDM +1 . The example in Fig. 8 shows an object moving from the left to the right while uncovering background. In the mask OM +1 , pels are labeled as foreground, if they are labeled as changed in CDM +1 , but not belong to uncovered background. Finally, the boundaries of the object mask OM +1 are adapted to luminance edges in the current frame in order to improve the accuracy. k
k
k
k
k
k
5. EXPERIMENTAL RESULTS
The proposed segmentation algorithm was applied to the MPEG-4 test sequences Mother-Daughter, Akiyo, HallMonitor, Container-Ship, Coastguard and Table-Tennis. Especially the sequence Table-Tennis contains moving cast shadows. The algorithm was further tested for two additional sequences where a person causes a moving cast shadow, Eric and Jurgen. The segmentation technique from the MPEG-4 core experiment [8] is used as reference. In Fig. 9 experimental results for the sequences Eric, Jurgen, and Table-Tennis are shown. It can be seen that the accuracy of the estimated object shapes is improved with respect to the reference method. While the estimated object shape of the reference method is perturbed by moving cast shadows, the object shape of the proposed method is well estimated, due to the explicit consideration of moving cast shadows. If there is no moving cast shadow in the scene, the results of both methods are similar, i.e. the results of the proposed method look like the results of the reference method that are shown in [8][7].
Results of 2D shape estimation for frame 42 of the sequence \Erik", for frame 21 of the sequence \Jurgen" and for frame 50 of the sequence \Table-Tennis", (CIF, 10 Hz): (a) Original image, (b) Result of reference method, (c) Result of proposed method.
Figure 9.
6. CONCLUSIONS
In this paper, an algorithm for 2D shape estimation for moving objects in video sequences has been proposed. The algorithm considers for rst time both, sequences taken by a moving camera and sequences where moving cast shadows appear. The proposed algorithm is based on an algorithm from an ISO/MPEG-4 core experiment on automatic segmentation that has been used as reference. The algorithm for 2D shape estimation can be splitted into ve steps: In a rst step, a possibly apparent camera motion is estimated and compensated. By the second step, a possibly apparent scene cut is detected, and if necessary the segmentation algorithm is reset. In the third step, a mask of changed image areas is estimated by a local adaptive thresholding relaxation technique. In this step, luminance changes due to moving cast shadows are detected by three criteria and eliminated from the mask of changed image areas. The three criteria evaluate static background edges, uniform changes of illumination, and the penumbra of shadows. By the fourth step, areas of uncovered background are removed from the mask. The resulting object mask is nally improved by applying a luminance edge adaptation and an object mask memory. The proposed algorithm has been applied to MPEG-4 test sequences taken by a static camera (Mother-Daughter, Akiyo, Hall-Monitor, and Container-Ship) and sequences taken by a moving camera (Coastguard and Table-Tennis). Especially the sequence Table-Tennis contains moving cast shadows. Further, two test sequences where a person causes a moving cast shadow have been used, Eric and Jurgen. It has been shown that the segmentation results have been improved with respect to the reference algorithm, in cases where moving cast shadows appear in the scene. There, the moving objects are disturbed by moving cast shadows, if using the reference technique, while the 2D shape of the moving objects is quite accurate estimated by the proposed algorithm. For sequences without moving shadows, the results of the proposed algorithm are similar to those of the reference algorithm.
REFERENCES
1. J. Benois, L. Wu, \Joint Contour-based and Motion-based Image Sequence Segmentation for TV Image Coding at Low Bit-rate", in Proceedings of IEEE VCIP 94, Chicago, Illinois, September 1994. 2. M. Bierling, \Displacement Estimation by Hierarchical Blockmatching", in Proceedings of 3rd SPIE Symposium on Visual Communications and Image Processing, Cambridge, USA, November 1988. 3. S. Colonese, U. Mascia, G. Russo, C. Tabacco, \New FUB Results on Core Experiment N2 on Automatic Segmentation Techniques", Doc. ISO/IEC JTC1/SC29/WG11 MPEG97/1633, Sevilla, Spain, February 1997. 4. C. Gu, T. Ebrahimi, M. Kunt, \Morphological Spatio-temporal Segmentation for Content-based Video Coding", in Proceedings of International Workshop on Coding Techniques for Very Low Bit-rate Video, Tokyo, Japan, November 1995. 5. R. Mech, P. Gerken, \Automatic Segmentation of Moving Objects (Partial Results of Core Experiment N2)", Doc. ISO/IEC JTC1/SC29/WG11 MPEG97/1949, Bristol, England, April 1997. 6. R. Mech, M. Wollborn, \A Noise Robust Method for Segmentation of Moving Objects in Video Sequences", in Proceedings of IEEE ICASSP 97, Munich, Germany, April 1997. 7. R. Mech, M. Wollborn, \A Noise Robust Method for 2D Shape Estimation of Moving Objects in Video Sequences Considering a Moving Camera", in Proceedings of WIAMIS 97, Louvain-la-Neuve, Belgium, June 1997. 8. R. Mech, M. Wollborn, \A Noise Robust Method for 2D Shape Estimation of Moving Objects in Video Sequences Considering a Moving Camera", Signal Processing: Special Issue on Video Sequence Segmentation for Contentbased Processing and Manipulation, Vol. 66, No. 2, pp. 203{217, April 1998. 9. MPEG-4, \Information Technology - Coding of Audio-visual Objects: Visual Committee Draft", Doc. ISO/IEC JTC1/SC29/WG11 N2202, Tokyo, March 1998. 10. J. Ostermann, \Segmentation of Image Areas Changed due to Object Motion Considering Shadows", in Y. Wang a.o. (Ed.): \Multimedia Communications and Video Coding", Plenum Press, New York, 1996. 11. F. Pedersini, A. Sarti, S. Tubaro, \Combined Motion and Edge Analysis for a Layer-based Representation of Image Sequences", in Proceedings of IEEE ICIP 96, Lausanne, Switzerland, September 1996. 12. J. Stauder, \Segmentation of Moving Objects in Presence of Moving Shadows", in Proceedings of VLBV 97, Linkoeping, Sweden, July 1997. 13. A. Watt, \3D Computer Graphics", Addison-Wesley, 1993. 14. M. Wollborn, R. Mech, S. Colonnese, U. Mascia, G. Russo, P. Talone, J. G. Choi, M. Kim, M. H. Lee, C. Ahn, \Description of Automatic Segmentation Techniques Developed and Tested for MPEG-4 Version 1", Doc. ISO/IEC JTC1/SC29/WG11 MPEG97/2702, Fribourg, Switzerland, October 1997. 15. W. Zhang, F. Bergholm, \An Extension of Marr's Signature Based Edge Classi cation and Other Methods Determining Diuseness and Height of Edges, and Bar Edge Weight", in Proceedings of Intern. Conference on Computer Vision, Berlin, Germany, May 1993.