Ninth International Workshop on Image Analysis for Multimedia Interactive Services
Autonomous and Adaptive Learning of Shadows for Surveillance Hasan Celik, Andoni Martin Ortigosa, Alan Hanjalic, Emile A. Hendriks Delft University of Technology, Information and Communication Theory Group {h.celik, a.hanjalic, e.a.hendriks}@tudelft.nl,
[email protected] The theoretical base of many proposed shadow detection approaches can be found in color constancies and/or brightness distortion [2]. These approaches compare pixel values between the reference and other frames using off-line trained, fixed and often heuristic thresholds. While they can prove effective in specific cases, extrapolation of this effectiveness to a general case is not straightforward. Other more recent solutions use the fact that shadows on a pixel location appear frequently, and try to build statistical models [8, 9] using e.g. multivariate Gaussians such as in [8] in order to learn on-line the characteristics of shadows in color images. However they can not be applied on gray-level videos. In this paper, we present an autonomous and adaptive approach to shadow removal for both graylevel and color videos. This approach does not require the shadow detection parameters to be set beforehand, but these parameters are automatically learned on-line by collecting the necessary information along the video. The main contributions of this paper can be summarized as follows: • A method for automatically identifying shadow pixels based on long-term scene observation, • The method independently learns parameters for a given scene and for a specific location in the frame. • The method makes use of only brightness.
Abstract Object detection is a critical step in automating the monitoring and surveillance tasks. To maximize its reliability, robust algorithms are needed to separate real objects from moving shadows. In this paper we propose a framework for detecting moving shadows caused by moving objects in video, which first learns autonomously and on-line the characteristic features of typical shadow pixels at various parts of the observed scene. The collected knowledge is then used to calibrate itself for the given scene, and to identify shadow pixels in subsequent frames. Experiments show that our system has a good performance, while being more adaptable and using only brightness information.
1. Introduction Fast emergence of capturing and storage technology in the past years, combined with rapidly rising computational power, has opened the door to the development of automated and real-time surveillance and monitoring solutions. These solutions are not only intended to facilitate implementation of public safety standards, but also to introduce paradigm shifts in surveillance, traffic monitoring (e.g. highway usage statistics, congestion monitoring, detecting dangerous situations) and human activity observation (e.g. in elderly homes). Considering the general definition of visual surveillance and monitoring as a concept that attempts to detect, recognize and track certain objects from image sequences, and more generally to understand and describe object behavior [1], object detection appears as a fundamental component of all surveillance solutions. In most cases with a static camera setup, this detection task is done first by extracting moving blobs, i.e. motion segmentation. Many solutions, ranging from frame differentiation to Gaussian Mixture Modeling [1] have been proposed, but they primarily focused on pixel value changes, and could not handle problems linked to shadows of moving objects. Such methods are thus not sufficient for the subsequent object detection and object behavior analysis steps. There is thus a need to remove shadows from blobs.
978-0-7695-3130-4/08 $25.00 © 2008 IEEE DOI 10.1109/WIAMIS.2008.26
2. Proposed Approach 2.1. Design choices We approach our solution for the shadow detection problem in four general steps, as illustrated in Figure 1. Before elaborating on these steps, we first briefly introduce the three innovative steps, and list the assumptions that we consider valid in the remainder of the paper. Aiming at fully autonomous and self-adaptable surveillance systems, we base our approach as much as possible on the information learned from the observed scene. In other words, a camera is given a sufficient time to observe a scene before it actually starts identifying shadow pixels. During this time, the statistical information about feature values of pixels of moving objects appearing in the scene will be collected
59
(the “Building shadow statistics” block). This information is extracted based on the output of a standard background/foreground separation algorithm. This information will be used to calibrate the system for the observed scene. After a reliable statistics on moving object pixels in the scene is collected, the typical feature value for shadows is estimated. As opposed to the classical methods, which require manual setting of parameters, the parameters of our system are learned fully automatically for specific regions in the scene. In the last step (“Shadow detection”), shadow detection is performed in any new video frame based on the match between the characteristic features of pixels from moving blobs and the estimated models. During the decision process, first the darkest shadow pixels are identified (umbra pixels), as they are easily differentiable from the rest of the scene. Subsequently, the penumbra pixels (brighter outer areas of shadows) connected to those found as shadow are also recognized, improving the overall accuracy of the method.
distinguish between shadows and actual moving objects. The rationale behind this is that this ratio should be nearly constant on a homogenous background, independently of objects which produce these shadow pixels. However, we make an adaptation of this idea in view of the fact that a scene is not necessarily constant (homogenous), so the homogeneity assumption may be wrong in a general case. We therefore split the scene into regions that are maximally homogenous, by dividing it into rectangular areas on which the analysis will be performed (Figure 2a). In the subsequent steps, the parameters will be learned and become applicable for each of these homogenous regions individually. Consequently, all the processing steps mentioned in the remainder of this paper are done per region, unless stated differently.
(a)
Video Shadow detection Detecting moving blobs
Building shadow statistics
Detecting shadows (umbra)
(b)
Figure 2: Splitting of (a) a scene into homogenous regions – (b) a blob into blocks.
Detecting shadows (penumbra)
As indicated by the cases shown in Figure 2b, it can be assumed that a shadow part of the blob is attached to the object part. Thus, if we split all blobs into blocks, we could expect that the shadows will dominate in some of these blocks, so the statistics taken from there will reveal the properties of the shadow part only. The number of different blocks which are used to segment each blob can be set a priori. While different block sizes are possible, the best results have been achieved dividing each blob into eight blocks. While smaller blocks may not contain enough blob pixels for a reliable statistics, larger blocks will typically contain the pixels from both the shadow and object. We assume that each scene has its specific light sources, the position of which does not change abruptly, and which determine the size, shape and the positioning of the shadow with respect to “its” object depending on the position of this object in the scene. Consequently, the block index in which the shadow dominates (e.g. block 4 in Figure 2b) can be assumed constant. To build the shadow statistics, we split all detected blobs into eight blocks. Then, for each block and for all blobs, frame ratios (brightness ratio between reference and analyzed image) are computed and accumulated to build a histogram. As the brightness of shadows is assumed to remain roughly constant, we expect the
Figure 1: Block diagram of our approach.
2.2. Detecting moving blobs First we apply background subtraction as in [6], according to which every initial reference pixel is identified as the median value of previous observations. The reference is then updated at each frame by also taking into account previous references. We emphasize, however, that the subsequent steps in the scheme in Figure 1 do not rely on the choice of the background subtraction method. The robust approach in [3] was also tested. It is just important to keep track of a reference, which is used to compute brightness ratios to identify shadows. As no color information is used, the approach in [6] is more suitable in our case, as [3] is computationally more expensive.
2.3. Building shadow statistics To maximize the applicability of our approach to general (also non-color) surveillance and monitoring videos, we rely in our approach on grey-level pixel information (brightness) only. As in existing methods proposed in [4], [5], [6] and [7], we use the ratio between the background model and analyzed images to
60
near the mode. To reduce false umbra detections in a given blob, only parts classified as umbra that have higher area than the largest umbra part of the blob are kept for a later stage (penumbra detection).
block having the lowest variance to be the shadow block (or a block where shadow dominates). This is visible on Figure 3 where the histogram on the upper left side corresponding to a shadow block shows lower variance than other blocks. This assumption might be wrong for blocks that do not fit into this setup, e.g. if an actual object falls into the shadow block. Therefore, the variance is computed on a thresholded version of the histogram, which is composed by all the values in the range [m − ε , m + ε ] , where ε is a parameter in the range [0.075, 0.125], and m the mode. The statistics is learned until it is considered reliable for the subsequent detection steps. We noticed from our experiments that 2000 to 3000 pixels accumulated per block proved to be sufficient for this purpose. This can not be converted into duration as it depends on the content of the scene which may influence shadows (i.e. number of objects crossing the scene, their size, etc.). This duration will be shorter if the observed scene is crowded and longer if fewer objects are present.
2.5. Penumbra detection The remaining pixels belonging to shadows are generally located next to umbra and are usually small: these are the penumbra parts. Actual moving objects are in general not in contact with the penumbra parts, because they are located in the outer portions of shadows. These parts are classified as shadow if their area is lower than a given value. We experimentally found out that a threshold value of 100 pixels gives satisfactory results. Subsequently, a dilation operation with a 4x4 pixel block is applied on the obtained shadow image, in order to further identify some penumbra parts as shadow (especially the parts in contact with the actual moving object). The latter dilation operation is also done within the mask image, in order not to grow wrongly classified noisy pixels (See Figure 4, person on top).
mode
(a)
(b)
Figure 4: (a) umbra – (b) penumbra and umbra. Figure 3: Top row contains histogram of frame ratios for shadow blocks (left) and another block (right). Bottom row contains the corresponding threholded histograms. Horizontal axis corresponds to frame ratio values and vertical axis to their frequencies.
3. Experiments and Results A set of eight different videos were tested to form a complete benchmark set where outdoor and indoor sequences as well as different kinds of shadows and objects of different sizes are present. Five of these videos are publicly available and have been used in a benchmark paper [2] where two performance measures are also proposed: • Shadow detection accuracy η. • Shadow discrimination accuracy ξ. They are defined as follows:
2.4. Umbra detection Umbra detection is done by hysteresis thresholding of the frame ratios, and consists of two steps. First, a seed binary image is created with the lower threshold ∆Th1 . Second, a binary mask is obtained by performing a less restrictive thresholding (using a higher threshold ∆Th2 ), after which the seed image grows morphologically within this mask. This results in the umbra part of the shadow. The lower and upper threshold are chosen respectively in the ranges [0.1;0.15] and [0.2;0.25]. Growing is performed via a 2x2 pixel block. Umbra parts can be reliably found since umbra pixel frame ratios are mostly concentrated
η=
TPS TPS + FN S
ξ=
TPF TPF + FN F
where S stands for shadows and F for foreground. True positives (TP) are the number of true pixels which have been classified as true. False negatives (FN) are the number of shadow pixels which have not been identified as shadow [2]. Finally the TPF is the number of ground-truth pixels in the foreground minus the number of pixels classified as shadow in the
61
using the brightness component only. In other words, the applicability scope of our algorithm is very broad. Color information is not used but our algorithm performs as good as other approaches. Moreover, we didn’t make restrictive assumptions on shadows, i.e. we haven’t used hard wired models. Another important aspect that we wish to emphasize is that the algorithm developed is able to learn from the scene the required information to automatically select the suitable parameters which are used to detect shadows. The only requirement is the acquisition of enough amount of information coming from the scene in order to select the adequate thresholds and parameters. This requirement is however not restrictive, as in a practical situation a surveillance system would be given enough time to watch the scene in which it is supposed to work. Improvement has to be done in the segmentation of the scene in homogenous regions; this can be done for instance by a simple split algorithm which would find low variance regions. Other recommendations concerning the proposed algorithm are toward a more elaborate analysis of the data collected during the calibration (learning) phase.
foreground. There, four conceptually different approaches were compared: statistical non-parametric approach (SNP) [4], statistical parametric method (SP) [5], and two deterministic non-model based methods (DNM1) [6] and (DNM2) [7]. We compare these methods and our approach on the 5 public sequences, as shown in Table 1. In addition, Table 2 shows the performance of our approach on another three videos, shown in the bottom row of Figure 5.
Figure 5: Test data. The first five are publicly available. From left to right and top to bottom: Highway 1, Highway 2, Campus, Intelligent Room, Lab, Aula, EWI, and Mall. Table 1: Comparison of our method to existing approaches. First 4 rows taken from [1]. SNP SP DNM1 DNM2 Proposed
Highway 1 η (%) ξ (%) 81.59 63.76 59.59 84.7 69.72 76.93 75.49 62.38 79.74 90.07
Highway 2 η (%) ξ (%) 51.2 78.92 46.93 91.49 54.07 78.93 60.24 72.5 41.09 75.57
Campus η (%) ξ (%) 80.58 69.37 72.43 74.08 82.87 86.65 69.1 62.96 77.36 98.18
Laboratory η (%) ξ (%) 84.03 92.35 64.85 95.39 76.26 89.87 60.34 81.57 67.18 96.52
Intell't Room η (%) ξ (%) 72.82 88.9 76.27 90.74 78.61 90.29 62 93.89 86.24 98.96
5. References [1] W. Hu, T. Tan, L. Wang. S. Maybank, “A survey on Visual Surveillance of Object Motion and Behaviors”, IEEE Transactions On Systems, Man, and Cybernetics, 34(3), August 2004. [2] A. Prati, I. Mikic, M.M. Trivedi, R. Cucchiara, “Detecting moving shadows: Algorithms and Evaluation”, IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(7), July 2003. [3] C. Stauffer, W.E.L. Grimson, “Adaptive background mixture models for real-time tracking”, Int’l Conf. on Computer Vision, 1999. [4] T. Horprasert, D. Harwood, and L.S. Davis, “A Statistical Approach for Real-Time Robust Background Subtraction and Shadow Detection”, Int’l Conf. on Computer Vision 1999. [5] I. Mikic, P. Cosman, G. Kogut, and M.M. Trivedi, “Moving shadow and object detection in traffic scenes,” Int’l Conference on Pattern Recognition, September 2000. [6] R. Cucchiara, C. Grana, G. Neri, M. Piccardi, and A. Prati, “The SAKBOT System for Moving Object Detection and Tracking”, Video-Based Surveillance Systems— Computer Vision and Distributed Processing, 2001. [7] J. Stauder, R. Mech, and J. Ostermann, “Detection of moving cast shadows for object segmentation,” IEEE Transactions on Multimedia, 1(1), pp. 65–76, March 1999. [8] F. Porikli, and J. Thornton, “Shadow Flow: A Recursive Method to Learn Moving Cast Shadows”, Int’l Conference on Computer Vision, 2005. [9] N. Martel-Brisson, and A. Zaccarin, “Learning and Removing Cast Shadows through a Multidistribution Approach”, IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(7), July 2007.
Table 2: Performance on other sequences. Proposed
Aula EWI Mall η (%) ξ (%) η (%) ξ (%) η (%) ξ (%) 76.4 96.34 82.45 96.06 72.51 79.66
Best results are given for the intelligent room sequence on which our method outperforms existing approaches for both metrics. For the rest, performance is comparable, which means that we achieved both high detection accuracy and high discrimination accuracy. However, it is important to note that our method doesn’t make use of color unlike the compared methods, and automatically finds important parameters from the observed scene. The best results are obtained in videos where the shadows are relatively large. In these cases, adequate thresholds can be easily identified as we can expect that one of the blocks will contain mainly shadow parts of blobs. We observed this phenomenon in videos like “Intelligent Room” or “Campus”.
4. Conclusions and Future Work We presented in this paper a shadow removal method which is autonomous, adaptive and based on
62