High-quality real time motion detection using PTZ ... - CiteSeerX

1 downloads 0 Views 177KB Size Report
cients), scene geometry or feedback signals coming from the imaging device ... in the detection and suppression of moving shadows in sys- tems for object ...
High-quality real time motion detection using PTZ cameras Alessandro Bevilacqua

Pietro Azzari

ARCES - DEIS (Department of Electronics, Computer Science and Systems) University of Bologna, Viale Risorgimento 2 - Bologna, ITALY, 40136

Abstract

is” to work with hinged Pan-Tilt-Zoom cameras. Extending background subtraction algorithms requires to have a background mosaic at one’s disposal.

The approaches based on background difference are the most used with fixed cameras to perform motion detection, because of the high quality of the segmentation achieved. However, real time requirements prevent most of the algorithms proposed in literature to exploit the background difference with Pan-Tilt-Zoom (PTZ) cameras in real world applications. Nevertheless, using color information to detect motion yields a sensible improvement in terms of accuracy of segmented masks, by helping reducing camouflage and detecting shadows. To our knowledge, the algorithm we have conceived is the first to exploit color information and perform real time alignment and background difference based on background mosaicing, with PTZ cameras. In addition, no prior information regarding scene or camera parameters have been used for either spatial or tonal alignment. Accurate experiments performed on indoor and outdoor sequences allow to assess both quality and performance of the method we devised.

1

Just few algorithms have been proposed in literature to perform mosaic background detection with PTZ cameras. Most of them need to exploit prior information about scene or camera settings in order to achieve a good segmentation meanwhile fulfilling the real time requirements. As a matter of fact, often it is difficult, or even impossible, to define or to extract prior information from the hardware utilized or the surrounding environment. At last, due to real time requirements of visual surveillance systems, gray scale image processing has been the preferred choice because of its inherent reduced computational cost with respect to color image processing. This work describes a general purpose framework that has been devised to perform a high quality motion detection in visual surveillance applications by using a PTZ camera. This has been realized by developing a very effective, yet efficient, color based background mosaicing algorithm. The tonal alignment as well as the background difference is performed using color images, thus yielding a sensible improvement in the segmented moving masks. Moreover, the algorithm has been conceived to be completely image based, meaning that we don’t rely on any prior information regarding camera intrinsics (focal length, distortion coefficients), scene geometry or feedback signals coming from the imaging device (pan/tilt angular movements, exposure settings). This choice yields the algorithm to be hardware independent, thus resulting in a general purpose approach.

Introduction

Accurate and reliable motion segmentation of video sequences is widely recognized as being the first stage of many video processing applications such as visual surveillance and traffic monitoring. The background subtraction technique is known as offering the best tradeoff between the quality of the detected moving masks and the computational cost, when using conventional stationary cameras. Using this approach basically means comparing the current frame with a reference scene (a previously computed background). Moving “blobs” (aggregate of pixels) are identified by thresholding these differences. Many attempts have been accomplished to improve the overall performance of motion detection systems, by improving the background differencing techniques, exploiting color information or using PTZ cameras to widen the surveyed area. However, the methods usually employed to perform background difference cannot be extended “as

This paper is composed as follows. Section 2 outlines an overview of the state of the art in the image mosaicing field and motion detection application. Section 3 describes the basic assumptions which the algorithm relies on. Section 4 details the spatial and tonal alignment methods utilized. The motion detection stage when using moving cameras is examined in Section 5. The experimental results discussed in Section 6 prove the effectiveness of the conceived solution. Concluding remarks are given in Section 7. 1

2

Previous work

drical mosaics. Here, the histogram specification technique is applied to correct possible tonal misalignments between the connecting head and tail frames only, leaving all the remaining frames out of consideration. Finally, using color-based image processing helps reducing the effects of shadows and highlights on the detected objects. For instance, the authors in [10] suggest the exploitation of the HSV color space to improve the accuracy in the detection and suppression of moving shadows in systems for object detection and tracking.

In the last few years, many solutions have been proposed to use background subtraction even with hinged pan/tilt/zoom (PTZ) cameras, provided that a mosaic of the background scene is available. Nevertheless, achieving high quality mosaics of the background in real time is computationally intensive. Therefore, the most common approaches deal with batch surveillance applications [1], simplify geometric transform model from projective to rigid 2D or affine [2] or exploit specific information regarding camera signals such as pan/tilt angles [3]. Most algorithms suffer from they being prone to registration errors when building the mosaic. Often, the problem is faced by using either robust direct non linear minimization methods, which are not compatible with our real time requirements, or prior information regarding either moving objects [1] or the real time camera position [4]. Other proposals regard a global minimization over multiple frames using bundle adjustment techniques which are definitely known to be not suitable for real time processing. As a significant example, the global registration technique used by the authors in [5] in order to handle the errors yielded by the large dynamic layers in the scene requires the entire sequence to be known in advance. In [6] a novel technique to achieve a real time spatial registration without using prior information is proposed. Although the spatial registration approach has proved to work well in case of evenly illuminated scenes, its qualitative performance decreases in case of heavy environmental illumination changes between the frames to be mosaiced.

3

Mosaicing principles

A mosaic is a compound image built through properly composing (aligning) a high number of frames and warping them into a common reference coordinate system, both spatial and tonal. The result consists of a single image of a greater resolution or spatial extent. The approach is known to be mathematically well grounded if images are taken according to the following assumptions: • images of an arbitrary scene acquired with a camera free to pan, tilt and rotate about its optical axis are taken from the same location; • images of a planar scene are taken from arbitrary locations; • images are acquired using an ideal pin hole camera. Under these assumptions, projective coordinate transformations (homographies) represent the transformation occurring between frames captured by the imaging device. Choosing a projective transformation model permits to treat in a homogeneous way every arbitrary displacement from the observation point. The homographic, or projective, mapping has eight degrees of freedom and it is given by:

While the spatial registration problem has been faced thoroughly, the tonal alignment techniques have been introduced quite recently, as far as the real time systems are concerned. Different approaches have been proposed to face the problem of images exhibiting different photometric properties due to them being captured at different exposure levels or under changing environmental lighting conditions. The authors in [7] compute an accurate photometric calibration of the imaging device to achieve its response function. The knowledge of the response function, along with the exposure ratios, permits to fuse multiple photographs into one high dynamic range radiance map whose pixel values are proportional to the true radiance values of the scene. While being very accurate this approach requires the use of a priori information that cannot be sometimes available. The authors in [8] [9] implement a linear parametric regression over comparagrams (which basically are images’s joint histograms) employing the so-called affine correction. Nevertheless, it has been demonstrated [9] that affine correction has a limited field of applicability and it is not suitable for images that are highly under- or over-exposed. The authors in [4] propose a motion segmentation algorithm for extracting foreground objects with a pan-tilt camera using cylin-

x =

Ax + b cT x + 1

(1)

where A = [a11 , a12 , a21 , a22 ] is a 2 × 2 scaling and rotation matrix, b = [b1 , b2 ] is a 2 × 1 translation vector and c = [c1 , c2 ] is a chirping, or “perspective” vector. The spatial coordinates are denoted by x = [x, y]T . Other models have been used in literature, such as pure translational, rigid or affine. However, none of them can describe the transformation occurring between frames captured by a PTZ camera as precisely as the projective model. These models have been mainly used to reduce the computational cost required by the parameter estimation or to determine the initial guess for non linear minimization. So far, we have dealt with geometric alignment only. However, global illumination changes or fluctuations in 2

the images’ exposure that might occur between the current frame and the corresponding region within the mosaic could strongly affect the registration process as well as further processing involving the mosaic. Such issues generate unwanted creases (color gradients) in the composite image which do not correspond to any physical structure of the scene. As a consequence, a mosaicing technique must face the problem of illumination changes. It is now a key issue to compute an intensity mapping function so as to perform a tonal, or photometric, registration jointly to a simple geometric spatial alignment.

4

coming from the frame to mosaic alignment. See [6] for more details. As far as range alignment is concerned, tonal misalignments are common occurrences with digital photographs. If they are not properly taken into account the resulting panorama will appear to have seams, even when the images are blended in overlapping regions. In our case, the motion detector based on background subtraction may erroneously interpret these artifacts as moving objects, thus generating false alarms. Once a successful spatial registration is achieved, we assume that the possible remaining discrepancies between corresponding pixels are due to photometric misalignments. The illumination changes we are interested in are:

Mosaicing algorithm

• automatic camera exposure adjustments (e.g. AGC);

Many methods are known to perform spatial image alignment, based on pixel intensities SSD minimization, phase correlation or feature tracking. The firsts are too computing intensive for real time applications. The second ones are mostly used to build up mosaics by collecting frames coming from rigid or, at most, affine camera motion. In order to meet the real time requirements when dealing with projective transforms we have chosen the feature based approach. For each frame, a set of corner points is extracted and tracked in the subsequent frame through using an accurate and efficient feature tracker. The corner points that are consistent with the 3D scene references (stationary with respect to the acquisition device) are useful to determine the camera movement from the observer point of view. A simple yet efficient clustering method [6] is used to filter out corner points which exhibit inconsistent motions (e.g. errors in feature matching or features belonging to moving objects). The redundancy of corner point motion vectors with respect to the number of model’s parameters requires an overconstrained linear system to be solved. In fact we can model our problem as a system of n linear equations in p unknowns: where n is the number of corner points identified as inliers by the clustering method and p is the number of parameters of our preferred motion model, which actually is the projective model. We know from linear algebra that here there is not any exact solution. These problems are characterized by relatively a low conditioning value, which yields problem instability. This requires the system is solved using the Singular Value Decomposition (SVD) method after performing a prior normalization of the corner’s positions. The approach just described works well in most cases. However, due to the problem being intrinsically ill conditioned, the frame to frame registration introduces small alignment errors to accumulate. These errors become more evident when a video sequence returns to a previously captured location (problem known as “looping path”). To reduce this problem, we have conceived an effective approach based on a feedback signal

• environmental illumination changes (e.g. changes).

daytime

Tonal alignment has to be performed after the spatial registration stage. It has to be tolerant with respect to a wide variety of issues. Spatial registration inaccuracies (e.g. due to homographies not modeling the camera motion or small alignment errors) and the presence of moving objects represent issues to account for. The previous considerations have prompted us to exploit an histogram-based approach, that allows to partly overcome the above-mentioned inaccuracies. The histogram specification technique is a histogrambased approach that aims to find an intensity mapping function able to transform a given cumulative histogram H1 into a target cumulative histogram H2 belonging to a reference image. Practically speaking, in case of gray level images only 256 couple of points (u1 , u2 ) derived from H1 and H2 are considered and the outcome is simply a look-up table (LUT): u2 = H2−1 (H1 (u1 )) (2) This method has been conceived to work with gray scale images, and its extension to color image processing requires to handle the correlation between color channels, that could originate unexpected hues not present in the source images. More details are given in [11].

5

Motion detection

Having a reliable background mosaic at our disposal has permitted to directly extend the use of our background subtraction algorithm for stationary camera (for more details see [12]) to moving PTZ camera. Although one may think that the mosaicing algorithm is devoted to the creation of the background mosaic during the bootstrap sequence and then left unused, this is not true. As a matter of fact, in order to perform background subtraction and maintenance 3

a key issue is represented by how to find out which part of the mosaic the current frame’s background correspond to. Since the homographic mapping constitutes a group structure, the functional composition of two (or more) homographic transformations results in a homographic mapping again. Practically speaking, an “incremental” transform matrix is built up time by time, “linking” all the transformation matrices by multiplication. In this way we record the current position of the camera with respect to the corresponding reference frame in the mosaic, thus yielding the location of the corresponding region being a trivial task. After the portion of the background subtraction has been indexed, we backproject it using the inverse incremental transformation matrix to remove the distortion introduced by the projective transformation. Again, this approach is legitimated from homography being a non singular linear transformation of the projective plane, thus assuring the inverse matrix always to exist. As a final remark, we want to stress that the need of real time mosaicing and the will of exploiting colors introduce a huge demand of computational resources. Nevertheless, color image processing permits to achieve important improvements in terms of shadow removal and reduction of camouflage, whereas different color tuples map to the same gray level value. Therefore, we have tried to subsample incoming frames to speed up the overall computation. As you realize from the experimental results, the comparable quality measures prove that using subsampled images can enhance computation speed while preserving the motion detection quality.

6

In Figure 1 one can see half resolution mosaics attained

Figure 1: Mosaics obtained by our algorithm processing at half resolution the outdoor (top and bottom) and the indoor (middle) sequences.

Experimental results

Extensive experiments have been accomplished in order to compare the accuracy of gray scale and color mosaics, to assess the improvement in terms of quality introduced by color processing of the detected moving masks and to evaluate computational time performance. To this purpose, two challenging indoor and outdoor sequences have been considered, they being different for illumination, scene structure, depth of field of view. Both sequences are 320 × 240 pixel resolution. The camcorder panned by hand across the scene back and forth many times, to emphasize the usual looping path problem suffered from many mosaicing algorithm. The camcorder was hinged on a tripod in order to make it rotate about its optical center, thus giving reasons for the domain motion model employed (images related by homography). Also, in order to enable the algorithm to perform in real time while maintaining a high quality of results, the experiments have been accomplished on both full resolution and half resolution sequences (size reduction is performed in real time). As the target machine, we used an AMD 2000+ equipped with 1GB RAM.

by processing the outdoor (top and bottom) and the indoor (middle) sequences. The first outdoor sequence DCOURT (Figure 1 top) has been acquired by manually scanning the scene along the horizontal direction (pure panning). The environment scene shows a close wall of a building on the left, a farther gate and one walking person. The ground is not planar and the wide field of view captured during the panning and quite the structured scene (hedges and trees) can emphasize even small alignment errors. The second sequence DLAB (Figure 1 middle) deals with an indoor environment with very close objects and a walking person again. Here the problems are no longer concerned with the structure of the scene. Rather, the short distance between the observer and the surrounding environment makes the assumption of plane at infinity (that underlies the use of homographies) hardly fulfilled, thus leading to erroneous motion parameter estimation and to alignment errors, accordingly. The third sequence (Figure 1 bottom) has been cap4

tured by both panning and tilting the camera. As one can see the outcoming mosaics are almost free of stitching artifacts, although they refer to half resolution processing. Table 1 reports numerical results regarding the mosaic’s quality as Table 1: Mosaicing Performance for DLAB and DCOURT sequences Sequence DLAB GS DLAB RGB DLAB GS (2X) DLAB RGB (2X) DCOURT GS DCOURT RGB DCOURT GS (2X) DCOURT RGB (2X)

fps 5.2 3.9 12 11 5 3.8 12 11

σ 7.1563 9.9815 7.2592 10.0212 12.5634 13.1663 11.3669 11.2547

ρ 0.9954 0.9945 0.9957 0.9951 0.9925 0.9921 0.9960 0.9958 Figure 2: In column, two gray scale (left) and color (right) sequences of three frames each, coming from DLAB sequence processed at half resolution.

well as the algorithm’s time performance of gray scale (GS) and color (RGB) sequences, at full and half resolution (2X). Two indicators are used to assess the quality of the attained registration accuracy: ρ and σ. ρ is the averaged cross correlation coefficients returned from each frame compared with its corresponding background region. It remains roughly the same when comparing GS and RGB images at same resolution and it even slightly improves for subsampled images. Nevertheless, the values are very similar and close to 1, thus confirming the high quality of our stitching method. σ is the standard deviation computed for each value history associated to a certain mosaic pixel averaged over the entire mosaic. Table 1 reports the worst σ computed over the three color channels. This measure gives a hint of the dispersion, with respect to the mean of the distribution of the pixels coming from different images and aligned within the same mosaic pixel. At a first sight, it could seem an interesting indicator. However, the difficulty of extracting a measure of the statistical independence of color channels makes gray scale and color mosaics σ’s hardly comparable. Besides such indicators, time performance measures are reported in fps, since for real time applications it results to be an effective and direct indicator. As expected, the processing speed worsen for RGB frames with respect to GS images: 25% for full resolution and just about 10% for subsampled ones. In particular, a speed of 11fps for RGB sequences allows full real time processing. In columns of Figures 2 and 3 one can see three frames extracted by two sequences showing the output of our motion detection referring to the same environments (DLAB and DCOURT, respectively) depicted in Figure 1. Here, the detected moving masks have been superimposed to the frames to make the visual inspection for accuracy assess-

ment easier. On the left, one can see the quality of the detected masks using conventional gray scale frames. The images on the right show the improvements yielded be exploiting color information. Having a color mosaic at our disposal can trigger the use of different color spaces to get a significant improvement of the motion detection outcomes. Performing background subtraction in a different color space, such as HSV or Y Cr Cb , permits to reveal moving shadows and to discard them when detecting motion. Shadows can have very a detrimental effect, especially in outdoor environments, causing moving object’ shape deformations, object recognition misleadings and altering results of further processing tasks such as tracking. Accordingly, exploiting color rather than gray level information only yields a remarkable improvement, as one can see by comparing columns of Figures 2 and 3. Also, the loss in performance exploiting color information is negligible if considering the benefits yielded in terms of accuracy of moving masks (see Table 1). Moreover, background differencing using downsampled frames can be performed in real time obtaining comparable motion clues with respect to full resolution frames. In the samples depicted in Figure 2, a person is entering in the room and casting his shadow on the wall at the back (left). However, the shadow is removed when using chromaticity (right). In the second set of samples depicted in Figure 3 a walking person is moving around a sunlight courtyard. In the outdoor scene, the cast shadow is yet more 5

References [1] Y. Sugaya and K. Kanatani. Extracting moving objects from a moving camera video sequence. In Proc. of Symposium on Sensing via Image Information, pages 279–284, June 2004. [2] F. Winkelman and I. Patras. Online globally consistent mosaicing using an efficient representation. In Proc. of IEEE Intl. Conf. on Multimedia Computing and Systems, pages 3116–3121, October 2004. [3] E. Hayman and J. Eklundh. Statistical background subtraction for a mobile observer. In Proc. of ICCV, pages 67–74, 2003. [4] K. S. Bhat, M. Saptharishi, and P. K. Khosla. Motion detection and segmentation using image mosaics. In Proc. of IEEE Intl. Conf. on Multimedia and Expo, volume 3, pages 1577–1580, 2000. [5] A. Bartoli, N. Dalal, B. Bose, and R. Horaud. From video sequences to motion panoramas. In Proc. of Workshop on Motion and Video Computing, pages 201–207, December 2002.

Figure 3: In column, two gray scale (left) and color (right) sequences of three frames each, coming from DCOURT sequence processed at half resolution.

highlighted with respect to the indoor one. Although it is marked and visible in the gray level sequence (left), it has been completely removed in the color one (right).

[6] A. Bevilacqua, L. Di Stefano, and P. Azzari. An effective real-time mosaicing algorithm apt to detect motion through background subtraction using a ptz camera. In Proc. of AVSS, volume 1, pages 511–516, 2005.

7

[7] Y. Tsin, V. Ramesh, and T. Kanade. Statistical calibration of ccd imaging process. In Proc. of ICCV, volume 1, pages 480–487, 2001.

Conclusion

[8] D. P. Capel. Image Mosaicing and Super-Resolution. University of Oxford, 2001.

In this work we reported the outcome of our research concerning a motion detection algorithm capable to work in real time with a PTZ camera relying on a color image-based mosaicing and without exploiting any prior information regarding either scene or acquisition device. Besides exploiting spatial registration, we conceived a method to perform effectively tonal registration in order to overcome the errors arising from photometric misalignments. Also, the high quality attained by mosaicing, and by background differencing algorithm accordingly, allows to achieve an effective real time motion detection by working at low resolution. This has been proved by extensive experiments accomplished indoors as well as outdoors. Besides, exploiting color information improves the overall quality of the detected masks, mainly in terms of reduction of camouflage and moving shadow detection. A research is being carried out to improve both spatial and tonal alignment, so as to make them more robust, general purpose and fast as well. In addition, scale and rotation invariant features will provide a more reliable tracking, mosaicing and a more accurate motion detection accordingly.

[9] S. Mann. Pencigraphy with agc: Joint parameter estimation in both domain and range of functions in same orbit of the projective wyckoff group. In Proc. of ICIP, volume 3, pages 193–196, September 1996. [10] R. Cucchiara, C. Grana, A. Prati, and S. Sirotti. Improving shadow suppression in moving object detection with hsv color information. In Proc. of IEEE Intelligent Transportation System, pages 360–365, 2001. [11] P. Azzari and A. Bevilacqua. Joint spatial and tonal mosaic alignment for motion detection with ptz camera. In Proc. of ICIAR, September 2006. [12] A. Bevilacqua, L. Di Stefano, and A. Lanza. An efficient motion detection algorithm based on a statistical non parametric noise model. In Proc. of ICIP, pages 2347–2350, October 2004.

6

Suggest Documents