Moving Object Segmentation Using Background Supermodels

0 downloads 0 Views 1MB Size Report
Moving Object Segmentation Using Super-Resolution Background Models ... field of automatic video surveillance has led to the potential b ... PTZ camera using techniques generalized from frame dif- ... and example-based texture synthesis [5, 4], where in our ... objects, dropped objects, objects in security-sensitive zones,.
Moving Object Segmentation Using Super-Resolution Background Models Joshua Migdal, Tomas Izo and Chris Stauffer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology, Cambridge, MA 02139 Email: {jmigdal,tomas,stauffer}@csail.mit.edu Abstract Focus-of-attention camera systems comprised of stationary, far-field master cameras and pan-tilt-zoom slave cameras offer a practical way of consistently tracking all objects across a large area while simultaneously obtaining detailed imagery of activity for wide-area surveillance. In order for such high-resolution data to be useful for motion-based activity analysis, the systems should be able to segment and track moving objects in a high-resolution view with minimal or no delay upon fixating on a given part of the scene. We present a method that exploits the same wide-area adaptive background model used for consistently tracking all objects in a scene to enable a high-resolution segmentation of moving objects. Assuming a known or learned homography between the two views, we use a technique akin to super-resolution to infer from a section of the low-resolution background model its corresponding high-resolution model, which we refer to as the super-resolution background model. Using this method, we are able to segment moving objects from the background in a single high-resolution image.

a

b

1.. Introduction

c

Figure 1: A focus-of-attention approach to tracking minimizes

For wide-area surveillance applications, there is an inherent trade-off between obtaining a broad field of view and high local resolution. The approach that maximizes coverage at the expense of detail is to use a wide field of view (FOV) camera to reliably track objects over a large area. However, as Figure 1b illustrates, such systems can do little more than register the gross positions and sizes of objects. Conversely, one may wish to maximize detail at the expense of coverage. Since determining the numbers and locations of all objects present in the area under surveillance is a baseline capability that all tracking systems should possess, this approach requires tiling several narrow FOV cameras together. Unfortunately, this approach scales quadratically in the zoom factor. Therefore, in a surveillance setting such as that shown in Figure 1, about 100 cameras would be needed to cover the entire area under surveillance. This is simply not practical. The introduction of pan-tilt-zoom (PTZ) cameras to the field of automatic video surveillance has led to the potential

the tradeoff between coverage and resolution in wide-area surveillance: (a) Tiling a scene at 10x resolution with enough coverage to allow hand-off would require ≈ 100 cameras (25 are shown) and a larger amount of computational resources to constantly acquire and process 10x higher resolution imagery. (b) A detail of the area of interest marked by the white rectangle in the top image shows the poor resolution and dynamic range of the far-field sensor, whereas the high-resolution view in (c) provides a much higher level of detail.

for maximizing both coverage and resolution. This paper advocates using the minimum number of wide FOV cameras needed to reliably track objects over the entire area and using taskable PTZ cameras to capture high-resolution imagery and segmentations of moving objects of interest. Unlike the case of camera tiling mentioned previously, the number of PTZ cameras needed for consistent coverage scales linearly with the number of moving objects of interest rather than being quadratic in zoom factor. PTZ cam-

1

eras also have the capability of dynamically changing zoom to capture objects at their appropriate scale. Thus, the camera can zoom in to capture the high-resolution detail of a pedestrian while also being able to zoom out sufficiently to track a large vehicle. Unfortunately, because the PTZ camera must zoom in sufficiently far to capture the detail of a moving object, its field of view is necessarily diminished. Thus, moving objects do not spend much time in its view before the camera needs to pan or tilt to keep them in sight. Since it takes a finite but nonnegligible amount of time to build up a background model, and since the moving object is present at all times in the camera’s field of view, to the best of our knowledge, the capability to perform object segmentation via background subtraction has not yet been done for cameras undergoing arbitrary pans, tilts, and zooms. Using a far-field, overview camera and a collocated, taskable PTZ camera, we can simulate the capabilities of having a very high-resolution overview camera, including the ability to perform background subtraction. We achieve this by using the low- and high- resolution current input images, along with the low-resolution background model continuously maintained in the overview camera, to infer a high-resolution background model within the field of view of the zoomed in, tasked PTZ camera. We are thus able to obtain high-resolution silhouettes from the first stationary frame of the PTZ camera onwards.

input image [7, 1]. Prior knowledge about the mapping between low and high frequency image detail is learned beforehand through multiple high-res and low-res image pairs. The zoom factors typically dealt with are between 2-4x. Our super-resolution problem involves inferring a high-resolution probabilistic background model from a lowresolution background model, and a pair of images taken from an overview camera and a zoomed in PTZ camera. Because of the unique constraints of our problem and the similarity between the background model and its image counterpart, we can achieve good results with much higher zooms. Since the background does not differ significantly from the high resolution input image, and since we have a single corresponding high-res, low-res image pair, we use these as nonparametric interpolants for hallucinating the high-res background. This is similar in spirit to image analogies [8] and example-based texture synthesis [5, 4], where in our case, the correspondences between texture samples in the low-resolution image and background model are applied to the high-resolution image to infer the high-resolution background. We will describe this process in detail in section 4.

3.. System Overview Focus-of-attention camera systems employ an omnipresent low-resolution scene model to continuously track all objects in a large environment. In some cases, it may be necessary to coordinate between multiple static overlapping cameras to accomplish blanket coverage of an entire area as in [15]. This work employs a single camera to track over the entire region shown in Figure 1. Given a new far-field, low-resolution image, If , our system simultaneously adapts the current far-field background estimate model Mf and produces a foreground segmentation Sf , which denotes where the If does not match our adaptive background model Mf . In this work, our background model is a per-pixel unimodal Gaussian model with diagonal covariance matrices. Thus, the current background model is comprised of a background mean image Bf and a variance image Vf . To reduce the effects of noise and produce better segmentations, we employed a Markov smoothing technique described in [11]. Sf can be exploited to track all moving objects in the scene over time with reasonable reliability assuming the number of foreground pixels detected on an object is significant relative to the inherent noise in the foreground segmentation. Unfortunately, both If and Sf contain little information about the true appearance and shape of objects due to lack of contrast and resolution in the widearea camera. Given the gross object class, position, and trajectory of the moving objects, the far-field system can task PTZ units to observe activities of interest, including unusually moving objects, dropped objects, objects in security-sensitive zones,

2.. Related Work Our work touches upon several areas of computer vision. From a surveillance and tracking perspective, our work is closely related to single camera background subtraction [6, 11, 14, 16] and PTZ camera control. Many systems have been implemented that can task one or more PTZ cameras to observe areas of high interest [2, 17, 10, 12, 3]. The idea in this case is to acquire a high-fidelity view or snapshot of the activity of interest. There has also been work done on acquiring automatic segmentations of moving objects from a PTZ camera using techniques generalized from frame differencing [2, 17]. However, these approaches begin the segmentation process at PTZ fixation onset (and thus cannot segment temporarily static objects) and are also subject to the foreground aperture problem [16]. An interesting approach was taken in [9], where background subtraction was performed in a single panning and tilting camera by registering the background model across overlaps after a pan or tilt operation. While the results they show are good, their system operated with a constant zoom factor and could not handle pans and tilts that did not overlap with the previous view. Our work is also related to super-resolution, or image interpolation as it is sometimes referred to when the problem is to infer a high resolution version of a low-resolution

2

If

IL homography

Bf

IH patch matching

BL

reconstruction

^

segmentation

SS

BH

Figure 2: A schematic showing the processing in our system: If and Bf are the low-resolution far-field current frame and background mean image. IL and BL are the upsampled and rectified current low-resolution frame and low-resolution background. IH is the current ˆH is the estimated super-resolution background mean image. SS is a silhouette obtained from IH and B ˆH . high-resolution frame. B

or potentially hazardous activities. The particular set of triggers is application-specific and has been the focus of previous work [2, 17, 10, 12]. There are a number of previous approaches to PTZcamera calibration [13, 12]. In this work, we assume an accurate PTZ-camera calibration. This calibration results in a homography estimate of the form H(pp , pt , pz ). This calibration enables estimates of values of pp , pt , and pz to cover a region of interest and enables conversion and rectification between low- and high- resolution images. Given a bounding region of an object or region of activity in the wide-area video, the appropriate pan-tilt-zoom parameters {pp , pt , pz } are calculated such that the entire region is within the field of view of the PTZ camera. Once the camera is fixated on the region of interest, a high-resolution image IH is captured. Using H(pp , pt , pz ), it is possible to compute the registered and resampled low-res image IL , segmentation SL , and background model ML from their far-field counterparts If , Sf , and Mf . This is illustrated by the left-hand portion of Figure 2. Our goal is to estimate the high-resolution background model MH , which will enable a robust foreground segmentation SH of the current foreground objects in the first highresolution image after a PTZ saccade and all future images until the next PTZ saccade. Unfortunately, it is not possible to quickly build a high-resolution background model MH . Until an object moves completely from where it was located when the camera was initially tasked to the area, it is simply not possible to estimate a background and reliably segment the object. As an example, imagine a system observing a military base. In a large courtyard with dozens of people, a dropped object appears in the far-field view. Upon focusing the PTZ camera on that object, there will never be

information about the appearance of the scene behind the object nor will it be possible to reliably segment the object from a single image. By exploiting the current rectified low-resolution background model ML , the most recent rectified low-res image IL , and the current high-res image IH , it is possible to esˆH timate an effective high-res background mean image B f which approximates the true background mean image BH as shown in Figure 2. As we illustrate in the following sections, this enables robust high-resolution segmentation of moving objects from the first image taken after PTZ fixation. The following section describes a method of producing ˆH an estimate of the high-resolution background model M from the low-resolution rectified background model ML , low-resolution rectified image IL , and high-resolution image IH .

4.. The Algorithm The problem of estimating a super-resolution backˆ H consists of estimating for every pixel ground model M the mean and variance of the Gaussian model for that pixel. In this paper we refer to the collection of pixel means as the background mean image or, when the context is clear, simply the background. Given a low-resolution background mean image BL , the current low-resolution image IL and the current, corresponding high-resolution image IH , our first task is to induce a super-resolution background mean image estimate ˆH suitable for the segmentation of moving objects in the B high-resolution input frame. In more precise terms, we are interested in finding the maximum a posteriori (MAP) highˆH such that resolution background estimate B

3

ˆH = argmax Pr(B ˆH |BL , IH , IL ). B

our problem, we found it sufficient to use a uniform prior distribution on the super-resolution background mean image. Finding a parametric form for the conditional distribution in the first term of the numerator in (2) would be difficult, if not impossible, so we take a non-parametric, locally greedy approach (similar to [4] and [8]) to maximizing this quantity. Consider a patch at location i in the lowresolution background mean image and another patch at location j in the current low-resolution image IL . If these two patches have a similar appearance, the pair of patches at the same locations in the high-resolution image and the superresolution background should also be similar. Thus if we find a mapping from patch locations in the low-resolution background to the low-resolution input image, we can use this mapping to assemble the super-resolution background from the high-resolution image. We accomplish this using the following algorithm:

(1)

ˆH B

This is similar to the standard Bayesian approach to super-resolution (see [1]), in which the problem is to find the MAP super-resolution image, given a set of lowresolution, unregistered input images. However, traditional super-resolution and the problem at hand differ in several important aspects: (1) the low-resolution input is presented to us in the form of the current low-resolution frame and the low-resolution background mean image, (which can be expressed as a function of the low-resolution input history) rather than a set of low-resolution images, (2) given the fact that our cameras are co-located and the homography between the two views is known (or previously learned), we can assume that the low-resolution input is registered with the high-resolution view and thus no registration is needed and, finally, (3) instead of a high-resolution image prior or a generic set of training examples, part of the high-resolution information is provided to us in the form of the current high-resolution frame. We are thus dealing with a special case of the superresolution problem, in which the aim is to preserve the high frequency information available in the areas of the high-resolution image that do not contain any moving objects, and induce the high-resolution background in the areas where it is occluded. To this end, it is useful to think of the low- and high-resolution frame pair as a single, perfectly registered training example, most of which directly corresponds to the low- and hi- resolution background mean images, but some areas of which–where moving objects occur–have no direct correspondence. Returning to the problem of estimating the highˆH , we can apresolution background mean image B ply Bayes Rule to (1) to express our optimization problem in terms of the likelihood of our data (low-resolution background and the low- and high-resolution image pair) given the estimated super-resolution background: ˆ ˆ ˆH = argmax Pr(BL , IL , IH |BH ) Pr(BH ) . B Pr(BL , IH , IL ) ˆH B

1. For some step size s we define P = {(1, 1), (1, 1 + s), ..., (m, n)} to be a regular grid of image pixel coordinates. 2. For each location i on the grid P in the low-resolution background BL , we take the square patch πi (BL ) centered at location i in BL and find the location j in the low-resolution current image IL for which the patch πj (IL ) centered at j in IL best matches the patch πi (BL ). We define the distance measure between two patches to be the sum of squared differences between the pixels belonging to them: dist(πi , πj ) = ssd(πi , πj )

(3)

Thus, for each i ∈ P we want to find j ∈ P such that: j = argmin [dist(πi (BL )), πj (IL ))]. j

(4)

This gives us the mapping φ : P → P between patch locations in the low-resolution background BL and the low-resolution image IL .

(2)

3. To assemble the super-resolution background BH , we place at each location i in BH the patch πφ(i) (IH ), i.e. the patch from the high-resolution image IH the location of which corresponds to the location i in the learned mapping φ. The pixel value at each location in the reconstructed background is then the average of all the patches that overlap at that pixel.

Since the denominator is a function of the given inputs, it is constant and can therefore be ignored. The first term of the numerator represents the conditional probability of the low-resolution background mean image and the low- and high-resolution current image pair, given the highresolution background mean image. The second term of the numerator can be interpreted as a prior distribution on the super-resolution background mean image. In traditional super-resolution work, this prior is often used to enforce smoothness of the induced image. Because of the nature of

Figure 3 shows several examples of best-matching pairs of image patches. Some of the matches are semantically correct–for instance the edge of the red car’s hood (patch number 1) in the low-resolution background was matched

4

3

1 2

reconstruction error

1 2 4

3

3 4

5

5

a

b

2 1.5 1 0.5 0

1

1

15

20

25

30 patch size

35

40

45

50

2

2

4

3

65

3 5

c

5

d

reconstruction error

4

3x zoom

2.5

Figure 3: Examples of patch correspondences determined by our patch-matching algorithm. Black squares mark 25x25 image patches and correspondences are shown by numbers. (a) shows a section of the low-resolution background, (b) is a section of the current low-resolution frame, and (c) and (d) are sections of the hallucinated super-resolution background and the high-resolution current frame. Correspondences in patch locations were determined between (a) and (b). The super-resolution background (c) was assembled by placing at each patch location the corresponding patch from (d) and averaging overlapping patch areas.

10x zoom

60 55 50 45 40

15

20

25

30 patch size

35

40

45

50

Figure 5: Reconstruction error (sum of squared differences between estimated high-resolution background and the reference background) of an occluded section of the background shown as a function of the patch size used for matching. Results are shown for both the 3x and 10x experiments.

size experimentally. For a number of different patch sizes, we ran our super-resolution background reconstruction algorithm and compared the estimated background to a reference background mean image learned directly on a sequence of high-resolution images from the PTZ camera. In Figure 5 we plot the reconstruction error measured as the sum of squared differences between the reference and the estimate background against the patch size. We chose a patch size of 25 for the 10x zoom experiment and a patch size of 19 for the 3x zoom experiment. The step size of the search grid was chosen to be approximately 1/4 of the patch size.

4.1.. Super-resolution background model variances Figure 4: Vector field showing the patch correspondence map Thus far we have described the process of inducing the pixel values of the super-resolution background mean image, given a low-resolution background model and a lowand high-resolution current image pair. In order to have a complete super-resolution background model, it is necessary to also estimate the color channel variances at each pixel. We take the variances of the super-resolution background model to be the linearly interpolated variances of the low-resolution model. This is reasonable because we believe that most of the variance for a given pixel is due to factors outside of the sensor such as specularities and other lighting effects.

between the low-resolution background and the current lowresolution frame. Arrows point towards the matching patch and are scaled by a factor of 5.

to a different location on the edge of the hood in the input image. Other patches are merely similar in appearance: patch number 4 on the car’s bumper was matched to the edge of the license plate of another car. However, the reconstructed super-resolution background is very similar to the reference high-resolution background (see Figure 9). Figure 4 shows a vector field plot of the mapping φ between patch locations in the low-resolution background and patch locations in the current, low-resolution input image. Notice that the largest concentration of significant displacements occurs in the area of the background that was occluded by the moving person. The parameters in the above algorithm are the size of the patch (the length of its side) used for matching and the step size of the search grid. We determined the optimal patch

5.. Results Our algorithm was tested on two scenarios involving overview/PTZ camera pairs corresponding to roughly a 3x zoom and 10x zoom respectively. For each experiment, image sequences from both the overview camera and the

5

a

b

a

b

c

d

c

d

Figure 6: Comparison of silhouettes obtained using different methods on the experiment with 3x zoom. (a) Bicubic interpolation of the low-resolution silhouette obtained from the overview camera (b). Silhouette obtained through bicubic interpolation of the overview camera’s background model. (c) Segmentation obtained using our super-resolution background model. (d) highresolution reference silhouette.

e

g

h

Figure 7: Comparison of silhouettes obtained with different methods for 10x zoom experiment. (a) Bicubic interpolation of the low-resolution silhouette obtained from the overview camera. (b) Silhouette obtained through bicubic interpolation of the overview camera’s background model. (c) Segmentation obtained using our super-resolution background model. (d) High resolution reference silhouette. (e)-(f) Cropped area around the person in (a)-(d).

tasked PTZ camera were recorded. Background subtraction as described in [11] was performed on both the low-res, overview camera’s video and the corresponding high-res PTZ video, the latter segmentation being used only for comparison. For each of the 3x and 10x scenarios, a single frame, chosen due to its interesting content, was selected to act as the first stable frame of the fixated PTZ camera. The content associated with this frame1 is: the far-field overview image (If ), the background mean image from the overview camera (Bf ), the variances associated with Bf (Vf ), the overview segmentation (Sf ), and the high-res image (IH ). If , Bf , and Sf were brought through the overview/PTZ camera homography H(·) using bicubic interpolation to form IL , BL , and SL , respectively. Vf was brought through H(·) using linear interpolation to produce VL . These are rectified, low resolution pixel-to-pixel counterparts to IH . Two segmentation approaches were implemented to compare against our method. The first is a comparison against the rectified low-res silhouette SL . This situation corresponds to using the overview camera’s segmentation to predict the location and shape of the moving objects in the PTZ camera. The second approach uses the bicubically interpolated and rectified far-field background mean image BL and its associated linearly interpolated and rectified variances VL as the high-res background model. Because the colors associated with the two cameras differ, we estimate a 3 × 4 color transformation matrix C that minimizes the mean squared 1

f

error between all pairs of pixels IL (p) = CIH (p), where a translation term is added to the {r, g, b} pixel values of IH , such that they are of the form {r, g, b, 1}. The colorcorrected high-resolution image is then compared against this bicubically interpolated background model to produce the segmentation SB . Our approach is also compared against the reference silhouette SH produced during the segmentation of the PTZ camera’s video stream. Since SH is obtained by using the high-resolution imagery and actual high-resolution background model, this silhouette should be the best we can achieve2 and what we should aim for through the use of super-resolution. Figures 6 and 8 show the comparative results for the 3x zoom scenario. For the super resolution reconstruction depicted in Figure 8d, the region occluded by the FedEx truck was faithfully reconstructed with respect to the actual high resolution background (Figure 8e). The accuracy with which we can reconstruct occluded areas is directly related to the capability of the unoccluded areas to represent the hidden part. In this case, there are several cars that can be used to patch together the back ends of the partially hidden ones and the road itself is omnipresent outside the area occluded by the truck. The smoothing that occurs is the re-

The overview camera was not synchronized with the PTZ camera, so the temporally closest high-res image was chosen.

2

6

Given that the same subtraction algorithm is used.

sult of our integration process across patch overlaps. Although we are pleased with the super resolution results on a purely qualitative level, it is their ability to achieve high fidelity motion silhouettes that is in fact important. To this end, the silhouettes produced by comparisons to the various background models are shown in figure 6. Due to the fact that bicubic interpolation smoothes edges and does not actually introduce any high frequency spatial detail into the resulting imagery, there is significantly more noise present in the corresponding scene segmentation (Figure 6b) than in our method (Figure 6c). Also of interest is that the pedestrian silhouette in the center of the field of view was not segmented in the overview camera. Obviously, the PTZ could not have been tasked on the pedestrian. But, if it had been tasked on the FedEx truck, then the person segmented in the zoomed in view would have been an object simply not detectable in a standard far field visual surveillance environment. Figures 7 and 9 likewise show segmentations and superresolution results for the 10x zoom case. This case is special because the level of detail present in the overview camera is remarkably small. This is evidenced by a comparison of the rectified overview image to the high-resolution image (Figure 9a and 9b) as well as by the coarseness of the rectified low-resolution silhouette (Figure 7a). The grille of the truck was a difficult feature to match, due to its complete lack of character in the low-resolution image and background model. This results in a flatter reconstruction of that area in our super resolution background (Figure 9d) and consequently a false-positive segmentation (Figure 7c). This false segmentation was not spared for the bicubicly interpolated background segmentation (Figure 7b). Interestingly, as much as we in the tracking community dislike segmenting shadows, it is nevertheless satisfying to show that our approach segments the shadow of the pedestrian very similarly to the high-resolution test case (Figure 7d), even though that level of detail is surely missing from the low resolution image and background model.

PTZ camera’s next tasked saccade, standard background subtraction and modeling is performed to continuously segment the area of interest. There is much left to do to improve the overall approach. Instead of using a patch-based approach and ssd, novel features are being investigated to further enhance the quality of the super resolution background model as well as the overall speed and efficiency of the algorithm. Also, we would like to extend this approach to handle multi-modal Gaussian mixture models and, more generally, to have the capability of upsampling arbitrary fields of random variables given particular and corresponding configurations of such variables in low- and high- resolution.

References [1] S. Baker and T. Kanade. Limits on super-resolution and how to break them. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(9):1167 – 1183, September 2002. [2] R. Collins, A. Lipton, T. Kanade, H. Fujiyoshi, D. Duggins, Y. Tsin, D. Tolliver, N. Enomoto, and O. Hasegawa. A system for video surveillance and monitoring: VSAM final report. Technical Report CMU-RI-TR-00-12, Robotics Institute, Carnegie Mellon University, May 2000. [3] C. Costello, C. Diehl, A. Banerjee, and H. Fisher. Scheduling an active camera to observe people. In Visual Surveillance and Sensor Networks, page 46, ACM, October 2004. [4] A. Efros and W. Freeman. Image quilting for texture synthesis and transfer. In SIGGRAPH, 2001. [5] A. Efros and T. Leung. Texture synthesis by non-parametric sampling. In ICCV, September 1999. [6] A. Elgammal, D. Harwood, and L. Davis. Non-parametric model for background subtraction. In Proc. of the 6th European Conference on Computer Vision, Dublin, Ireland, June/July 2000. [7] W. Freeman, E. Pasztor, and O. Carmichael. Learning low-level vision. International Journal of Computer Vision, 40(1):25–47, 2000. [8] A. Hertzmann, C.E.Jacobs, N. Oliver, B. Curless, and D. Salesin. Image analogies. In SIGGRAPH, 2001. [9] S. Kang, J. Paik, A. Koschan, B. Abidi, and M. A. Abidi. Real-time video tracking using PTZ cameras. Proc. of SPIE 6th International Conference on Quality Control by Artificial Vision, pages 103–111, May 2003. [10] S. Lim, L. Davis, and A. Elgammal. A scalable image-based multicamera visual surveillance system. In AVSS, pages 205–212, 2003. [11] J. Migdal and W. E. L. Grimson. Background subtraction using markov thresholds. In IEEE Workshop on Motion and Video Computing, January 2005. [12] A. Senior, A. Hampapur, and M. Lu. Acquiring multi-scale images by pan-tilt-zoom control and automatic multi-camera calibration. In WACV05, pages I: 433–438, 2005. [13] S. Sinha and M. Pollefeys. Towards calibrating a pan-tilt-zoom camera network. In Workshop on Omnidirectional Vision and Camera Networks at ECCV, 2004. [14] C. Stauffer and W. E. L. Grimson. Adaptive background mixture models for real-time tracking. In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition, pages 246–252, 1999. [15] C. Stauffer and K. Tieu. Automated multi-camera planar tracking correspondence modeling. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, pages 259–266, July 2003. [16] K. Toyama, J. Krumm, B. Brumitt, and B. Meyers. Wallflower: Principles and practice of background maintenance. In ICCV (1), pages 255–261, 1999. [17] X. Zhou, R. Collins, T. Kanade, and P. Metes. A master-slave system to acquire biometric imagery of humans at distance. In ACM International Workshop on Video Surveillance, November 2003.

6.. Conclusions and Future Work We have presented a method whereby high fidelity segmentations of moving objects can be achieved on the first frame after fixation has been achieved in a PTZ camera. The process involves the PTZ camera being tasked by a cross-calibrated, collocated, far-field camera that is continuously maintaining its own low-resolution background model of the viewing area. With the rectified low-resolution image and background model from the far field camera and the high-resolution image from the PTZ camera, an estimate of the high-resolution background model is created through our super-resolution process. The high-resolution image from the PTZ camera is then compared to this background to form the high resolution segmentation. Until the

7

a

c

b

d

e

Figure 8: Comparison of high-resolution backgrounds obtained with various methods in the 3x zoom experiment. (a) Low-resolution input image (rectified through the homography and bicubically interpolated. (b) High-resolution input image. (c) Low-resolution background model (rectified through the homography and bicubically interpolated. (d) Our super-resolution background. (e) Reference high-resolution background learned on a high-resolution image sequence.

a

c

b

d

e

Figure 9: Comparison of high-resolution backgrounds obtained with various methods in experiment with 10x zoom. (a) Low-resolution input image (rectified through the homography and bicubically interpolated. (b) High-resolution input image. (c) Low-resolution background model (rectified through the homography and bicubically interpolated. (d) Our super-resolution background. (e) Reference high-resolution background learned on a high-resolution image sequence.

8