UMD VDT, an Integration of Detection and Tracking Methods for ...

2 downloads 109 Views 3MB Size Report
We describe a multiple object tracking system that is an integration of human detection and general ... A filter on location classifies object position into road and ...
UMD VDT, an Integration of Detection and Tracking Methods for Multiple Human Tracking Son Tran, Zhe Lin, David Harwood, and Larry Davis UMIACS University of Maryland College Park, MD 20740, USA

Abstract. We integrate human detection and regional affine invariant feature tracking into a robust human tracking system. First, foreground blobs are detected using background subtraction. The background model is built with a local predictive model to cope with large illumination changes. Detected foreground blobs are then used by a box tracker to establish stable tracks of moving objects. Human detection hypotheses are detected using a combination of both shape and region information through a hierarchical part-template matching method. Human detection results are then used to refine tracks for moving people. Track refinement, extension and merging are carried out with a robust tracker that is based on regional affine invariant features. We show experimental results for the separate components as well as the entire system.

1

Overview of UMD VDT

Most human activity analysis and understanding approaches in visual surveillance take human tracks as their input. Establishing accurate human tracks is therefore very important in many visual surveillance systems. Even though it has been long studied, accurate human tracking is still a challenge due to a number of reasons such as shape and pose variation, occlusion and object grouping. We describe a multiple object tracking system that is an integration of human detection and general object tracking approaches. Our system detects people automatically using a probabilistic human detector and tracks them as they move through the scene. It is able to resolve partial occlusion and merging, especially when people walk in groups. Figure 1 shows the diagrammatic overview of our system. It consists of three main components: background subtraction, human detection and human tracking. Their details will be discussed in the next sections (Section 2, 3, and 4). In this section, we describe the overall procedure that is built on top of these components. 1.1 Algorithm Integration Foreground Detection First, background subtraction (section 2) is applied to detect foreground blobs. To improve the detection rate, we combine background subtraction with a shortterm frame difference. The background subtraction module is designed to cope

2

Input

Foreground Detection

Human Detection

Track Extension

Tracks

Foreground Tracking

Fig. 1. Main components of our human tracking system.

with global illumination changes. However, in visual surveillance, lighting changes sometime happen at the local scale such as when a truck casts a large scattered shadow on streets and pavements. In such cases, subtraction from the long-term, trained background model will likely produce many false alarms while frame differencing does not. The resulting detected foreground boxes are connected to form preliminary tracks, P T . Unstable tracks are then eliminated. Box Filtering The set of these preliminary tracks, P T , corresponds to different moving objects in the scene (humans, human groups, cars, car group, etc.). A cascade of filters is applied to eliminate irrelevant moving objects retaining only tracks that correspond to human or human groups. A filter on location classifies object position into road and non-road regions. It is trained for each view, for example, based on cars’ position. A filter on object shape classifies objects into humans, human groups or cars. Its decision is based on human width and height ratio. This filter gives high detection rates for single-person boxes even when shadow is significant. A filter on track properties makes the distinction between the tracks of cars and human groups. Their tracks often differ in a number of aspects. For example, cars often move along the road with high velocity, while humans rarely walk in groups along the road. Stacking these filters in a decision tree-like order, we have a composite filter that outputs a set of tracks, F T , that correspond to only single humans or human groups. Further elimination of false alarms can be done in combination with human detection described below. Human Detection In the tracks, F T , each individual human in a human group has not yet been segmented and localized accurately. Localization inaccuracy may also exist in a track of single person when there is a strong shadow associated with it. To resolve these problems, a human detection technique (section 3) is applied. It is difficult to apply the frame-by-frame object detection directly to tracking since the detection task is often time consuming and can only be performed at very low frame rates (e.g. on every fifth or tenth frame of the original sequences). Furthermore, the image quality in surveillance videos is typically low, making it difficult to achieve high detection and low false alarm rates at the same time for every frame. Here we used the frame-by-frame detection result, D, to refine the set F T = {F Ti } obtained in the previous section. Let Di = {Di1 , . . . , Din } ∈ D denotes

3

the set of relevant detected humans for F Ti . Each Dij may be empty or is set of detected boxes that significantly intersect F Ti at time j. Algorithm 1 describes the integration of detection results into tracking. Algorithm 1 Track Refinement using Human Detection Input: set of tracks F T and human detection result D Output: set of refined tracks RT RT = ∅ For every track F Ti ∈ F T 1. Compute Di ∈ D, the set of relevant detected human boxes for F Ti . 2. RTi = ∅ 3. Forward Tracking From frame t to t + 1, – For every track tr ∈ RTi , extend tr to frame t + 1, using the tracker in section 4. Stop at the end of F Ti . – Merge any detected box in Dit+1 that intersects tr significantly. – For each un-mergeable box, start a new track tr and add it to RTi . 4. Backward Tracking – For each track tr ∈ RTi , using the tracker in section 4, track backward, starting from the first frame in tr. Stop at the beginning of Ti . 5. Add RTi to RT . Merge similar tracks in RT . In surveillance, it is very common that people with similar appearance walk close to each other. In such cases, accurate object state prediction to limit the tracker’s searching range plays an important role in reducing track or identity switch. Therefore, in the forward and backward tracking steps in Algorithm 1, a prediction step is added to the tracker described in section 4. The prediction is based on the movement of the bounding box in Ti and the relative location of the tracked human w.r.t. this bounding box. Since the prediction is only possible within the duration of Ti , at this stage, we stop the tracker at the temporal boundary of Ti . In the last step of Algorithm 1, two tracks can be merged if they are not temporally far away from each other and spatially overlap each other. The merging process stops when there are no two tracks that can be merged. Track Extension Each refined track RTi ∈ RT is extended within the temporal limit of the original track F Ti ∈ F T . Usually each RTi is quite stable and long. Therefore, the velocity of the tracked object can be estimated reliably. Using this estimation as the prediction on object dynamics, we extend RTi using the tracker in Section 4. Similar tracks are then merged to form the set of final tracks. 1.2 Results Figure 2 shows sample results of our tracking system on some sequence from the VACE-CLEAR 2007 tracking evaluation dataset. Foreground detection, human

4

detection, and final localization are shown overlaid. As we can see, foreground detection results (green boxes) give rough locations of moving objects in the scene. The human detection results (blue boxes) give better estimation but with occasional misses (Figure 2, left, top). Based on the human detection, the tracker is able to track each individual more accurately even when they walk in groups (Figure 2, left, bottom). Figure 3 shows results from another sequence. Figure 3.b

Fig. 2. Results on sequence P V T RA102a01 (VACE-CLEAR07). Foreground detection, human detection, and final localization are show respectively in green, blue and red. On the left are some blowups and on the right is a full view.

shows some false alarms when the shape of a detected foreground bolob is similar to a human. For this sequence, there is no detection result in the neighborhood of frame ]4999 (fifty frames before and after), but the system was still able to track the object accurately (Figure 3.b).

a)frame ]1498

b)frame ]1748

c)frame ]4999

Fig. 3. Results on sequence P V T RA201c05 (VACE-CLEAR07). Foreground detection, human detection(if present), and final localization are show respectively in green, blue and red.

5

2

Background Subtraction

2.1 Introduction Common approaches to background modeling construct a probability distribution for a pixel’s color values or for some local image features. A single or multimodal distribution such as a Gaussian or mixture of Gaussians ([1]) may be used depending on levels of noise and complexity of the background. Approaches proposed in the literature have difficulties in handling fast response to illumination change. Adaptation using Kalman filter approach ([2]) or caches ([3]) inevitably leads to lag in accommodating to illumination change during which the system is effectively ”blind”. Here, we employ the VQ approach ([3]) to background modeling, and motivate the use of a locally linear model to predict and update pixel color codebook values. Our approach is able to deal with large illumination changes using only a small number of codewords per pixel. 2.2 Background Modeling using the Codebook Representation There are two main observations that motivate our approach. First, in natural outdoor scenes, different surfaces have different irradiant responses when the illumination is changed. To accurately adapt the background model to such changes, a single global set of adaptive parameters, such as suggested by [4], is insufficient. Instead, for each pixel we use a separate set of parameter values that are updated independently as lighting changes. Second, as lighting changes, R

¨

m

255 saturated

¨a

m

D

dB

m

e dC

saturated

H B 0

255

a)

x x

b)

Fig. 4. a) Illustration on R-B plane of the response of the camera sensor to changes in lighting intensity. b) Prediction errors and distances in brightness dC and dB in color. ∆ is the local tangent and m is the mean at a pixel. a is the global gain. x denotes the observation. the color values of individual pixels do not lie on a color line passing through the origin of color space, as assumed by [3] and others. Since camera sensors have limited range [0-255], the response is necessarily saturated at the limits so that locally, changes in color as a function of illumination change will not point toward the color space origin. This is illustrated in Figure 4. At low light levels all pixels fade to black. At sufficiently high light levels they all tend to white. Between those two extremes they follow a curved trajectory between black and white depending on the reflectance properties of the corresponding surface point. We point out that reaching saturation is quite common, especially for scenes that

6

are under direct and strong sunlight. Models for background adaptation under illumination change have to take a typical camera’s non-linear responses into account. Background Model Construction We construct a locally linear model to update codeword colors as illumination changes. So, at each pixel, we use a tangent vector to that pixel’s color trajectory to represent the rate and direction of its color variation. As the lighting changes, both are adjusted to better model the local behavior of a pixel’s color. The overview of our background model adaptation is as follows. – For each pixel, a codeword model is learned during a training period. Each codeword contains a principle color values and a tangent vector. – When scene illumination changes, the change is measured through a gain in a global index value. For each codeword, this gain, in combination with the mean and tangent of the codeword, produces a prediction of the codeword color under the changing lighting condition. A new codeword is created if no prediction is close to the observation. With some additional updating based on secondary technical considerations, these predictions constitute our updated background model. Foreground Detection Scene illumination will, of course, also change after training the codebook is completed. Therefore, we keep updating the background model in parallel with foreground detection. The algorithm is the same as above with the exception that when the observation for a pixel is far off from its prediction, instead of creating a new code word, we mark that pixel as belonging to the foreground. 2.3

Results

We shown here an experiment on an outdoor sequence where the lighting changed extensively during both the training and detection periods. The first 150 frames are used for training. No foreground objects are present during that period. Figure 5 shows the results on three frames from the testing sequence. As we can see, with the same distance threshold, both of the approaches in [3] and [2] produce more false alarms than ours. Note that reducing rD to exclude false alarms also leads to a reduction in detection rate (i.e. producing more holes in the foreground regions). For example, in [3]’s result for frame ]400, the foreground (true positive) starts to disappear while the background (false alarms) has not been completely eliminated yet. Note that increasing the adaptation rate also does not necessarily lead to improvements in detection.

3 3.1

Human Detection Introduction

Local part-based approaches [5][6] detect human parts first and assemble the detected parts for final classification. Global template-based approaches, e.g. [7], use a more direct global shape matching method for human detection. Blob-based approaches [8][9] use an MCMC-based optimization to segment foreground blobs into humans.

7

Frame ]1

Frame ]400

Frame ]1000

[3]

Ours Fig. 5. The first row shows three frames of the testing sequence. The remaining rows respectively show the result of [3] and our methods on those three frames. 3.2

Human Detection

Bayesian MAP Framework We formulate the human detection problem as a Bayesian MAP optimization: c∗ = arg maxc P (c|I), where I denotes the original image, c = {h1 , h2 , ...hn } denotes a human configuration (a set of human hypotheses). {hi = (xi , θi )} denotes an individual hypothesis which consists of foot position xi and corresponding model parameters θi . Using Bayes Rule, the equation can be decomposed into a joint likelihood and a prior as follows: (c) P (c|I) = P (I|c)P ∝ P (I|c)P (c). We assume a uniform prior, hence the MAP P (I) problem reduces to maximizing the joint likelihood. The joint likelihood P (I|c) is modelled as: P (I|c) = Pol (c)Pcl (c), where Pol (c) and Pcl (c) denote object-level likelihood and configuration-level likelihood, respectively. Object-level likelihood as a product of Qnis further expressed Qn likelihoods of each hypothesis: Pol (c) = i=1 Pol (hi ) = i=1 Pol (I|x, θi ). Also, configuration-level likelihood is calculated as Sthe global coverage density of the n i=1 Γ (hi ) binary foreground regions: Pcl (c) = ΓΓ(c) = , where, Γf g denotes the Γf g fg foreground coverage, and Γ (hi ) denotes the coverage by the hypothesis hi . Hierarchical Part-Template Matching We first generate a flexible set of global shape models by part synthesis (Figure 6(a)). Next, silhouettes and boundaries are extracted from the set of generated global shape models and decomposed into three parts (head-torso, upper legs and lower legs). The part parameters are denoted as θht , θul and θll , where each parameter represents the index of the corresponding part in the part-template tree. Then, the treestructured part-template hierarchy is constructed by placing the decomposed part regions and boundary fragments into a tree as illustrated in Figure 6(b).

8

(a)

(b)

Fig. 6. An illustration of the part-template tree and its construction process. (a) Generation of global shape models by part synthesis, decomposition of global silhouette and boundary models into region and shape part-templates, (b) Parttemplate tree characterized by both shape and region information. Algorithm 2 Hierarchical Part-Template Matching For each pixel x in the image, we adaptively search over scales distributed around the expected human size (w0 , h0 ) estimated by foot-to-head plane homography and an average aspect ratio ∆. 1) We match the set of head-torso shape templates with edges and estimate the ∗ maximum likelihood solution θht . ∗ 2) Based on the part-template estimate θht , we match the upper leg template models and the lower leg template models with edges to find the maximum like∗ lihood solution for leg layers θul and θll∗ . 3) Finally, we estimate global human shapes by combining the maximum likelihood estimates of the local part-templates, and return the synthesized model ∗ ∗ parameters θ∗ = {θht , θul , θll∗ }.

For a part-template j ∈ ht, ul, ll, we denote the overall part-template likelis hood as Pol (I|x, θj ), the chamfer matching score as the shape likelihood Pol (I|x, θj ), r and the part foreground coverage density as the region likelihood Pol (I|x, θj ). Given the binary foreground image If and Canny edge map Ie , the likelihoods are calculated as follows: s r Pol (I|x, θj ) = Pol (Ie , If |x, θj ) = Pol (Ie |x, θj )Pol (If |x, θj ), s s Pol (I|x, θj ) = Pol (Ie |x, θ) = Dchamf er (x, Tθj ), r Pol (If |x, θj ) = γ(x, θj ),

(1) (2) (3)

where Tθj represents the part-template defined by parameter θj and Dchamf er (x, Tθj ) represents the average chamfer distance. The foreground coverage density γ(x, θj ) is defined as the proportion of the foreground pixels covered by the part-template Tθj at candidate foot pixel x. Then, we find the maximum likelihood estimate

9

θj∗ (x) as follows: θj∗ (x) = arg maxθj ∈Θj Pol (I|x, θj ), where Θj denotes the parameter space of the part j, and Pol (I|x, θj ) denotes the part-template likelihood for candidate foot pixel x and part-template Tθj . Finally, the global object-level P likelihood Pol (I|x) for candidate foot pixel x is estimated as: Pol (I|x) = j wj Pol (I|x, θj∗ (x)), where wj is an importance weight for the part-template j. We smooth the resulting likelihood map by a 2D Gausssian filter adaptively and extract local maxima from it. We define the set of local maxima with likelihood larger than some threshold as the set of initial human hypotheses: C = {h1 , ...hN }. Optimization: A fast and efficient greedy algorithm is employed for optimization. The algorithm works in a progressive way as follows: starting with an empty configuration c0 = φ, we iteratively add a new, locally best hypothesis from the residual set of possible hypotheses until the termination condition is satisfied. The iteration is terminated when the joint likelihood stop increasing or no more hypothesis can be added. The greedy solution c∗ is used as the final estimate for the human configuration. 3.3

Implementation and Results

Given the calibration information and the binary foreground image from background subtraction, we first estimate binary foot candidate regions Rf oot for − efficiently localizing all the human feet. Human vertical axis → v x is estimated for a foot candidate pixel x by the calibration information. Then, foot candidate regions Rf oot are obtained as: Rf oot = {x|γx ≥ ξ}, where γx denotes the proportion of foreground pixels in an adaptive rectangular window W (x, (w0 , h0 )) determined by candidate foot pixel x. The window coverage is efficiently calculated using integral images. Figure 7 shows an example of our human detection process.

4 4.1

Object Tracking Introduction and Related Work

Developing robust tracking algorithms is challenging due to factors such as noisy input, illumination variation, cluttered backgrounds, occlusion, and object appearance change due to 3D motion and articulation. There is a vast literature on tracking, and we restrict ourselves to some recent approaches related to our research. To address appearance changes, adaptive modeling of object appearance is typically employed. In [10], object appearance is modeled with a mixture of a fixed number of color-spatial Gaussians. Each mixture is considered as a particle in a particle filter system. This representation is quite similar to the object model in [11], where a variable number of Gaussian kernels is used. The set of kernels is updated between frames using Bayesian sequential filtering. The Gaussian approximation makes these approaches applicable to gradual appearance changes. However, similar to [12], they have difficulties with rapid changes or when changes include both occlusion and dis-occlusion, such as when an object rotates (even slowly).

10

(a)

(b)

(c)

(e)

(f)

(d)

Fig. 7. An example of the detection process with background subtraction. (a) Adaptive rectangular window, (b) Foot candidate regions Rf oot (lighter regions), (c) Object-level likelihood map by the hierarchical part-template matching, (d) The set C of human hypotheses overlaid on the Canny edge map, (e) Final human detection result, (f) Final human segmentation result. We use in this work a regional-feature based tracking algorithm designed to cope with large changes in object appearance. The tracking algorithm estimates a time-varying occupancy map of the tracked object which is updated based on local motion models of both the object and the background. The regional features we employ are the MSER features ([13]) that have been used previously for object recognition and wide-baseline stereo matching (see [14]). They are more stable from frame to frame than local features such as corners or lines, making it easier to match them for motion modelling. 4.2

Approach Summary

Our approach is motivated by the simple assumption that any image element (feature, pixel . . . ) that is close to the object and moves consistently with the object is, with high probability, part of the object. Our approach to updating the model of a tracked object is, then, based on motion of image elements as opposed to appearance matching. The summary of our approach is as follows. We represent the object with a probabilistic occupancy map and reduce the problem of tracking the object from time t to t + 1 to that of constructing the object occupancy map at t + 1 given the occupancy map at time t. We start by computing probabilistic motion models for detected features at time t conditioned on that they belong to the foreground or the background. Feature motion distributions are computed based on the occupancy map at time t and feature similarities across frames. From these feature motions, a probabilistic motion model for each pixel is constructed. The construction starts from the center of the regional features and expands out to cover the entire image. Finally, pixel motion fields are used to construct the

11

object occupancy at t+1. Figure 8 shows the flowchart of our tracking algorithm.

Initialize Occupancy Map OC0

Evaluate Feature Motion Models

Compute Probabilistic Pixel Motion Fields

End

Update Occupancy Map

Fig. 8. Main components of our tracking algorithm

4.3

Experimental Results

Figure 9 shows the tracking results on a surveillance video from published domains1 (rough bounding boxes are shown for better visualization). The challenges in this sequence include cluttered background, partial occlusion, large object shape and size changes as the car turns and moves away.

References 1. Stauffer, C., Grimson, W.: Adaptive background mixture models for real-time tracking, Washington, DC, USA, IEEE Computer Society (1999) 245–252 2. Kilger, M.: A shadow handler in a video-based real-time traffic monitoring system. In: Proc. IEEE Workshop Applications of Computer Vision, IEEE Computer Society (1992) 11–18 3. Kim, K., Chalidabhongse, T., Harwood, D., Davis, L.S.: Background modeling and subtraction by codebook construction, Washington, DC, USA, IEEE Computer Society (2004) 3061–3064 4. Koller, D., Weber, J., Malik, J.: Robust multiple car tracking with occlusion reasoning, London, UK, Springer-Verlag (1994) 189–196 5. Wu, B., Nevatia, R.: (Detection of multiple, partially occluded humans in a single image by bayesian combination of edgelet part detectors) ICCV, 2005. 6. Mikolajczyk, K., Schmid, C., Zisserman, A.: (Human detection based on a probabilistic assembly of robust part detector) ECCV, 2004. 7. Gavrila, D.M., Philomin, V.: (Real-time object detection for smart vehicles) ICCV, 1999. 8. Zhao, T., Nevatia, R.: (Mcmc-based approach for human segmentation) CVPR, 2004. 9. Smith, K., Perez, D.G., Odobez, J.M.: (Using particles to track varying numbers of interacting people) CVPR, 2005. 10. Wang, H., Suter, D., Schindler, K.: Effective appearance model and similarity measure for particle filtering and visual tracking. In Leonardis, A., Bischof, H., Pinz, A., eds.: Proc. ECCV-2006. Volume 3 of LNCS. (2006) 606–618 1

The ETI-SEO and VACE-CLEAR07 tracking evaluation projects

12

(a) ]1

(b) ]90

(c) ]206

(d) ]1

(e) ]90

(f) ]188

(g) ]1

(h) ]90

(i) ]206

Fig. 9. Tracking result on the car sequence. Results of [11] (top), [15] (middle) and our tracker (bottom).

11. Han, B., Davis, L.: On-line density-based appearance modeling for object tracking. In: Proc. IEEE ICCV-2005, IEEE Computer Society (2005) 1492–1499 12. Jepson, A., Fleet, D., El-Maraghi, T.: Robust online appearance models for visual tracking. IEEE Trans. PAMI 25(10) (2003) 13. Matas, J., Chum, O., Martin, U., Pajdla, T.: Robust wide baseline stereo from maximally stable extremal regions. In: Proc. BMVC-2002, London (2002) 384–393 14. Mikolajczyk, K., T. Tuytelaars, C.S., Zisserman, A., Matas, J., Schaffalitzky, F., Kadir, T., Gool, L.V.: A comparison of affine region detectors. Int’l J. Computer Vision 65(1-2) (2005) 43–72 15. Collins, R., Liu, Y., Leordeanu, M.: On-line selection of discriminative tracking features. IEEE Trans. PAMI 27(10) (2005) 1631–1643

Suggest Documents