Tracking Multiple Humans using Fast Mean Shift ...

2 downloads 0 Views 243KB Size Report
annotation and to a blob tracker. Finally the paper is concluded in Section 6. 2. Fast Mean Shift Computation. The mean shift algorithm is a nonparametric tech-.
Tracking Multiple Humans using Fast Mean Shift Mode Seeking Csaba Beleznai1,∗, Bernhard Fr¨ uhst¨ uck2 , Horst Bischof3 1 Advanced Computer Vision GmbH - ACV, Vienna, Austria [email protected] 2 ¨ Siemens AG Osterreich, Programm- und Systementwicklung, Graz, Austria 3 Institute for Computer Graphics and Vision, Graz University of Technology, Graz, Austria

Abstract Tracking multiple targets - such as humans in a busy scene - is a non-trivial task due to the frequent occlusions occurring between the target objects. This work describes a novel way to detect human candidates directly from the non-thresholded difference image obtained by background differencing and to track detected candidates by a fast variant of the mean shift procedure. For occluding targets, a model-based search step is performed to search for the local configuration best explaining the observed distribution in the difference image. The proposed technique shows real-time performance for challenging multi-target scenarios. Presented results are compared to a conventional blob-based tracker and evaluated in terms of detection performance.

1. Introduction The ultimate goal of automated visual surveillance systems is to obtain a high-level representation - such as activity recognition or scene understanding - for a given scene. To achieve this goal, the underlying detection and tracking mechanisms have to perform reliably. Scenarios of practical interest, however, usually contain a large number of humans or other objects rendering the task of robust detection and tracking significantly more difficult. A commonly used low-level object detection strategy is to compute the difference between a frame and a reference image (such as an image representing a background model) and applying thresholding, morphological filtering and subsequent labelling. Thresholds are sensitive parameters leading to an immediate decision whether a pixel belongs to a moving or non-moving region. Filtering and labelling are also prone to introduce errors - such as over- and undersegmented objects - which are propagated to higher levels of process∗ This work has been carried out within the K plus Competence Center ADVANCED COMPUTER VISION. This work was funded from the K plus Program.

Figure 1: Tracking humans in a crowded scene using a blob-based hypothesis tracker (left image) and the proposed method (right image). ing, where these errors are difficult to overcome. The thresholding and filtering steps eliminate substantial amount of the original information on the spatial intensity distribution within the difference image, which is essential for reliable object detection and tracking. Another alternative is to operate directly on the difference image and obtain an object representation by clustering. Recently, several approaches have been proposed on clustering of image data in the context of object detection and tracking. Pece [1] proposed clustering using mixtures of Gaussians and tracking by propagating cluster parameters. Numerous similar object detection approaches rely on estimating spatial covariance [2, 3, 4]. The mean shift approach - delineating clusters without a specific assumption on the distribution of data has been presented [5] recently and used for image segmentation [6] and object detection [7]. The difference image of a crowded scene might contain many nearby clusters. In this case unconstrained clustering usually fails to isolate individual clusters. Constraints imposed by a model can improve the clustering task significantly, as demonstrated by [8]. In this paper we present a novel way of performing object detection in complex scenes containing frequent occlusion events. We describe a fast mean shift clustering procedure using integral images. Furthermore we present a simple model-based search step exploit-

ing intermediate clustering results to narrow down the search space when searching for the optimum object configuration best explaining the underlying difference image intensity distribution. The presented work is structured as follows: in Section 2 the main steps of fast mean shift computation are explained. Section 3 describes the clustering scheme using the fast mean shift procedure. Section 4 presents the cluster tracking algorithm. The proposed method is computationally efficient therefore real-time detection and tracking of a large number of targets becomes feasible. Section 5 demonstrates the results of the method on different examples, and compares them to a manual annotation and to a blob tracker. Finally the paper is concluded in Section 6.

2. Fast Mean Shift Computation The mean shift algorithm is a nonparametric technique to locate density extrema or modes of a given distribution by an iterative procedure. Starting from a location x the local mean shift vector represents an offset to x0 , which is a translation towards the nearest mode along the direction of maximum increase in the underlying density function. The local density is estimated within the local neighborhood of a kernel by kernel density estimation where at a data point a kernel weights K(a) are combined with weights associated with the data, i.e. with sample weights I(a). For digital images sample weights are defined by the pixel intensities at pixel locations a. The new location vector x0 obtained after applying the mean shift offset: P K(a − x)I(a)a 0 x = Pa (1) a K(a − x)I(a) The difference image obtained by background subtraction can be thought as a mixture of clusters, where cluster centers are defined by local density maxima of difference image intensities. Thus, to delineate clusters within the multimodal density function, a mean shift clustering procedure can be applied, as described in [5]. If the difference image contains a large number of clusters - a typical situation when having images of a crowded scene -,a significant number of separate mean shift mode seeking procedures are needed to fully explore the multimodal density function. Such a clustering task usually becomes computationally so demanding, that no real-time operation is possible. Therefore, we propose a fast way to compute the mean shift offset using a uniform kernel K based on ideas originating from the boxlet work of Simard et al. [9]. The following identity is a fundamental property

of the convolution operation: (f ∗ g)(n) = f (n) ∗ g = f ∗ g (n)

(2)

where ∗ represents the convolution operator between the functions f and g; f (n) is the n-th integral of f (or the n-th derivative for negative n). From this identity it follows that convolution also satisfies: f ∗ g = f (−n) ∗ g (n)

(3)

From the above equation it can be easily seen that the convolution operation can be carried out in an efficient manner if the derivative of the filter becomes sparse and it can be represented by delta functions exclusively. The mean shift vector computation (Eq.1) for two-dimensional digital images involves three convolution operations: combining kernel and sample weights with x- and y- pixel coordinates (numerator of Eq.1) and the convolution between kernel and sample weights (denominator of Eq.1). Using the convolution identity of Eq.3, the computation of the new location vector by mean shift can be rewritten as: P K 00 (a − x)iix (a) 0 x = Pa 00 (4) a K (a − x)ii(a) where K 00 represents the second derivative of the kernel K, differentiated with respect to each dimension of the image space, i.e. the x- and y-coordinates. Accordingly, the functions iix and ii are the double integrals, i.e. two-dimensional cumulative sums in the form of: X iix (x) = I(xi )xi (5) xi

Suggest Documents