Multiple Object Tracking Using Local PCA - CiteSeerX

0 downloads 0 Views 471KB Size Report
The Joint Probabilistic Data. Association Filter (JPDAF) [1] removes each time unfeasi- ble association hypotheses, but it assumes a fixed number of targets and ...
Multiple Object Tracking Using Local PCA Csaba Beleznai∗ Advanced Computer Vision GmbH - ACV Vienna, Austria [email protected]

Bernhard Fr¨uhst¨uck Siemens AG Austria, Program and System Engineering Graz, Austria

Horst Bischof Institute for Computer Graphics and Vision Graz University of Technology Graz, Austria Abstract Tracking multiple interacting objects represents a challenging area in computer vision. The tracking problem in general can be formulated as the task of recovering the spatio-temporal trajectories for an unknown number of objects appearing and disappearing at arbitrary times. Observations are noisy, their origin is unknown, generated by true detections or false alarms. Data association and the estimation of object states are two crucial tasks to be solved in this context. This work describes a novel, computationally efficient tracking approach to generate consistent trajectories. First, trajectory segments are created by analyzing the spatio-temporal data distribution using local principal component analysis. Subsequently, linking between trajectory segments is carried out relying on spatial proximity and kinematic smoothness constraints. Tracking results are demonstrated in the context of human tracking and compared to results of a frame-to-frame-based tracking approach.

1. Introduction Multiple object tracking is an extensively investigated subject in the field of visual surveillance [3]. Its main complexity stems from the fact that observed data is usually contaminated with noise, missing observations or clutter. Typical examples for such situations can be encountered for instance in blob-based object detection approaches relying on background subtraction [3, 6], where undersegmentation, oversegmentation and false detections are frequent problems ultimately leading to failures when tracking is performed. Partitioning the observations such that each observation within a partition belongs to a single tracked object requires advanced techniques for data association and state estimation [1]. A frame-to-frame-based association represents ∗ This work has been carried out within the K plus Competence Center ADVANCED COMPUTER VISION. This work was funded from the K plus Program.

Figure 1. Illustration depicting typical situations in the spatio-temporal feature space containing the observations. Lines represent the true object trajectories. the simplest approach for relating subsequent observations, however, in the presence of noise it quickly leads to tracking failures. The multiple hypothesis tracker (MHT) [10] evaluates all association hypotheses over time. It copes better with noisy data and it is also able to initiate and terminate trajectories, however, its computational complexity given the combinatorial optimization - makes it impractical for real-time applications. The Joint Probabilistic Data Association Filter (JPDAF) [1] removes each time unfeasible association hypotheses, but it assumes a fixed number of targets and can not initialize new trajectories or terminate existing ones. Recently, a large variety of probabilistic approaches making use of Monte Carlo methods have appeared [11] with the ability to maintain multiple hypotheses at reasonable computational costs. We propose a tracking approach which performs stable tracking in a computationally efficient manner by analyzing the spatio-temporal space of observations. Observations obtained at a uniform sampling rate for N objects over the duration T of the surveillance task can be represented

as a multi-structured distribution in a spatio-temporal feature space (see Figure 1). Such data form elongated structures. Spatially overlapping and subsequently separating objects produce branching structures. The slope of the filaments represents the velocity of moving objects in the image space. The presented tracking framework operates faster than real time, it is tolerant against noisy data and it is able to initiate, terminate trajectories and reconnect broken ones. Our approach follows the ”deferred logic” scheme [9], where the tracking process is delayed until enough observations are acquired. The spatio-temporal analysis window typically spans over several tens of frames, thus trajectories are updated with a delay of several seconds or the process can be executed off-line. The structure of the paper is as follows: Section 2 describes the local PCA-based curve fitting algorithm creating trajectory segments. Section 3 explains how segments are linked in order to bridge discontinuities. Section 4 demonstrates and discusses tracking results. Finally, in Section 5 the paper is brought to conclusion.

2. Tracking by local PCA Principal component analysis is a well-established approach to perform dimensionality reduction on multivariate data assuming that the distribution is close to Gaussian. The motion of a real world object is usually subjected to kinematic constraints, such as limited acceleration or deceleration. For the tracking context this implies that motion data acquired for a given object at consecutive time instances are strongly correlated. Thus we can expect that the correlated data within a local neighborhood can be well described by a local variant of the PCA algorithm. The term local PCA [12] (further on denoted by LPCA) refers to an extension of the PCA which is applied to a subset of data points and the distribution is locally approximated by a linear subspace. A similar approach for principal curve generation is described by [5]. Let us describe the detected objects by spatio-temporal coordinates: Xi = {x, t}, 1 ≤ i ≤ N . x denotes the pixel coordinates of the centroid positions in the image space R2 , t represents the temporal coordinate (in units of seconds or frames) for all N detected objects over the full range of the video sequence. Many detectors deliver a detection probability {pi }i=1..N for data points, representing a likelihood measure that the observation is generated by a true detection. An object size model H(x) is obtained by calibrating the ground plane of the scene. A trajectory denotes a temporal sequence of object coordinates T = {(x1 , t1 ), (x2 , t2 ), . . . , (xZ , tZ )}, where Z represents the duration. A piecewise linear trajectory approximation is obtained by repeatedly relocating the analysis window along the first principal component and subsequently centering it on the local distribution of data. 1. An initial point X1 is chosen (see Figure 2a).

Figure 2. (a): Estimating trajectory segments by local PCA. (b):Trajectory segment linking. 2. Mean shift iterations [4] are performed starting from the point X1 until convergence, leading to the nearby local den  sity maximum X1 . X1 is added to the set of trajectory coordinates. The mean shift procedure uses an Epanechnikov kernel [4] of size H(x) and combines kernel weights with data weights p.  3. LPCA is performed at X1 with a window size of H(x ). Denote the resulting eigenvectors u1 , u2 and u3 , arranged with respect of the descending order of the eigenvalues λ1 ≥ λ2 ≥ λ3 . If the local distribution of data is of elongated shape, λ1 is significantly larger than λ2 and λ3 . Interacting objects might generate a distribution, such as the data in Figure 2a representing two intersecting trajectories, where no dominant principal direction can be estimated locally. The ratio λ1 /λ2 is used as a local anisotropy measure. If this measures approaches unit value, the trajectory segment estimation procedure is stopped. Thus, trajectory segments are created only at locations of the feature space, where data support the reliable estimation of a principal direction. Trajectory segments are thereafter connected by a linking procedure as described in Section 3. 4. The analysis window is moved along the first principal  component to a new location X1 by the amount of H(x’).  5. Step 2 is executed starting from the point X1 . If no more data points are available (Step 2 does not converge to a new location), the procedure is stopped and the trajectory is terminated. The eigenvector u1 is interpreted as a velocity estimate. A velocity model is initialized and propagated during the tracking process: v(t + ∆) = αs · v(t) + (1 − αs ) · u1 , where αs denotes a constant which is positive and smaller than one. In successive LPCA steps, the velocity model is

used to render the estimation of the local principal component more robust by applying locally weighted PCA [7] to the data. The velocity model at each local density maxi mum Xj (as determined in Step 2) provides a local motion estimate. Weights inversely proportional to the distance between the data points and the local motion estimate are assigned to the data using a Gaussian weighting kernel [8]. Subsequently, the product of motion-based weights with the detection probabilities p is formed in order to favour data points which likely to correspond to true detections and support the local motion estimate at the same time. The set of starting points is initially defined as the entire set of points {Xi }i=1..N . The tracking procedure is initiated at the data point with the smallest available temporal coordinate in the spatio-temporal feature space. Data points which are analyzed by LPCA are removed from the set of starting points.

3. Trajectory segment linking The local approximation by PCA generates a set of trajectories {Ti }i=1..K . Some of the generated trajectories might be discontinuous. The objective of the linking algorithm can be defined as connecting two trajectory segments Tm and Tn by inserting a link L between their end points such that it minimizes a cost function C(L) over all links connecting segments (see Figure 2b). The cost function is defined as: l(L) C(L) = + δS(L), (1) H(xc ) where l denotes the length of the inserted link, which is normalized by the object height model H(xc ) (see Figure 2b). The second term S(L) is a penalty term gauged by the smoothness of the connected segments being equal to the sum of angles between the link and the connected segments. δ is a weighting factor penalizing directional changes between trajectory segments. Trajectory segment pairs are linked using a greedy strategy until a cost limit Cmax is reached. Temporally co-existing trajectories are excluded from the matching process.

4. Results and discussions Motion detection relying on an adaptive background subtraction-based technique [3] was performed on two video sequences. Sequence A depicts a scene viewed from the top (2430 frames, 320×240 pixels), Sequence B (1940 frames, 360×288 pixels) shows a railway station scenario with large amount of moving and some standing humans (see Figure 3a and 3b). Moving objects are delineated from the difference image using a model-based clustering technique (see [2] for details). Typical distributions of data points in the spatio-temporal feature space are shown in the top part of Figure 3. The feature space contains many vertical structures which correspond to observations primarily generated by noise or motion clutter.

Figure 3. Obtained trajectories in the spatiotemporal feature space (top) and backprojected into the image (bottom) for the Sequence A (a) and B (b). The LPCA-based tracking approach was carried out on the data obtained for the sequences A and B. Examples for obtained trajectories within the feature space and backprojected into the image space are shown in Figure 3. The density of humans in Sequence A is low. Due to the camera viewpoint there are no significant overlaps between tracked objects. Therefore, obtained trajectories are highly consistent with the underlying data. Humans in Sequence B exhibit frequent overlaps, many of them are only partially visible throughout the sequence and there is a large amount of motion clutter. Most of the obtained trajectories agree well with the distribution of observed data, however, in neighborhoods containing multiple observations and additional noise, the local estimation of the trajectory segments occasionally produces errors. The quality of the LPCA tracking approach is examined in terms of stability and accuracy for a selected trajectory. In Sequence B a group of three humans moves jointly across the image. The motion path of one individual of the group (indicated by a bounding box in Figure 4a) was annotated manually. The tracking accuracy with respect to the annotated trajectory was computed for the proposed LPCAbased tracker and for a frame-to-frame-based tracker. The frame-to-frame-based tracker was based on the work in [6]. In order to assess the tracking accuracy, the Euclidean distance between the coordinates of ground truth and obtained trajectories is computed for each frame. The distance is normalized by the object height model yielding the normalized deviation D between trajectory coordinates at a time instance j:  m  x − x0  j j D(j) = , (2) H(x0j ) 0 where xm j and xj denote coordinates of the measured and the ground truth trajectories, respectively.

frame-based tracker. Our approach was implemented in MATLAB. Tracking in the spatio-temporal volume for the Sequence A, which contains in average 22 observations per frame, can be performed at a rate of 200 fps on a 3.2 GHz computer.

5. Conclusions We present a novel tracking approach using local principal component analysis to partition observed data into trajectories describing moving objects. A simple trajectory segment linking step is carried out subsequently to cope with discontinuities in the distribution of the observed data. The algorithm is applied to two video data sets and compared to results of a frame-to-frame-based tracker. The algorithm produces stable tracking results when tracking many interacting targets, while maintaining high computational speed. Further research will focus on tracking improvements in the presence of noise by performing hierarchical grouping of multiple local trajectory estimates.

References Figure 4. (a): Results by frame-to-frame tracking (gray line), by the proposed approach (black line) and the ground truth trajectory (green line) for a selected human (rectangle). (b): The normalized spatial deviation of tracking results w.r.t. the ground truth. The evolution of the normalized deviation over the duration of the temporal overlap between measured and ground truth trajectories is shown in Figure 4b. Frame-to-framebased tracking (bright line) temporarily looses track due to missing observations (location between the two arrows). Furthermore, the presence of several nearby humans leads to association ambiguities and tracking results strongly deviate from ground truth (right half of the plot 4b). The LPCA-based tracking approach (indicated as dark line in Figure 4) tracks the object until its disappearance. It is able to overcome the gap of missing observations where frame-to-frame-based tracking fails, nevertheless, it is temporarily perturbed due to the lack of observations. The trajectory obtained by LPCA deviates from the reference trajectory by ca. 10% of the local human height. The trajectories of all moving humans were annotated in the first 1013 frames of Sequence B and the overall tracking performance was analyzed (Table 1). The proposed approach shows improved tracking stability (less trajectory fragments) and spatial accuracy compared to the frame-toTable 1. Tracking performance (based on comparisons to 42 annotated trajectories in the first 1013 frames of Sequence B) Tracking method LPCA Frame-to-frame

# of detected trajectories 62 93

Avg. norm. dev. 0.19 0.3

[1] Y. Bar-Shalom. Tracking and data association. Academic Press Professional, Inc., San Diego, CA, USA, 1987. [2] C. Beleznai, B. Fr¨uhst¨uck, and H. Bischof. Tracking multiple humans using fast mean shift mode seeking. In Proceedings of IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, pages 25–32, Breckenridge, USA, January 2005. [3] R. Collins, A. Lipton, T. Kanade, H. Fujiyoshi, D. Duggins, Y. Tsin, D. Tolliver, N. Enomoto, and O. Hasegawa. A system for video surveillance and monitoring: VSAM final report. In Technical Report CMU-RI-TR-00-12, Robotics Institute, Carnegie Mellon University, 2000. [4] D. Comaniciu and P. Meer. Mean shift: A robust approach toward feature space analysis. IEEE Trans. Pattern Anal. Mach. Intell., 24(5):603–619, 2002. [5] J. Einbeck, G. Tutz, and L. Evers. Local principal curves. Statistics and Computing, 15(4):301–313, October 2005. [6] L. M. Fuentes and S. A. Velastin. People tracking in surveillance applications. In Proceedings of 2nd IEEE International Workshop on Performance Evaluation of Tracking and Surveillance, Kauai, Hawaii, 2001. [7] P. J. Huber. Robust Statistics. John Wiley & Sons, New York, 1981. [8] J. Park, Z. Zhang, H. Zha, and R. Kasturi. Local smoothing for manifold learning. In Conference on Computer Vision and Pattern Recognition, volume 2, pages 452–459, 2004. [9] A. Poore. Multidimensional assignment and multitarget tracking. Partitioning Data Sets. DIMACS Series in Discrete Mathematics and Theoretical Computer Science, 19:169– 196, 1995. [10] D. Reid. An algorithm for tracking multiple targets. Automatic Control, IEEE Transactions on, 24(6):843–854, 1979. [11] J. Vermaak, A. Doucet, and P. P´erez. Maintaining multimodality through mixture tracking. In Int. Conf. on Computer Vision, ICCV’03, Nice, France, June 2003. [12] L. Xu. Multisets modeling learning: A unified theory for supervised and unsupervised learning. In Proc. of IEEE ICNN94, volume I, pages 315–320, Orlando, FL, June 1994.

Suggest Documents