A Graph-Based, Multi-resolution Algorithm for ...

2 downloads 0 Views 489KB Size Report
Donatello Conte1, Pasquale Foggia2, Jean-Michel Jolion3, and Mario Vento1. 1 Dipartimento di Ingegneria dell'Informazione ed Ingegneria Elettrica,.
A Graph-Based, Multi-resolution Algorithm for Tracking Objects in Presence of Occlusions Donatello Conte1, Pasquale Foggia2, Jean-Michel Jolion3, and Mario Vento1 1

Dipartimento di Ingegneria dell’Informazione ed Ingegneria Elettrica, Università di Salerno, Via Ponte don Melillo, I-84084 Fisciano (SA), Italy {dconte, mvento}@unisa.it 2 Dipartimento di Informatica e Sistemistica, Università di Napoli, Via Claudio 21, I-80125, Napoli, Italy [email protected] 3 Lyon Research Center for Images and Information Systems, FRE CNRS 2672 Bat. 403 INSA Lyon, 69621 Villeurbanne Cedex, France [email protected]

Abstract. In a video surveillance system the object tracking is one of the most challenging problem. In fact objects in the world exhibit complex interactions. When captured in a video sequence, some interactions manifest themselves as occlusions. A visual tracking system must be able to track objects which are partially or even fully occluded. In this paper we present a novel method of tracking objects through occlusions using a multi-resolution representation of the moving regions. The matching between objects in two consecutive frames to recognize the trajectories is preformed in a graph theoretic approach. The experimental results on the standard database PEST2001 show that the approach looks promising.

1 Introduction Real-time object tracking is recently becoming more and more important in the field of video analysis and processing. Applications like traffic control, user–computer interaction, on-line video processing and production and video surveillance need reliable and economically affordable video tracking tools. In the last years this topic has received an increasing attention by researchers. However many of the key problems are still unsolved. Occlusions represent one of the most difficult problems in motion analysis. Most of early works in this field either do not deal with occluding cases at all [14] or impose rigid constraints on camera positioning [2], in order to make the problem tractable. While it is possible to find applications where this is acceptable, in many real world cases the ability to deal with occlusions is essential to the practical usefulness of a tracking system. In fact, very often the camera positioning is dictated by architectural constraints (e. g., inside of a building), and furthermore it may not be desirable to have a point of view that is too high with respect to the observed scene since this L. Brun and M. Vento (Eds.): GbRPR 2005, LNCS 3434, pp. 193 – 202, 2005. © Springer-Verlag Berlin Heidelberg 2005

194

D. Conte et al.

may limit the ability to classify the tracked objects. Hence a tracking system should be prepared to face the possibility that the object being followed gets (at least partially) covered either by a static element of the scene (e.g. a tree or a wall) or by another moving object. This is particularly true when the objects being tracked are persons, since even in a not crowded scene two persons are likely to walk so close to each other (say because they are talking) to form an occlusion. In this paper we present a novel algorithm for object tracking that is able to deal with partial occlusions of the objects. In particular, our algorithm is based on a pyramidal decomposition of the objects being tracked, using a multi-resolution approach. When an object is partially occluded by another, the algorithm descends to a more detailed level of representation in order to assign each part of the collapsed region to the proper object. In the paper we focus our attention only to the tracking phase of the system. This phase is built upon a foreground detection technique described in another work [3], which identifies for each frame of the video which pixels (grouped into connected components) belong to the moving regions. The rest of the paper is organized as follows. A review of related research is presented in Section 2. Section 3 describes the architecture of our tracking system. In Section 4 and Section 5 our algorithm is described with some examples. Section 6 presents the experimental results on a standard database of video sequences. Finally, some discussion and conclusions can be found in Section 7.

2 Related W orks In the last decade there has been several works on tracking moving objects through occlusions. A first group of algorithms [4,15] deals with occlusions by estimating positions and velocities of the objects participating to the occlusion. After the occlusion, the correct object identities are re-established using estimated object locations. In [15] the estimation is performed by Kalman filters. In these systems if during the occlusion one of the objects suddenly changes its direction the estimate will be inaccurate, possibly leading to a tracking error. In [4] when there is a merge between regions, the position of the single regions are the same of the merged region that is the objects belonging to the region take the coordinates, the center, etc. of this latter. In all the works of this group, statistical features measured before the occlusion begins are used to resolve the labels after the occlusion; however, the systems is not able to decide about which pixels belong to which object during the occlusion event. A different framework used in object tracking is layered image representation [18], in which image sequences are decomposed into a set of layers ordered by depth, along with associated maps defining motions, intensities and opacities. The layer ordering, together with the shape, motion, appearance of all layers, provide complete information for occlusion reasoning. The drawback of these algorithms is the high cost in terms of computational complexity of the algorithm that assigns each pixel to the correct layer. In order to reduce the cost, approximate solutions to the layer assignment problem are described in [18].

A Graph-Based, Multi-resolution Algorithm

195

Most of the published works [1, 5, 6, 9, 11, 12, 13, 16,17] construct appearance models of the tracked objects so that their identity can be preserved through occlusion events. In [1] a feature-based method is used. In order to handle occlusions, instead of tracking entire objects (specifically vehicles), object sub features are detected and tracked. Then, such features are grouped into objects, using also motion information derived from the tracking. This approach however is very computationally expensive; also, it cannot be easily extended to non-rigid objects (such as persons). The works of Haritaoglu et al. [6] and Wren et al. [17] present a system to recognize people. Both use silhouettes to model a person and his parts; however [17] assumes that there is only a single person in the scene while [6] allows multiple person groups and isolated people. During an occlusion, a group (a detected moving region) is segmented into its constituent individuals finding the best matching between people models and parts of the region. The other works [5, 9, 11, 12, 13,16] use color information to represent the objects when they are isolated (building the so-called appearance model of the object). This enables the correct segmentation of the objects when they are overlapping during an occlusion: the algorithms search for the labeling of the subregions that maximizes the likelihood of the appearance of the detected region given the models of the individual objects. In [16] the appearance model of an object is a template mask derived by shape and color information. In [11, 12,13] an object is modelled as a set of vertically aligned blobs (since the objects to detect are people) where a blob is a region with a coherent color distribution. These works differ from each another in the definition of the coherence criterion. In [5] two levels of the tracking are performed: a blob level tracking (in a graph-based framework) that is the tracking of the connected moving regions and an object level tracking in which an object consists of one or more blobs, and a blob can be shared among several objects. In [9] simple shape and color information are used because the work proposes to deal with low – resolution video sequences. The method we propose can be included in the last category, since it is based on an appearance model. In particular, we model a region with a hierarchical decomposition represented as a graph pyramid; then, during an occlusion, for each region that represents a group of objects, we try to segment the region by matching its hierarchy with the hierarchies of the objects present before the occlusion. The advantage of our proposal with respect to similar techniques is that the matching is performed in a top down fashion: the algorithm starts with the topmost level of the hierarchies (that is, the less detailed one), iteratively resorting to lower, more detailed levels only in case of ambiguities. In this way, the computational cost is kept low in the average case, while retaining the ability to perform a precise segmentation in the infrequent difficult cases.

3 The Proposed Representation and Algorithm Let us clarify some terms to better understand the description of the algorithm: we mean by term regions (or equally detected regions) the connected regions coming out by foreground detection step. With the term objects we mean the real objects of interest (people, car, etc.) present in a frame, so during an occlusion a single detected region can be composed by more objects. The simplest representation of a detected moving region is provided by its bounding box, which contains enough information

196

D. Conte et al.

for tracking when there are no occlusions. Our method is based on a more complex representation, based on a graph pyramid, that in absence of occlusions retains the simplicity and effectiveness of the bounding box, but enables a more accurate object matching to take place during an occlusion. Namely, each moving region is represented at different levels of resolution using a graph for each level. At the topmost level, the apex of the pyramid, the graph is composed by a single node, containing as attributes the position and dimension of the bounding box, and the average color. At lowest level there is an adjacency graph, where the nodes represent single pixels, and the edges encode the 4-connected adjacency relation. The intermediate levels are obtained by a bottom-up process, using the classical decimation-grouping procedure described in [10], where color similarity is used to decide which nodes must be merged. Notice that the number of levels in a pyramid is not fixed, but depends on the color uniformity of the region. Each intermediate node represents a sub region of the whole region, and its attributes are the bounding box and average color of that sub region. Since the construction of the pyramids is an expensive task, it is performed only on the first frame for all the moving regions. Then, for objects not occluded, the pyramid at successive frames is incrementally derived by the one at the previous frame. Only when an occlusion happens, the corresponding pyramid needs to be recomputed from scratch. During the tracking process, each node in the pyramid receives a label representing the identity of the object to which the corresponding (sub)region belongs. When a (sub)region contains parts of more than one object, the corresponding node is labelled as MULTILABEL. Now lets turn our attention on the tracking algorithm. The goal of this algorithm can be stated as follows: given the labelled representation of a frame It and the representation of the next frame It+1, the algorithm should find a labelling for this latter that is consistent with object identities.

Fig. 1. Scheme of the Tracking Algorithm

To achieve this result, the algorithm proceeds as follows: after a first initialization step, in which the graph pyramid is computed for each region, and labels are assigned assuming that each region contains a single object, the algorithm compares the topmost levels of each pyramid in It with those in It+1. The way the comparison is performed is detailed in section 4. If the comparison outcome is sufficient to assign a label to each node, the algorithm terminates. Instead, if some ambiguities arise (as is

A Graph-Based, Multi-resolution Algorithm

197

the case when two objects overlap), the algorithm is repeated using the next levels of the pyramids, until either a consistent labelling is found, or the bottom of the pyramid is reached. In this (infrequent) case, the algorithm outputs a “best effort” labelling. The whole process is sketched in Fig 1, while Fig. 2 illustrates the pyramidal representation of a toy image.

Fig. 2. The hierarchical representation a) The input image; b) the detected moving regions; c) the graph pyramids representing the detected regions

4 Comparison and Processing of Adjacent Frames Let It and It+1 be two consecutive frame images. At each level of the hierarchical representation there is a graph for each detected region, and the algorithm needs to compare each graph of It with each graph of It+1. In order to perform this comparison, we will first introduce a similarity measure between graph nodes, and then we will explain how this graph similarity is used in the comparison and labelling process. As we told in previous section, each node is described by the dimension and the position of the bounding box and by the color (or the gray level, for monochrome videos). We have defined a color similarity Sc∈[0,1], a dimension similarity Sd∈[0,1] and a position similarity Sp∈[0,1]. These three functions are based on euclidean distance in the corresponding spaces, and are normalized so that their value is 1 if and

198

D. Conte et al.

only if the corresponding features are equal. Then, we define the similarity between two nodes n and m as the product S(n, m) = Sc⋅Sd⋅Sp. In order to avoid to consider the matching between nodes that are too dissimilar, we have introduced a threshold T2, and we truncate S to 0 (meaning that the nodes should never be matched). We will now explain with more details how the comparison and labelling process is performed. First, at the current level of the comparison, the algorithm computes the similarities among all the nodes of graphs of It and all those of It+1 that have not been labelled in the previous steps. Then an association graph is built. The association graph a bipartite graph where each node n of It is linked to each node m of It+1 such that S(n, m)>0. Finally, a weighted bipartite graph matching (WBGM, see [8]) is performed, finding the matching between the nodes in It and those of It+1 that maximizes the sum of the corresponding similarities. At this point, each pair (n, m) of matched nodes is examined. The similarity S(n, m) is compared with a threshold T1 (which is, clearly, greater than the previously defined T2). If S> T1, then the label of n is assigned to m. If this label is not MULTILABEL, it is also propagated to the descendant of m at the successive levels; otherwise, the descendants of m will receive the label of the corresponding descendants of n in a recursive process. If S< T1, then the algorithm attempts a refined matching at the next level, marking the children of m and the children of all the nodes n’ of It for which S(n’, m)>T2 as candidates for further processing in the next level of resolution. If a node m has no matching for which S>T2, and it is at the first level of resolution, then it is considered as a new object, and a new label is generated for it. Otherwise, if this node it is at a more detailed level, then it is marked as a new sub region, and will be labelled later. The algorithm terminates when all the nodes of the current level receive a label, or the bottom level is reached. In this latter case, the algorithm uses the labelling derived by the found WBGM also when ST1, while A122 is left unmatched. So B121 receives a label, that is propagated to B12. Now B1 can be labeled as MULTILABEL.

6 Experimental Results The goal of our effort was to develop a tracking system for handling occlusion. Given this focus, we report results on PETS 2001 database [7]. The PETS dataset, is a stan-

200

D. Conte et al.

dard database to evaluate the performances of tracking systems. We perform our experimentation on test dataset 1, camera 1. In the video sequence there are moving people and cars that occlude each other. The same dataset is used by Fuentes [4] to evaluate his tracking algorithm. In Tab. 1 the characteristics of chosen video are shown: the video consist of 2688 frames in which there are about 400 frame that present occlusions. Table 1. PETS 2001 characteristics Dataset name PETS 2001 Test Dataset 1 Camera 1

Number of frames

Percentage of occlusions

Environment

2688

15%

Outdoor

In order to provide a quantitative measure of the tracker performance we use the performance index P presented in [3]. In Tab. 2 experimental results are shown. Table 2. Experimental results: Comparison with a standard tracking algorithm Experiment #1 Whole dataset on our algorithm #2 “Occlusions” sequence with our algorithm #3 “Occlusions” sequence with a standard tracking algorithm #4 While dataset on a standard tracking algorithm

TP

TN

FP

FN

P

5776

14

0

124

0.979

58

2

0

10

0.857

37

2

0

31

0.557

5744

14

7

156

0.972

The first row are the values of performance for the whole dataset. In the whole video sequence the performances are very high. In the second row we show the performances on about one hundred frames in which there are only occlusions (in Fig. 7 three frames of this sequence with the detected objects are shown). The results are quite promising considering that in real situations the percentage of occlusions is relatively low; besides the performance is much greater than a simple algorithm that does not handle with occlusions (third row). Finally, we shown the results on the whole dataset of the simple tracking algorithm. The two algorithms are comparable. This proves the effectiveness of our algorithm that deals with occlusions also if the latter are, in percentage, not much with respect of the entire video sequence. Following a more interesting comparison is shown. We compared the algorithm with another algorithm (Senior et al. [16]) that deals with occlusions. In [16] some indexes, suitable to the evaluation of this kind of algorithms, are shown. Besides Senior et al. use the same dataset. We computed the values of indexes for our algorithm and we compared them with those presented by the authors themselves. Tab. 3 shows the effectiveness of our approach: all the values of error indexes are much less than those of the algorithm [16].

A Graph-Based, Multi-resolution Algorithm

201

Table 3. Experimental results: Comparison with another algorithm that deals with occlusions Senior algorithm [16]

Our algorithm

Track error fp

16/14

4/14

Track error fn

4/14

1/14

Average position error

5.51

0.24

Average area error

-346

-8.4

Average detection lag

1.71

0

Average track completeness

0.12

0.04

7 Conclusions In this paper we discussed an object tracking algorithm based on a graph theoretic approach with a multi-resolution representation to handle with occlusions. We demonstrated that by using a multi-level representation of moving objects in the scene together with a graph-based approach, it is possible to deal with occlusions: the algorithm can recognize, in a unique connected moving region, the parts that belong to different objects.

Fig. 7.Three images from the dataset #2. a) Before the occlusion; b) During the occlusion; c) After the occlusion

A future development of our proposed method will investigate on several enhancements of the algorithm: one of this is the increasing of the speed of the algorithm. In fact the execution time of the algorithm is relatively high because of the construction of the pyramid. The discussed idea of keeping the pyramid of an object for several frames is one of the possible solutions, but we have to investigate also other ones.

References 1. D. Beymer, J. Malik. “Tracking vehicles in congested traffic”. SPIE Vol. 2902, Transportation Sensors and Controls: Collision Avoidance, Traffic Management, and ITS, pp. 8 - 18. 1996. 2. A. Bobick, S. Intille, J. Davis, F. Baird, C. Pinhanez, L. Campbell, Y. Ivanov, A. Schtte, A. Wilson. “The KidsRoom: A perceptually-based interactive and immersive story environment”. PRESENCE: Teleoperators and Virtual Environments Vol. 8 – 4, p. 367- 391. 1999.

202

D. Conte et al.

3. D. Conte, P. Foggia, C. Guidobaldi, A. Limongiello, M. Vento. “An Object Tracking Algorithm Combining Different Cost Functions”. LNCS, vol. 3212, pp. 614-622, 2004. 4. L. M. Fuentes, S. A. Velastin. “People tracking in surveillance applications”. Second IEEE International Workshop on Performance Evaluation of Tracking and Surveillance, 2001. 5. S. Gupte, O. Masoud, R. F. K. Martin, N. P. Papanikolopoulos. “Detection and Classification of Vehicles”. IEEE Transactions on intelligent transportations systems, IEEE Computer Society Press. Vol. 3 – 1, pp. 37 - 47. 2002. 6. I. Haritaoglu, D. Harwood, L. S. Davis. “W4: Real-Time Surveillance of People and Their Activities”. IEEE Transactions on PAMI. Vol. 22 – 8, pp. 809 – 830. 2000. 7. http://www.cvg.cs.rdg.ac.uk/PETS2001/pets2001-dataset.html 8. H.W. Kuhn, “The Hungarian Method for the Assignment Problem”, Naval Research Logistics Quarterly, Vol.2 pp.83-97, 1955. 9. P. KaewTrakulPong, R. Bowden. “A real time adaptive visual surveillance system for tracking low-resolution colour targets in dynamically changing scenes”. Image Vision Computing. Vol. 21 – 10, pp. 913-929. 2003. 10. J. M. Jolion, A. Montanvert. “The Adaptive Pyramid: A Framework for 2D Image Analysis”. CVGIP: Image Understanding, Academic Press. Vol. 55 – 3, pp. 339 – 348. 1992. 11. J. Li, C.S. Chua, Y.K. Ho, "Color Based Multiple People Tracking". Seventh International Conference on Control, Automation, Robotics and Vision, pp. 309-314, 2002. 12. S.J. McKenna, S. Jabri, Z. Duric, A. Rosenfeld, and H. Wechsler. ``Tracking Groups of People''. Computer Vision and Image Understanding. Vol. 80, pp. 42 – 56. 2000. 13. N. M. Oliver, B. Rosario, and A. Pentland. “A Bayesian computer vision system for modeling human interactions”. IEEE Transactions on PAMI, Vol. 22 – 8, pp. 831 – 843. 2000. 14. T.J. Olson, F.Z. Brill. “Moving Object Detection and Event Recognition Algorithm for Smart Cameras”. Proceedings DARPA Image Understanding Workshop, pp. 159-175, 1997. 15. R. Rosales, S. Sclaroff. “Improved Tracking of Multiple Humans with Trajectory Prediction and Occlusion Modeling”. Proceedings IEEE Conference on Computer Vision and Pattern Recognition. Workshop on the Interpretation of Visual Motion, 1998. 16. A. Senior, A. Hampapur, Y. L. Tian, L. Brown, S. Pankanti, R. Bolle. “Appearance Models for Occlusion Handling”. Proceedings of Second International Workshop on Performance Evaluation of Tracking and Systems. 2001. 17. C. R. Wren, A. Azarbayejani, T.J. Darrell, A.P. Pentland. “Pfinder: realtime tracking of the human body”. IEEE Transaction on PAMI, Vol. 19 – 7, pp. 780 – 785. 1997. 18. Y. Zhou, H. Tao. “A Background Layer Model for Object Tracking through Occlusion”. International Conference on Computer Vision, Nice, France, pp. 1079-1085. 2003.

Suggest Documents