Pattern Recognition 39 (2006) 562 – 572 www.elsevier.com/locate/patcog
A graph-based, multi-resolution algorithm for tracking objects in presence of occlusions Donatello Contea,∗ , Pasquale Foggiab , Jean-Michel Jolionc , Mario Ventoa a Dipartimento di Ingegneria dell’Informazione ed Ingegneria Elettrica, Università di Salerno, Via Ponte don Melillo, I-84084 Fisciano (SA), Italy b Dipartimento di Informatica e Sistemistica, Università di Napoli, Via Claudio 21, I-80125, Napoli, Italy c Lyon Research Center for Images and Information Systems, UMR CNRS 5205 Bat. J. Verne INSA Lyon 69621, Villeurbanne Cedex, France
Received 7 October 2005; received in revised form 13 October 2005
Abstract One of the main difficult problem in video analysis is to track moving objects during a video sequence, especially in presence of occlusions. Unfortunately, almost all the different approaches work on a pixel-by-pixel basis, yielding them unusable in real-time situations as well as not much expressive at a semantic level of the video. We present in this paper a novel method of tracking objects through occlusions that exploits the wealth of information due to the spatial coherence between pixels, using a graph-based, multi-resolution representation of the moving regions. The experimental results show that the approach is promising. 䉷 2005 Pattern Recognition Society. Published by Elsevier Ltd. All rights reserved. Keywords: Object tracking; Occlusion problem; Graph pyramid; Multi-resolution segmentation
1. Introduction Real-time object tracking is recently becoming more and more important in the field of video analysis and processing. Applications like traffic control, user–computer interaction, on-line video processing and production and video surveillance need reliable and economically affordable video tracking tools. In the last years this topic has received an increasing attention by researchers. However many of the key problems are still unsolved. Occlusions represent one of the most difficult problems in motion analysis. Most of early works in this field either do not deal with occluding cases at all [1] or impose rigid constraints on camera positioning [2], in order to make the problem tractable. While it is possible to find applications where this is acceptable, in many real world cases the ability to deal with occlusions is essential to the practical ∗ Corresponding author. Tel.: +39 089 96 42 27.
E-mail addresses:
[email protected] (D. Conte),
[email protected] (P. Foggia),
[email protected] (J.-M. Jolion),
[email protected] (M. Vento).
usefulness of a tracking system. In fact, very often the camera positioning is dictated by architectural constraints (e.g., inside of a building), and furthermore it may not be desirable to have a point of view that is too high with respect to the observed scene since this may limit the ability to classify the tracked objects. Hence a tracking system should be prepared to face the possibility that the object being followed gets (at least partially) covered either by a static element of the scene (e.g. a tree or a wall) or by another moving object. This is particularly true when the objects being tracked are persons, since even in a not crowded scene two persons are likely to walk so close to each other (say because they are talking) to form an occlusion. Most of the papers face this problem with several pixel by pixel approaches. In this paper we present a novel algorithm for object tracking that is able to deal with partial occlusions of the objects. In particular, our algorithm is based on a graph pyramidal decomposition of the objects being tracked, using a multi-resolution approach, so exploiting the spatial coherence between pixels. Furthermore, the graphbased representation allows a wealthy description of the objects and their mutual relation.
0031-3203/$30.00 䉷 2005 Pattern Recognition Society. Published by Elsevier Ltd. All rights reserved. doi:10.1016/j.patcog.2005.10.012
D. Conte et al. / Pattern Recognition 39 (2006) 562 – 572
In this paper we focus our attention only on the tracking phase of the system. This phase is built upon a foreground detection technique described in another work [3], which identifies for each frame of the video which pixels (grouped into connected components) belong to the moving regions. The rest of the paper is organized as follows. A review of related research is presented in Section 2. Section 3 describes the organization of data used in the algorithm, while Section 4 describes the overall architecture of the algorithm. In Section 5 our algorithm is described with some examples. Finally, Section 6 presents the experimental results on a standard database of video sequences together with some discussions and future works. We summarize our paper and present future directions in Section 7. 2. Related works Tracking moving objects problem is included in the more general framework of motion analysis. One of the most used descriptions of a video sequence is the spatio-temporal segmentation. The latter is the technique that attempts to extract backgrounds and independent objects together with their motion. In a recent work of Megret et al. [4] the spatiotemporal segmentation techniques are classified into three categories (see Fig. 1): 1. Segmentation-based approach. 2. Trajectories-based approach. 3. Joint spatial and temporal segmentation. In this paper we address our attention on the first category. In fact, usually, tracking moving objects has been solved by spatial priority segmentation. Following four groups of algorithms that deal with occlusions are described. 2.1. Features-based In Ref. [5] a feature-based method is used (it follows topleft process of Fig. 1). In order to handle occlusions, instead of tracking entire objects (specifically vehicles), object sub features are detected and tracked. Then, such features are grouped into objects, using also motion information derived from the tracking. This approach however cannot be easily extended to non-rigid objects (such as persons). 2.2. Measures estimation Another group of algorithms [6,7] deals with occlusions by estimating positions and velocities of the objects participating to the occlusion. The spatial segmentation is performed by background subtraction (see Wren et al. [8]). Each recognized connected region is considered as a moving object or a group of them. The regions are represented by a bounding box evaluated as the smallest rectangle, whose sides are parallel to the edges of the frame, in which the
563
object is inscribed. The tracking is performed between two consecutive frames, say Ft−1 and Ft . In the tracking step, each bounding box at frame Ft is labelled by matching it with the most similar bounding box of the frame Ft−1 (or it takes a new label if there is no matching). When a box in frame Ft has two or more similarity values close each other, then it represents an occlusion between the boxes of the frame Ft−1 similar to it. After the occlusion, the correct object identities are re-established using estimated object locations. In Ref. [7] the estimation is performed by Kalman filters. In these systems if during the occlusion one of the objects suddenly changes its direction the estimate will be inaccurate, possibly leading to a tracking error. In Ref. [6] when there is a merge between regions, the position of the single regions are the same of the merged region that is the objects belonging to the region take the coordinates, the centre, etc. of this latter. In all the works of this group, statistical features measured before the occlusion begins are used to resolve the labels after the occlusion; however, the systems is not able to decide about which pixels belong to which object during the occlusion event. 2.3. Layered image representation A different framework used in object tracking is layered image representation [9–11], in which image sequences are decomposed into a set of layers ordered by depth along with associated maps defining motions, intensities and opacities. The layer ordering, together with the shape, motion, appearance of all layers, provide complete information for occlusion reasoning. With reference to the above classification [4] the segmentation step is performed by motion similarity segmentation with model fitting methods. These methods seek to find the model parameters that will optimize the overall quality of fitting. Each pixel is associated with one motion model; the parameters of the models are unknown. So these algorithms need to estimate both parameters of the models and labels of the pixel (segmentation). Joint estimation of both motion parameters and labels is very complex and prone to being trapped in local minima, because of its high dimensionality (in Wang et al. [9] there are only motion parameters whilst in Tao et al. [10] and Smith et al. [11] there are also shape parameters for the models). In order to reduce the cost, approximate solutions to the layer assignment problem are described in Refs. [10,11]. Furthermore, these approaches are too sensitive to the a priori chosen models. 2.4. Appearance models The fourth group of algorithms [12–16,8] starts from a spatial segmentation that is similar to the algorithms of the second group. Then a tracking is performed. Each recognized object is represented by its appearance models so that their identity can be preserved through occlusion events. The works of Haritaoglu et al. [13] and Wren et al. [8] present a
564
D. Conte et al. / Pattern Recognition 39 (2006) 562 – 572
Fig. 1. Spatio-temporal techniques.
system to recognize people. Both use silhouettes to model a person and his parts; however Ref. [8] assumes that there is only a single person in the scene while Ref. [13] allows multiple person groups and isolated people. During an occlusion, a group (a detected moving region) is segmented into its constituent individuals finding the best matching between people models and parts of the region. The other works [12,14–16] use colour and shape information to construct the appearance model. This enables the correct segmentation of the objects when they are overlapping during an occlusion: the algorithms search for the labelling of the sub regions that maximizes the likelihood of the appearance of the detected region given the models of the individual objects. In Ref. [16] the appearance model of an object is a template mask derived by shape and colour information. In Refs. [14,15] an object is modelled as a set of vertically aligned blobs (since the objects to detect are people) where a blob is a region with a coherent colour distribution. These works differ from each other in the definition of the coherence criterion. In Ref. [12] two levels of the tracking are performed: a blob level tracking (in a graph-based framework) that is the tracking of the connected moving regions and an object level tracking in which an object consists of one or more blobs, and a blob can be shared among several objects. This group of algorithms does not suffer from the same problems of the previous described algorithms. In fact the appearance models rise from the input data and not from some a priori models. However, in all the works presented in literature the correlation between part of multiple objects region and the appearance models of the objects is always performed at pixel level. In addition to the high computational complexity, this way does not exploit the wealth of information due to the spatial relationship between pixels. Moreover, these algorithms work well with relatively large-size targets because of their complex colour and shape models. In the literature there is an interesting work of KaewTrakulPong [17] that deals with
occlusions with low-resolution targets (commonly found in outdoor visual surveillance applications). The authors use simple shape and colour information to represent objects. In a normal situation, when each detected region represents a moving object, the regions labelling is formulated as an assignment problem. This involves a high speed up in processing video. However, in case of occlusion the algorithm falls again in a pixel by pixel processing. In fact, during an occlusion event some regions (multiple objects regions) are not labelled by the assignment problem because of their difference from the models. In these cases, a pixel correlation with the models is performed to label parts of these regions. This latter work is most closely related to our approach. In fact in our work the tracking is also performed at a lowresolution (so with a high speed processing) until an occlusion occurs. However, our approach presents some remarkable improvements. We model a region with a hierarchical decomposition represented as a graph pyramid; then, during an occlusion, for each multiple objects region, we try to segment the region by matching its hierarchy with the hierarchies of the objects present before the occlusion. The first advantage is that the hierarchical representation allows to label sub-regions at an intermediate level in which we operate on regions and not on pixels. The algorithms never examine the hierarchy at pixel level because at intermediate level the parts of regions are closer to appearance model than isolated pixels. The construction of the pyramid is performed at pixel level but this step take place only at the initialization step and in other few cases. The second advantage resides in the representation of each level of the hierarchy. At each level of the hierarchy a detected region is represented by a graph in which the nodes are the sub-regions (with different resolution according to the level, i.e. at the lowest level we have a pixel adjacency graph). This representation highlights the structural relationship between sub-regions. So we can
D. Conte et al. / Pattern Recognition 39 (2006) 562 – 572
benefit not only on a spatial coherence within a region, but also on spatial relationship between regions.
3. The proposed representation Let us clarify some terms to better understand the description of the algorithm: we mean by term regions (or equally detected regions) the connected regions coming out by foreground detection step. With the term objects we mean the real objects of interest (people, car, etc.) present in a frame, so a single detected region can be composed by more objects and, vice-versa, several detected regions can belong to a single object. The simplest representation of a detected moving region is provided by its bounding box, which contains enough information for tracking when there are no occlusions. Our method is based on a more complex representation, based on a graph pyramid, which in absence of occlusions retains the simplicity and effectiveness of the bounding box, but enables a more accurate object matching to take place during an occlusion. Namely, each moving region is represented at different levels of resolution using a graph for each level. At the topmost level, the apex of the pyramid, the graph is composed by a single node, containing as attributes the position and dimension of the bounding box, and the average colour. At lowest level there is an adjacency graph, where the nodes represent single pixels, and the edges encode the 4-connected adjacency relation (see Fig. 2a). Notice that the nodes describe only the foreground pixels in the bounding box of the region. The intermediate levels are obtained by a bottom–up process, using the classical decimation-grouping procedure described in Ref. [18], where a colour similarity is used to decide which nodes must be merged. We decide to use the absolute difference between the mean R, G and B levels of
565
the regions. For two regions R1 and R2 the colour similarity: C = 1R − 2R + 1G − 2G + 1B − 2B , where 1R , 2R are the mean of R for regions R1 and R2 , 1G , 2G are the mean of G for regions R1 and R2 , 1B , 2B are the mean of B for regions R1 and R2 . Note that this definition has produced the best results in our experimentation, but the method is metric independent. The representation differs from the quadtree structure. The latter is a regular structure that cannot exactly represent an object but only an approximation of it. On the contrary with the irregular pyramid we are able to represent the exact boundaries of the objects. The number of levels in a pyramid is not fixed, but depends on the colour uniformity of the region. For a 30 × 30 pixels detected region, the number of levels is about 7. For greater regions the number of levels remain almost unchanged, because this number depends on the image complexity rather than on the size. Furthermore, the algorithm never examines all the levels (see later), but it stops at the first level. Notice that, since the way we analyze the pyramid is top–bottom, for us “first level” means the top (i.e. the apex) of pyramid. Each intermediate node represents a sub region of the whole region, and its attributes are the position and size of the sub region bounding box and average colour of that sub region. In particular a node attribute is a set of 7 (5 in grey level images) numbers {w, h, avR , avG , avB , cx, cy} ({w, h, avGL , cx, cy}) where w, h are the width and height of the sub region bounding box, avR , avG , avB (avGL ) are the average colour (the average grey level) of the sub region, cx, cy are the position of the sub region bounding box centre in the image (Fig. 2b). During the tracking process, each node in the pyramid receives a label representing the identity of the object to which
Fig. 2. An example of a graph pyramid: (a) the hierarchical structure; (b) the nodes attributes.
566
D. Conte et al. / Pattern Recognition 39 (2006) 562 – 572
Fig. 3. An example of region labelling: (a) the region represents only one object: every node has the same label; (b) the region represents two objects: we can recognize the sub regions belonging to the objects using the second level.
the corresponding (sub)region belongs. When a (sub)region contains parts of more than one object, the corresponding node is labelled as MULTILABEL. In a MULTILABEL region the object parts can be reconstructed seeing through the levels of the graph pyramid (see Fig. 3).
4. The proposed algorithm The goal of this algorithm can be stated as follows: given the labelled representation of a frame It and the next frame It+1 , the algorithm should find a representation of the frame It+1 and a labelling for this latter that is consistent with object identities. The pseudo-code of the algorithm is shown in Fig. 4. We perform an initialization on the first frame of the video in which there is at least one moving object. The initialization consists in computing a graph pyramid for each moving region; labels are assigned assuming that each region contains a single object. The algorithm compares (see below the way the comparison is performed) the topmost levels of each pyramid in It with those in It+1 . It is important to notice that in the frame It+1 we have only the bounding boxes of the detected regions (i.e. we have a partial representation of the frame, namely the topmost level of graph pyramids of the detected regions). If the comparison outcome is sufficient to assign a label to each node, the algorithm terminates. In this case we have a label to each region on frame It+1 . Instead, if some ambiguities arise (as it is the case when two objects overlap), the algorithm is repeated using the next levels of the pyramids, until either a consistent labelling is found, or the bottom of the pyramid is reached.
Let turn our attention on “Match&Label” procedure. At each level of the hierarchical representation there is a graph for each detected region, and the algorithm needs to compare each graph G1 of It with each graph G2 of It+1 . The “LevelMatching” is devoted to this task. In order to perform this comparison we introduce a similarity measure between graph nodes. As we told in previous section, each node is described by the dimension and the position of the bounding box and by the colour (or the grey level, for monochrome videos). We have defined a colour similarity Sc ∈ [0, 1], a size similarity Sd ∈ [0, 1] and a position similarity Sp ∈ [0, 1]. Following the formulas for two generic nodes of the graph n and m: C=R,G,B |Cn − Cm | Sc (n, m) = 1 − 3 MAXcolor C=R,G,B |Cn − Cm | =1− 3 × 255 C=R,G,B |Cn − Cm | =1− , (1) 765 where Cn (Cm ) is the average of colour component C for the node n(m). Sd (n, m) = 1 −
|nh − mh | + |nw − mw | , Fh + F w
(2)
where nh (mh ) is the bounding box height of n(m), nw (mw ) the bounding box width of n(m), Fh the height of the frame and Fw the width of the frame. Sp (n, m) = 1 −
|ncx − mcx | + |ncy − mcy | Fh + F w
(3)
where ncx (mcx ) is the horizontal position of the bounding box centre of n(m), ncy (mcy ) the vertical position of the bounding box centre of n(m), Fh the height of the frame and Fw the width of the frame. These three functions are based on the Manhattan distance in the corresponding spaces, and are normalized so that their value is 1 if and only if the corresponding features are equal. The Manhattan distance is a very useful metric as it is evaluated using only additions and subtractions and thus it can be implemented very efficiently. Then, we define the similarity between two nodes n and m as the product S(n, m) = Sc · Sd · Sp . Note that one can also study the experimental distribution of the measurements and combine them in a more complex statistical schema. In order to avoid considering the matching between nodes that are too dissimilar, we have introduced a threshold T2 , and we truncate S to 0 (meaning that the nodes should never be matched). Then an association graph is built. The association graph is a bipartite graph where each node n of G1 is linked to each node m of G2 such that S(n, m) > 0. Finally, a weighted bipartite graph matching (WBGM, see Ref. [19]) is performed, finding the matching between the nodes in
D. Conte et al. / Pattern Recognition 39 (2006) 562 – 572
567
Fig. 4. Pseudo-code of the tracking algorithm.
G1 and those of G2 that maximizes the sum of the corresponding similarities. Typically in a frame the number of detected regions does not exceed the value of 10. So at the first level the algorithm performs a comparison between two graphs of a maximum of 10 nodes. At more detailed level the number of nodes can reach the value of 20 nodes per graph. The execution of “LevelMatching” leads to three situations for the nodes (m) of G2 : 1. The node m is in the solution of WBGM and the similarity with its correspondent node in G1 is above of a threshold T1 which is, clearly, greater than the previously defined T2 (line 8 of “Match&Label” in the pseudo-code). 2. The node m is in the solution of WBGM and the similarity with its correspondent node in G1 is below T1
and above T2 (line 16 of “Match&Label” in the pseudocode). 3. The node m is not in the solution of WBGM (line 20 of “Match&Label). Case 1: We have to define the pyramid of m and label it according to the correspondent node n. In a normal situation the region m keeps the information of the corresponding region n (we copy the children of n for the node m). So we make the assumption that if the regions between the frames are similar, then also theirs hierarchical representations are similar. This assumption is usually true. Therefore, in most cases the algorithm does not perform the pyramid construction process. But it is important to notice that there is two cases in which we construct the pyramid of m without considering the pyramid of n (the function “NeedConstruction”
568
D. Conte et al. / Pattern Recognition 39 (2006) 562 – 572
is devoted to this check): (a) we rebuild pyramids every N frames (N = 10 has given the best performances in our experiments); (b) we rebuild the pyramid of m if the latter region undergoes an heavy variation (in similarity measure defined above) between two consecutive frames. Then the label of n is assigned to m as follow. If this label is not MULTILABEL, it is also propagated to the descendant of m at the successive levels; otherwise, the descendants of m will receive the label of the most similar descendants of n in a recursive process. Case 2: The node m is similar to n but not too much (probable event of occlusion). First we have to construct the graph pyramid for m. Then the algorithm attempts a refined matching at the level below, marking the children of m and the children of all the nodes n of It for which S(n , m) > T2 as candidates for further processing in the next level of resolution (function “Marking” in the pseudo-code). Case 3: If the node m not matched is at the first level of resolution then it is considered as a new object, and a new label is generated for it. Otherwise, if this node is at a more detailed level, then it is marked as a new sub region, and will be labelled later. Coming back to the “tracking” procedure, the algorithm terminates when all the nodes of the current level receive a label, or the bottom level is reached. In this latter case, the algorithm uses the labelling derived from the found WBGM also when S < T1 . At this point, the nodes that have been expanded for processing at a higher resolution are labelled in a bottom–up process, propagating to the parent the label of the children, if they have the same label, or labelling the parent as MULTILABEL. New sub regions are also labelled during this process using the label of their sibling with the greatest similarity. These new sub regions represent a bit alteration in the detected region and take the label of the most similar sub region.
5. An example In order to clarify the algorithm, we will now present its phases on a simple illustration. The four adjacent frames that we will use for this example are shown in Fig. 5. As it can be seen, there are two objects in the first frame (Ft−2 ) that undergo a little motion in the frame Ft−1 , then they overlap themselves forming a single region in the third frame (Ft ); finally (frame Ft+1 ) the two objects separate again. Notice that the objects do not have linear trajectories, so any algorithm based on prediction fails in this case. First the algorithm perform labelling of regions in Ft−1 starting from the frame Ft−2 (see Fig. 6a). At the beginning only the representation of A1 and A2, and the topmost level of the B1 and B2 pyramids are avail-
Fig. 5. The example: four adjacent frames with the moving object detected.
Fig. 6. The example: comparison between Ft−2 and Ft−1 .
Fig. 7. The example: comparison between Ft−1 and Ft at the first resolution level.
able. However this is enough to the comparison at the first resolution level (Fig. 6b). Both S(A1, B1) and S(A2, B1) are above T1 , so the node B1 can receive the ‘0’ label and B2 the ‘1’ label. Furthermore, B1 takes the A1 pyramid structure as well as B2 the A2 one. The algorithm stops for these two consecutive frames: we have the representation and the labelling of Ft−1 . The process continues to elaborate the successive frame, so that the tracking algorithm is performed on frames Ft−1 and Ft (Fig. 7a). Fig. 7b shows the comparison at the first resolution level. Both S(B1, C1) and S(B2, C1) are above T2 , but none of them is above T1 , so the node representing C1 cannot yet receive a label, and must be expanded together with B1 and B2. Only in this case the process has to construct the pyramid of C1. Fig. 8 shows what happens at the second resolution level. Now the pairs (B11, C11), (B22, C14) and (B21, C13) have a similarity greater than T1 , and so can be labelled. However, the similarity between B12 and C12 is still under the threshold T1 , so these two nodes need to be further expanded. The third level matching is depicted in Fig. 9a. Now B121 is matched with C121 having S > T1 , while B122 is left unmatched. So C121 receives a label that is propagated to C12. Now C1 can be labelled as MULTILABEL and, seeing
D. Conte et al. / Pattern Recognition 39 (2006) 562 – 572
Fig. 8. The example: second resolution level.
569
new sub region and, in the last step of the algorithm, receive the label of its sibling with the greatest similarity (see Section 4); therefore, D122 receive the ‘0’ label. Note that no memory of the objects before the occlusion is used. The region D122 is considered as a bit alteration of the object in the frame Ft+1 as regards as to the object in the frame Ft . This approach allow recognizing the real boundaries of the objects also when the occlusion lasts for a long time or the boundaries of the objects undergo a large modification during the occlusion. Fig. 10e shows the (correct) result of label assigning.
6. Experimental results
Fig. 9. The example: last level matching and result for frame Ft .
The goal of our effort was to develop a tracking system for handling occlusion. Given this focus, we report results on PETS 2001 database [20]. The PETS dataset is a standard database to evaluate the performances of tracking systems. We perform our experimentation on test dataset 1, camera 1. In the video sequence there are moving people and cars that occlude each other. The same dataset is used by several authors (e.g. Fuentes et al. [6], Senior et al. [16]) to evaluate their tracking algorithm. The video consist of 2688 frames in which there are about 400 frames that present occlusions. In order to provide a quantitative measure of the tracker performance we use the performance index P presented in Ref. [3]: P=
TP + TN , TP + TN + FP + FN
(4)
where
Fig. 10. The example: (a) matching between Ft and Ft+1 ; (b) comparison between Ft and Ft+1 at the first resolution level; (c) second resolution level; (d) last level; (e) final result.
the labels of the levels below, can be divided into the objects belonging to it (Fig. 9b). Finally the process elaborates the frame Ft+1 (Fig. 10a). The steps are like the processing in frame Ft . In fact, at the first level D1 and D2 cannot receive a label (Fig. 10b) and the process continues at the second level of resolution (Fig. 10c). Now the pairs (C11, D11), (C13, D21) and (C14, D22) have a similarity greater than T1 , and so can be labelled. However the nodes C12 and D12 need to be further expanded (Fig. 10d). At the last level (C121, D121) has a similarity measure S > T1 , so D121 take the ‘0’ label. D122 is marked as a
• a true positive (TP) is an association found by the tracker that is also present in the ground truth; • a true negative (TN) is an object that is labelled as new by the tracker, and is considered new also in the ground truth; • a false positive (FP) is an association found by the tracker that is missing from the ground truth, either because the object in the ground truth is considered as new, or because it is associated with a different previous frame object; • a false negative (FN) is an object that is labelled as new by the tracker, but is not new in the ground truth. Notice that these measures concern only the evaluation of the matching process between two frames. To evaluate the algorithm in presence of occlusions more significant indexes are detailed later. In Table 1 a comparison with our previous algorithm [3] (that did not deal with occlusions) is shown. We also show the execution times (ET): the algorithm is still usable in real-time systems also in presence of occlusions.
570
D. Conte et al. / Pattern Recognition 39 (2006) 562 – 572
Table 1 Experimental results. Comparison with the tracking algorithm in Ref. [3] Experiment
TP
TN
FP
FN
P
ET
#1 Whole dataset on our algorithm #2 ‘Occlusions’ sequence with our algorithm #3 ‘Occlusions’ sequence with a standard tracking algorithm #4 Whole dataset on a standard tracking algorithm
5776 58 37
14 2 2
0 0 0
124 10 31
0.979 0.857 0.557
5 fps 3 fps 7 fps
5744
14
7
156
0.972
7 fps
Fig. 11. Three images from the dataset #2 (a) before the occlusion; (b) during the occlusion; (c) after the occlusion.
In the first row there are the values of performance for the whole dataset. In the whole video sequence the performances are very high. In the second row we show the performances on about one hundred frames in which there are only occlusions (three frames of this sequence with the detected objects are shown in Fig. 11). Third and fourth rows show the results on the algorithm [3] on the same datasets. The results are quite promising considering that in real situations the percentage of occlusions is relatively low; besides the performance is much greater than a simple algorithm that does not handle with occlusions (third row). Finally, we show the results on the whole dataset of the simple tracking algorithm. The two algorithms are comparable. This proves the effectiveness of our algorithm that deals with occlusions also if the latter are, in percentage, not much with respect of the entire video sequence. Following another interesting comparison is shown. We compared the algorithm with another algorithm (Senior et al. [16]) that deals with occlusions. The basic idea is to represent the moving regions with an appearance model (a RGB colour model with a probability mask) showing how the object appears in the image. These appearance models are used to solve a number of problems, including improved localization during tracking, track correspondence and occlusion resolution. Furthermore, in Ref. [16] some indexes, suitable to the evaluation of this kind of algorithms, are shown. The index are explained in Ref. [16]: Object centroid position error: Objects in the ground truth are represented as bounding boxes. The object centroid position error is approximated by the distance between the
centroids of the bounding boxes of ground truth and the results. This error measure is useful in determining how close the automatic tracking is to the actual position of the object. Object area error: Here again, the object area is approximated by the area of the bounding box. This of course leads to a bias. However, given the impracticality of manually identifying the boundary of the object in thousands of frames, the bounding box area error is a reasonable measure of the quality of the segmentation. Object detection lag: This is the difference in time between when the ground truth identified a new object versus the tracking algorithm. Track incompleteness factor: This measures how well the automatic track covers the ground truth: Fnf + Fpf Ti
(5)
where, Fnf is the false negative frame count, i.e. the number of frames that are missing from the result track, Fpf is the false positive frame count, i.e. the number of frames that are reported in the result which are not present in the ground truth and Ti is the number of frames present in both the results and the ground truth. The smaller this factor, the better the result. Track error rates: These include the false positive rate fp and the false negative rate fn as ratios of numbers of tracks fp =
Results without corresponding ground truth , Total number of ground truth tracks
(6)
D. Conte et al. / Pattern Recognition 39 (2006) 562 – 572 Table 2 Experimental results. Comparison with another algorithm that deals with occlusions
Track error fp Track error fn Average position error Average area error Average detection lag Average track completeness
fn =
Senior algorithm [16]
Our algorithm
16/14 4/14 5.51 −346 1.71
4/14 1/14 0.24 −8.4 0
0.12
0.04
Ground truth without corresponding result . Total number of ground truth tracks
(7)
Besides Senior et al. use the same dataset. We computed the values of indexes for our algorithm and we compared them with those presented by the authors themselves. Table 2 shows the effectiveness of our approach: all the values of error indexes are much less than those of the algorithm [16]. The low error rate fp means that in a few cases the algorithm has lost the identity of the object. In fact fp estimates the numbers of tracks resulting by the tracker without corresponding in the ground truth, so when the algorithm lose the identity, it recognizes, erroneously, a new object that does not have a corresponding in the ground truth. The error rate fn estimates the numbers of ground truth tracks without corresponding in the tracker. The low error rate in our approach means the effectiveness of our approach during occlusions. In fact during an occlusion we could lose some objects (so some tracks of ground truth) in a multiple objects detected region. We have a very low error for the average area error; again this means that in a multiple objects detected region the algorithm is efficient to recognize real objects boundaries. Our algorithm succeeds in following the objects during all the frames they stay in the scene (how indicate the two last error rates). 7. Conclusions In this paper we presented a novel approach to object tracking in presence of occlusion. The paper base itself on a graph pyramid representation of moving regions. The tracking is performed by a comparison between regions of adjacent frames. The comparison is performed at different resolution levels to label multiple objects regions (case of occlusion). The experimental results we carried out show that the algorithm is efficient. Future developments of our proposed method will investigate on several enhancements of the algorithm to increase the robustness and the speed of the process. With respect to this latter topic, the discussed idea of keeping the pyramid of an object for several frames is one of the possible solutions, but we have to investigate also other ones.
571
References [1] T.J. Olson, F.Z. Brill, Moving object detection and event recognition algorithm for smart cameras, Proceedings of DARPA Image Understanding Workshop, 1997, pp. 159–175. [2] A. Bobick, S. Intille, J. Davis, F. Baird, C. Pinhanez, L. Campbell, Y. Ivanov, A. Schtte, A. Wilson, The KidsRoom: a perceptuallybased interactive and immersive story environment, PRESENCE: Teleoperators Virtual Environ. 8 (4) (1999) 367–391. [3] D. Conte, P. Foggia, C. Guidobaldi, A. Limongiello, M. Vento, An object tracking algorithm combining different cost functions, Lecture Notes in Computer Science, vol. 3212, Springer, Berlin, 2004, pp. 614–622. [4] R. Megret, J.M. Jolion, Tracking scale-space blobs for video description, IEEE Multimedia 9 (2) (2002) 34–43. [5] D. Beymer, J. Malik. Tracking vehicles in congested traffic, SPIE, Transportation Sensors and Controls: Collision Avoidance, Traffic Management, and ITS, vol. 2902, 1996, pp. 8–18. [6] L.M. Fuentes, S.A. Velastin, People tracking in surveillance applications, Second IEEE International Workshop on Performance Evaluation of Tracking and Surveillance, 2001. [7] R. Rosales, S. Sclaroff, Improved tracking of multiple humans with trajectory prediction and occlusion modeling, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Workshop on the Interpretation of Visual Motion, 1998. [8] C.R. Wren, A. Azarbayejani, T.J. Darrell, A.P. Pentland, Pfinder: realtime tracking of the human body, IEEE Trans. PAMI 19 (7) (1997) 780–785. [9] J.Y.A. Wang, E.H. Adelson, Representing moving images with layers, IEEE Trans. Image Process. Special Issue: Image Sequence Compression 3 (5) (1994) 625–638. [10] H. Tao, H.S. Sawhney, R. Kumar, Object tracking with Bayesian estimation of dynamic layer representations, IEEE Trans. PAMI 24 (1) (2002) 75–89. [11] P. Smith, T. Drummond, R. Cipolla, Layered motion segmentation and depth ordering by tracking edges, IEEE Trans. PAMI 26 (4) (2004) 479–494. [12] S. Gupte, O. Masoud, R.F.K. Martin, N.P. Papanikolopoulos, Detection and classification of vehicles, IEEE Trans. Intell. Transp. Syst. 3 (1) (2002) 37–47. [13] I. Haritaoglu, D. Harwood, L.S. Davis, W4: real-time surveillance of people and their activities, IEEE Trans. PAMI 22 (8) (2000) 809 –830. [14] S.J. McKenna, S. Jabri, Z. Duric, A. Rosenfeld, H. Wechsler, Tracking groups of people, Comput. Vision Image Understanding 80 (2000) 42–56. [15] N.M. Oliver, B. Rosario, A. Pentland, A Bayesian computer vision system for modeling human interactions, IEEE Trans. PAMI 22 (8) (2000) 831–843. [16] A. Senior, A. Hampapur, Y.L. Tian, L. Brown, S. Pankanti, R. Bolle, Appearance models for occlusion handling, Proceedings of Second International Workshop on Performance Evaluation of Tracking and Systems, 2001. [17] P. KaewTrakulPong, R. Bowden, A real time adaptive visual surveillance system for tracking low-resolution colour targets in dynamically changing scenes, Image Vision Comput. 21 (10) (2003) 913–929. [18] J.M. Jolion, A. Montanvert, The adaptive pyramid: a framework for 2D image analysis, CVGIP: Image Understanding 55 (3) (1992) 339–348. [19] H.W. Kuhn, The Hungarian method for the assignment problem, Naval Res. Logistics Quarterly 2 (1955) 83–97. [20] .
572
D. Conte et al. / Pattern Recognition 39 (2006) 562 – 572
About the Author—DONATELLO CONTE graduated in computer engineering in 2002. He is currently a Co-Ph.D. student from the University of Salerno, Italy and INSA of Lyon, France. He is a member of the IAPR Technical Committee on Graph-Based Representations (TC15). His research interests concern graph matching, real-time video analysis and intelligent video surveillance. About the Author—PASQUALE FOGGIA received a Laurea degree in computer engineering in 1995, and a Ph.D. in 1999. Currently, he is an Associate Professor of the University of Naples “Federico II”. His interests cover the areas of pattern recognition, graph-based representations, graph matching, machine learning, real-time video analysis, intelligent video-surveillance and traffic monitoring. About the Author—JEAN-MICHEL JOLION received the ‘Diplôme d’ingénieur’ in 1984, and the Ph.D. in 1987, both in Computer Science from the Institut National des Sciences Appliquées (INSA) of Lyon (France). His research interests belong to the computer vision and pattern recognition fields. He has studied robust statistics applied to clustering, multiresolution and pyramid techniques, graphs-based representations mainly applied to the image retrieval domain. About the Author—MARIO VENTO received the Laurea degree in electronic engineering in 1984, and a Ph.D. in 1988. Currently, he is a full-time Professor of computer science and artificial intelligence at the University of Salerno. His research interests are in artificial intelligence, pattern recognition and machine learning. He is especially dedicated to classification techniques, contributing to neural network theory, statistical learning, graph matching and multi-expert classification.