Performance Evaluation of Object Detection and ... - Semantic Scholar

1 downloads 0 Views 421KB Size Report
It can take a maximum value of NG, which is the number of ground truth objects in the sequence. We define Average Tracking Accuracy (ATA), which can be.
Performance Evaluation of Object Detection and Tracking in Video Vasant Manohar1, Padmanabhan Soundararajan1, Harish Raju2 , Dmitry Goldgof1 , Rangachar Kasturi1 , and John Garofolo3 1

3

University of South Florida, Tampa, FL {vmanohar, psoundar, goldgof, r1k}@cse.usf.edu 2 Advanced Interfaces Inc., State College, PA [email protected] National Institute of Standards and Technology, Gaithersburg, MD [email protected]

Abstract. The need for empirical evaluation metrics and algorithms is well acknowledged in the field of computer vision. The process leads to precise insights to understanding current technological capabilities and also helps in measuring progress. Hence designing good and meaningful performance measures is very critical. In this paper, we propose two comprehensive measures, one each for detection and tracking, for video domains where an object bounding approach to ground truthing can be followed. Thorough analysis explaining the behavior of the measures for different types of detection and tracking errors are discussed. Face detection and tracking is chosen as a prototype task where such an evaluation is relevant. Results on real data comparing existing algorithms are presented and the measures are shown to be effective in capturing the accuracy of the detection/tracking systems.

1

Introduction

Recent years have seen rapid development in the state-of-the-art technologies for computer vision problems. A new approach to solving these problems is frequently proposed with high claims on its performance and robustness. Evaluation of algorithms is imperative, in order that a particular technology is not oversold. From a research point of view, well-established problems need standard databases with established benchmark performances, evaluation protocols and scoring methods available. Object detection and tracking is a key computer vision topic, which focuses on detecting the position of a moving object in a video sequence. It is the first step accomplished by a event recognition system that extracts semantic content from video. There have been many efforts towards empirical evaluation of object detection and tracking [1, 2, 3, 4, 5, 6, 7, 8]. These works either present a single measure that concentrates on a particular aspect of the task or a suite of measures that look at different aspects. While the former approach cannot capture the performance P.J. Narayanan et al. (Eds.): ACCV 2006, LNCS 3852, pp. 151–161, 2006. c Springer-Verlag Berlin Heidelberg 2006 

152

V. Manohar et al.

of the system in its entirety, the latter results in a multitude of scores which cannot be easily comprehended in assessing the performance of the system. Similarly, while evaluating tracking systems, earlier approaches either concentrate on the spatial aspect of the task, i.e., assess correctness in terms of number of trackers and locations in frames [4, 7], or the temporal aspect which, emphasizes on maintaining consistent identity over long periods of time [2]. In the very recent works of [3, 8], a spatio-temporal approach towards evaluation of tracking systems is adopted. However, these approaches do not provide the flexibility to adapt the relative importance of each of these individual aspects. Finally, majority of these undertakings make little effort in actually comparing the performance of existing algorithms on real world applications using the proposed measures. In this paper, we propose two comprehensive measures that capture different aspects of the detection and the tracking task in a single score. While the detection measure assumes a spatial course, a spatio-temporal concept is the backbone of the tracking measure. By adopting a thresholded approach to evaluation (See Secs 3.1 and 3.2), the relative significance of the individual aspects of the task can be modified. In the end, face detection and tracking is picked as an exemplar task for evaluation and select algorithm performances are compared on a reasonable corpus. The remainder of the paper is organized in the following manner. Section 2 briefs the ground truth annotation process which is vital to evaluation. Section 3 describes the proposed comprehensive measures for detection and tracking. Section 4 explains the one-to-one mapping which is an integral part of this evaluation. Section 5.1 details the experimental results describing the behavior of the measures for different types of detection and tracking errors. Section 5.2 discusses and compares the results of three face detection and two face tracking algorithms on a data set containing video clips from boardroom meetings. We conclude and summarize the findings in Section 6.

2

Ground Truth Annotations

Clearly, the first step towards carrying out a scientific evaluation is to have a valid ground truth. More importantly, the approach taken towards annotation decides the evaluation technique. It has been well observed in the research community that a universal approach to annotation/evaluation cannot be adopted across domains. The main reason being the fact that features rich in a particular domain might not be discernible in a different domain. In this paper, the method used for ground truthing is one in which objects are bounded by a geometric shape, such as rectangles, polygons or ellipses. Features of the object will be used as guides for marking the limits of the edges. If the features are occluded, which is often the case, the markings are approximated. Unique IDs are assigned to individual objects and are consistently maintained over subsequent frames. Face, text and person detection/tracking in broadcast news segments and meeting videos are few examples of the task-domain pairs where such an approach is often adopted.

Performance Evaluation of Object Detection and Tracking in Video

153

There are many free and commercially available tools which can be used for ground truthing videos such as Anvil, VideoAnnex, ViPER [9], etc... In our case, we used ViPER (Video Performance Evaluation Resource), a ground truth authoring tool developed by the University of Maryland. Fig 1 shows a sample annotation using ViPER for face in a broadcast news segment.

Fig. 1. Sample annotation of face in broadcast news using rectangular boxes. Facial features such as eyes and lower lip are used as guides to marking the edges of the box. Internal Data Structure maintains a unique Object ID for each of the faces shown which helps in measuring the tracking performance. Courtesy: CNN News.

A fact that has been well appreciated by the community is the need for reliable ground truth for genuine evaluations. To assure quality in the ground truth, 10% of the entire corpus was doubly annotated and checked for quality using the evaluation measures.

3

Performance Measures

The proposed performance measures are primarily area-based and depends on the spatial overlap between the ground truth and the system output objects to generate the score. In order that we get the best score of an algorithm’s performance, we perform a one-to-one mapping between the ground truth and the system output objects such that the metric scores are maximized. All the measure scores are normalized such that the best performance gets a score of 1 and the worst performance gets a score of 0. Secs 3.1 and 3.2 discuss the frame based detection measure and the sequence based tracking measure respectively, while Sec 4 briefs the one-to-one matching strategy. The following are the notations used in the remainder of the paper, (t)

– Gi denotes the ith ground truth object and Gi denotes the ith ground truth object in tth frame. (t) – Di denotes the ith detected object and Di denotes the ith detected object in tth frame. (t) (t) – NG and ND denote the number of ground truth objects and the number of detected objects in frame t respectively.

154

V. Manohar et al.

– NG and ND denote the number of unique ground truth objects and the number of unique detected objects in the given sequence respectively. Uniqueness is defined by object IDs. – Nf rames is the number of frames in the sequence. – Nfi rames is the number of frames the ground truth object (Gi ) or the detected object (Di ), depending on the context, existed in the sequence. (t) – Nmapped is the number of mapped ground truth and detected objects in frame t while Nmapped is the number of mapped ground truth and detected objects in the whole sequence.

3.1

Detection – Frame Based Evaluation

A good detection measure should capture the performance in terms of both overall detection (number of objects detected, missed detects and false alarms) and goodness of detection for the detected objects, i.e., spatial accuracy (how much of the ground truth is detected) and spatial fragmentation (object splits and object merges). The Sequence Frame Detection Accuracy (SFDA) is a frame-level measure that penalizes for fragmentations in the spatial dimension while accounting for number of objects detected, missed detects, false alarms and spatial alignment of system output and ground truth objects. For a given frame, the Frame Detection Accuracy (FDA) measure calculates the spatial overlap between the ground truth and system output objects as a ratio of the spatial intersection between the two objects and the spatial union of them. The sum of all the overlaps is normalized over the average of the number of ground truth and detected objects. (t) (t) For a single frame t where there are NG ground truth objects and ND detected objects , we define F DA(t) as, F DA(t) =

Overlap Ratio (t)

(t)

NG +ND 2 (t)

Nmapped

where, Overlap Ratio =



(1)

 (t) Di | (t)  (t)



|Gi

i=1

|Gi

(t)

Di |

(2)

(t)

Here, the Nmapped is the number of mapped objects, where the mapping is done between objects which have the best spatial overlap in the given frame t. In order to measure the detection performance for the whole sequence, the F DA is calculated over all the frames in the sequence and normalized to the number of frames in the sequence where at least a ground truth or a detected object exists. This way of normalization accounts for both missed detects and false alarms. We thus obtain the Sequence Frame Detection Accuracy (SFDA) which can be expressed as, t=Nf rames SF DA = t=N t=1 f rames t=1

F DA(t)

(t) ∃(NG

(t)

OR ND )

(3)

Performance Evaluation of Object Detection and Tracking in Video

155

Metric score

Metric scores

Fig 2 shows the effect of spatial inaccuracies (missed object region) and temporal inaccuracies (missed object frames as against object-ID mismatch which does not have any effect on the detection measure as long SFDA vs. Spatial Inaccuracies SFDA vs. Temporal Inaccuracies as the detected object spatially 1 1 aligns with the ground truth.) on 0.8 0.8 SFDA for a video sequence (approximately 2500 frames) con0.6 0.6 taining 1 object (typically the case with close-up face videos). 0.4 0.4 Here, spatial overlap ratio is de0.2 0.2 fined as the ratio of the spatial intersection of the two boxes to 0 0 the spatial union of them. Tem1 0.5 0 1 0.5 0 poral overlap ratio is defined as Spatial Overlap Ratio Temporal Overlap Ratio the ratio of the number of frames the object was detected in to Fig. 2. Effect of spatial and temporal inaccurathe number of frames the ground cies on the detection measure (SFDA) for a setruth object existed. We can ob- quence containing a single object serve that given a single object, the spatial and temporal inaccuracies (missed detects at the frame level) have a linear effect on the detection measure. Relaxing Spatial Alignment. For many systems, it would be sufficient to just detect the presence of an object in a frame, and not be concerned with the spatial accuracy of detection. To evaluate such systems, we propose a thresholded approach to evaluation of detection. Here, the detected object is given full credit even when it overlaps just a portion of the ground truth. OLP DET is the spatial overlap threshold. (t)

Nmapped

Overlap Ratio Thresholded =



(t)

(t) |Gi

i=1

 (t) (t)    |Gi ∪ Di |, if

where, Ovlp

(t) (t) T hres(Gi , Di )

=

 

(t) |Gi



(t)

Ovlp T hres(Gi , Di )

(t) Di |,



(t) Di |

(t)

|Gi

(t)

∩Di

(4)

|

(t) |Gi |

≥ OLP DET otherwise

The threshold for a given application is derived from spatial disagreements between the annotators in the 10% double annotated data. The motivation behind this is to eliminate the error in the scores induced due to ground truth inconsistencies. Also, this way of arriving at the spatial threshold reflects the difficulties in how humans perceive the task. 3.2

Tracking – Sequence Based Evaluation

In this paper, tracking consists of simply identifying detected objects across contiguous frames. The task is similar to detection, with detected objects linked by

156

V. Manohar et al.

a common identity (object IDs) across frames. Therefore, objects which leave the scene and return later in the sequence are not identified as the same object. But, occluded objects are to be treated as the same object. However, tracking is optional during occlusion. Frames in which the object is occluded are marked with special flags during annotation and these frames are excluded from evaluation. Unlike detection, this is a spatio-temporal task and its performance can be assessed with a measure similar to the Sequence Frame Detection Accuracy measure described in Sec 3.1. The significant difference between the measures is that in detection tasks the mapping between the system output and reference annotation objects is optimized on a frame-by-frame basis, whereas for tracking, the mapping is optimized at a sequence level. One of the advantages of making this task highly parallel to the detection task is that the SFDA measure can also be applied to the tracking output to quantify the performance degradation due to mis-identification of objects across frames. A good tracking measure should capture the performance in terms of both overall tracking (number of objects detected and tracked, missed detects and false alarms) and goodness of track for the detected objects, i.e., spatial and temporal accuracy (how much of the ground truth is detected and in how many frames) and spatial (object splits, object merges) and temporal fragmentation (discontinuous tracking). The Sequence Track Detection Accuracy (STDA) is a spatio-temporal measure which penalizes fragmentations in both the temporal as well as the spatial dimensions while accounting for number of objects detected and tracked, missed objects and false alarms. A one-to-one mapping between the ground truth and the system output objects by computing the measure over all the ground truth and detected object combinations and using an optimization strategy to maximize the overall score for the sequence [see Sec 4]. The STDA is then calculated as,  (t) Nf rames |G(t) ∩Di | i Nmapped (t) (t) t=1  |Gi ∪Di | ST DA =

i=1

N(Gi ∪Di =∅)

(5)

Analyzing the numerator of Eq 5, we observe that it is merely the overlap of the detected object over the ground truth, which is very similar to Eq 2. The only difference is that, in tracking we measure the overlap in the spatiotemporal dimension while in detection the overlap is in the spatial dimension alone. The value of TDA is influenced by the ability of an algorithm to detect and consistently track an object in the sequence. The STDA is a measure of tracking over all the objects in the sequence. It can take a maximum value of NG , which is the number of ground truth objects in the sequence. We define Average Tracking Accuracy (ATA), which can be termed as the STDA per object, as AT A =

ST DA NG +ND 2



(6)

Performance Evaluation of Object Detection and Tracking in Video

157

It can be readily realized that for a given object, the ATA exhibits a direct linear dependence on spatial and temporal imperfections, as was the case with the SFDA (See Fig 2). Relaxing Detection Penalty. At times it is desirable to measure the tracking aspect of the algorithm and not be concerned with the detection accuracy. In this case, we can relax the detection penalty by using an area thresholded approach similar to Sec 3.1. In the equation described here, we introduce a threshold here namely, OLP T RK. 

Ovlp T hres(Gi , Di )

t=1

|Gi ∪ Di |

Nf rames

T DA T (i) =

(t)

where, Ovlp

4

(t)

(t) (t) T hres(Gi , Di )

 

(t) |Gi



(t) Di |,

(7)

(t)

 (t) (t)    |Gi ∪ Di |, if =

(t)

(t)

|Gi

(t)

∩Di (t)

|Gi

|

|

≥ OLP T RK . otherwise

Matching Strategies

From Eqs 2 and 5, it is apparent that both the detection and the tracking measures distinguish between individual objects at the frame and sequence level respectively. A valid score can be obtained only when there is a unique oneto-one mapping of ground truth and detected objects using some optimization. Potential strategies to solve this assignment problem are the weighted bi-partite graph matching and the Hungarian algorithm [10]. There are many variations of the basic Hungarian strategy most of which exploit constraints from specific problem domains. The algorithm has a series of steps which are followed iteratively and it has a polynomial time complexity. Specifically some implementations have O(N 3 ) complexity. Faster implementations have been known to exist; the current best bound is O(N 2 logN +N M ) [11]. In our case, the matrix to be matched is usually sparse and this fact could be taken advantage of by implementing a hash function for mapping sub-inputs from the whole set of inputs.

5 5.1

Results and Analysis Experiments

There are many aspects of an algorithm that affect the final scores of the detection and the tracking measure. For an object detection and tracking task the errors that can affect the metric scores can be due to a single or a combination of the following errors - spatial inaccuracy, temporal inaccuracy, missed detects and false alarms. To measure the influence of all of these factors at the same time

158

V. Manohar et al.

will not reflect the behavior of the measures to individual errors. Hence, in the following sections, we observe the performance of the measures by systematically handling one error at a time. We have developed an evaluation tool, which, in addition to calculating the detection and tracking scores, will also output the contribution of the above mentioned errors to the final score. This can be used for diagnostic purposes by algorithm developers to identify strengths and weaknesses of an approach and also for achieving optimal parameter settings for the algorithm. Since we already looked at the effect of spatial and temporal inaccuracies in Fig 2, we will just investigate the effect of missed detects and false alarms in this section. Effect of Missed Detects. In this experiment, we consider a video sequence (approximately 4500 frames) which has 75 objects that vary in their frame persistence. As against the meeting room domain where the objects persist in a longer framespan, in this case the objects stay in the scene for a short duration of time. This is typical for face, text, person and vehicle detection/tracking in broadcast news domains. Fig 3 illustrates the performance of the measures for missed objects in the video sequence. Here, for all objects other than the missed object, we assume that they are detected and tracked ideally. Fig 3 also shows the corresponding frame persistence of the object that is missed from the ground truth. We can observe a uniform degradation of the ATA score while the SFDA score exhibits a non-uniform behavior. Clearly, the SFDA score is influenced by temporally predominant objects (existing in more frames) in the sequence, while the ATA score is independent of the frame persistence of objects. Given an ideal detection and tracking for the remaining objects in the sequence, we can analytically

Comprehensive Metrics vs. Missed Detects as function of Frame Persistence

Metric scores

1 0.8 0.6 0.4 0.2

SFDA ATA

Frame Persistence (Normalized)

0 1

0.8

0.6 0.4 Detection Ratio

0.2

0

1

0.8

0.6 0.4 Detection Ratio

0.2

0

0.6 0.4 0.2 0

Fig. 3. Effect of missed detects on the comprehensive measures (SFDA, ATA) for a sequence containing 75 objects. The figure shows the corresponding object’s frame persistence which was missed from the ground truth. For all the objects not missed, we assume ideal detection and tracking.

Performance Evaluation of Object Detection and Tracking in Video

159

characterize the SFDA and the ATA measures for missed detects as shown in Eqs 8 and 9. ND i SFDA =



ND i=1

i=1



Nf rames

Nfi rames +

NG j=1

j

(8)

Nf rames

2

ND

ATA =

NG +ND 2

.

(9)

Effect of False Alarms. Having looked at the effect of missed detects on the SFDA and the ATA, it is fairly straightforward to imagine the effect of false alarms on the measure scores. Given an ideal detection and tracking for all the objects in the sequence, we can analytically characterize the SFDA and the ATA measures for false alarms as shown in Eqs 10 and 11. NG j SFDA =



ND i=1

j=1



Nf rames

Nfi rames +

NG j=1

j

(10)

Nf rames

2

ATA =

NG NG +ND 2



(11)

Just as missing a predominantly occurring object decreases the SFDA score by a higher extent, introducing an object in a large number of frames affects the SFDA score more. However, the ATA score is affected by the number of unique objects (different object IDs) inserted into the sequence. 5.2

Face Detection and Tracking Evaluation

In this section, we describe the test-bed that we use in our evaluation of face detection and tracking algorithms. We compared three face detection algorithms and two face tracking algorithms. The algorithm outputs were obtained from the original authors and thus can be safely assumed that the reported outputs are for the optimal parameter settings of the algorithm without any implementation errors. For anonymity purposes, these algorithms will be referred to as Algo 1, Algo 2 and Algo 3. The source video was in MPEG-2 standard in NTSC format encoded at 29.97 frames per second at 720x480 resolution. The algorithms were trained on 50 clips, each averaging about 3 minutes (approx. 5400 frames) and tested on 20 clips, whose average length was the same as that of the training data. The ground truth was provided to algorithm developers for the 50 clips to facilitate training of algorithm parameters. Fig 4 shows the SFDA scores of the three face detection algorithms on the 20 test clips. It also reports the SFDA scores thresholded at 10% spatial overlap, missed detects and false alarms associated with each sequence. By adopting a thresholded approach, we alleviate the effect of errors caused due to spatial anomalies. Thus, the errors in the thresholded SFDA scores are primarily due to missed detects and false alarms. One can observe a strong correlation between the SFDA scores and the missed detects/false alarms. Results show that Algo 1

V. Manohar et al. SFDA scores 10% Thres.

Comparison of face detection systems on meeting room video clips 1 0.5 0

1 0.5 0

2

MD / frame

2 2

FA / frame

160

4

4

4

6

6

8

8

10 12 Sequence ID

10 12 Sequence ID

14

14

16

16

18

18

20

20

1 0 2

4

6

8

10 12 Sequence ID

14

16

18

20

Algo 1 Algo 2 Algo 3

Algo 1 Algo 2 Algo 3

Algo 1 Algo 2 Algo 3

Algo 1 Algo 2 Algo 3

2 0 2

4

6

8

10 12 Sequence ID

14

16

18

20

Fig. 4. Evaluation results of three face detection systems. Missed Detects (MD) and False Alarms (FA) are normalized with respect to total number of evaluation frames.

outperforms the other algorithms on all the test clips. It has good localization accuracy in addition to low missed detection and false alarms rate. Fig 5 shows the ATA scores for the two face tracking systems on the test set. Additionally, ATA scores thresholded at 10% spatial overlap, missed detects and false alarms associated with each sequence are reported. It can be observed that, though Algo 1 has lesser identification errors and false alarm rates, there is certainly scope and promise for improvement in the performance. Results show that inconsistent identification and induction of sporadic false alarms are detrimental to performance of tracking systems.

ATA scores 10% Thres.

1 0.5 0

Norm. MD

Comparison of face tracking systems on meeting room video clips 1 0.5 0

0.4 0.2 0

Algo 1 Algo 2 2

6

8

10 12 Sequence ID

14

16

18

20

Algo 1 Algo 2 2

Norm. FA

4

4

6

8

10 12 Sequence ID

14

16

18

20

Algo 1 Algo 2 2

4

6

8

10 12 Sequence ID

14

16

18

20

4

Algo 1 Algo 2

2 0 2

4

6

8

10 12 Sequence ID

14

16

18

20

Fig. 5. Evaluation results of two face tracking algorithms. Missed Detects and False Alarms are normalized with respect to total number of unique ground truth objects in the sequence.

Performance Evaluation of Object Detection and Tracking in Video

6

161

Conclusions

A comprehensive approach to evaluation of object detection and tracking algorithms is proposed for video domains where an object bounding approach to ground truth annotation is followed. An area based metric, that depends on spatial overlap between ground truth objects and system output objects to generate the score, is proposed in the case of an object bounding annotation. For the detection task, the SFDA metric captures both the detection capabilities (number of objects detected) and the goodness of detection (spatial accuracy). Similarly, for the tracking task, both the tracking capabilities (number of objects detected and tracked) and the goodness of tracking (spatial and temporal accuracy) are accounted by the ATA metric. By decomposing the performance in terms of its components, algorithm developers can analyze the robustness and shortcomings of a given approach. Evaluation results of face detection and tracking systems on meeting room video clips show the effectiveness of the metrics in capturing the performance.

References 1. Antani, S., Crandall, D., Narasimhamurthy, A., Mariano, V.Y., Kasturi, R.: Evaluation of Methods for Detection and Localization of Text in Video. In: Proceedings in International Workshop on Document Analysis Systems. (2000) 2. Black, J., Ellis, T.J., Rosin, P.: A Novel Method for Video Tracking Performance Evaluation. In: Proceedings of IEEE PETS Workshop. (2003) 3. Brown, L.M., Senior, A.W., Tian, Y., Connell, J., Hampapur, A., Shu, C., Merkl, H., Lu, M.: Performance Evaluation of Surveillance Systems Under Varying Conditions. In: Proceedings of IEEE PETS Workshop. (2005) 4. Collins, R., Zhou, X., Teh, S.: An Open Source Tracking Testbed and Evaluation Web Site. In: Proceedings of IEEE PETS Workshop. (2005) 5. Fisher, R.B.: The PETS04 Surveillance Ground-Truth Data Sets. In: Proceedings of IEEE PETS Workshop. (2004) 6. Hua, X., Wenyin, L., Zhang, H.: Automatic Performance Evaluation for Video Text Detection. In: Proc. International Conference on Document Analysis and Recognition. (2001) 7. Nascimento, J., Marques, J.: New Performance Evaluation Metrics for Object Detection Algorithms. In: Proceedings of IEEE PETS Workshop. (2004) 8. Smith, K., Gatica-Perez, D., Odobez, J., Ba, S.: Evaluating Multi-Object Tracking. In: Proceedings of IEEE Empirical Evaluation Methods in Computer Vision Workshop. (2005) 9. Doermann, D., Mihalcik, D.: Tools and Techniques for Video Performance Evaluation. In: ICPR. Volume 4. (2000) 167–170 10. Munkres, J.R.: Algorithms for the Assignment and Transportation Problems. J. SIAM 5 (1957) 32–38 11. Fredman, M.L., Tarjan, R.E.: Fibonacci Heaps and their uses in Improved Network Optimization Algorithms. Journal of ACM 34 (1987) 596–615

Suggest Documents