differences in domains and tasks to which they are subjected. .... available, the AD algorithms can be evaluated by comp
Best Paper Award
Evaluation Schemes for Video and Image Anomaly Detection Algorithms Shibin Parameswaran, Josh Harguess, Christopher Barngrover, Scott Shafer, Michael Reese Space and Naval Warfare Systems Center Pacific 53560 Hull Street, San Diego, CA 92152-5001 {shibin.parameswaran, joshua.harguess, chris.barngrover, scott.a.shafer, michael.c.reese}@navy.mil ABSTRACT Video anomaly detection is a critical research area in computer vision. It is a natural first step before applying object recognition algorithms. There are many algorithms that detect anomalies (outliers) in videos and images that have been introduced in recent years. However, these algorithms behave and perform differently based on differences in domains and tasks to which they are subjected. In order to better understand the strengths and weaknesses of outlier algorithms and their applicability in a particular domain/task of interest, it is important to measure and quantify their performance using appropriate evaluation metrics. There are many evaluation metrics that have been used in the literature such as precision curves, precision-recall curves, and receiver operating characteristic (ROC) curves. In order to construct these different metrics, it is also important to choose an appropriate evaluation scheme that decides when a proposed detection is considered a true or a false detection. Choosing the right evaluation metric and the right scheme is very critical since the choice can introduce positive or negative bias in the measuring criterion and may favor (or work against) a particular algorithm or task. In this paper, we review evaluation metrics and popular evaluation schemes that are used to measure the performance of anomaly detection algorithms on videos and imagery with one or more anomalies. We analyze the biases introduced by these by measuring the performance of an existing anomaly detection algorithm.
1. INTRODUCTION Anomaly detection (AD) is an active area of research in computer vision. Numerous methods to detect and localize anomalies in videos and images have been proposed.1–4 Depending on the domain and tasks, the behavior and the performance of anomaly detection algorithms can vary widely. The differences mainly arise from the particular characteristics of anomalies encountered in various computer vision problem domains. For instance, anomaly detection algorithms used to detect man-made objects in satellite imagery may have different constraints and behaviors from those algorithms designed to detect suspicious activity in a surveillance video. Similarly, performance of AD algorithms can also be affected by the scale and/or duration of anomalies. This performance disparity in different operating conditions makes the role of evaluation methodologies very important in this research area. Object detection is another very important and active area of computer vision and, however subtle, is a separate but connected research area to that of AD. The difference is important in this paper because most of the current performance metrics in computer vision for detection and tracking tasks are built around a framework of object detection, not anomaly detection.5–8 We have found these performance metrics lacking in some ways for measuring the performance of anomaly detection algorithms. For instance, when ground truth is gathered for an object detection task from imagery, a common way to annotate the imagery is with a bounding box that covers the object of interest. However, inevitably there will be non-object related pixels and patches within the bounding box which represent background or other information in the imagery. Therefore, when performing anomaly detection tasks on the same set of imagery where anomalies are the size of pixels or small patches, the “ground truth” gathered for the object detection task contains non-anomalous information from the anomaly detection point of view. Our approach is to use multiple performance metrics to get a better picture of true anomaly detection performance given the pitfalls related to using any one object performance metric.
Automatic Target Recognition XXVI, edited by Firooz A. Sadjadi, Abhijit Mahalanobis, Proc. of SPIE Vol. 9844, 98440D · © 2016 SPIE · CCC code: 0277-786X/16/$18 · doi: 10.1117/12.2224667
Proc. of SPIE Vol. 9844 98440D-1 Downloaded From: http://spiedigitallibrary.org/ on 05/24/2016 Terms of Use: http://spiedigitallibrary.org/ss/TermsOfUse.aspx
(a) (b) Figure 1. Example images showing anomalies in images and videos: (a) boat is an anomaly compared to its background, (b) the cart is an anomaly due to its difference in motion pattern compared to the pedestrians around it
In this paper we review the different evaluation methods used in the field of video and image anomaly detection. In addition to reviewing the characteristics of each method, we also discuss the appropriateness and applicability of it in different video/image anomaly detection settings. This paper is organized as follows. In the following section we give a detailed explanation of anomaly detection. Then in Section 3 we discuss various evaluation schemes followed by a discussion of several evaluation metrics in Section 4. Section 5 gives a detailed explanation of our experiment and results, with a conclusion to follow in Section 6.
2. ANOMALY DETECTION Anomaly detection (AD) in videos and images is an important research area in the field of computer vision. In a surveillance setting, anomaly detection algorithms can be used to automatically alert the data analysts of events or objects that deviate from normalcy. The definition of anomaly and normalcy can differ widely according to the context and the task at hand which makes anomaly detection a challenging problem. For instance, in Figure 1(a), the boat can be considered an anomaly as it looks different from its background. Similarly, Figure 1(b) is a frame from a video data set1 in which the cart is identified as an anomaly because of its difference in motion pattern compared to the motion of pedestrians and the background. Therefore, the algorithm that can detect the boat will behave differently than the algorithm that can detect anomalies based on difference in motion. For a summary of anomaly detection algorithms used in different computer vision applications, please refer to the survey papers in the respective domains.9, 10 In spite of the underlying differences, the AD algorithms used in computer vision (and other domains) share the same fundamental goal: to classify data (objects, actions or events) as normal when it conforms to a predefined normalcy and as an anomaly otherwise. The algorithms essentially act as a binary classifier that assigns data to a normal or anomalous category by thresholding the scores returned by the underlying classifier. For example, if we assume that all normal pixels in an image can be modeled using a Gaussian distribution, N (µ, σ 2 ), then any pixels that deviate from normalcy can be identified by thresholding the log-likelihood values in an image. The threshold for the likelihood is chosen such that anything below it is considered not normal and, therefore, an anomaly. p(x; µ, σ 2 ) ≥ 0 and A + C > 0 Frame is counted as false positive if A + B > 0 and A + C = 0 Similarly, TPR is the ratio of the number of frames correctly classified as anomalies to the total number of frames in the video. An evaluation with frame-level ground truth ignores spatial localization of anomalies. Since frame-level evaluations do not check if the location of the anomaly detected by the AD algorithm coincides with the groundtruth anomaly, this allows for “lucky” detections where the algorithm misses the true anomaly but mislabels a normal pixel as an anomaly.
Proc. of SPIE Vol. 9844 98440D-5 Downloaded From: http://spiedigitallibrary.org/ on 05/24/2016 Terms of Use: http://spiedigitallibrary.org/ss/TermsOfUse.aspx
3.4 Frame-level with Localization Evaluation (Hybrid) In order to test localization and to avoid counting “lucky” detections, an evaluation scheme that requires pixellevel ground-truth masks is used. Labels are given at a frame level but the decision correctness is validated at a pixel level. To be counted as true positive, this scheme requires at least α% overlap between the set of truly anomalous pixels and the set of pixels detected as anomalous by the AD algorithm. If this criteria is not met, the detection/frame (frame-level) is counted as a false positive. This scheme avoids cases where the detector classifies the frame correctly even when it misses the true anomaly completely. The realization of frame-level evaluation with localization in the toy example shown in Figure 3 is the following:
A ≥α A+C A Frame is counted as false postive if