5288
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 12, DECEMBER 2015
Gaussian Process Regression-Based Video Anomaly Detection and Localization With Hierarchical Feature Representation Kai-Wen Cheng, Yie-Tarng Chen, and Wen-Hsien Fang
Abstract— This paper presents a hierarchical framework for detecting local and global anomalies via hierarchical feature representation and Gaussian process regression (GPR) which is fully non-parametric and robust to the noisy training data, and supports sparse features. While most research on anomaly detection has focused more on detecting local anomalies, we are more interested in global anomalies that involve multiple normal events interacting in an unusual manner, such as car accidents. To simultaneously detect local and global anomalies, we cast the extraction of normal interactions from the training videos as a problem of finding the frequent geometric relations of the nearby sparse spatio-temporal interest points (STIPs). A codebook of interaction templates is then constructed and modeled using the GPR, based on which a novel inference method for computing the likelihood of an observed interaction is also developed. Thereafter, these local likelihood scores are integrated into globally consistent anomaly masks, from which anomalies can be succinctly identified. To the best of our knowledge, it is the first time GPR is employed to model the relationship of the nearby STIPs for anomaly detection. Simulations based on four widespread datasets show that the new method outperforms the main state-of-the-art methods with lower computational burden. Index Terms— Video surveillance, anomaly detection, global anomaly, Gaussian process regression.
I. I NTRODUCTION
V
IDEO anomaly detection is an important, yet challenging problem because normal events that suffer from the problems of partial occlusion and scale variation, and the interand intra-classes behavior variations are likely to be misclassified as abnormal. Moreover, abnormal events themselves can involve with, for example, individual object/human alone such as speeding cars, or even a group of objects like fighting or jaywalking. In general, video event anomalies can be classified as local and global anomalies [1]. While a local anomaly is defined as an event that is different from its spatio-temporal neighboring events, a global anomaly is defined as multiple
Manuscript received May 25, 2015; revised September 3, 2015; accepted September 7, 2015. Date of publication September 17, 2015; date of current version October 6, 2015. This work was supported by the Ministry of Science and Technology, Taiwan, under Contract MOST 104-2221-E-011-048, and Contract MOST 104-2221-E-011-112. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Martin Kleinsteuber. The authors are with the Department of Electronic and Computer Engineering, National Taiwan University of Science and Technology, Taipei 10607, Taiwan (e-mail:
[email protected]; ytchen@mail. ntust.edu.tw;
[email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TIP.2015.2479561
events that globally interact in an unusual manner, even if any individual local events can be normal. Most research on anomaly detection, see [2]–[5], has focused more on detecting local anomalies such as objects with strange appearances or speed, but less on global anomalies which are quite common in many scenarios such as traffic surveillance. To deal with global anomalies, the relationship of local observations has been utilized in many methods, examples of which include the Bayesian network method [6] and the probabilistic framework [7]. However, these methods pay much attention to insignificant abnormal events such as a biker running into a pedestrian walkway [8], and the evaluation of more complicated datasets containing global anomalies is still missing. Moreover, these methods are devised based on modeling the relationship of dense features and do not work well for sparse observations. It is, however, very expensive in both space and time to extract the dense features. The social force model [9] and interaction energy potential [10] have also been proposed to detect unusual interpersonal relationship. However, these methods are designed for human motion and may not be amenable to model moving objects such as cars. To tackle the above problems and simultaneously detect local and global anomalies, this paper proposes a unified framework, shown in Fig. 1, using a sparse set of spatiotemporal interest points (STIPs). For local anomalies, they can be identified as those low-likelihood STIP features with respect to a low-level codebook of visual features. As for global anomalies, which involve inter-event interactions, we collect an ensemble of the nearby STIP features and consider that an observed ensemble is regular if its semantic (appearance) and structural (position) relations of the nearby STIP features occur frequently. Global anomalies are then identified as interactions with either dissimilar semantics or misaligned structures with respect to the probabilistic normal models. Since recognizing global anomalies requires a set of normal interaction templates, we first pose the extraction of normal interactions from the training videos as a problem of finding the frequent geometric relations of the nearby interest points. A high-level codebook of interaction templates is then built, each of which represents an interaction by an ensemble of STIPs. We next model the geometric relations of the STIP features and propose a novel inference method using Gaussian process regression (GPR). It is noteworthy that GPR is particularly suitable for anomaly detection since it is fully non-parametric and robust to the noisy data, and it supports the missing input values like
1057-7149 © 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
CHENG et al.: GPR-BASED VIDEO ANOMALY DETECTION AND LOCALIZATION
Fig. 1.
5289
Overview of the proposed method.
sparse STIPs. Finally, a consistent detection method is also addressed to amalgamate the results from local and global anomaly detectors into a globally consistent mask. Simulations show that the proposed method outperforms the main state-of-the-art methods based on four widespread datasets, yet with lower complexity. The contributions of this paper include: 1) a novel hierarchical event representation is built to simultaneously deal with local and global anomalies; 2) an efficient clustering method is employed to extract deformable templates of inter-event interactions from the training videos; 3) a GPR model, which is robust to model the deformable configuration of STIPs, is constructed for anomaly inference. To the authors’ best knowledge, it is the first time GPR is employed to model the relationship of the nearby STIPs for anomaly detection. Note that since our model is built upon sparse STIPs rather than densely-sampled patches [6], [7], the space-time complexity called for can thus be greatly reduced; 4) a novel consistent detection scheme is advised, which not only removes false predictions, but also recovers missed detections. Part of this paper has been presented at the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) [11]. The rest of this paper is organized as follows. Sec. II gives an overview of the related works. Sec. III introduces the hierarchical event representation. The high-level codebook construction is elaborated in Sec. IV. Sec. V details the GPR learning and inferring, and global anomaly detection. Sec. VI describes a new consistent detection scheme. Extensive experiments are conducted in Sec. VII to validate the new method and Sec. VIII concludes our work. II. R ELATED W ORK The area of visual anomaly analysis has received enormous attention in recent years, as evidenced by a detailed survey in [12]. Depending on applications, the notion of anomaly varies from background subtraction, fall detection [13] to intrusion detection [14]. Behavior representation,
understanding and anomaly inference are some major issues in anomaly detection. Object profiles such as location, shape and trajectory were widely used in constrained scenes to find unusual events. To effectively detect anomalies from unconstrained scenes, normal event understanding is usually posed as a 3D pattern-learning problem. Suspicious events are treated as low-likelihood patterns with respect to either offline templates of normal events [4] or adaptive models learned from sequential data [2]. In addition to one-class support vector machine [15], [16], Mahadevan et al. [3] proposed a mixture of dynamic texture (MDT) model [17] to detect temporal and spatial abnormalities. Saligrama and Chen [4] used a ranking scheme based on the local k-nearest-neighbor detectors for anomaly detection. These approaches like [3] and [4] flag abnormal events based on independent location-specific statistical models and have not considered the relationship among local observations. To model group interactions, Cui et al. [10] proposed an interaction energy potential to model the interpersonal relationship. Social force model was extended from physics to analyze crowd dynamics [9]. These models were strongly adhered to human motion and had their limitation in specific scenarios. Infinite hidden Markov model was used in [8] to model the time-evolving properties of visual features, but the spatial relationship of these features was ignored. Roshtkhari and Levine [7] encoded the spatio-temporal composition (STC) of densely-sampled 3D patches with a probability density function (pdf). However, the estimation of high-dimensional pdf may suffer from the curse of dimensionality. Boiman and Irani [6] proposed an inference by composition (IBC) algorithm to compute the joint probability between a database and a query ensemble. However, the underlying graph, which expands substantially in accordance with the database size, could be a burden on the message passing process. Also, these methods [6], [7] modeled the spatiotemporal relations of densely-sampled 3D patches which are extracted with high computational demand.
5290
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 12, DECEMBER 2015
Algorithm 1 GPR Based Anomaly Detection and Localization
The proposed method begins with constructing a novel hierarchical feature representation. In contrast to [7], we use the geometric relations of the STIP features rather than the dense-sampled features to render more efficient data processing. A. Low Level: Multi-Scale Event Representation
GPR has been applied to trajectory analysis [18] and human motion modeling [19]. For the multi-object activity modeling, Loy et al. [20] formulated the non-linear relationship between decomposed image regions as a regression problem. As the normalness of a specific region is predicted based on its complementary regions in the previous frame, spatial configurations between objects can be well characterized. However, Markov assumption cannot handle complex causality. Consistent detection, which integrates the potentially local scores from a detector into a globally consistent mask, is beneficial in locating action targets. For anomaly detection, Benezeth et al. [21] used an Markov Random Fields (MRF) model parameterized by a co-occurrence matrix to allow for spatial consistency detection. Li et al. [22] used the conditional random field (CRF) to synthesize the scores derived from multi-scale spatial/temporal detectors. However, these graphical model-based approaches usually require an additional learning step. The consistent detection can also be implemented without learning by employing a search algorithm. For example, Lampert et al. [23] proposed an efficient window search for object detection. Yuan et al. [24] formulated action detection as a maximum sub-volume problem. A target action is located with a 3D bounding box via the maximum sub-volume search. To locate target action/anomaly more precisely, Tran et al. [25] formulated anomaly localization as a maximum path problem. However, the search process for high-resolution videos is still a time-consuming task. Recently, Cheng et al. [16] combined the works of [23] and [25] and came up with an efficient maximum subsequence search for anomaly localization. III. H IERARCHICAL F EATURE R EPRESENTATION The following sections will describe the proposed algorithm, as summarized in Algorithm 1. In the training procedure, we will discuss how to cluster the low-level STIP features and high-level ensembles from training videos into low-level and high-level codebooks, respectively, from which multiple GPR models can be established. In the testing procedure, we will elaborate how to estimate the anomaly likelihoods of lowlevel and high-level features from a test video with respect to multiple GPR models, from which a consistent detection can be performed to precisely locate a variety of local and global anomalies.
Since every event happens with dynamics, an STIP feature is used to represent an event. We use the STIP detector proposed by Dollar et al. [26], which utilizes a 2D Gaussian filter and a 1D Gabor filter in the spatial and temporal directions, respectively. To handle events with different scales due to the camera perspective distortion, a two-level Gaussian video pyramid is built from the input video. Depending on the scenarios, we empirically choose an appropriate descriptor from the interest point response (IPR) [26], 3DSIFT [27] and the 3D extensions of HOG [28] and HOF [29]. To establish a normal event model to handle the potential behavior variations, we next quantize the low-level STIP features into a visual codebook C using the k-means algorithm based on the Euclidean metric. The anomaly likelihood for each STIP feature is based on the k-nearest neighbors (k-NN) distance with respect to the visual vocabulary C by 1 ||di − c j ||2 (1) yil = k c j ∈Ci
where || · ||2 denotes the Euclidean norm, Ci ⊆ C is the subset of the top-k nearest codewords for the i-th interest point, STIPi , and di is its feature vector. The rationale for choosing k-NN detector is due to its recent success in anomaly detection [4] and to its robustness to the effect of noise on the classification for large k [30]. For the binary classification scenario, Hastie et al. [31] demonstrated that the decision boundary for k-NN is far more regular than that based on the nearest neighbor. The choice of an appropriate k will be discussed in more detail in Sec. VII-A. On the other hand, though we can use a Gaussian mixture model to represent yil as a likelihood as well, it requires to estimate the means and covariances of the Gaussian components. In contrast, the k-means clustering only needs to compute the cluster means, which involve fewer parameters, entailing a faster convergence rate [32]. Thus, abnormal events with either strange appearances or unusual motions can then, efficiently and effectively, be detected if yil exceeds a threshold Tl . Note that we consider local and global anomalies separately. That is, we exclude the anomalous interest points detected by the local anomaly detector using Eq. 1 as we are more interested in the interactions of normal events. B. High Level: Ensemble of STIP Features To acquire the potential interactions in videos, we densely slide a 3D window over the video space to obtain ensembles of the nearby STIP features. An ensemble centered at the j-th sampled point is given by E j = {(vi , yil , Ci )| ∀STIPi ∈ R j }
(2)
CHENG et al.: GPR-BASED VIDEO ANOMALY DETECTION AND LOCALIZATION
5291
Algorithm 2 Clustering Ensembles of STIP Features
Fig. 2. An example of measuring semantic and structural similarities: We partition the ensemble space into 3-by-3 regions. Four different spatial relations of STIPs (black dots) and their matched codewords are provided.
where R j denotes the spatio-temporal neighborhood around the j-th sampled point. For each STIPi ∈ R j , its relative location vi ∈ R3 , its k-NN distance yil , and the subset of the matched codewords Ci ⊆ C are stored. Since there are some ensembles containing only few or no STIPs, we enforce a quality control mechanism to filter out such ensembles and accelerate the construction of the high-level codebook in the next stage. The quality function of an ensemble E j is defined as the area ratio of cuboid volumes V(STIPi ) to the ensemble volume V(E j ): STIPi ∈R j V(STIPi ) q(E j ) = (3) V(E j ) An ensemble E j is qualified if its quality value q(E j ) exceeds a user-specified threshold TE . To efficiently calculate Eq. 3, we can apply the high-dimensional image integral technique in [33]. Suppose there is a 3D mask M with the size being equal to the input video, which flags all of the cuboid 3 coverage surround STIPs (i.e., for each location v ∈ R in the 3D mask, Mv = 1 if v ∈ i V(STIPi ); and Mv = 0, otherwise). The quality function of an ensemble E j in Eq. 3 becomes the average of those binary values Mv ’s that locate in the region R j , q(E j ) = v∈R j Mv /V(E j ), which can be efficiently computed by [33] 3−||p||1 I (x p ) M p∈{0,1}3 (−1) (4) q(E j ) = V(E j ) where {x p |p ∈ {0, 1}3 } indicates the eight corner locations of the ensemble E j , and IM is the 3D integral image of the volumetric mask which can evaluate the quality function with O(1) operation. IV. H IGH -L EVEL C ODEBOOK C ONSTRUCTION To find the frequent geometric relations among the nearby STIP features from the training videos, we cluster these qualified ensembles to acquire a high-level codebook of implicit interaction templates. Specifically, given a set of qualified ensembles, we divide the ensembles into m sets S = {S1 , ..., Sm } so as to minimize the within-cluster distance given by J = min S,m
m i=1 E j ∈Si
si m(E j , Ei )
(5)
where Ei is the representative ensemble in Si . Note that the ensemble here is represented by a set of STIPs, which is contrary to the vector form in [7]. A. Semantic and Structural Similarities A similarity measurement between two ensembles is required for clustering. Here, we employ a two-phase strategy for computational efficiency. It begins with partitioning an ensemble space into nr 3D subregions, and then computes the difference between ensembles E and E based on a newly defined similarity measure: uT Qu (6) ||u + u ||0 where the L 0 -norm ||x||0 denotes the total number of non-zero elements in x, the location occurrence u for an ensemble E is an nr ×1 binary vector in which every entry indicates whether any STIP exists in the corresponding subregion, and the label co-occurrence matrix Q is an nr ×nr binary diagonal matrix in which the i th diagonal entry indicates whether any pair of the matched codewords from ensembles E and E coincides in the i th subregion. Fig. 2 demonstrates the similarity computation in a 2D example. We can note that ensembles E1 and E2 share similar semantic and structural relationship while E1 and E3 only have similar structural relationship, and E1 and E4 are quite different. si m(E, E ) =
B. Bottom-Up Greedy Clustering As the number of ensembles grows in proportion to the size of the training data, it is advantageous to adopt a bottom-up clustering procedure for large datasets to reduce time and memory requirements. Algorithm 2 shows a greedy approach which sequentially updates an ever-growing codebook E by Ei ∗ = Ei ∗ ∪ E j once a qualified ensemble E j from the training videos is available. Note that those input ensembles with a similar topology are agglomerated altogether so that STIPs within the merged ensemble may appear in a deformable configuration. To reduce the variation in the number of STIPs among different templates, we enforce the quality control in Eq. 4 to each template, and a template stops the agglomeration when its quality value exceeds a user-specified threshold Tq . After clustering, we also prune noise in each template by
5292
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 12, DECEMBER 2015
Algorithm 3 Marginal Inference Algorithm
Fig. 3. The GPR model learning and inference: In the middle, squares represent observed variables and circles represent model prediction. The thick horizontal bar represents a set of fully connected nodes. For each normal template, its topology is formulated as a k-NN regression problem for GPR.
discarding subregions with lower support (i.e., a small number of STIPs). The low-support suppression and quality control mechanisms are straightforward but effective, as shown in Fig. 8. V. GPR-BASED G LOBAL A NOMALY D ETECTION Next, GPR is used to model the spatio-temporal relationship of STIPs for each template in E as a k-NN regression problem, as shown in Fig. 3. The details are delineated in the following subsections. A. GPR Model Learning For a specific template, let V = {vi ∈ R3 |i = 1, ..., n} be a sequence of relative positions of STIPs. Let k-NN distances y = {yil ∈ R|i = 1, ..., n} serve as the target values. The goal of GPR is to learn the mapping from inputs V to the continuous observable targets y. Assume that the target vector y follows a zero-mean Gaussian prior. According to [34], the predictive distribution of f∗ = { f (v(i) ∗ ) ∈ R|i = 1, ..., n ∗ } at 3 test locations V∗ = {v(i) ∗ ∈ R |i = 1, ..., n ∗ } is a multivariate Gaussian distribution given by f∗ |V, y, V∗ ∼ N (¯f∗ , V(f∗ ))
(7)
where ¯f∗ = K∗T (K + σn2 In )−1 y (K + σn2 In )−1 K∗ , in which In is
and V(f∗ ) = K∗∗ − K∗T an n × n identity matrix, and K (V, V), K (V, V∗ ), and K (V∗ , V∗ ), denoted by K, K∗ , and K∗∗ , respectively, are the covariance matrices evaluated based on the radial basis function (RBF) kernel given by ||x − x ||22 ) + σn2 δ(x − x ) (8) 2l 2 where δ(x−x ) is a Kronecker delta which is one if x = x and zeros elsewhere. The RBF kernel relates predictions at nearby locations with each other, so GPR together with it tends to force outliers to be low-likelihood. To handle noisy observations, additive identical, independent distributed Gaussian noise with variance σn2 is imposed on each observation in the training set. The hyper-parameters in the RBF kernel include the length-scale l, the signal variance σ 2f , and the noise variance σn2 , which can be determined by maximizing the marginal likelihood of the training data based on the conjugate gradient method. Note that the noise variance σn2 with proper optimization can lead to a smooth mapping curve so as to prevent GPR from over-fitting the training data. After the learning process, the GPR model for each template records the training data and the learned hyper-parameters, i.e., D = {V, y, l, σ f , σn }). k(x, x ) = σ 2f exp(−
B. GPR Model Inference Given a new ensemble E ∗ = (V∗ , y∗ ) from a test video, the likelihood to a specific GPR model Dg = {V, y, l, σ f , σn } is defined by the followingmarginal probability: p(y∗ |V∗ , Dg ) =
p(f∗ |V∗ , Dg ) p(y∗ |f∗ )df∗
(9)
where the possibility p(f∗ |V∗ , Dg ) accounts for the positional distribution and p(y∗ |f∗ ) captures the appearance similarity. The likelihood p(y∗ |V∗ , Dg ) in Eq. 9 is the conditional probability of the target values y∗ given the test locations V∗ and the GPR model Dg , and the GPR prediction values f∗ have been marginalized out. Thus, it can jointly consider how likely the semantic and structural relationship in the test ensemble belong to the GPR model, as illustrated in Fig. 3. In Eq. 9, we use GPR to model the first term which yields a multivariate Gaussian distribution. As for the similarity term p(y∗ |f∗ ), there are many choices. For instance, Rasmussen and Williams [34] used a zero-mean Gaussian assumption and showed that the integral boils down to a multivariate Gaussian distribution. It is, however, inappropriate in our case because no prior information learned from a GPR model is used to model p(y∗ |f∗ ). Therefore, we augment the Gaussian assumption by incorporating the prediction results given by (10) y∗ |f∗ ∼ N (f∗ , σn2 In∗ ) If we assume that the pattern residuals = y∗ −f∗ follow an independent, identical Gaussian distribution with variation σn2 , Eq. 9 becomes an integral of Gaussian product. Substituting Eqs. 7 and 10 into Eq. 9 results in 1 p(y∗ |V∗ , Dg ) = (2π)n |V(f∗ )||σn2 In∗ | 1 · exp − (f∗ − ¯f∗ )T V(f∗ )−1 (f∗ − ¯f∗ ) 2 1 1 − (y∗ −f∗ )T ( 2 In∗ )(y∗ −f∗ ) df∗ 2 σn (11)
CHENG et al.: GPR-BASED VIDEO ANOMALY DETECTION AND LOCALIZATION
5293
Then, by making use of the Sylvester’s determinant theorem [35] and Woodbury inversion lemma [36], the logarithmic form of Eq. 11 can be simplified as 1 T 1 log p(y∗ |V∗ , Dg ) = − f¯∗ ∗−1 ¯f∗ − y∗T ∗−1 y∗ 2 2 1 n T −1 ¯ + y∗ ∗ f∗ − log 2π − log |∗ | 2 2 (12) where ∗ := V(f∗ ) + σn2 In∗ . A practical implementation of Eq. 12 is shown in Algorithm 3, where the matrix inversion is replaced with the Cholesky decomposition for faster and numerically stable computation. In case of failure in the Cholesky decomposition, we relax the input dependence by discarding the off-diagonal entries in ∗ . The computations in Algorithm 3 are mainly contributed by matrix multiplication. Since L and α can be pre-computed in the training period, the overall running time takes approximately O(n 2 n ∗ ) provided that n n ∗ . C. Global Anomaly Detection In the test mode, we calculate the likelihood of a test ensemble E ∗ from a test video with respect to multiple GPR models. The global negative log likelihood (GNLL) of a test ensemble E ∗ against the g-th GPR model is defined as the average on the point-wise negative log likelihood given by Gg (E ∗ ) = −
n∗ 1 log p(y∗(i) |v ∗(i) , Dg ) n∗
(13)
i=1
The best-matched GPR model for the ensemble E ∗ is determined by g ∗ = arg min Gg (E ∗ )
(14)
g
To precisely locate abnormal events, each STIP within the test ensemble is assigned with its local negative log likelihood (LNLL) with respect to the best-matched GPR model by yih = − log p(y∗(i) |v ∗(i) , Dg ∗ ), ∀STIPi ∈ R∗
(15)
For point-wise likelihood calculation, most matrix manipulations in Algorithm 3 reduce from polynomial to linear time complexity. Though the computation order remains unchanged, a salient speedup can be achieved in practice. The time complexity of Algorithm 3 thus reduces to O(n ∗ g n 2g ) where n g is the number of STIPs in the g-th GPR model. Fig. 4 computes the likelihood of three test cases on the Subway dataset where a large-scale ensemble is adopted to monitor short-term video clips. To combine the results from local and global anomaly detectors, the weighted sum is applied: yˆi = (1 − α) yˆil + α yˆih
(16)
where α ∈ [0, 1] is the preference factor and yˆil and yˆih are the standard scores of yil and yih as defined in Eqs. 1 and 15, respectively.
Fig. 4. Visual example of global anomaly detection: A learned GPR model is shown in the first column while the test behaviors are shown in the remaining columns. We intentionally use the rotation-invariant 3DSIFT descriptor such that these behaviors cannot be distinguished solely using their patterns (k-NN distances) (the second row) unless the positional information (the third row) is considered.
VI. C ONSISTENT D ETECTION AND L OCALIZATION Based on the aforementioned GPR-based anomaly detection framework, a test video is converted into anomaly likelihood maps using Eq. 16. To locate anomalies, we can perform a binary classification on the outputs of the GPR-based anomaly detector, based on which an STIP is reported as anomaly if its likelihood exceeds a predetermined threshold. However, this may yield many false or missing detections due to the improper selection of the detector-dependent threshold or negligence of the temporal consistency, both of which may impact the anomaly localization. In light of this, this section proposes a consistent localization scheme that incorporates the local scores of STIPs using Eq. 16 into globally consistent masks from which abnormal events can be precisely located. The consistent localization can be formulated as solving a constrained optimization problem. Suppose there is a sequence of 2D anomaly likelihood maps L = {L 1 , ..., L N }, L t ∈ Rw×h , which is generated by evaluating Eq. 16 at each STIP from a test video. We further assume that each likelihood map L t is associated with a spatial window xtT = [vtT , stT ] 2 2 with location vt ∈ R and scale st ∈ R , and a function f t (xt ) := STIPi ∈xt yˆi which is a summation of the local scores yˆi within the window xt . The consistent localization problem can be formulated by maximize
N
x1 ,...,x N ∈R4+ t =1
subject to
f t (xt )
xt c, t = 1, . . . , N
||vt +1 − vt ||2 ≤ ρ, t = 1, . . . , N − 1,
(17)
where denotes the component-wise inequality. The first inequality constraint bounds the positions and scales of each window xt to the video space, i.e. c = [w, h, w, h]T , and the second inequality constraint specifies the spatial proximity between each pair of the successive windows to ensure consistent detection, in which ρ is a predetermined proximity threshold. Based on the constrained problem in Eq. 17, the optimal variables provide a set of windows over the video space that can, consistently and precisely, locate the abnormal events. Note that if the ranges of functions f t ’s
5294
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 12, DECEMBER 2015
Algorithm 4 Consistent Detection Algorithm
Fig. 5. The proposed consistent detection: (a) The candidate regions (black rectangles) in each frame are found by using the efficient window search. (b) A 3D trellis is comprised of black-filled nodes, where the centers of these candidate regions locate, and empty nodes. (c) The optimal path (black solid line) is found by using the max path search, in which each pair of the successive nodes in the path satisfies the constraints in Eq. 17.
are non-negative, the optimal variables become a set of windows that fully covers every STIP in the test video. To avoid this, we properly offset the functions f t ’s by a small negative constant. As the scales, locations, and the starting and the ending points of an abnormal event are unknown beforehand, solving this problem is a challenging issue. The exhaustive search of finding an abnormal event takes extremely enormous time O(w2 h 2 N(2ρ + 1)2N ). To efficiently solve this problem, we present a low-complexity, yet effective search algorithm [16], which decouples an input video into the spatial and temporal domains and applies two strategies, the efficient window search [23] and the max path search [25], respectively. The overall process is depicted in Algorithm 4 and illustrated in Fig. 5. A. Efficient Window Search In the spatial domain, the efficient window search is applied for each frame to locate the candidate regions where anomalies may occur, each region can be found for the t-th frame by x∗ = arg max f t (x) x∈X
(18)
where X denotes a set of windows. To efficiently solve this problem, we adopt a branch-and-bound algorithm [23] that hierarchically splits a set of windows into two smaller disjoint subsets and keeps the upper bound of the function value f t for each subset. This procedure proceeds until the target region x∗ is found. Note that, after obtaining the target region x∗ , one can remove it from the search space and restart the process to search for the next candidate region. By doing so, as shown in Fig. 5(a), we can locate multiple candidate regions in each frame t, denoted as {(x1t , ft (x1t )), (x2t , f t (x2t )), ...}, and the centers of which are denoted as {v1t , v2t , ...}. B. Max Path Search In the temporal domain, we construct a 3D trellis T where each pixel in the input video is denoted as a node v, as shown
in Fig. 5(b), the value of which, Tv , is given by f (x) if v ∈ {v1t , v2t , ...} Tv = φ other wi se
(19)
Afterward, let p = {vt }tN=1 be a path in the trellis T satisfying the constraints in Eq. 17, and the likelihood of this path p is computed by T ( p) := tN=1 Tvt . Then, Eq. 17 can be posed as the max path problem given by p∗ = arg max T ( p) p∈ pat h(T )
(20)
where path(T ) denotes the set of all possible paths in T . By performing the max path search [25] on T , we can find the optimal path p∗ and their corresponding regions {x∗t }tN=1 where the abnormal target locates, as shown in Fig. 5(c). Similarly, it can discover multiple abnormal events by erasing the score of the best path from T and restarting the search. The proposed consistent scheme can remove false detections and meanwhile recover missed detections arising from the GPR detector. Moreover, it is time efficient even for high-resolution videos. VII. E XPERIMENTAL R ESULTS In this section, we evaluate the proposed method based on four widespread real-world datasets and compare it with the main state-of-the-art methods. In Sec. VII-A, we describe the experiment setup and the evaluation protocol. In Sec. VII-B, we demonstrate the advantages of using GPR and the proposed consistent scheme, and assess the proposed approach based on various aspects of the GPR model. Thereafter, in Sec. VII-C, we first compare the proposed method with previous methods that treat each local observation independently, and then with those methods that further consider the spatio-temporal relationship of local patterns for anomaly detection. The four public datasets we employ include the Subway [2], the UCSDped1 [3], the Behave [37], and the QMUL Junction
CHENG et al.: GPR-BASED VIDEO ANOMALY DETECTION AND LOCALIZATION
5295
TABLE II T HE PARAMETER S ETTING OF F OUR D ATASETS
Fig. 6.
Dataset Snapshots.
TABLE I D ATASET D ESCRIPTION
datasets [20], all of which contain a variety of suspicious events (e.g., no payment, fighting, and jaywalking events) and difficulties (e.g., illumination change, crowded scenes, and scale variation) to illustrate the versatility of the proposed method. Fig. 6 shows some snapshots of each dataset. Table I describes the scenarios, video lengths of each dataset and GT indicates whether an official ground truth is provided (Y) or not (N). A. Evaluation Protocol and Experiment Setup To evaluate our work, we mainly follow the pixel-level and the frame-level protocols [3]. For the frame-level criterion, an observed frame is considered to be true positive if both of the frame and its ground truth detect anomalies regardless of the location; whereas, for the pixel-level criterion, a true positive is hit when a frame coincides with its ground truth in which at least 40% of the co-located pixels are identified. ROC curves are plotted by imposing multiple thresholds on the detection results. We quantify the ROC performance in terms of the equal error rate (EER) [3] and area under curve (AUC) [38], where EER is the percentage of the mis-classified frames when the false positive rate is equal to the missed rate and AUC is the area under the ROC curve. The partitions of each dataset for the training and test procedures are also listed in Table I. For the Subway dataset, we use the first 20 minutes of the video for training, which is consistent with Adam et al. [2] and the rest for the test. For the UCSDped1 dataset, we use the original partition with 34 training and 36 test videos suggested by Mahadevan et al. [3] to be in consistence with the previous works. As there are no official training and test partitions for the Behave and the QMUL Junction datasets, both of them are split based on their event labels. For instance, the training data of the Behave dataset are comprised of the randomly-picked video clips without including the abnormal events (i.e. the chase, fight, and run together cases), and the remaining segments of the videos are assigned to the test set without
re-using the training videos. Similar partitions of the training and test sets are performed for the QMUL Junction dataset. Table II lists the parameters employed for the four datasets. For local anomaly detection, the type of STIP feature, the value of k, the size of vocabulary C, and the threshold Tl are determined based on the 2-fold cross-validation which uses the training data and a subset of test data to optimize the frame-level AUC performance. The ensemble size for grouping the nearby STIPs is dataset dependent. For the UCSDped1 and Behave datasets which contain lots of local anomalies, the extent of which is approximately enclosed by the local-scale window size [0.17, 0.25, 4].1 On the other hand, the incidents in the Subway and QMUL Junction datasets are dominated by the global anomalies which cannot be well detected with a localscale ensemble, so we use a global-scale window that monitors the whole scene with a temporal scale of 10 seconds for both datasets. Also, the threshold TE for pruning background ensembles is set as 0.2 for the local-scale ensembles and 0.1 for the global-scale ensembles, entailing at least 20% or 10% ensemble space where STIPs reside. In addition, the similarity threshold Ts in Algorithm 2 is adaptively adjusted based on the 2-fold cross-validation to minimize the within-cluster cost in Eq. 5. The preference factor α is set as 0.5 in order to gain benefits from both local and global anomaly detectors. The impact of the choice of α will be investigated in Sec. VII-B.4. The proximity threshold ρ in Eq. 17 is set as 30 pixels, as the objects in the consecutive frames may move at most within this distance. B. Evaluation of the Proposed Approach In this subsection, an intensive evaluation of the proposed approach in terms of the kernel functions, time complexity, and etc., is conducted. Due to space limitation, some comparisons are only performed on the UCSDped1 dataset. This is because 1 The ensemble size contains three scales: the first/second entry indicates the ratio of ensemble width/height to the image width/height, and the third one denotes the temporal scale in seconds.
5296
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 12, DECEMBER 2015
Fig. 7. The performance of the Sparse Cuboids method under different parameters: (a) AUC performance v.s. number of clusters on several datasets. (b) AUC performance v.s. number of the nearest neighbors k in k-NN computation. For each dataset, the parameter that achieves the highest detection rate is used and marked with a filled square.
compared to the other datasets, it contains a variety of abnormal events and difficulties incurred from the crowded scenes such as heavy occlusion, which can demonstrate the versatility and robustness of the proposed method. To evaluate the impact of the GPR modeling, the proposed approach with and without the GPR modeling, referred to as the GPR and the Sparse Cuboids methods respectively, are both provided for comparison. Also, the GPR method in conjunction with the developed consistent detection is referred to as the consistent GPR. 1) Evaluation of the Sparse Cuboid Method: First, the performance with different vocabulary sizes of the k-means clustering on each dataset is shown in Fig. 7a. For the UCSDped1 and QMUL Junction datasets, we can note that 100 clusters suffice to handle the slight scale variations of vehicles and pedestrians. In contrast, since passengers in the Subway dataset have a large variation in the scale, more clusters (250 clusters) are required to reach the performance of 88.9% AUC. Also, we require fewer clusters (80 clusters) in the Behave dataset possibly because the same group of people repeatedly acted out ten scenarios to comprise the dataset. Fig. 7b plots the AUC performance of the Sparse Cuboids as a function of k in the k-NN computation. We can observe from Fig. 7b that the nearest neighbor strategy achieves the best accuracy for the Subway and QMUL Junction datasets. This is because people in the same moving direction in the Subway dataset, and the rigid vehicle body in the QMUL Junction can be well discriminated from outliers. However, we can also note that the nearest neighbor strategy is unreliable for the Behave and UCSDped1 datasets. Therefore, more neighbors are required in the Sparse Cuboids method to achieve more robust anomaly detection. 2) GPR With Data Pruning and Balance: Here, we assess the proposed filtering schemes described in Sec. IV-B based on the UCSDped1 dataset. After invoking the quality control (i.e., the data balance scheme), we can find from Fig. 8 that the first learned template has a more balanced number of STIPs compared with the other templates, and thus becomes more distinguishable. Moreover, the misclassification rate of the abnormal ensembles can be significantly reduced by using the low-support suppression (i.e., the data pruning scheme). By averaging the off-diagonal entries in the confusion matrices as a measure of noise, the GPR method without using the data
Fig. 8. Effect of data pruning and balance: In each confusion matrix, the rows indicate 12 learned templates, and the first 12 columns are the normal interactions selected from the members of each template in the corresponding rows, and the last 5 columns are the abnormal interactions from the UCSDped1 dataset. Note that the similarity scores of each template are normalized from 0 (black) to 255 (white) for better visualization. The bottom table provides an index for each confusion matrix. TABLE III T HE L EARNING P ROCESS OF GPR U SING D IFFERENT K ERNEL F UNCTIONS ON THE UCSDped1 D ATASET
balance and pruning schemes (the leftmost matrix in Fig. 8) can have a noise level of 16.74%. By putting all of the aforementioned schemes all together (the rightmost matrix in Fig. 8), the noise level can then be reduced to 4.32%. 3) GPR With Different Kernel Functions: As the choice of the kernel functions has significant influence on the performance of the GPR modeling, apart from the isotropic RBF kernel in Eq. 8, we also investigate other stationary kernel functions based on the UCSDped1 dataset, including the Matern kernel with v = 3/2, the rational quadratic (RQ) kernel, and the RBF kernel that implements automatic relevance determination (ARD), which are given, respectively, by √ kMatern (x, x ) = σ 2f (1 + 3||x − x ||2 /l) √ · exp(− 3||x − x ||2 /l) kRQ (x, x ) = σ 2f (1 + ||x − x ||22 /(2βl 2 ))−β
kRBF(ARD) (x, x ) = σ 2f exp − (x − x )T −2 (x − x )/2 (21) where l, σ f , β, and = diag(λ1 , λ2 , λ3 , λ4 ) are the associated hyper-parameters. We initialize all of the hyper-parameters, denoted by θ , to 1 and then determine them by maximizing the marginal likelihood p(y|V; θ ∗ ) of the training data (V, y) in GPR based on the UCSDped1 dataset. The optimization process is based on the conjugate gradient method which converges after a finite number of iterations. By using the optimized marginal likelihood as a measure of how well GPR fits the training data, we can compare different kernel functions as shown in Table III, from which we can note that the GPR model using the Matern, RQ, and RBF(ARD) kernels slightly outperforms the one that based on the isotropic RBF kernel. This is as expected because the isotropic RBF
CHENG et al.: GPR-BASED VIDEO ANOMALY DETECTION AND LOCALIZATION
Fig. 9.
5297
The comparison of GPR model using different preference factors α. (a) UCSDped1. (b) QMUL Junction. (c) Frame-level AUC v.s. α curve.
kernel is a degenerate case of the former ones. Also, we can find that the isotropic RBF kernel attains close log likelihoods, yet with a much smaller number of iterations (101 iterations) compared to the Matern kernel (250 iterations), the RQ kernel (410 iterations), and the RBF(ARD) kernel (557 iterations). In this paper, we choose the isotropic RBF kernel mainly because 1) it together with the GPR model tends to force outliers to be low-likelihood and 2) it can attain similar results yet with a faster convergence rate compared to other kernels. 4) GPR With Different Preference Factors α: To provide more insights into the model’s properties, the comparison of the model using different preference factors α is conducted in Fig. 9. Note that the choice of α is to leverages the local (α = 0) and global (α = 1) anomaly analyses. We can see that the global anomaly detector (α = 1) outperforms the local one (α = 0) for the jaywalking scenario, as shown in Fig. 9(b), while the other way around for the bike scenario, as shown in Fig. 9(a). By combining the local and global anomaly detectors together with α = 0.5, the resulting detector shares the advantages of both to achieve more robust performance, as evidenced by the quantitative results in Fig. 9(c), where the performances for the Subway, UCSDped1, Behave, and the QMUL Junction datasets are boosted from their lowest AUCs (88.9%, 71.4%, 80.7%, 27.3%) to their highest AUCs (92.7%, 83.8%, 93.4%, 81.6%), respectively. 5) Performance Improvement With the GPR Modeling and Consistent Detection: To follow, we evaluate the impact of the GPR modeling and the developed consistent detection. Table IV shows the averaged detection rates for each anomaly class on various datasets. As an illustration, some of the detection results are shown in Fig. 10. While individual STIPs alone are useful for detecting abnormal events in the Subway and Behave datasets, the Sparse Cuboids method with GPR can enhance the performance to at least 89% AUC, as the GPR model is robust to noisy observations by employing the Gaussian noise model as described in Eq. 8. For the UCSDped1 dataset, we can note that the Sparse Cuboids method relies on the multi-scale STIP detection so that local anomalies (e.g. bikers, cars, and skaters) with scale variations can well be detected with 72.2% AUC. Together with GPR, the performance can be improved by an average of 9% AUC. Moreover, those abnormal events, such as walking
TABLE IV AUC/EER P ERFORMANCE OVER THE A NOMALY C LASSES
on grass, ignored by the Sparse Cuboids method are likely to be identified by the GPR method, as the latter takes into account the nearby local observations. However, the two men talk event in Fig. 10(c) is falsely-detected by the GPR method since it has unusual interaction relative to the walking pedestrian. For the QMUL Junction dataset, since most of the abnormal events involve with multiple humans and vehicles, the Sparse Cuboids method treats each local observation independently, and cannot provide satisfactory results. In contrast, the GPR method considers an ensemble of STIPs all together rather than individual STIP features alone, thus improving the Sparse Cuboids method by up to a 70% gap of AUC. Table IV also indicates the effectiveness of the developed consistent detection scheme. For this, Fig. 11 illustrates the detection results of a car by the GPR method with and without the consistent detection scheme. We can observe that the GPR method together with the consistent detection can remove the false detections and meanwhile recover the missed detections. 6) Computational Complexity: Since the computation cost is a major concern in anomaly detection, we measure the
5298
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 12, DECEMBER 2015
Fig. 10. The anomaly likelihood maps for different scenarios (a)–(f): The input videos are shown in the first row, where the ground truth is marked with black bounding box. The results of the consistent GPR, GPR, and Sparse Cuboids are shown in the second, third, and fourth rows, respectively.
Fig. 11. The advantages of anomaly consistency detection on the UCSDped1 dataset: (a) A car appearing in the walkway is considered abnormal in the UCSDped1 dataset. (b) The anomaly likelihood map of the proposed GPR method is comprised of the local likelihood of STIPs. (c) The better noise-free likelihood map is produced by the proposed consistent GPR method.
computational time of the proposed GPR method on four datasets. Our method is implemented in the MATLAB environment on a computer with Core i7-2600 CPU and 4GM RAM. No particular programming technique is used, except for the GPML toolbox [34]. Table V lists the processing time of each stage in the GPR method on the UCSDped1 dataset. The computation time of the Sparse Cuboids method contains the interest point detection, feature extraction, vector quantization, and the k-NN computation. We can note that the proposed highlevel codebook construction can efficiently process around 33 ensembles per second. In addition, since the learning process of GPR only involves hyper-parameter estimation, it is extremely efficient with a speed of 0.6 ms per ensemble. The likelihood computation in GPR consumes more time than the learning process, which takes 81% of the entire inference time. This is due to the point-wise likelihood estimation for each STIP, but it is not necessary in the learning process. Nevertheless, it is still time-affordable as there are only 300 or so qualified ensembles per test video. Table V also compares the GPR method with two closely related models on the
TABLE V C OMPUTATIONAL T IME ON THE UCSDped1 D ATASET (ms PER T RAIN /T EST E NSEMBLE )
UCSDped1 dataset: the STC [7] and the IBC [6] methods. For the Sparse IBC method [6], it requires less learning time at the expense of significant inference time (9 seconds). The Sparse STC [7] is efficient but its dense version requires about five times of the running time required by our method. Table VI shows the average frame rates for the four different datasets. For the Subway and the Behave datasets, a higher
CHENG et al.: GPR-BASED VIDEO ANOMALY DETECTION AND LOCALIZATION
5299
Fig. 12. The ROC performance of anomaly detection and localization: The ROC curves for the (a) Subway, (b) Behave, (c) UCSDped1, and (e) QMUL Junction datasets are based on the frame-level criterion. The ROC curve for the (d) UCSDped1 dataset is based on the pixel-level criterion. TABLE VI C OMPUTATIONAL T IME (F RAMES PER S ECOND )
frame rate can be achieved because both datasets contain a small number of events with simple backgrounds which in turn reduce the number of the STIP features. In contrast, the scenarios in the UCSDped1 and QMUL Junction datasets contain a large amount of densely moving objects so that a larger number of STIP features needs to be processed. C. Comparisons With Previous Works To scrutinize the effectiveness of the proposed consistent GPR method, we compare it with the main state-of-the-art methods. For clarity, we classify these methods into two categories: the methods that treat each of the local observations independently without considering their relationship, which include the Local kNN [4], the MDT [3], and the OptiFlow Stat [2] methods, and the methods that consider the spatiotemporal relationship of the local observations, which include the Dense STC [7], Sparse STC [7], Sparse Recon [1], Interaction Energy Potentials [10], Social Force [9], Sparse IBC [6], and the method by Loy et al. [20]. For clarity, we use the prefixes Dense or Sparse with STC [7] and IBC [6] to emphasize that their models are used to characterize the relationship of either densely-sampled or the STIP features provided by the Sparse Cuboids method, respectively.
The ROC curves based on the four datasets are shown in Fig. 12, and the AUC/EER performance is summarized in Table VII. Since the aforementioned works do not provide the source code in the public domain, the results of Dense STC, Sparse STC, Local kNN, and Sparse IBC are implemented by us; whereas, the results of other methods are directly obtained from their papers. 1) Comparison With Local Observation Methods: In terms of the ROC curves and AUC/EER performance, some interesting observations can be drawn: • The consistent GPR has better AUC performance (88.9%) than OptiFlow Stat (58.5%) and MDT (81.8%) based on the UCSDped1 dataset, as it is more flexible in using different STIP features. In contrast, the OptiFlow Stat method only used optical flows, so the objects with unusual appearance cannot be detected. • The AUC performance of the consistent GPR (88.9%) based on the UCSDped1 dataset is not as good as that by Local kNN (92.7%). This is due to the fact that, as opposed to using the dense features in Local kNN, it is based on the sparse features to greatly reduce the spacetime complexity with around 3% loss of AUC. • The consistent GPR substantially outperforms all of the other methods based on the QMUL Junction dataset. Because anomalies in this dataset contain multiple objects, each of which is normal in terms of its track and appearance, methods like Local kNN (43.6%) and Sparse Cuboids (27.3%) cannot provide satisfactory results. On the contrary, since we further consider the geometric relation of local observations by using GPR, the unusual interactions among vehicles and pedestrians can be identified, even though the individual objects themselves are normal.
5300
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 12, DECEMBER 2015
TABLE VII AUC/EER P ERFORMANCE C OMPARISON OF A NOMALY D ETECTION
2) Comparison With Methods Modeling Spatio-Temporal Relationship: Next, we compare the consistent GPR method with the previous methods which take into account the spatiotemporal relationship. Based on the results reported in Fig. 12, Table VII, and Fig. 10, we have the following observations: • For all datasets, the consistent GPR outperforms the Dense STC and Sparse Recon methods, except that the Dense STC has better frame-level AUC/EER value, but lower pixel-level performance (41.7%) than ours (72.4%) on the UCSDped1 dataset. This is because they treat an ensemble as an atomic unit and cannot identify whether each of the local observations in the ensemble is abnormal or not. As a counterpart, GPR can provide the prediction error for each STIP, which is crucial for precise anomaly localization. Moreover, the Dense STC and Sparse Recon were devised to model the relationship of dense features and thus called for substantially higher computational burden. • The consistent GPR is generally better than the Sparse STC and Sparse IBC by at least 6% AUC, which demonstrates that GPR is in particular suitable for dealing with the sparse features. This also indicates that the STC [7] and IBC [6], which were originally devised to model dense local observations, can not deal with sparse features well. • For the QMUL Junction dataset, the consistent GPR achieves the best performance with 85.4% AUC and 23.8% EER and especially outperforms the method by Loy et al. [20] by 8% AUC, which relied on image decomposition and then used GPR to model the configuration of the decomposed image regions from the previous and current frames. Their method is effective in identifying unusual spatial configurations, but cannot handle complex temporal relationship. In contrast, the consistent GPR provides a more versatile solution since the proposed hierarchical feature representation can deal not only with the spatial but also with the temporal relationship of local observations in any scale. • The consistent GPR is generally better than the Social Force model [9] by an average of 11% AUC, but is a bit worse (95.5%) than Interaction Energy Potentials [10] (97.9%) based on the Behave dataset. These methods
•
fully utilized the nature of human motion and had limited applications. For the proposed consistent GPR, since each STIP can account for either human or non-human object, we can cope with more complicated scenarios. Neither the consistent GPR nor the other methods work well in the case shown in Fig. 10(f) where a white van at the bottom stopped at the traffic light but passed the stopping line. As we assume abnormal events happening with movements, the white van is stationary and none of the STIPs are detected there. To overcome this problem, we can combine both 2D and 3D interest point detectors into our GPR model to detect more potential anomalies. VIII. C ONCLUSION
This paper provides a hierarchical framework for local and global anomaly detection. We rely on a bottom-up greedy algorithm and GPR to cluster, learn, and infer the semantic (appearance) and structural (position) relationship of the nearby STIPs. A consistent detection scheme is also presented. The new method can achieve at least 85% detection rate based on the four challenging datasets and provide competing performance against the previous works while maintaining much lower space-time complexity as only the sparse STIPS are dealt with. ACKNOWLEDGMENT The authors would like to express their gratitude to anonymous reviewers for carefully reviewing the manuscript and for many thoughtful comments, which have enhanced the readability and quality of this manuscript. R EFERENCES [1] Y. Cong, J. Yuan, and J. Liu, “Sparse reconstruction cost for abnormal event detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2011, pp. 3449–3456. [2] A. Adam, E. Rivlin, I. Shimshoni, and D. Reinitz, “Robust realtime unusual event detection using multiple fixed-location monitors,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 30, no. 3, pp. 555–560, Mar. 2008. [3] V. Mahadevan, W. Li, V. Bhalodia, and N. Vasconcelos, “Anomaly detection in crowded scenes,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2010, pp. 1975–1981. [4] V. Saligrama and Z. Chen, “Video anomaly detection based on local statistical aggregates,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2012, pp. 2112–2119.
CHENG et al.: GPR-BASED VIDEO ANOMALY DETECTION AND LOCALIZATION
[5] V. Kaltsa, A. Briassouli, I. Kompatsiaris, L. J. Hadjileontiadis, and M. G. Strintzis, “Swarm intelligence for detecting interesting events in crowded environments,” IEEE Trans. Image Process., vol. 24, no. 7, pp. 2153–2166, Jul. 2015. [6] O. Boiman and M. Irani, “Detecting irregularities in images and in video,” Int. J. Comput. Vis., vol. 74, pp. 17–31, Aug. 2007. [7] M. J. Roshtkhari and M. D. Levine, “Online dominant and anomalous behavior detection in videos,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2013, pp. 2611–2618. [8] I. Pruteanu-Malinici and L. Carin, “Infinite hidden Markov models for unusual-event detection in video,” IEEE Trans. Image Process., vol. 17, no. 5, pp. 811–822, Mar. 2008. [9] R. Mehran, A. Oyama, and M. Shah, “Abnormal crowd behavior detection using social force model,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2009, pp. 935–942. [10] X. Cui, Q. Liu, M. Gao, and D. N. Metaxas, “Abnormal detection using interaction energy potentials,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2011, pp. 3161–3167. [11] K.-W. Cheng, Y.-T. Chen, and W.-H. Fang, “Video anomaly detection and localization using hierarchical feature representation and Gaussian process regression,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2015, pp. 2909–2917. [12] O. P. Popoola and K. Wang, “Video-based abnormal human behavior recognition—A review,” IEEE Trans. Syst., Man, Cybern. C, Appl. Rev., vol. 42, no. 6, pp. 865–878, Nov. 2012. [13] Y.-T. Chen, Y.-C. Lin, and W.-H. Fang, “A video-based human fall detection system for smart homes,” J. Chin. Inst. Eng., vol. 33, no. 5, pp. 681–690, 2010. [14] V. M. Kettnaker, “Time-dependent HMMs for visual intrusion detection,” in Proc. Conf. Comput. Vis. Pattern Recognit. Workshop, Jun. 2003, p. 34. [15] B. Schölkopf, J. C. Platt, J. C. Shawe-Taylor, A. J. Smola, and R. C. Williamson, “Estimating the support of a high-dimensional distribution,” Neural Comput., vol. 13, no. 7, pp. 1443–1471, Jul. 2001. [16] K.-W. Cheng, Y.-T. Chen, and W.-H. Fang, “Abnormal crowd behavior detection and localization using maximum sub-sequence search,” in Proc. 4th ACM/IEEE Int. Workshop Anal. Retr. Tracked Event Motion Imag. Stream, Oct. 2013, pp. 49–58. [17] A. B. Chan and N. Vasconcelos, “Modeling, clustering, and segmenting video with mixtures of dynamic textures,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 30, no. 5, pp. 909–926, May 2008. [18] K. Kim, D. Lee, and I. Essa, “Gaussian process regression flow for analysis of motion trajectories,” in Proc. IEEE Int. Conf. Comput. Vis., Nov. 2011, pp. 1164–1171. [19] J. M. Wang, D. J. Fleet, and A. Hertzmann, “Gaussian process dynamical models for human motion,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 30, no. 2, pp. 283–298, Feb. 2008. [20] C. C. Loy, T. Xiang, and S. Gong, “Modelling multi-object activity by Gaussian processes,” in Proc. Brit. Mach. Vis. Conf., 2009, pp. 1–11. [21] Y. Benezeth, P.-M. Jodoin, and V. Saligrama, “Abnormality detection using low-level co-occurring events,” Pattern Recognit. Lett., vol. 32, pp. 423–431, Feb. 2011. [22] W. Li, V. Mahadevan, and N. Vasconcelos, “Anomaly detection and localization in crowded scenes,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 36, no. 1, pp. 18–32, Jan. 2014. [23] C. H. Lampert, M. B. Blaschko, and T. Hofmann, “Beyond sliding windows: Object localization by efficient subwindow search,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2008, pp. 1–8. [24] J. Yuan, Z. Liu, and Y. Wu, “Discriminative subvolume search for efficient action detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2009, pp. 2442–2449. [25] D. Tran, J. Yang, and D. Forsyth, “Video event detection: From subvolume localization to spatiotemporal path search,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 36, no. 2, pp. 404–416, Feb. 2014. [26] P. Dollar, V. Rabaud, G. Cottrell, and S. Belongie, “Behavior recognition via sparse spatio-temporal features,” in Proc. 2nd Joint IEEE Int. Workshop Vis. Surveill. Perform. Eval. Tracking Surveill., Oct. 2005, pp. 65–72. [27] P. Scovanner, S. Ali, and M. Shah, “A 3-dimensional sift descriptor and its application to action recognition,” in Proc. ACM Multimedia, Sep. 2007, pp. 357–360. [28] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., Jun. 2005, pp. 886–893.
5301
[29] N. Dalal, B. Triggs, and C. Schmid, “Human detection using oriented histograms of flow and appearance,” in Proc. 9th Eur. Conf. Comput. Vis., May 2006, pp. 428–441. [30] B. S. Everitt, S. Landau, M. Leese, and D. Stahl, “Miscellaneous clustering methods,” in Cluster Analysis, 5th ed. New York, NY, USA: Wiley, 2011. [31] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction. New York, NY, USA: Springer-Verlag, 2009. [32] C. Bishop, Pattern Recognition and Machine Learning. New York, NY, USA: Springer-Verlag, 2006. [33] E. Tapia, “A note on the computation of high-dimensional integral images,” Pattern Recognit. Lett., vol. 32, pp. 197–201, Jan. 2011. [34] C. E. Rasmussen and C. K. I. Williams, Gaussian Processes for Machine Learning. Cambridge, MA, USA: MIT Press, 2006. [35] J. J. Sylvester, “On the relation between the minor determinants of linearly equivalent quadratic functions,” Philos. Mag., vol. 1, no. 4, pp. 295–305, 1851. [36] M. A. Woodbury, Inverting Modified Matrices. Princeton, NJ, USA: Princeton Univ., 1950. [37] S. Blunsden and R. B. Fisher, “The behave video dataset: Ground truthed video for multi-person behavior classification,” Ann. BMVA, vol. 4, nos. 1–12, May 2010. [38] C. Lu, J. Shi, and J. Jia, “Abnormal event detection at 150 FPS in MATLAB,” in Proc. IEEE Int. Conf. Comput. Vis., Dec. 2013, pp. 2720–2727.
Kai-Wen Cheng received the B.S. degree in electrical engineering from the National Taiwan University of Science and Technology, in 2011, where he is currently pursuing the M.S. and Ph.D. degrees with the Multimedia Network Laboratory in Electronic and Computer Engineering. His research interests include computer vision and machine learning.
Yie-Tarng Chen received the B.S. degree from National Taiwan University, in 1984, the M.S. degree from Northwestern University, in 1989, and the Ph.D. degree from Purdue University, in 1993, all in electrical engineering. He is currently an Associate Professor with the Department of Electronic Department, National Taiwan University of Science and Technology. His research interests include computer vision, multimedia system, and machine learning.
Wen-Hsien Fang received the B.S. degree in electrical engineering from National Taiwan University, in 1983, and the M.S.E. and Ph.D. degrees from the University of Michigan, Ann Arbor, in 1988 and 1991, respectively, in electrical engineering and computer science. In Fall 1991, he joined the National Taiwan University of Science and Technology, as a Faculty Member, where he currently a Professor with the Department of Electronic and Computer Engineering. His research interests span various facets of signal processing applications, including statistical signal processing, signal processing for wireless communications, and multimedia signal processing.