Fully Convolutional Neural Network for Fast Anomaly Detection in ...

2 downloads 317618 Views 1MB Size Report
Sep 3, 2016 - anomaly classes which are usually unseen in the training phase, ... auto-encoder and evaluated more carefully by the second Gaussian model.
arXiv:1609.00866v1 [cs.CV] 3 Sep 2016

Deep-Anomaly: Fully Convolutional Neural Network for Fast Anomaly Detection in Crowded Scenes Mohammad Sabokrou1+ , Mohsen Fayyaz1+ , Mahmood Fathy2 , Reinhard Klette3 2

1 Malek-Ashtar University of Technology Iran University of Science and Technology 3 Auckland University of Technology

Abstract. We present an efficient method for detecting and localizing anomalies in videos showing crowded scenes. Research on fully convolutional neural networks (FCNs) has shown the potentials of this technology for object detection and localization, especially in images. We investigate how to involve temporal data, and how to transform a supervised FCN into an unsupervised one such that the resulting FCN ensures anomaly detection. Altogether, we propose an FCN-based architecture for anomaly detection and localization in crowded scenes videos. For reducing computations and, consequently, improving performance both with respect to speed and accuracy, we investigate the use of cascaded out-layer detection. Our architecture includes two main components, one for feature representation, and one for cascaded out-layer detection. Experimental results on Subway and UCSD benchmarks confirm that the detection and localization accuracy of our method is comparable to stateof-the-art methods, but at a significantly increased speed of 370 fps. Keywords: Video anomaly detection, CNN, transfer learning, real-time processing

1

Introduction

Anomaly detection and localization is an important task in video analysis. An ”anomaly” event has a wide range of definitions, however, in general they tell an event ”anomaly” when it occurs rarely. In crowded scene anomaly detection problems, anomalies have either rare shape or motion. Looking for an unknown shape or motion in a video is difficult and time consuming. State-of-the-art approaches learn regions or patches of normal videos as reference model. In fact, these reference models include normal events or shapes of every regions of training data. In testing phase, researchers consider regions that differ from the normal model as being an anomaly. There are various types of normal regions + Equal Contribution

2

Authors Suppressed Due to Excessive Length

in video data. Developing a feature set which efficiently describes these regions is difficult and requires an enormously large set of training samples. Often trajectory properties have been used for representing normal behavior of objects. Recent work typically prefers low-level features, such as histogram of gradients (HoG) or histogram of optic flow (HoF), for modeling spatio-temporal properties of video data. Trajectory-based methods cannot handle occlusion problems and are of high complexity for crowded scenes. CNNs are often used in computer vision and achieved state-of-the-art results, for example for image classification [23], object detection [16], or activity recognition [45]. It is also argued that handcrafted features cannot efficiently represent normal video [41,56,42]. However, CNNs are very slow especially when using a block-wise method [16,14]. Thus, dividing a video into a set of patches and representing them by using a CNN, appears to require further thoughts about possible speedups. Main difficulties in detecting an anomaly using a CNN are: (1) it is very slow for patch-based methods, (2) training a CNN is completely supervised learning, but real-world anomaly detection cannot be based on large sets of samples from anomaly classes which are usually unseen in the training phase, and (3) training even a small CNN is already a time-consuming procedure. Recently researchers, have proposed methods and tricks in other area for overcoming these difficulties. In object detection methods, such as, Faster-RCNN [38] have used convolutional layers to have a feature map of every regions in input data. In semantic segmentation methods, such as, [27] have used fully convolutional networks to use traditional CNNs as regional feature extractors. Making traditional classification CNNs to work as a fully convolutional net and regional feature extractor reduces the computation costs. As mentioned, using CNNs cannot be purely embedded in anomaly detection tasks, neither FCNs can. For overcoming these difficulties, we propose a new FCN based structure for extracting discriminative features of video regions. Our FCN is a combination of several initial convolutional layers of a pre-trained CNN and an additional new convolutional layer. The pre-trained CNN, considered by us, follows the AlexNet model [61];1 similar to the CNN used in [61] it is pre-trained for image classification by using ImageNet [18] and the MIT places dataset [31]. We found that the extracted features (by this CNN at convolutional layers) are sufficiently discriminative for anomaly detection in video data. Patch-based methods typically convert video frames at first into a set of patches, and then analyze these extracted patches one by one, for feature extraction and modeling. In difference to this approach, we feed complete frames to our FCN, and feature extraction for all regions is done efficiently in the input frames. By analyzing the output of the FCN, we detect and localize anomalies. Convolution or pooling operations in each CNN layer can run completely concurrently. For analyzing frames of size 320 × 240 on a standard NVIDIA TITAN GPU, our method runs very fast, with ≈ 370 fps. 1

A version of the AlexNet model is available onhttps://github.com/BVLC/caffe/ tree/master/models/bvlc_reference_caffenet.

Title Suppressed Due to Excessive Length

3

Convolution or pooling operations in a CNN are patch-based procedures that extract regions from input image data with a specific stride and size. These operations provide a description for every extracted region. Every detected feature (in the output of a convolutional layer), including its descriptor, is characterizing a potential region in a set of video frames. Both of these operations are invertible. However, a receptive field (i.e. a region in a frame), which causes the generation of some feature vectors, can be obtained by a roll-back operation from deeper layers to more shallow layers of the network. Consequently, we propose a method for detecting and localizing anomaly regions in a frame based on analyzing the output of deep layers in FCN. Our idea about localizing a receptive field is inspired by Faster-RCNN in [38] and Overfeat in [44]. We use the convenient structure of a CNN for patch-based operations for extracting and representing all patches of a given set of frames concurrently. A generated feature vector, while using the CNN for each detected region, is fitted to the given image classification task. Similar to [34] we use a transfer learning method to have a better description for each region. We evaluate for finding the best intermediate convolutional layer of the CNN, then a new convolutional layer is added following the best-performing layer of the CNN. The kernels of a pre-trained CNN are adjusted based on pre-training and considered to be constant in our FCN; the parameters of the final layer, which is added by us, are trained based on our training frames. In other words, all regions generated by the pre-trained CNN are represented by a sparse-auto-encoder as feature vector of size h (h is the hidden size of the auto-encoder). We find that the feature set, generated by a pre-trained CNN, is sufficiently discriminative for modeling many regions. To be faster, just those regions which are not confidently classified, are given to the final convolutional layer for being represented by it. In the testing phase, those regions which differ significantly from the first Gaussian model, are labeled as being a confident anomaly. Those regions which correspond completely to this model are labeled as being normal. Those regions with a minor difference (i.e. suspicious ones) are represented by sparseauto-encoder and evaluated more carefully by the second Gaussian model. Our approach is similar to a cascade classifier defined by two stages. The main contributions of this paper are as follows: – To the best of our knowledge, this is the first time that a FCN is considered for anomaly detection. – We adapt a pre-trained classification CNN to a FCN for generating video regions to interpret motion and shape concurrently. – We propose a new FCN architecture for time-efficient anomaly detection and localization. – The proposed method performs as well as state-of-the-art methods, but our method can also perform in real-time for typical applications. – We achieved a processing speed of 370 fps on a standard GPU; this is about three times faster than the fastest state-of-the-art method reported elsewhere.

4

Authors Suppressed Due to Excessive Length

Section 2 provides a brief survey on related work. We present the proposed method in Section 3 including the overall scheme of our method, and also details for anomaly detection and localization, and for the evaluation of different layers of the CNN for performance optimization. For qualitative and quantitative experiments, see Section 4. Section 5 concludes the paper.

2

Related Work

Early work in the subject area focused on modeling of object trajectories; see [19,53,36,3,8,32,17,46,35]. An object is labelled as being an anomaly if it does not follow learned normal trajectories. The main weaknesses of these methods are (1) that they are not robust with respect to occlusions, (2) and they are very complex for crowded scenes. For avoiding these weaknesses, researchers proposed some methods using spatial-temporal low-level features, such as optical flow or gradients. Zhang et al. [60] model the normal pattern of a video with a Markov random field (MRF) in respect to a number of features, such as rarity, unexpectedness, or relevance. Boiman and Irani [6] consider an event as being an anomaly if its reconstruction is impossible by using previous observations only. Adam et al. [1] use an exponential distribution for modeling the histograms of optical flow in local regions. Mahadevan et al. [29] use a mixture of dynamic textures (MDT) for representing the video and fit a Gaussian mixture model to features. In [26], the MDT [29] is extended and explained with more details. Kim and Grauman [22] exploit a mixture of probabilistic PCA (MPPCA) model for representing local optical flow patterns. They also use an MRF for learning the normal patterns. A method based on the motion properties of pixels for behavior modeling is proposed by Benezeth et al. [4]. They described the video by learning a co-occurrence matrix for normal events across space-time. In [21], a Gaussian model is fitted to spatio-temporal gradient features, and a hidden Markov model (HMM) is exploited for detecting the abnormal events. The social force (SF) is introduced by Mehran et al. [30] as an efficient technique for abnormal motion modeling for crowds. In [59] a method is proposed which is based on spatial-temporal oriented energy filtering. Cong et al. [11] construct an over-complete normal basis set from normal data; if reconstructing a patch with this basis set is not possible then it is considered to be an anomaly. In [2], a scene parsing approach is presented. All object hypotheses for the foreground of a frame are explained by normal training. Those hypotheses which cannot be explained by normal training are considered to be anomaly. The authors of [43] propose a method based on clustering the test data by using optic-flow features. [48] introduced an approach based on a cut/max-flow algorithm for segmenting the crowd motion. If a flow does not follow the regular motion model then it is considered as being an anomaly. Lu et al. [28] propose a very fast (140-150 fps) anomaly detection method. Their method is based on sparse representation. In [40] an extension of the bag of video words (BOV) approach is used. In [62], a context-aware anomaly detection

Title Suppressed Due to Excessive Length

5

algorithm is proposed where the authors represent the video using motions and the context of videos. In [12], a method for modeling both motion and shape with respect to a descriptor (named “motion context”) is proposed; they consider anomaly detection as a matching problem. Roshkhari et al. [39] introduce a method for learning the events of a video by using the construction of a hierarchical codebook for dominant events in a video. Ullah et al. [49] learn an MLP neural network using trained particles to extract the video behavior. A Gaussian mixture model (GMM) is exploited for learning the behavior of particles using extracted features. Also, in [50], an MLP neural network for extracting the corner features from normal training samples is proposed; authors also label the test samples using that MLP. Authors of [51] extract corner features and analyze those features based on their properties of motion by an enthalpy model; a random forest with corner features for detecting anomaly samples is exploited. Xu et al. [55] propose a unified anomaly energy function based on a hierarchical activity-pattern discovery for detecting anomalies. Work reported in [41,56] model normal events based on a set of representative features which are learned on auto-encoders [52]; these authors use a one-class classifier for detecting anomalies as being outliers compared to the target (i.e. normal) class. In [33], the histogram of oriented tracklets (HOT) is used for video representation and anomaly detection; these authors also introduce a new strategy for improving HOT. Yun et al. [58] introduce an informative structural context descriptor (SCD) to represent a crowd individually; in their work, a (spatial-temporal) SCD variation of a crowd is analyzed to localize the anomaly region. A hierarchical framework for local and global anomaly detection is proposed in [9]. Normal interactions are extracted by finding frequent geometric relationships between sparse interest points; authors model the normal interaction template by Gaussian process regression. Xiao et al. [54] exploit sparse seminonnegative matrix factorization (SSMF) for learning the local pattern of pixels. Their method learns a probability model by using local patterns of pixels for considering both the spatial and temporal context. Their method is totally unsupervised. Anomalies are detected by the learned model. In [24] an efficient method for representing human activities in video data with respect to motion characteristics is introduced and named as motion influence map; authors label those blocks of a frame as being an anomaly which have a low occurrence. [25] proposes an unsupervised framework for detecting the anomalies based on learning global activity patterns and local salient behavior patterns via clustering and sparse coding.

3

Proposed Method

First we provide a general outline of the method before going into details for anomaly detection and localization.

6

Authors Suppressed Due to Excessive Length

3.1

Overall Scheme

Anomaly events in video data usually refer to irregular shapes or motions, possibly in combination of both of them. Thus, identifying shape or motion are important tasks for anomaly detection and localization. A single frame does not include motion properties of events; it only provides shape information. For analyzing both shape and motion, we consider the average of the tth frame It and of the previous frame It−1 , denoted by It0 (not to be confused with a derivative). 0 0 We consider the sequences Dt = hIt−4 , It−2 , It0 i for detecting anomalies in the th t frame of the video with respect to motion or shape. For representing video frames of size w × h, we start with sequences Dt defined on grids Ω0 of size w0 × h0 , with w0 = w and h0 = h. Sequences Dt are given to a FCN, defined by k th intermediate convolutional layers, with k = 0, 1, . . . , L, each defined on a grid Ωk of size wk × hk , with wk > wk+1 , and hk > hk+1 .(We used L = 3 for numbers of convolutional layers.) The output of the k th intermediate convolutional layer of the FCN are feature vectors fk ∈ Rmk (i.e. mk maps of real feature values), satisfying mk ≤ mk+1 , starting with m0 = 1. For input sequence Dt , the output of the k th convolutional layer is a matrix of vector values 

fkt (i, j, 1 : mk )

(wk ,hk ) (i,j)=(1,1)

=

n > o (wk ,hk ) fkt (i, j, 1), . . . , fkt (i, j, mk )

(1)

(i,j)=(1,1)

Every feature vector fkt (i, j, 1 : mk ) is derived from a specific receptive field (i.e. a sub-region of input Dt ). In other words, at first a high-level description Dt is provided for the tth frame of the video, then Dt is represented subsequently by the k th intermediate convolutional layer of the FCN. This representation is used for identifying in Ωk a family of partially pairwise overlapping regions - the receptive fields. This way we represent frame It at first by sequence Dt on Ω0 , and then by mk maps  (wk ,hk ) fk,l = fkt (i, j, l) (i,j)=(1,1) , for l = 1, 2, . . . , mk

(2)

on Ωk , where the size wk ×hk decreases with increasing k values, but the number mk of maps increases with increasing k values. Suppose we have q training frames from a video which all are normal. When representing all of them with respect to the k th convolutional layer of the FCN, then we have wk × hk × q vectors of length mk , defining our 2D normal regions descriptions. They are generated automatically by pre-trained FCN. For modeling the normal behavior, a Gaussian distribution is fitted as a oneclass classifier to these descriptions of normal regions. This defines our normal reference model. In the testing phase, a test frame It is described in a similar way by a set of regional features. Those regions which differ from the normal reference model are labeled as being an anomaly. Consequently, this procedure provides time-efficient anomaly detection. Typically, the features generated by pre-trained CNN are sufficiently discriminative. These features are learned based on a set of independent images which are not only related to video surveillance applications. Consequently, suspicious

Title Suppressed Due to Excessive Length

7

Fig. 1. The influence of representing the receptive fields with added convolutional layer as k+1 convolutional layer.(here k is set to be 2) left:Input frame middle: The heatmap respect to 2th layer of pre-trained FCN right: The heat-map which is generated in respect to representation of 2th layer of pre-trained FCN by added convolutional layer.

regions are represented by a “more discriminant” feature set. This new representation leads to a better performance for distinguishing anomaly regions from normal ones. In other words, we transformed the generated features by a CNN into an anomaly detection problem. This work is done by an auto-encoder which is trained on all normal regions. As a result, those fkt (i, j, 1 : mk ) regions which are suspicious are given to an auto-encoder for having a better representation. This work is done by the (k+1)st convolutional layer whose kernels are learned by a sparse auto-encoder. Let Tkt (i, j, 1 : mk ) be the transformed representation of fkt (i, j, 1 : mk ) by a sparse auto-encoder; see Fig. 1. The anomaly region is more distinguishable on the heat-map when the regional descriptors are represented again by the auto-encoder (i.e. the final convolutional layer). Again, for the new feature space, those regions which differ from the normal reference model are labeled as being an anomaly. The proposed approach ensures both accuracy and speed. See Fig. 2. Suppose that f (i, j, 1 : mk ) ∈ Rmk is the description of an anomaly region. By inverse moving, from the k th to the 1st layer of the FCN, we can identify the region of input frames for which fkt (i, j, 1 : mk ) is a description for. This is because the convolution as well as the mean pooling operator of the FCN (from 1st to 2nd layer) are approximately invertible. For example, let the 1st and 2nd convolutional layer, and the 1st sub-sampling layer be named by C1 , C2 , and S1 , respectively. By (.)−1 we identify the inverse of a function. The exact location of description fkt (i, j, 1 : mk ) in the Dt sequence (the input of the FCN) is at C1−1 (S1−1 (C2−1 (f (i, j, 1 : mk ))). For more details, see Sections 3.2 and 3.3. Figure 2 shows the work-flow of our detection method. First, input frames are given to a pre-trained FCN. Then, hk × wk regional feature vectors are generated in the output of the k th layer. All feature vectors are verified using Gaussian classifier G1 . Those patches which differ significantly from G1 , being a normal reference model, are labeled as being an anomaly. G1 is a Gaussian distribution which is fitted to all the normal extracted regional feature vectors; regions which completely differ from G1 are considered to be an anomaly. Regions which are fitted with low confidence (i.e. suspicious ones) are given to a sparse auto-encoder. Now, we label these regions again based on Gaussian classifier G2 which works similar to G1 . G2 is also a Gaussian classifier; it is

8

Authors Suppressed Due to Excessive Length (1)

(2)

𝑚𝑘

𝑚1

w

hk 𝑓𝑘𝑡 (1,1)1..𝑚𝑘

𝑓𝑘𝑡 (𝑘, 𝑘)1..𝑚𝑘

wk

h

α >𝑑(𝐺1 , 𝑓𝑘𝑡 (𝑖, 𝑗)1..𝑚𝑘 )>β Transferring description of suspicious regions

𝑓𝑘𝑡 (𝑖, 𝑗)1..𝑚𝑘 𝑚𝑘

𝑚1

𝑑(𝐺1 , 𝑓𝑘𝑡 (𝑖′, 𝑗′)1..𝑚𝑘 )>α 𝑂𝑅

Roll-back on CNN

(𝑖 ′ , 𝑗 ′ ) hk wk (6)

𝑑(𝐺2 , 𝑇𝑘𝑡 (𝑖′, 𝑗′)1..𝑚𝑘 )>ϕ (5)

Sparse auto-encoder

CNN

(3) 𝑓𝑘𝑡 (1,1)1..𝑚𝑘 ⋮ ⋮ ⋮ 𝑡 𝑓𝑘 (𝑖, 𝑗)1..𝑚𝑘 ⋮ ⋮ ⋮ 𝑡 𝑓𝑘 (𝑤𝑘 , ℎ𝑘 )1..𝑚𝑘

𝑇𝑘𝑡 (1,1)1..ℎ ⋮ ⋮ ⋮ 𝑇𝑘𝑡 (𝑖, 𝑗)1..ℎ ⋮ ⋮ ⋮ 𝑇𝑘𝑡 (𝑤𝑘 , ℎ𝑘 )1..ℎ (4)

Fig. 2. Scheme of proposed method. (1) Input video frame of size w0 ×h0 . (2,3) Description of regions of size hk × wk generated by the kth layer of the FCN. (4) Transformed feature domain using a sparse auto-encoder (for enhancing the features). (5) Joint anomaly detector. (6) Positions of those descriptions which identify anomalies.

trained on all extracted regional feature vectors from training video data which are again represented by an auto-encoder. Finally, the location of those regions that are identified as being an anomaly, can be annotated by a roll-back on the FCN. 3.2

Anomaly Detection

We propose to represent the video by a set of regional features. They are extracted densely. Their description is given by feature vectors in the output of the k th convolutional layer. See Equ. (1). Gaussian classifier G1 (.) is fitted on all normal regional features, generated by the FCN. Those regional features that have a distance to G1 (.) larger than a threshold α are considered as being an anomaly. Those ones that are completely fitted to G1 (i.e. their distance is less than threshold β) are labeled as being normal. A region is suspicious if it has a minor distance to G1 (i.e. between α and β). All suspicious regions are given to the next convolutional layer which is trained on all normal regions generated by the pre-trained FCN. The new representation of these regions is much more discriminative and denoted by  (wk0 ,h0k ) Tk,n = Tkt (i, j, n) (i,j)=(1,1) , for n = 1, 2, . . . , h

(3)

where h is the size of the generated feature vectors by the auto-encoder, i.e. it is the size of the hidden layer. As we process in this step only the suspicious regions, instead of all of the regions, we ignore in the grid (wk0 , h0k ) many of the points (i, j) of the grid (wk , hk ). Similar to G1 , we create a Gaussian classifier G2 on all the normal training regional features which are represented by our autoencoder. Those regions which are not sufficiently fitted to G2 are considered as

Title Suppressed Due to Excessive Length

9

being an anomaly. Equations (4) and (5) summarize the anomaly detection by the use of the two fitted Gaussian classifiers:   if d(G1 , fkt (i, j, 1 : mk )) ≥ α Anomaly G1 (fkt (i, j, 1 : mk )) = Normal (4) if d(G1 , fkt (i, j, 1 : mk )) ≤ β   t Suspicious if α ≥ d(G1 , fk (i, j, 1 : mk )) ≥ β and for a suspicious region, represented by Tkt (i, j, 1 : h): G2 (Tkt (i, j, 1

( Anomaly : hk )) = Normal

if d(G2 , Tkt (i, j, 1 : h)) ≥ φ otherwise

(5)

Here, d(G, x) is the Mahalanobis distance of regional feature vector x from the G-model.

3.3

Localization

The first convolutional layer has m1 kernels of size x1 ×y1 . They are convolved on sequence Dt for considering the tth frame. As response to each kernel, a feature is extracted (i.e. each region for the input of the FCN is described by a feature vector of length m1 ). In continuation, we have mk maps as output for the kth layer. Consequently, a point in the output of the k th layer is a description for a subset of overlapping x1 × y1 -receptive fields in the input of the FCN. The order of layers in the modified version of AlexNet is denoted by [C1 , S1 , C2 , S2 , C3 , f c1 , f c2 ] where C or S are a convolutional layer or a sub-sampling layer, respectively, and the two final layers are fully connected. Assume that n regional feature vectors (i1 , j1 ) · · · (in , jn ), generated in layer Ck on grid Ωk , are detected as being an anomaly. A location (i, j) in Ωk corresponds by −1 C1−1 (· · · Sk−1 (Ck−1 (i, j)))

(6)

to a location (i.e. a rectangular region) in the original frame. Suppose we have mk kernels of size xk × yk which are convolved with stride d on the output of the previous layer of Ck . Ck−1 (i, j) is the (rectangular) set of all locations in Ωk−1 which are mapped in the FCN on (i, j) in Ωk . Function Sk−1 is defined in the analogous way. The sub-sampling (mean pooling) layer can also be consider as being a convolutional layer which has only one kernel. Any region detected as being an anomaly in the original frame (i.e. in Ω0 ) is this way a union of some overlapping and large patches. This leads to a poor localization performance. For example, a detection in the 2nd layer causes 51 × 51 overlapping receptive fields. For being more accurate with detections we only identify those pixels in Ω0 as being an anomaly which are covered by more than ζ related receptive fields (we set ζ=3, experimentally).

10

3.4

Authors Suppressed Due to Excessive Length

FCN Structure for Anomaly Detection

In this section we analyze the quality of different layers of pre-trained CNN for generating regional feature vectors. As mentioned, adapt a classification CNN to a FCN by just using convolutional layers. Selecting the best layer for representing the video is important with respect to two different aspects: (1) Usually the deeper features are more discriminative, but deeper is also more time-consuming. Also, as the CNN is trained for image classification, going deeper may create over-fitted features for image classification. (2) Going deeper leads to larger receptive fields on input data, consequently an increased likelihood of inaccurate localization, i.e. of a reduced detection performance. We use a modified version of AlexNet which is named Caffe Reference Model for the two first convolutional layers of our FCN model, which is trained on 1 183 categories, with 205 scene categories from the MIT Places Database [31], and 978 object categories from the train data of ILSVRC2012 (ImageNet) [18] with 3.6 million images. The implemented FCN has three convolutional layers. For finding the best k (i.e. the best convolutional layer) we set initially k equal to 1, and increase it gradually to 3. When k is decided, deeper layers are ignored. We discuss our general findings at an abstract level. First we use the output of layer C1 . Corresponding receptive fields are small in size, and generated features are weak for distinguishing abnormal from normal regions. Here we have many false positives. We go deeper and use the output of C2 . Now we achieve a much better performance compared to C1 , probably due to the following two reasons: (1) A corresponding receptive field in input frames is now sufficiently large, and (2) the deeper features are more discriminative. With k = 3, i.e. layer C3 is used as output, the capacity of the network increases, but results are not as good as for the 2nd convolutional layer. It seems by adding one more layer we achieved deeper features but these features are also likely to over-fit the image classification task for which this network is trained for on ImageNet. Consequently, we decided for the C2 output for extracting regional features. Similar to [34] we transformed the description of each generated regional feature using a convolutional layer; the kernels of the layer are learned using a sparse auto-encoder. We call this new layer on top of the C2 layer of the CNN, CT . The combination of three (initial) layers of a pre-trained CNN (i.e. C1 , S1 , and C2 ) with an additional (new) convolutional layer is our new architecture for detecting anomalies. See Figure 3. Table 1 shows the performance of different layers of the pre-trained CNN. Table 2 reports the performance when using the proposed architecture with different number of kernels in the (k + 1)th convolutional layer. We represented the video frames with our FCN, and just a Gaussian classifier is exploited at the final stage of the FCN (see the performance for 100, 256, and 500 kernels in Table 2). We also evaluated the use of two Gaussian classifiers, as mentioned before, in a way similar to a cascade (see the last column in Table 2). The frame-level and pixel-level EER measures are introduced in the next section. Here just note that smaller values are “better”

Title Suppressed Due to Excessive Length 227

Fixed part

11

Trainable part

55 27

′ 𝐼𝑡−2

23

23

227 ′ 𝐼𝑡−1

23 27

55

23

𝐼𝑡′ 256

96 96

st

1 Convolution Output

Mean Pooling

nd

2 Convolution Output

rd

3 Convolution Output

Fig. 3. Proposed FCN structure for detecting anomalies. This FCNN is only used for regional feature extraction. Later on, two Gaussian classifiers are embedded for labeling anomaly regions, as discussed before.

values.Table 3 reports the performance of processing the network outputs in C2 output and CT output with cascaded classifiers. Table 1. Evaluating CNN convolutional layers for anomaly detection Layer Output in C1 Proposal size 11 × 11 Frame-level EER 40% Pixel-level EER 47%

Output in C2 51 × 51 13% 19%

Output in C3 67 × 67 20% 25%

Table 2. Effect of the number of kernels of (k + 1)th convolutional layer used for representing regional features when using C2 for outputs Number of kernels Frame-level EER

100 19%

256 17%

500 15%

500 and two classifiers 11%

Results of evaluating the different CNNs confirm that the best architecture for the studied data is the proposed CNN architecture.

4

Experimental Results and Comparisons

We evaluate the performance of the proposed method on UCSD [47] and Subway [1] benchmarks. We show that our proposed method detects anomalies at high speed, like a real-time method in video surveillance, with equal or even better performance than other state-of-the-art methods. For implementing our Deep-Anomaly architecture we use the Caffe library [20] All experiments are done using a standard NVIDIA TITAN GPU with MATLAB 2014a.2 2

Our implementation and Caffe models are publicly available at blinded.

12

Authors Suppressed Due to Excessive Length

Table 3. Effect of adding (k + 1)th convolutional layer used for representing regional features when using C2 for outputs Layer Frame-level EER

4.1

C2 13%

CT and two classifiers 11%

UCSD and Subway datasets

UCSD Ped2 [47]. Dominant dynamic objects in this dataset are walkers where crowd density varies from low to high. An object such as a car, skateboarder, wheelchair, or bicycle, is considered as being an anomaly. All training frames in this dataset are normal and contain pedestrians only. This dataset has 12 sequences for testing, and 16 video sequences for training, with 320 × 240 resolution. For evaluating the localization, the ground truth of all test frames is available. The total numbers of anomalies and normal frames are ≈2384 and ≈2566, respectively. Subway [1]. This data set has two sequences recorded at the entrance (1 h and 36 min, 144,249 frames) and exit (43 min, 64,900 frames) of a subway station. People entering and exiting the station show mostly normal behavior; anomaly events are defined by people moving in the wrong direction (i.e. exiting the entrance or entering the exit), or avoiding payment. This dataset has some limitations such as (1) the number of anomalies is low, and (2) a predictable spatial localization (at entrance and exit regions). 4.2

Evaluation Methodology

We compare our results with state-of-the-art methods using a receiver operating characteristic (ROC) curve, the equal error rate (EER), and the area under curve (AUC). We use two measures, at frame level and at pixel level, which are introduced in [29] and exploited in most of the previous work. Based on these measures, frames are considered to be an anomaly (positive) or normal (negative). The measures that we use are defined as follows: 1. Frame-level: In this measure, if one pixel detects an anomaly then it is considered as being an anomaly [29]. 2. Pixel-level: If at least 40 percent of anomaly ground truth pixels are covered by pixels that are detected by the algorithm, then the frame is considered to be an anomaly [29]. 4.3

Qualitative and Quantitative Results

Figure 4 illustrates the output of the proposed system on samples of the UCSD and Subway dataset; our method detects and localizes anomalies correctly in these cases. The main problem of an anomaly detection system is a high rate of false-positives. Figure 5 shows regions which are wrongly detected as being an anomaly by our system. Actually, false-positives result for two reasons: (1) Very crowded scenes, and (2) walking in different directions. As walking in opposite

Title Suppressed Due to Excessive Length

(A)

13

(B)

Fig. 4. Output of the proposed method on Ped2 UCSD and Subway dataset. A-left and B-left: Original frames. A-Right and B-Right: Anomaly regions are indicated by red.

Fig. 5. Some examples of false-positives of our system. Left: A pedestrian walking in opposite direction to all other people. Middle: A crowded region is wrongly detected as an anomaly. Right: People walking in different directions.

direction to all other pedestrians is not seen in the training video, it is also considered as being an anomaly by our algorithm. Frame-level and pixel-level ROCs of the proposed method in comparison of state-of-the-art methods are provided in Fig. 6; left and middle for frame-level and pixel-level EER on UCSD Ped2 dataset, respectively. The ROCs show that the presented method has a much better performance for the UCSD dataset compared to the other considered methods. Table 4 shows the frame-level and pixel-level EER of our and of other stateof-the-art methods. Our frame-level EER is 11% where the best result is 10%, achieved by Tan Xiao et al. [54]. We outperform all other considered methods except [54]. Our pixel-level EER is 15% where the next best result is 17%; our method achieved a better performance than any other state-of-the-art method in the pixel-level EER measure (by 2%). The frame-level ROC of the Subway dataset is shown in Fig. 6 (right). We evaluated our method both for the entrance and exit scenes in the Subway dataset. The ROC confirms that our method has a better performance than MDT [26] and SRC [11] methods. We also provide AUC and EER compression for this dataset in Table 5. For the exit scene, we outperform the other considered methods in respect to both AUC and EER measures; we are better by 0.5% and 0.4% in AUC and EER, respectively. For the entrance scenes, the AUC of the proposed method is better than for all other methods (by 0.4%). Regarding EER, we are also better except compared to Saligrama et al. [43] (here we are worse by 0.3%).

14

Authors Suppressed Due to Excessive Length

100

100

90

90

80

80

Ours

80

70

70

Sabokrou et al

70

60

Dan Xua et al

60

50

MDT

50

Exit-SRC

40

MPCCA

40

Exit-MDT

30

SF

30

20

SF+MPCCA

20

10

Adame et al

10

0

EER curve

0

50

Sabokrou et al.

40

MDT

30 20

Dan Xua et al

10

EER curve

0 0

10

20

30

40

50 FPR

60

70

80

90

100

90

0

10

20

30

40

50 60 FPR

70

80

90 100

TPR

Ours

TPR

TPR

60

100

Exit-Our Enterance-SRC 0

10

20

30

40

50

60

70

80

90

100

FPR

Fig. 6. ROC comparison with state-of-the-art methods. Left: Frame-level of UCSD Ped2. Middle: Pixel-level of UCSD Ped2. Right: Subway dataset. Table 4. EER for frame and pixel level comparisons on Ped2 Method Frame-level Pixel-level Method Frame-level Pixel-level IBC [6] 13% 26% Reddy et al. [37] 20% — Adam et al. [1] 42% 76% Bertini et al. [5] 30% — SF [30] 42% 80% Saligrama et al. [43] 18% MPCCA [22] 30% 71% Dan Xua [55] 20% 42% MPCCA+SF [29] 36% 72% Li et al [26] 18.5% 29.9% Zaharescu et al. [59] 17% 30% Tan Xiao et al. [54] 10% 17% MDT [29] 24% 54% Sabokrou [41] 19% 24% Ours 11% 15%

4.4

Run-time Analysis

For processing one frame, three steps need to be done: Some preprocessing such as resizing the frames and constructing the input of the FCN is needed, representing the input by the FCN, and finally the regional descriptors must be checked by a Gaussian classifier. With respect to these three steps, run-time details of our method for processing a single frame are provided in Table 6. The total time for detecting an anomaly in a frame is ≈0.0027 s. Thus, we achieve 370 fps what is much faster than any of the other considered state-of-the-art methods. Table 7 shows the speed of our method in comparison to other methods. There are some key points which make our system fast. Proposed method benefits from fully convolutional neural networks. These types of networks, perform the feature extraction and localization concurrently. This property, brings us less computations. Also, by combining six frames into a three channel input for our network, we process a cubic batch of video frames at just one forward-pass. As mentioned above, for detecting anomaly regions we just process two convolutional layers and for some regions we represent it using sparse auto-encoder. Processing at these shallow layers results in low computations. All of theses tricks plus the processing fully convolutional networks in parallel, result in astonishing fast processing speed of our system compared to other methods.

Title Suppressed Due to Excessive Length

15

Table 5. AUC-EER comparison on Subway dataset Method SRC [11] MDT [29] Saligrama et al. [43] Ours Exit 80.2/26.4 89.7/16.4 88.4/17.9 90.2/16 Entrance 83.3/24.4 90.8/16.7 –/– 90.4/17 Table 6. Details of run-time (second/frame)

Time (in sec)

5

Pre-processing 0.0010

Representation 0.0016

Classifying 0.0001

Total 0.0027

Conclusions

We presented a FCN architecture for generating and describing anomaly regions for videos. By using the strength of FCN architecture for patch-wise operations on input data, the generated regional features are context-free. Also, the proposed FCN is a combination of a part of a pre-trained CNN (an AlexNet version) and a new convolutional layer where kernels are trained with respect to the chosen training video. Just the final convolutional layer of the proposed FCN is trainable and must be learned. The proposed strategy is very fast and also a solution for overcoming limitations in training samples used for learning a complete CNN. Our method enables us to run as a deep-learning-based method at a speed of about 370 fps. Altogether, the proposed method is both fast and accurate for anomaly detection in video data.

16

Authors Suppressed Due to Excessive Length

Table 7. Run-time comparison on Ped2 (in sec) IBC [6] MDT [29] Roshtkhari et al. [40] Li et al. [26] Xiao et al. [54] Ours 66 23 0.18 0.80 0.29 ≈0.0027

Title Suppressed Due to Excessive Length

17

References 1. Adam, A., Rivlin, E., Shimshoni, I., Reinitz, D.: Robust real-time unusual event detection using multiple fixed location monitors. IEEE Trans. Pattern Analysis Machine Intelligence, 30(3) 555–560 (2008) 2. Anti´c, B., Ommer, B.: Video parsing for abnormality detection In: ICCV. pp.2415–2422 (2011) 3. Antonakaki, P., Kosmopoulos, D., Perantonis, S.J.: Detecting abnormal human behaviour using multiple cameras. Signal Processing, 89(9) 1723–1738 (2009) 4. Benezeth, Y., Jodoin, P.M., Saligrama, V., Rosenberger, C.: Abnormal events detection based on spatio-temporal co-occurrences. In: CVPR, pp.1446–1453 (2009) 5. Bertini, M., Del Bimbo, A., Seidenari, L.: Multi-scale and real-time nonparametric approach for anomaly detection and localization. Computer Vision Image Understanding. 116(3) 320–329 (2012) 6. Boiman, O., Irani, Mi.: Detecting irregularities in images and in video. Int. J. Computer Vision. 74(1) 17–31 (2007) 7. Brunet, D., Vrscay, E.R., Wang, Z.: On the mathematical properties of the structural similarity index. IEEE Trans. Image Processing. 21(4) 1488–1499 (2012) 8. Calderara, S., Heinemann, U., Prati, A., Cucchiara, R. Tishby, N.: Detecting anomalies in peoples trajectories using spectral graph analysis. Computer Vision Image Understanding. 115(8) 1099–1111 (2011) 9. Cheng, K.W., Chen, Y.T., Fang, W.H.: Video anomaly detection and localization using hierarchical feature representation and Gaussian process regression. In: CVPR. 2909–2917 (2015) 10. Coates, A., Ng, A. Y., Lee, H.: An analysis of single-layer networks in unsupervised feature learning. In: Int. Conf. Artificial Intelligence Statistics, 215–223 (2011) 11. Cong, Y., Yuan, J., Liu, J.: Sparse reconstruction cost for abnormal event detection. In: CVPR, 3449–3456 (2011). 12. Cong, Y., Yuan, J., Tang, Y.: Video anomaly search in crowded scenes via spatio-temporal motion context. IEEE Trans. Information Forensics Security. 8(10) 1590–1599 (2013) 13. Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: CVPR, pp.580–587 (2014) 14. Giusti, A., Ciresan, D.C., Masci, J., Gambardella, L.M., Schmidhuber. j.: Fast image scanning with deep max-pooling convolutional neural networks. In: ICIP, pp.4034–4038 (2013) 15. Ghodrati, A., Diba, A., Pedersoli, M., Tuytelaars, T., Van Gool, L.: DeepProposal: Hunting objects by cascading deep convolutional layers. In: ICCV (2015) 16. Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: CVPR. pp.580–587 (2014) 17. Hu, W., Xiao, X., Fu, Z., Xie, D., Tan, T., Maybank, S.: A system for learning statistical motion patterns. IEEE Trans. Pattern Analysis Machine Intelligence. 28(9) 1450–1464 (2006) 18. ImageNet: image-net.org (2016)

18

Authors Suppressed Due to Excessive Length 19. Jianga, F., Yuan, J., Tsaftarisa, S. A., Katsaggelosa, A. K.: Anomalous video event detection using spatiotemporal context. Computer Vision Image Understanding, 115(3) 323–333 (2011) 20. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: Convolutional architecture for fast feature embedding. In: Proc. ACM Int. Conf. Multimedia, pp. 675–678 (2014) 21. Kratz, L., Nishino, K.: Anomaly detection in extremely crowded scenes using spatio-temporal motion pattern models. In: CVPR, pp.1446–1453 (2009) 22. Kim, J., Grauman, K.: Observe locally, infer globally: a space-time MRF for detecting abnormal activities with incremental updates. In: CVPR, pp. 2921– 2928 (2009) 23. Krizhevsky, A., Sutskever, I., Hinton, G. E.: ImageNet classification with deep convolutional neural networks. In: Advances Neural Information Processing Systems, 1097–1105 (2012) 24. Lee, D.G., Suk, H.I., Park, S.K., Lee, S.W.: Motion influence map for unusual human activity detection and localization in crowded scenes. IEEE Trans. Circuits Systems Video Technology 25(10) 1612-162 (2015) 25. Li, N., Wu, X., Xu, D., Guo, H., Feng, W.: Spatio-temporal context analysis within video volumes for anomalous-event detection and localization. Neurocomputing, 155, 309–319 (2015) 26. Li, W., Mahadevan, V., Vasconcelos, N.: Anomaly detection and localization in crowded scenes. IEEE Trans. Pattern Analysis Machine Intelligence, 36(1), 18–32 (2014) 27. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) 28. Lu, C., Shi, J., Jia, J.: Abnormal event detection at 150 fps in MATLAB. In: ICCV. pp.2720–2727 (2013). 29. Mahadevan, V., Li, W., Bhalodia, V., Vasconcelos, N.: Anomaly detection in crowded scenes. In: CVPR, pages 1975–1981 (2010) 30. Mehran, R., Oyama, A., Shah, M.: Abnormal crowd behavior detection using social force model. In: CVPR, pp:935–942, 2009. 31. MIT Places Database: places.csail.mit.edu (2016) 32. Morris, B.T., Trivedi, M.M.: Trajectory learning for activity understanding: Unsupervised, multilevel, and long-term adaptive approach. IEEE Trans. Pattern Analysis Machine Intelligence, 33(11), 2287–2301 (2011) 33. Mousavi, H., Nabi, M., Galoogahi, H.K., Perina, A., Murino, V.: Abnormality detection with improved histogram of oriented tracklets. In: ICIAP (2015) 34. Oquab, M., Bottou, L., Laptev, I., Sivic, J.: Learning and transferring mid-level image representations using convolutional neural networks. In: CVPR pp.1717– 1724 (2014) 35. Piciarelli, C., Foresti, G.L.: On-line trajectory clustering for anomalous events detection. Pattern Recognition Letters, 27(15) 1835–1842 (2006) 36. Piciarelli, C., Micheloni, C., Foresti, G.L.: Trajectory-based anomalous event detection. IEEE Trans. Circuits Systems Video Technology, 18(11) 1544–1554 (2008). 37. Reddy, V., Sanderson, C., Lovell, B.C.: Improved anomaly detection in crowded scenes via cell-based analysis of foreground speed, size and texture. In: CVPRW, 55–61 (2011) 38. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: Towards real-time object detection with region proposal networks. In: NIPS, 91-99 (2015)

Title Suppressed Due to Excessive Length

19

39. Roshtkhari, M., Levine, M.: Online dominant and anomalous behavior detection in videos. In: CVPR, 2611–2618 (2013) 40. Roshtkhari, M.J., Levine, M.D.: An on-line, real-time learning method for detecting anomalies in videos using spatio-temporal compositions. Computer Vision Image Understanding, 117(10) 1436–1452 (2013) 41. Sabokrou, M., Fathy, M., Hoseini, M., Klette, R.: Real-time anomaly detection and localization in crowded scenes. In: CVPR workshop (2015) 42. Sabokrou, M., Fathy, M. and Hoseini, M.: Video anomaly detection and localisation based on the sparsity and reconstruction error of auto-encoder. In: Electronics Letters pp.1122-1124 (2016) 43. Saligrama, V., Chen, Z.: Video anomaly detection based on local statistical aggregates. In: CVPR, (2012) 44. Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., LeCun, Y.: Overfeat: Integrated recognition, localization and detection using convolutional networks. In: ICLR (2014) 45. Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances Neural Information Processing Systems, 68–576 (2014) 46. Tung, F., Zelek, J.S., Clausi, D.A.: Goal-based trajectory analysis for unusual behaviour detection in intelligent surveillance. Image and Vision Computing, 29(4) 230–140 (2011) 47. UCSD dataset: www.svcl.ucsd.edu/projects/anomaly/dataset.html. 48. Ullah, H., Conci, N.: Crowd motion segmentation and anomaly detection via multi-label optimization. In: ICPR workshop Pattern Recognition Crowd Analysis (2012) 49. Ullah, H., Tenuti, L., Conci, N.: Gaussian mixtures for anomaly detection in crowded scenes. IS&T/SPIE Electronic Imaging, 866303–866303 (2013) 50. Ullah, H., Ullah, M., Conci, N.: Real-time anomaly detection in dense crowded scenes. IS&T/SPIE Electronic Imaging, 902608–902608 (2014) 51. Ullah, H., Ullah, M., Conci, N.: Dominant motion analysis in regular and irregular crowd scenes. ECCV workshop HBU, 62–72 (2014) 52. Vincent, P., Larochelle, H., Bengio, Y., Manzagol, P.-A.: Extracting and composing robust features with denoising autoencoders. In: Int. ACM Conf. Machine Learning, 1096–1103 (2008) 53. Wu, S., Moore, B., Shah, M.: Chaotic invariants of Lagrangian particle trajectories for anomaly detection in crowded scenes. In: CVPR, (2010) 54. Xiao, T., Zhang, C., Zha, H.: Learning to Detect Anomalies in Surveillance Video. IEEE Signal Processing Letters, 22(9) 1477–1481 (2015) 55. Xu, D., Song, R., Wu, X., Li, N., Feng, W., Qian, H.: Video anomaly detection based on a hierarchical activity discovery within spatio-temporal contexts. Neurocomputing, 143, 144–152 (2014) 56. Xu, D., Ricci, E., Yan, Y., Song, J., Sebe, N., Kessler, F.B.: Learning deep representations of appearance and motion for anomalous event detection. In: BMVC (2015) 57. Yang, Y., Shu, G., Shah, M.: Semi-supervised learning of feature hierarchies for object detection in a video. In: CVPR, 1650–1657 (2013) 58. Yuan, Y., Fang, J., Wang, Q.: Online anomaly detection in crowd scenes via structure analysis. IEEE Trans. Cybernetics, 45(3) 562–575 (2015) 59. Zaharescu, A., Wildes, R.: Anomalous behaviour detection using spatiotemporal oriented energies, subset inclusion histogram comparison and event-driven processing. In: ECCV 2010, 563–576 (2010)

20

Authors Suppressed Due to Excessive Length 60. Zhang, D., Gatica-Perez, D., Bengio, S., McCowan.: Semi-supervised adapted HMMS for unusual event detection. In: CVPR (2005) 61. Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., Oliva, A.: Learning deep features for scene recognition using places database. In: Advances Neural Information Processing Systems (2014) 62. Zhu, Y., Nayak, N., Roy-Chowdhury, A.: Context-aware modeling and recognition of activities in video. In: CVPR, 2491–2498 (2013)

Suggest Documents