ABSTRACT. We propose a novel, online adaptive one-class support vector machines algorithm for anomaly detection in crowd scenes. Integrating incremental ...
ANOMALY DETECTION IN CROWD SCENES VIA ONLINE ADAPTIVE ONE-CLASS SUPPORT VECTOR MACHINES Hanhe Lin
Jeremiah D. Deng
Brendon J. Woodford
Department of Information Science, University of Otago PO Box 56, Dunedin 9054, New Zealand Email: {hanhe.lin, jeremiah.deng, brendon.woodford}@otago.ac.nz ABSTRACT
represented as histograms and used to train one-class SVMs. When a new test event arrives, its histogram representation is verified by the learnt one-class SVMs model. If it is not detected as an anomaly and satisfies the update criterion, the one-class SVMs model will be updated for further detection.
We propose a novel, online adaptive one-class support vector machines algorithm for anomaly detection in crowd scenes. Integrating incremental and decremental one-class support vector machines with a sliding buffer offers an efficient and effective scheme, which not only updates the model in an online fashion with low computational cost, but also discards obsolete patterns. Our method provides a unified framework to detect both global and local anomalies. Extensive experiments have been carried out on two benchmark datasets and the comparison to the state-of-the-art methods validates the advantages of our approach.
Training data
Test data
Vocabulary formation
One-class SVMs
Y
Online update?
N
Feature extraction
Event representation
Anomaly?
Y
Fig. 1: The flowchart of our proposed algorithm.
1. INTRODUCTION
978-1-4799-8339-1/15/$31.00 ©2015 IEEE
Event representation
N
Index Terms— anomaly detection, crowd scenes, support vector machines, online learning
Anomaly detection in crowd scenes has attracted more and more attention for intelligent video surveillance systems development in the context of increased awareness of national security. In this paper we propose a framework using an online adaptive one-class Support Vector Machines (SVMs) [1] to detect anomalies in crowd scenes extending the promising performance in the original batch settings this model has been applied to [2, 3]. The core part of our algorithm is to deploy a novel incremental and decremental one-class SVMs approach within a modified sliding buffer [4]. In our approach the incremental part updates the model in an online fashion with lower computational cost for each batch computation whilst the decremental part also forgets old patterns that no longer represent the current distribution. This improves on the results from more computationally efficient algorithms such as [5] which adopts a stochastic gradient descent algorithm but gives only approximate results. The flowchart of our proposed algorithm is shown in Fig. 1. Given a training set of video segments, each segment is divided into a set of video events. We then form a visual vocabulary by performing k-means clustering of a random subset of descriptors extracting from the training set. By assigning each descriptor to its closest vocabulary word, the video events are
Feature extraction
2. FEATURE EXTRACTION AND EVENT REPRESENTATION Sliding window Video clip
Video segments Feature extraction Spatial-temporal event
Temporal event
Codebook formation
Event representation
2434
Fig. 2: The framework of event representation. In each video segment, we first compute optical flow using Horn and Schunck’s method [6]. In each spatial-temporal
ICIP 2015
patch, all flow vectors are quantized into N bins by using a soft-assignment scheme [7]. A N -dimensional HOF descriptor is then formed counting all flow vectors’ soft contribution. Following [8], we classify the anomaly in crowd scenes into two classes on the basis of the scale: Global Anomaly (GA) and Local Anomaly (LA). The GA is defined as the behavior of the whole scene is anomalous even if local behaviors are normal; while the LA refers to an individual’s behavior is different from that of neighbouring individuals. To deal with the two anomaly categories, we propose two separate representations as shown in Fig. 2. Video segments are firstly acquired using a sliding window. Then we extract the non-overlapping spatial-temporal patches and compute their HOF descriptors to form a codebook. These video segments are divided into spatial-temporal events or temporal events. For a spatial-temporal event, the video segment is partitioned into m × n cells, in each cell we compute its histogram separately, and the event is represented as a concatenated histogram. It is different from a temporal event where only one histogram is computed. 3. INCREMENTAL AND DECREMENTAL ONE-CLASS SUPPORT VECTOR MACHINE One-class SVMs, where only normal P patterns are trained, aim to find a separating function f (x) = j αj k(xj , x) − ρ that contains most of the patterns in a compact region. Our proposed Incremental and Decremental One-Class SVMs (IDOCSVMs) is the extension of the incremental and decremental SVMs [9]. Fig. 4 illustrates the procedure of our proposed approach. t
t+1 Time
Decremental
Incremental
The first-order conditions on W reduce to the KarushKuhn-Tucker (KKT) conditions: X ∂W = αj k(xi , xj ) − ρ ∂αi j
gi =
≥ 0, if αi = 0 = 0, if 0 < αi < C =⇒ f (xi ) ≤ 0, if αi = C X ∂W = αj − 1 = 0. ∂ρ j
(1)
(2)
The training data D is divided into three sets: margin support vectors S, error support vectors E, and the remaining set O. In the following section, we will abbreviate k(xi , xj ) to kij . 3.2. Derivation of the ID-OCSVMs When a new data pattern xc arrives, its coefficient αc is initially set to 0. If we have gc > 0, we can put xc into set O and terminate our algorithm because it has no impact on the model. If we have gc ≤ 0, we increase its corresponding coefficient αc from 0 while updating the coefficients of margin support vectors S and ρ to keep the KKT conditions satisfied for the enlarged data set: X ∆gi = kic ∆αc + kij ∆αj + ∆ρ, ∀i ∈ D ∪ c, (3) j∈S
0 = ∆αc +
X
∆αj .
(4)
j∈S
For all margin support vectors set S = {s1 , . . . , sn }, gi ≡ 0. The equation (3) and (4) can be re-written as: 1 0 1 ··· 1 ∆ρ k s1 c 1 ks1 s1 · · · ks1 sn ∆αs1 = − . ∆αc .. .. . . . . . . . . . . . . . k sn c 1 k sn s1 · · · k sn sn ∆αsn | {z } Q
Fig. 4: The framework for ID-OCSVMs. From time t to t + 1, we add a new arriving pattern (red circle) into the sliding buffer using the incremental procedure, while removing the obsolete pattern (blue circle) from the sliding buffer through the decremental procedure. The incremental procedure is reversible and in the following we only describe the incremental procedure of ID-OCSVMs due to space limit. 3.1. Karush-Kuhn-Tucker conditions
=⇒
∆ρ ∆αs1 .. .
∆αsn
=
β β s1 .. .
β sn
∆αc ,
with coefficient sensitivities given by β 1 β s1 k s1 c .. = −R · .. . . β sn
The one-class SVMs training problem can be formulated as
k sn c
(5)
(6)
where R = Q−1 . Substituted equation (5) in (3):
X 1X max min : W = αi αj k(xi , xj ) − ρ( αi − 1). ρ 0≤αi ≤C 2 ij i
∆gi = γi ∆αc ,
2435
(7)
Scene 2
−0.2 −0.3 10
20
30
40
50
60
70
0
Decision value
−0.1
−0.4
Scene 3
0
Decision value
Decision value
Scene 1 0
−0.1 −0.2 −0.3 10
80
20
30
40
50
60
70
80
−0.1 −0.2 −0.3 −0.4
90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 260 270
10
20
Index of testing video segments
Index of testing video segments
30
40
50
60
70
80
90 100 110 120 130 140
Index of testing video segments
Fig. 3: The experimental result of the online adaptive one-class SVMs model for 11 video clips from the UMN dataset. with margin sensitivities: X γi = kic + kij βj + β, ∀i ∈ / S.
The largest possible increment ∆αcmax is determined by finding the minimal value of the above conditions: (8)
j∈S
3.3. Online update of the ID-OCSVM We cannot obtain the new one-class SVMs state directly since in equation (5) and (7) the composition of the sets S, E and O changes relative to the change of ∆αc and ∆gi . Therefore, we have identified the following conditions that are most likely to occur: 1. gc becomes zero, corresponding to xc joins to S. The c largest possible increment is computed as ∆αcg = −g γc . 2. ac reaches C, corresponding to xc becomes an error support vector. The largest step is computed as ∆αcα = C − αc . 3. Some gi in E becomes zero, which is equivalent to xi transferring from E to S. The corresponding largest increase ∆αcE is computed as:
5. xi in S reaches a bound, αi with equality 0 is equivalent to transferring xi from S to O, and equality C from S to E. The largest possible increment is computed as: ∆αimax , ∀i ∈ S, βi
where
C − αi , −αi ,
if βi > 0 if βi < 0
3.4. Recursive update of R It is time-consuming if we compute the inverse matrix R whenever the set S has changed. Fortunately, by applying the Sherman-Morrison-Woodbury formula [10] for block matrix inversion, the update rule of the matrix R for a data xk added to S is computed as: β 0 .. 1 βs1 . R . + . R← [β, βs1 , · · · , βsn , 1] , . 0 ηk β sn 0 ··· 0 0 1 where
X
kkj βj + β.
(11)
j∈S
4. Some gi in O becomes zero, which is equivalent to xi transferring from O to S. The largest step is computed as: −gi , ∀i ∈ O ∩ γi < 0. ∆αcO = min γi
∆αimax =
(10)
Once obtaining ∆αcmax , we can update ρ, αi , gi through equation (5) and (7). The process repeats until the coefficient αc becomes C or gc reaches zero.
ηk = kkk +
−gi ∆αcE = min , ∀i ∈ E ∩ γi > 0. γi
∆αcS = min
∆αcmax = min(∆αcg , ∆αcα , ∆αcE , ∆αcO , ∆αcS ).
(9)
Similarly, to remove a data pattern xk from the set S, the update rule is written as: −1 Rij ← Rij − Rkk Rik Rkj , ∀i, j ∈ S ∪ 0; i, j 6= k.
4. EXPERIMENTAL RESULTS We report experimental results on two benchmark datasets: the UMN dataset1 and the UCSD Ped2 dataset2 . The UMN data is used to verify the effectiveness of spatial-temporal events for GA, and the UCSD Ped2 is used to test temporal events for LA. In the following experiments, unless otherwise specified, 50, 000 HOF descriptors are selected randomly from the training set to form a codebook with a size of 200 through the k-means algorithm. 1 Available 2 Available
2436
from http://mha.cs.umn.edu from http://www.svcl.ucsd.edu/projects/anomaly/dataset.html
4.1. UMN dataset
4.2. UCSD Ped2 dataset
The UMN dataset contains eleven video clips of three different scenes with a resolution of 320 × 240 pixels. We use the first 300 frames of each scene for training the initial parameters using a batch one-class SVMs model, and the rest for testing. The buffer size is set as 100. The temporal window size of each video segment is set as 12 with no overlap, and the size of spatial-temporal patch is given as 10 × 10 × 3. We split each frame into 4 × 4 cells, each using a histogram of 200 bins. We then concatenate these histograms of all cells into a code of 3200 dimensions. The results are shown in Fig. 3, where the top row illustrates sample frames in the dataset. The green dots and red crosses in the bottom row represent the normal video events and abnormal video events respectively. It shows that most abnormal events (red crosses) have lower decision values than normal events (green dots), which is consistent with the ground truth. The average Area Under the Curve (AUC) performance of our approach in three scenes is 0.9947, 0.9827, and 0.9856 (0.9853 overall), which is comparable to Chaotic Dynamics [11] (0.99), better than Sparse Reconstruction Cost [8] (0.97) and Social Force [12] (0.96).
The UCSD Ped2 dataset has 16 training clips and 12 testing clips with a resolution of 240 × 360 pixels. The common anomalies contain bikers, skaters, and small carts. The spatial-temporal patch of size 10 × 10 × 3 is adopted in the experiment. We apply a 40 × 40 × 15 temporal event with 10 × 10 × 7 overlap in training set and 20 × 20 × 7 overlap in testing set. The selection of event size provides the trade-off between the ability to detect an anomaly and timely response. We use the conventional batch one-class SVMs [1] on events extracted from the first training clip (i.e., 120 frames) to obtain the initial parameters. Our online adaptive one-class SVMs approach is then applied to the rest of 15 training clips and 12 testing clips, where the size of the sliding buffer is set as 8, 000. To reduce computational cost, we keep only a limited set of the remaining set O with 0 < gi < ǫ and discarding all data with gi ≥ ǫ. As in [13], two measurements are used to evaluate the performance of anomaly detection: frame-level and pixel-level. We report the experimental results of the UCSD Ped2 dataset in Fig. 5. Table 1 indicates that the Equal Error Rate (EER) performance of our proposed algorithm is comparable to that of Mixture of Dynamic Textures (MDT) [14] and Latent Dirichlet Allocation (LDA) [15] on the frame level, but outperforms MDT on the pixel level.
1
True positive rate
0.9 0.8
Table 1: Quantitative comparison of our proposed method and the state-of-the-art approaches.
0.7 0.6 0.5
Method
0.4 0.3 0.2
Frame−level (AUC = 0.8841) Pixel−level (AUC = 0.8119)
0.1 0
0
0.2
0.4
0.6
0.8
Ours MDT [14] LDA [15]
1
False positive rate
Frame-level EER (%) 20 19 16
Pixel-level EER (%) 24 30 −
(a)
Our algorithm is implemented using MATLAB on a 2.7GHZ Intel Core i5 with 8GB RAM. The average computation time is 0.11 second/frame for UMN dataset, and 0.25 second/frame for UCSD dataset. 5. CONCLUSION AND FUTURE WORK
(b)
Fig. 5: Results on the UCSD Ped2 dataset. (a) The ROC curves of the frame-level and the pixel-level. (b) Examples of detected abnormal events, where the green rectangles are true negatives, the blue and red rectangle are false positive and false negative respectively.
2437
In this paper, we propose a novel framework to detect anomalies in crowd scenes. By keeping the KKT conditions satisfied for the enlarged data set, our approach effectively updates one-class SVMs models in an online fashion. The online algorithm along with the use of a sliding window can adapt to new patterns and forget obsolete patterns at the same time. Satisfactory performance is gained for the detection of both global and local anomalies using benchmark datasets. So far, our approach can only update (add or remove) one pattern at a time. Our future work is to update (add and remove) multiple data patterns simultaneously [16].
References [1] Bernhard Sch¨olkopf, John C Platt, John Shawe-Taylor, Alex J Smola, and Robert C Williamson, “Estimating the support of a high-dimensional distribution,” Neural Computation, vol. 13, no. 7, pp. 1443–1471, 2001. [2] Larry M Manevitz and Malik Yousef, “One-class SVMs for document classification,” the Journal of Machine Learning Research, vol. 2, pp. 139–154, 2002. [3] Junshui Ma and Simon Perkins, “Time-series novelty detection using one-class support vector machines,” in Neural Networks, 2003. Proceedings of the International Joint Conference on. IEEE, 2003, vol. 3, pp. 1741–1745. [4] Fr´ed´eric Desobry, Manuel Davy, and Christian Doncarli, “An online kernel change detection algorithm,” Signal Processing, IEEE Transactions on, vol. 53, no. 8, pp. 2961–2974, 2005. [5] Jyrki Kivinen, Alexander J Smola, and Robert C Williamson, “Online learning with kernels,” Signal Processing, IEEE Transactions on, vol. 52, no. 8, pp. 2165–2176, 2004. [6] Berthold K Horn and Brian G Schunck, “Determining optical flow,” in 1981 Technical Symposium East. International Society for Optics and Photonics, 1981, pp. 319–331. [7] J.C. van Gemert, C.J. Veenman, A.W.M. Smeulders, and J.-M. Geusebroek, “Visual Word Ambiguity,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 32, no. 7, pp. 1271–1283, July 2010. [8] Yang Cong, Junsong Yuan, and Ji Liu, “Sparse reconstruction cost for abnormal event detection,” in Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on. IEEE, 2011, pp. 3449–3456. [9] Gert Cauwenberghs and Tomaso Poggio, “Incremental and decremental support vector machine learning,” Advances in Neural Information Processing Systems, pp. 409–415, 2001. [10] Gene H Golub and Charles F Van Loan, Matrix Computations, vol. 3, JHU Press, 2012. [11] Shandong Wu, Brian E Moore, and Mubarak Shah, “Chaotic invariants of lagrangian particle trajectories for anomaly detection in crowded scenes,” in Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on. IEEE, 2010, pp. 2054–2060. [12] Ramin Mehran, Alexis Oyama, and Mubarak Shah, “Abnormal crowd behavior detection using social force
2438
model,” in Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, 2009, pp. 935–942. [13] Vijay Mahadevan, Weixin Li, Viral Bhalodia, and Nuno Vasconcelos, “Anomaly detection in crowded scenes,” in Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on. IEEE, 2010, pp. 1975–1981. [14] Weixin Li, V. Mahadevan, and N. Vasconcelos, “Anomaly detection and localization in crowded scenes,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 36, no. 1, pp. 18–32, Jan 2014. [15] Daphna Weinshall, Gal Levi, and Dmitri Hanukaev, “LDA topic model with soft assignment of descriptors to words,” in Proceedings of the 30th International Conference on Machine Learning, 2013, pp. 711–719. [16] Masayuki Karasuyama and Ichiro Takeuchi, “Multiple incremental decremental learning of support vector machines,” in Advances in Neural Information Processing Systems, 2009, pp. 907–915.