Interactive Browsing System for Anomaly Video Surveillance - CiteSeerX

3 downloads 545 Views 2MB Size Report
Centre for Pattern Recognition and Data Analytics (PRaDA). Deakin University ... factors from MIT dataset are visualized in the right column of Figure 1 while ...
Interactive Browsing System for Anomaly Video Surveillance Tien-Vu Nguyen, Dinh Phung, Sunil Gupta and Svetha Venkatesh, Centre for Pattern Recognition and Data Analytics (PRaDA) Deakin University, Australia. {tvnguye,dinh.phung,sunil.gupta,svetha.venkatesh}@deakin.edu.au Abstract—Existing anomaly detection methods in video surveillance exhibit lack of congruence between rare events detected by algorithms and what is considered anomalous by users. This paper introduces a novel browsing model to address this issue, allowing users to interactively examine rare events in an intuitive manner. Introducing a novel way to compute rare motion patterns, we estimate latent factors of foreground motion patterns through Bayesian Nonparametric Factor analysis. Each factor corresponds to a typical motion pattern. A rarity score for each factor is computed, and ordered in decreasing order of rarity, permitting users to browse events using any proportion of rare factors. Rare events correspond to frames that contain the rare factors chosen. We present the user with an interface to inspect events that incorporate these rarest factors in a spatial-temporal manner. We demonstrate the system on a public video data set, showing key aspects of the browsing paradigm. Index Terms—abnomal detection, nonparametric factor analysis, spatial-temporal, user interface, rank-1 robust PCA

I. I NTRODUCTION Security and surveillance systems focus on rare and anomalous events detection. Typically, these events are detected by estimating the statistics from the “normal” data - anything that deviates is termed as rare. The problem, however, is that in surveillance data, there is a semantic gap between statistically rare events produced by the detection algorithms and what the user would consider as semantically rare. As an example, consider 6 cameras that we monitored over a 12 months period: Each camera using a current system (www.icetana.com) detects 25 anomalies a day, a total of 54600 alerts in a year. However, in this period there were less than 10 real criminal events. This means is that although our systems focus on less than 1% of monitored footage, they produce unnecessary alerts. Detecting real criminal events only is the Holy Grail, as yet unattainable. In this paper we ask the question: Is there an alternative to examining these anomalies, at least retrospectively? Consider security officers being given location/time of an incident - they now wish to find footage that matches. We propose a novel interface that permits the operators to specify such queries, and retrieve potential footage of rare events that match. This geometric query can be either spatial (rare events in region of interest) or spatial-temporal (rare events at location A, then B). Our solution first finds patterns in the scene through application of nonparametric factor analysis to extracted foreground

Fig. 1. The illustrations of our learned factors overlaying with data the from MIT dataset. The left column presents three common patterns. Three rare factors are displayed in the right column.

in frames. Since the number of latent factors is unknown in advance, we employ recent advances in Bayesian nonparametric factor analysis. The generative process models non-negative count data with a Poisson distribution [1]. The presence or absence of a factor is modeled through a binary matrix. Its nonparametric distribution follows the Indian Buffet Process [2], and is modeled through a draw from beta process, which allows infinitely many factors. The extracted factors correspond to patterns of movement in the scene. The rareness of each extracted factor is determined by how much it is used across the whole data set. The factors are then ordered in decreasing rarity, and the user is allowed to choose a proportion of rare factors for consideration. Three top candidates rare factors from MIT dataset are visualized in the right column of Figure 1 while three other common patterns are in the left side. Frames that contain these factors are considered as potential candidates. The solution to a given geometric query are candidate frames that satisfy the specified spatial or spatialtemporal constraints. We demonstrate this browsing paradigm, with spatial and spatio-temporal queries in video data sets. The user interface of our system is displayed in Figure 2.

G

D

F A: Adjusting level of rareness. The indicator is displaying 10% of the total of events. B: Two browsing schemes: spatial and spatio-temporal.

E

C: Latent factors detected using % of rare factors chosen. D: Filtered output. E: Spatially selected region.

B

F: Video clip matching filtered result.

A

G: List of clips containing latent factors. H: Number of consecutive frames.

C

H

Fig. 2. Graphicak user interface (GUI) for our browsing system.

The significance of this paradigm is that it allows an operator to browse rare events, either spatially or spatialtemporally, at different “scales” of rarity. The use of nonparametric factor analysis models allows the framework to gracefully adapt to the data, without the need for a priori intervention. The framework can also easily be extended to accommodate multiple cameras. To our knowledge, there is no such existing system in the literature. Our main contributions in this paper are: • The anomaly detection frame work based on part based matrix decomposition that utilizes our recently introduced rank-1 robust background subtraction for motion video from static camera and non parametric pattern analysis. • The new browsing scheme allowing users not only control the rareness degree but also query spatial or spatialtemporal searching to overcome the limitation of the semantic gap. The paper is organized as follows. The background information on anomaly detection techniques are described in section II. Section III briefly introduces our framework. Section IV provides the background subtraction step based on robust PCA. Next, we explain in detail our proposed system and its analysis. Our experimental results and conclusion are depicted in section VI and VII. II. R ELATED BACKGROUND In recent years, abnormality and rare event detection in video surveillance has generated considerable research attention. Detecting patterns in motion forms the core of most anomaly detection techniques. This task can be done by three main realms: unsupervised, semi-supervised and supervised learning. Some typical examples of such successful supervised learning for anomaly detection include [3] which used Extensible Markov Model to detect rare event relying on spatial and temporal values; [4], [5] aimed to detect abnormality from crowded video data. However, these techniques are often restricted in the case that all of the defined events have to be known in advance. Unsupervised methods, such as PCA

determines residual subspaces from which anomalies can be deduced [6] - these methods lend themselves to discovery of novel events. The problem with PCA is that although abnormalities can be identified, the spatial location in the frame cannot be done. In order to be able to apply spatial filtering, this is essential; [7] looked at activities whose states are permuted and/or partially ordered. Another example is [8] which derived the statistical model based HMM to identify the spatio-temporal relationship of the patterns. The work of [9] exploited the Infinite Hidden Markov Model[10] to efficiently segment the motion data into multiple coherent sections before computing the residual subspaces for rare events detection. To extract motion features, early work have used mixture models [11], or optical flow estimation [12], which heavily depending on the threshold and is a time consuming process; whilst more recent approaches use robust PCA to separate out the foreground as the low rank part of a data matrix. The latter is robust to noise, has efficient implementations [13] and is our choice for feature extraction. We therefore resort to an unsupervised setting; factor analysis [14], that decomposes a matrix into key factors, or patterns, and components that combine them to reconstitute a frame. Factors naturally lend themselves to spatial filtering, and is our choice to determine common and rare patterns. More significantly, it is our goal to introduce the browser paradigm uniquely that assists user to control the searching criteria following the spatial or spatio-temporal options. III. P ROPOSED B ROWSING F RAMEWORK A schematic illustration of the proposed system is shown Figure 3. The first step is to perform background subtraction followed by a feature extraction step detailed in section IV. Once the features are extracted, latent factors are learned as detailed in section Vthat non-parametric factor analysis is then applied to recover the decomposition of factors(motion patterns) and constituent factor weights. For each latent factor a rareness score is derived based on their overall contribution to the scene, and sorted in decreasing order of rareness level. Since we follow a part-based decomposition approach for

Rank-1 Robust PCA

M

Divide into Blocks [sz x sz]

Summarizing

S

X

n raw motion features

n frames

n blocked raw motion features

1 motion feature matrix

Fig. 3. Foreground extraction and feature computation using rank-1 robust PCA.

scene understanding, each latent factor is a sparse image having the same dimension of the original video frame. Therefore, query for rare events at a spatial location can directly ‘interact’ with latent factors. The user is then able to select a proportion of rare factors for consideration. Based on the rareness degree of each latent factors we can return to user(s) the corresponding footages. We shall now describes these steps in details. IV. ROBUST F OREGROUND E XTRACTION AND DATA R EPRESENTATION Since our framework focuses on scene understanding and therefore, features are extracted directly from the foreground information. To do so, we require a robust foreground extraction algorithm which can operates incrementally and in real-time. To this end, we utilize a recently proposed robust PCA approach [13] which is a special case of the robust PCA theory [15], [16] developed specifically for static surveillance camera. Given a short window time size of n and M = [M1 , M2 , . . . , Mn ] is the data matrix consisting of n consecutive frames, the goal of robust PCA theory is to decompose M =L+S

(1)

where L is a low-rank matrix and S is a sparse matrix. Standard method to robust PCA includes principal component pursuit (PCP) [15] which involves an SVD decomposition at each optimization iteration step which can be very costly to compute. Static camera, however, poses a strong rank-1 characteristic wherein the background remains unchanged within a short window time. Given this assumption, an algorithm for rank-1 robust PCA can be efficiently developed which is shown to be reduced a robust version of the temporal median filter [13]. This makes the foreground extraction, contained in S, becomes extremely efficient1 since it can avoid the costly SVD computation in the original formulation of [15]. Moreover, it can be operated incrementally in real-time. Next, using the sparse matrix S, a fixed sz × sz block is super-imposed and the foreground counts in each cell is 1 In practice, it is noted to be 10-20 times faster than a standard optical flow implementation.

accumulated to form a feature vector X summarizing the data matrix M over a short window time of size n. An illustration of this step is shown in Figure 3. V. L EARNING OF L ATENT FACTORS Recall that for each short window time a foreground feature X is collected. Let X = [X1 X2 . . . XT ] be the feature matrix over such T collections. Our next goal is to learn latent factors from X, each of which shall represent a ‘part’ or basis unit that constitute our scene. Using a part-based decomposition approach, a straightforward approach is to use nonnegative matrix factorization of [14] which factorizes X into X ≈ WH

(2)

where W and H are nonnegative matrices, columns of W contains K latent factors and H contains the corresponding coefficients of each factor contribution to the original data in X. Due to the nonnegativity of H, a part-based or additive decomposition isPachieved and each columns of X is repreK sented by Xj = k=1 Wk Hkj . However, a limitation of NMF for our framework is that it requires a manual specification of the number of latent factors K in advance. This can severely limit the applicability of the proposed framework since such knowledge on K is very difficult to obtain. To address this issue, we employ recent advances in Bayesian nonparametric factor analysis for this task which can automatically infer the number of latent factors from the data [17], [18]. In particular, we use a recent work that models count data using Poisson distribution (see [1] for details). For the sake of completeness, we shall briefly describe it here. A nonparametric Bayesian factor analysis can be written as follows: X = W (Z F ) + E

(3)

wherein is denoted as a Hadamard product, Z is a newly added binary matrix whose nonparametric prior distribution follows an Indian Buffet Process (IBP) [2]. Its binary values indicating the presence or absence of a factor (i.e. a column of matrix W ) and the matrix F contains the coefficients. Formally, Zkn = 1 implies that the k-th factor is used while reconstructing the n-th data vector, i.e. n-th column of the

A

C

B

A: user drags two rectangles to search which events in red one followed by the blue one

B, C: detected frame corresponding in red/blue region

Fig. 4. Example of spatial-temporal browsing. User draw two rectangles: red and blue to find the abnormal incident that turn up at blue area followed the one in red section. The system caught the pedestrian which is compatible with his motion direction in this zones.

matrix X. In this nonparametric model, Z is modeled through a draw from beta process which allows infinitely many factors. Given the data, the number of active factors2 are automatically discovered using the inference procedure. The distributions on the parameters W , F of the above nonparametric model is as below Wmk ∼ Gamma (aw , bw ) , Fi ∼ ΠK k=1 Gamma (aF , bF ) (4) where aw , bw , aF and bF are the shape and scale parameters. Similarly, given the parameters, the data is modeled using a Poisson distribution in the following manner

A. Browsing Functionalities Using the latent factors learning in the previous steps, we propose the following functionalities for our system. 1) Discovering Rare Factors and Footages: For each factor Wk within K factor discovered in the previous step, we define a score to measure its rareness P based on its overall contribution to the scene. Since Xj = k Wk Hkj it is clear that P Hkj is the contribution of factor Wk to reconstruct Xj , hence j Hkj is its overall contribution to X. We define the rareness score of a factor as a function reciprocal to this quantity:   X r-score (Wk ) = − log  Hkj  (8) j

Xi | X, Zi , Fi ∼ Poisson (X (Zi Fi ) + λ)

(5)

where λ is a parameter which expresses modeling error E such that Emn ∼ Poisson (λ). Gibbs sampling is used to infer W and F . Since the condition posteriors are intractable, auxiliary variables are introduced to make the inference become tractable. For example, the Gibbs update equation for i-th row of W , denoted by W(i) , is given as:

Fα =

K+1 sik j k=1

where the auxiliary variables s = can be sampled from a multinomial distribution for each j ∈ {1, . . . , T } PK+1 satisfying k=1 sik = 1: j   iK i(K+1) p si1 |· j , . . . sj , sj i(K+1) Xij ! sik j ΠK λsj ∝ QK+1 k=1 (Wik Hkj ) ik k=1 sj !

(9)

(6)

j=1



I (Wk ) = {j | Hkj > , j = 1, . . . , T }

where  is a small threshold, mainly used for the stability of the algorithm. Further, let Kα be the collection of all rare factors, then the index set of all detected footages can easily seen to be:

P ik a+ T j=1 sj −1

p(W(i) |Z, F , X, λ, s) ∝ ΠK k=1 (Wik )     T   X × exp − b + Hkj  Wik  

In our system, we rank the scores for those factors learned in section V using Equation 8 and allows the user interactively choose the percentage α of rare factors to be displayed and interacted with (cf. Figure4 and Figure 2.A). The list of footages associated with this factor is also return the user (cf. Figure 2.G), denoted by I (Wk ) the corresponding index set, then:

(7)

The matrix F and Z can also be sampled in a similar manner proposed in [18]. 2 e.g. k-th factor is an active factor, if k-th row of the matrix Z has at least one non-zero entry.

[

I (W )

(10)

W ∈Kα

2) Spatial Searching: Given a spatial region of interest R inputs to the system, spatial filtering on rare events can now be efficiently carried out by analyzing the intersecting region between the spatial region R and the set of rare factors W . First we extend R to R0 to have the full size of the video frame by zero padding and mask it with each rare factor W which will be selected if the resultant matrix is non-zero. Let SPα (R) be the set of output indices returned, then formally: SPα (R) =

[

I (W )

where

W ∈SPF(Kα ,R)

SPF (Kα , R) = {W | W ∈ Kα , ||W R0 ||0 > 0}

where α is a percentage of rareness degree in V-A1 and is element-wise multiplication, ||A||0 is l0 -norm which counts the number of non-zero elements in the matrix A. The demonstration of this browsing capacity can be represented in Figure 4 that the security officer can scrutinize the red rectangle region in the left window to inspect any unusual things happened in the right panel that one guy is crossing the street illegally. 3) Spatial-temporal Searching: More significantly, the spatio-temporal criteria searching is included in our model that shown visually in Figure 4. The semantic can be understood as “show me the events here (red rectangle) followed by the events there (blue rectangle)” that is set temporally as within ∆t seconds. Once again our filters extracted the frames data into the potential candidates for rare frames. Initially, user indicates a queue regions of interest. For the describing purpose, we shrink into two regions, say red and blue rectangle, spatial scanning in previous section will be applied into both rectangles. Those output patterns are considered as the necessary input for this process. In accordance with the mathematical formula in Equation 11, the typical illustration of this searching category can be found in Figure 4. STα (R1 , R2 ) = (11)

Fig. 6. Example of false positive detection. The overflowing traffic of 200 consecutive frames with the large motion energy sometimes effects the system that the user can deal with successfully based on our user interface.

We demonstrate the proposed system using the MIT dataset [19]. In this public dataset, the traffic scene are recorded by the static camera, especially the traffic flows such as truck, car, pedestrian, bicycle, and other noisy motions such as leaves flickering due to wind etc. These objects generate various motion patterns in the intersection area of the traffic scene. The image dimension of the traffic scene 480×720 pixel per frame that showed in Figure 1. More interestingly, as mentioned earlier in section IV, the static camera owns the rank-1 property, the necessary condition for our background subtraction task. In the section 1, for the motion feature extraction stage, we choose the block size sz is 20x20 and a sequential footages of n = 200. In order to deal with matrix factorization problem if we do not know the number of latent factors beforehand, a trivial solution is to conduct a model selection work that the number of latent factor K can vary unexpectedly. The visualization of model selection step is depicted in Figure 2 in which we restrain the parameter scope from 20 to 56 in which the experiment will be examined on those numbers. The parameter K is estimated as 40 from the our nonparametric output 5. From 40 learned patterns, we sort all in increasing order of rareness amount that is explained in section V-A1. For example, three candidates for common factors and three rare factors are shown in Figure 1. We establish the browsing paradigm by assisting user to restrict their searching region by spatial and spatial-temporal criteria. One typical example is presented in Figure 4. The user

draws two regions: red and blue rectangles to investigate which patterns will followed by others in those windows. Initially, the system will automatically detects suitable candidate patterns in those regions with regard to the proportion of rareness level that user are querying. Through the candidate factors, we will reverse to all the consecutive frames and clips associated with the selected factors. Then, the most appropriate event will be discovered following the Equation 11. In Figure 4, people who across the zebra-crossing (red rectangle) and turn right (blue rectangle) are caught by our system. One false positive abnormality recognition in our framework are recorded. Because the big traffic flow in the period of n = 200 serial frames in the selected rectangle, the system will treat it as abnormal episode. When user draw a spatial interrogation in this area, the machine will give back this flow as a possible candidate for abnormality. However, user can control, fortunately, the rareness level and alter it following the true abnormality semantically in the scene. Concerning with the input rareness rate, multiple patterns and clips are discovered so that user can decide which one is a real affair. Thus, our proposed framework surmounts successfully the semantic gap of statistical perspective and human perception. The focus of our paper is on browsing interactively the abnormal activities locally in a scene. There is no such existing interactive system available for comparison. Moreover, the difficult thing in evaluating our experimental results for interactivity is that there is no suitable ground truth which can satisfy all of user spatial and temporal queries. Because

{(i, j) | i ∈ SPα (R1 ) , j ∈ SPα (R2 ) , |i − j| < ∆t} VI. R ESULTS AND D EMONSTRATION

Fig. 5. 40 factors learned from MIT dataset, shown from top-left to right-bottom in the increasing order of their ‘rareness’.

user can examine in different locations: top left, right bottom, or middle region, and with different sized query areas and time interval. For that reason, the quantitative evaluation of our abnomality dectection approach can be referred to the previous paper [9]. The programming languages to build this system are C# and Matlab that implemented on the PC Intel Core i7 3.4 GHz, RAM 8GB that our query system costs approximately less than 0.2 second, as the motion feature extraction step was pre-processed. As mentioned, the rare patterns are understood as human perception, so we select roughly p = 10% for the number of rare events that the user can slide the bar to alter the number of rare events following their interests. VII. C ONCLUSION AND F UTURE W ORK In this paper, we introduce the novel video surveillance browsing scheme to examine rare events. The motion features are computed by Rank-1 Robust PCA. We learn the hidden factors utilizing the advancement of nonparametric factor analysis. The anomalous and rare events are detected in a unsupervised manner and can be filtered out interactively. We established the browsing paradigm with spatial and temporalspatial approaches to overcome the limitation of pure computational processing. In order to accommodate the model automatically with the growing data for real time purposes, the nonparametric framework is necessary as it allows the model complexity grows with the data. Our future work will aim to develop the advanced system utilizing multi-core processor with enhanced techniques for motion extraction without pre-processing stages in accordance with the nonparametric approach. R EFERENCES [1] S. Gupta, D. Phung, and S. Venkatesh, “A nonparametric Bayesian Poisson Gamma model for count data,” in Proceedings of International Conference on Pattern Recognition (ICPR), Japan, November 2012. [2] T. Griffiths and Z. Ghahramani, “Infinite latent feature models and the Indian buffet process,” Advances in Neural Information Processing Systems, vol. 18, p. 475, 2006. [3] Y. Meng, M. Dunham, F. Marchetti, and J. Huang, “Rare event detection in a spatiotemporal environment,” in Granular Computing, 2006 IEEE International Conference on. IEEE, 2006, pp. 629–634.

[4] T. V. Duong, H. H. Bui, D. Phung, and S. Venkatesh, “Activity recognition and abnormality detection with the Switching Hidden SemiMarkov Model,” in IEEE Int. Conf. on Computer Vision and Pattern Recognition, vol. 1. San Diego: IEEE Computer Society, 20-26 June 2005, pp. 838–845. [5] T. Duong, D. Phung, H. Bui, and S. Venkatesh, “Human behavior recognition with generic exponential family duration modeling in the hidden semi-markov model,” in International Conference on Pattern Recognition, vol. 3, Hongkong, 2006, pp. 202–207. [6] S. Budhaditya, D. S. Pham, M. Lazarescu, and S. Venkatesh, “Effective anomaly detection in sensor networks data streams,” in Proc. ICDM. IEEE, 2009, pp. 722–727. [7] H. Bui, D. Phung, S. Venkatesh, and H. Phan, “The hidden permutation model and location-based activity recognition,” in Proc. of National Conference on Artificial Intelligence (AAAI), Chicago, USA, July 2008. [8] L. Kratz and K. Nishino, “Anomaly detection in extremely crowded scenes using spatio-temporal motion pattern models,” in Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, 2009, pp. 1446–1453. [9] T. V. Nguyen, D. Phung, S. Rana, D. S. Pham, and S. Venkatesh, “Multimodal abnormality detection in video with unknown data segmentation,” in Pattern Recognition (ICPR), 2012 21st International Conference on. Tsukuba, Japan: IEEE, November 2012, pp. 1322–1325. [10] M. Beal, Z. Ghahramani, and C. Rasmussen, “The infinite hidden Markov model,” in Advances in Neural Information Processing Systems (NIPS), vol. 1. MIT, 2002, pp. 577–584. [11] W. E. L. Grimson, C. Stauffer, R. Romano, and L. Lee, “Using adaptive tracking to classify and monitor activities in asite,” in Proc. IEEE Int. Conf. on Computer Vision and Pattern Recognition, 1998, pp. 22–29. [12] L. Xu, J. Jia, and Y. Matsushita, “Motion detail preserving optical flow estimation,” in Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on. IEEE, 2010, pp. 1293–1300. [13] D.-S. Pham, S. Rana, D. Phung, and S. Venkatesh, “Generalized median filtering - a robust matrix decomposition perspective,” IEEE Trans on Image Processing, (under review). [14] D. Lee and H. Seung, “Algorithms for non-negative matrix factorization,” Advances in Neural Information Processing Systems, vol. 13, 2001. [15] E. Candes, X. Li, Y. Ma, and J. Wright, “Robust principal component analysis,” Arxiv preprint arXiv:0912.3599, 2009. [16] A. Eriksson and A. van den Hengel, “Efficient computation of robust low-rank matrix approximations in the presence of missing data using the l1 norm,” in Procs. of IEEE Int. Conf. on Computer Vision and Pattern Recognition (CVPR), 2010, best paper award. [17] J. Paisley and L. Carin, “Nonparametric factor analysis with Beta process priors,” in Procs. of the International Conference on Machine Learning (ICML)). ACM, 2009, pp. 777–784. [18] Y. Teh, D. Gorur, and Z. Ghahramani, “Stick-breaking construction for the Indian buffet process,” in Proc. of the Int. Conf. on Artificial Intelligence and Statistics (AISTAT), vol. 11, 2007. [19] X. Wang, X. Ma, and W. Grimson, “Unsupervised activity perception in crowded and complicated scenes using hierarchical Bayesian models,” IEEE Trans on Pattern Analysis and Machine Intelligence (PAMI), pp. 539–555, 2008.