A Prototype for Anomaly Detection in Video Surveillance Context F.Persia1 , D.D’Auria1 , G.Sperl´ı1 , and A.Tufano2 1
Dipartimento di Ingegneria Elettrica e Tecnologie dell’Informazione, University of Naples Federico II Via Claudio 21, 80125, Naples, Italy {fabio.persia,daniela.dauria4,giancarlo.sperli}@unina.it 2 Universit´ a Telematica Pegaso
[email protected]
Abstract. Security has been raised at major public buildings in the most famous and crowded cities all over the world following the terrorist attacks of the last years, the latest one at the Bardo museum in the centre of Tunis. For that reason, video surveillance systems have become more and more essential for detecting and hopefully even prevent dangerous events in public areas. In this paper, we present a prototype for anomaly detection in video surveillance context. The whole process is described, starting from the video frames captured by sensors/cameras till at the end some well-known reasoning algorithms for finding potentially dangerous activities are applied. The conducted experiments confirm the efficiency and the effectiveness achieved by our prototype. Keywords: Video Surveillance, Anomaly Detection, Activity Detection, Unexplained Activities
1
Introduction
In latest years modern world’s needs of safety caused a speed spreading of video surveillance systems; these systems are collocated especially in the most crowded places. The main purpose of a video surveillance system is to create some automatic tools, which can extend the faculties of human perception, allowing collection and real time analysis of data coming from lots of electronic ”viewers” (sensors, cameras, etc...). One of the main limits of modern security systems is that most of them have been designed for specific functionalities and contexts: they generally use an only kind of sensors (such as cameras, motes, scanners) which cannot notice all the possible important phenomena connected to the observation context. A second and not negligible limit is that the ”semantics” of the phenomena (events) that such systems can notice is quite limited and, as well, these systems are not very flexible when we want to introduce new events to be identified. For example, a typical video surveillance system at a tunnel mouth uses a set of cameras monitoring train transit and the possible presence of objects in the
2
A Prototype for Anomaly Detection in Video Surveillance Context
scene. When a person transits on the tracks, we want the system automatically to identify the anomalous event and to signal it to a keeper. The commonest ”Image Processing” algorithms (that can be directly implemented on a camera processor or can be stored on a dedicated server that can process information sent by a camera) can quite precisely identify the changes between a frame and the next one and, in this way, discover the eventual presence of anomalies (train transit, presence of a person,...) in the scene. In the scene analysis, a system does not consider all the environmental parameters, such as brightness, temperature and so on, and how these parameters can modify the surveys (the identification of a small object in the scene is more complex in the night); as well, this system cannot identify semantic higher level events (such as a package left near a track) with the same precision and reliability. Similarly, a traditional video surveillance system can discover, in a bank, the presence of objects near the safe, but cannot automatically notify an event interesting for the context, such as a ”bank robbery” (every time that an object is near the safe, an alarm should be generated: in this way, nevertheless, false alarms could be generated also when the bank clerk goes into the safe room to take some money). In the end, we want a modern video surveillance system to attain the following points: it is to integrate heterogeneous information coming from different kinds of sensors, to be flexible in capability to discover all possible events that can happen in the monitored environment and to be adaptable to the context features of the observed scene. From a technological point of view, the main requirements of this kind of systems are: heterogeneity of the adopted surveying systems, heterogeneity of noticed data and of those to be processed, wiring of devices and communication with servers dedicated to processing. In this paper, we present a prototype of framework for anomaly detection in video surveillance context. The whole process is described: thus, we start from the video frames captured by sensors/cameras and then, after several steps, we apply some well-known reasoning algorithms [1, 2] for finding high-level unexplained activities in time-stamped observation data. The paper is organized as in the following. Section 2 deals with the evolution and the basic architecture of video surveillance systems. Section 3 describes in detail the proposed prototype for anomaly detection in video surveillance context. Section 4 presents some experiments using real-world datasets. Eventually, Section 5 discusses some conclusions and possible future improvements.
2
State of the Art
The video surveillance systems proposed in literature can be classified into three categories (or generations [3], [4]), from a technological point of view. The three proposed generations, in fact, have followed the evolution of the communication techniques of the image processing and of the data storing and they have been evolving with the same rapidity as these techniques.
A Prototype for Anomaly Detection in Video Surveillance Context
2.1
3
Evolution of Surveillance Systems
First Generation Surveillance Systems (1GSS) extend human perception capability from a spatial point of view: a set of cameras (sensor layer ) are used to capture visual signals from different positions set in a monitored environment. Such signals are then sent and visualized by operators, after an analogical transmission, in an only location (control room). 1GSS most considerable disadvantages are due to the operators’ short duration of attention, which is responsible for a high rate of missed recording of important events [5]. From the early 80’s, because of the increasing interest of research in video processing and in order to improve the basic technologies in this area, scientists obtained a sensible improvement in camera resolution and, at the same time, a reduction of hardware (computers, memories, ...) costs. Most of the researches made during Second Generation Surveillance System (2GSS) period have improved the development of automatic techniques called automated event detection. These techniques have made monitoring of very large areas easier, because they act as pre-filters of defined events [4]. Nevertheless, the 2GSS systems are characterized by a good level of digitalization about signal transmission and processing, in the sense that systems include digital components in some parts of their architecture. The main goal of Third Generation Surveillance Systems (3GSS) is to obtain, to manage and to efficiently transmit real time noticed video events, from a large set of sensors, through a full digital approach; this approach uses digital components in all layers of the system architecture, from sensor layer till visual and codified presentation of information to operators [3]. In a 3GSS system cameras communicate with some processing and transmission devices: in this way intelligent cameras are built. A network layer, whose principal component is an intelligent hub, has the purpose to assemble data coming from different cameras. So, we can say that the automatic video surveillance system goal is to act as a pre-filter for human validation of suspicious events. Such pre-filtering is generally based on video processing and gives some important parameters for object localization and for tracking of their trajectories, while they are in the monitored environment. The following is a first schematization of a 3GSS video surveillance system logical architecture: Sensor Layer The sensor layer is composed of one or more fixed or mobile cameras: their purpose is to collect images, to be sent to image processing system. Image Processing Layer Images captured by cameras are stored in a specific video database (VideoRepository); then, they are sent in input to IPS system, that processes them. Such module extracts low-level information (such as the presence of new objects in the scene, their position, and so on...) through image processing algorithms; it also converts this information into a format conformable to a syntax used by higher layers. For what image processing algorithms concern, Dung et all. [13] propose an approach for object detection based on local
4
A Prototype for Anomaly Detection in Video Surveillance Context
Fig. 1: Video surveillance system architecture
invariant features, exploiting binary features, which allows to meet the requirements of a real time system, and color information as well. Moreover, in order to overcome the object segmentation problem without prior knowledge of the objects, Chaudhary et all. [14] present a method which allows to estimate the same objects through sequential back-tracking via exploitation of affine relationships of consecutive frames. In [15], the authors develop a human detection approach based on a stereo vision system able to extract range images from the foreground, on which there are applied some human detection techniques based on local features in order to improve the detection performance for the occluded humans. Then, an Event Description Language has been defined on the basis of Image Processing algorithms; through this language, it is possible to formalize a complex event in a strict way. The data, organized in this way, are stored in a specific area of Database (Data Collection). Composite Event Detection Layer A post-processing framework, called HSRC (High Semantic Reasoning Center ) is the part of the system responsible of complex events’ occurrences surveying, through the processing of low-level data made available by IPS. We can classify the following components: – Event Repository: it is the part of Database in which predicates and complex events definitions are stored; as well, information about event occurrence surveying in the video are stored in it, too. – Data Collection: it is the part of Database that collects framework IPS output, organized according to EDL language syntax. – Agent Based Processor (ABP): its main aims are to capture the interesting event definition, composed of the Event Repository, to capture the observation, that is, the video description in terms of predicates, from Data Collection and to verify the event occurrence during the observation. – Subvideo Extractor : when an event is detected, this framework extracts from the video the interesting frame sequence and saves it in the Video Repository
A Prototype for Anomaly Detection in Video Surveillance Context
5
as a new file; in this way, the sequence is made available for on-line and offline visualizations. – Query Builder : the framework assigned to client service creation organizes parameters which the attitude of ABP processor bases on. Management services are built on language capability and on algorithms ability available in IPS. User Layer The system presents a set of services to final clients through userfriendly interfaces. Such services can be classified into two different categories: 1. Management services: the system manager can define new kinds of primitive and complex events and can extend the image processing algorithm suite of IPS framework. 2. Client services: the client can specify the working system parameters, based for example on the alert mode. He can visualize on-line and off-line video sequences corresponding to alarms detected by the system; he can also visualize whole stored videos and make some statistics on detected event occurrences. 2.2
Related Work
In these years many framework have been developed to identify anomalies or suspicious events in video sequences. In [16] the authors present a framework for detecting complex events through inferrencing process based on Markow Logic Network (MLNs) and rule-based event models. Another approach has been employed by Zin et al. [17], which propose an integrated framework for detecting suspicious behaviors in video surveillance systems exploiting multiple background modeling techniques, high-level motion feature extraction methods and embedded Markow chain models.
3
The Proposed Prototype
In this Section, we describe the prototype designed and developed for finding anomalous activities in video surveillance context. The architecture of the proposed prototype (Figure 2) consists of the following layers: an Image Processing Library, a Video Labeler, an Activity Detection Engine and the Unexplained Activities Problem (UAP) Engine, implementing the algorithms for video anomaly detection. In particular, the Image Processing Library analyzes the video captured by sensors/cameras and returns the low level annotations for each video frame as output; the Video Labeler fills the semantic gap between the low level annotations captured for each frame and the high level annotations, representing high level events that can be associated to the video frames; then, we used an Activity Detection Engine to find activity occurrences matching the well-known models, that can be classified into good and bad ones: thus, such a module takes as inputs the high level annotations previously caught by the Video Labeler and
6
A Prototype for Anomaly Detection in Video Surveillance Context
Fig. 2: The prototype architecture
Fig. 3: A video frame from ITEA-CANDELA dataset
the stochastic activity models; eventually, the UAP (Unexplained Activity Problem) Engine described in [1,2] takes as input the activity occurrences previously found with the associated probabilities and the high level annotations and discovers the Unexplained Video Activities. 3.1
The Image Processing Library
The Image Processing Library used in our prototype implementation is the Reading People Tracker (RPT) [7,8], that achieves a good accuracy in object detection and tracking. RPT takes the frame sequence of the video as input and returns an XML file describing the low level annotations caught in each frame, according to a standard schema defined in an XML Schema. We have only made some few updates to the RPT’s source code, in order to be able to get more easily the type of each object detected in a frame (person, package, car). For instance, figure 4 shows the low level annotations associated to the frame number 18 (figure 3) of a video belonging to the ITEA - CANDELA dataset 3 , which has been used to carry out some preliminary experiments. As we can see in figure 4, the RPT correctly identifies two objects (represented by the XML elements called track ) into the frame shown in figure 3: the former, identified by ID = 5, is a person 3
http://www.multitel.be/˜va/candela/abandon.html
A Prototype for Anomaly Detection in Video Surveillance Context
7
Fig. 4: The related low level annotations
(type = 5), while the latter, identified by ID = 100, is a package (type = 6). The XML attribute type of the element track denotes the type of the detected object. 3.2
The Video Labeler
As we mentioned before, the Video Labeler fills the semantic gap between the low level annotations captured for each frame and the high level annotations. So, through the Video Labeler, some high level events, called action symbols, with the related timestamps are detected; thus, the output of the Video Labeler is the list of action symbols related to the considered video source. The Video Labeler has been implemented in Java programming language: it uses the DOM libraries to parse the XML file containing the output of the Image Processing Library. The Video Library defines the rules that have to be checked to verify the presence of each interested high level atomic event in the video. So, a Java method for each action symbol we want to detect, containing the related rules, has been defined. There are listed below some examples of rules defined to detect some atomic events (action symbols) in a video belonging to the ITEA-CANDELA dataset. Action Symbol A: A person P goes into the central zone with the package – – – – –
There are at least two objects in the current frame At least one of the objects is a person At least one of the objects is a package The person identified appears on the scene for the first time The distance between the person’s barycenter and the package one is smaller than an apposite distance threshold
Action Symbol B: A person P drops off the package – – – –
There are at least two objects in the current frame At least one of the objects is a person At least one of the objects is a package The person was previously holding a package
8
A Prototype for Anomaly Detection in Video Surveillance Context
– The distance between the person’s barycenter and the package one is smaller than an apposite distance threshold Action Symbol C : A person P goes into the central zone – – – –
There is at least one object in the current frame At least one of the objects is a person The person identified appears on the scene for the first time If there are also some packages on the scene, their distances are greater than an apposite distance threshold
Action Symbol D: A person P picks up the package – – – –
There are at least two objects in the current frame At least one of the objects is a person At least one of the objects is a package The distance between the person’s barycenter and the package one is smaller than an apposite distance threshold – The person was not previously holding a package Action Symbol E : A person P1 gives the package to another person P2 – – – – –
There are at least three objects in the current frame At least two of the objects are persons At least one of the objects is a package P1 was previously holding a package In the current frame, both the distances of P1 and P2’s barycenters from the package are smaller than an apposite distance threshold – In the next frames, P1’s distance from the package is greater than the threshold, while P2’s one is smaller (it means that P2 has got the package and P1 is not holding it anymore) Action Symbol F : A person P goes out of the central zone with the package – This symbol is detected when a person holding a package does not appear anymore on the scene for a specified TTL 3.3
The Activity Detection Engine
An Activity Detection Engine is able to find activity occurrences matching the well-known models: thus, such a module takes as inputs the list of action symbols previously caught by the Video Labeler and the stochastic activity models and finally returns the list of the discovered activity occurrences with the related probabilities. To reach this goal, a specific software called tMAGIC, which is the implementation of a theoretical model presented in [9] has been used. As a matter of fact, the [9] approach addresses the problem of efficiently detecting occurrences of high-level activities from such interleaved data streams. In
A Prototype for Anomaly Detection in Video Surveillance Context
9
this approach, there has been proposed a temporal probabilistic graph so that the elapsed time between observations also plays a role in defining whether a sequence of observations constitutes an activity. First, a data structure called temporal multiactivity graph to store multiple activities that need to be concurrently monitored has been proposed. Then, there has been defined an index called Temporal Multi-Activity Graph Index Creation (tMAGIC) that, based on this data structure, examines and links observations as they occur. There are also defined some algorithms for insertion and bulk insertion into the tMAGIC index showing that this can be efficiently accomplished. In this approach, the algorithms are basically defined to solve two problems: the evidence problem that tries to find all occurrences of an activity (with probability over a threshold) within a given sequence of observations, and the identification problem that tries to find the activity that best matches a sequence of observations. There are introduced some complexity reducing restrictions and pruning strategies to make the problem, which is intrinsically exponential, linear to the number of observations. It is demonstrated that tMAGIC has time and space complexity linear to the size of the input, and can efficiently retrieve instances of the monitored activities. Moreover, this Activity Detection Engine has been also exploited in other works belonging to different contexts, such as [10–12]. 3.4
The UAP Engine
The UAP (Unexplained Activity Problem) Engine takes as input the activity occurrences previously found by the Activity Detection Engine with the associated probabilities and the list of the detected action symbols and finally discovers the Unexplained Video Activities, that are subsequences of the video source which are not sufficiently explained with a certain confidence by the activity models and that could thus be potentially dangerous. Such module is based on the concept of possible worlds, has been developed in Java programming language and provides the implementations of the theoretical algorithms FindTUA, FindPUA [1, 2].
4
Experimental Evaluation
We generated a video by concatenating multiple videos from the ITEA - CANDELA dataset, a publicly available dataset depicting a number of staged package exchanges and object drop-offs and pick-ups. We evaluated precision and recall against a ground truth provided by human annotators. Annotators were informed about known activities by providing them with a graphical representation of the activity models (see Figure 5). They were asked to watch the video and identify video segments where totally (resp. partially) unexplained activities occurred. Figure 6 shows the processing time of FindTUA and FindPUA as a function of the length of the video. Note that both axes are on a logarithmic scale. It is clear that both algorithms run in time linear in the length of the video, and significantly outperform naive algorithms that do not use some optimization properties [1].
ATM deposit – Version 4 10
A Prototype for Anomaly Detection in Video Surveillance Context
Ad1
(2, 0.2) (2, 0.5) detectPerson
(3, 1)
insertChecks (2, 0.8)
insertCard
pickupReceipt
(1, 0.5)
insertCash
(1, 1)
withdrawCard
(1, 0.3) (2, 0.7)
Fig. 5: Example of stochastic activity: ATM deposit
Fig. 6: Processing times
In order to assess accuracy, we compared the output of our algorithms against ground truth provided by 8 human annotators who were taught the meaning of graphical representations of activities in A (the set of graphs representing our knowledge base) (e.g., Figure 5). They were asked to identify the totally and partially unexplained activities w.r.t. A. We ran FindTUA and FindPUA with values of the probability threshold τ ranging from 0.4 to 0.8, looking for all totally and partially unexplained activities (the minimum length L of an unexplained sequence was set to 200). We use {Sia }i∈[1,m] to denote the unexplained sequences returned by our algorithms and {Sjh }j∈[1,n] to denote the sequences flagged as unexplained by human annotators. Precision and recall were computed as follows: P =
|{Sia |∃Sjh s.t. Sia ≈ Sjh }| m
(1)
R=
|{Sjh |∃Sia s.t. Sia ≈ Sjh }| n
(2)
and
A Prototype for Anomaly Detection in Video Surveillance Context
11
where Sia ≈p Sjh means that Sia and Sjh overlap by a percentage no smaller than 75%. Precision and recall when τ = 0.4, 0.6, 0.8 are shown in Tables 1a and 1b: we can easily notice that, the higher is the probability threshold value, the higher is the precision, the lower is the recall and vice versa. That is exactly what we reasonably expected. In summary, we can say that our framework achieved a good accuracy. τ 0.4 0.6 0.8
Precision 62.5 66.67 72.22
Recall 89.17 82.5 71.67
τ 0.4 0.6 0.8
(a) FindTUA
Precision 59.65 64.91 70.18
Recall 77.38 74.6 71.83
(b) FindPUA
Table 1: Precision and recall values
5
Conclusions and Future Work
This work presented a framework for anomaly detection in video surveillance context. More in details, we started from describing how the video frames are captured by sensors/cameras and thus analyzed, then we showed the different steps applied in order to finally discover some high-level activities which are not sufficiently explained by the well-known activity models and that could be potentially dangerous in the video surveillance context. Future work will be devoted to compare this framework with other ones which can be built for instance by replacing the components used at each layer with others either already well-known in literature or specifically designed and developed following innovative approaches. For instance, we planned to try also to use another Image Processing Library which would hopefully improve the overall effectiveness of the framework and allow the whole process to work as much as possible automatically. Moreover, we can try to exploit a different UAP Engine for discovering unexplained activities in video surveillance context, which would be no longer based on the concept of possible worlds, but on game theory.
References 1. M. Albanese, C. Molinaro, F. Persia, A. Picariello and V. S. Subrahmanian, ”Discovering the Top-k Unexplained Sequences in Time-Stamped Observation Data”, IEEE Trans. on Knowledge and Data Engineering (TKDE), vol. 26, no. 3, pp. 577-594, Mar. 2014. 2. M. Albanese, C. Molinaro, F. Persia, A. Picariello and V. S. Subrahmanian, ”Finding Unexplained Activities in Video”, International Joint Conference on Artificial Intelligence (IJCAI), pp. 1628-1634, 2011.
12
A Prototype for Anomaly Detection in Video Surveillance Context
3. J.K. Petersen (2001), ”Understanding Surveillance Technologies” Boca Raton, FL: CRC Press. 4. R.Collins, A.Lipton and T.K.Kanade (2000), ”Special issue on videosurveillance and monitoring” IEEE Trans. Pattern Anal. Mach. Intell., vol.22. 5. C.Regazzoni, V.Ramesh (2001), ”Scanning the Issue/Technology Special Issue on Video Communications, Processing, and Understanding for Third Generation Surveillance Systems” University of Genoa, Siemens Corporate Research Inc., University of Udine, IEEE. 6. T. Lacey, Scott A. DeLoach (2000), ”Verification of Agent Behavioral Models” Air Force Institute of Technology, Graduate School of Engineering, Management Department of Electrical, Computer Engineering, Wright-Patterson Air Force Base(OH), The 2000 International Conference on Artificial Intelligence (ICAI’2000) June 26 - 29, Monte Carlo Resort, Las Vegas, Nevada. 7. N. T. Siebel and S. Maybank, ”Fusion of multiple tracking algorithms for robust people tracking”, in Proc. of ECCV02, pp. 373-387, 2002. 8. N. T. Siebel and S. Maybank, ”The advisor visual surveillance system”, in ECCV 2004 workshop Applications of Computer Vision (ACV, 2004). 9. M. Albanese, A. Pugliese, and V. S. Subrahmanian, ”Fast Activity Detection: Indexing for Temporal Stochastic Automaton based Activity Models”, IEEE Trans. on Knowledge and Data Engineering (TKDE), vol. 25, no. 2, pp. 360-373, Feb. 2013. 10. F. Persia, D. D’Auria, ”An application for finding expected activities in medial context scientific databases”, SEBD 2014: pp. 77-88. 11. D. D’Auria, F. Persia, ”Automatic evaluation of medical doctors’ performances while using a cricothyrotomy simulator”, IRI 2014: pp. 514-519. 12. D. D’Auria, F. Persia, ”Discovering Expected Activities in Medical Context Scientific Databases”, DATA 2014: pp. 446-453. 13. Phan Dung, Oh Chi-Min, Kim Soo-Hyung, Na In-Seop, Lee Chil-Woo, ”Object Recognition by Combining Binary Local Invariant Features and Color Histogram”, 2013 2nd IAPR Asian Conference on Pattern Recognition (ACPR), pp.466-470 14. Chaudhary, Yasushi Mae, Masaru Kojima, Tatsuo Arai, ”Autonomous acquisition of generic handheld objects in unstructured environments via sequential backtracking for object recognition”, 2014 IEEE International Conference on Robotics and Automation (ICRA), pp.4953-4958 15. T. Ubukata, M. Shibata, K. Terabayashi, A. Mora, T. Kawashita, G. Masuyama, K. Umeda, ”Fast Human Detection Combining Range Image Segmentation and Local Feature Based Detection”, 22nd International Conference onPattern Recognition (ICPR), pp.4281 - 4286 16. I. Onal, K.Kardas, Y. Rezaeitabar, U. Bayram ,M. Bal, I. Ulusoy, N.K. Cicekli ”A framework for detecting complex events in surveillance videos” 2013 IEEE International Conference on Multimedia and Expo Workshops (ICMEW), pp. 1-6 17. T.T. Zin, P. Tin, H. Hama, T. Toriu, ”An integrated framework for detecting suspicious behaviors in video surveillance” Society of Photo-Optical Instrumentation Engineers (SPIE) Conference Series