Cite this paper as: Walha A., Wali A., Alimi A.M. (2012) Support Vector Machine Approach for Detecting Events in Video Streams. In: Hassanien A.E., Salem AB.
Support Vector Machine Approach for Detecting Events in Video Streams Ahlem Walha, Ali Wali, and Adel M. Alimi REGIM: REsearch Group on Intelligent Machines, University of Sfax, ENIS, BP W, 3038, Sfax, Tunisia {walha.ahlem,ali.wali,adel.alimi}@ieee.org
Abstract. The object recognition is an important topic in image processing. In this paper we present an overview of a robust approach for event detection from video surveillance. Our events detecting system consists of three modules, learning, extraction and detection. The extraction part of the video characteristics is based on MPEG 7. Meanwhile, in the detection part we use SVMs for the recognition of events.
1
Introduction
The multimedia information retrieval is becoming increasingly useful for many applications. The existence of networks of excellence (PASCAL, Muscle, DELOS, etc..), European projects (with Media, etc..), pro-grams tested (TRECVid, ImagEVAL, etc..) and the involvement of large industrials (Google, IBM, etc..) are clear evidence of an emerging market for multimedia search tools. These tools generally follow the same pattern that is on the one hand by extracting content visual characteristics of raw materials, and on the other hand by exploiting these characteristics to solve a spot of a more or less generic research. We usually distinguish between two types of video indexing systems. There is a part called generic systems that enables a classification of different available movies without taking into account contextual information. These systems allow classifying the different movies depending on the stage (internal or external), camera (static or moving), etc... The other part is called specific systems which allow only indexing a particular type of video such as TV news, surveillance video, sporting events such as football or tennis, etc. Although their use is limited to one type of movie, however, the specific systems can respond to many requests from users of video indexing systems. The goal of this paper is to continue this genericity improvement models in the context of kernel methods. Figure (1) shiws a positive example of the type event of interest which is presented once or more in this video, while Figure (2) show a negative example of the type "event of interest which is not presented in this video" [6]. The problem is similar to find-ing weakly supervised objects in images, which are examples of the type "object of interest which is present once or more in this picture. Having held our expertise in this area, only a work of adaptation to the context will be conducted on the general principle of detector A. Ell Hassanien et al. (Eds.): AMLTA 2012, CCIS 322, pp. 143–151, 2012. c Springer-Verlag Berlin Heidelberg 2012
144
A. Walha, A. Wali, and A.M. Alimi
Fig. 1. Positive example for the event PersonRun
Fig. 2. Negative example for the event PersonRun
construction. By cons-level management of real-time detection in the flow, significant work remains to be conducted [8]. A new detection method is iterative development, which will update the detection results after each arrival of a new frame; this is based on the results obtained during previous iterations [18]. The indexing of video sequences can be viewed as a binary classification of all images, whether the sequence of images corresponds to the predefined event, or not. The problem is then to detect predefined events in video sequences or scenarios as in [3]. The ideal tool in the field of indexing according to a given problem is software that would work in two stages. After a learning phase of video sequences where the events are not predefined, a recognition step would classify each sequence of images into two classes: those that meet the predefined event and those that do not match. This recognition step is of course based on the learning process conducted initially [12]. In [21] a technique for detection of events based on a generic learning system called M-SVM (Multi-SVM) is proposed. The objective of this method is allows the detector to make different types of event by presenting an incremental way a significant number of examples. Examples of application of this technique are the intelligent video surveillance of airports and shopping spaces. In [22] a hybrid method that combines HMM
Support Vector Machine Approach for Detecting Events in Video Streams
145
with SVM to detect semantic events in video is proposed. The advantage of this method that is suitable to the temporal structure of event thanks to Hidden Markov Models (HMM) and guarantees high classification accuracy thanks to Support Vector Machines (SVM). In this paper we describe our approach of event detection in video streams for surveillance applications. The following part of this paper is organized as follows: In Section 2, a description of our detection method is provided. The results are presented and commented in Section 3. Finally, concluding remarks are drawn in Section 4.
2
Support Vector Machine: A Brief Background
Support vector machine is a concept related to supervised learning. It is based on statical learning theory. The main idea of SVM algorithm is to adjust sigmoid kernel function in order to achieve the optimal use of input data for separation. Support vector machine constructs an N-dimensional hyperplane that optimally separates the input data into two classes. SVM uses alternative training method for linear, polynomial, radial basis function, and multi-layer perceptron classifiers. The input of SVM is a set of vectors, with each vector belongs to one of the two classes. For linear kernel function, the optimal goal is to find the hyperplane that leaves largest number of cases of the class at the same side. The linear separating hyperplane is defined by equation (1). yi(w.xi + b) >= 1
(1)
Where w is a normal weight and b is a bias which will satisfy the inequalities. Xi is the input space, y belongs -1,1, which is the corresponding class. SVM selects the hyperplane, which leave the maximum distance which is w1 . The closed cases are called support vectors. The value w2 quantity is called margin. For non separable data optimization, problem is solved by introducing error and penalization for misclassified cases. Here, the problem is minimizing, as shown in equation (2). e (2) min(1/2(w)2 + C i=1
With this constraint, the separation equation is changed to equation (3). yi(w.xi + b) >= 1 − e, e > 0
(3)
The parameter C is the tradeoff between training set and the error in the separation. The are other widely used kernel functions, such as polynomial kernel function and gaussian kernel function, as shown in equation (4) and equation (5), respectively. (4) k(x , x ) = (x Δx + 1)p, k(x, x) = exp(−
x−x ) sigma
(5)
146
3
A. Walha, A. Wali, and A.M. Alimi
Approach
The system of video surveillance is characterized by the event detection in uncontrolled environments. It has one main problem: the enormous diversity of an event seen from different angles, at different scales and with different degrees. The proposed approach has mainly three major phases which are visual feature extraction, learning by SVM and detection of event. These three phases are described in detail in this section along with the steps involved and the characteristics feature for each phase. Figure (3) shows an overview of event detection approach.
Fig. 3. The proposed event detection approach phases
3.1
Extracting Descriptors
A video sequence can be decomposed into a hierarchy of plans. A plan is a video taken from camera which continues for a given event. Then each shot video can be characterized by one or more keyframes extracted from video plan. These images are useful for the learning module [12]. The next step is to define the descriptors of each image. We will use the standard MPEG7 to obtain these de-scriptors. We should obtain, for each image, the following characteristics (1) Color Descriptors (2) Shape Descriptors, (3) Texture Descriptor, and (4) Motion Descriptors. The extraction of information contained in an image requires a choice between different types of possible de-scriptors: color, motion, shape and texture. A descriptor is a vector of values calculated from an image, which summarizes information contained in the image (color, shape, texture ...). Each descriptor MPEG7 is a summary of the characteristics of the image. The following descriptors had the top overall performance [16]: – LayoutColor: is obtained by sampling the image for a block of 8x8 pixels. The information of the 3 color channels of the block is treated separately to obtain decomposition into baseband signals by applying the DCT.
Support Vector Machine Approach for Detecting Events in Video Streams
147
– StructureColor: this descriptor is mostly used for image retrieval. It expresses local color structure in an image [3]. – ScalableColor: the Scalable Color Descriptor is a Color Histogram in HSV Color Space, which is encoded by a Haar transform [3]. – DominantColor: is obtained by vector quantization of color in the image into classes. The percentage of pixels associated with each class and the value of the 3 color components of each class is the SFD [3]. The descriptor "dominant colors" is constructed as follows: We have to note that each color the image is represented by three parameters what are dominant color, the percentage of this class in the image, and the variance of color in the class – Variance: characterizes the texture, we use gray level of pixels (the V of HSV). For each pixel, we then calculate the variance of all pixels contained in a neighborhood of size n × n around the pixel (n = 3, 5, 7 ...). We then take the histogram of these variances as a descriptor of the image. – The co-occurrence matrices that contains the average space of second order. Fourteen indices (defined in Haralick) which correspond to the descriptive characteristics of textures can be calculated from these matri-ces. We have only listed six of these indices: Uniformity, Contrast, Entropy, Correlation, Directivity and Con-sistency. – Contour shape: the descriptor uses the closed contour of an object or a 2D uniform in color. It consists of a list of points (coordinate pair) defining the contour of the object [18]. – Région Shape: the shape of an object may consist of either a single region or a set of regions as well as some holes in the object. – Keypoint descriptor: the local image gradients are measured at the selected scale in the region around each keypoint [4] [5]. These are transformed into a representation that allows for signi?cant levels of local shape distortion and change in illumination [7]. – Optical flow: is the pattern of apparent motion of objects, surfaces, and edges in a visual scene caused by the relative motion between camera and scene [6] [12] [13]. Figure (4) shows the optical flow results calculated for two successive images.
Fig. 4. Optical flow calculated for two successive images
148
3.2
A. Walha, A. Wali, and A.M. Alimi
Learning by SVM
Support Vector Machines (SVM) is a set of supervised learning techniques to solve problems of discrimination and regression. The SVM is a generalization of linear classifiers. The SVMs have been applied to many fields (bio-informatics, information retrieval, computer vision, finance, etc.) [1,2]. According to the data, the performance of SVM is similar to a neural network or a Gaussian mixture model. They directly implement the principle of structural risk minimization and work by mapping the training points into a high dimensional feature space, This module begins by making learning with SVM. We will use picture types for each event from which we are going to define a vector descriptor. This vector is a combination of different values of descriptors. After obtaining all vectors that characterize each event we move to the next stage of testing events. In our approach, we aim at building two models for each class: a model for the be-ginning of events and another for the end. We will work with the 6 events: 3.3
Detection of Events
In this module, we detect the event from different angles. Each image tested by the learning module has a class of membership; it is to calculate the probability of each event presented by the sequence of image and we choose the closest event without exceeding a definite threshold [12].
4
Experimental Results
The dataset consists of surveillance camera which was acquired at London Gatwick airport; it is provided by TREC Video Retrieval Evaluation (TRECVid) [20]. The
Fig. 5. Event treated results
Support Vector Machine Approach for Detecting Events in Video Streams
149
dataset is large in scale, an approximately 100 hour video dataset, which consists of videos from five fixed-view surveillance cameras in the airport. We aim at treating six events which are grouped into two categories (1). The experiment is to apply our detection algorithm on a set of 1282 subsequences of 6 events, "PersonRuns", "OpposingFlow", "ElevatorNoEntry", "PeopleMeet", "Embrace", and "PeopleSplitUp" extracted manually that we knew their classes. Table 1 shows our event detection result and a comparison between other systems [19], where #Ref is the number of annotated events, Sys# is the number of events detected, #Cordet is the number of correct detections, #Fa is the number of false detections and #Miss is the number of missed events. From the result, we can see that we have obtained a good performance in event PersonRun, PeopleSplitUp and ObjectPut. However, the action OpposingFlow, Embrace and EvaluatorNoEntry are relatively low. We consider that SVM classifier model does not have enough discrimination ability. Table 1. Event Detection Scoring Analysis Report (??) PersonRun Our System Informediatrecvid 2010 OpposingFlow Our System PeopleSplitUp Our System eSur@trecvid 2009 eSur@trecvid 2010 Embrace Our System eSur@trecvid 2009 eSur @ trecvid 2010 EvaluatorNoEntry Our System ObjectPut Our System IPG-BJTU @ Trecvid 2010
5
#Ref 107 107 #Ref 150 #Ref 187 187 187 #Ref 187 175 175 #Ref 30 #Ref 621 621
# Sys 20 532 # Sys 24 # Sys 32 198 167 # Sys 23 80 925 # Sys 0 # Sys 9 8
#Cordet 7 19 #Cordet 4 #Cordet 6 7 16 #Cordet 5 1 6 #Cordet 0 #Cordet 7 1
#Fa 13 513 #Fa 20 #Fa 26 191 136 #Fa 17 79 71 #Fa 0 #Fa 2 7
#Miss 8 88 #Miss 9 #Miss 9 180 171 #Miss 10 174 169 #Miss 3 #Miss 8 620
Conclusions and Future Works
We have presented a method to detect events in video recordings. This technique has applications for very large video surveillance indexing and detection of specific events (human intrusion into a prohibited area, or someone leaving a package, etc...) in a video database. The main idea of this method is to extract the various descriptors of the image and then learn the desired event with a number of relevant examples presented by the user. It also provides the "lifelike" sequences more and then take the test for the entire database.
150
A. Walha, A. Wali, and A.M. Alimi
References 1. Li, H., Lei, B., Zan, G., Arnold, O., Wei, L., Long-Fei, Z., Shoou-I, Y., MingYu, C., Florian, M., Alexander, H.: Informedia@TRECVID (2010), http://wwwnlpir.nist.gov/projects/tvpubs/tv10.papers/cmu-informedia.pdf 2. Kaihua, J., Zhipeng, H., Zhongwei, C., Guochen, J., Ten, X., Qiong, H., Guangcheng, Z., Yaowei, W., Lei, Q., Yonghong, T., Xihong, W., Wen, G.: PKUIDM @ TRECVid 2010: Pair-Wise Event Detection in Surveillance Video (2010), http://www-nlpir.nist.gov/projects/tvpubs/tv10.papers/pku-idm.pdf 3. Swain, M., Ballard, D.: Color Indexing. Int. Journal of Computer Vision 7(1), 11–32 (1991) 4. Lowe, D.G.: Distinctive Image Features from Scale-Invariant Keypoints. The International Journal of Computer Vision 60(2), 91–110 (2004) 5. Ming-Yu, C., Huan, L., Alexander, H.: Informedia@TRECVID2009:Analyzing Video Motions (2009), http://www-nlpir.nist.gov/projects/tvpubs/tv9.papers/cmu.pdf 6. Masaki, T., Yoshihiko, K., Mahito, F., Masahiro, S., Noboru, B., Shin’ichi, S.: NHK STRL at TRECVID 2009: Surveillance Event Detection and High-Level Feature Extraction (2009), http://www-nlpir.nist.gov/projects/tvpubs/tv9.papers/nhkstrl.pdf 7. Lowe, D.G.: Object recognition from local scale-invariant eatures. In: Proc. ICCV 1999, vol. 2, pp. 1150–1157 (1999) 8. Viola, P., Jones, M.: Robust Real-time Object Detection. International Journal of Computer Vision 57(2), 137–154 (2004) 9. Han, F., Shan, Y., Cekander, R.: A Two-Stage Approach to People and Vehicle Detection with HOG-Based SVM. In: PerMIS Proceeding, pp. 133–140 (2006) 10. Scholkopf, B., Plat, J.C., Shawe-Taylor, J., Smola, A.J., Williamson, R.C.: Estimating the support of a high-dimensional distribution. Neural Computation 13, 1443–1471 (2001) 11. Chen, P.H., Lin, C.J., Scholkopf, B.: A tutorial on-support vector machines. Applied Stochastic Models in Business and Industry 21, 111–136 (2005) 12. Weilong Y., Tian L., Greg, M.: SFU@TRECVid2009: Event Detection, http://www.sfu.ca/~wya16/trecvid/trecvid.html 13. Efros, A.A., Berg, A.C., Mori, G., Malik, J.: Recognizing action at a distance. In: IEEE International Conference on Computer Vision, October 13-16, pp. 726–733. California Univ., Berkeley (2003) 14. Wali, A., Alimi, A.M.: Event detection from video surveillance data based on optical flow histogram and high-level feature extraction. In: IEEE DEXA Workshops 2009, pp. 221–225 (2009) 15. Macan, T., Loncaric, S.: Hybrid optical flow and segmentation technique for lv motion detection. In: Proceedings of SPIE Medical Imaging, San Diego, USA, pp. 475–482 (2001) 16. Xiaokang, Y., Yi, X., Rui, Z., Erkang, C., Qing, Y., Bo, X., Zhou, Y., Ning, L., Zuo, H., Cong, Z., Xiaolin, C., Anwen, L., Zhenfei, C., Kai, G., Jun, H., Tong, S.J.: University participation in high-level feature extraction and surveillance event detection at TRECVID 2009 (2009), http://www-nlpir.nist.gov/projects/tvpubs/tv9.papers/sjtu-iicip.pdf 17. Wali, A., Ben Aoun, N., Karray, H., Ben Amar, C., Alimi, A.M.: A New System for Event Detection from Video Surveillance Sequences. In: Blanc-Talon, J., Bone, D., Philips, W., Popescu, D., Scheunders, P. (eds.) ACIVS 2010, Part II. LNCS, vol. 6475, pp. 110–120. Springer, Heidelberg (2010)
Support Vector Machine Approach for Detecting Events in Video Streams
151
18. Chan, T., Vese, L.: An Active Contour Model without Edges. In: Nielsen, M., Johansen, P., Fogh Olsen, O., Weickert, J. (eds.) Scale-Space 1999. LNCS, vol. 1682, pp. 141–151. Springer, Heidelberg (1999) 19. Yuan, S., Ping, G., Shu, W., Lin, Y., Haifeng, D., Liang, L., Zhenjiang, M.: Event detection: IPG-BJTU Trecvid (2010), http://www-nlpir.nist.gov/projects/tvpubs/tv10.papers/ipg_bjtu.pdf 20. National Institute of Standards and Technology (NIST): TRECVID 2009 Evaluation for Surveillance Event Detection (2009), http://www.nist.gov/speech/tests/trecvid/2009/ and http://www.itl. nist.gov/iad/mig/tests/trecvid/2009/doc/eventdet09-evalplan-v03.htm 21. Wali, A., Alimi, A.M.: Incremental Learning Approach for Events Detection from large Video dataset. In: Seventh IEEE International Conference on Advanced Video and Signal Based Surveillance, AVSS, pp. 555–560 (2010) 22. Bae, T.M., Kim, C.S., Jin, S.H., Kim, K.-H., Ro, Y.M.: Semantic Event Detection in Structured Video Using Hybrid HMM/SVM. In: Leow, W.-K., Lew, M., Chua, T.-S., Ma, W.-Y., Chaisorn, L., Bakker, E.M. (eds.) CIVR 2005. LNCS, vol. 3568, pp. 113–122. Springer, Heidelberg (2005)