Event Detection and Clustering for Surveillance Video ... - CiteSeerX

11 downloads 0 Views 183KB Size Report
The target of surveillance summarization is to identify high-value information events in a video stream and to present it to a user. In this paper we.
Ninth International Workshop on Image Analysis for Multimedia Interactive Services

Event Detection and Clustering for Surveillance Video Summarization Uros Damnjanovic

Virginia Fernandez

Ebroul Izquierdo

José María Martinez

Queen Mary University of London [email protected] mul.ac.uk

Universidad Autónoma de Madrid virginia.fernandeza@estu diante.uam.es

Queen Mary University of London [email protected] ul.ac.uk

Universidad Autónoma de Madrid josem.martinez@ uam.es

Even though surveillance systems are in use for decades, number of publications related to surveillance domain has just been written in last few years. Detection and classification of events is used most often in the literature. Object detection technique based on wavelet coefficients is used to detect frontal and rear view of pedestrians in [1]. In [2] two different architectures that employ summarization techniques in the surveillance domain are described. Video summarization based on the optimization of viewing time, frame skipping and bit rate constraint is presented in [3]. For a given temporal rate constraint the optimal video summary problem is defined as finding a predefined number of frames that minimize the temporal distortion. In [4] authors presented the tool that utilizes MPEG-7 visual descriptors and generates a video index for summary creation. The resulting index generates a preview of the movie and allows non-linear access to the content. This approach is based on hierarchical clustering for merging shot segments that have similar features and neighbor each other in the time domain. In [5] Rasheed and Shah construct a shot similarity graph, and use graph partitioning normalized cut for clustering shots into scenes. Video motion analysis can be used for creating video summaries as in [6]. In this approach Wang et al. showed that by analyzing global/camera motion and object motion is possible to extract useful information about the video structure. More complete overview of existing techniques and available literature on intelligent surveillance systems can be found in [7] and [8]. We present in this paper event detection and clustering approach for building both static and dynamic summary. Main idea of our approach is to combine video skim with set of key frames organized in clusters to enable fast browsing of whole video. To create the summary we first detect events using energy difference between frames. Then we cluster events based on their visual appearance, and finally based on the clusters structure we build the summary and

Abstract The target of surveillance summarization is to identify high-value information events in a video stream and to present it to a user. In this paper we present surveillance summarization approach using detection and clustering of important events. Assuming that events are main source of energy change between consecutive frames set of interesting frames is extracted and then clustered. Based on the structure of clusters two types of summaries are created static and dynamic. Static summary is build of key frames that are organized in clusters. Dynamic summary is created from short video segments representing each cluster and is used to lead user to the event of interest captures in key frames. We describe our approach and present experimental results.

1. Introduction Nowadays, the interest in civil, military and commercial surveillance is growing up due to the increasing demand of security. Thousands of video cameras can be found at public places, public transport, banks, airports, etc. resulting in huge number of information which is difficult to process in real time. In order to efficiently organize growing stocks of surveillance videos it is necessary to automatically organize data using signal based representation. Video summarization techniques can be very useful tool when applied to the surveillance videos. Main objective of video summarization is to identify interesting segments in the video and present them to the user. Applied to the surveillance domain, summarization techniques can provide user both with overview of the events that occurred and faster browsing capabilities. By detecting and organizing events, essence of the surveillance video is captured in the summary decreasing time needed for browsing the content.

978-0-7695-3130-4/08 $25.00 © 2008 IEEE DOI 10.1109/WIAMIS.2008.53

63

present it to the user. Video summary is used to lead user to the key frame cluster with a specific event. Key frame summary then summarizes clusters containing interesting events. Structure of the paper is as follows. In section 2 the energy difference event detection approach is presented. Section 3 presents the spectral clustering approach for clustering of events and building the summary.

2. Event detection with frame energy value estimation

Figure 1. Frame energy difference estimation. Fixed threshold is used to detect frames with high energy values.

For development of this algorithm, it has been assumed that the surveillance cameras are deployed pointing to a fixed place. Quantization of redundancy between frames in the form of frames energy values is used as the criteria for event detection. Event detection comprises three steps that can be grouped into two different stages. In the first stage, the “difference frame” and its energy are calculated for each frame. In the second stage, frames showing event are found and the reference frame is refreshed. Algorithm steps are described in the following:

the set , ( , )is total number of frames between first and last frame belonging to the same candidate , . Second length measure with respect to the set , ( , ) is total number of frames belonging to the set

between first and last frame of the event candidate. Event is properly detected if condition (2) is met. ( , ) >  (2) ( , ) Where  is fixed threshold. Finally set of key frames

⊂ is created from frames belonging both to the set

and to the set of detected events.

Step1: Calculation of the “difference frame”. The “difference frame” is calculated for every frame as the absolute value of difference between pixel intensities of the current frame and the reference frame.

3. Building the summary with spectral clustering

Step2: Calculation of the energy of the “difference frame”. Once the “difference frame” has been calculated, its energy value () is found using equation (1) where (, ) is the differential intensity of the pixels belonging to the “difference frame” of current frame. −1 ∑−1 =0 ∑ =0 (, ) () = (1) ∙ Step3: Select the frames using a threshold. In this step the set of interesting frames is extracted from the set of all frames using a fixed threshold. When one of the energy values exceeds the threshold, its corresponding frame becomes the reference frame for the next iteration (see figure 1). Further video analysis concentrates on the set since it contains most of the “event” frames. Let  and +1 be two consecutive frames of the set . Absolute distance between frames  and +1 , ( , +1 ) is distance of frames  and +1 in the set . Set of event candidates

′ is extracted by merging consecutive frames with ( , +1 ) <  into one event, where  is a fixed threshold. Now, define two length measures describing potential events. First length measure with respect to

Graph based data model considers every point in the dataset as a node of the graph, while edges of the graph correspond to similarity measure defined over the set of features. For the specific problem of surveillance video summarization set of frames extracted in section 2 is used as input for the spectral clustering algorithm. Each frame is described using the set of low level features. Similarity matrix  = [, ] is formed with similarities that are linear combination of specific () feature similarities , as in formula (3): (1) (2) (3) (3) , = 1 ∙  + 2 ∙  + 3 ∙  ,

,

,

We used three MPEG-7 low level features for frame description: color layout, edge histogram and homogenous texture. Weights  define importance of () each feature. Similarity , is calculated usig following equation: 2

 ,

 −   = exp(− ) 2 2

(4)

Where  −   is Euclidean distance between two feature vectors. Parameter  serves as a scaling factor which determines sensitivity of the similarity function

64

(I)

to changes in the scenes. After the similarity matrix is formed, next step is to cluster frames. We use Normalize cut clustering criteria [10] to optimize following objective function: 1   1 NCut( A, B) Cut ( A, B)   (5)  VolA VolB  Where (, ) is total connection between two clusters and , and !() is total connection between cluster A and all nodes in the dataset. Minimization of defined objective separate dataset into clusters with small inter and high intra cluster similarities. By using spectral relaxation it is possible to minimize Ncut in continuous domain by solving generalized eigensystem: L x D x (6) " = [  ] is diagonal matrix whose entries are  = ∑ =1.. , where  is number of frames, and # = " −  is graph Laplacian matrix. Eigenvector  (2) corresponding to the second smallest eigenvalue $(2) is used to recursively bipartition frames. After each step NCut value is used to test the quality of clustering indicating if clustering should be stopped or continued. High value of Ncut, in our case higher than some predefined threshold %& means that two clusters are similar and that frames in should stay in one cluster. Algorithm is then used recursively until all clusters are found. Extracted clusters are now used to create short video skim for dynamic, and sets of key frames for static browsing (see fig. 2). Video segments that are going to represent clusters in the skim are found by analyzing the structure of the eigenvectors corresponding to a specific cluster. We use the fact that structure of eigenvectors correspond to the structure of the clusters [11]. Connection between the level of change within the scene and eigenvector structure is used to build the video skim by searching for segments that correspond to eigenvector entries with highest variation. Starting frame  of the segment  representing cluster ' , is found using following equation: (2) (2) (2) (7)  =  |  = max{ − '-/4 }

(II) (III) (IV) (V) Figure 2. Event detection and summarization process overview. (I) Set of frames F of original video. (II) Set of frames I coming from frame energy value analysis containing interesting events. (III) Set of key frames S representing different events. (IV) Clusters of key frames used for building the final summary (V).

video, cluster ' contribution to the summary is calculated by:

(') %(') = % ∙ (8) %5 Since one cluster is represented with more segments, we define maximal %6 and minimal %# length of one segment coming representing one event. Individual event length in the summary is chosen in such a way that it is between %6 and %# , and that number of different events used in the skim, coming from the cluster 'correspond to the number of temporally detached events.

4. Experiments and results For evaluation purposes we tested our algorithm with two surveillance videos. Videos are taken from the fixed surveillance cameras of the parking lot. Manually generated ground truth is used to evaluate the performance of the algorithm. Events belong to one of the following groups: moving cars and people and group of highlight events around the gate that is being monitored. Parameter  in (4) adjusting the sensitivity of the similarity measure is set to 8, 32 and 320 respectively for color layout, edge histogram and homogenous texture similarities. Contribution of each descriptor to the elements of the similarity matrix is determined by weights 1 , 2 and 3 which are set to 0.5, 0.3 and 0.2 respectively. We define measure of information concentration for describing how informative is a set of frames.  7 = (9)   is number of frames with some interesting event, and  is total number of frames in the set. Ground truth for two surveillance videos is shown in table 1.

∈'

Every cluster ' will be included in the summary with some duration %('). Duration %(') include more than one segment from the cluster '. Number of different segments 4' coming from the cluster ' included in the summary depends on the size of the cluster ', (') and a number of temporally detached events within one cluster. First step is to determine each cluster contribution to the video skim. If % is final length of the video skim, and %5 is total length of the original

65

Table 1. Ground truth Ground truth Video 1 Length 140 min Number of events 82 Number of highlights 5 Information concentration 13%

Video 2 150 min 96 12 21%

7. References [1] M. Oren, C. Papageorgiou, P. Sinham, E. Osuna, and T. Poggio, ”Pedestrian Detection Using Wavelet Templates”, Proc. IEEE Conf. on Computer Vision and Pattern Recognition, Puerto Rico, 1997, pp. 193–199

Key frame summary captured high percentage of events as shown in table 2. Also dynamic summary captured have percentage of highlights as shown in table 3. and have high information concentration

[2] S. Shipman, R. Radhakrishan, A. Divakaran, “Architecture for Video summarization Services Over Home Networks and the Internet”, Mitsubishi Electric Research Labs, Cambridge, MA, USA.

Table 2. Static summary results Key frame summary Video 1 Video 2 Number of events 80 91 Number of highlights 5 12 Information concentration 96% 98%

[3] Z.Li, G.M Schuster, A.K. Katsaggelos, “MINMAX Optimal Video Summarization“ IEEE Transactions on Circuits and Systems for Video Technology. Vol. 15,issue 10 (2005) pp. 1245-1256

Table 3. Dynamic summary results Skim summary Video1 Video2 Length 46 sec 65 sec Number of highlights 5 11 Information events 29 38 Information concentration 98% 99%

[4] J. Lee, G. Lee, W.Y. Kim, “Automatic Video Summarizing Tool Using MPEG-7 Descriptors for Personal Video Recorder” IEEE Transaction on Consumer Electronics, Vol. 49 (2003). pp. 742-749 [5] Z. Rasheed, and M. Shah, “Detection and Representation of Scenes in Videos” IEEE Transactions on Multimedia, Vol. 7, issue 5. (2005) pp. 1097-1105

5. Conclusion

[6] Y. Wang, T. Zhang, and D. Tretter, “Real Time Motion Analysis Towards Semantic Understanding of Video Content” Conference on Visual Communications and Image Processing (2005)

We have shown that summarization using both dynamic and static summaries can be efficiently used for surveillance video summarization coming from static cameras. Since every significant change in consecutive frames is coming from possible event frame energy difference showed to be reliable event indicator. One drawback of such approach is that changes in the background and illumination can be falsely detected as events too. Clustering of detected events and building of summary significantly increase browsing efficiency with its high concentration of useful information. Focusing on the important areas within frames together with use of events classifiers will be encompassed in the future work.

[7] M. Valera and S.A. Velastin, “Intelligent Distributed Surveillance Systems: a Review” IEEE Proceedings- Vision, Image and Signal Processing, 152(2), pp 192-204. [8] A. R. Dick and M.J. Brooks, “Issues in Automated Visual Surveillance “In Proceeding of VII Digital Image Computing: Techniques and Applications, Sydney, Australia, 10-12 Dec. 2003, pp. 195-204 [9] H. Zhi-Qiang and H. Chong-Zhao, “A Background Reconstruction Algorithm Based on Pixel Intensity Classification in Remote Video Surveillance System”, Proceedings of the 7th International Conference on Information Fusion, Stockholm, Sweden 2004, pp. 754–759.

6. Acknowledgments This research was supported by the European Commission (FP6-027685-MESH). The expressed content is view of authors but not necessarily the view of the MESH project as whole.

[10] J. Shi, and J. Malik, ”Normalized Cuts and Image Segmentation” IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 22 (2000) pp. 888-905 [11] U. Damnjanovic, T. Piatrik, D. Djordjevic and E. Izquierdo, “Video Summarization for surveillance and News Domain” SAMT 2007, Genoa, Italy, pp. 99-112.

Work partially supported by Catedra UAM-Infoglobla, Spanish Government (TEC2007-65400 – Semantic Video) and Comunidad de Madrid (S-0505/TIC-0223 - ProMultiDis-CM).

66

Suggest Documents