F1=0.75 is a standard Kalman filter blob tracker with Gaussian background modelling based on the OpenCV library [5]. The main distractions are camera shake, ...
Human Intrusion Detection using Texture Classification in Real- Time N. Buch and S.A. Velastin Digital Imaging Research Centre, Kingston University, UK Transport for London, London, UK Abstract The task of detecting people entering a sterile zone is a common scenario for visual surveillance systems. We propose a novel texture classifier to detect a person in a video frame without temporal information in real- time by identifying salient texture regions. An extension to this classifier by fusing it with simple motion information significantly outperforms standard motion tracking. Lower detection time can be achieved by combining texture classification with Kalman filtering. F1 measures are given for the i-LIDS sterile zone dataset of the UK Home Office. The fusion approach running on 10 frames per second gives the highest result of F1=0.92 for the 24 hour test dataset.
1 Introduction The use of visual surveillance and cameras for security application is continuously growing. One example scenario is detecting people entering a sterile zone. We present a system which alarms on intrusion in the i-LIDS sterile zone test dataset of the UK Home Office [1], see Figure 1 for example views of the outdoor scene. The main challenge for the academic and industrial communities is to demonstrate robust operation over a wide range of environmental conditions. A common solution for this problem is to generate a background model to estimate a motion foreground and perform tracking. The background model can be a single Gaussian as in the OpenCV1.0 blobtracker [5]. A background model based on mode in the temporal histogram is given in [8]. The seminal paper of Stauffer and Grimson [6] introduces a Mixture of Gaussians background model to deal with multiple backgrounds by trading off computational speed and memory size. Periodically changing background is modelled in [2]. A background model based on texture is proposed in [4]. All background modelling approaches are affected by camera shake or fast scene changes. Causes for those changes can be an auto iris function, colour mode changes of the camera or fast moving shadows. Detection on single frames overcomes those problems, however it increases the difficulty of detection as there is no temporal information available. Regression trees are used in [3] to classify pixels into road and non road of vehicle mounted cameras assuming known road and non road seed areas. A review of invariant pattern features is given in [7] which are commonly used in classifying images and content based retrieval. We propose a novel texture classifier for intrusion detection in still images. Based on spectral texture features of image regions, salient objects are detected in real- time. The
1
Figure 1 Correctly detected intrusions (True positives TP) of the texture classifier with motion extension in View1 top and View2 bottom. Note the snow in the middle image of View1 and people crawling on the right. algorithm is evaluated on the i-LIDS sterile zone dataset. Improved performance is demonstrated with a novel fusion approach of texture classifier and simple inter-frame difference motion mask. A second extension to the texture classifier uses Kalman filtering and allows motion silhouettes to initialise tracks to reduce detection time. The next section introduces the texture classifier and the two extensions. Section 3 shows the testing framework and section 4 discusses results. Conclusions are given in section 5.
2 Texture Classifier This section introduces our novel texture classifier. Based on the general principle, one extension is given by incorporating motion cues into the classifier and one by performing Kalman filtering for shorter detection times.
2.1 Basic Texture Classifier The intrusion detection system is based on a region texture classifier for still images. Refer to Figure 2 for a block diagram. A grey- level input frame (CIF 360x288) is divided up into images patches Pij first. Those patches are 16 pixels square and have 20 percent overlap. Two regions R i are defined, one for the grass and one for the fence area. Fast Fourier Transform (FFT) is performed for each patch Pɶ ij = FFT ( Pij ) . Low and high frequency components are removed from the spectrum. Low frequencies contain the illumination conditions of the patch and high frequencies noise from the analogue video feed. The sum of the remaining spectrum gives a scalar feature f ij for each image patch f ij = ∑ Pɶ ij . (1) x, y
2
Image Patches
Bandpass SUM
FFT
Clustering ( x, y, f ij )
Classifier ( Fi / Bi )
Texture Fi Alarm
Intrusion Rule
Position List
Merge regions
Figure 2 Block diagram of region texture classifier The feature f ij discriminates people from background due to their different appearance, but gives a similar response over the whole background. This is in contrast to the aim of describing texture discriminately as in [7]. The patches Pij are clustered with respect to coordinates x, y and the feature f ij . Euclidian distance is used to find N clusters with minimal variance (the Matlab function ‘Linkage’ is used). The resulting clusters Cik are classified into foreground Fi and background B i . It is assumed that most of the image contains background and only M patches are potential foreground Fɶ i . The values for M and N can be obtained from training data. The smallest foreground object should occupy approximately one cluster which gives N = 15 by considering the area ratio between this object and the background. The number of foreground clusters M = 4 is determined by the ratio between smallest and largest object. The mean feature fik for every cluster Cik is calculated by
∑
f ik =
f ij .
(2)
Pij ⊂ Cik
The potential foreground Fɶ i contains M clusters with the highest feature fik leaving all other clusters as background Bi = Cik Cik ⊄ Fɶ i . The mean features f i of the background clusters and their variance σ i2 is calculated
{
}
{
}
{
}
f i = mean fik Cik ⊂ Bi
(3)
σ i2 = var f ik Cik ⊂ Bi .
(4)
The foreground Fi consists of salient clusters of Fɶ i
{
Fi = Cik Cik ⊂ Fɶ i ∧ f ik ≥ T ⋅ σ i2 + f i
}
(5)
with T = 5 in our case. Spatially close foreground clusters are merged into single objects which occurs for objects close to the camera. The positions of objects are logged over time to accumulate paths. The intrusion rule framework triggers an alarm, if a path shows consistent movement towards the target (fence) for a minimum time.
3
Image Patches
FFT
Bandpass SUM
Clustering ( x, y, f ij )
Classifier ( Fi / Bi )
Texture Fi
Interframe Difference
Morph. Opening
Alarm
Motion mask
Merge regions
Intrusion Rule
Position List
Figure 3 Block diagram for texture classifier with motion extension
2.2 Motion Extension The described algorithm in 2.1 does not use any temporal information for detecting objects. The only noise source is the texture of an image. An extension to the above algorithm uses simple inter-frame differencing for motion estimation to be fused with texture information (see Figure 3). This motion information is affected by camera shake and changing illumination condition and so on, not by texture. Valid objects in the ‘merge regions’ block require detection by the texture classifier and significant motion in the same area. This reduces false detections as noise for texture and motion is independent and therefore less likely to occur jointly which allows lower detection thresholds for each detector.
2.3 Kalman Filtering Both of the above systems suffer from some delay time until the first detection of a person. Slow moving people stay partly occluded for a significant time which potentially delays detection. This second extension with a Kalman filter overcomes this problem by allowing tracks to be initialised purely by motion. Refer to Figure 4 for a block diagram. The position list block which accumulated object positions is now replaced by a Kalman filter. This allows the use of texture objects and motion mask for building tracks. New tracks are started for texture objects and silhouettes in the motion mask. Allowing silhouettes to start tracks requires some motion filtering compared to the earlier two systems to eliminate frequent false motion detection. The texture classifier in comparison has a much higher precision and does not require this filtering. Measurements for the Kalman filter update are taken from the following sources if measurements match predictions: •
Detected texture objects
•
Silhouettes in motion mask if no texture object matches the prediction
•
Prediction if no measurement is available
The update with the motion mask allows tracks to start at the first appearance of a person and to fill temporal gaps in the texture object detection.
4
Image Patches
FFT
Bandpass SUM
Clustering ( x, y, f ij )
Classifier ( Fi / Bi )
Texture Fi
Interframe Difference Alarm
Morph. Motion mask Opening Intrusion Rule
Kalman Filter
Merge regions Texture Objects
Figure 4 Block diagram for texture classifier with Kalman Filter extension
Figure 5 True positive examples of Kalman extension showing smooth tracks. Note the person rolling sideways in the image on the left.
3 i-LIDS Testing The proposed system is tested on the i-LIDS sterile zone dataset published by the UK Home Office [1]. The test dataset consists of 24 hour video data for two views during day and night with various weather conditions (day, night, snow, fast moving shadows, etc.). Each view is split into a sequence with alarms (208 total) and a sequence without alarms but with various distractions (birds, rabbits, etc.). See Figure 1 and Figure 5 for detection examples.
3.1 Framework The system was designed according to i-LIDS requirements receiving an analogue video input and providing a relay alarm output. A Trimedia frame grabber is used to sample the video and provide it to a Windows application. The image processing is performed in a Matlab library, which is compiled and integrated into the application. Real- time performance of 10 frames per second is achieved on a Pentium 4 with 2.4 GHz and 1 GB RAM for the texture classifier and extended texture and motion classifier. The extension of the Kalman filter reduces the frame rate to 9 frames per second.
5
1.00
0.92
0.90
0.91
0.92
0.87
0.89
0.80 F1 measure
0.70 0.60 0.50 0.40 0.30 0.20 0.10 0.00 View 1 alarm
View 1 total
Motion Tracker
View 2 alarm
Texture Classifier
View 2 total
Texture + Motion
TOTAL Kalman Filter
Figure 6 Performance for 10 seconds alarm window. Results are shown for alarming sequences, total per view and total of dataset
Figure 7 Two false positives and one false negative example
4 Results The F1 measure is used for i-LIDS testing which combines recall R and precision P with a weight of 65%. Results for the specified alarm window of 10 seconds are shown in Figure 6 for four different systems. The first system (‘Motion Tracker’) achieving F1=0.75 is a standard Kalman filter blob tracker with Gaussian background modelling based on the OpenCV library [5]. The main distractions are camera shake, fast illumination changes due to clouds, birds and changes from black & white to colour of the camera. The second system is the proposed texture classifier (‘Texture Classifier’) which does not use any motion estimation achieving F1=0.78. It shows a better performance than the motion tracker, however, there are problems with detected shadows of the fence and at low image contrast. View2 has lower performance for that reason, see Figure 7 for false positives and false negatives. The detection threshold is set high to eliminate false positives.
6
1.00
0.96
0.95
0.93
0.90
0.92
0.88
0.80 F1 measure
0.70 0.60 0.50 0.40 0.30 0.20 0.10 0.00 View1 alarm
View1 total
Motion Tracker
View2 alarm
Texture Classifier
View2 total
Texture + Motion
TOTAL Kalman Filter
Figure 8 Performance for 20 seconds alarm window. An improvement compared to 10 seconds is noticeable for both texture classifiers due to later correct detections of slow moving people. A1 T1 A2 T2
A1 T1 A2 T2
TP 85
FP FN 17 28 5 62 10 33 9 Motion Tracker
F1 0.80 0.78 0.76 0.72 0.75
A1 T1 A2 T2
TP 108
F1 0.96 0.95 0.93 0.88 0.92
A1 T1 A2 T2
FP FN 4 5 2 83 3 12 7 Texture + Motion
TP 107
FP FN 2 6 9 66 3 29 30 Texture Classifier
F1 0.97 0.92 0.83 0.68 0.81
TP 97
F1 0.88 0.85 0.87 0.85 0.85
FP FN 11 16 7 78 8 17 4 Kalman Filter
Table 1 Detailed numbers of TP, FP and FN with F1 measures for all four systems The different nature of the noise in the above systems suggests the first extension of the texture classifier with a simple motion estimation which is system three. It outperforms both individual systems with F1=0.89 by considering the independence of the noise sources. A low threshold for texture and motion detection allows reduction of false negatives. The false positives are still kept low due to fusion of the two information sources. One downside of the fusion is the increased time to generate an alarm which sometimes extends past 10 seconds for slow moving people. This increased alarm time inspired the second extension by using Kalman filtering and initialising tracks from motion silhouettes in the inter frame difference mask. It is the last system shown in the figures. The overall performance degrades compared to the texture classifier with motion extension. This is due to a larger number of false positives particular during the snow sequence. To keep those false positives down, the catch area for tracks is kept small, which causes some short lost tracks for people. Those tracks are to short to alarm on which causes false negatives. The average alarm time in the 10 seconds window is lowered from 3.4 for motion extension to 3 seconds for Kalman
7
filtering. Both texture classifier based performance figures increase for a larger alarm window of 20 seconds (Figure 8 and Table 1) which is caused by late correct detections. The best overall performance is F1=0.92 for the texture classifier with motion extension with the best performance for View1 of F1=0.95. The particular problem with low contrast in View2 is compensated for in the joint system which gives a minimum performance increase of 0.8 in respect to the other systems.
5 Conclusions We demonstrate good real- time performance for a novel texture classifier to detect salient objects in still images of the i-LIDS sterile zone dataset. A first extension with information fusion between texture and motion increases performance significantly to F1=0.92 on the 24 hour test data set. A second extension using a Kalman filter shows to increase the alarm response time, however degrading the performance due to more false positives. Future work will focus on applying the classifier for moving camera platforms where no background estimation is possible.
6 Acknowledgements We are grateful to the Directorate of Traffic Operations at Transport for London for funding this work. We thank Fei Yin for providing the test results for the OpenCV blob tracker.
7 References [1]
[2]
[3] [4]
[5] [6]
[7] [8]
Home Office Scientific Development Branch. Imagery library for intelligent detection systems i-LIDS. http://scienceandresearch.homeoffice.gov.uk/hosdb/cctv-imagingtechnology/video-based-detection-systems/i-lids/ [accessed 17 June 2008]. A. Colombo, V. Leung, J. Orwell, and S.A. Velastin. Markov models of periodically varying backgrounds for change detection. In Visual Information Engineering 2007, London, July 2007. IET. Bob Davies and Rainer Lienhart. Using cart to segment road images. In Proceedings of SPIE Multimedia Content Analysis, Management and Retrival, 15-19 Jan. 2006, San Jose. Marko Heikkila and Matti Pietikainen. A texture-based method for modeling the background and detecting moving objects. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(4):657–662, 2006. OpenCV. Open source computer vision library. http://www.intel.com/technology/computing/opencv/ [accessed 17 June 2008]. C. Stauffer and W.E.L. Grimson. Adaptive background mixture models for real-time tracking. In Computer Vision and Pattern Recognition, 1999. IEEE Computer Society Conference on., volume 2, pages 246–252, June 1999. Jianguo Zhang and Tieniu Tan. Brief review of invariant texture analysis methods. Pattern Recognition, 35(3):735–747, March 2002. Jianyang Zheng, Yinhai Wang, Nancy L. Nihan, and Mark E. Hallenbeck. Extracting roadway background image: Mode-based approach. Transportation Research Record, (1944):82–88, 2005.
8