Detecting Normality - CiteSeerX

3 downloads 3699 Views 769KB Size Report
developed systems have more sophisticated region tracking algorithms that aim at the reduction ... accurate noise free motion field is not necessary as the system will learn .... The CCTV camera is mounted in a corridor leading to ticket hall.
Measurement and Control, Vol. 32, Issue 9 (Special Issue on “Intelligent Vision Systems”), pp. 261- 264

Image Processing System for Pedestrian Monitoring using neural classification of normal motion patterns B. A. Boghossian and S. A. Velastin Department of Electronic Engineering King’s College London Strand, London WC2R 2LS [email protected], [email protected] ABSTRACT Automated surveillance of crowded dynamic scenes requires prompt detection and classification of unusual activities as means of alerting operators to potentially dangerous situations as they arise. Motion is a strong cue that can be used to classify dynamic scenes and hence detect abnormal movements that can be related to critical situations. Here we propose a method to detect such unusual movements by learning the normal motion characteristics of a dynamic scene and using the acquired knowledge to detect the unusual cases. We consider a typical CCTV scenario in Liverpool Street Underground Station in the City of London as an application example and we evaluate the system performance on more than 1800 events to demonstrate its practicality and reliability.

1

Introduction In recent years interest in automated surveillance systems has grown dramatically as the

advances in image processing and computer hardware technologies have made it possible to design intelligent incident detection algorithms and implement them as real-time systems. The need for such equipment has been obvious for quite some time now, as human operators are unreliable, fallible and expensive to employ. However, the main challenge that the automated surveillance systems have to live up to is the human operators’ ability to conduct a conceptual understanding of scene dynamics and objects’ interactions. This ability is vital for achieving good performance especially in complex and crowded scenes were more than one agent is contributing to the events. The concept of automated incident detection is based on the idea of finding suitable image cues that can represent the specific event of interest with minimum overlapping with other classes. These cues should be easy to calculate or extract from the video sequence to allow on-

Measurement and Control, Vol. 32, Issue 9 (Special Issue on “Intelligent Vision Systems”), pp. 261- 264

line operation, they should be robust towards noise and their representation of the event should be time and position independent. In this paper we adopt motion as the main cue for abnormal incident detection. We start by extracting scene motion information by a hardware implemented exhaustive Block Matching motion detector, followed by a stage of motion segmentation and filtering of. The extracted motion information from sequences with normal events is fed to a neural network as a training data. Hence the system is first set to learn the usual motion patterns in the scene as means to detect the abnormal motion patterns as they arise. In section 2 we present some of the previous work in this field. In section 3 the neural network model is described, section 4 describes the runtime techniques employed to detect the abnormal agents. Section 5 describes a typical example of implementation followed by results and conclusions in section 6.

2

Related work Motion-based automated surveillance or Intelligent Scene Monitoring systems (ISM) were

introduced in the eighties. Video Motion Detection (VMD) and Video Non-Motion Detection have come a long way since then. The main aim of these systems is to alert operators or to start a high-resolution video recording when the motion conditions of a specific area in the scene are changed. Their operation is purely based on detecting changes in the grey level patterns between successive video frames or from a predefined reference background image. However, recently developed systems have more sophisticated region tracking algorithms that aim at the reduction of false alarms caused by scene brightness variations, shadows and other non-human moving objects. An overview of these systems can be found in [1] . Neural and Bayesian networks have been used in motion-based approaches to estimate optical flow, approximate or correct a sparse vector field, learn and recognise moving object trajectories, perform lip reading and identify gestures, postures or interactions and behaviour. Learning object dynamics has mostly taken one of two routes: Calculation, estimation and classification of motion trajectories to infer higher level concepts as in [2] [3] or spatio-temporal model-based tracking and movement recognition as in [4] . Most of these approaches calculate optical flow as a passing stage to deduce higher level features. On the other hand, other researchers suggest the direct use of raw motion-vectors as cues to identify higher level concepts as in [5] [6] [7] [8] . Yacoob et al in [5] use a radial basis function network architecture to learn the correlation of facial features motion patterns and human emotions. They adopt a hierarchical approach that

Measurement and Control, Vol. 32, Issue 9 (Special Issue on “Intelligent Vision Systems”), pp. 261- 264

recovers motion directions at the low level, determine the motion of facial features at the middle level and identify emotions at the highest level. A separate network is trained to identify each emotion and each network is divided into sub-networks that specialise in represent the temporal image features at the lower levels of the hierarchy. At the lowest level sub-networks are further divided to four networks that are trained to identify the direction of motion from row motion vectors calculated by correlation in image space. Miyauchi et al in [7] propose a motion interpretation network which enables optical flow interpretation and describes motions on a plane through the use of neural network with complex back propagation learning. The network architecture allows the system to define motion parameters by identifying displacement, expansion/contraction and rotation, or any combination of these, in the input vector field. In [8] the author extends this idea to interpret 3D-object motion by establishing correspondence between motion patterns from three viewpoints using neural networks. Heikkonen et al in [6] employ a self-organising map (SOM) to learn the motion patterns in the image sequence. Feature vectors that represent the position and displacement of motion vectors that typically exist in the sequence are fed as inputs to the SOM at the learning stage. The authors assume smooth object motion, as physical objects’ motion cannot change abruptly. Hence, at runtime, the SOM is used for correcting the linear prediction of object position and movement direction throughout successive frames. Our approach is similar to what Heikkonen et al have presented. However, a more general network architecture was adopted because the assumption of linearly predictable smooth motion is not valid in the case of human motion analysis. Pedestrian motion is classified as articulated motion, however the articulated motion model has to be relaxed to take into account clothing and hair motion. In this paper, we present an approach to detect abnormal behaviour in Closed Circuit Television (CCTV) sequences in underground stations by training a network to recognise the raw motion vector patterns available in the scene. The generality of the adopted network architecture accommodates for any combination of motion patterns that might arise in the scene. Moreover, an accurate noise free motion field is not necessary as the system will learn during the training stage the motion patterns that are related to shadows, scene brightness changes or any other unrelated sources of motion and will not unduly trigger false alarms. At the same time, the system can deal with crowded scenes as long as the training samples include such examples.

Measurement and Control, Vol. 32, Issue 9 (Special Issue on “Intelligent Vision Systems”), pp. 261- 264

3

Learning normality Scene motion information is extracted at a rate of 8.33 times per second. Spatial filters are

applied to the motion vector field to remove outliers. A combination of background removal and inter-frame difference is employed as means to disregard motion vectors corresponding to the stationary background (assuming a fixed CCTV camera and small changes in the scene brightness due to the indoor nature of underground stations). A complex neural network architecture is adopted to model the motion patterns in the scene, where the network states and weights are represented using complex numbers. A 2D network consisting of 64x64 cells is employed with each cell connected to its closest 24 neighbours via complex weights. This allows motion displacements of up to 16 pixels between frames in a 640x512-pixel image. This displacement was empirically found to be sufficient to represent the available motion patterns in the video sequences processed, see Figure 1.

16 pixels

Figure 1: Difference image resulting from the subtraction between two video frames separated by 120 milliseconds (capture at a frame rate of 8.33 frames per second) that illustrates the amount of displacement typically observed during this period.

Measurement and Control, Vol. 32, Issue 9 (Special Issue on “Intelligent Vision Systems”), pp. 261- 264

Each cell in the network corresponds to a motion vector in motion space. The cell state describes the x and y components of the corresponding motion vector as a complex number. During the training stage, correspondence between motion vectors in successive video frames is integrated based on measures of position, direction and absolute strength. The network parameters are updated to best describe the observed motion patterns and the temporal relationship between connected cells. In other words, the network weights describe the relationship between the complex state of each cell at time t and the complex state of its neighbouring cells at time t+1. Training consists of two stages: first a video sequence containing the usual motion patters for the specific CCTV camera view is arbitrary selected as a training set. The network weights are adjusted to recognise and predict those patterns. As the network converges, the second stage of training starts where the network weights are fine-tuned during run-time to best describe the new examples observed (on-line training). Although, the first stage has a more significant effect on the network weights, the second stage is also necessary to eliminate false alarms.

4

Detecting abnormality After the network has been trained to recognise and predict the usual motion patters for the

specific camera view, the system is now capable of detecting moving objects that are violating the network predictions within a threshold. At this stage further processing is required to segment the motion information and identify the moving object(s) in the scene. Region growing segmentation is adopted to segment the vector field into groups on the basis of direction, speed and position connectivity. The network prediction for each moving object in the scene is compared with the actual motion patters extracted from the scene to calculate an error measurement for that object. Hence, objects with motion patterns that largely deviation from the network prediction are potential candidates for detection. A further stage of temporal filtering of the error measures is employed to allow the system to build a confidence measure before triggering the alarm. It should be mentioned that the motion segmentation stage is not required in the training process as the network cells learn the scene motion patterns that correspond to their position in the network. However, to calculate an error measurement for each moving object in the scene, segmentation is necessary. Moreover, the segmentation stage should not slow down the detection frame rate below the training frame rate, as this will introduce a larger time gap between frames and hence change the motion properties of the scene from those observed during training.

Measurement and Control, Vol. 32, Issue 9 (Special Issue on “Intelligent Vision Systems”), pp. 261- 264

In the case of false alarms during run-time, the specific segment of the vector field that contradicted the network prediction is captured and used to update the network weights to eliminate this source of false alarm in the future. The user interface allows the operator to easily interrupt the processing in the case of a false alarm and visualise the source of the alarm before making the decision of using the incident as a training set.

5

Application to a one-way corridor Here we employ the approach described above to detect pedestrians walking in the opposite

direction in a one-way corridor. The CCTV camera is mounted in a corridor leading to ticket hall A in Liverpool Street Underground station in the City of London. Due to the special architecture of underground stations most of the premises have low ceilings, hence CCTV cameras are low mounted which results in a very bad perspective view, see Figure 3. Previously in [9] , motion segmentation and temporal filtering was used to detect the main direction of motion for each moving object in the scene. However, the bad viewing perspective introduces undesired motion components like pedestrian up-down movement during walking and running or movements of upper limbs. These movements become significant enough to trigger false alarms especially when the targeted pedestrian walking direction is projected as vertical motion in the image plane. These limitations are targeted in the proposed approach, as the learning set includes all possible movement patters in the scene including those that are not directly related to the walking direction of the pedestrian. However, the learning set does not have to be comprehensive in describing the motion patterns in the scene, as learning can continue during runtime (on-line learning). Hence, the operator can advise the network to learn the motion patterns associated with a specific event that has caused a false alarm. Therefore, an arbitrary sequence that describes the normal directions of movement in the corridor was recorded and the extracted motion vectors were applied to the network as a learning set. Figure 2 shows one instance of motion vectors that are extracted from the video sequence and the interface that is used to train the network. At run-time the system operates directly on live video data (live camera or VCR) and results are superimposed on the live video sequence. Hence, moving objects that violate the network predictions are highlighted and an alarm is sounded.

Measurement and Control, Vol. 32, Issue 9 (Special Issue on “Intelligent Vision Systems”), pp. 261- 264

Figure 2: Motion vectors extracted from the scene are used to train the network. (The figure shows three subjects walking away from the camera) After training the network on a video sequence that was chosen arbitrarily, the run-time performance was weak as nine false alarms were triggered during the three hours test. However, by training the network on-line the performance was improved dramatically. Figure 3 shows the system operation where the highlighted area indicates the motion vectors that did not comply with the network prediction. Due to the viewing perspective pedestrians moving at the back end of the corridor do not exhibit significant motion components. Although it is possible to magnify these components by effectively reducing the processing frame rate linearly with scene depth, but we have chosen not to do so due to the large processing power required to achieve that goal as a real-time implementation. Hence, the classification is conducted when the moving object exhibits significant motion components that provide enough confidence to trigger the alarm.

Measurement and Control, Vol. 32, Issue 9 (Special Issue on “Intelligent Vision Systems”), pp. 261- 264

Figure 3: Detection of pedestrians moving in opposite direction during runtime. (Walking away from the camera is the usual direction of motion.)

6

Practical results and conclusions The performance of the system was evaluated against manually generated ground truths. A

total of 1818 events were considered. An event is defined as the process of detecting the motion direction of a pedestrian passing through the camera’s field of view. In the corridor scenario an event represents a pedestrian entering and exiting the scene. An event approximately lasts for ten seconds (i.e. 250 video frames in CCIR/PAL mode and hence 80 frames processed by the system). Table 1 shows the performance figures for the system. The total number of correctly classified events represents the true positive percentage. The false alarm rate and the no detection rate represent the false positive and true negative percentages respectively. Performance measure Number of events Percentage

True detection Walking away Walking towards 1082 733 99.83%

False alarms 5 0.29%

No detection 3 0.17%

Table 1 : Performance figures for 1818 events while moving away from the camera is the normal case.

Measurement and Control, Vol. 32, Issue 9 (Special Issue on “Intelligent Vision Systems”), pp. 261- 264

It should be noted that none of the false alarms were caused by noise, shadows or scene brightness changes. Pedestrians exhibiting sudden changes in their walking direction were detected by the system as abnormal agents. We did not train the network to ignore such incidents, as they might have some significance to the CCTV operator as means to detect fights or other activities that exhibit motion patters that diverge from the normal. To validate the performance measurement process and to be able to disregard the effect of any special case scenario, the network was trained to perform the reverse task (i.e. detect pedestrians moving away from the camera) and similar performance figures were obtained as shown in Table 2.

Performance measure Number of events Percentage

True detection Walking away Walking towards 1080 736 99.89%

False alarms 6 0.33%

No detection 2 0.11%

Table 2 : Performance figures for 1818 events whiel moving towards the camera is the normal case.

During-run time the crowding levels in the scene varied largely (up to 22 pedestrians in the scene) especially due to the fact that the test sequence was recorded during a rush hour. However, large crowding levels that might occur in a corridor due to a blockage were not observed. As the crowding levels increase the negative effect of occlusions on the segmentation process performance becomes more significant. However, none of the missed detection cases occurred due to occlusion see Figure 4. As the system operates at a processing rate of 8.33 frames per second in 25 frames per second video system, it captures a frame every other two frames. Hence, there is the potential of having slightly different motion patterns during runtime then those observed during training if a different sequence of sampled video frames was acquired. We have noticed that these differences are insignificant practically in normal operation, however these differences become more significant as the speed of moving objects in the scene increases and they might cause false alarms. Hence, due to the relatively slow processing rate, the system performance is not time independent in scenes with fast moving objects. This limitation is to be addressed by attempting to accelerate the processing rate to reduce or eliminate the time dependent differences in the motion patterns.

Measurement and Control, Vol. 32, Issue 9 (Special Issue on “Intelligent Vision Systems”), pp. 261- 264

Figure 4: Detection of a partially occluded pedestrian. To conclude, we have proposed in this paper the use of row motion vectors, that are extracted from CCTV video sequences on-line, to detect abnormal motion patterns as means to alert the operators to potentially dangerous situations. We have applied the proposed method to a typical CCTV scenario and evaluated the system performance to demonstrate its practicality.

7

Acknowledgements The authors would like to thank London Underground Ltd and Liverpool Street Station GSM

for their co-operation and support. References [1] R. M. Roger, I. J. Grist, G. A. Peskett, ‘Video motion detection systems: A review for the nineties’ in the proceedings of IEEE 28th annual Carnahan conference 1994, P 92-97. [2] B. Carswell, V. Chandran ‘Automated recognition of drunk driving on highway from video sequences’ IEEE Internation Conference on Image Processing ICIP’94, vol.2, p. 306-310.

Measurement and Control, Vol. 32, Issue 9 (Special Issue on “Intelligent Vision Systems”), pp. 261- 264

[3] S. J. McKenna, S. Gong, ‘Gesture Recognition for Visually mediated interaction using probabilistic event trajectories’ British Machine Vision Conference 1998, pp?? [4] L. Xu, D. Hogg, ‘Using neural networks to learn spatial-temporal models for moving deformable objects tracking’ international workshop on neural networks for identification, control, robotics and signal/image processing 1996 p145-153. [5] M. Rosenblum, Y. Yacoob, L. Davis ‘Human emotion recognition from motion using a radial basis function network architecture’ in the proceedings of the IEEE workshop on Motion of non-rigid and articulated objects 1994. P. 43-49. [6] J. Heikkonen, P. Koikkalainen, C. Schnorr, ‘Learning motion trajectories via selforganization’ Proceedings of 12th Image analysis and Pattern recognition international conference 1994 IAPR’94 vol.2, p.554-556. [7] M. Miyauchi, M. Seki ‘Interpretation of optical flow through neural network learning’ ICCS/ISITA’92 Sigapore ???? [8] A. Miyauchi, A. Watanabe, M. Miyauchi ‘A method to interpret 3D motion using neural networks” IEEE International conference on Image Processing ICIP’94, 1994, vol.3 p. 83-87. [9] B. Boghossian, S. Velastin ‘Evaluation of motion-based algorithms for automated crowd management’ Workshop on Performance Characterisation and Benchmarking of Vision Systems, Las Palmas, Gran Canaria, Spain January 1999. P. 80-96.