operators monitoring the scene. The system should ... data during extended periods of monitoring. ... PostgreSQL database engine, which supports geometric.
WIDE AREA SURVEILLANCE WITH A MULTI CAMERA NETWORK J. Black, T.J. Ellis, D. Makris Digital Imaging Research Centre, Kingston University, United Kingdom {J.Black, T.Ellis, D.Makris}@kingston.ac.uk
Abstract This paper describes a system for visual surveillance for outdoor environments using an intelligent multi camera network. Each intelligent camera uses robust techniques for detecting and tracking moving objects. The system architecture implemented supports the real time capture and storage of object track information into a surveillance database. The tracking data stored in the surveillance database is analysed in order to learn semantic scene models, which describe entry zones, exit zones, links between cameras, and major routes in each camera view. These set of models provide a robust framework for coordinating the tracking of objects between overlapping and non-overlapping cameras, and recording the activity of objects detected by the system.
1 Introduction With reduced hardware costs and faster processor speeds it has become feasible to deploy large intelligent networks of cameras for visual surveillance applications. There are several key issues that need to be addressed by existing technology. Firstly, calibration of an intelligent camera network can be a time consuming process, particularly if the spatial relationships between the views are unknown. A surveillance system should support a ‘plug and play’ concept where new cameras can be easily be inserted or removed by the system. Secondly, once the spatial geometry is known between the cameras, a mechanism is required to reliably integrate tracking information from each camera view, in order to reduce the information load on human operators monitoring the scene. The system should assign unique identities to objects visible in several overlapping camera views, and preserve the identities of objects that move between non-overlapping camera views. Thirdly an intelligent camera network would generate a large volume of tracking data during extended periods of monitoring. The system architecture should make provisions for appropriate storage of information that supports fast indexing and retrieval. This is an important task, since a human operator could need to review object activity in a scene over a specific time interval, or recall certain types of events that occurred within the surveillance region. In this paper we present a distributed system architecture that addresses each of these issues. The remainder of this paper has the following structure. In section two we describe the overall system architecture. The major components of the system are discussed, along with their key functions. In section three we
describe the semantic models of the scene, and camera topology that are automatically learned by analysis of the tracking data. Each of the semantic models are stored in the surveillance database and used by the system to augment the multi view object tracking process. In section four the methods used to coordinate tracking between multiple camera views are discussed. Inter camera tracking between nonoverlapping camera views is performed by using a combination of spatial and object appearance cues. In section five we describe how the semantic scene models are used to generate compact summaries of object activity in the surveillance database. This high level information can be used for event recall and activity analysis. Section six is a summary of the main contributions of the paper and what future work is planned.
2 System Architecture The Kingston University Experimental Surveillance system (KUES) comprises of several components: • • • • • •
Network of Intelligent Camera Units (ICU) Surveillance Database Offline Calibration and Learning Module Multi View Tracking Server (MTS) Video Summary and Annotation Module (VSA) Query Human Computer Interface (QHCI)
The system components are implemented on separate PC workstations, in order to distribute the data processing load. In addition, each system component is connected to a 100Mb/s Ethernet network to allow the exchange of data, as shown in Figure 1. The main functionality of each ICU is to robustly detect motion [10] and track moving objects in 2D [11]. The tracking data generated by the ICU network is stored in the surveillance database. The system uses the PostgreSQL database engine, which supports geometric primitives and the storage of binary data. The stored data can be efficiently retrieved and used to reconstruct and replay the tracking video to a human operator. Transferring the raw video from each ICU to the surveillance database would require a large network bandwidth. Only the raw pixels associated with each detected foreground object are transmitted over the network. This MPEG4-like encoding strategy results in considerable savings in terms of the load on the Ethernet network and storage requirements. An example of the tracked objects stored in the surveillance database is
shown in Figure 2. The foreground objects are plotted every five frames to observe the tracks motion history. By plotting foreground objects onto a background image of the camera view it is possible to reconstruct the tracking video. ICU1
ICU2
ICUN
......
MultiView Tracker
Video Summary/ Annotation
Query Human Computer Interface
spatial-temporal activity queries with faster response times than would be possible if using the tracking data.
3 Offline Learning The system performs a training phase to learn information about the scene by post analysis of the 2D trajectory data stored in the surveillance database, as discussed in section one. The information learned includes: semantic scene models and details of the camera topology. These are both automatically learned without supervision, and stored in the surveillance database to support the functionality of the MTS, VSA, and QHCI system components. 3.2 Semantic Scene Models
Offline Calibration/ Learning
Surveillance Database
Figure 1: System Architecture of KUES
The semantic scene models define regions of activity in each camera view. In Figure 3 the entry zones, exit zones, and routes identified for one of the camera views are shown. The entry zones are represented by black ellipses, while the exit zones are represented by white ellipses. Each route is represented by a sequence of nodes, where the black points represent the main axis of the route, and the white points define the envelope of the route. Route one and two represent lanes of vehicle traffic in the scene. It can be observed that the entry and exit zones are consistent with driving on the left hand side of the road in the UK. The third route represents flows of pedestrian traffic along the pavement.
Figure 2: Example of object video data stored in surveillance database The tracking data in the surveillance database is analysed to learn semantic models of the scene [2,6]. The models include information that describe: entry zones, exit zones, stop zones, major routes, and the camera topology. The semantic models are stored in the surveillance database, so they can be accessed by the various system components. The models provide two key functions for the surveillance system. Firstly, the camera topology defines a set of calibration models that are used by the MTS to integrate object track information generated by each ICU. The model identifies overlapping, and non-overlapping views that are spatially adjacent in the surveillance region. In the former case homography relations are utilized for corresponding features between pairs of overlapping views [1]. In the latter case the topology defines the links between entry and exit zones between the non-overlapping camera views. This model allows the MTS to reason where and when an object should reappear having left one camera view in a specific exit zone. Secondly, the semantic scene models provide a method to annotate the activity of each tracked object. It is possible to compactly express an object’s motion history in terms of the major routes identified in each camera view. This data can be generated online and stored in the surveillance database. The metadata is a high level representation of the tracking data. This enables a human operator to execute
Figure 3: Example of semantic models stored in the surveillance database
3.2 Camera Topology The major entry and exit zones in the semantic scene models are used to derive the camera topology. The links between cameras are automatically learned by the temporal correlation of object entry and exit events that occur in each view[2]. This approach can be used to calibrate a large network of cameras without supervision, which is a critical requirement for a surveillance system. The camera topology is visually depicted in Figure 4, where the major entry and exit zones are plotted for each camera, along with the links identified between each camera view.
uses a combination of spatial-temporal and appearance cues to track objects within a multi camera framework. A novel method is employed for colour calibration, which allows an object’s appearance to be predicted between non-overlapping camera views. A correlation matrix is computed between histograms of the same object observed in different views. Dynamic programming is used to compute the minimum cost path to align the appearance histograms of objects seen in different cameras. 4.1 Non-Overlapping Camera Views
Figure 4: The camera topology learnt for a network of six cameras
4 Inter Camera Tracking and Calibration One of the key issues with multi view tracking systems is defining calibration models that enable features to be corresponded between camera views. In [5] they devised an unsupervised method that automatically finds the limits of the field of view between pairs of cameras. This enables the system to handover the tracking of objects between overlapping views. Stauffer [9] automatically learns a tracking correspondence model (TCM) between overlapping cameras. The TCM defines a homography transformation between pairs of overlapping camera views. The homography is a point based planer transformation, which assumes there is a dominant ground plane present in the scene. Our surveillance system uses a homography based approach to coordinate tracking between overlapping camera views. The homography transformations are automatically learned for each pair of overlapping camera views as defined by the camera topology [1]. This enables the system to correctly correspond multiple objects between cameras. Recently, interest has been shown in tracking objects between non-overlapping camera views [3,4,8]. This problem is more challenging, since motion of the object is unknown in the transit region between the camera views. This increases the uncertainty of where and when the object should reappear in the surveillance region. The KNIGHT system [3] uses a combination of spatial-temporal and appearance cues to coordinate tracking between disjoint camera views. Parzen windows are used to estimate the inter camera space-time probabilities from a set of training data using some supervision. Kettnaker [4] uses a Bayesian formulation to coordinate tracking between non-overlapping views, although they assume the camera topology, transition times, and transition probabilities are known in advance. Porikli [8] also
One significant advantage of our system over the methods [3,4,5,8,9] previously discussed is that we can automatically learn the camera topology in a correspondence free manner, using a large set of observation data [2]. The camera topology also describes a set of object handover regions (OHR) between non-overlapping views. Each OHR defines related entry and exit zones, along with the transition times and transition probabilities. The OHRs identified for the six cameras in our surveillance network is shown in Figure 4. The models are extended to include colour calibration information, which allows an objects appearance to be predicted when it emerges from an OHR. We employ the method described in [8], since the approach can model nonlinear, and non-parametric models. The training data consists of pairs of histograms that represent the object’s appearance on entry and exit in the OHR. The appearance histograms are constructed using the foreground pixels of the object. The HSV colour space is used, where 8x8x4 bins are allocated to each channel respectively. The training data is derived from correlated 2D objects tracks found during the learning phase of the camera topology. The colour calibration model defines the path P : {(mch,i , n ch,i )} that minimises the inter-bin distances between object histograms taken from each camera view. Where ch = 1..K , and K is the number of colour channels in the object histogram. i = 1.. N and N is the length of the minimum cost path. The values mch,i and nch,i relate to the indices of bins in the object histograms of the relevant colour channel in the first and second views, respectively. Given two appearance histograms they can be compared using a modified Bhattacharyya distance (1), which is a popular similarity measure for colour based tracking [3,7]. N
3
Dc ( p, q) =
∑ ∑ (p 1−
ch
ch (mch,i ) pch (nch,i ))
(1)
i =1
Given the set of training examples the value of Dc ( p, q ) is modeled as a Guassian variable with mean µ Dc and variance 2 σ Dc . Two histograms are identified as matching if the distance measure is within two standard deviations of the mean distance.
4.2 Object Handover Agents Given the camera topology and colour calibration models for each OHR it is possible to coordinate tracking between non-overlapping camera views. The object handover mechanism only needs to be activated when an object is terminated within an exit zone of an OHR. When new objects are created within the entry zone of the OHR the system attempts to match the objects that are currently in transit between the pair of views. Handover Initiation The handover agent is activated when an object is terminated within an exit zone of the OHR that is defined by the camera topology. The handover agent records the exit location, exit time, and appearance histogram of the object when it leaves the field of view of the first camera. Allowing the object handover agent to only be activated when an object is terminated in a OHR eliminates the case where an object is prematurely terminated within the field of view due to tracking error. In addition, once the handover agent has been invoked the OHR model can be used to determine the most likely regions where the object is expected to re-appear, hence reducing the computational cost of completing the handover process. Handover Completion The handover process is completed when an object is created within the entry zone of the OHR. The object handover agent task is only complete if the new object satisfies the following data association constraints: the transition time of the object must be consistent with the expected temporal delay of the OHR, and the objects should match, based on the appearance histogram similarity measure (1), discussed in section 4.1. Handover Termination The handover agent is terminated once an object has not been matched after a maximal temporal delay, which can be determined by the statistical properties of the OHR. The maximal temporal delay in a handover region is an important characteristic, since the surveillance regions are not constrained in such a way that an object must re-appear in the field of view once it enters an OHR. It is possible that once the object enters the OHR it will not visible to any views in the network of cameras. In this situation it is not possible for the system to locate the object. An example of the object handover reasoning process is given in Figure 5. The identity of the vehicle is correctly preserved as it moves through four of the cameras in the surveillance network. The black lines represent the handover regions used to coordinate the tracking of objects between each of the views. The sequence of images (reading from left to right) represents the vehicles motion through each of the four camera views.
Figure 5: Object handover between four camera views
5 Video Summaries and Annotation The surveillance system generates a large volume of tracking data during long periods of continuous monitoring. The surveillance database typically requires between 5-10GB for storing twenty-four hours of tracking data. The vast majority of the space is consumed by the image data of each detected foreground objects. A general task of a human operator is to review certain types of activity. For example, a request can be made to replay video from all instances where an object moved within a restricted area during a specific time interval. Even with appropriate indices the tracking data is not suitable for these types of queries, since the response times are of the order of minutes, which would not be acceptable to the operator if many similar queries had to be run. We resolve this problem by generating metadata that describes the tracking data in terms of the routes contained in the semantic scene models. Each tracked object is summarised in terms of its: entry location, entry time, exit location, exit time, and appearance information. To support spatio-temporal queries the complete object trajectory is stored as a path database geometric primitive. This storage format simplifies classification of object’s path in terms of the route models. Once the routes taken by the object have been determined, its complete motion history is expressed in terms of the nodes along each route. Metadata is generated that describes the entry time and location along the route, and the exit time and location along the route. When an object traverses several routes, multiple entries are generated. The data flow of the metadata generation process is shown in Figure 6. The operator can use the query interface to execute spatialtemporal queries on the metadata with faster response times (of the order in seconds). The tracking data can be accessed appropriately to replay the tracking video. The current system supports the following types of activity queries: • •
Select all objects that have used a specific entry of exit zone Retrieve all objects that have used a specific route in a spefic camera
• • •
Retreive all objects that have visited a combination of routes in a specific order Retrieve all objects that have passed through a section of a route. Count the route popularity by time of day
Low-Level Tracking Data
Semantic Scene Models
3.
4.
5.
Video Summary and Annotation
6.
Route Classification
Object Summary
Object History
7.
8. Metadata
Figure 6: The data flow for online metadata generation 9.
6 Conclusion We have demonstrated a framework for tracking using an intelligent camera network and a multi view tracking server. The key novelty of our system is that we automatically learn scene models, which define the camera topology, major entry/exit zones, and routes in each view. The camera topology models are used by the multi view tracking server to coordinate object tracking. Both spatial-temporal and appearance cues are used by the system to coordinate the tracking of objects between non-overlapping views. The semantic models also provide an efficient method for generating summaries of object activity. Each tracked object can be compactly expressed in terms of the routes of the semantic model, allowing human operators to easily execute high-level activity queries with fast responses. In the current system it is possible to create spatialtemporal queries on single object activity. In future work we will add more sophisticated data mining functionality to identify different types of object interactions and behaviours.
References 1.
Black J, Ellis T.J., Multi View Image Surveillance and Tracking, IEEE Workshop on Motion and Video Computing, Orlando, December 2002, 169-174. 2. Ellis T.J., Makris D, Black J., Learning a Multicamera Topology. The Joint IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance (VS-PETS), Nice, France, October 2003, pp 165-171
10.
11.
Javed O., Rasheed Z., Shafique K., Shah M., Tracking Across Multiple Cameras With Disjoint Views, International Conference on Computer Vision (ICCV 2003), Nice France, October 2003, pp 952-957 Kettnaker V., Zabih R. Bayesian Multi-Camera Surveillance. IEEE Conference on Computer Vision and Pattern Recognition (CVPR 1999), Fort Collins, Colorado, June 1999, pp 2253-2561 Khan S., Shah M, Consistent Labeling of Tracked Objects in Multiple Cameras with Overlapping Fields of View. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), October 2003, Vol. 25, No. 10, pp 1355-1360 Makris D, Ellis T.J., Automatic Learning of an ActivityBased Semantic Scene Model, IEEE Conference on Advanced Video and Signal Based Surveillance (AVSB 2003), Miami, July 2003, pp183-188 Nummiaro K., Koller-Meier E., Van Gool L. Color Features for Tracking Non-Rigid Objects. Special Issue on Visual Surveillance, Chinese Journal of Automation, May 2003, Vol 29, No. 3, pp 345-355 Porikli F., Divakaran A., Multi-Camera Calibration, Object Tracking and Query Generation, International Coneference on Multimedia and Expo (ICME), Baltimore, July 2003, pp 653-656 Stauffer C., Tieu K. Automated multi-camera planar tracking correspondence modeling. IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2003), Madison Wisconsin, June 2003, Vol 1, pp 259-266 Xu M, Ellis T.J., Illumination-Invariant Motion Detection Using Color Mixture Models, British Machine Vision Conference (BMVC 2001), Manchester, September 2001, 163-172. Xu M, Ellis T.J., Partial Observation vs Blind Tracking through Occlusion, British Machine Vision Conference (BMVC 2002), Cardiff, September 2002, 777-786.