Hierarchical Database for a Multi-Camera Surveillance System James Black Digital Imaging Research Centre Kingston University Kingston-upon-Thames Surrey, KT1 1LQ, UK
[email protected] Dimitrios Makris Digital Imaging Research Centre Kingston University Kingston-upon-Thames Surrey, KT1 1LQ, UK
[email protected] Tim Ellis Digital Imaging Research Centre Kingston University Kingston-upon-Thames Surrey, KT1 1LQ, UK Tel: +44 20 8547 7659
[email protected] The original version can be downloaded at: http://www.springerlink.com/content/vt127692213u252l/
Hierarchical Database for a Multi-Camera Surveillance System
Abstract
This paper presents a framework for event detection and video content analysis for visual surveillance applications. The system is able to coordinate the tracking of objects between multiple camera views, which may be overlapping or non-overlapping. The key novelty of our approach is that we can automatically learn a semantic scene model for a surveillance region, and have defined data models to support the storage of different layers of abstraction of tracking data into a surveillance database. The surveillance database provides a mechanism to generate video content summaries of objects detected by the system across the entire surveillance region in terms of the semantic scene model. In addition, the surveillance database supports spatio-temporal queries, which can be applied for event detection and notification applications.
Page 2 of 42
1 Introduction A primary requirement for a visual surveillance system is to record and archive activity detectable in the viewfields of the surveillance cameras and facilitate convenient means for retrieval. In traditional systems this is done by continuously recording the full bandwidth video, or sampling the data using a trigger generated by a video motion detection (VMD) system. More recent technologies replace the analogue video recorder with a digital recording, taking advantage of standard video compression algorithms (e.g. MJPEG, MPEG) to reduce the storage requirements. Whilst these systems provide some limited capability to detect ‘events’ in the video data, they are incapable of identifying more complex sequences of events that can be used to classify complex behaviours of the detected objects. With the advent of multi-view video tracking surveillance systems the requirements are for an information management system that can support a diverse variety of data associated with the surveillance task, and also to enable sophisticated queries involving the interaction between tracked objects, the environment, and other objects. For instance, to detect two people meeting at a park seat during a given time period, and to trace their route through the environment prior and subsequent to the meeting event. In order to pose such queries it is important to be able to represent these spatiotemporal events in a form that is efficient to store and search. We describe the design and usage of a database that is used to maintain information associated with the tracking of multiple objects (mainly pedestrians and vehicles) in an outdoor environment viewed by multiple video cameras. Information is represented at multiple levels of abstraction, describing pixel video data, object features and trajectory information and semantically labelled scene Page 3 of 42
regions [3]. The database is queried to extract trajectory data that is used to analyse motion-related events and create spatio-temporal probabilistic models of typical activities in the scene, supporting a semantic interpretation of the detected activities. These models can be used to support detection of atypical activities and to interpret simple object behaviours. Section 2 overviews some recent developments in multi-camera surveillance systems, and examines the most popular methods for event detection and annotation. Section 3 considers some of the forms of semantic labelling that can be applied to outdoor surveillance systems. Section 4 describes how the trajectory data stored in the database can be used to learn spatio-temporal and probabilistic models that encode common activities. Section 5 describes the overall system architecture and the database model we have used for our online surveillance system. The models are used to construct a surveillance database that can contain descriptions of semantically meaningful scene features, which can be used to enhance the capabilities of the object tracking algorithms. Section 6 gives a summary of the paper and future planned work.
2 Background 2.1 Multiview Tracking Systems With reduced hardware costs and increased processor speeds it has become feasible to deploy intelligent multi camera surveillance networks to robustly track object activity in both indoor and outdoor environments [4,6,7,10,12,13]. Cai and Aggarwal presented an extensive distributed surveillance framework for tracking people in indoor video sequences [4]. Appearance and geometric cues were used for object tracking. Tracking was coordinated between partially overlapping views by applying epipole line analysis for object matching. Chang and Gong created a multi view tracking Page 4 of 42
system for cooperative tracking between two indoor cameras [6]. A Bayesian framework was employed for combining geometric and recognition based modalities for tracking in single and multiple camera views. Outdoor image surveillance presents different challenges, with greater variability in the lighting conditions and the vagaries of weather to contend with. The Video Surveillance and Monitoring (VSAM) project developed a system for continuous twenty-four hour monitoring [7]. The system made use of model based geo-location to coordinate tracking between adjacent camera views. In [13] the ground plane constraint was used to fit a sparse set of object trajectories to a planar model. A robust image alignment technique could then be applied in order to align the ground plane between overlapping views. A test bed infrastructure has been demonstrated for tracking platoons of vehicles through multiple camera sites [12]. The system made use of two active cameras and one omnidirectional camera overlooking a traffic scene. Each camera was connected to a gigabit Ethernet network to facilitate the transfer of full size video streams for remote access. More recently the KNIGHT system [10] has been presented for real-time tracking in outdoor surveillance. The system can track objects between overlapping and non-overlapping camera views by using spatial temporal and appearance cues.
2.2 Video Annotation And Event Detection One application of a continuous twenty-four hour surveillance system is that of event detection and recall. The general approach to solving this problem is to employ probabilistic frameworks in order to handle the uncertainty of the data that is used to determine if a particular event has occurred. A combination of both Bayesian classification and Hidden Markov Models (HMMs) were used in the
Page 5 of 42
VIGILANT project for object and behavioural classification [21]. The Bayesian classifier was used for identification of object types, based on the object velocity and bounding box aspect ratio. A HMM was used to perform behavioural analysis to classify object entry and exit events. By combining the object classification probability with the behavioural model the VIGILANT system achieved substantial improvement in both object classification and event recognition. The VIGILANT system also made use of a database to allow an untrained operator to search for various types of object interactions in a car park scene. Once the models for object classification and behavioural analysis have been determined it is possible to automatically annotate video data. In [19] a generic framework for behavioural analysis is demonstrated that provides a link between low-level image data and symbolic scenarios in a systematic way. This was achieved by using three layers of abstraction: image features, mobile object properties, and scenarios. In [8] a framework has been developed for video event detection and mining, which uses a combination of rule-based and HMMs to model different types of events in a video sequence. Their framework provides a set of tools for video analysis and content extraction, by querying a database. One problem associated with standard HMMs is that in order to model temporally extended events it is necessary to increase the number of states in the model. This increases the complexity and the time required to train the model. This problem has been addressed by modelling temporally extended activities and object interactions using a probabilistic syntactic approach between multiple agents [9]. The recognition problem is subdivided into two levels. The lower level can be performed using a HMM for proposing candidate detections of low-level temporal features. The higher level takes input from the low-level event detection for a stochastic context-free parser. The grammar and parser can provide a longer
Page 6 of 42
range of temporal constraints, incorporate prior knowledge about the temporal events given the domain, and resolve ambiguities of uncertain low-level detection. The use of database technology for surveillance applications is not completely new. The ‘Spot’ prototype system is an information access system that can answer interesting questions about video surveillance footage [11]. The system supports various activity queries by integrating a motion tracking algorithm and a natural language system. The generalised framework supports event recognition, querying using a natural language, event summarisation, and event monitoring. In [20] a collection of distributed databases were used for networked incident management of highway traffic. A semantic event/activity database was used to recognise various types of vehicle traffic events. The key distinction between these systems and our approach is that a MPEG-4 like strategy is used to encode the underlying video and semantic scene information is automatically learned using a set of offline processes [14,15,16,17].
3 Environment Modelling 3.1 Camera Network In order to integrate the track data from multiple cameras, it is useful to consider the visibility of targets within the entire environment, and not just each camera view separately. Four region visibility criteria can be identified to define the different fields-of-view (FOV) available from the network of cameras:
Page 7 of 42
camera location building viewfield overlapped viewfield
Figure 1. Visibility criteria for camera network.
visible FOV - this defines the regions that an individual camera will image. In cases where the camera view extends to the horizon, a practical limit on the view range is imposed by the finite spatial resolution of the camera or a practical limit on the minimum size of reliably detectable objects camera FOV - encompasses all the regions within the camera view, including occluded regions network FOV - encompasses the visible FOV's of all the cameras in the network. Where a region is occluded in one camera’s visible FOV, it may be observable within another FOV (i.e. overlap). virtual FOV - covers the network FOV and all spaces in between the camera FOV’s within which the target must exist. The “boundaries” of the system represent locations from which previously unseen targets can enter the network.
Figure 1 illustrates the camera network visibility regions for a simple environment projected onto the ground plane. Occluded regions are shown in white (if within the expected viewfield of a camera).
Page 8 of 42
Figure 2 shows an example of a manually constructed scene occlusion model. In the single camera tracking software we combine knowledge of the location of occlusion regions with the predicted locations of moving objects to determine the likelihood of an object being detected in the next image frame [22,23]. We have used a simple representation of these static occlusion regions using a rectangular bounding box. The boxes are labelled to identify the following occlusion types:
1. border occlusion (BO) – this expresses the limits of the camera FOV and is trivially determined in image coordinates, but requires a knowledge of the extent of the visible ground plane if it is to be expressed in 3D coordinates. 2. long-term occlusions (LO) – these represent regions where objects may ‘prematurely’ leave the scene due to occlusion regions that overlap with the image border. Typical sources include buildings (and doors into buildings) and vegetation. 3. short-term occlusions (SO) – these represent regions where the object may temporarily disappear from the view of the camera. Apart from predicting the disappearance, the tracker prediction can also determine the expected re-appearance of the object (assuming it holds to the motion model used by the tracker).
It should be noted for the long and short-term occlusion regions, in the 2D image plane coordinates, this occlusion reasoning mechanism only determines the possibility of an occlusion, since the object may actually move in front of the occlusion region and hence remain visible. It would be left to a 3D reasoning process to generate a more reliable prediction, necessitating a 3D representation of the occlusion region.
Page 9 of 42
Figure 2. Depiction of typical static occlusion-generating features in the scene.
The occlusion reasoning process is also applied to reasoning about dynamic occlusions (i.e. moving objects occluding each other) but this process does not require any interaction with the database, and is determined purely from the forward location predictions of the Kalman filter tracker [1].
3.2 Scene Modelling We wish to be able to identify the relationship between activity (moving objects) in the camera viewfield and features of the scene. These features are associated with the visual motion events, for example, where objects may appear (or disappear) from view, either permanently or temporarily. Figure 3 illustrates two representations for the spatial information, showing the topographical and topological relationships between a specific set of activity-related regions – entry and exit zones (labelled A, C, E, G and H), paths (AB, CB, BD, DE, DF, FG, FH), path junctions (B, D, F) and stopping zones (I, J).
Page 10 of 42
Figure 3. a) Topographical map showing the spatial relationships between activity-related scene features. b) topological graph of same scene.
We can categorise these activity-related regions into two types. The first are fixed with respect to the scene, and are linked to genuine scene features such as the pavement or road, where we expect to locate objects that are moving in well-regulated ways, or at the doorway into a building. The second are view-dependent, associated with the particular viewpoint of a camera, and may change if the camera is moved.
Page 11 of 42
Figure 4. Manually constructed region map locating different motion-related activities in a camera viewfield. Entry and exit zones are indicated by the yellow rectangles, paths are shown by the green polygons, and stop zones by the red regions.
Figure 4 indicates a number of these activity zones that can be identified by observing the scene over long periods of time (typically hours or days). Entry/exit zones 1 and 3 are view-dependent; zone 2 is a building entrance and hence scene-related. The stop regions are examples of locations where people stop, in this case, sitting on the wall and the flag plinth.
4 Learning Scene Features 4.1 Entry/Exit Zones In most of the environments, targets enter and exit the scene from predefined entry/exit zones. Entry/exit zones depend of the scene structure and the camera view, so they can be associated with scene features like doors, gates, etc (scene-based features), or they can be placed on the borders of the visible FOV (view-based features). They can also be both scene-based and view-based, e.g. the
Page 12 of 42
section of the FOV border with a road. Entry and exit zones can be coincident, as is usually the case for pedestrian activity. In other environments, which contain road traffic, entry/exit zones can be different. Entry/Exit zones are also learnt from observations. The first and the last point of each trajectory of the dataset are used for the entry point dataset and the exit point dataset respectively. A multi-step approach based in the Expectation-Maximisation (EM) algorithm is used to learn gaussian mixture models (GMM) of the entry/exit zones [17]. Two of the main problems that are encountered in our approach: the unavoidable presence of noise and the model order selection. Noise in the dataset is associated with false entry/exit points, and results from failure of the motion tracker. Two kinds of noise are recognised:
i)
“Activity” noise is caused by false initiations/terminations of trajectories. This is distributed over all the activity areas and has high values where activity is high (mainly where the rate of dynamic occlusions is high).
ii)
“Stationary” noise results from irrelevant motion in the scene, e.g. trees, curtains, computer screens, windows and other reflective surfaces. Such noise sources are usually spatially localised.
We describe the algorithm for the Entry zones. Learning the Exit zones is similar: i)
Entry point dataset is derived from all the trajectory data.
ii)
Apply EM on the Entry point dataset. Model order is an arbitrary number (10-15), larger than the real number of entry zones.
Page 13 of 42
iii)
Use the results of the previous step to detect stationary noise trajectories: if a trajectory is characterised as stationary, its entry point is excluded from the entry point dataset. At the end, a new clean entry point dataset is formed.
iv)
Re-apply EM to the clean entry point dataset.
v)
Separate “signal” clusters from “noise” using their density. Keep only the clusters with density above a threshold.
In step (ii), an initial EM learning is performed. However, the results may not be fully representative, due to the presence of noise. Stationary noise is removed in step (iii). Trajectories for which all their points belong to a single cluster of step (ii) are characterised as stationary and are discarded. In step (iv) EM is applied again to the clean dataset. “Activity” noise is characterised by wide gaussian distributions of low density in the activity areas, and can be eliminated by discarding models with a low density, below a threshold. The density di of the ith gaussian distribution is calculated using the formula: di =
wi π ⋅ Σi
where wi is the prior probability of the gaussian, i.e. its popularity, Σ i the covariance and π ⋅ Σi is the area of a gaussian ellipse. The threshold T used to suppress “activity” noise is estimated from:
T=
a⋅N π⋅ Σ
where a is a user defined constant (typical values a=0.1), N is model order and Σ is the covariance matrix of the entire dataset.
Page 14 of 42
Figure 5. Left upper and lower are 2D spatial histograms of the initialisation and termination of approximately 13223 trajectories accumulated over a 12 hour period. Right upper and lower are the extracted entry and exit zones.
Figure 5 shows the 2D histograms for trajectory initialisation and termination taken from a 12 hour tracking period comprising approximately 13223 pedestrian trajectories. The results of applying the EM algorithm to these histograms are overlaid onto the image, showing the high-density gaussian clusters that are detected. Figure 6 shows similar results in a traffic road environment.
Page 15 of 42
(a)
(b) Figure 6. Entry (a) and exit zones (b) derived from the a set of 12746 trajectories in a cat traffic environment . Note that traffic drives on the left-hand side of the road in the UK.
4.2 Path Modelling Having identified zones where objects enter and leave the camera field of view, we now model the paths along which they travel. The model uses a discrete, spline-like representation to encode the main axis of the path, constructing an envelope that defines the spatial extrema of the path [14,15]. The paths are learnt in an unsupervised fashion from sets of trajectories extracted from the database. Overlaid onto the spatial model, a probabilistic model of route usage is generated. This uses a 1D
Page 16 of 42
gaussian at each spline node, perpendicular to the path direction, to model the pattern of trajectories that contribute to the path. Adding the probabilistic information supports the use of an HMM to model the overall usage [16].
Figure 7. 752 Sample trajectories used to learn path models and the resulting path models.
The left image in Figure 7 shows a set of 752 pedestrian trajectories observed over a twenty-four hour period. The right image in Figure 7 shows the set of paths and associated envelopes of the paths automatically learned from the trajectory data.
4.3 Stop Zones Although the majority of surveillance systems concentrate on the task of tracking moving objects through the scene, in many cases it is the objects that become stationary (or semi-stationary) that may be of particular interest to a security surveillance task. As a consequence, we consider next the task of locating regions in the image that might be specifically associated with such behaviour. Stop zones are defined as the regions where the targets are considered stationary. A variety of different areas can be characterised as stop zones, such as areas where people rest (e.g. a seat), wait Page 17 of 42
for the opportunity to continue their journey (e.g. at a pedestrian crossing or a road traffic junction), wait to access a particular resource (e.g. an ATM or at a bus stop), or access a resource (e.g. a computer or a printer) or are merely observing the scene. Moving objects may also become stationary when they meet and interact with other moving objects – two people meeting in a park and sitting at a bench to chat or someone getting into or out of a motor vehicle. From the above description, stop zones can be associated with scene features and their detection contributes to the semantic description of the scene. Stop zones are learnt in a similar way to entry/exit zones. The dataset consists of trajectory points that correspond to a stopping event. This event is detected when the target speed falls below a threshold. Because the apparent speed of the targets on the image plane is strongly affected by the perspective of the view, speed is more reliably estimated in ground plane coordinates. Figure 8 shows a 2D histogram of the stationary objects taken from camera 1 (see figure 6) and the Figure 9 shows the detected stop zones overlaid onto the reprojected image of this camera view.
Figure 8: Histogram of the 9455 stop event points
Page 18 of 42
Figure 9: Stop zones detected
A stop event is characterized not only by its location, but also its duration. Therefore, it is desirable to model the stop event duration, related to its stop zone. In Figure 10 the stop event duration distribution is shown for two selected zones. As it is illustrated, the stop event duration distribution can be approximated by an exponential function.
Figure 10. Stop event duration distribution for two selected stop zones (in seconds).
Page 19 of 42
4.4 Temporal Dependencies of Models We have so far assumed that the models we have learnt are derived from trajectories generated by a stationary stochastic process. In reality, the models need to accommodate a temporal component, either based on creating separate models for different time periods, or building temporal dependencies into the model.
(a)
(b) Figure 11. a) Probability of trajectory terminating at the University entrance and b) terminating at the bottom left exit.
Figure 11 a) shows the probability of a pedestrian entering the scene in Figure 7 terminating at the University entrance, and b) starting at the University entrance and exiting the scene. The models are Page 20 of 42
generated over a 24-hour period. From 8-9am, almost all trajectories occur towards the entrance of the University, whilst the number of people leaving peaks around 5pm.
4.5 Evaluation We ran a set of 70 experiments to test the algorithm that detects the entry/exit zones, based on the dataset shown in Figure 5. Specifically, the algorithm was tested for different values of N=3..16 and for each value, it was run five times with different initialization data. The output gaussian ellipses were manually labeled as “signal” and “noise” and these labels were compared to the ones that our method automatically provides. The outcome of the automatic method in each experiment was characterized successful only when all the labels were coincident with the manual ones. Table 1 illustrates the success rate for different values of N and the overall success rate which is 85.7%. We ran the algorithm for a variety of scenes and datasets and in most cases the results are successful. If N is too large, then some of the zones may be multimodal. The algorithm fails to discard the stationary motion noise if its source is coincident to a real entry zone, or if its source is multimodal. The dataset of Figure 7 was used to evaluate the route-learning algorithm. The algorithm was run for different values of parameters. Because the algorithm that we developed is incremental, we reorder the trajectory dataset to test against any dependency on the dataset order. In the majority of the experiments, the output of the route learning algorithm was consistent with a semantic interpretation of the routes given manually. For further validation, a test dataset of 100 previous unseen trajectories was labeled both manually and using the route matching algorithm and a specific set of routes, which is pictured in
Page 21 of 42
Figure 27. The results were compared and our automatic method classified correctly 97 out of 100 trajectories.
N
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Overall
%
100
100
100
100
100
100
80
60
80
80
80
60
80
40
85.7
Table 1: Success rate of identifying entry zones from noise, for different values of N.
5 Database Design Multi camera surveillance systems can accumulate vast quantities of data over a short period of time when running continuously. In this paper we address several issues associated with data management, which include: how can object track data be stored in real-time in a surveillance database? How can we construct different data models to capture multiple levels of abstraction of the low level tracking data, in order to represent the semantic regions in the surveillance scene? How can each of these data models support high-level video annotation and event detection for visual surveillance applications?
5.1 System Architecture The surveillance system comprises of a set of intelligent camera units (ICU) that utilise vision algorithms for detecting and robustly tracking moving objects in 2D image coordinates. It is assumed that the viewpoint of each ICU is fixed and has been calibrated using a set of known 3D landmark points. Each ICU communicates to a central multi view-tracking server (MTS), which can integrate
Page 22 of 42
all the information received in order to generate global 3D tracking information. Each individual object, along with its associated tracking details are stored in the surveillance database. I n t e llig e n t C a m e r a N e t w o r k IC U 1
IC U 2
M u lti V ie w T r a c k in g S e r v e r
… … .
IC U N
RT1
T em p ora l A lig n m e n t
RT2
LAN
V ie w p o in t I n te g r a tio n
: : :
NTP DAM EON
RTN
3D T r a c k in g
P o stg reS Q L S u r v e illa n c e
DB I C U : I n t e llig e n t C a m e r a U n it R T : R e c e iv e r T h r e a d O f f lin e C a lib r a t io n / L e a r n in g H om ograp hy A lig n m e n t
U p d a t e S u r v e illa n c e D a t a b a s e P a th L e a r n in g
P o p u la te Im a g e F r a m e le t L a y e r
3 D W o r ld C a lib r a tio n
P o p u la t e S e m a n t ic L a yer
P o p u la t e O b je c t M o tio n L a y e r
G e n e r a te O b je c t S u m m a ry
Figure 12. System architecture of the surveillance system.
The system architecture employs a centralised control strategy as shown in Figure 12. The multi view-tracking server (MTS) creates separate receiver threads (RT) to process the data transmitted by each intelligent camera unit (ICU) connected to the surveillance network. Each ICU transmits tracking data to each RT in the form of symbolic packets. The system uses TCP/IP sockets to exchange data between each ICU and RT. Once the object tracking information has been received it is loaded into the surveillance database.
Page 23 of 42
5.2 Data Abstraction and Representation The key design consideration for the surveillance database was that it should be possible to support a range of low-level and high-level queries. At the lowest level it necessary to access the raw video data in order to observe some object activity recorded by the system. At the highest level a user would execute database queries to identify various types of object activity observed by the system. In order to address each of these requirements we decided to use a multi-layered database design, where each layer represents a different abstraction of the original video data. The surveillance database comprises three layers of abstraction: •
Image framelet layer
•
Object motion layer
•
Semantic description layer This three-layer hierarchy supports the requirements for real-time capture and storage of
detected moving objects at the lowest level, to the online query of activity analysis at the highest level. Computer vision algorithms are employed to automatically acquire the information at each level of abstraction. The physical database is implemented using PostgreSQL running on a Linux server. PostgreSQL provides support for storing each detected object in the database. This provides an efficient mechanism for real-time storage of each object detected by the surveillance system. In addition to providing fast indexing and retrieval of data the surveillance database can be customised to offer remote access via a graphical user interface and also log each query submitted by each user.
Page 24 of 42
5.2.1
Image Framelet Layer The image framelet layer is the lowest level of representation of the raw pixels identified as a
moving object by each camera in the surveillance network. Each camera view is fixed and background subtraction is employed to detect moving objects of interest [22]. The raw image pixels identified as foreground objects are transmitted via a TCP/IP socket connection to the surveillance database for storage This MPEG-4 [5] like coding strategy enables considerable savings in disk space, and allows efficient management of the video data. Typically, twenty-four hours of video data from six cameras can be condensed into only a few gigabytes of data. This compares to an uncompressed volume of approximately 4 terabytes for one day of video data in the current format we are using, representing a compression ratio of more than 1000:1. In Figure 13 an example is shown of some objects stored in the image framelet layer. The images show the motion of two pedestrians as they move through the field of view of the camera. Information stored in the image framelet layer can be used to reconstruct the video sequence by plotting the framelets onto a background image. We have developed a software suite that uses this strategy for video playback and review and to construct pseudo-synthetic sequences for performance analysis of tracking algorithms [2].
Page 25 of 42
Figure 13 Example of objects stored in the image framelet layer. Field Name Camera Videoseq Frame Trackid Bounding_box Data
Description The camera view The identification of the captured video sequence The frame where the object was detected The track number of the detected object The bounding box describing the region where the object was detected A reference to the raw image pixels of the detected object Table 2 Attributes stored in image framelet layer.
The main attributes stored in the image framelet layer are described in Table 2. An entry in the image framelet layer is created for each object detected by the system. It should be noted that additional information, such as the time when the object was detected is stored in other underlying database tables. The raw image pixels associated with each detected object is stored internally in the database. The PostgreSQL database compresses the framelet data, which has the benefit of conserving disk space.
Page 26 of 42
5.2.2
Object Motion Layer The object motion layer is the second level in the hierarchy of abstraction. Each intelligent
camera in the surveillance network employs a robust 2D tracking algorithm to record an object’s movement within the field of view of each camera [23]. Features are extracted from each object including: bounding box, normalized colour components, object centroid, and the object pixel velocity. Information is integrated between cameras in the surveillance network by employing a 3D multi view object tracker [1] which tracks objects between partially overlapping, and non-overlapping camera views separated by a short spatial distance. Objects in overlapping views are matched using the ground plane constraint. A first order 3D Kalman filter is used to track the location and dynamic properties of each moving object. When an object moves between a pair of non-overlapping views we treat this scenario as a medium term static occlusion, and use the prediction of the 3D Kalman filter to preserve the object identity when if it reappears in the field of view of the adjacent camera. The 2D and 3D object tracking results are stored in the object motion layer of the surveillance database. The object motion layer can be accessed to execute offline learning processes that can augment the object tracking process. For example, we use a set of 2D object trajectories to automatically recover the homography relations between each pair of overlapping cameras. The multi view object tracker robustly matches objects between overlapping views by using these homography relations. The object motion and image framelet layer can also be combined in order to review the quality of the object tracking in both 2D and 3D. The key attributes stored in the object motion layer are described in Table 3 and Table 4. In Figure 14 results from both the 2D tracking and multi-view object tracker are illustrated. The six images represent the viewpoints of each camera in the surveillance network. Cameras 1 and 2, 3 and 4, and 5 and 6 have partially overlapping fields of view. It can be observed that the multi-view Page 27 of 42
tracker has assigned the same identity to each object. Figure 15 shows the field of view of each camera plotted onto a common ground plane generated from a landmark-based camera calibration. 3D motion trajectories are also plotted on this map in order to allow the object activity to be visualized over of the entire surveillance region. Field Name Camera Videoseq Frame Trackid Bounding_box
Description The camera view The identification of the captured video sequence The frame where the object was detected The track number of the detected object The bounding box describing the tracked region of the object Position The 2D location of the object in the image Appearance The normalized colour components of the tracked object Table 3 Attributes stored in object motion layer (2D Tracker). Field Name Multivideoseq
Description The identification of the captured multi video sequence Frame The frame where the object was detected Trackid The track number of the detected object Position The 3D location of the tracked object in ground plane coordinates Velocity The velocity of the object Table 4 Attributes stored in object motion layer (Multi View Tracker).
Page 28 of 42
Figure 14. Camera network on University campus showing 6 cameras distributed around the building, numbered 1-6 from top left to bottom right, raster-scan fashion.
Figure 15. Re-projection of the camera views from Figure 14 onto a common ground plane, showing tracked objects trajectories plotted into the views (white, red, blue and green trails).
Page 29 of 42
5.2.3
Semantic Description Layer The object motion layer provides input to a machine-learning algorithm that automatically
learns a semantic scene model, which contains both spatial and probabilistic information. Regions of activity can be labelled in each camera view, for example entry/exit zones, paths, routes and junctions, as was discussed in Section 4. These models can also be projected on the ground plane as is illustrated in Figure 16. These paths were constructed by using 3D object trajectories stored in the object motion layer. The green lines represent the hidden paths between cameras. These are automatically defined by linking entry and exit regions between adjacent non-overlapping camera views. These semantic models enable high-level queries to be submitted to the database in order to detect various types of object activity. For example we can generate spatial queries to identify any objects that have followed a specific path between an entry and exit zone in the scene model. This allows any object trajectory to be compactly expressed in terms of a routes and paths stored in the semantic description layer.
Figure 16. Re-projection of routes onto ground plane
Page 30 of 42
Field Name Camera Zoneid Position Cov Poly_zone
Description The camera view of the entry or exit zone The identification of the entry or exit zone The 2D centroid of the entry or exit zone The covariance of the entry or exit zone A polygonal approximation of the entry or exit zone Table 5 Attributes stored in semantic description layer (entry/exit zones). Field Name Camera Routeid Nodes Poly_zone
Description The camera view of the route The identification of the route The number of nodes in the route A polygonal approximation of the envelope of the route Table 6 Attributes stored in semantic description layer (routes).
Field Name Camera Routeid Nodeid Position Position_left Position_right Stddev
Description The camera view of the route node The identification of the route The identification of the route node The central 2D position of route node The left 2D position of the route node The right 2D position of the route node The standard deviation Guassian distribution of object trajectories observed at the route node Poly_zone A polygon representation of the region between the this route node and its successor Table 7 Attributes stored in semantic description layer (route nodes).
The main properties stored in the semantic description layer are described in Table 5, Table 6, and Table 7. Each entry and exit zone is approximated by polygon that represents the covariance of the region. Using this internal representation in the database simplifies spatial queries to determine when an object enters an entry or exit zone. The polygonal representation is also used to approximate the envelope of each route and route node, which reduces the complexity of the queries required for online route classification that will be demonstrated in the next section. An example of the routes,
Page 31 of 42
routenodes, entry and entry regions is shown in Figure 17. The black and white ellipses indicate entry and exit zones, repectively. Each route is represented by a sequence of nodes, where the blue points represent the main axis of each route, and the red points define the envelope of each route.
3 2 1
Figure 17. Example of routes, entry and exit zones stored in semantic description layer
5.3 Metadata Generation Metadata is data that describes data. The multi-layered database allows the video content to be annotated using an abstract representation. The key benefit of the metadata is that it can be more efficiently queried for high-level activity queries when compared to the low level data. It is possible to generate metadata online when detected objects are stored in the image framelet and object motion layers. In Figure 18 the data flow is shown from the input video data to the metadata generated online. Initially, the video data and object trajectory is stored in the image framelet and object motion layers. The object motion history is then expressed in terms of the model stored in the semantic description layer to produce a high-level compact summary of the object’s history. The metadata contains information for each detected object including: entry point, exit point, time of activity, appearance
Page 32 of 42
features, and the route taken through the FOV. This information is tagged to each object detected by the system. The key properties of the generated metadata are summarised in Table 8 and Table 9. Each tracked object trajectory is represented internally in the database as a path geometric primitive, which facilitates online route classification. Field Name Videoseq
Description The identification of the capture video sequence in the image framelet layer Trackid The trackid of the object EntryTime The time when the object was first detected ExitTime The time when the object was last seen EntryPosition The 2D entry position of the object ExitPosition The 2D exit position of the object Path A sequence of points used to represent the object’s 2D trajectory Appearance The average normalized colour components of the tracked object Table 8 Attributes metadata generated (object_summary).
Field Name Videoseq
Description The identification of the captured video sequence in the image framelet layer Trackid The trackid of the object Routeid The identification of the route EntryTime The time the object entered the route Entrynode The first node the object entered along the route EndTime The time the object left the route ExitNode The last node the object entered along the route Table 9 Attributes metadata generated (object_history).
In Figure 19 it is illustrated how the database is used to perform route classification for two of the tracked object trajectories. Four routes are shown that are stored in the semantic description layer of the database in Figure 19 (a). In this instance the first object trajectory is assigned to route 4, since this is the route with the largest number of intersecting nodes. The second object trajectory is assigned to route 1. The corresponding SQL query used to classify routes is shown in Figure 19(b). Each node Page 33 of 42
along the route is modeled as a polygon primitive provided by the PostgreSQL database engine. The query counts the number of route nodes the object’s trajectory intersects with. This allows a level of discrimination between ambiguous choices for route classification. The ‘?#’ operator in the SQL statement is a logical operator that returns true if the object trajectory intersects with polygon region of a route node. Additional processing of the query results allows the system to generate the history of the tracked object in terms of the route models stored in the semantic description layer. A summary of this information generated for the two displayed trajectories is given in Figure 19c. It should be noted that if a tracked object traversed multiple routes during its lifetime then several entries would be created for each route visited.
Meta Data
Semantic Description Layer
Generate Meta-Data
Object Motion Layer Image Framelet Layer
Figure 18. Information flow for online meta-data generation.
Page 34 of 42
select routeid, count(nodeid) from routenodes r, objects o where camera=2
1
and o.path ?# r.polyzone and o.videoseq =87
4
2
and o.trackid in(1,3) group by routeid
3
(a)
(b)
Videoseq
Trackid
Start Time
End Time
Route
87
1
08:16:16
08:16:27
4
87
3
08:16:31
08:16:53
1
(c) Figure 19 (a) Example of route classification, (b) SQL query to find routes that intersect with an object trajectory, (c) The object history information for both of the tracked objects.
In Figure 20 the activity query used to identify objects moving between a specific entry and exit zone is shown. The query would return all objects entering the FOV by entry zone B and leaving the FOV by either exit zone A or exit zone C. The underlying database table ‘object_summary’ is metadata that provides an annotated summary of all the objects detected within the FOV. The ‘&&’ function in the SQL statement is a geometric operator in PostgreSQL that determines if the object’s entry or exit position is within a given entry or exit region. The last clause at the end of the activity Page 35 of 42
query results in only objects detected between 10:30am and 11:00am on May 24th 2003 being returned in the output. This example demonstrates how the metadata can be used to generate spatial temporal queries of object activity. select o.videoseq, o.trackid, o.start_time from object_summary o, entryzone e, exitzone x where o.start_pos && e.region and o.end_pos && x.region and e.label = ‘B’ and x.label in (‘A’,‘C’) and o.start_time between ‘2004-05-24 10:30’ and ‘2004-05-24 11:00’ Figure 20. SQL query to retrieve all objects moving between entry zone B and exit zone A.
Figure 21. Sample of results returned by spatial temporal activity queries: objects moving from entry zone B to exit zone A, and objects moving from entry zone B to exit zone C.
Page 36 of 42
The metadata provides better indexing of the object motion and image framelet layers of the database, which results in improved performance for various types of activity queries. The use of spatial temporal SQL would allow a security operator to answer the following types of queries: •
Retrieve all objects moving within a restricted area over a specific time period
•
Retrieve a list of all objects leaving a building within a specific time period
•
Retrieve a list of all objects entering a building within a specific time period
•
Show examples of two or more objects moving along a route simultaneously The response times of spatial temporal queries can be reduced from several minutes or hours
to only a few seconds. This is due to the metadata being much lighter to query when compared to the original video. This would be of practical use for real-world applications where a security operator may be required to review and browse the content of large volumes of archived CCTV footage.
5.4 Network Bandwidth Requirements In order to determine the feasibility of using a hierarchical database we have run the system continuously over a twenty-four hour period using a camera network consisting of six intelligent cameras. The amount of network traffic generated over 10 minute time intervals by the intelligent camera network is shown in Figure 22. The majority of the tracking data is generated by cameras 5 and 6, which overlook a road that has regular flows of vehicle traffic. The peak data transmission rate is at around 5pm, which is consistent with the time of rush hour traffic in London. The network traffic is much lower than that required to transmit the original video from six cameras over the network.
Page 37 of 42
T ra c k i n g d a t a t ra n s m is s i o n ra t e s 500
Bytes transmitted (MB/10mins)
450 400 350 300 250 200 150 100 50 0
0
5
10 15 T im e (h o u rs )
20
Figure 22. Plot of the network traffic generated by six intelligent cameras
The majority of the network traffic generated by the intelligent camera network is stored in the image framelet layer. This is expected since image framelet layer consists of the raw pixels identified as foreground objects. During the 24 hour period the network traffic relating to data stored in the image framelet layer of the database was 19GB. The amount of network traffic related to data saved in the object motion layer was 163MB. It should be noted that in the current system the framelets are transmitted in the form of raw pixels, by employing an appropriate compressing algorithm the required network bandwidth would be reduced.
6 Conclusions and Further Work We have described the use of a hierarchical database to store a variety of data extracted from a multi-camera visual surveillance system. In addition, the database functions to serve a post-tracking analysis that generates semantic labels associated with primitive activities. These models are used to
Page 38 of 42
augment the database, and can be subsequently employed to support and enhance the online analysis operation. We are currently extending the system to link the semantic labels between both overlapping and adjacent, non-overlapping camera views to create a set of environment-wide scene models. One of the limitations of our approach is that a high-level representation can be derived only for trajectories that are consistent to the activity-based scene model and not for unusual trajectories. However, mid-level and low level representation can still be used for all the trajectories. In future work we plan to make the metadata MPEG-7 compliant [18]. MPEG-7 is a standard for video content description that is expressed in XML. MPEG-7 only describes content and is not concerned with the methods used to extract features and property from the originating video. By adopting these standards we can future proof our surveillance database and ensure compatibility with other content providers.
7 Acknowledgements We would like to acknowledge support from the UK Engineering and Physical Science Research Council (EPSRC) under grant number GR/M58030. Thanks also to Ming Xu.
8 References
[1] Black J, Ellis TJ, Multi View Image Surveillance and Tracking, IEEE Workshop on Motion and Video Computing, Orlando, December 2002, 169-174.
Page 39 of 42
[2] Black J, Ellis TJ, Rosin P. A Novel Method for Video Tracking Performance Evaluation. The Joint IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance (VS-PETS), Nice, France, October 2003. [3] Black J, Ellis TJ, Makris D, A Hierarchical Database for Visual Surveillance Applications, IEEE International Conference on Multimedia and Expo (ICME2004), June, Taipei, Taiwan, (2004) [4] Cai Q Aggarwal JK, Tracking Human Motion in Structured Environments Using a Distributed Camera System, IEEE Pattern Analysis and Machine Intelligence (PAMI), Vol. 2, No. 11, November 1999, 1241-1247. [5] Cai X, Ali F, Stipidis, E., MPEG4 Over Local Area Mobile Surveillance System, IEE Intelligent Distributed Surveillance Systems, London, UK, February 2003. [6] Chang T, Gong S, Tracking Multiple People with a Multi-Camera System, IEEE Workshop on Multi-Object Tracking (WOMOT01), Vancouver, Canada, July 2001, 19-28 [7] Collins RT, Lipton AJ, Fujiyoshi H, Kanade T, Algorithms for Cooperative Multisensor Surveillance, Proceedings of the IEEE, Vol. 89, No. 10 October 2001, 1456-1476. [8] Guler S, Liang WH, Pushee I, A Video Event Detection and Mining Framework, IEEE Workshop on Event Mining Detection and Recognition of Events in Video, June 2003. [9] Ivanov YA, Bobick AF, Recognition of Visual Activities and Interactions by Stochastic Parsing, IEEE Pattern Analysis and Machine Intelligence (PAMI), Vol. 22, No. 8, August 2000, 852-872. [10]
Javed O, Shah M, KNIGHT: A Multi-Camera Surveillance System, IEEE International
Conference on Multimedia and Expo 2003, July Baltimore.
Page 40 of 42
[11]
Katz B, Lin J, Stauffer C, Grimson E. Answering Questions about Moving Objects in
Surveillance videos. Proceedings of the 2003 Spring Symposium on New Directions in Question Answering, March 2003 [12]
Kogut GT, Trivedi, MM, Maintaining the Identity of Multiple Vehicles as They Travel
Through a Video Network, IEEE Workshop on Multi-Object Tracking (WOMOT01), Vancouver, Canada, July 2001, 29-34. [13]
Lee L, Romano R, Stein G, Monitoring Activities from Multiple Video Streams: Establishing
a Common Coordinate Frame, IEEE Pattern Analysis and Machine Intelligence (PAMI), Vol. 22, No. 8, August 2000,758-767. [14]
Makris D, Ellis T, “Finding Paths in Video Sequences”, British Machine Vision Conference
BMVC2001, University of Manchester, UK, vol. 1, September 2001, 263-272. [15]
Makris D, Ellis T, Path Detection in Video Surveillance, Image and Vision Computing
Journal, Elsevier, vol. 20, October 2002, 895-903. [16]
Makris D, Ellis T, Spatial and Probabilistic Modelling of Pedestrian Behaviour, British
Machine Vision Conference (BMVC2002), Cardiff University, UK, vol. 2, pp. 557-566, September 2002. [17]
Makris D, Ellis T, Automatic Learning of an Activity-Based Semantic Scene Model, IEEE
Conference on Advanced Video and Signal Based Surveillance (AVSB 2003), Miami, July 2003. [18]
Martinez JM, Koenen R, Pereira F, MPEG-7 The Generic Multimedia Content Description
Standard, Part 1, IEEE Multimedia, Vol. 9, No. 2, April-June 2002, 78-87.
Page 41 of 42
[19]
Medioni G, Cohen I, Bremod F, Hongeng S, Nevatia R, Event Detection and Analysis from
Video Streams, IEEE Pattern Analysis and Machine Intelligence (PAMI), Vol. 23, No. 8, August 2001, 873-889. [20]
Trivedi M, Bhonsel S, Gupta A. Database Architecture for Autonomous Transportation
Agents for On-scene Networked Incident Management (ATON), International Conference on Pattern Recognition (ICPR), Barcelona, Spain 2000, pp 4664-4667. [21]
Remagnino P, Jones GA, Classifying Surveillance Events and Attributes and Behavior, British
Machine Vision Conference (BMVC2001), Manchester, September 2001, 685-694. [22]
Xu M, Ellis TJ, Illumination-Invariant Motion Detection Using Color Mixture Models, British
Machine Vision Conference (BMVC 2001), Manchester, September 2001, 163-172. [23]
Xu M, Ellis TJ, Partial Observation vs Blind Tracking through Occlusion, British Machine
Vision Conference (BMVC 2002), Cardiff, September 2002, 777-786.
Page 42 of 42