Finding Camera Overlap in Large Surveillance Networks

2 downloads 0 Views 3MB Size Report
Abstract. Recent research on video surveillance across multiple cam- eras has typically focused on camera networks of the order of 10 cameras. In this paper we ...
Finding Camera Overlap in Large Surveillance Networks Anton van den Hengel, Anthony Dick, Henry Detmold, Alex Cichowski, and Rhys Hill School of Computer Science University of Adelaide Adelaide, 5005, Australia {anton,ard,henry,alex,rhys}@cs.adelaide.edu.au

Abstract. Recent research on video surveillance across multiple cameras has typically focused on camera networks of the order of 10 cameras. In this paper we argue that existing systems do not scale to a network of hundreds, or thousands, of cameras. We describe the design and deployment of an algorithm called exclusion that is specifically aimed at finding correspondence between regions in cameras for large camera networks. The information recovered by exclusion can be used as the basis for other surveillance tasks such as tracking people through the network, or as an aid to human inspection. We have run this algorithm on a campus network of over 100 cameras, and report on its performance and accuracy over this network.

1

Introduction

Manual inspection is an inefficient and unreliable way to monitor large surveillance networks (see Figure 1 for example), particularly when coordination across observations from multiple cameras is required. In response to this, several systems have been developed to automate inspection tasks that span multiple cameras, such as following a moving target, or grouping together related cameras. A key part of any multi-camera surveillance system is to understand the spatial relationships between cameras in the network. In early surveillance systems, this information was manually specified or derived from camera calibration, but recent systems at least partly automate the process by analysing video from the cameras. These systems are demonstrated on networks containing of the order of 10 cameras, but have requirements that mean they do not scale well to networks an order of magnitude larger. For example: [1] requires manually marked correspondences between images; [2] requires a training stage where only one object is observed; and [3,4,5] require many correct detections of objects as they appear and disappear from cameras over a long period of time. An important step towards recovering spatial camera layout is to determine where cameras overlap. The approach taken in [6] is to estimate motion trajectories for people walking on a plane, and then match trajectories between cameras. However, this assumes planar motion, and accurate tracking over long periods Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 375–384, 2007. c Springer-Verlag Berlin Heidelberg 2007 

376

A. van den Hengel et al.

Fig. 1. Snapshot of video feeds from the network. Some cameras are offline.

of time. It also does not scale well, since track matching complexity increases as O(n2 ) with the number of cameras n. In [7] evidence for overlap is accumulated by estimating the boundary of each camera’s field of view in all other cameras. Again, this does not scale well to large numbers of cameras, and assumes that all cameras overlap. Because they start with an assumption of non-connectedness, and gradually accumulate evidence for connections, most methods for determining spatial layout rely on accurately detecting and/or tracking objects over a long time period. They also require comparisons to be made between every pair of cameras in a network. The number of pairs of cameras grows with the square of the number of cameras in the network, rendering exhaustive comparisons infeasible. This paper describes the implementation of a method called exclusion for determining camera overlap that is designed to quickly home in on cameras that may overlap. The method is computationally fast, and does not rely on accurate tracking of objects within each camera view. In contrast to most existing methods, it does not attempt to build up evidence for camera overlap over time. Instead, it starts by assuming all cameras are connected and uses observed activity to rule out connections over time. This is an easier decision to make, especially when a limited amount of data is available. It is also based on the observation that it is impossible to prove a positive connection between cameras—any correlation of events could be coincidence—whereas it is possible to prove a negative connection by observing an object in one camera while not observing it at all in another.

Finding Camera Overlap in Large Surveillance Networks

2

377

The Exclusion Algorithm

Consider a set of c cameras that generates c images at time t. By applying foreground detection [8] to all images we obtain a set of foreground blobs, each of which can be summarised by an image position and camera index. Each image is partitioned into a grid of windows, and each window can be labelled “occupied” or “unoccupied” depending on whether it contains a foreground object. Exclusion is based on the observation that a window which is occupied at time t cannot be an image of the same area as any other window that is simultaneously unoccupied. Given that windows tend to be unoccupied more often than they are occupied, this observation can be used to eliminate a large number of window pairs as potentially viewing the same area. The process of elimination can be repeated for each frame of video to rapidly reduce the number of pairs of image windows that could possibly be connected. This is the opposite of most previous approaches: rather than accumulate positive information over time about links between windows, we seek negative information allowing the instant elimination of impossible connections. Such connections are referred to as having been excluded [9]. 2.1

Exclusion over Multiple Timesteps

Rather than calculate exclusion separately at each timestep, it is more efficent to gather occupancy information over multiple frames and then calculate exclusion over all of them at once. Let the set of windows over all cameras be W = {w1 . . . wn }. Corresponding  to each window wi is an occupancy vector oi = (oi1 , . . . , oiT ) with oit set to 1 if window wi is occupied at time t, and 0 if not. If two windows are images of exactly the same region in the world, we would expect their corresponding occupancy vectors to match exactly. This can be tested by applying the exclusiveor operator ⊕ to elements of the occupancy vectors: K

a ⊕ b = max ak ⊕ bk k=1

It can be inferred that two windows wi and wj do not overlap if oi ⊕ oj = 1. This comparison is very fast to compute, even for long vectors. 2.2

Exclusion with Tolerance

Exclusion as described so far assumes that: 1. corresponding windows in overlapping cameras cover exactly the same visible area in the scene, 2. all cameras are synchronised, so they capture frames at exactly the same time, and 3. the foreground detection module never produces false positives or false negatives.

378

A. van den Hengel et al.

In reality none of these assumptions is likely to hold completely. It is thus possible that two overlapping windows might simultaneously register as occupied and vacant and therefore that the exclusive-or of the corresponding occupancy vectors might incorrectly indicate that they do not overlap. Assumptions 1 and 2 can be relaxed by including the neighbours of a particular window when registering its occupancy. We use a padded occupancy vector pi which has element pit set to 1 when window wi or any of its neighbours is occupied at time t. A more robust mechanism for determining whether two windows wi and wj overlap is thus to calculate oi  pj on the basis of the occupancy vector oi and the padded occupancy vector pj . The  operator is a uni-directional version of the exclusive-or defined such that K

a  b = max ak  bk . k=1

(1)

where ak bk is 1 if and only if ak is 1 and bk is 0. Note that this means exclusion calculation is no longer symmetric. To account for detection errors (assumption 3), we calculate exclusion based on accumulated results over multiple tests, rather than relying on a single contradictory observation. Assuming that the detector has a constant failure rate, the evidence for exclusion is directly related to the number of contradictory observations in a fixed time period t = 1...T [9], which we call the exclusion count: Eij =

T 

oit  pjt .

(2)

t=1

2.3

Normalised Exclusion

The exclusion count has two main shortcomings as a measure for deciding window overlap/non-overlap: – As the operator ab will only return true when a is true, the exclusion count Eij between windows wi and wj is bounded by the number of detections in wi , and is likely to be higher for windows wi that register more detections. – In a large network, it will frequently occur that data sent from a camera will be lost, or not arrive in time to be included in the exclusion calculation, or that a camera will go offline. Thus the maximum value of Eij also depends on how often data from wj is available. To address these problems we define a padded availability vector v for each window that is set to 1 when occupancy data for the window and its neighbours is available, and 0 otherwise. We can then define an exclusion opportunity count between each pair of windows: Oij =

T  t=1

oit  vjt

(3)

Finding Camera Overlap in Large Surveillance Networks

379

Based on this we define an overlap certainty measure from each window with opportunity count at least 1 to every other window: Cij =

Oij − Eij Oij

(4)

which measures the number of times that an exclusion was not found between wi and wj as a proportion of the number of times an exclusion could possibly have been found given the available data. In general, exclusion estimates for windows that are only occupied a small number of times are dominated by noise such as erroneous detection. We therefore include a penalty term for such windows:   log(Oij )  = Cij × min 1, (5) Cij log(Oref ) where Oref is a number of detections empirically determined to result in reliable exclusion calculation. We set this to 20 in our experiments.

3

Implementing Exclusion

In this section we describe how the exclusion algorithm is implemented in order to find overlap in a large network of cameras. This is done in two steps: – Object detection (Section 3.1): after each frame is captured, it is processed to detect objects within it. These detections are then converted to occupancy data for each window and sent to a central server. The main challenge for large camera networks is to detect objects quickly and reliably. – Exclusion calculation/update (Section 3.2, 3.3): at regular intervals of the order of several seconds, the stored occupancy data is used to calculate exclusion between each window pair. This exclusion result is then merged with exclusion results from earlier time periods, resulting in an updated estimate of camera overlap. The main challenge here is to synchronise data from different cameras, and to mitigate the memory requirements of exclusion data. 3.1

Distributed Foreground Detection

We detect foreground objects within each camera image using the Stauffer and Grimson background subtraction method [8]. To derive a single position from a foreground blob, we use connected components and take the midpoint of the low edge of the bounding box of each blob. This corresponds approximately to the lowest visible extent of the object in the image, assuming that the camera is approximately upright. Foreground detection is the most computationally intensive part of exclusion, but is also the stage that is easiest to parallelise. Presently, cameras are assigned to one of several processors that perform background subtraction on each image they capture. Eventually, though, we aim to implement detection on the cameras themselves.

380

A. van den Hengel et al.

3.2

Calculating Exclusion

Each occupancy result is tagged with the timestamp of the frame of video on which it is based and sent to a central server. After a fixed time interval, typically several seconds, these results are assembled to form an occupancy vector oi for each window wi . Each element of oi is indexed by a time offset t within the time interval, and can be one of three values: – oit = 2 if no occupancy data is available for wi within the time interval [t − tˆ, t + tˆ) – oit = 1 if wi is occupied within the time interval [t − tˆ, t + tˆ) – oit = 0 if wi is not occupied within the time interval [t − tˆ, t + tˆ) where tˆ is a tolerance to account for inaccuracies in camera synchronisation. These occupancy vectors are then used to calculate exclusion and opportunity counts as described in Sections 2.2 and 2.3 for each window pair within the time interval. These counts are then added to counts from previous time intervals, giving an updated estimate of exclusion confidence for each window pair. 3.3

Exclusion Data Compression

The central server stores both an exclusion count Eij and an exclusion opportunity count Oij for each pair of windows. Both counts are stored as a byte. This means that for a network of 100 cameras, each containing a 10 × 10 window grid, the counts require approximately 2 × 108 bytes of storage. Initially, Eij = 0 for all i and j. Consider how the exclusion counts are affected when a single person is observed in one window wD , and no other person is detected across the network. This will result in the exclusion count EDj being incremented for all windows j = D in the network. If the exclusion counts are stored in a matrix whose ij th element is the exclusion count between wi and wj , this results in an entire row of the matrix being incremented. Situations similar to this are quite common and suggest that a run length encoding scheme could effectively compress the matrix. Similarly, exclusion opportunity counts Oij are initially 0 for all i and j. Like exclusion counts, neighbouring opportunity counts are likely to be incremented at identical times, since an increment to Oij requires that wi is occupied and all data in the neighbourhood of wj is available. Again, this suggests the use of a run length encoding scheme to store exclusion opportunity data.

4

Testing Exclusion

We tested our exclusion implementation on network containing 100 Axis IP cameras, distributed across a university campus. Frames are captured from each camera as JPEG compressed 320×240 images using the Axis API. Each frame is divided into a 9×12 grid of windows, for a total of 10800 windows. As previously mentioned, the computational cost of foreground detection over a large number of

Finding Camera Overlap in Large Surveillance Networks

381

cameras far outweighs that of exclusion. This coarse level of foreground detection is well suited to implementation on board a camera, but for the purposes of testing a cluster of 16 dual core Opteron PCs has been used to process the footage from the 100 cameras in real time. By contrast, the central server, where occupancy results are assembled and exclusion is calculated, is a single desktop PC (Dell Dimension 4700, 3.2GHz Pentium 4, 1GB memory). 4.1

Performance Testing

We first test how the performance of exclusion scales, both over long time periods and large numbers of cameras. It was found that due to the optimisations described previously, the performance of exclusion does not depend strongly on the number of cameras on the network. Rather, it depends on the amount of activity in the network. Thus we observed the performance of exclusion during high and low activity periods, over a period of one hour. The memory required by exclusion increases over time, as shown in Figure 2. This is largely due to the decreased effectiveness of RLE compression of the exclusion counts (EC) as more activity is observed. The opportunity counts (EOC) are still well compressed by RLE after one hour, as camera availability changes rarely during this time. However, notice that the increase in EC elements, and corresponding increase in memory usage, is less than linear. Even after an hour of observation, only 29.56MB of memory is being used, compared to over 200MB that would be required to store the uncompressed data. Figure 3 shows the time taken to calculate exclusion at intervals over the one hour period. Notice that the time to compute exclusion remains fairly constant over the time period, and is consistently faster than real time, even using a standard desktop PC. In fact the exclusion is calculated for the hour’s footage 30000000

40 35

Memory Use (MB)

20000000

25

15000000

20 Data Size (MB)

15

Allocated Size (MB)

10000000

Compressed EC Elements

10

Compressed EOC Elements

Number of elements

25000000

30

5000000

5 0

0 0

15

30

45

60

Frame time (minutes)

Fig. 2. Memory usage over one hour of processing. The exclusion element count (EC) shows how the RLE compression becomes less effective over time.

382

A. van den Hengel et al. 16

14 Avg Occupancy Count / Camera

14

12

Speed Relative To Real Time Time Taken (min)

12

8 8 6 6

Time taken (min)

10

10

4

4

2

2

0

0 0

15

30

45

60

Frame time (minutes)

Fig. 3. Timing information for one hour of processing. The time required to process each frame remains approximately constant over time, although it increases slightly during periods of higher activity. Exclusion for 100 cameras is consistently calculated at over 4 times real time on a desktop PC, and an hour’s video takes under 13 minutes to process.

using less than 13 minutes of processor time. It is also evident that the time taken to calculate each exclusion count does depend on the amount of activity, measured by the number of occupancies detected per available camera. This can be seen by the slight increase in “Avg Occupancy Count per Camera” between about 30 and 50 minutes, and the corresponding decrease in “Speed relative to real time”. 4.2

Ground Truth Verification

It is difficult to verify that exclusion captures all overlap in a large camera network, and excludes all non-overlap. For example, Figure 1 shows a set of images captured from across the network at one moment. After some exclusion processing, the grid is rearranged to group together related cameras as shown in Figure 4. Connections are drawn between a window pair when the overlap certainty measure (Equation 5) exceeds a threshold C ∗ . The link must pass the threshold in both directions for the connection to be established, i.e. a link   > C ∗ and Cji > C ∗ . In our is drawn between wi and wj if and only if Cij ∗ experiments we set C = 0.8. To verify the exclusion results we manually inspected the groups that were found. Close up views of some groups can be seen in Figure 5. It can be seen that overlap has been detected correctly in a variety of cases despite widely differing viewpoints and lighting conditions. These are correspondences that would be

Finding Camera Overlap in Large Surveillance Networks

383

Fig. 4. Video feeds from Figure 1 after running exclusion on one hour of footage. The cameras are arranged on screen so that related cameras are near each other, to aid human inspection.

Fig. 5. Overlapping groups detected by exclusion

very difficult to detect by tracking people, and attempting to build up correlations between tracks. The lighting conditions are often very poor, and the size of people in each camera varies greatly. Figure 4 also includes four camera groups that have been erroneously linked. Each of these groups has only one or two links between windows in each image, and view low traffic areas. These errors would thus disappear as more traffic is viewed. To correct these groups until enough traffic has been seen, a filter can be implemented that only links cameras when more than window in each camera is linked. Alternatively, a human operator can sever the links manually.

384

A. van den Hengel et al.

Some camera overlap was not detected because of low traffic during the hour that footage was captured. However, all overlap between cameras monitoring areas with enough detections (relative to Oref ) to calculate exclusion was correctly determined. This leads us to believe that remaining overlap can be detected when the system is run over a longer time period.

5

Conclusion

This paper describes a method for automatically determining camera overlap in large surveillance networks. The method is based on the process of eliminating impossible connections rather than the slower process of building up positive evidence of activity. We describe our implementation of the method, and show that it runs faster than real time on an hour of footage from a 100 camera network, using a single desktop PC. Future work includes testing the system over a period of several days, adding more cameras to the network, and implementing a more efficient foreground detector.

References 1. Javed, O., Rasheed, Z., Shafique, K., Shah, M.: Tracking across multiple cameras with disjoint views. In: ICCV 2003, pp. 952–957 (2003) 2. Dick, A.R., Brooks, M.J.: A stochastic approach to tracking objects across multiple cameras. In: Webb, G.I., Yu, X. (eds.) AI 2004. LNCS (LNAI), vol. 3339, pp. 160– 170. Springer, Heidelberg (2004) 3. Ellis, T.J., Makris, D., Black, J.K.: Learning a multi-camera topology. In: Joint IEEE Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance (VS-PETS), pp. 165–171. IEEE Computer Society Press, Los Alamitos (2003) 4. Stauffer, C.: Learning to track objects through unobserved regions. In: IEEE Computer Society Workshop on Motion and Video Computing, pp. 96–102. IEEE Computer Society Press, Los Alamitos (2005) 5. Tieu, K., Dalley, G., Grimson, W.E.L.: Inference of non-overlapping camera network topology by measuring statistical dependence. In: ICCV 2005, pp. 1842–1849 (2005) 6. Lee, L., Romano, R., Stein, G.: Monitoring activities from multiple video streams: Establishing a common coordinate frame. IEEE Trans. Pattern Analysis and Machine Intelligence 22(8), 758–767 (2000) 7. Khan, S., Javed, O., Rasheed, Z., Shah, M.: Human tracking in multiple cameras. In: IEEE International Conference on Computer Vision, pp. 331–336 (2001) 8. Stauffer, C., Grimson, W.E.L.: Learning patterns of activity using real-time tracking. IEEE Trans. Pattern Analysis and Machine Intelligence 22(8), 747–757 (2000) 9. van den Hengel, A., Dick, A., Hill, R.: Activity topology estimation for large networks of cameras. In: AVSS 2006. Proc. IEEE International Conference on Video and Signal Based Surveillance, pp. 44–49. IEEE Computer Society Press, Los Alamitos (2006)