1
To be presented at the ASCI conference, Lochem, The Netherlands, June 19-21 2002
Object Detection and Tracking Using a Likelihood Based Approach Paul Withagen1,2
Klamer Schutte1
Frans Groen2
1
TNO Physics and Electronics Laboratory, P.O. Box 96864, 2509 JG The Hague, The Netherlands 2 IAS group, University of Amsterdam, Kruislaan 403, 1098 SJ Amsterdam, The Netherlands
[email protected] [email protected] [email protected]
Keywords: Background Modeling, Expectation Maximization, Detection, Tracking
Abstract Many surveillance algorithms use both background modeling to detect moving objects and object tracking to analyze the motion patterns of the objects detected. In our case, Expectation Maximization (EM) is used to model the background and detect moving objects. Tracking is based on the objects color histogram. Using EM we can calculate the probability that a pixel value belongs to the background. Simultaneously, we use the color histogram of an object as a feature for tracking the object, which we use to calculate the probability that the pixel belongs to the object. In this paper we report integration between background modeling using EM and object tracking using color histograms. The classification between background objects will be based on probabilities. We will show the advantages for both the object detection and tracking part.
1
Introduction
In the last years, a lot of work has been done in the area of video surveillance (see for example [1, 2, 3]). Because of the difficult circumstances the applications are often used in, the problem is still a challenge. Changing illumination, waving trees, water, scene changes and shadows are typical problems which make the problem challenging. And the objects which are under surveillance are also not the easiest thinkable. People are not at all rigid bodies, and their movements are in many cases hard to predict. People change direction, speedup, stop, interact, come together and move away from each other. In the process they occlude each other, leave the scene to appear at an other location, etc. That is why the “perfect algorithm has not been found yet, and will not be found soon either.
A surveillance application usually consists of some sort of moving object detection, object tracking, and higher order processing, like the detection of someone going into a prohibited area, face recognition for identification, or gesture recognition to determine what a person is doing. This paper concentrates on the two low-level steps, moving object detection and object tracking. In many surveillance applications, background modeling is used to detect moving objects. Toyama compares a number of these algorithms in [2]. One of the problems with most algorithms is the need of an empty scene for initialization. Often this is hard to obtain, and each time something changes in the scene, the initialization needs to be redone. The use of a selfadapting system makes this initialization redundant. This can be done by the use of Expectation Maximization (EM, [4]) to estimate a mixture of Gaussian kernels for each pixel. For practical applications, an online version of the EM algorithm ([5])should be used. This approach is used more and more the last years because the increasing speed of computers make it feasible for real-time implementations [6, 7, 8] . The algorithm models the background by creating a model of the appearance of colors for each pixel over time. The model is described by a mixture of Gaussian kernels. Each time a new frame is recorded, for each pixel the new color value is compared to the mixture model, and this way the likelihood that this pixel is background can be calculated. If all objects would be known at this time, you could also calculate the probability for background, object 1, object 2, or object N. But this does not allow the appearance of new objects. Another way to use the EM algorithm for object detection, is to create one feature-space, with one feature for each pixel. In this feature-space, you can then use the EM algorithm to find clusters, which then should relate to objects. See for example [9], or [3], where not only color values, but also their spatial lo-
cations were used as a feature. A disadvantage of this approach, is that objects consisting of different colors, will not be found as one object. By thresholding the likelihood-image produced by the background estimation, foreground objects can be detected. The detected objects now need to be tracked. This is possible by using a Kalman filter [10] on their positions, and letting the Kalman filter solve the correspondence problem between the frames. For tracking humans, this is not a good solution, because they easily change direction and speed, and tend to interact with each other. That is why additional features are necessary to keep the different objects separated. Two frequently used features are shape and color. We choose to use the color histogram of the object as feature. The use of color histograms for object location was introduced in [11] and is frequently used to find objects in image achieves . Histogram-based object recognition has the advantage that it is invariant to limited deformations of the object itself. It also allows for an efficient algorithm. The histogram-based “template matching” algorithm (in the remainder of this paper called histogram matching) does not use least squares to compare the image and the template, but compares the two histograms. It can be implemented very efficient because no spatial information is incorporated in the model, so when the region of interest is shifted over the image, for each new shift position, only the pixels which “shift in” and those who “shift out” need to be processed to find the new match-error at that position. Section 2 shows how the tracking of objects results in a likelihood whether a certain pixel belongs to a certain object. Using these likelihoods, and the likelihood for background, all pixels can be classified without the use of thresholds. The specifics of our implementation will be explained in section 3. In section 4 we will describe some experiments performed with the algorithm and finally, conclusions will be given in section 5.
2
Likelihood Integration
As a result of the background modeling, for each pixel we get a prediction how likely it is that the pixel is background. Provided that we also have a likelihood for the pixel being object, no thresholds will be necessary in the classification of the pixels between background and any of the objects: we assign the pixels to the class with the highest likelihood. Looking at the physics of the objects we want to track, we can see that between two frames, objects can only move a certain distance, and their shape can also only modify little. Thus the core of the object (by the word core we mean the entire object except for a shell of certain thickness around the boundary) will not change much by deformation of the object, and
because of the limited speed of the objects, it will not have a major change in its amount of occlusion. So it is relatively safe to get the position of the core of the object by histogram matching. At the new location, all pixels which lie in the core of the object are almost certain to belong to the object, so their likelihood is close to one, depending on the match-error. For the pixels near the boundary of the object, this will not hold. The likelihood that they belong to the object is determined by three factors: • How far does the pixel lie from the edge of the core of the object? • What is the likelihood-ratio of pixels with this color, based on the histogram of the object? • What is, given the pixels we already have in the core of the object, the need for a pixel with this color to make the histogram of the entire object the same as in the previous frame? The first item can easily be calculated by a distance transform [12] of the core of the object (see figure 1(f)). The second item is given by the number of pixels in the associated bin of the object histogram H in relation to the total number of pixels in the histogram (see figure 1(g)). The third item is the most difficult one. We already have all pixels in the core of the objects assigned to belong to the object. These can be removed from the object histogram by subtraction of the core-object histogram C. This results in a wishhistogram W of missing values. Normalization so the values lie between zero and one gives: 1H−C . (1) 2H+C Locations where W has a high value are the colorvalues we still need to make our object histogram complete (see figure 1(h)). The three probabilities mentioned above can be combined by multiplying them to calculate one likelihood, so W =1+
P (x, y, ~c) = PD (x, y) · P (~c|H) · P (~c|W)
(2)
with PD a probability related to the distance and ~c the color values of the pixel at location x, y (see figure 1(i)). The maximum speed and deformability of the objects determines which pixels need to be processed (the width of the boundary). All other pixels get a likelihood equal to zero for this object. After calculating a likelihood-image for each object, classification can be performed by taking the maximum for each pixel of the set of background likelihood and object likelihoods (see figure 1(k)). Then some postprocessing is necessary in the case of objects which (for example due to occlusion) split, and to deal with small isolated regions of one class inside an other class (see figure 1(l)).
3
Implementation
In this section, some details of our implementation will be described. First the overall pseudo-code will be given, then some parts of the implementation will be explained in more detail. Pseudo-code of the algorithm for each frame: 1. Start with all old objects (see figures 1(a) and 1(b)), and calculate their core (see figure 1(c)). 2. Read a new image. 3. Use a Kalman prediction to get estimated locations and search ROIs for all objects. 4. Do histogram matching to find the location of the core of all objects (see figure 1(e)). 5. For each pixel, calculate the likelihood for: • Object 1 . • Object . . • Object L (see figures 1(f), 1(g), 1(h), and 1(i)) • Background (see figure 1(j)) (Only calculate the likelihood for those objects for which the pixel lie within a certain distance from the edge of the core-object). 6. Classify all pixels outside the core of the objects, based on the likelihoods (see figure 1(k)). 7. For those pixels classified as background, threshold their likelihood in order to detect possible new objects. Group foreground pixels and initialize new objects. 8. Perform post-processing on all objects to find objects which are too small, split, etc. (see figure 1(l)) 9. Update for all background pixels the background model 10. Calculate the center of gravity for all objects, and update the Kalman model and object and core-object color histograms (see figure 1(m)).
The EM background modeling algorithm we used is described in more detail in [8]. This work is closely related to the work of Stauffer and Grimson [6, 7], but as opposed to Stauffer and Grimson, it updates all kernels, which results in a more accurate estimate of the variance of the kernels [13]. In order to reduce artifacts from shadows and changes in the light-intensity, we use the r and g channels from normalized rgb as
features of the background model. An additional advantage is that normalized color space is two dimensional and practically without any correlation, which results in a faster algorithm, because the two channels can be regarded as independent of each other. In the (r, g)-space, we use the EM algorithm to model the data per pixel by two separate 1D mixtures of Gaussian kernels (although, the use of 2D kernels is being considered), one for r and one for g. Four kernels are used per color. The kernel with the highest prior is regarded as the foreground kernel. As initialization, 100 frames (4 seconds) are processed before object detection and tracking is started. Similar to the background model, the object templates (a color histogram in our case) need to be updated from time to time. In order to prevent bad templates from one bad estimate of the location or occlusion, the templates are updated with only a small fraction γ of the object in the new image, and only when the match-error is not too high, as that is a clue for occlusion. The update of the color histogram is done by: Ht+1 = (1 − γ)Ht + γN,
(3)
where H is the template histogram and N is the histogram of the object in the new frame. We choose to use the full RGB histogram, and not use normalized rgb because of its higher selectivity. We can do so, because the update speed for the objects is much greater than that for the background model, as the view of objects tends to change faster than the background. So we do not need to use normalized rgb, as we update fast enough to cope with slow intensity changes due to changes in illumination. The number of bins of the histograms used is eight for each of the three dimensions. As can be seen, there are some differences between modeling the background and the objects. As the number of objects in the data is limited, it is feasible to memorize the full histograms of the objects, as opposed to memorizing a full histogram for each pixel, which would require too much memory. The update speed of the background should be much slower then the update of the histograms using γ. That is because different time-scales are involved. The modeling of the background should not be affected by passing objects, so its time constant should be large. The update of the objects on the other hand, should be much faster, as it should deal with changes in the appearance of objects between subsequent frames. The likelihood calculation of the objects is done in two stages. First histogram matching is used to find the best location of the core of the object. The pixels which could belong to the object are determined, and their likelihood is calculated. To perform histogram matching, first a binary erosion of the object in the previous frame is applied (see
(a) Object in frame t.
(b) Binary object.
(c) binary object-core.
(d) Object-core in frame t.
(e) Object-core in frame t+1.
(f) Probability based on distance.
(g) Probability based on color histogram.
(h) Probability based on Wish histogram.
(i) Total object probability.
(j) Background probability frame t+1.
(k) Maximal probability.
(l) Segmented object after postprocessing.
(m) Segmented object in frame t+1.
Figure 1: Step-by-step algorithm description.
figures 1(b) and 1(c)). The size of the structuring element is based on the maximal speed of the objects, and is in our case 5x5 pixels. This provides the set of core-object pixels. At different locations in the new image, the histogram of the core-object is compared to the histogram we would obtain if the core-object would be at this location. The sum of absolute differences is a measure of fit (a lower value means a better fit). The location of the best fit is chosen as the new core-object location, and the pixels given by the core-object at this location are given a likelihood of one. The histogram of the core-object is updated using equation (3). Using a binary dilation on the core-object (with twice the size of the erosion which has been used to create the core-object from the object) gives us those pixels which could belong to the object. We use a distance transform [12] to get a distance-based likelihood (PD in equation (2), see figure 1(f)). For all pixels found using the dilation steps, the likelihood is given by the number of items in the corresponding bin of the object-histogram. It is calculated by indexing the color-value of the current pixel in the normalized object histogram (see figure 1(g)). The final contribution is given by the wish-histogram. We index the color-value in the wish-histogram calculated in equation (1) (see figure 1(h)). This histogram is calculated by subtraction of two not-normalized histograms. New objects are detected using a threshold on the likelihood of pixels classified as background. The physics of the objects determines the minimum size of an object, providing a criterium whether or not a blob can be an object. Objects which are too small are re-labeled as background. To improve the speed of the histogram matching, an estimate of the object location is calculated by using a Kalman filter. The new locations are used to update the Kalman filter. This reduces the area we have to do (histogram based) template matching in. Finally, the background model is updated for all pixels labeled as background.
4
(b) Foreground by EM.
Experiments
For our experiments we used data recorded at a railway station using a digital video camera. The camera was located at the second floor of the main hall. In figure 2(a), one of the input images is shown. Figure 2(b) shows the segmentation in foreground and background using only the EM algorithm, while figure 2(c) shows the segmentation using our integration algorithm.
5
(a) Input image.
Conclusions
In this paper we have shown that two important steps of video surveillance, namely object detection
(c) Foreground by proposed algorithm.
Figure 2: Results of the background estimation and tracking after 150 frames. Two objects are shown in these images, one actual object, and an old object, to show the trajectory it made.
and tracking, can be integrated using a probabilitybased algorithm. This approach has the advantage that no thresholds are necessary to re-find previously detected objects. This also leads to better object segmentation and easier detection of (partial) occlusion. The core of the object is tracked using template matching. Pixels near the object boundary are detected in each frame using a probability based algorithm, which strives to keep the color information in the object constant, while allowing the object pixels to shift relative positions. Experiments show this is an advantage while tracking non-rigid bodies like humans.
References [1] D. Harwood I. Haritaogly and L.S. Davis, “W4 : Real-time surveillance of people and their activities,” IEEE Transactions on Pattern Analisys and Machine Vision, vol. 22, no. 8, pp. 809–830, August 2000. [2] K. Toyama, J. Krumm, B. Brumitt, and B. Meyers, “Wallflower: principles and practice of background maintenance,” in IEEE Conference on Computer Vision, Kerkyra, Greece, 1999, pp. 255–261. [3] Christopher Richard Wren, Ali Azarbayejani, Trevor Darrell, and Alex Pentland, “Pfinder: Real-time tracking of the human body,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 19, no. 7, pp. 780–785, 1997. [4] A.P. Dempster, N.M. Laird, and D.B. Rubin, “Maximum likelihood from incomplete data via the EM-algorithm,” Journal of the Royal Statistical Society, vol. B 39, no. 1, pp. 1–38, 1977. [5] C. Priebe, “Adaptive mixtures,” Journal of the American Statistical Association, vol. 89, no. 427, pp. 796–806, 1994. [6] W.E.L. Grimson, C. Stauffer, R. Romano, and L.Lee, “Using adaptive tracking to classify and monitor activities in a site,” in IEEE Conference on Computer Vision and Pattern Recognition, Santa Barbara, CA, 1998, pp. 22–29. [7] C. Stauffer and W.E.L. Grimson, “Adaptive background mixture models for real-time tracking,” in IEEE Conference on Computer Vision and Pattern Recognition, Fort Collins, CO, 1999, vol. 2, pp. 246–252. [8] G. Stijnman and R. van den Boomgaard, “Background estimation in video sequences,” Tech. Rep. 10, Intelligent Sensory Information Systems Group, University of Amsterdam, (http://www.science.uva.nl/research/reportsisis/ISISreport.html), January 2000.
[9] N. Oliver, A. Pentland, and F. B’erard, “Lafter: Lips and face real time tracker with facial expression recognition,” in Proceedings of the IEEE Computer Vision and Pattern Recognition Conference (CVPR), Puerto Rico, 1997. [10] R. Kalman, “A new approach to linear filtering and prediction problems,” Transaction of the ASME-Journal of Basic Engineering, pp. 35–45, March 1960. [11] M.J. Swain and D.H. Ballard, “Indexing via color histograms,” in DARPA90, 1990, pp. 623– 630. [12] Kenneth R. Castleman, Digital Image Processing, Prentice Hall, Englewood Cliffs, NJ, USA, 1996. [13] Frank Zwarthoed, “Real-time object detectie in video mat gebruik van mixture modellen voor achtergrondschatting (in dutch),” M.S. thesis, IAS group, University of Amsterdam, April 2002.