A Robust Real-time Multi-level Model-based Pedestrian Tracking System
Osama Masoud
Nikolaos P. Papanikolopoulos*
[email protected]
[email protected]
(612) 625-8544
(612) 625-0163
Artificial Intelligence, Robotics, and Vision Laboratory Department of Computer Science University of Minnesota 4-192 EE/CS Building 200 Union St. SE Minneapolis, MN 55455 FAX: (612) 625-0572
* Author to whom all correspondence should be sent
Abstract
This paper describes a real-time system for tracking pedestrians in sequences of grayscale images acquired by a stationary camera. The output of the system is the spatio-temporal coordinates of each pedestrian during the period the pedestrian is visible. The system is model-based in the sense that it uses a simple rectangular model which has its own location and velocity for pedestrians. Our system uses three levels of abstraction. The lowest level, which is at the image level, deals with raw images. At this level, background subtraction is performed and the result is passed to the second level. This level, which we call the blobs’ level, deals with blobs obtained by segmenting the subtracted image. It performs robust blob tracking with no regard to what or who the blobs represent. The output of this level is the tracked blobs which are in turn passed to the final level, the pedestrians level. The pedestrian level deals with pedestrian models and depends on the tracked blobs as the only source of input. By doing this, we avoid trying to infer information about pedestrians directly from raw images, a process that is highly sensitive to noise. The pedestrian level makes use of Kalman filtering to predict and estimate pedestrian attributes. The filtered attributes constitute the output of this level which is the output of the system. Our system was designed to be robust to high levels of noise and particularly to deal with difficult situations such as partial or full occlusions of pedestrians.
The system was implemented on a Datacube MaxVideo 20 equipped with a Datacube Max860 and was able to achieve a peak performance of over 20 frames per second. Experimental results based on indoor and outdoor scenes have shown that the system is robust under many difficult traffic situations.
1
1 Introduction There is an increasing number of applications where pedestrian monitoring is of high importance. Traffic control, security monitoring, pedestrian flow analysis, and pedestrian counting are some applications which rely heavily on pedestrian tracking. We have developed a real-time system that reliably tracks pedestrians in the field of view of a fixed camera.
In our system, we use three levels of abstractions. Figure 1 depicts these levels along with data flows among them. Each level can be thought of as a module operating on a certain data type and all the modules are running in parallel. Moreover, each level retains a state of the data it produces to be used in conjunction with the data received from the lower level. The lowest level deals with raw images. It receives a sequence of images and produces a sequence of difference images. As will be explained below, a difference image is the result of subtracting the input image from the background. In the second level, which deals with blobs, difference images are segmented to obtain blobs. Then, blobs are associated with the blobs that were generated previously. In essence, the blobs’ level tracks blobs. Tracked blobs are passed on to the pedestrian level where relations between pedestrians and blobs as well as information about pedestrians is inferred using previous information about pedestrians in that level.
In the next section, we give a review of related work in pedestrian tracking. Section 3 describes image subtraction which is all that is done at the image level. Sections 4 and 5 describe the processing done at the blobs’ level. The pedestrian level is presented in Section 6. Experimental results follow in Section 7.
2
2 Related Work Research work in pedestrian tracking was aimed to recover different types of information including 2D location, 3D location, posture, and action such as walking, running, etc. Rohr [8] used generalized cylinder models to describe the human body. The objective was to recover the pedestrian location and posture. The system assumed that the pedestrian walks in parallel to the image plane and does not deal with occlusions. Rashid [7] presented a system which interprets moving light displays (MLDs). The objective was to recover the structure of the pedestrian using the tracked points. Some methods [5,6] tried to infer the location and the action by analyzing the motion patterns of the legs for example. A group in England has developed the TULIP system [11]. Their system works under specific environmental conditions. A transputer-based system has also been developed in England by Ali and Dagless [1]. Some vision-based systems were developed for the detection of border intruders [3], secure perimeter violators [4], and unauthorized safety zone trespassers. Segen and Pingali [9] tracked features on the contour of pedestrians and then performed clustering on feature paths. Their system which ran in real-time, was tested on indoor image sequences. The system failed in certain situations involving occlusions. Cai and Aggarwal [2] developed a pedestrian tracking system that uses cameras mounted at various locations whose relative positions are assumed to be known. The system was robust but processing time was 0.3 seconds per frame. Smith et al. [10] performed pedestrian detection in real-time. The system used several simplistic criteria to judge whether the detected object is a pedestrian or not.
3
3 Image Subtraction A static camera model has the advantage that the scene background remains constant to a certain extent. This allows us to get a good approximation of the location of changes in the image by subtracting the image from the background and then thresholding the result using an appropriate threshold value. The resulting image, which we call the difference image, is a binary image with 1’s in places where there are changes and 0’s everywhere else. Of course, changes captured this way may or may not correspond to moving objects. Shadows, image noise, changes in lighting conditions, changes in weather conditions, and small changes in the camera’s intrinsic parameters are some examples of changes that show up in the difference image but do not correspond to moving objects. It is also possible that some moving objects do not introduce any change to the difference image, such as a moving object which is very similar in color to the background beneath it. This problem is commonly encountered in computer vision. For example, optical flow and motion field do not always correspond to each other.
There is no perfect solution to this problem but several researchers have attempted to overcome some of its aspects. For example, some research groups have used adaptive background update to capture the new global changes in the image sequence such as changes in lighting conditions. Noise reduction is usually taken care of by using a low pass filter on the image and the background before they are subtracted and by ignoring small clusters of 1s in the difference image. We adopt the latter technique in our system. The input to the system is a stream of gray scale images of size 512 × 480 . For efficiency reasons, the images are subsampled to half the number of pixels on each side. A threshold of 20 is used to obtain the difference image.
4
4 Blob Extraction Once a difference image is computed, connected segments of 1’s are extracted using border following. Another way to extract connected components is to use the raster scan algorithm. The advantage of the latter method is that it extracts holes inside the blobs while border following does not. However, for the purpose of our system, holes do not constitute a major issue. Moving pedestrians usually form solid blobs in the difference image and if these blobs have holes, they may be still considered part of the pedestrian. Border following has the extra advantage of being more efficient. While raster scan algorithm has to traverse every pixel in the image, with border following, the interior of blobs need not be considered. Thus, the larger the total area of all the blobs in the image, the faster the segmentation process becomes.
The following parameters are computed for each blob b : 1. Perimeter, 2. Area, denoted by A ( b ) : the number of pixels inside the blob, 3. Bounding box: the smallest rectangle surrounding the blob, 4. Density, denoted by D ( b ) : A ( b ) / Bounding box area.
5 Blob Tracking When a new set of blobs is computed for frame i , an association with frame ( i – 1 ) ’s set of blobs is sought. Ideally, this association can be an unrestricted relation. With each new frame, blobs can split, merge, appear or disappear. The relation among blobs can be represented by an undirected
5
bipartite graph, G i ( V i, E i ) , where V i = B i ∪ B i – 1 . B i and B i – 1 are the sets of vertices associated with the blobs in frames i and i – 1 , respectively. Since there is a 1-to-1 correspondence between the blobs in frame i and the elements of B i , we will use the terms blob and vertex interchangeably. Figure 2 shows how the blobs in two consecutive frames are associated. The graph in the figure expresses the fact that blob 1 split into blobs 4 and 5, blob 2 and part of blob 1 merged to form blob 4, blob 3 disappeared, and blob 6 appeared.
The process of blob tracking is equivalent to computing G i for i = 1, 2, …, n , where n is the total number of frames. Let N i ( u ) denote the set of neighbors of vertex u ∈ V i , N i ( u ) = { v ( u, v ) ∈ E i } . To simplify graph computation, we will restrict the generality of the graph to those graphs which do not have more than one vertex of degree more than one in every connected component of the graph. Mathematically, ∀( u, v ) ∈ E i, N i ( u ) > 1 ⇒ N i ( v ) = 1 . This is equivalent to saying that from one frame to the next, a blob may not participate in a splitting and a merging at the same time. We refer to this as the “parent structure constraint.” According to this constraint, the graph in figure 2(c) is invalid. If, however, we eliminate the arc between 1 and 5 or the arc between 2 and 4, it will be a valid graph. This restriction is reasonable assuming a high frame rate where such simultaneous split and merge occurrences are rare.
There are exponentially many ways a general bipartite graph can be constructed. In fact, given two sets of vertices of sizes m and n, there are 2
mn
different possible graphs. This number is
reduced with the parent structure constraint but still remains exponential. To further reduce this number, we use another reasonable constraint which we call the “locality constraint.” With this
6
constraint, vertices can be connected only if their corresponding blobs have a bounding box overlap area which is at least half the size of the bounding box of the smaller blob. This constraint, which significantly reduces possible graphs, relies on the assumption that a blob is not expected to be too far from where it was in the previous frame. This is also reasonable to assume if we have a relatively high frame rate. We refer to a graph which satisfies both the parent structure and locality constraints as a valid graph.
To find the optimum G i , we need to define a cost function, C ( G i ) , so that different graphs can be compared. A graph with no arcs, i.e. E i = ∅ , is one extreme solution in which all blobs in V i – 1 disappear and all blobs in V i appear. This solution has no association among blobs and should therefore have a high cost. In order to proceed with our formulation of the cost function, we define two disjoint sets, which we call parents, P i , and descendents, D i , whose union is V i such that Di =
∪
u ∈ Pi
N i ( u ) . P can be easily constructed by selecting from V i all vertices of degree more
than one, all vertices of degree 0, and all vertices of degree one which are only in B i . Furthermore, let S i ( u ) =
∑
A ( v ) be the total area occupied by the neighbors of u . We now write
v ∈ N i(u)
the formula for the cost function as 2
A( u ) – Si( u ) C ( G i ) = ∑ ----------------------------------------max( A ( u ), S i ( u )) u∈P i
This function favors associations in which blobs do not change much in size. It also favors changes in large blobs’ sizes to changes in small blobs’ sizes which is intuitively reasonable.
Using this cost function, we can proceed to compute the optimum graph. First, we notice that
7
given a valid graph G ( V , E ) and two vertices u, v ∈ V , such that ( u, v ) ∉ E , the graph G' ( V , E ∪ { ( u, v ), ( v, u ) } ) has a lower cost than G provided that G' is a valid graph. If it is not possible to find such a G' , we call G dense. Using this property, we can avoid some useless enumeration of graphs which are not dense. In fact, this observation is the basis of our algorithm to compute the optimum G .
Our algorithm to compute the optimum graph works as follows: A graph G is constructed such that the addition of any edge to G makes it violate the locality constraint. There can be only one such graph. Note that G may violate the parent structure constraints at this moment. The next step in our algorithm systematically eliminates just enough edges from G to make it satisfy the parent structure constraint. The resulting graph is valid and also dense. The process is repeated so that all possible dense graphs are generated. The optimum graph is the one with the minimum cost.
6 Pedestrian Tracking A walking pedestrian in the image sequence will cause some blobs to appear. The number of resulting blobs due to this pedestrian may not be only one. Partially occluded pedestrians, for example, may show up as more than one blob. Similarity of color between the pedestrian clothes and the background may also result in more than a blob. If the pedestrian is fully occluded, there will be no blobs. This demonstrates that the relation between blobs and our definition of a pedestrian need not be one-to-one. The final level of abstraction deals with pedestrian data. Each pedestrian has a location and a velocity. Tracking pedestrians depends on the state of pedestrians as
8
well as the tracked blobs. To achieve this, we define a many-to-many relationship between pedestrians and blobs. In other words, a pedestrian may consist of zero or more blobs and a blob may participate in zero or more pedestrians. The purpose of tracking pedestrians is to compute this relationship as well as the location and velocity of pedestrians whenever a new blob graph, as defined in the previous section, is computed. Pedestrians usually walk with a constant speed. Moreover, the speed of a pedestrian usually changes gradually when the pedestrian desires to stop or start walking. For these reasons, we choose to use Kalman filtering to estimate the location and velocity of pedestrians. In the next section we describe the Kalman filter that we used. In Section 6.2, we show how pedestrians and blobs are related. Section 6.3 describes the pedestrian location measurement method. Finally, we describe how all these pieces are put together in Section 6.4.
6.1 Kalman Filtering We used a constant speed model as our pedestrian model. In this model, we assume that pedestrians walk with a constant velocity. Only the horizontal component equations will be presented here. The vertical component is treated in the exact same manner. In effect, we will be using two independent filters for each pedestrian, one for the horizontal component and one for the vertical component. In the following equations, t represents time. The system equation is given by x t + 1 = Fx t + w t where x = x is the state vector consisting of the location, x , and velocity, v , F is the transition v matrix, 1 ∆t , and w = 0 is a Gaussian system noise. The measurement model equation is 0 1 w
9
given by z t = Hx t + u t where z is the measured location, H = 1 0 is the measurement matrix, and u is a Gaussian 2
2
measurement error. We denote the variance in system error by σ h for the horizontal filter and σ v
for the vertical filter. Measurement error variances for the horizontal and vertical filters are 2
2
denoted by δ h and δ v , respectively.
The horizontal Kalman filter equations for the prediction and correction phases become: •
Predictions: xˆ t + 1 = Fx t T Pˆ t + 1 = FP t F +
•
0 0 2
0 σh
Correction: 2 T T K t + 1 = Pˆ t + 1 H ( HPˆ t + 1 H + δ h )
–1
x t + 1 = xˆ t + 1 + K t + 1 ( z t + 1 – Hxˆ t + 1 ) P t + 1 = ( I – K t + 1 H )Pˆ t + 1
In the above equations, xˆ and x are the predicted and estimated system states, respectively. Pˆ and P are the predicted and estimated system covariance matrices, respectively. In our system, we 2
2
2
2
used the values { 1, 1, 5, 5 } for { σ h, σ v , δ h, δ v } which have a bias towards the history of state since we assume that measurements can be highly noisy.
6.2 Relating Pedestrians to Blobs We represent the relationship between pedestrians and blobs as a directed bipartite graph,
10
GP i ( V P i, EP i ) , where V P i = B i ∪ P . B i is the set of blobs resulting from the i th image which we defined in the previous section. P is the set of pedestrians. An edge ( p, u ) , p ∈ P and u ∈ B i denotes the blob u participating in pedestrian p . We call GP i a pedestrian graph. Given a blob graph, G i ( V i, E i ) , as computed in the previous section, and a pedestrian graph, GP i – 1 , we compute EP i as follows: EP i = { ( p, u ) ( u, v ) ∈ E i ∧ ( p, v ) ∈ EP i – 1 }
This transitive definition says that if a pedestrian was related to a blob in frame ( i – 1 ) and that blob is related to another blob in the i th frame (through a split, merge, etc.), then the pedestrian is also related to the latter blob.
6.3 Calculating Pedestrian Positions We use the following rule to update the location of pedestrians: move each pedestrian, p , as little as possible so that it covers as much as possible of its blobs, { u ( p, u ) ∈ EP i } ; and if a number of pedestrians share some blobs, they should all participate in covering all these blobs. There are many vague terms in this rule which need to be specified. First of all, the amount by which a pedestrian covers a blob implies a measure of overlap area. We will use a rectangle of fixed size to represent the shape of pedestrians; and we have already used the bounding box as a shape representation of blobs. However, since the blob bounding box area may be quite different from the actual blob area, we will include the blob density as computed in the previous section in the computation of pedestrian-blob overlap area. Let BB ( p ) be the bounding box of a pedestrian p , and BB ( b ) be the bounding box of a blob, b . The intersection of BB ( p ) and BB ( b ) is a rectangle
11
which we denote as X ( BB ( p ), BB ( b ) ) . The overlap area between p and b is computed as ( area of X ( BB ( p ), BB ( b ) ) ) × D ( b ) . When more than one pedestrian share a blob, the overlap area is computed this way for each pedestrian only if the other pedestrians do not also overlap the intersection area. If they did, that particular overlap area is divided by the number of pedestrians whose boxes overlap the area. Figure 3 illustrates this situation. The overlap area for p 1 is comc × D ( b2 ) c × D ( b2 ) puted as a × D ( b 1 ) + b × D ( b 2 ) + ----------------------- . For p 2 , the overlap area is d × D ( b 2 ) + -----------------------. 2 2
The problem of finding the optimum locations of pedestrians can be stated in terms of the overlap area measure that we just defined. We would like to place pedestrians such that the total overlap area of each pedestrian is maximized. It can be easily seen that there are at least N! optimum solutions, where N is the number of pedestrians. That is because given any arrangement, we can interchange any two pedestrians without any change in cost since all pedestrians have the same size. We restate the optimization problem as the problem of finding the minimum total overlap area arrangement of pedestrians which has the least sum of square distances between old and new locations of the pedestrian.
We do not to attempt to solve the problem optimally because of its complexity. Instead, we resort to a heuristic solution using relaxation. First, a small step size is chosen. Then, each pedestrian is moved in all possible directions by the step size and the location which minimizes the overlap area is recorded. Pedestrian locations are then updated according to the recorded location. This completes one iteration. In each following iteration, the step size is increased. In our implementation, we start with a step of one pixel and double the step size in each iteration until a maximum of
12
32 pixels.
6.4 Pedestrian Tracking as a Complete System As a blob graph G i is computed, it is used to compute a pedestrian graph GP i , as we discussed in Section 6.2. The Kalman filter prediction phase is then executed followed by calculating pedestrian positions as in Section 6.3. These positions constitute the measurements that will be fed back into the filter to find the new state estimates. The state estimates make up the outputs of this module.
At the end of this stage, we do some checks to refine the pedestrian-blob relationships. These can be summarized as follows: 1. If the overlap area between a pedestrian and one of its blobs becomes less than 10% of the size of both. It will no longer be considered belonging to this pedestrian. This serves as the splitting procedure when two pedestrians walk past each other. 2. If the overlap area between a pedestrian and a blob that does not belong to any pedestrian becomes more than 10% of the size of either one, the blob will be added to the pedestrian blobs. This makes the pedestrian re-acquire some blobs that may have disappeared due to occlusion. 3. Select one of the blobs that do not belong to any pedestrians. If the area of this blob is large enough (1000 pixels in our implementation), create a new pedestrian centered around the bounding box of the blob. This step actually serves as the initialization stage.
13
7 Experimental Results The system was implemented on the Minnesota Vision Processing System (MVPS) which is the image processing component of the Minnesota Robotic Visual Tracker (MVRT). MVPS consists of a Motorola MVME-147 SBC running real-time operating system OS-9, a Datacube MaxVideo 20 video processor, and a Datacube Max860 vector processor.
The system was tested on several indoor and outdoor image sequences ranging from simple to complex pedestrian scenarios. In most cases, pedestrians were tracked correctly throughout the period they appeared in the scene. Scenarios included pedestrians moving at a slow or very high speeds, partial and full occlusions, and several pedestrian interactions. Interactions between pedestrians included occlusion of one another, repeated merging and splitting of blobs corresponding to two or more pedestrians walking together, pedestrians walking past each other, and pedestrians meeting and then walking back in the direction they came from. The system has a peak performance of over 20 frames per second. In a relatively cluttered image with about 6 pedestrians, the frame processing rate dropped down to about 14 frames per second. Figure 4 shows 16 snapshots taken from a sequence of about 8 seconds. The sequence demonstrates the system behavior against occlusions, both partial and full.
There were cases when the system failed. Those include highly crowded images with people walking away from the camera. In this case, the blob size becomes very small as the pedestrians get more distant. As a result, all pedestrians boxes collapse on top of each other in order to cover this blob which is not considered a failure yet. But when a new blob appears nearby, some of these
14
boxes will latch to it and stay with it. Other inevitable failures occur when a pedestrian is almost similar in color to the background. In this case, if a pedestrian box is tracking this pedestrian, it will be prune to latch to other nearby pedestrians having bigger blobs. Also, when a pedestrian becomes totally occluded but then reappears at an unexpected location, the pedestrian box will loose track. Finally, in cases where two pedestrians walk very closely around each other, their pedestrian boxes may get interchanged erroneously.
8 Conclusions and Future Research We presented a real-time pedestrian tracking system capable of working robustly under many difficult circumstances such as occlusions and ambiguities. For each pedestrian in the view of the camera, the system produces location and velocity information as long as the pedestrian is visible. There are several issues that still need to be addressed. These include using blob models and possibly using statistical filtering in blob tracking for more robust tracking. Another issue is allowing variable pedestrian size instead of a fixed one. In a typical scene, pedestrians who are far from the camera can be much smaller and can move much slower than pedestrians who are near the camera due to perspective projection. This can be taken into consideration in our system. Another aspect that needs to be dealt with is spatial interpretation of blobs. In the current system, only one pedestrian will be created for a very wide blob unless it splits. We may, however, be able to infer the existence of more than one pedestrian by analyzing the blob dimensions. Use of a priori known scene topology is another issue that can be considered. This would be particularly useful to make the system more robust in cases of occlusions. Finally, the use of multiple cameras is an interesting extension that can be considered.
15
9 Acknowledgement This work has been supported by the Minnesota Department of Transportation through Contracts #71789-72983-169 and #71789-72447-159, the Center for Transportation Studies through Contract #USDOT/DTRS 93-G-0017-01, the National Science Foundation through Contracts #IRI9410003 and #IRI-9502245, the Department of Energy (Sandia National Laboratories) through Contracts #AC-3752D and #AL-3021, the McKnight Land-Grant Professorship Program at the University of Minnesota, and the Department of Computer Science of the University of Minnesota.
10 References
1. A. T. Ali and E. L. Dagless. Vehicle and pedestrian detection and tracking. In Proc. of the IEE Colloquium on “Image Analysis for Transport Applications,” page 48, 5/1-7, 1990. 2. Q. Cai and J. K. Aggarwal. Tracking human motion using multiple cameras. In Proc. of the 13th International Conference on Pattern Recognition, pages 68-72, 1996. 3. H. Frankel, S. Riter, and A. Bernat. An automated imaging system for border control. In Proc. of the 1986 International Carnahan Conference on Security Technology: Electronic Crime Countermeasures, pages 169-173, 1986. 4. J. C. Matter. Video motion detection for physical security applications. In Proc. of the 1990 Winter Meeting of the American Nuclear Society, page 396, 1990. 5. H. Mori, N. M. Charkari, and T. Matsushita. On-line vehicle and pedestrian detections based
16
on sign patterns. IEEE Trans. on Industrial Electronics, 41(4):384-390, August 1994. 6. S. A. Niyogi and E. H. Adelson. Analyzing and recognizing walking figures in XYT. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, pages 469-474, 1994. 7. R. F. Rashid. Towards a system for the interpretation of moving light displays. IEEE Trans. on Pattern Analysis and Machine Intelligence, PAMI2(6):574-581, November 1980. 8. K. Rohr. Towards model-based recognition of human movements in image sequences. CVGIP: Image Understanding, 59:94-115, January 1994. 9. J. Segen and S. Pingali. A camera-based system for tracking people in real time. In Proc. of the 13th International Conference on Pattern Recognition, pages 63-67, 1996. 10. C. Smith, C. Richards, S. A. Brandt, and N. P. Papanikolopoulos. Visual tracking for intelligent vehicle-highway systems. To appear, IEEE Trans. on Vehicular Technology, November 1996. 11. C. L. Wan, K. W. Dickinson, A. Rourke, M.G.H. Bell, X. Zhang, and N. Hoose. Low-cost image analysis for transportation applications. In Proc. of the IEE Colloquium on “Image Analysis for Transport Applications,” page 48, 1/10 , 1990.
17
11 Illustrations
Image Sequence
Images Difference Images
Blobs Tracked Blobs
Pedestrians
Tracked Pedestrians
Figure 1. The three levels of abstraction and data flows among them.
18
2
4
1
4
2
5
3
6
3 1
6
5
(a)
(b)
(c)
Figure 2. (a) Blobs in frame ( i – 1 ) . (b) Blobs in frame i . (c) Relationship among blobs.
p1 b1
a p2
c
b
d b2
Figure 3. Overlap area. p 1 and p 2 share blob b 2 while b 1 is only part of p 1 . See Section 6.3 for overlap area computation.
19
frame 10
frame 41
frame 52
frame 69
frame 82
frame 92
frame 100
frame 109
Figure 4. A number of snapshots from the input sequence overlaid with pedestrian boxes shown in white and blob boxes shown in black.
20
frame 120
frame 129
frame 141
frame 156
frame 164
frame 182
frame 206
frame 262
Figure 4. (continued).
21
22