Multi-Object Trajectory Tracking Abstract - Semantic Scholar

8 downloads 14671 Views 2MB Size Report
call this algorithm trajectory tracking since it estimates the state sequence or ..... output of this module is the probability of each pixel being the center of a human.
Multi-Object Trajectory Tracking Mei Han

Wei Xu

Hai Tao‡

Yihong Gong

NEC Laboratories America, Cupertino, CA, USA {meihan, xw, ygong}@sv.nec-labs.com ‡

University of California, Santa Cruz, CA, USA [email protected]

Abstract The majority of existing tracking algorithms are based on the maximum a posteriori (MAP) solution of a probabilistic framework using a Hidden Markov Model, where the distribution of the object state at the current time instance is estimated based on current and previous observations. However, this approach is prone to errors caused by distractions such as occlusions, background clutters and multi-object confusions. In this paper we propose a multiple object tracking algorithm that seeks the optimal state sequence that maximizes the joint multi-object state-observation probability. We call this algorithm trajectory tracking since it estimates the state sequence or “trajectory” instead of the current state. The algorithm is capable of tracking unknown time-varying number of multiple objects. We also introduce a novel observation model which is composed of the original image, the foreground mask given by background subtraction and the object detection map generated by an object detector. The image provides the object appearance information. The foreground mask enables the likelihood computation to consider the multi-object configuration in its entirety. The detection map

consists of pixel-wise object detection scores, which drives the tracking algorithm to perform joint inference on both the number of objects and their configurations efficiently. The proposed algorithm has been implemented and tested extensively in a complete CCTV video surveillance system to monitor entries and detect tailgating and piggy-backing violations at access points for over six months. The system achieved 98.3% precision in event classification. The violation detection rate is 90.4% and the detection precision is 85.2%. The results clearly demonstrate the advantages of the proposed detection based trajectory tracking framework.

1. Introduction Multiple object tracking has been a challenging topic in computer vision. Generally speaking, it has to solve two problems jointly: the estimation problem as in traditional tracking, and the data association problem, especially when multi-object interactions exist. Many tracking algorithms solve the estimation problem in a maximum marginal a posteriori (MAP) formulation [3], that is, the tracking result is the current object state with the highest posterior probability based on current and previous observations. The marginal MAP formulation can be simplified if the Hidden Markov Model (HMM) [33] is assumed. Estimation algorithms that find the MAP solution and propagate the posterior probability of the object state over time are equivalent to the forward algorithm in HMM literatures. However, this approach may fail due to background clutters, occlusions and multi-object confusions. Another type of tracking algorithms estimate the joint state-observation sequence distribution. The tracking result corresponds to the state sequence that maximizes the joint probability between the state sequence and the observation sequence. The state sequence indicates the trajectories of the multiple objects being tracked. We call this type of tracking methods trajectory tracking. In a discrete HMM model, the trajectory tracking problem can be solved using the Viterbi al-

gorithm [33], which is a dynamic programming algorithm that keeps best sequences ending at all possible states in the current frame. A well-known early work in trajectory tracking is the multiple hypothesis tracking (MHT) algorithm developed by Reid [34]. In MHT, the multi-target tracking problem is decomposed into the state estimation and the data association components. The data association problem is NP-hard though some work has been done to generate k-best hypotheses in polynomial time [10]. The joint probabilistic data association filter (JPDAF) [13] finds the state estimate by evaluating the measurement-to-track association probabilities. Some methods are presented to model the data association as random variables which are estimated jointly with state estimation by EM iterations [38, 14]. Most of these methods are in the small target tracking community where the object representation is simple. There has been a great amount of work on multiple object visual tracking. MacCormick and Blake used a sampling algorithm for tracking a fixed number of objects [31]. Tao et al. presented an efficient hierarchical algorithm to approximate the inference process in a high dimensional space [40]. Isard and MacCormick proposed a Bayesian multiple-blob tracker and used a particle filter to perform the inference [23]. Hue et al. described an extension of the classical particle filter where the stochastic assignment vector is estimated by a Gibbs sampler [21]. These tracking algorithms are based on the marginal MAP framework. Ideally, a tracking algorithm should pursue the best joint state-observation sequence instead of the marginal distribution of the current object state. However, when the state dynamics and likelihood are Gaussian distributions, the joint state-observation is also a Gaussian distribution. In this case, the MAP solutions of the joint distribution and the marginal distribution are identical. Therefore, there is no advantage in using a trajectory tracking scheme. For general distributions, however, the marginal MAP tracker is no longer a good approximation to the trajectory tracker. This problem is more prominent for multi-object tracking. Another reason why trajectory tracking has not been widely used is that state variables are mostly continuous variables

such as positions, deformation parameters and appearance values. Even though these variables can be quantized, the number of all possible combinations is often prohibitively large. The situation becomes even worse when multiple objects are tracked jointly. Another important component in a tracking scheme is the object representation that models the characteristics of the object being tracked, which is the combined outcome of the object’s shape, pose, motion, reflectance properties and illumination conditions. Over the years, steady progress has been made in developing various representations to facilitate effective and robust object tracking. Some of the approaches include the template-based methods [16, 17, 32], subspace-based methods [5, 9, 11, 15, 20, 30], layer-based methods [24, 25, 41, 47], image statistics based methods [8, 6, 7, 12, 4], feature-based methods [36, 39], contour-based methods [22, 26], and discriminative feature based methods [7, 2]. Our work in this respect is motivated by several recently published papers on the integration of object detection and tracking. The SVM tracker uses a detection algorithm to efficiently track vehicles [1, 43]. Several multiple people detection and tracking systems use image information such as aspect ratio [18], silhouette [19, 45], and human shape models [46, 35]. Verma et al. described a method that propagates face detection results over time [42]. The algorithm proposed in this paper is based on the trajectory tracking framework. The key to the success is a detection-based hypothesis generation process that limits the solution space in each frame to a small number of locations. In addition, multiple objects are considered jointly over time. This dramatically alleviate the problems caused by occlusions and background clutters that plague many existing tracking algorithms. In this paper, we also introduce a novel observation model for efficient multiple object trajectory tracking. The observation is composed of the image itself, the foreground mask computed using a background model and the object detection map generated by an object detector. The image provides the object appearance information for linking object detection results across frames. The

foreground mask, which is generated using a Gaussian-mixture background model [37], enables the likelihood computation to consider a multi-object configuration in its entirety. The detection score map, which consists of pixel-wise object detection scores, provides cues on where the objects are. Any object detection method can be used to generate the detection map. In our implementation, we use a fine tuned convolutional neural network for this purpose [29].

2. Multiple object trajectory tracking 2.1

The trajectory tracking framework

The multiple object trajectory tracking method uses a Hidden Markov Model to obtain the best joint probability of the state sequence and the observation sequence. Each object is identified by its index i and its state at time t, which is represented by xit = (pit , vti , ait , sit )

(1)

where pit is the image location, vti represents the 2D image velocity, ait and sit denote the appearance and scale of object i at time t, respectively. pit and vti are continuous image coordinates. We use the color histogram ait to represent the object appearance, and the size of the bounding box within which the color histogram is computed to represent the scale sit , as described in Section 2.2. We describe all the K objects in the image at time t using a multi-object configuration, Xt = {xit |i = 1, · · · , K}, where K is the number of objects. If M is the maximum possible number of objects in an image, the configuration space is

SM

K K=0 Xt ,

representing all possible configurations with 0 to M objects

where XtK denotes the configuration space composed of K objects. For trajectory tracking, a configuration sequence that maximizes P (Xt , · · · , X0 , Zt , · · · , Z0 ) is pursued, where Z0 , · · · , Zt are image observations in frames from 0 to t. It should be noticed that the number of objects in Xt may change over time.

The joint probability P (Z, X) of a given state sequence X = {Xt , · · · , X0 } and an observation sequence Z = {Zt , · · · , Z0 } can be written as, P (Z, X) = P (Z|X)P (X)

(2)

With the Markovian assumption, it can be expanded as P (X|Z) = P (X0 )P (Z0 |X0 )

t Y

P (Zi |Xi )P (Xi |Xi−1 )

(3)

i=1

The configuration space of multiple object tracking has a very high dimensionality. The key to the success of trajectory tracking for multi-object configurations is to develop an efficient inference algorithm in the configuration space [40, 23, 44, 28]. In the multiple hypothesis tracking (MHT) algorithm [34, 10], this problem is solved for small targets by finding all possible combinations of current observations and existing trajectories within certain clusters. Inspired by this approach, in this paper, we propose a detection-driven inference algorithm that uses a detection module to generate object hypotheses and exploits image information to track object identities and handle multi-object interactions.

2.2

Observation

The observation Zt at time t is composed of the original image frame ZIt , the foreground mask image ZM t derived using a background model, and the object detection map ZDt generated by an object detector, as shown in Figure 1. A Gaussian-mixture background model [37] is used to generate a binary foreground mask as shown in Figure 1 (b). The black pixels represent the regions of the foreground objects. This image serves two purposes. Firstly, it enables the likelihood computation to consider the multiobject configuration in its entirety. It gives a measure about how well a configuration, including the number of objects and their states, explains the foreground pixels, which is described in Section

(a)

(b)

(c)

Figure 1: Image observations: (a) original image, (b) foreground mask image, the black pixels represent the area of the foreground objects, (c) human detection score map, darker pixels represent higher detection scores. 3.3.2. Secondly, we can restrict the object detector to search only in the foreground areas to reduce computation. The image provides the object appearance information that helps in tracking object identities. For each location in the image, the object detector provides a bounding box whose size corresponds to the scale given by the best detection score at that location. The object appearance at the location is represented by the color histogram calculated within the bounding box. The detection map, shown in Figure 1(c), consists of pixel-wise object detection scores. Any object detection method can be used. In our implementation, we employ a convolutional neural network based object detection module [29] to detect pedestrians. As shown in Figure 1(c), the darker spots represent higher detection scores. The neural network searches over each pixel at several different scales. The detection score corresponds to the best score among all scales. Intuitively, the detection score map is used to convert the continuous object state space into a discrete one. The proposed algorithm generates object hypotheses only at local maxima in the detection score map, with the additional requirement that the detection scores at those locations are higher than an empirically pre-determined threshold.

Combining the three likelihood terms, P (ZIt |Xt ), P (ZDt |Xt ) and P (ZM t |Xt ), the proposed multiple object trajectory tracking algorithms prefers the tracks that are connections of object detections with high detection scores, have similar appearance across frames, and explain the foreground regions. The strong visual cues make the configuration sequences with the best joint stateobservation probability survive temporal missing/false detections, occlusions and cluttered backgrounds.

3. The Algorithm In this section we describe the four main components of the proposed tracking algorithm: hypothesis generation, multi-object dynamics, likelihood computation, and hypothesis management.

3.1

Hypothesis generation

A graph structure is maintained in the proposed multiple object tracking algorithm for each single multi-object trajectory, as shown in Figure 2. The graph nodes represent the object detection results. Each node is composed of the object detection probability, object size or scale, location and appearance representation. We use color histogram in each object bounding box to represent the object appearance. The strength of each link in the graph is computed based on position closeness, size similarity and appearance similarity between two nodes (detected objects). The graph is continuously extended across time (frame number) during tracking. Given object detection results from each image, the hypothesis generation step firstly calculates the connections between the maintained graph nodes and the nodes in the current image. The maintained nodes include the ending nodes of all the object tracks in a maintained hypothesis. They are not necessarily from the previous image since object detection may have missing detections. The hypothesis generation process handles object occlusions by node splitting and merging, that

Missing detection Split

Initialization False detection

Merge Disappear

Frame #:

i

i+1

i+2

i+3

i+4

Nodes: human detection results Edges: associations between detection results

Figure 2: Graph structure of a single multi-object trajectory is, when an object appears from occlusion, the previous node splits into two object tracks. On the other hand, when an object becomes occluded, the corresponding node is merged with the occluding node. The hypothesis generation process deals with missing data naturally by skipping nodes in graph extensions, therefore, the connection is not necessarily built on two nodes from consecutive image frames. The hypothesis generation process handles false detections by ignoring some nodes. It initializes new object tracks for some appearing nodes if their similarity scores to existing nodes are low and if they appear in regions with high prior probabilities of objects entering into the scene such as door areas. The process is shown in Figure 2. It should be emphasized that the graph shown in Figure 2 is a single hypothesis representing a multi-object trajectory. The proposed multiple object tracking algorithm keeps many hypotheses using such spatio-temporal graphs. At each time instance, it expands these graphs and prunes trajectories in a balanced way to maintain the hypotheses as diversified as possible and delays the decision of most likely hypothesis to a later time instance.

3.2. Multi-object dynamics To compute the joint state-observation probability P (Z, X), we need to formulate the state sequence prior P (X), which is determined by object dynamics. Let A(t) be the set of the indices of the active objects, i.e., the objects appear in the view, at time t. Assuming the motions of objects are independent, P (Xt |Xt−1 ) can be factorized as, P (Xt |Xt−1 ) ∝ τ1 τ2 τ3

(4)

where Y

τ1 = i∈A(t)

T

Y

τ2 =

P (xit |xit−1 )

A(t−1)

Pa (pit )

i∈A(t)\A(t−1)

Y

τ3 =

Pd (pit−1 )

(5)

i∈A(t−1)\A(t)

The first term τ1 represents the state transition dynamics for objects appearing in both frames t − 1 and t, xit = Φxit−1 + Σwt−1

(6)

where Φ is the state transition matrix, Σ is the disturbance matrix and wt−1 is the normal random noise with zero mean and covariance Q. For the object state xit = {pit , vti , αti , sit }, we employ a constant velocity model and assume the object appearance and scale remain constant across frames, therefore, Φ is defined as,



  I2×2     0  Φ=    0   

0

I2×2

0

I2×2

0

0

Im×m

0

0

0     0      0    

(7)

1

where I2×2 and Im×m denote 2-by-2 and m-by-m identity matrices, m is the number of quantization bins in the color histogram representation.

The second term τ2 models object addition, or new track initialization, and the third term τ3 for object deletion, or track ending. The function Pa gives the probability that a trajectory does, in fact, begin ( appear) in the current frame, which is a function of the object location pit . The probability is higher around the edges of the field and the door areas than in other areas because people typically do not appear ”out of nowhere” [40]. Pd is defined in a similar way. It should be noted that, strictly speaking, dynamics of a group of objects sometimes are not independent. Completely modelling the interactions among objects may reduce the solution space and make the tracking algorithm more robust. In this paper, we limit ourselves to only modelling the multiple object interactions explicitly in the likelihood function.

3.3. Observation likelihood The probability P (Zt |Xt ) describes how the underlying state Xt of the system fits the observation Zt . An object-based likelihood function is often computed as the matching score of the object representation with the image at the object location. Such a likelihood function does not explain the whole image. On the other hand, an image-based likelihood function explains every pixel in the image using the object state. The advantage of using the image-based likelihood is that when the tracker is trapped at a wrong location, such as a background clutter, the likelihood is low because the true target cannot be explained using other objects. We propose a likelihood function that is composed of an object-based likelihood term for the original image, and an image-based likelihood term for the foreground mask and the detection map. More formally, the likelihood function is P (Zt |Xt ) = P (ZIt |Xt ) P (ZM t |Xt ) P (ZDt |Xt )

(8)

where P (ZIt |Xt ) represents the image observation likelihood, P (ZM t |Xt ) denotes the foreground mask likelihood, and P (ZDt |Xt ) is the detection likelihood.

3.3.1 Image observation likelihood We obtain the object size and appearance from the original image according to the position of the object pit and the detection map ZDt at time t . The size Isi , which is a function of pit and ZDt , is the scale with the best human detection score at the closest discrete position to pit because human detector only works on pixel-wise coordinates. The appearance from the image Iai , which is a function of pit and Isi , is computed within the bounding box at pit of size Isi . Therefore, P (ZIt |Xt ) =

Y

P (Iai |ait )P (Isi |sit )

(9)

i

where P (Iai |ait ) and P (Isi |sit ) have Gaussian distributions N (ait , Σ) and N (sit , σ).

3.3.2 Foreground mask likelihood P (ZM t |Xt ) is the likelihood of the foreground mask, which is composed of coverage and compactness terms, as in [40]. The main idea is to use the background subtraction result as the cue for detecting foreground regions. The areas without significant image changes are considered to be well explained by the background model. The likelihood is computed as the percentage of the covered foreground regions, or the coverage, and the efficiency of multi-object coverage, or the compactness. It is formulated as, P (ZM t |Xt ) = e−(lcov +lcomp )

(10)

where Ã

lcov = log Ã

lcomp = log

T Sm

|A (

j=1 Bj )

+ c|

|A| + c

T Sm

|A ( j=1 Bj ) + c| P | m j=1 Bj | + c

! !

(11)

lcov calculates the hypothesis coverage of the foreground pixels and lcomp measures the hypothesis compactness. A denotes all foreground pixels and Bj represents the pixels covered by jth object.

m is the number of different nodes in this hypothesis.

T

denotes the set intersection and

S

denotes

the set union. The numerators in both lcov and lcomp represent the foreground pixels covered by the combination of multiple objects in the current hypothesis. Therefore, lcov represents the foreground coverage of the hypothesis, the higher the larger coverage. lcomp measures how much the nodes overlap with each other, the larger the less overlap and the more compact. c is a constant for ensuring numerical stability. These two values give a spatially global explanation of the image (foreground) information. They measure the combined effect of multiple tracks in a hypothesis. This computation is similar to the image likelihood computation in [40]. Intuitively, P (ZM t |Xt ) prefers multi-object configurations that cover all foreground regions with minimal overlapping areas among objects.

3.3.3

Detection likelihood

The object detector calculates the score of the object presence at location j, where j is a discrete coordinate. We assume the following noisy detection process. An object at position p is detected at location j with probability f (j|p). A single object can cause object detection responses at multiple locations and the detection response at a single location can be caused by multiple objects. The response of object detection at location j depends on a binary random variable Hj indicating whether j is a response caused by some object(s). Further more, we assume that the detection response ZDt

at different locations are independent. We have, P (ZDt |Xt ) ∝

Y

j P (ZDt |Xt )

j

=

YX j

j P (ZDt |Hj )P (Hj |Xt )

Hj

and P (Hj = 0|Xt ) =

Y

(1 − f (j|pit ))

i

(12)

P (Hj = 1|Xt ) = 1 − P (Hj = 0|Xt )

(13)

j The distributions P (ZDt |Hj ) and f (j|pit ) encode the prior knowledge about how the observation j relates to the internal state of the system. In our implementation, P (ZDt |Hj ) are modelled as two

1D Gaussian distributions N (di , σ), where d1 = 0.7 and d2 = 0.1. f (j|pit ) is modelled as a 2D Gaussian distribution centered around the object location. To make object detection results more coherent temporally, we also linearly combine the raw detection result in the previous frame with the raw detection result of the current frame to form ZDt . More specifically, the object detection score map in the previous frame SDt−1 is warped to 0

the current frame using backward warping, which is denoted as SDt−1 . Backward optical flow is computed in the foreground ground region of the current frame for this purpose. Then the detection 0

scores SDt−1 and SDt are combined according to the following formula to form ZDt : 0

ZDt = αSDt−1 + (1 − α)SDt

(14)

where α ∈ [0, 1] is a weighting parameter. We used α = 0.3 in our implementation. It should be pointed out that an additional benefit of the proposed trajectory tracking method is that it provides feedbacks to the object detection module to improve the object detection performance. According to the best multi-object hypotheses, the tracking module predicts the most likely locations to detect objects. This interaction tightly integrates the object detection and tracking, making both components much more reliable.

3.4

Hypothesis management

The design of the proposed multiple object tracking system follows two principles: 1. We keep as many multi-object trajectory hypotheses as possible and make them as diversified as possible to represent all the possible explanations of image sequences. The decision is made very late to guarantee

it is an informed and global decision. 2. We perform local pruning to remove unlikely connections and only keep a manageable number of hypotheses. This principle helps the system to achieve a real-time performance. The graph structure is applied to generate and manage multiple hypotheses and to perform reasonable pruning for both reliable performance and efficient computation. The hypothesis management module performs local pruning using a likelihood threshold θ. In addition, the algorithm only maintains a maximum number of N multi-object trajectory hypotheses during inference. We also enforce a maximum number of object tracks M which are active in each multi-object trajectory hypothesis. With reasonable assumptions of the thresholds, we can achieve real-time performance in a not-too-crowded environment. A thresholding scheme certainly has the risk of eliminating correct sequences, especially when the measurements are ambiguous. Our empirical results show that our pruning technique gives optimal or close-to-optimal tracking results with reasonable performance of background modelling and object detection. The proposed tracking algorithm shares a similar idea of data association with [34, 27]. It is designed to explore many possible associations between target detections. However, the previous algorithms are not image-based tracking methods. The only information considered in these algorithms is target locations. The association is not dependent on object appearance which is one of the key elements in our algorithm to accomplish object identification. The previous methods also lack a scheme to integrate image-based likelihood computation. The proposed hypothesis management algorithm is summarized as follows: • Hypothesis expansion for n = 1 to N expand Hypothesis(n) according to the current observations; if the number of object tracks in a new hypothesis > M , the new hypothesis is pruned;

if P (Z, X) of a new hypothesis < θ the new hypothesis is pruned; • Hypothesis removal check every object track in each multi-object trajectory hypothesis, if there is no new expansion for T seconds, the object track is deleted from the hypothesis; check every multi-object hypothesis, if there is no object track, the hypothesis is pruned; • Hypothesis sorting obtain the top N hypotheses according to their probabilities P (Z, X) with the constraint that at least R hypotheses are kept for hypotheses with different number of object tracks. The number of object tracks in each multi-object trajectory hypothesis can range from 1 to M . So we need to maintain at least R 1-track hypotheses, R 2-object hypotheses, and so on. It is obvious that N ≥ M R. In our implementation, the number of trajectory hypotheses N ranges from 20 to 30. The maximum number of object tracks in each hypothesis is M = 6 and R = 2. An object track expires in T = 20 seconds. The computational complexity of the proposed algorithm mainly lies in the hypothesis generation and likelihood computation. In theory, the number of new hypotheses created from each old hypothesis is exponentially related to the number of objects. However, with some simple pruning in our implementation, we reduce this to around 20 new hypotheses for each 6-object configuration. These hypotheses are further pruned in the later stage.

4. Experiments 4.1

A multiple pedestrian tracking system

We implemented and tested the multiple object tracking algorithm in a complete video surveillance system which is depicted in Figure 3. The system works under general fixed camera situations and has been tested extensively on CCTV cameras which were installed to monitor entries with access controls. The two test environments are semi-outdoor. Since object detection is included in the observation sequence, the requirement for accurate background modelling is relaxed and the method can potentially be applied to a non-stationary camera environment. The adaptive background modelling module deals with changing illuminations and does not require objects to be constantly moving. An object detection module takes the foreground pixels generated by background modelling as input and outputs the probabilities of object detection. It searches over the foreground pixels and gives the probability of each location where an object is present. Any object detection approach can be fit into this part to provide initial and local detection results. In our system, we use a convolutional neural network to detect humans in each image. The output of this module is the probability of each pixel being the center of a human. These probabilities undergo a non-maximum suppression to keep the locations which have local maximum probabilities. The detection results are then sent to the proposed multi-object trajectory tracking algorithm. Based on the tracking results, the alert reasoning module detects violations automatically based on predefined rules.

4.2. Tracking results In this section, we describe our experimental results of multiple object trajectory tracking. The first example demonstrates the importance of trajectory tracking, especially when the human detection module makes mistakes. The second and the third examples show how the observations, i.e., the

Camera

Foreground blobs CNN human detection

Digitization Background subtraction

Feedback to the detection module Top hypothesis

Alert reasoning

Alert code

Detection probability of each pixel Optical flow projection Detection observations

Ranked hypotheses

Hypothesis management

Likelihood computation

Hypothesis list Hypothesis generation

Figure 3: Diagram of the proposed video surveillance system images, the foreground masks and the detection maps, jointly help to achieve good tracking results. The first test sequence is taken inside the secure area of an access control door. Three persons are involved. Person A first goes up to the door and opens the secure door from inside for persons B and C to come in. Then three of them continue to walk. Figure 4(a) shows 5 images from the sequence with overlaid bounding boxes indicating the human detection results. The darker bounding box demonstrates the higher detection score and the size of the bounding box represents the scale with the best detection score. Figure 4(b) demonstrates the multi-track hypothesis with the largest joint state-observation sequence probability generated by the multiple object trajectory tracking algorithm. The tracks are overlaid on the detection score map. Different intensities represent different tracks. Due to lighting changes and the semi-outdoor environment, the foreground mask sometimes is not complete and/or has false masks on background clutter. The human detection based on each image is certainly not perfect either. There are false detections in the forth and fifth images caused by background noises and people interactions. In the third image, the human detector misses the person in the back due to occlusions and, in the fifth image, it misses the person in the front due to distortions. However, the trajectory tracking algorithm manages to maintain the correct number of tracks and their configurations, as shown in Figure 4(b), because it searches for the best explanation

of the observations over time. The fifth image shows three tracks, the turning track at the bottom (the darkest one) corresponds to person A who walks to the door and comes back, the track in the middle corresponds to the person wearing a hat and the short track on the top (the lightest one) represents the last person coming in. The first four images in Figure 4(b) show the corresponding multi-track or multi-object configurations at the moments when the images in (a) are taken. Therefore, we know when and where each track begins and how these tracks interact. Since we are performing trajectory tracking, many other less optimal configuration sequences are also maintained. We show one of them in Figure 4(c) which has a lower joint probability at the end of the sequence. This sequence starts with one object track, as shown in the first two images, which is quite similar to the optimal solution. Therefore, the joint probability is also very high up to the moment when more people appear. In the third image, the second person appears and the configuration chooses to maintain one track for both detections. The detection-based observation likelihood, at this moment, is lower than the optimal one in Figure 4(b) where a new track is initialized for the person who just walks in. In the forth and fifth images, the configuration only keeps two tracks to interpret the observations. Due to occlusions among the three people, there have been quite a few missing detections as well as false detections. Therefore, this configuration sequence generates two jumpy tracks which are switching between different people to cover the shaky detections. However, the foreground mask likelihood and the image likelihood, which is based on appearance and size matching, are both lower than the optimal one. Figure 5 demonstrates an example of multiple people tracking at another access control area. This camera is mounted in a non-secure area where the card swiping machine can be seen in the images. The example shows that the lady first opens the door for the person in gray shirt and then the person in dark shirt follows and goes into the area. Figure 5(a) shows the images from the sequence and (b) demonstrates the tracking results. Since people stop and talk in the sequence, the state dynamics

cannot guarantee to maintain the correct track identities when people cross each other. The image likelihood composed of appearance and size, however, is able to maintain the track identities, as shown in Figure 5(b). Interestingly, there is one short track close to the up-left corner of the images because one person is standing inside the secure area and the human detection consistently detects him through the glass window. Therefore, 4 tracks are shown in Figure 5(b), the short track for the standing person, the long track for the lady, the dark track for the person in gray shirt, and the light track for the person in dark shirt. The last example demonstrates that object detection not only improves the computation efficiency, but also makes the tracking system insensitive to irrelevant objects. Figure 6 shows a cleaning lady walking into the door, loitering around a plant and then walking away. It is difficult to obtain the outline of the person from the foreground mask by contour matching [19, 46] or aspect ratio [18] because of occlusions and the existence of other large objects, such as the trash can in this example. Even the foreground mask likelihood computation can be fooled because the foreground pixels of the large object cannot be covered by the human. However, the human detector provides the information to separate the person from irrelevant objects. Our trajectory tracking takes advantage of the convolutional neural network human detector so that it identify the human form and maintains one clean track. In these experiments, we assume that the maximum number of people in the scene is 6 and keep the 30-best configurations. With these parameters set, we achieve a real-time performance of 10 fps.

4.3. Anomaly detection The trajectory tracking results provide information accumulated through image frames, including the number of objects, their motion and interaction history, and the timing of their behaviors. In

videos

events

violations

detected violations

false alarms

access door 1

772

37

35

4

access door 2

943

45

42

5

Table 1: Recall and precision of violation detection on one month’s real videos for two access control doors. practice, such information can be best extracted from a vision system. For example, it is possible to set up a sensor for indicating the opening and closing of the door. However, without a camera system, there is no easy way to confirm whether and how many objects (people, vehicles, animals, luggage) actually move through the door. Basically the multiple object visual tracking system provides sufficient information to interprets image sequences about WHO (how many), WHEN, WHERE and WHAT. Based on the number of objects and the corresponding multiple tracks, we can analyze the behaviors of the objects. The goal is to detect abnormal behaviors which are related to object counting, interactions, motion and timing. Since objects are tracked constantly, the system can automatically analyze and report traffic information: how many objects enter/leave through the door and loiter/stay at sensitive areas at certain time. We have tested the trajectory tracking system to detect tailgating and piggy-backing violations at access points. If there are more than one persons going into secure area with only one card swipe, it is a tailgating violation. If the person going into secure area is not the person who just swipes the card, it is piggy-backing. We also define rules to detect suspicious behaviors, such as multiple swipes by one person, people loitering at the door or swiping areas. It is straightforward to integrate the signals from card readers, door opening and other sensors into the system. As mentioned above, even with these signals we still need images and robust tracking to verify if people are at the door or swiping areas, when they appear/disappear, how they have been moving and when. In the cases when we do not have door opening or card swiping information, we can

obtain the related information from images. The system has been running at a few access control areas, where only the visual information is available, for more than six months. In average, the system achieves 98.3% precision in events classification. The violation detection rate is 90.4% and the detection precision is 85.2%. Table 1 lists a subset of the test, which shows the results of one month’s data at two access doors. The system achieved overall 99.2% precision. The violation recall and precision rates are 93.9% and 89.5%, respectively. Details are shown in Table 1. The results satisfy the requirements of a real system.

5. Conclusion In this paper we describe a multiple object trajectory tracking method. We use a Hidden Markov Model as the probabilistic framework to maximize the joint probability between the state sequence and the observation sequence. Therefore, the trajectory tracking method can deal with temporal difficulties caused by cluttered backgrounds, multi-object interactions and occlusions. We use an observation model composed of the original image, the foreground mask after background subtraction and the object detection score map generated by an object detector. Neither the background modelling nor the object detection is required to be accurate. Both background modelling and object detection are used for making image-based decisions while the trajectory tracking is making a global decision on the number of objects and their configurations. We show a detection-driven inference scheme to perform reliable and efficient multi-object tracking without requirements on the number of objects to be known or fixed. The configuration space of multiple object trajectory tracking has a high dimensionality. Much more work has to be done to design an efficient inference algorithm. Learning of the model parameters is another challenging task. Future work includes the improvement of the current inference

algorithm and the development of learning algorithms for the HMM described in this paper.

References [1] S. Avidan. Support vector tracking. PAMI, 26(8):1064–1072, August 2004. [2] S. Avidan. Ensemble tracking. In CVPR05, pages II: 494–501, 2005. [3] Y. Bar-Shalom and X. Li. Multitarget Multisensor Tracking: Principles and Techniques. YBS Publishing, 1995. [4] S. Birchfield and R. Sriram. Spatiograms versus histograms for region-based tracking. In CVPR05, pages II: 1158–1163, 2005. [5] M. Black and A. Jepson. Eigentracking: Robust matching and tracking of articulated objects using a view-based representation. In ECCV96, pages II: 329–342, 1996. [6] R. Collins. Mean-shift blob tracking through scale space. In CVPR03, pages 234–240, 2003. [7] R. Collins, Y. Liu, and M. Leordeanu. On-line selection of discriminative tracking features. PAMI, 27(10):1631–1643, October 2005. [8] D. Comaniciu, V. Ramesh, and P. Meer. Kernel-based object tracking. PAMI, 25(5):564–577, May 2003. [9] T. Cootes, G. Edwards, and C. Taylor. Active appearance models. PAMI, 23(6):681–685, June 2001. [10] I. Cox and S. Hingorani. An efficient implementation of reid’s multiple hypotheses tracking algorithm and its evaluation for the purpose of visual tracking. PAMI, 18(2):138–150, Feb 1996. [11] F. De la Torre and M. Black. A framework for robust subspace learning. IJCV, 54(1-3):117–142, August-September 2003. [12] Z. Fan and Y. Wu. Multiple collaborative kernel tracking. In CVPR05, pages II: 502–509, 2005. [13] T. Fortmann, Y. Bar-Shalom, and M. Scheffe. Sonar tracking of multiple targets using joint probabilistic data association. IEEE Journal Oceanic Eng., OE-8:173–184, July 1983. [14] H. Gauvrit and J. Le Cadre. A formulation of multitarget tracking as an incomplete data problem. IEEE Trans. on Aerospace and Electronic Systems, 33(4):1242–1257, Oct 1997.

[15] G. Hager and P. Belhumeur. Efficient region tracking with parametric models of geometry and illumination. PAMI, 20(10):1025–1039, October 1998. [16] G. Hager, M. Dewan, and C. Stewart. Multiple kernel tracking with ssd. In CVPR04, pages I: 790–797, 2004. [17] B. Han and L. Davis. On-line density-based appearance modeling for object tracking. In ICCV05, pages II: 1492–1499, 2005. [18] I. Haritaoglu, D. Harwood, and L. Davis. W4s: A real-time system for detecting and tracking people in 2 1/2-d. In ECCV98, 1998. [19] I. Haritaoglu, D. Harwood, and L. Davis. Hydra: Multiple people detection and tracking using silhouettes. In IEEE Workshop on Visual Surveillance, 1999. [20] J. Ho, K. Lee, M. Yang, and D. Kriegman. Visual tracking using learned subspaces. In CVPR04, pages I: 782–789, 2004. [21] C. Hue, J. Le Cadre, and P. Perez. Tracking multiple objects with particle filtering. IEEE Trans. on Aerospace and Electronic Systems, 38(3):791–812, July 2002. [22] M. Isard and A. Blake. Condensation - conditional density propagation for visual tracking. IJCV, 29(1):5–28, 1998. [23] M. Isard and J. MacCormick. Bramble: A bayesian multiple-blob tracker. In ICCV01, pages II: 34–41, 2001. [24] A. Jepson, D. Fleet, and T. El-Maraghi. Robust online appearance models for visual tracking. In CVPR01, pages I:415–422, 2001. [25] N. Jojic, N. Petrovic, B. Frey, and T. Huang. Transformed hidden markov models: Estimating mixture models of images and inferring spatial transformations in video sequences. In CVPR00, pages II: 26– 33, 2000. [26] M. Kass, A. Witkin, and D. Terzopoulos. Snakes: Active contour models. IJCV, 1(4):321–331, 1988. [27] Y. Kirubarajan, Y. Bar-Shalom, and K. Pattipati. Multiassignment for tracking a large number of overlapping objects. IEEE Transactions on Aerospace and Electronic Systems, 37(1), January 2001. [28] O. Lanz and R. Manduchi. Hybrid joint-separable multibody tracking. In CVPR05, 2005.

[29] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, November 1998. [30] K. Lee and D. Kriegman. Online learning of probabilistic appearance manifolds for video-based recognition and tracking. In CVPR05, pages I: 852–859, 2005. [31] J. MacCormick and A. Blake. A probabilistic exclusion principle for tracking multiple objects. In ICCV99, pages 572–578, 1999. [32] I. Matthews, T. Ishikawa, and S. Baker. The template update problem. PAMI, 26(6):810–815, June 2004. [33] L. Rabiner. A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, 77:257–286, Feb 1989. [34] D. Reid. An algorithm for tracking multiple targets. IEEE Tran. on Automatic Control, 24(6):843–854, December 1979. [35] J. Rittscher, P. Tu, and N. Krahnstoever. Simultaneous estimation of segmentation and shape. In CVPR05, pages II: 486–493, 2005. [36] J. Shi and C. Tomasi. Good features to track. In CVPR94, pages 593–600, 1994. [37] C. Stauffer and W. Grimson. Learning patterns of activity using real-time tracking. PAMI, 22(8):747– 757, August 2000. [38] R. Streit and T. Luginbuhl. Maximum likelihood method for probabilistic multi-hypothesis tracking. In Proceedings of SPIE International Symposium, Signal and Data Processing of Small Targets, 1994. [39] F. Tang and H. Tao. Object tracking with dynamic feature graphs. In VS-PETS05, pages 25–32, 2005. [40] H. Tao, H. Sawhney, and R. Kumar. A sampling algorithm for tracking multiple objects. In International Workshop on Vision Algorithms, September 1999. [41] H. Tao, H. Sawhney, and R. Kumar. Object tracking with bayesian estimation of dynamic layer representations. PAMI, 24(1):75–89, January 2002. [42] R. Verma, C. Schmid, and K. Mikolajczyk. Face detection and tracking in a video by propagating detection probabilities. PAMI, 25(10):1215–1228, October 2003. [43] O. Williams, A. Blake, and R. Cipolla. A sparse probabilistic learning algorithm for real-time tracking. In ICCV03, pages 353–360, 2003.

[44] T. Yu and Y. Wu. Collaborative tracking of multiple targets. In CVPR04, pages I: 834–841, 2004. [45] L. Zhao and L. Davis. Closely coupled object detection and segmentation. In ICCV05, pages I: 454– 461, 2005. [46] T. Zhao, R. Nevatia, and F. Lv. Segmentation and tracking of multiple humans in complex situations. In CVPR01, pages II:194–201, 2001. [47] Y. Zhou and H. Tao. An background layer model for object tracking through occlusion. In ICCV03, 2003.

Figure 4: Trajectory tracking results with missing/false human detections: Left: original images with overlaid bounding boxes showing the human detection results; Middle: the tracking results with the best joint state-observation probability (three tracks are included): Right: the tracking results with a lower joint probability (only two tracks are maintained).

(a)

(b) Figure 5: Trajectory tracking results of crossing tracks: (a) original images with overlaid bounding boxes showing the human detection results, (b) multiple object tracking results with correct track identities.

(a)

(b) Figure 6: Trajectory tracking results with large objects: (a) original images with overlaid bounding boxes showing the human detection results, (b) multiple object tracking results showing a track which only involves the person.

Suggest Documents