A Multi-Camera Multi-Target Tracker based on Factor Graphs Francesco Castaldo and Francesco A.N. Palmieri Dipartimento di Ingegneria Industriale e dell’Informazione Seconda Universit`a degli Studi di Napoli Aversa (CE), 81031 Italy Email:
[email protected] -
[email protected]
Abstract—System modeling with Probabilistic Graphical Models (PGM) has become increasingly popular in the last years. In this paper we design a Multiple Target Tracker based on the probabilistic architecture of Normal Factor Graph. Belief propagation makes best use of data coming from different branches of the graph and yields the tracks via messages fusion. The issues of data association, track life-cycle management and data fusion from heterogeneous sensor modalities are resolved at each time step by propagating and combining forward and backward probabilistic messages. Inexpensive cameras deployed in the scene under surveillance are the primary sensor modality, even if the framework has been designed to receive data from a wide range of sensors such as Radars, Infrared cameras, etc. The framework has been tested by calculating the tracks of different ships moving in an harbour framed by three cameras.
I. I NTRODUCTION Multiple target tracking [1] is a very compelling problem. Tracked objects appear and disappear from the area under surveillance, sensors yield unreliable and unsynchronized estimates, and even if the latter are correct there is still the issue of matching each estimate to one of the targets in the scene (the so-called data association problem). If no match is found, we could infer that a new target has been detected, but we have to discriminate bewteen real targets and clutter (ocean waves for harbours, foliage movements on streets, etc.). Different policies have to be scheduled in order to efficiently start, maintain and cancel the tracks. In recent years there has been a great interest in robust target tracking [2]. Along with classical approaches, very appealing and recent solutions involve the use of Bayesian probabilistic models [3] [4], or framework based on Gaussian mixtures [5] [6] in which the number of targets becomes another parameter to estimate. The recent target tracking methods share as basic principle the idea of fusing data [7] [8] coming from different and heterogeneous sensors. All these trackers however are severely put to test when we ask them to produce realistic tracks in real applications and not just with simulated data. As a matter of fact, yielding stable and realistic tracks is crucial to develop a system capable of using the track data to recognize abnormal behaviors and situations in complex areas. For this reason we propose a framework based on Forneystyle Factor Graphs (FFGs) [9] in which tracks birth, death and management are naturally addressed within a modular system in which each part of the process is wired into a different
subsystem. More specifically, the flexibility of FFGs allows the designer to divide the algorithm in blocks of different nature (linear/non-linear function, pieces of codes, etc.) and re-use information from other blocks to enhance the track quality. All the user have to do is to coherently define the blocks interfaces and the types of (forward and backward) messages. Belief propagation and combination does the rest. More in general factor graphs seem to provide a simple but rigorous way to resolve composite problems, among which we focus on multiple target tracking. Furthermore, the objectbased structure of the factor graphs can be easily reproduced in software, by using programming languages based on objects, or realized in hardware, by wiring the probabilistic network in application-specific circuits as FPGA. The system presented in this paper fuses information from previous estimation steps and from the sensors. Here the reference sensors are visual cameras [10] [11] [12], but, given the modular nature of the framework, other sensors as Radar, infrared cameras, etc. could be integrated without much efforts. We leave this extension to future works. Information can also flow towards the sensors to increase their accuracy, or to ease their work in locating objects in the image. This paper can be considered as an extension of [13], in which is achieved single object tracking with FFGs. The paper is structured as follows. In Section II the factor graph underlying the framework is presented and the blocks composing the system are analyzed. Section III shows a preliminary test of multiple tracking using real data coming from an harbour. Section IV draws conclusions and suggests future developments. II. T HE FACTOR G RAPH In this section we present the general architecture of our framework, depicted in Figure 1 for a discrete time-step k ∈ N. We denote the state of the system as sk = {s1k , s2k , .., sO k }, where O is the number of tracks and sik = (Xki , Yki , X˙ ki , Y˙ ki )T is the object state at time k, with (Xki , Yki ) positional coordinates and (X˙ ki , Y˙ ki ) speed variables. The system is synchronous and at each discrete-time step k all the information coming from the dynamic model and the vision systems is embedded into a unique pipeline. The great feature of factor graphs [14], specially in their Forney-Style normal form [9] [15], is that, through the use of
Fusion Block
Model Block
...
...
+
Camera Block
...
+
Fig. 2. The model block based on the nearly-constant velocity dynamic model.
...
Fig. 1. The FFG architecture for the time k. The estimates from the visual sensors C 1 , .., C N are fused with the tracks updated by the target dynamic model.
diverter blocks, new information can be easily added without having to re-work out the rest of system. In a factor graph edges represent variables and nodes represent factors. There is a direction for each branch that allows unambiguous definition of forward (f ) and backward (b) messages. Each factor block describes the conditional probability function that maps the input into the output variable. We assume that all messages and state variables are Gaussian pdfs, i.e. fully described by a mean vector and a covariance matrix. The means typically represent our best estimates of the variables and the covariances the multidimensional regions of confidence. Forward and backward messages for multi-dimensional variables si (we omit here the k subscript for readability) are denoted with mean and covariance as fsi = {mfsi , Σfsi } and bsi = {mbsi , Σbsi }, respectively. For a multi-object variable s we have a set of means and covariances for fs = {fs1 , fs2 , .., fsO } and bs = {bs1 , bs2 , .., bsO }. In single target tracking [13] fsi and bsi at each edge can be combined with the product rule [9] to obtain the Gaussian distribution of the variable defining the same edge psi ∝ bsi fsi −1 with mean msi = Σsi (Σ−1 fsi mfsi + Σbsi mbsi ) and covariance −1 −1 Σsi = (Σ−1 . For the multi-object variable s we fsi + Σbsi ) need to combine each forward message fsi with the backward message bsi that is most likely to represent the same target, otherwise the fusion generates redundant and wrong tracks. By checking the distances between the mean of the forward message, mfsi , and all the means of the backward messages, mbsi , we combine each fsi with its nearest neighbor on the backward direction. Our model describes a fixed time window on time steps k = 0, .., K. Therefore we have to inject into the graph an initial forward message fs0 = {mfs0 , Σfs0 } at the beginning of the chain and a backward message bsK = {mbsK , ΣbsK } at the end. If a priori information about tracked objects is not
available, the message fs0 can be assumed as fs0 = {fs10 }, with mfs1 = 0 and a very high-valued covariance matrix 0 (Σfs1 → ∞). The same can be done for bsK . More in general, 0 distributions with very large variances are injected when there is no available data. In our architecture are defined three macro-blocks, named Model Block, Camera Block and Fusion Block, that deal with model dynamic, camera estimates and track fusion respectively. All these operations are performed by propagating bidirectionally all the information coming from the different branches of the system. Our architecture is built to be cyclefree to avoid indeterminacies caused by loops. In the following we detail the content of the three macroblocks just defined. To improve readability, all the operations are described for a single target sik ∈ sk and it is assumed that at the exit of a block all the state estimates are gathered in the correspondent multi-object state variables. In this paper, the state variable sk contains the targets state (the tracks) at time k, zk represents the tracks passed through the target dynamic model and c1k , .., cNk group the estimates coming from the N cameras. The indexes are k = 0, .., K for the time, i = 1, .., O for the number of targets (and tracks), j = 1, .., N for the cameras and t = 1, .., ζj for the moving objects detected by the j-th camera. The model and camera blocks do not differ from those mentioned in [13], but in this paper are extended to multiple target tracking. A. Model Block We assume here a standard nearly-constant-velocity model [16] for the i-th target, with accelerations along X and Y modeled as small white noises i i Xk−1 Xk Yki i i = A Yk−1 + wk , (1) i X˙ k X˙ k−1 Y˙ i Y˙ i k
with
k−1
1 0 0 1 A= 0 0 0 0
T 0 1 0
0 T , 0 1
(2)
T sampling period and wk = (wXk , wYk , wX˙ k , wY˙ k )T white 2 2 2 Gaussian noise N ∼ (04 , diag(σX , σY2 , σX ˙ , σY˙ )), that models uncertainties in object position and speed. Noise covariance
allows us to tune the confidence we have on the model (more detailed models can be used [16], but we leave them for now to future papers). Forward and backward messages are computed as follows [15] fzi0 = {Amfsi k
k−1
, AΣfsi
AT }
(3)
k−1
fzik = {mfzi + mfwk , Σfzi + Σfwk },
(4)
bzi0 = {mbzi − mfwk , Σbzi + Σfwk },
(5)
0k
k
0k
k
k
−1 T −1 −1 bsik−1 = {(AT Σ−1 A Σb i mbzi , (AT Σ−1 }. b i A) b i A) z0 k
z0 k
0k
z0 k
(6)
B. Camera Block In Figure 3 we depict the FFG architecture for a single camera Cj . The two blocks Bj1 and Bj2 represent the extraction of the targets data from the camera raw video flow and the transformation of the data to a Gaussian estimate respectively. Therefore in this macro-block coexists two type of variables, namely cjk = {ctjk } and pjk = {ptjk }, with t = 1, .., ζj . More specifically, ctjk = (Xkt , Ykt , X˙ kt , Y˙ kt )T is the state estimate of a single object defined in ”world” coordinates, ptjk = (xtjk , yjtk , x˙ tjk , y˙ jtk )T is the pixel information of the same object, with x˙ tjk and y˙ jtk speed variables in the image reference system, and ζj is the number of objects detected by the camera. The translation from pixels to world coordinates depends on the camera parameters and on its positioning. The camera model is a standard pinhole model, that defines the mapping between world points and pixel coordinates [10] [12] [17]. In this framework the world points are bounded to move on 2D surfaces (the sea plane for harbours, the street for parking lots, etc.), therefore the mapping can be described with an homography matrix. In this paper we assume to know exactly the homography matrices, having used the calibration methods detailed in [17] [10]. The pinhole model for each camera Cj , j = 1, .., N corresponds to the following equation t t xjk Xk (7) λtjk yjtk = Hj Ykt , 1 1 where xtjk and yjtk are target pixel coordinates (assumed real) and hj11 hj12 hj13 Hj = hj21 hj22 hj23 , (8) hj31 hj32 hj33 is the camera 3 × 3 homography matrix [10]. The scale parameters λtjk is a factor that depends on (Xkt , Ykt ) and renders the transformation non linear. Manipulating (7), we obtain λ t xt h X t +h Y t +h xtjk = jλkt jk = hjj11 Xkt +hjj12 Ykt +hjj13 = qjt1 (Xkt , Ykt ), jk k 31 32 k 33 t t y t = λjk yjk = hj21 Xkt +hj22 Ykt +hj23 = q t (X t , Y t ), t t t j2 jk k k λj hj31 Xk +hj32 Yk +hj33 k dxtj t t t t t t x˙ jk = dtk = qj3 (Xk , Yk , X˙ k , Y˙ k ), dy t t y˙ jk = dtjk = qjt4 (Xkt , Ykt , X˙ kt , Y˙ kt ). (9)
Fig. 3. The camera block. The information extracted by Bj1 from the video flow of the camera is passed to the Bj2 block, in which there is the translation between pixel and world coordinates using the pinhole model.
The four equations (9) link the object position and speed in the world cartesian coordinate system, with their pixel position and speed into the image plane framed by the sensor. For the opposite direction, i.e. from pixels to world points, the equation is rewritten as t t t Xk /λjk xjk Ykt /λtj = Rj yjt , (10) k k 1/λtjk 1 where
rj11 Rj = rj21 rj31
rj12 rj22 rj32
rj13 rj23 = Hj−1 . rj33
(11)
Manipulating (10) we get rj xt +rr y t +rj Xkt /λtj k = rj11 xjtk +rj12 yjtk +rj13 = gjt1 (xtjk , yjtk ), Xkt = 1/λ t jk 31 jk 32 jk 33 Ykt /λtj rj21 xtj +rj22 yjt +rj23 t k k k Yk = 1/λt = rj xt +rj yt +rj = gjt2 (xtjk , yjtk ), jk 31 jk 32 jk 33 dXkt t t t t t t ˙ , y , x ˙ , y ˙ = g (x X = j3 jk jk jk jk ), k dt Y˙ t = dYkt = g t (xt , y t , x˙ t , y˙ t ). jk jk j4 jk jk k dt
(12) The information flow at time k at each camera system is injected at blocks Bj1 as a consequence of frame-by-frame based on an algorithm of background subtraction wired in the Bj2 block. The analysis of the video tracker is not the topic of this paper, therefore all the issues related to illumination, clutter, etc. are not considered. As a matter of fact, we assume that at each time k, for the moving object t the algorithm provides the estimate (αjtk , βjtk ) in pixel coordinates of the object reference point (the barycenter, or the rightmost point in touch with the ground, or else). This information is included into a forward message fptj = {mfpt , Σfpt }, with k
mfpt
jk
=
jk
jk
(αjtk , βjtk , α˙ jtk , β˙ jtk )T ,
(13)
and Σfpt
jk
= diag(σα2 t , σβ2 t , σα2˙ t , σβ2˙ t ), jk
jk
jk
(14)
jk
with (σα2 t , σβ2 t ) accuracy parameters (variances) chosen acjk jk cording the camera quality and algorithm precision. Also image-plane velocity estimates and confidence values could be estimated. If no speeds are estimated on the image plane, (α˙ jtk , β˙ jtk ) and (σα2˙ t , σβ2˙ t ) are given arbitrary and very large jk
jk
values respectively. Following the flow upward in Figure 3 through block Bj2 we can write the forward message fctj for the object t. The k non-linear nature of the pinhole model defined in (7) makes fctj non Gaussian and difficult to evaluate. Therefore we k resort to a localized linear approximation computing mean and covariance and approximating fctj with a Gaussian density k (typical in extended Kalman filters). Assuming a Gaussian error model on the image (x−αjtk , y−βjtk , x− ˙ α˙ jtk , y− ˙ β˙ jtk )T ∼ N (0, Σfpt ), we can write jk
fctj = {gjt (αjtk , βjtk , α˙ jtk , β˙ jtk ), k
Jjt (αjtk , βjtk , α˙ jtk , β˙ jtk )Σfpt Jjt (αjtk , βjtk , α˙ jtk , β˙ jtk )T }, (15) jk
where
gjt1 (x, y) gjt (x, y) 2 , gjt (x, y, x, ˙ y) ˙ = gjt (x, y, x, ˙ y) ˙ 3 gjt4 (x, y, x, ˙ y) ˙
Fig. 4. The fusion block. Estimates from the cameras and tracks from the model are merged to have the new tracks.
is the Jacobian for the backward direction. The backward flow can be used to improve object localization, as already showed in [13] for single target tracking. We will address elsewhere the analysis of the backward flow to the cameras in the multiple target domain. It is also worth to point out that, since the modeling for each camera is an homography, even a single camera can do the tracking. However, at least a two- or three-camera setup is suggested to have redundant information for better performance and for covering larger areas.
groups all the functions defined in (12), and ∂ t ∂ t 0 ∂x gj1 ∂y gj1 ∂ gt ∂ t g 0 j2 ∂y j2 Jjt (x, y, x, ˙ y) ˙ = ∂x ∂ t ∂ t ∂ t ∂x gj3 ∂y gj3 ∂ x˙ gj3 ∂ t ∂ t ∂ t ∂x gj4 ∂y gj4 ∂ x˙ gj4
0 0 ∂ t ∂ y˙ gj3 ∂ t ∂ y˙ gj4
(16)
(17)
is the Jacobian matrix. By the same reasoning, traveling in the backward direction from world to pixel coordinates through Bj2 , we obtain bptj = {qjt (ξjtk , ηjtk , ξ˙jtk , η˙ jtk ), k
Γtj (ξjtk , ηjtk , ξ˙jtk , η˙ jtk )Σbct Γtj (ξjtk , ηjtk , ξ˙jtk , η˙ jtk )T }, (18) jk
where (ξjtk , ηjtk , ξ˙jtk , η˙ jtk ) is the estimate coming from the upward fusion block, qjt1 (X, Y ) t ˙ Y˙ ) = t qj2 (X, Y ) qjt (X, Y, X, (19) ˙ Y˙ ) , qj (X, Y, X, 3 ˙ Y˙ ) qjt4 (X, Y, X, groups all the equations defined in (9) and ∂ t ∂ t 0 ∂X qj1 ∂Y qj1 ∂ ∂ t t q q 0 ∂X j2 ∂Y j2 Γi (x, y) = ∂ t ∂ t ∂ qjt q q ∂X 3 ∂Y j3 ∂ X˙ j3 ∂ t ∂ t ∂ t q ∂X qj4 ∂Y qj4 ∂ X˙ j4
0 0 ∂ t . q j ˙ 3 ∂Y ∂ t q ∂ Y˙ j4
(20)
C. Fusion Block The fusion block receives all the messages from the graph branches and yields the tracks. More in detail, the block is reached by model tracks and camera estimates, namely zk = ζj 1 {z1k , .., zO k } and ck = {ck , .., ck }. The many issues this block has to resolve are: 1) data fusion between the estimates, 2) pruning of incorrect and redundant tracks, 3) management of the life-cycle of the tracks (birth, existence and death), 4) data association, i.e. the association between tracks zik and sensor data cjk . The general scheme of the block is depicted in Figure 4. The indeterminacies caused from loops in graph [3] are set aside by considering the fusion block as a single system with its own definitions of forward and backward messages. In this way we can ignore all the loops caused by the connections of the blocks composing the fusion block. In the following we calculate the forward and backward messages yielded by the fusion block. Following the forward flow from left to right in Figure 4, the message fzk from the model is injected through P 0 block, that for the forward direction is just an identity block (output=input). P and P 0 are two pruning blocks, namely their task is to cut incorrect tracks after the fusion. P works in the forward direction, while in the backward direction is simply an identity block; for P 0 we have the contrary. Therefore the forward message fzk travels as it is to the equal block. The equal block is defined in [9], and can act in two ways. If the entry is a single message, the block acts as a diverter and routes the input to the different branches of the graph (one for each sensor), as showed in Figure 4. If the entries are many,
they are multiplied together by the block and returned in the output message. Therefore the same fzk message is routed to the dot (·) blocks. Each of these blocks cross the model tracks with the estimate fcjk transmitted by the j-th camera. The data association issue is resolved here using again a proximity criteria. More specifically, each target mean mfct is compared jk
with the means of all the model tracks mfzi , and the nearest k neighbor is chosen for the matching. The message coming out from the block is then the product of the camera estimate and the closer model track, i.e. fctj ·fzik . If no match is found (i.e. k all the distances between model and camera means are above a threshold τneigh ), the sensor data could be related to a new target in the scene or clutter, and the output of the block will be the sensor estimate as it is, i.e. fctj . It is evident that such k operations can be carried out by a simple program wired into the dot block. The operations just described are summarized in the piece of pseudo-code showed in Listing 1. Listing 1. The following MATLAB pseudo-code shows how the data association problem is carried through for a camera Cj , with the min(·) a function that returns the minimum value and the index of the same value of the dis vector.
f o r t = 1 : l e n g t h ( fctj ) k f o r i = 1 : l e n g t h ( fzik ) disi =|mfct − mfzi | jk
k
end [ mindis , idis ] = min ( dis ) ; i f ( mindis