detection and Joint Probabilistic Data Association filter (JPDAF) in the context of motorcycle tracking. A major limitation of. JPDAF is its inability to adapt to ...
Incorporating Statistical Background Model and Joint Probabilistic Data Association Filter into Motorcycle Tracking Phi-Vu Nguyen
Hoai-Bac Le
Faculty of Information Technology University of Science Ho Chi Minh City, Vietnam
Faculty of Information Technology University of Science Ho Chi Minh City, Vietnam
Abstract— Multi-target tracking is an attractive research field due to its widespread application areas and challenges. Every point tracking method includes two mechanisms: object detection and data association. This paper is a combination between a statistical background modeling method for foreground object detection and Joint Probabilistic Data Association filter (JPDAF) in the context of motorcycle tracking. A major limitation of JPDAF is its inability to adapt to changes in the number of targets, but in this work, it is modified so that we can successfully apply JPDAF with known number of targets at each time instant. The experimental system works well with the number of targets less than 10/frame and be able to self-evolve with gradual and “once-off” background changes. Keyword– Multi-target tracking, point tracking, data association, JPDA, JPDAF, foreground object detection, statistical background model, motorcycle tracking.
I.
INTRODUCTION
Motion understanding is an essential function of human vision. Consequently, object tracking takes the crucial role in computer vision. Multi-target tracking has widespread applications in both military (air defense, air traffic control, ocean surveillance) and civilian areas (for automatical surveillance demands in public or secret places), especially when human labour becomes more and more expensive. Object tracking, in general, is a challenging problem. Its complexities arise due to the following factors [4]: loss of information caused by projection from 3D to 2D space, complex object motions, complex object shapes, partial and full object occlusions, scene illumination changes, and realtime processing requirements. There are three main categories of object tracking [4]: point tracking, kernel tracking, and silhouette tracking. While kernel and silhouette tracking concern object shapes, point tracking considers an object as a point and just focuses on its position and motion, which can be represented by state vector. Filtering is a class of methods that is suited for solving the dynamic state estimation problems of point tracking. In multi-target tracking, we have a task of finding a correspondence between the current targets and measurements, named data association. Data association is a complicated problem especially in the presence of occlusions, 978-1-4244-2379-8/08/$25.00 (c)2008 IEEE
misdetections, entries, and exits of objects. There are many statistical techniques for data association [7], among them, Joint Probabilistic Data Association (JPDA) is the method that aims to find a correspondence between measurements and objects at the current time step based on enumerating all possible associations and computing the association probabilities, it is a widely used technique for data association ([8],[9]). However, to have a good JPDA filter (JPDAF), it is required to have accurate measurements, that means we need to have good object detection results. Every tracking method requires an object detection mechanism, there are those which just need the detection at the first time objects appear, while the others need it in every frame, point tracking belongs to this type. One effective way for foreground object detection is to give an accurate background model. Recently, L.Li et al proposed a foreground object detection method by statistical modeling of complex backgrounds [1]. This work used a Bayesian framework for incorporating three type of features: spectral, spatial and temporal features into a representation of complex background containing both stationary and nonstationary objects. With the statistics of background features, the method is able to: represent the appearances of both static and dynamic background pixels, self-evolve to gradual as well as sudden “once-off” background changes. Taking advantage of the excellent object detection results from this method, this paper employs JPDAF for vehicle tracking in the motorcycle lane. A major limitation of JPDAF is its inability to adapt to changes in the number of targets, because it is confused between a measurement originated from a new object appearance and a false alarm. However, in the context of motorcycle surveillance, we has proposed a strategy to detect new objects entering and objects leaving the observation area, so that we can successfully apply JPDAF with known number of targets at each time instant. The experimental system has good results with the number of targets less than 10/frame, including of detecting and tracking the wrong-wayed motorcycles. Motorcycle tracking in particular and traffic tracking in generral, is an interesting but challenging application. Its main difficulties can be 284
enumerated as: the severely occlusions when traffic density is high (especially in rush hours), the shadows of big vehicles, and the real-time processing demand of a traffic surveillance system. This paper is the next step (after [6]) of the effort finding the most satisfied approach for automatical traffic surveillance in big cities of Vietnam. The remains of this paper is organized as follow: section II is the main ideas for statistical modeling of complex background proposed in [1], section III reviews the background of JPDAF, a complete algorithm and experimented results on simulated data of JPDAF are presented at the end of this section, section IV is the combination of statistical background model and the modified JPDAF so that they can be applied in the motorcycle tracking situation, the experimental results of this combination are submitted in section V, and the conclusion is after all others. II.
STATISTICAL BACKGROUND MODELING FOR FOREGROUND OBJECT DETECTION
A. Bayesian framework for classifying background and foreground points Let s(x,y) be a pixel in a video frame at time t with its Decartes co-ordinate, v be the feature vector extracted at s. Then using Bayes formula we can determine the probability that s belongs to background given v as follow: Ps (b | v) =
Ps ( v | b) Ps (b) Ps ( v)
(1)
where b implies that s belongs to background. Similarly, the probability that s belongs to a foreground object given v is: Ps ( f | v) =
Ps ( v | f ) Ps ( f ) Ps ( v)
Undergoing some transformations, (3) is equivalent to: 2 Ps ( v | b) Ps (b) > Ps ( v)
classification results at s up to time t, and {S vt (i )}i =1,..., M ( v ) takes note the statistics of the M(v) feature vectors which have the highest frequencies at s, each S vt (i ) contains: pvt i = Ps ( vi ) ° S vt (i ) = ® pvt i |b = Ps ( v i | b) ° τ ¯ vi = (vi1 ,..., viD ( v ) )
to pvt i , the frequence of vi. Then, the first N(v) (N(v) Ps (c) Ps (e)
where Ps(b) is the probability that s is classified as background, Ps(v) is the probability that v is observed at s, and Ps(v|b) is the probability that v is observed when s has already been classified as background. Thus, we can use Ps(v|b), Ps(b) and Ps(v), which will be modeled and estimated based on statistics in subsection B and C, to judge whether a point comes from background or foreground. B. Statistics for background features and feature selection To estimate Ps(v|b), Ps(b) and Ps(v), we need a data structure to take into account the statistical information relevant to feature vector v at s over a sequence of frames. Each feature type at s has a table of statistics defined as:
d ( v1 , v 2 ) = 1 −
978-1-4244-2379-8/08/$25.00 (c)2008 IEEE
2 v1 , v 2 2
v1 + v 2
2
(8)
where v = {c, e}, v1 and v2 are identified with each other if d(v1, v2)< į. 2) Feature selection for dynamic background pixels: as for a dynamic background object, its motion is usually in a small range (so that it is still referred to background) and has a period: waving tree branches and their shadows for example. Hence, the color co-occurrence feature is used to take advantage of these properties. Let ct-1 = (Rt-1, Gt-1, Bt-1)T and ct = (Rt, Gt, Bt)T be the color features at time t-1 and t at pixel s, then the color co-occurrence vector at time t and pixel s is defined as cct = (Rt-1, Gt-1, Bt-1, Rt, Gt, Bt)T. In this case, another distance measure is used: d (cct , cc j ) = max{ cctk − cc jk } k ∈[1..6]
t
° p (b) Tv (s) = ® v t °¯{S v (i )}, i = 1,..., M ( v )
(7)
With color and gradient features, we need a quantization measure that is less sensitive to illumination changes, so a normalized distance measure based on the inner product of two vectors is adopted [2]:
(3) (4)
(6)
where D(v) is the dimension of vi. In table Tv(s), the t {S v (i )}i =1,..., M ( v ) are kept sorting in descending order with respect
(2)
where f refers that s is a foreground point. According to Bayesian decision rule, s will be classified as background point if: Ps (b | v) > Ps ( f | v)
where pvt (b) grasps the Ps(b) at time t based on the
(5) 285
(9)
C. Learning the statistics of background features So far, we have already had a data structure for statistics, now for the procedure of feature learning. There are two kinds of background changes, so we will have different learning strategy for each one. 1) Gradual background changes: once we have the classification result at pixel s (subsection D) and time t, its statistics at the next time instant will be updated as follow: pvt +1 (b) = (1 − α ) pvt (b) + α Ltb pvt +i 1 = (1 − α ) pvt i + α Ltvi p
t +1 v i |b
= (1 − α ) p
t v i |b
(10)
+ α (L L ) t t b vi
where v = {c ,e, cc}, 0 < Į < 1 is the learning rate. If s is classified as background point at time t, then Ltb = 1, else
differencing and interframe differencing methods are applied to detect changes. Background differencing calculates the difference between background B(s,t) and input frame, while interframe differencing performs the same work on consecutive frames. Let Fbd(s,t) and Ftd(s,t) be the background difference and interframe difference respectively. If Fbd(s,t) = 0 and Ftd(s,t) = 0, pixel s is referred to nonchange background point. If Ftd(s,t) = 1, s is classified as dynamic point, then color co-occurrence features are used for background/foreground classification, otherwise, s is a static point, so color and gradient features are used in the next step. 2) Background/Foreground classification: Let vt be the input feature at pixel s and time t. The probabilities are estimated as follow: Ps (b) = pvt (b) Ps ( v t ) =
Ltb =0. If the input feature vector vt is identified with vi then t vi
v j ∈U ( v t )
t vi
L =1, otherwise, L = 0. Besides, if there is no vi in table
Tv(s) identified with vt, the last component in {S vt +1 (i )}i =1,..., M ( v ) will be replaced by new one: v M ( v ) = vt , p
t +1 vM (v )
=α , p
t +1 v M ( v )|b
=α
(11)
2) “Once-off” background changes: an “once-off” background change occurs when there is a suddenly change in illumination, or a moving foreground object stopping and becoming a background instance, that means when background becomes foreground suddenly or vice versa. When this happens, we have: N (v)
N ( v)
Ps ( f ) ¦ Ps ( v i | f ) = i =1
N ( v)
Or
¦p i =1
N ( v)
¦ P ( v ) − P (b) ¦ P ( v i =1
t vi
s
i
s
i =1
s
i
| b) > M (12)
N (v )
− pvt (b) ¦ pvt i |b > M
(13)
i =1
where M is a high percentage threshold (80% ~ 90%). Thus, (13) can be considered as a condition to check if an “once-off” background change is happening. In that case, the statistics of foreground should be turned to background statistics: t +1 v
t v
p (b) = 1 − p (b) pvt +i 1 = pvt i pvt +i |1b =
(14)
( pvt i − pvt (b) pvt i |b ) pvt +1 (b) N (v)
¦p i =1
t +1 v i |b
will
converge to 1 as long as the background features are observed frequently [1]. D. Foreground object detection 1) Change detection: In order to have a proper feature selection as mentioned in C, we need to know whether a pixel is static or dynamic. Therefore, color-based background 978-1-4244-2379-8/08/$25.00 (c)2008 IEEE
Ps ( v t | b) =
pvt j
¦
v j ∈U ( v t )
(15) pvt j |b
where v = {c ,e, cc}, U(vt) is the set of vi ∈ Tv(s) that are identified with vt: U ( vt ) = {v j ∈ Tv (s), d ( v t , v j ) ≤ δ and j ≤ N ( v)} (16)
If there is no vi ∈ Tv(s) identified with vt , Ps(vt) and Ps(vt|b) = 0. As saying above, if s is a static pixel, we will have v = c and v = e, thus, Tc(s) and Te(s) are used as their tables of statistics. After calculating the probabilities as (15), (7) is used to classified s as background or foreground. Note that in this case: Ps (b) = pct (b) = pet (b) . Similarly, if s is a dynamic pixel, v = cc and (4) is used as the classification criterion. 3) Foreground object segmentation: after finishing background/foreground classification for all pixels, an “oil spreading” algorithm is applied to find connected regions of foreground pixels. Then some Heuristic technologies are used to separate objects sticking each other due to shades. E. Background maintenance To make the background differencing in change detection step more accurate, the background image should be regularly updated. Let B(s,t) and I(s,t) be the background and input frame at s and time t. If s is referred to a nonchange background point, the background at s is updated as: B(s, t + 1) = (1 − β )B(s, t ) + β I (s, t )
for i = 1, …, N(v). This learning process is also proved that
¦
(17)
where 0 < Į < 1. Otherwise, if s classified as a background point (static or dynamic), the background at s is replaced by the new one: B(s, t + 1) = I(s, t )
(18)
Figure 1 presents the complete algorithm of foreground object detection.
286
p(ș|N(t)) is the prior probability of a joint association event, given by: p(θ | N (k )) = pDN (t ) −φ (1 − pD )M (t ) − N (t ) +φ µ F (φ )
φ! (23) N (t )!
where pD is the probability of detection of an object with the assumption that target detection occurs independently over time with known probability. Thus, the probability of a joint association event is: 1 φ ! N ( t ) −φ (1 − pD )M (t ) − N (t ) +φ × p(θ | Z1:t ) = pD c N (t )! (24) (µ F (φ ))2 ∏ gt (xt ,i | z t , j ) θ j ,i ∈θ
The state estimation of object i is: xˆ t ,i
= E (xt ,i | Z1:t ) N (t)
= ³ xt ,i p(xt ,i , ¦ θ j ,i |Z1:t )dxt ,i
(25)
j =1
N (t)
= ¦ ³ xt ,i p (xt ,i | θ j ,i , Z1:t ) p(θ j ,i | Z1:t )dxt ,i j =1
Let the association probability for a particular association between measurement zt,j and object i be defined by:
Figure 1. The complete algorithm of foreground object detection.
III.
β j ,i
JOINT PROBABILISTIC DATA ASSOCIATION FILTER
Let M(t) be the number of objects at time t, and N(t) be the number of measurements received. The set of objects and measurements at time t can be respectively denoted as: X t = {xt ,1 , xt ,2 ,..., xt , M (t ) }
(19)
Z t = {z t ,1 , z t ,2 ,..., z t , N (t ) }
(20)
Let ș = {și,j, j = 1..N(t), i = 1..M(t)} denote the joint association event between objects and measurements, where și,j is the particular event which assigns measurement zt,j to object i. The joint association event probability is:
=
where Z1:t is the sequence of measurements up to time t, c is the normalization constant. The first term p(Zt| ș, N(t), Z1:t-1) is the likelihood function of the measurements, given by: p( Zt | θ , N (t ), Z1:t −1 ) = µ F (φ ) ∏ g t (xt ,i | z t , j )
N (t) j =1
N (t)
(27)
= ¦ xˆ β j ,i j t ,i
j =1
where xˆ tj,i is the state estimation from Kalman filter [10] with the assumption on association between measurement zt,j and object i. N (t )
¦β j =1
j ,i
< 1 , and in fact, it is difficult to propose
a model for exactly estimating β j ,i in (26) as the theory, so we want to normalize β j ,i so that in (27). Hence:
β j ,i =
¦
N (t) j =1
β j ,i = 1 before using
1 N (t) β j ,i , ¦ N (t ) j =1
(28)
and (27) becomes: N (t)
xˆ t ,i = ¦ xˆ tj,i β j ,i
θ j ,i ∈θ
978-1-4244-2379-8/08/$25.00 (c)2008 IEEE
(26)
= ¦ E (xt ,i | θ j ,i , Z1:t )β j ,i
xˆ t ,i
(22)
where φ is the number of false alarms, µ F (φ ) is the probability of number of false alarms, which is usually Poisson distributed, gt(xt,i |zt,j) is the likelihood that measurement zt,j is originated from target xt,i. The second term
p(θ | Z1:t )
Hence, (25) becomes:
Note that:
1 p (Z t | θ , N (t ), Z1:t −1 ) p(θ | Z1:t −1 , N (t )) (21) c 1 = p (Z t | θ , N (t ), Z1:t −1 ) p(θ | N (t )) c
¦
θ :θ j ,i ∈θ
p(θ | Z1:t ) = p(θ | Z t , N (t ), Z1:t −1 ) =
= p(θ j ,i | Z1:t )
(29)
j =1
Figure 2 below is the complete algorithm of JPDAF at each time instant t.
287
Target 4
Target 5
Figure 2. The complete algorithm of JPDAF at each time instant.
Target 6
Figure 3 is the experimental results of JPDAF performed on simulated data of 8 targets in 100 time steps. The left one of each image pair is the simulated data and the right one is the estimated track of each target. Targets’ positions are initialized in the area of [0..500] x [0..50], false alarms are taken randomly in the area of [0..200] x [0..200], PD = 0.98, µ F (φ ) Poisson(λ = 0.1) . Simulated data
Target 7
Estimated tracks
Target 8 Figure 3. JPDA filter results on simulated data for 8 targets.
Target 1
IV. COMBINING STATISTICAL BACKGROUND MODEL AND JPDA FILTER FOR MOTORCYCLE TRACKING
Target 2
A. Moving object detection Statistical background model is applied to detect moving objects in the motorcycle lane with the parameters in Table 1. TABLE I.
Target 3
978-1-4244-2379-8/08/$25.00 (c)2008 IEEE
288
PARAMETERS FOR STATISTICAL BACKGROUND MODEL
The color and gradient vectors are obtained by quantizing their domains to 256 resolution levels, while for color cooccurrence vectors, the number of quantized levels is 32, į = 0.005 is used for the distance measure in (8) while į = 2 is used for (9). B. Multi-target tracking Using the measurements achieved from detection stage, JPDAF performs data association between the current measurements and targets. At each time t, basing on the accuracy of detection results, we can propose a strategy to detect new objects entering the observation area. If: ∃z t , j ∈ Z t so that ∀xt −1,i ∈ X t −1 , z t , j − x t −1,i > ε
(30)
where xt-1,i = (xt-1,i, yt-1,i ) and z t , j = ( ztx, j , zty, j ) are respectively the Decartes coordinates of object i at time t-1 and measurement j at time t, İ is a small positive number. Then zt,j is considered as a measurement originated from a new object. That means if a measurement is not “too close” with any target at the last time instant, it is implied that a new target has occurred. Besides, if an object is at the end of the observation area and it is not a new object or it is misdetected more than 3 time instant, it will be removed. To increase the accuracy of JPDAF, beside the spatial distance, the information of color histogram should be incorporated into the likelihood gt(xt,i |zt,j) in (22). Hence, Bhattachayya distance is employed to calculate the “distance” between the reference color model K * {k * (n; x0 )}n =1,..., N and
where 0.5 < Ȗ < 1 because spatial distance information has a higher priority than color in this context. In our application, we chose Ȗ = 0.7. V.
EXPERIMENTAL RESULTS
A. Object detection results The below is some results of object detection (Figure 5 (a)), the left image of each pair is the input frame and the right one is the detection result. The experimental sequences are taken from the motorcycle lane in a cloudy weather and the illumination changes are easily seen, but the detection algorithm still works very well. The background is learned rapidly, figure 5 (b) is a learned background after 60 frames, together with the statistics of background features, the results of background/foreground classification step is very accurate, there are almost no misclassified background point. But the difficulty is in the segmentation step, when the object density at the end of the observation area is high, many occlusions usually occurs and the segmentation step will usually make mistakes (Figure 6). Figure 5 (c) is an example of “once-off” background change, there was a motorbike stopping close to the pavement for a while and it became background soon after that.
the candidate color model K (xk ) {k (n; xt )}n=1,..., N of each target, (details in [3]): 1/ 2
ª
N
º
¬
n =1
¼
ξ [K * ,K ( x k )]= «1 − ¦ k * (n; x 0 )k ( n; x k ) »
(31)
where reference color model of a target is chosen as its last state and the candidate color model is its current measurement. Moreover, for increasing the accuracy, the reference and candidate model are divided into two sub-regions (Figure 4), then the color likelihood of a candidate model is produced: 2
clt (x t ,i | z t , j ) ∝ e ¦ w=1
− λξ 2 [K *w ,K w ( xt )]
(32)
Figure 4. The reference color model.
Let dlt(xt,i |zt,j) be the spatial distance likelihood (p(zt|Xt)) which attained by Kalman filter between measurement j and target i [10], then the likelihood gt(xt,i|zt,j) in (24) is defined as: gt (xt ,i | z t , j ) = γ dlt (x t ,i | z t , j ) + (1 − γ )clt (xt ,i | z t , j )
978-1-4244-2379-8/08/$25.00 (c)2008 IEEE
(a)
(33)
289
(b)
(a) (c) Figure 5. (a) Some results of object detection; (b) Learned background image at frame 60; (c) An example of “once-off” background change.
(b) Figure 6. Some successful and failed results of segmenting objects sticking each other due to shades.
Table II is the quantitative results of object detection. The system was tested on ten sequences which has the object density < 10 objects/frame, each sequence has an average length of 10 seconds and uses the first 30 frames (1 second) for initial background learning (“+30” in Length column). The precision rate = 100% demonstrated that there is no background object which is classified as foreground, and the mistake percentages in the recall rate is caused by the incorrect segmentation. TABLE II.
(c)
THE STATISTICS OF OBJECT DETECTION RESULTS
(d) Figure 7. Some results of tracking.
978-1-4244-2379-8/08/$25.00 (c)2008 IEEE
290
B. Tracking results The results of JPDAF depends on the object detection results, if objects are correctly detected, the tracking algorithm will works very well. In general, this system works well with a reasonable number of targets/frame (< 10 targets/frame). With the strategy for detecting objects entering and exiting the observation area, the JPDAF can also detect and track the motorcycles driven in wrong direction. Figure 7 shows some tracking results, including of the tracking of wrong-wayed motorcycle (Figure 7 (d), object 10). Table III shows the statistics of full correct tracks in the ten sequences above (the mis-tracked objects in any frame are not counted). Since JPDAF is an NP-hard problem (the number of possible joint association events at each time instant t is
¦
Min ( M ( t ), N ( t )) i =1
CMi (t ) ANi (t ) ), the computation cost of JPDAF is
one of its major weak points. All of these experiments are deployed on a Pentium IV 2.4 Ghz, 512 MB RAM, due to the high cost of object detection and tracking algorithm, the processing rate is 2s/frame with the frame size is 360x240 and the sequence rate is 30 frames/s. TABLE III.
THE STATISTICS OF TRACKING RESULTS
978-1-4244-2379-8/08/$25.00 (c)2008 IEEE
VI.
CONCLUSION
This paper is a next step on the way searching an efficient approach for a motorcycle surveillance system after using Particle filter in [6]. Some improvements have been achieved in object detection step which has more accurate results in the whole observation area and the ability to efficiently adapt to illumination changes and “once-off” changes. However, occlusions have not been strictly handled and the computation cost is one of the major limitations. In the future, we hope that many new multi-target tracking methods will be applied in this context and the best selection will be produced. REFERENCES [1]
L.Li, W.Huang, I.Y. Gu, and Q.Tian, “Statistical modeling of complex backgrounds for forground object detection,” IEEE Transactions on Image Processing, Vol. 13, No. 11, Nov. 2004. [2] L.Li and M.Leung, “Integrating intensity and texture differences for robust change detection,” IEEE Transaction on Image Processing, Vol. 11, pp.105-112, Feb. 2002. [3] Okuma, A.Taleghani, N.de Freitas, J.J.Little and D.G.Lowe, “A boosted particle filter: Multitarget detection and tracking,” Proceedings of ECCV 2004, Vol I:2839, 2004. [4] A.Yilmaz, O.Javed and M.Shah, “Object tracking: a survey,” ACM Computing Surveys, Vol. 38, No. 4, Dec. 2006. [5] O.Frank, “Multiple Target Tracking,” thesis in the degree of Dipl.El.Ing.ETH (Swiss Federal Institute of Technology Zurich), Feb. 2003. [6] Hoai Bac Le, Nam Trung Pham, Tuong Vu Le Nguyen, “Applied Particle Filter in Traffic Tracking,” Proceedings of IEEE International Conference on RIVF 2006. [7] I.J.Cox, “A review of statistical data association techniques for motion correspondence,” Int. J. Comput. Vision 10, 1, 53-66, 1993. [8] C.Rasmussen and G.Hager, “Probabilistic data association methods for tracking complex visual objects,” IEEE Trans. Patt. Analy. Mach. Intell. 23, 6, 560-576, 2001. [9] D.Schulz, W.Burgard, D.Fox and A.B.Cremers, “Tracking multiple moving targets with a mobile robot using Particle filters and statistical data association,” Proceedings of the IEEE International Conference on Robotics & Automation (ICRA), Seoul, Korea, 2001. [10] B.Ristic, S.Arulampalam, N.Gordon, “Beyond the Kalman Filter,” Artech House, 2004.
291