Bayesian Segmentation and Motion Estimation in Video ... - CiteSeerX

5 downloads 304 Views 2MB Size Report
Key-Words:- Bayesian Segmentation of Video Sequences, Hidden Markov Model, Markov-Potts Random Field model, Markov ... vals in video. We thus call these sequences “2D+T” ..... displacement of the mass center of an object in a lim-.
Bayesian Segmentation and Motion Estimation in Video Sequences using a Markov-Potts Model Patrice BRAULT and Ali MOHAMMAD-DJAFARI LSS, Laboratoire des Signaux et Systemes, CNRS UMR 8506, Supelec, Plateau du Moulon, 91192 Gif sur Yvette Cedex, FRANCE. [email protected] , [email protected]

Abstract : - The segmentation of an image can be presented as an inverse ill-posed problem. The segmentation problem is presented as, knowing an observed image g, how to obtain an original image f in which a classification in statistically homogeneous regions must be established. The inversion technique we use is done in a Bayesian probabilistic framework. Prior hypothesis, made on different parameters of the problem leads to the expression of posterior probability laws from which we can estimate the searched, realistic, segmentation. We report here a new approach for the segmentation of video sequences based on recent works on the Bayesian segmentation and fusion [3, 4, 5]. In our Bayesian framework we make the assumption that the images follow a Potts-Markov random field model. The final implementation is done with Markov Chain Monte Carlo (MCMC) and Gibbs sampling algorithms. The originality here resides in the fact that the sequence segmentation is done sequentially from one frame to the next in a Markov Space “temporal-like” framework. The result is also used for a first approach of motion estimation.

Key-Words:- Bayesian Segmentation of Video Sequences, Hidden Markov Model, Markov-Potts Random Field model, Markov Chain Monte Carlo (MCMC), Gibbs Sampling, Motion Estimation and Quantification.

1 Introduction

notion of time and particularly of similar time intervals in video. We thus call these sequences “2D+T” to distinct them from 3D sequences where no regular time relationship, or no time relationship at all, exists between the 2D frames. The paper is organized as follows : - Section 2 summarizes the recently developed Bayesian segmentation of still images with a PottsMarkov model [3, 4, 5]. - Section 3 presents and comments the results of this segmentation approach w.r.t. to other segmentation techniques leading to quantification-based compression. - Section 4 summarizes the extension of the Bayesian

2D and 3D Bayesian segmentation, as well as fusion and source separation are being studied and yet performed, in our group, in the framework of Inverse Problems and in particular with Probabilistic Bayesian approaches. Along with the work on still images segmentation we are developing a Bayesian implementation for the segmentation of 2D+T video sequences. We recall here that a video sequence can be presented like a series of still images, called frames, exhibiting an inter-relation, between the objects or the regions pertaining to two successive frames, that is called “motion”. This motion is of course closely related to the 1

• For the A case, we first do the assumptions that the additive global noise  is white and Gaussian and second that all the pixels r of f that pertain to the same class k have a mean mk and a variance σk2 . We thus write: A1)

method for segmenting 2D+T sequences. - Section 5 shows how to use the results of 2D+T segmentation to obtain quantified motion parameters of different objects in the video scene.

2 2.1

Modeling for Still Image Segmentation

The description of still image segmentation and fusion by a Bayesian approach and a Hidden Markov Model (HMM) has been developed and well described by Feron et al in [3, 4, 5]. We summarize in this section the main steps for the computation of the posterior laws necessary in the final segmentation-fusion algorithm. We write the basic relations, explaining that the inverse problem is to find a non-noisy form f of the observed image g and also to extract a segmentation in statistically homogeneous regions of f . We thus can write down the basic expression :

p(f (r)|z(r) = k) = N (mk , σk2 )

(6)

• For the B case, we do the same assumptions for p(g|z) and, using 3 and 5, we obtain: B1) p(g(r)|z(r) = k) = N (mk , σk2 + σ2 )

(7)

In order to find homogeneous regions, we introduce a spatial dependency via a Potts-Markov random field which fixes the dependency between the homogeneous zones of an image by means of an attractive/repulsive “α” parameter: B2) ( ) XX 1 exp α δ(z(r) − z(s)) p(z(r)r∈R ) = T (α) s

(1)

If we assume that  is a Gaussian noise, we can write: p() = N (0, σ2 I)

(5)

and A2)

General expression

g =f +

p(g(r)|f (r)) = N (f (r), σ2 )

r∈R

(2)

(8) where s ∈ V (r).

so p(g|f ) = p (g − f ) = N ((g − f ), σ2 .I)

(3)

2.2

Because the goal is to have a reconstructed image f segmented in a limited number of homogeneous regions, a hidden variable z which can take the discrete values k ∈ {1, ..., K} is introduced which represents the f image classified in K classes thus segmenting the image in regions that can be gathered, or not, to retrieve a segmentation in homogeneous regions. • In order to estimate the variable (f, z) using the Bayesian approach (see [4, 5]), we write:

p(f, z|g) = p(f |z, g).p(z|g)

Prior selection of the hyperparameters

The prior laws defined in the former paragraph need some parameters which are σ , σk and mk . So we also assign prior laws to these parameters by means of “hyper-hyperparameters” α0 , β0 , m0 and σ0 . We refer to [6] for the choice of these hyper-hyperparameters. So: θ = (σ2 , mk , σk2 ) is the set of hyper-parameters, with laws:  ∼ IG(α0 , β0 )  p(σ2 p(mk ) ∼ N (m0 , σ02 ) , k = {1...K}  p(σk2 ∼ IG(α0 , β0 ) , k = {1...K}

(4)

we thus need to define: A : p(g|f ) and p(f |z) for p(f |z, g) and B : p(g|z) and p(z) for p(z|g)

∀k ∈ N. 2

(9)

(10)

2.3 A posteriori distributions and Gibbs sampling algorithm The Bayesian approach consists now in estimating the whole set of variables (f, z, θ) following the joint a posteriori distribution p(f, z, θ|g) after 4. It is difficult to simulate a joint sample directly from this joint a posteriori distribution. Nevertheless, and considering the prior laws defined before, we are able to simulate the conditional a posteriori laws p(f |g, z, θ), p(z|g, θ) and p(θ|g, f, z). Figure 1. One frame of the sequence Tennis

In order to estimate the set of variables (f, z, θ) the following Gibbs sampling algorithm is proposed :

  f z  θ

∼ p(f |g, z (n) , θ(n−1) ) ∼ p(z|g, θ(n−1) ) ∼ p(θ|f (n) , z (n) , g)

(11)

We refer here to [4, 5] for details on the expressions and more detailed computational implementation of the conditional posterior laws. This algorithm is iterated a ”sufficient” number of times (itermax) in order to reach the convergence of the segmentation. By convergence we mean that the segmented image does not change significantly any more from one iteration to the next one. For now we do not have fixed any “convergence criterion”, but only our experience on the image analyzed has showed us that itermax depends essentially on the complexity of the image and the number of luminance levels, as well as on the number of classes taken for the segmentation. In general for a number of classes K = 4 we take itermax between 20 and 50.

3

Figure 2. Segmentation of one frame in 4 classes,

in the Bayesian Markov-Potts framework.

4

Video Sequences Segmentation

A video sequence can be presented like a series of still images, called frames, exhibiting an inter-relation between the objects or the regions pertaining to two successive frames that is called “‘motion”. This motion is of course closely related to the notion of time and particularly of similar time intervals in video. We thus call these sequences “2D+T” to distinct them from 3D sequences where no time relationship exists between

Still Image Segmentation

We present here the result of the Bayesian segmentation, described in the former paragraph, for one frame of a sequence. We consider here the frame as a still, isolated, image. The segmentation is done in 4 classes. 3

the 2D frames. The algorithm for segmenting a video sequence is now described with the following steps : 1) We select the number K of labels (or classes). For the Tennis sequence we take K = 4. For more complex frames we would take K = 6 or K = 8, according to the high number of different objects or regions in the frames. 2) We run the algorithm, on the first frame of the sequence, with a high number of iteration. This is due to the fact that we have no prior information (or knowledge of the initialization) on Zn (r), the segmented image. Z0 (r), the initialization taken for Z1 (r) is made without any prior knowledge, and the segmentation is totally unsupervised. 3) Then each frame fn is segmented on the basis of the knowledge of the fn−1 frame segmentation, the image Zn−1 (r). Due now to this knowledge of Zn−1 (r), the number of iterations between frames is reduced to a low value (e.g. 6).

Figure 4. Bayesian Segmentation of 6 frames

of the sequence Tennis. Parameters are K = 4; original-frame-iter-number=20; successiveframes-iter-num=6

5

Quantification of kinematic parameters

Through a good segmentation in homogeneous regions of a 2D+T sequence, we are able to perform a computation of the kinematic parameters related to different regions or objects of the sequence. If the sequences, for example Tennis player here, is segmented in K = 4 classes, and we want to quantify the motion of the tennis ball between two frames, we consider only the images zk (r) with k ∈ {1...K}, where k is the class to which the tennis ball pertains. We then display only the objects that pertain to this class. This provides a binary image where all the regions pertaining to this k class are serially numbered according to the homogeneity of the region (4 connexity here). We attribute each region a label number, thus each pixel of this region owes the same number. We compute the “mass” of the object of interest and its centroid (barycenter) by using the label numbers assigned to the pixels which compose this object. We repeat the same operation for the next frame. We then look for objects (of class k) between frame n and frame n + 1, that pertain to a limited “motion window” and which have a close mass ( a maximum difference of 5 percent between the masses for the object we “track”. This enables to follow in an unsupervised way the object of interest. It is easy now to get the motion vector of the object between 2 frames by computing the distance between the centroids of the object in frame n and frame n+1. This is shown here in figure 7 where the centroids

Figure 3. Six frames of original Tennis Sequence.

One frame size is 320 × 240

4

B1(X1 , Y1 ) and B2(X2 , Y2 ) are represented by circles in the middle of the balls and for frames n and n+1. In the example shown here we can accurately give the ~ = 1 pixels/frame along axis Ox and Y ~ = 16 speed X pixels/frame along axis Oy. For large motions between 2 frames we intend to use all the iterations of the Gibbs algorithm between two frames, which enables to restrict the displacement between two iterations and to facilitate the “object tracking”.

Figure 7. This shows frames 1 and 2 superim-

posed and with a zoom on the ball. The centroid of the ball is indicated by the two circles for frames 1 and 2 and the arrow materializes the motion vector

6

Figure 5. This figure shows the start frame (1) of

the sequence plus the (five) successive iterations that lead to the segmentation of the next frame (2) of the sequence. We notice that the final segmentation is already reached on the 4th iteration (segmentations of iterations 4 and 5 are identical).

Conclusion

We have described and applied a Bayesian method for the segmentation of a video (2D+T) sequence. The interest of such a work resides in different aspects : - A low number of classes for the segmentation enables to quantify an 8 bit image in only 2 bits, while keeping a good homogeneity between regions. This shows the possibility of video compression through segmentation. - The analysis of the motion of regions pertaining to the same class, in the final segmented frames, is used for motion computation of specific regions or objects. This is done in a totally unsupervised way by using the displacement of the mass center of an object in a limited “motion window”. - Our framework of applications is rather slow but accurate and unsupervised motion computation and modeling is targeted. In particular we start applying these methods for biocellular medical imaging analysis. - This work will soon be compared and possibly combined with faster methods described in [2] where the motion is first estimated by wavelets tuned to motion and which could lead conversely to segmentation based on motion.

Figure 6. This figure shows the regions of class 4

to which the ball pertains.

5

References

Triennial Calcutta Symposium on Probability and Statistics, 28-31 December. 2003, Dept. of Statistics, Calcutta University, Kolkata, India.

[1] Brault P.,“On the Performances and Improvements of Motion-Tuned Wavelets for Motion Estimation”, WSEAS04 ICOPLI, jan. 2004. Invited Paper. (Published in Transactions on Electronics, Issue 1, Vol.1, Jan. 2004).

[4] Olivier F´eron and Ali Mohammad-Djafari, “A hidden Markov model for image fusion and their joint segmentation in medical image computing”, submitted to MICCAI, St. Malo, France, 2004.

[2] Brault P., “A New Scheme For Object-Oriented Video Compression And Scene Analysis, Based On Motion Tuned Spatio-Temporal Wavelet Family And Trajectory Identification”, IEEEISSPIT03, Int’l Conf. on Signal. Proc. and Info. Techno., Darmstadt, dec. 2003.

[5] Olivier F´eron and Ali Mohammad-Djafari, “Image fusion and unsupervised joint segmentation using a HMM and MCMC algorithms”, submitted to J. of Electronic Imaging, April 2004. [6] H. Snoussi and A. Mohammad-Djafari, “Fast Joint Separation and Segmentation of Mixed Images”, Journal of Electronic Imaging, Vol. 13(2), april 2004.

[3] Olivier F´eron and Ali Mohammad-Djafari, “A Hidden Markov model for Bayesian data fusion of multivariate signals”, presented at Fifth Int.

6

Suggest Documents