APPLYING LEVEL SET THEORY TO DIGITAL VIDEO SEGMENTATION

8 downloads 0 Views 93KB Size Report
of different rates. It permits, for example, the transmission of variable frame rate video coding, with the perceptually more important objects being transmitted at ...
Proceedings of the International Workshop on Very Low Bitrate Video Coding, Athens, October 2001

APPLYING LEVEL SET THEORY TO DIGITAL VIDEO SEGMENTATION Patrick Kehoe [email protected]

Richard B. Reilly, [email protected]

Dept. Electronic & Electrical Engineering, University College Dublin, Belfield, Dublin 4, Ireland. ABSTRACT The emergence of the MPEG-4 standard has brought the area of video segmentation and contour tracking to the forefront or research. In MPEG-4 video sequences are decomposed into various video object planes (VOP’s), which are coded separately into the video stream. This has increased the need for good video segmentation techniques. Inspired by recent advances in level set theory that give excellent results for image segmentation albeit at very high complexity, this paper looks at how level sets can be applied to video segmentation. A framework is proposed and implemented to reduce the complexity of using level sets for video segmentation. Finally, results are given that show excellent segmentation with significant complexity reductions. Future research directions are then discussed. Figure 1: Segmentation of video frame into set of VOP’s 1. INTRODUCTION

To date, segmentation of images or video has been addressed using a variety of approaches. Low level segmentation techniques where the image sequence is simply modeled as a statistical distribution of pixel values in both space and time dimensions employ low-level features such as colour, optical flow, gradient, texture and motion to segment images [3,4,5,6]. One such low-level approach that has gained popularity in the image segmentation is the use of level set theory. Level sets have been applied to numerous problems in physics, medical imaging, image processing, and computer vision all yielding excellent results [7,9]. They do on the other hand yield very high complexity [8]. The reasons though that they are very useful for video and image processing is that they handle topological changes (splitting and merging) very easily and a quite stable numerical approximation scheme is available for there use. Although generic segmentation techniques desired are the long-term goal of segmentation research this paper focuses on tracking objects that show high gradient in an image. In particular faces are tracked in various video

Emerging video coding standards such as MPEG-4 have made significant progress over previous video coding standards [1,2]. While coding efficiency is improved, the standard addresses the increasing importance of contentbased functionality, interactivity and universal accessibility. Instead of representing video sequences as rectangular arrays of pixels MPEG-4 represents each scene as a composition of video object planes (VOP’s) with each VOP corresponding to one semantically meaningful object (Figure 1). The use of VOP’s allows for improved coding efficiency and flexibility across a variety of different rates. It permits, for example, the transmission of variable frame rate video coding, with the perceptually more important objects being transmitted at higher rates. Separating the video content into various objects also provides increased flexibility in the reconstruction by allowing for selective decoding based on available bandwidth and very attractive for applications such as video telephony.

1

Proceedings of the International Workshop on Very Low Bitrate Video Coding, Athens, October 2001

Previous schemes that work on moving two-dimensional curves suffer from instability [9]. Moving the problem to one higher dimension gives the method tremendous generality as well as making it possible to build highly accurate schemes to approximate the equations of motion. In such a setting, topological changes occur and are handled in a natural manner, and the technique extends trivially to three dimensions. The way the propagating front moves over the image depends on the speed term F(k), where k represents the curvature. In general, the speed function can be split into two components:

sequences. In section 2 the level set approach is looked at in more detail. In section 3, methods for reducing the complexity of the above approach are proposed. In sections 4, 5 and 6 we explain how this can be applied to video segmentation and propose a method to exploit the temporal changes in video segmentation, which is important in computational reduction for tracking the moving objects. In section 8, the results are given for the proposed framework and in section 9 future ideas and directions for research are proposed. 2. THE LEVEL SET APPROACH

F = Fa +Fg (κ )

The level set method is an approach that was devised by Osher and Sethian [9]. It is a simple and versatile method for computing and analyzing the motion of an interface Γ in two dimensions. Γ bounds a region (possibly multiple connected) Ω. The goal is to compute and analyse the subsequent motion of Γ under a velocity field F. This velocity can depend on position, time, the geometry of the interface and the external physics. Topological merging and breaking are well defined and easily performed. The speed is made dependent on the curvature as it can be shown that this adds stability to the propagating interface avoiding the necessity of the interface to cross over itself. The movement of the interface is solved by a “HamiltonJacobi” type equation (equation 1) for a function in which the interface is a particular level set.

Φ t + F ∇Φ = 0

The first term referred to, as Fa is the advection term. Fa is a constant independent of the moving front geometry. The front uniformly expands or contracts depending on the sign of Fa . The second term, Fg (κ ) , depends on the geometry of the front such as its local curvature, κ . It is usually set to equal −εκ where κ is the local curvature and ε is a small positive constant. This diffusion term smoothes out the high curvature regions of the front. It has a regularising effect on the front and is necessary for stability allowing the entropy condition to hold [9]. The level set method presented above is a relatively straightforward to program. However, it is not particularly fast, nor does it make efficient use of computational resources. In the next section a variety of modifications are looked at to speed up the algorithm.

(1)

The central idea of the level set approach is to represent the front, Γ, as the zero level set of a higher dimensional function Φ (Figure 2).

Γ

(2)

3. REDUCTIONS IN COMPLEXITY 3.1 Narrow band Method

The level set approach has been applied to image and image processing yielding excellent results, but at an extremely computational cost. Performing calculations over an n×n image requires O(n2) operations per time step. An efficient modification to this method leads to the Narrow Band approach where computation is performed only within a closed neighbourhood of the zero level set [9]. In this method pixels only within a certain distance d of the zero level set are updated with time thus reducing the complexity from O(n2) to O(kn), where k is the number of cells in the narrow band.

Φ

3.2 Sub-Sampling

Another simple yet effective method that is proposed to reduce the complexity of the algorithm is to sub-sample the original n×n image to by a factor k. The image is reduced to size n/k × n/k, thus the area is reduced by k2. This reduces the complexity also by a factor of k2. The

Figure 2: Φ , the level set function derived from Γ.

2

Proceedings of the International Workshop on Very Low Bitrate Video Coding, Athens, October 2001

0  h( g ( x, y )) = g(x, y) 1 

level set method can be applied to the sub-sampled image. The resulting boundary can be extrapolated onto the larger n×n image. The resulting zero level set is now very close to the boundary of the n×n image’s object (within k pixels). A narrow-band level set method can be applied to adapt to the nearby boundary.

if if if

g(x, y) < 0.3 0.3 < g ( x, y ) < 0.7 (4) 0.7 < g ( x, y ) ≤ 1

4. APPLICATIONS TO VIDEO SEGMENTATION

Video sequences consist of time varying images stored as rectangular arrays. The main challenge in video segmentation is to quickly identify the object to be segmented in the image and to then track it successfully over the entire sequence of images. The first problem is to find the object in the image. This is done using level set methods discussed above in section 2 with complexity reductions from section 3. The initial level set, Γ, starts at the edges of the video frame and propagates inwards with speed, F. In section 4.1 it is shown how the speed function F is generated and in section 4.2 explain the stopping criterion is explained.

Figure 3: Piece-wise linear stopping function h (g (x, y)). 4.2 Stopping Criterion

4.1 Speed Properties

Ideally the zero level set would stop moving when it approached the surface of the desired object after applying h o g, but in practice due to presence of varying lighting conditions or shadows in an image the zero level set is not reduced to zero at all points as the gradient may not be steep enough. This can cause a portion of level set to move inside the desired object and fill out inside. To overcome this problem a stopping criterion that measures the average speed at the zeros level set is critical in video segmentation. It was useful to define an indicator function I over the whole image (m×n pixels) got from the level set function Φ . Thus,

The speed of the propagating level set can be influenced by such features as colour, gradient, motion vectors and texture can be used to help identify objects in image sequences. In our work, the gradient information present in the image was used to adapt the evolving level set to the object surface. Useful image features are edge boundaries. By detecting areas of high gradient values in an image, a stopping criterion for the propagating level set can be determined [6,7]. Multiplying the speed function, F, from equation 2, by the quantity g , does this. g ( x, y ) =

1 (1 + ∇u ( x, y ) )

(3)

255 I (i, j ) =  0

∇u ( x, y ) represents the magnitude of the gradient of the image pixel values u ( x, y ) . The function g ( x, y ) is close to unity away from boundaries, and drops to zero near sharp changes in the image gradient. So, by multiplying this with the speed function the interface speed can be reduced to almost zero near the edges of an object. The problem with the function g ( x, y ) though, is that although it slows the propagation the speed function F near the edges of the desired object, it does not actually ever stop it fully to zero. It is more desirable to actually make it zero in the area of high gradient. Thus, h is applied to g, which is a piece-wise non-linear threshold function. (Equation 4, Figure 3).

if if

Φ (i, j) = 0 Φ (i, j) ≠ 0

(5)

Figure 4: Indicator function I, generated from the level set Φ function.

3

Proceedings of the International Workshop on Very Low Bitrate Video Coding, Athens, October 2001 object has been tracked, there exists a curve, Γ, from the previous frame. From this a level set function Φ can be generated. This is moved downward by 2-3 units causing Γ to move outward.

This means I, acts like a scaled indicator function over the level set for the whole image. This is shown in Figure 4 with black pixels representing 255 and white pixels representing 0. How this varies over time is a measure if the curve is moving fast or approaching the edges of the desired object. Since the zero level set does not always reduce to zero at all points (due the presence of light and shadows on the object) we wish to track it is desirable to use a measure that takes into account of the total zero-level propagation. The function Jk measures the movement of the curve between the (k- 1) th and kth step. Thus, m

This new curve ϒ is outside the object but still close (Figure 5). For the next frame k+1, it is reasonable to guess that the object is near where it was in the kth frame so ϒ is a good initial starting curve which the level set can propagate from. But the object is often moving in some direction over a sequence of frames. To get a better estimate where the object will be in the k+1th frame the centroids ck, ck-1 of the kth and k-1th frames are calculated. The difference of between these two values generates a motion vector v. Then the initial starting condition for the level set method in the k+1th frame is ϒ shifted by the motion vector v. In the case of the first frame v is set to zero and the initial starting curve is the edges of the image.

n

∑∑ Ik (i, j ) − Ik − 1(i, j ) Jk =

i =1 j =1

m× n

(6)

This Jk is a measure of the change of the level set at the kth time step. This is a good measure it the curve is slowing down and approaching the desired object. To avoid instantaneous changes Jk is averaged over the last four time steps. When this is less than a certain threshold, Thresh, the zero-level set had reached the desired objects surface. Thus when,

6. IMPLEMENTATION OF VIDEO SEGMENTATION

The implementation of the video segmentation framework proposed consists of two stages. The initial stage consists of reading in frame k, which is down-sampled by a factor, d. Then the narrow-band level set method is applied to track the object in the image starting from a curve, Γ. If it is the first frame then Γ starts from the boundary of the image, otherwise it is generated from the previous segmented frame. Once the object has been tracked, then the zero level set can then be up-sampled by the same factor, d (Figure 6). The zero level set can be adapted further the object’s boundary using the narrow-band method for high resolution. The up-sampled zero-level set is very close to the object the narrow band converges very quickly. The complexity of this method is mainly determined by the intermediate stage (applying the narrowband method to the down-sampled image) as this is more computational intensive compared to simple operations of up- and down-sampling. The complexity of this method is roughly Cd, where Cd is the complexity of applying the narrowband method to the down-sampled image. The relationship between Cd and C1, (the complexity associated with applying the level set method to the original image) is given in equation 8. This relationship is true regardless of the definition of complexity, as the down-sampling the image (and so the level set function Φ ) by a factor of d, scales the computations/operations involved by a factor of 1/d 2 for the level set algorithm in equation 1.

4

∑ Jk − j j =0

4

≤ Thresh (7)

the zero-level set had approached the surface of the desired object. 5. TEMORAL CORRELATION

In video sequences there is huge correlation between successive frames. This fact is exploited in the video coding scenes (H.261, H.263, MPEG-1, 2 &4). Similarly in video segmentation this is also the case where objects from in successive frames generally move a small amount. Some previous work on level set methods in video segmentation has not exploited this fact [8]. In that work the initial curve, Γ, for each frame starts at the boundary of the frame and propagates inwards. In this section an algorithm is proposed to exploit the temporal correlation between fames so that the initial curve Γ is near the object. The algorithm consists of two parts. Firstly, once an

4 Figure 5: Moving, Γ, outwards by lowering the level set function Φ .

Proceedings of the International Workshop on Very Low Bitrate Video Coding, Athens, October 2001

Cd ≈

Original Image: Miss America Image size 288x360 pixels

C1 d2

(8) Downsampled: Miss America Image size 78x90 pixels

Downsample by 4

Figure 7: Relative Complexity vs. Down-sample factor, d

Up-Sampled Image

Upsample by 4

above. The resulting zero-level set was then extrapolated to the original sized image. As can be seen from Figure 7 this has reduced computational complexity considerably. Complexity was measured by both CPU-time and floatingpoint operations. Both data sets were normalized against the maximum complexity (the complexity associated with the un-sampled image, down-sample factor is 1). The theoretical curve is generated from equation 5. There was a 80% reduction in computation time/complexity in the Miss America sequence (288×360 pixels, 150 frames, 30fps) when the temporal algorithm was used in section 5 as opposed to restarting from the edges of the image for a down-sample factor of 4.

Level Set iterated and located face boundary using gradient for image speed

Figure 6: The first frame of the “Miss America” sequence is down-sampled by a factor of 4 and then the face is tracked using the narrow-band level set method The above covered the first part of the segmentation framework. The second stage is to exploit the temporal correlation in the video sequence. Thus initial curve Γ, for the next frame the curve is calculated based upon the final zero level set for the previous frame using the algorithm from section 5, rather than starting from the edges of the image. This should reduce the complexity considerably as the curve is very near the object before the narrow band method is applied compared to starting at the edges. This gain is achieved for each frame in the video sequence and can be quite considerable for long video sequences.

9. DISCUSSION & CONCLUSION

The level set approach provides a useful tool for image and video processing applications. While computationally intensive, it yields good results. A variety of methods can be applied to reduce complexity both on a per-frame and inter-frame basis. In this paper sub-sampling is suggested as a way to reduce the complexity of the problem considerably. When combined with the Narrow Band Method the reduction in complexity is considerable showing that there may be a case for the use of level sets in video segmentation. Also a temporal algorithm is shown that reduces the complexity of tracking object(s) over a sequence of frames.

8. RESULTS

Sub-sampling was used in conjunction with the Narrowband level set method to segment the first frame of the “Miss America” sequence (Figure 6). The first frame (288×360) and was down-sampled different factors, d. (1,2,3,4 &6). Then the narrow-band level set method was applied to the sub-sampled image and the speed function is multiplied by the image speed term g(x,y) in equation 3.

As ongoing research it would be interesting to apply even faster and more recent level set algorithms such as the fast marching method [12] to find the object for each frame. Recent work on metamorphosis of level sets is being looked at as a method for adapting the level set to a changing object boundary [6,10,11] for even better tracking of objects. For high action sequences if may be necessary to reset the initial curve, Γ, for each frame to the border of the frame to ensure stability. The use of other features such as texture, colour and motion in conjunction

5

Proceedings of the International Workshop on Very Low Bitrate Video Coding, Athens, October 2001

with gradient information to give improved speed function to detect object boundaries is currently being researched. 10. REFERENCES

[1] F. Pereira, “MPEG-4: Why, What, How and When?”, Signal Processing: Image Communication. Vol. 15, 2000, p271-279. [2] T. Sikora, “The MPEG-4 Video Standard Verification model”, IEEE Trans on Circuits and Systems for Video Technology, Vol. 7, No. 1, Feb. 1997, pp19-31. [3] P. Campadelli, D. Medici, R. Stetting "Colour Image Segmentation Using Hopfield Networks", Image and Vision Computing, Vol.15, p161-166. [4] J. Beois-Pineau, F. Morier, D. Barba, H. Sanson, “Hierarchical Segmentation Of Video Sequences For Content Manipulation And Adaptive Coding”, Signal Proceedings, Special Issue On Video Sequence Segmentation. Vol. 66(2), 1998. [5] P.Kehoe, P. Sharma, R.B.Reilly, “Digital Video Segmentation Using Level Sets”, Proc. of ISSC Maynooth. June 2001. [6] N. Paragios, R. Deriche, “Geodesic Active Contours And Level Sets For The Detection And Tracking Of Moving Objects.” IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 22 No. 3 March 2000. [7] S. Osher and J.A.Sethian, “Fronts Propagating With Curvature-Dependent Speed: Algorithms Based On Hamilton--Jacobi Formulations”, Journal Of Computational Physics, 12-49, .1988. [8] P. Harper, R.B.Reilly "Color Video Segmentation using Level Sets", Proc. IEEE International Conf. on Image Processing, Vancouver, September 2000. [9] J.A.Sethian, Level Set Methods, Cambridge University Press. 1996. [10] D. E. Breen, Ross T. Whitaker, “A Level-Set Approach For The Metamorphosis Of Solid Models” IEEE Transactions on Visualization and Computer Graphics. October 2000. [11] T. Chan, L. A. Vese, “Active Contours Without Edges”, submitted to IEEE Trans On Image Processing. 2001 [12] J. A. Sethian, “A Fast Marching Level Set Method for Monotonically Advancing Fronts”, Proceedings in the National Academy of Sciences, 1995.

6

Suggest Documents