Temporal Edges and Spatial Classification for Video Object

0 downloads 0 Views 127KB Size Report
segment video object planes in a nearly static background, without affected by lighting ... the decomposition of video images into VOPs has been considered one of the .... Fuzzy-C-mean algorithm is an iterative learning algorithm that is able to ...
Temporal Edges and Spatial Classification for Video Object Segmentation

Yuh Ren Choo, Pau-Choo Chung, Chich-Ling Huang and Jar-Ferr Yang, Department of Electrical Engineering National Cheng Kung University, Tainan, Taiwan, R.O.C. [email protected]

Abstract

Tachun Wang and Chen-Chiung Hsieh Institute for Information Industry, Taipei, Taiwan, R.O.C.

I. Introduction

Extracting semantic video object planes (VOP) is an important step for the success of MPEG-4. In video object

With the increasing popularity of multimedia

plane segmentation, combining spatial and temporal

applications, new coding techniques that allow variable

information has been regarded as a promising approach

bit-rate transmission and content-based interactivity are

for the segmentation of video object planes. This paper

necessary. For this need, the standard MPEG-4 is therefore

proposes a spatio-temporal algorithm for extracting the

developed by encapsulating the concept of video object

VOPs of image sequences.

This method differs from

planes (VOPs) in order to provide region-based coding

traditional algorithm in that it is based on the combination

along with content-based interaction. Due to this reason,

of temporal edges and asymmetric fuzzy-C-mean on

the decomposition of video images into VOPs has been

spatial region classification. With the proposed temporal

considered one of the essential steps for the success of

edges, the generality of temporal information is retained

MPEG-4.

while computation time is saved in contrast to motion

Related literatures [1-6] on VOP segmentation were

vectors. The proposed spatial classification scheme,

published. Some of these existing methods used watershed

asymmetric fuzzy-C-mean, takes into account of the

for region partition and mapped the regions onto

degrees of dispersion and orientations of pattern

subsequent frame [2][7], while some used jointed feature

distributions. Thus, more accurate classification results

of intensity and motion vectors for dividing the image into

can be obtained.

disjointed regions. These approaches decomposed the

proposed

Experiments have shown that the

spatio-temporal

algorithm

can

effectively

sequence image into individual regions, rather than

segment video object planes in a nearly static background,

complete

meaningful

contents.

Furthermore,

these

without affected by lighting sources and shadows.

decomposition methods are easily affected by the inhomogeneity of the applied low-level features. On the other hand, motion fields are adopted as a basic feature for

-245-

combining regions into VOP and for tracking regions in

Section V.

subsequent frame [1-3][6]. Though motion fields could be theoretically used, they are extremely noise sensitive.

II. VOP generation from temporal edges

Inherent with motion fields are the two problems that their computation is very time-consuming and their accurate

Motion is one basic characteristic associated with

estimation is restrained when a region has very smooth or

moving objects and, thus VOPs. Because of this reason,

periodic features.

temporal information is regarded one important feature for

This paper proposes a new approach for VOP

segmenting VOPs. Temporal information was commonly

segmentation by extracting motion objects from an image

represented by motion vectors or optical fields. To

sequence. Since a motion object is characterized by a

compute the motion vectors, steepest descent or various

different motion from that of the background, some type of

block-matching

motion information has to be incorporated into the

Whichever approach is used, the computation load is

segmentation procedure. However, instead of using motion

extremely expensive and is highly sensitive to noise or

fields, our approach proposes to use temporal edges,

luminance change. A slight interference could cause

thereby avoiding the problems associated with the

largely incorrect results. Furthermore, moving regions

estimation of motion fields.

must contain sufficient texture to generate optical flow,

algorithms

have

been

employed.

Beside motion information from temporal domain, a

since the luminance in the interior of homogenous moving

spatial segmentation result is also combined in the

regions remains constant. All of these reasons make

proposed VOP segmentation method. An asymmetric

accurate computation of motion fields a very challenging

fuzzy-C-mean algorithm is proposed for spatial domain

task.

segmentation. Embedded in the asymmetric fuzzy-C-mean

As an alternative approach, this paper proposes the

algorithm are the orientations and degrees of dispersions of

use of temporal edges as the temporal information.

feature distributions. In a real image sequence, it is

Temporal edge detection is performed on the edge image

common to see noise and intensive details that deteriorate

throughout the video sequence. Common edge detectors

the segmentation results. In contrast to conventional

such as the high-pass filters -- Sobel or Laplacian

fuzzy-C-mean the proposed asymmetric fuzzy-C-mean is

operators, can be applied to obtain the edge image by

also designed to take the contextual information into its

capturing the high-frequency components or the abrupt

operations, which is an effective approach to reduce the

changes in gray levels. Alternatively, in this paper a

interference resulting from noise and image details.

“"DownLevel"

operator

performed

on

the

image

This paper is organized as follows. Section II

luminance is adopted to obtain the edge image. The

describes the employment of the proposed temporal edges

DownLevel operation is performed as follows. The gray

in VOP generation. The complete algorithm of VOP

value of the image, I(x,y), is quantized from 256 scales

generation using the proposed asymmetric fuzzy-C-mean

(0-255) to 16 scales (0-15) by the following equation:

approach is introduced in Section III. The results are revealed in Section IV. Finally we state conclusions in

-246-

Y(x,y) = I(x,y) >> 4

(1)

where the operation I(x,y) >> 4 represents right shifting

4 bits of the image I(x,y).

The normalized features serve as the input to the following

The pixel at (x, y) is marked as an edge pixel if its Y(x,y)

spatial classification.

value is different from its 8-connectivity pixels. One result using the "DownLevel" operation is shown in Figure 1(b).

III.I Asymmetric Fuzzy-C-mean for classification

While the "DownLevel" edge detector can obtain similar results as other filters, it is a lot more efficient than the

Fuzzy-C-mean algorithm is an iterative learning

Sobel and the Laplacian operators. Furthermore, it was

algorithm that is able to adjust each cluster representative

also found that the edges obtained by "DownLevel"

center to best partition an image [8-9]. However, this

operator are much thinner than those obtained by Sobel

algorithm is effective only when the patterns are

operator. Thus in our further temporal change detection,

symmetrically distributed in terms of between-class

these edges would offer more accuracy in detection results.

distribution and within-class distribution. As the patterns form an elongated class distribution or some class patterns are distributed more densely than others, the fuzzy-C-mean which put the decision boundary at midway between two neighboring classes would fail in discriminating the two

(a)

classes.

(b)

In compared with fuzzy-C-mean, the proposed

Figure 1: (a) Original, (b) Y Downlevel

asymmetric fuzzy-C-mean incorporates the covariance

III. Spatial Feature Generation

matrix in computing membership degree of a pixel belonging to each class. For pixel k with the feature vector

The features used for spatial region classification are the YUV values of the image. To reflex their relative v importance, these features are normalized. Let F ( x, y )

v v v v ( xk −Vi )T Σi−1 ( xk −Vi )

(2)

−1 m−1

v v v v ∑j=1 ( xk −V j )T Σ−j1 ( xk −V j ) C

−1 m−1

(5)

v

where Vi is the i-th cluster center, and u ik represents

where f i ( x , y ) , 1 ≤ i ≤ 3 , represent the feature values of YUV. The normalization is performed as

zi = ( fi − fi min) /(fi max − fi min)

computed as

uik* =

be the feature vector for a pixel at location (x,y),

v F( x, y) = [ f1 ( x, y),...,f 3 ( x, y)]

v x k its membership degree belonging to the i-th class is

the belongingness of pixel k to cluster i. The m here is the weighting exponent which defines the fuzziness of the

(3)

membership values. When m is extremely large, the fuzzy

where fi represents feature i, f i max and f i min represent

memberships of all the classes tend to be the same and

the maximum and minimum value of feature i respectively.

equal to 1/C, except when the input vector is exactly

Therefore, the normalized feature vector at pixel k is

positioned in one class center. Therefore, large values of m

v x k = [ z1 ( x , y ),..., z 3 ( x , y )]

should be avoided in a physical application. TheΣi here is (4)

used to take into account the denseness and orientation of

-247-

the i-th class distribution, and can be obtained as the fuzzy

own label instead of several regions representing the same

covariance matrix of the i-th class.

label. After that, the pixels in the noise-suspect class are

Once the belongings of all the pixels to each class

reassigned to a proper class based on their image features,

are obtained with Eq. (5), they are used to re-adjust each

color and luminance, and their spatial adjacency to

class center. Thus, the center of the i-th class is calculated

neighboring classes.

as

For easy explanation, the labeled regions will be v Vi =

v ∑k =1 (uik ) xk n

m



n k =1

(uik )

m

called regular regions. Similarly, regions belonging to the

(6)

i = 1,2....,c

noise-suspect class called noise-suspect region. Then for a

Equations (5) and (6) are computed iteratively until the

pixel in noise-suspect region R, its surrounding regular

algorithm reaches a stable state when the class centers

regions are defined as those regular regions exactly

between two consecutive iterations does not have

adjacent to the noise-suspect region R. Let A k be a set

significant change; that is, the error ∆ = Vv t − Vv t −1

ε

smaller than a threshold

is vt , where V is the matrix

containing each of the class centers as its column vectors.

containing

the

labels

of

regions

surrounding

a

v noise-suspect pixel x k . Then in the relabel process, the v degree of the pixel x k belonging to its neighboring regular region i is computed according to u ik ( t ) + ∑ w l u il ( t )

Using a Noise class and contextual information for u ik ( t + 1 ) =

Reducing Noise Interference In the unsupervised learning, the most challenging



l∈ Ω

j∈ A k

(u

jk

( t ) + ∑ w l u jl ( t ))

problem is the interference caused by noise that not only

l∈ Ω

(9)

r

itself is difficult to be classified, but also often biases the

where Ω k contains neighboring pixels of x k

computation of each class center. In order to increase the

noise-suspect region and w l ’s are the weighting factors

robustness of the asymmetric fuzzy-C-mean algorithm to

determining the contribution ratios of neighboring pixels

noise, a noise class is designed. This noise class is

to xk . The initial u ik ( 0 ) is obtained by applying xk

designed to attract noise-suspect pixels that reside far

into Eq. (5). Computation of Eq. (9) is repeated until the

outside every class. With this main idea, this noise class

membership degrees of noise-suspect pixels belonging to

membership degree is computed as

each region come to a stable state.

uC +1,k =

2 Mas (uik ) i

(7)

Max (uik ) + Mas (uik ) i

r

i

i ≠ j*

r

III.II. VOP region mapping and generation

i

where the Mas() operation is defined as Mas(uik ) = Max(uik ) ,

in the

As have been mentioned, noise-generated pseudo temporal edges are sparse; on the contrary, real temporal

j* = arg max (uik ). i

edges are dense and highly connected. Let P i be the set

(8)

containing pixels labeled as region i and #(.) represent

After convergence, a relabeling step for all regions is

operator receiving a set and computing the number of

conducted to ensure that each disconnected region has its

elements in the set. Also let set B contain the pixels of

-248-

temporal edges of the image frame. Then the combination

AKIYO sequence, this sequence images consist of

of

significant shadows on the background area around the

temporal

edges

and

asymmetric

fuzzy-C-mean

classification results is conducted as:

mother and the daughter shoulders area. These shadows

retain P i if # ( P i ∩ B ) ≥ β # ( Pi )

cause significant interference to the extraction of temporal

(10)

information. However, as the temporal edges are extracted

The combination process is repeated on each labeled

based on edges instead of luminance changes, effects of

region. Then, the VOP is obtained by merging all retained

interference have been significantly reduced.

regions.

there still exist some sparse pseudo temporal edges. A combination

IV. Experimental results

of

temporal

edges

and

the

Even so,

spatially

partitioned results using AFCM has completely eliminated the effects resulting from pseudo temporal edges. The

The proposed method was tested using several image

generated VOP is shown in Figure 3(d).

sequences. Figure 2(a) is the AKIYO sequence, in which the effect of the apparent motion of the background is not severe. Some single-pixel-wised noisy background edges can be easily eliminated.

Therefore, by only using

temporal edges is significant to retrieve the VOP shown in

Figure 3: (a) Clustering (b) Morphological

Figure 2(b). The required computation time is very little.

Open-closing (c) Relabeling (d) VOP

For the sequence of 300 frames, it took only 5 seconds in

Figure 4 demonstrates the improvements over

contrast to the motion vector approach which took 26.25

traditional fuzzy-C-mean approach by applying the

seconds for the steepest decent method [2], disregarding

asymmetric fuzzy-C-mean. The explicitness of the

the post processing to obtain VOPs. Furthermore, in this

partitioned background is significantly refined due to the

experiment, we also found that the motion of the

properties of orientations and degrees of dispersion being

broadcaster Akiyo is not apparent and several regions such

embedded in the classification, and therefore, yields up a

as the dress and the face areas are very smooth. These two

robust VOP result.

reasons cause motion vector detection very challenging.

(a) (a)

(b)

(b)

Figure 4: (a) Traditional FCM (b) AFCM

(c)

Figure 2. (a) Original image (b) Extracted VOP (c)

V. Conclusions

Edge Another test sequence sample -- MOTHER AND DAUGHTER is shown in Figure 3(a). Different from the

-249-

An algorithm in basis of the combination of temporal

and spatial information is proposed for video object

[3] C. Gu and M.C. Lee, "Semiautomatic Segmentation and

extraction. Differing from the conventional methods, this

Tracking of Semantic Video Objects", IEEE Trans. on

paper proposes the use of temporal edges. Several

Circuits

experiments have shown that temporal edges provide more

pp.572-584, Sept 1998.

and

Systems

for

Video

Technology,

vol.8,

accurate temporal information for stationary background

[4] L. Vincent, "Morphological Grayscale Reconstruction in

image sequences, especially in a head-and-shoulder case

Image Analysis: Applications and Efficient Algorithms",

such as AKIYO in which accurate motion vector

IEEE Trans. on Image Processing, vol.2, pp.176-201, April

estimation is doubted. Furthermore, its computation is

1993.

more efficient.

[5] Y. Rui, T. Huang, M. Ortega and S. Mehrotra, "Relevance

The spatial classification is conducted on the

Feedback: A Power Tools for Interactive Content-Based

proposed AFCM, in which orientations and degrees of

Image Retrieval", IEEE Trans. on Circuits and Systems for

dispersion are embedded. With these embedding, the

Video Technology, vol.8, pp.644-655, Sept 1998.

AFCM presents higher classification robustness to the

[6] Y.

Deng,

B.S.

Manjunath,

"Netra-V:

Toward

an

variation of pattern distributions. The combination of

Object-Based Video Representation", IEEE Trans. on

temporal domain and spatial domain is then conducted on

Circuits and Systems for Video Technology, vol.8, pp616-627,

the classification results.

Sept 1998.

The algorithm is tested on several benchmark

[7] L. Vincent and P. Soille, "Watersheds in Digital Spaces: An

standard image sequences and the resulting VOPs are

Efficient Algorithm Based on Immersion Simulations",

provided to demonstrate the feasibility of our approach.

IEEE Trans. on P.A. and M.I., vol.13, No.6, June 1991. [8] S. Haykin, Neural Networks – A Comprehensive Foundation,

Acknowledgment

Upper Saddle River, NJ: Prentice-Hall International. [9] J-S. R. Jang, C.T. Sun and E. Mizutani, Neuro-Fuzzy and

The work was supported by MOEA under contract

Soft Computing. Upper Saddle River, NJ: Prentice-Hall

number 88-EC-2-A-17-0208.

International, 1997.

References [1] T. Meier and K. Ngan, "Automatic Segmentation of Moving Objects for Video Object Plane Generation", IEEE Trans. on Circuits

and

Systems

for

Video

Technology,

vol.8,

pp.525-538, Sept. 1998. [2] D. Wang, "Unsupervised Video Segmentation Based on Watersheds and Temporal Tracking", IEEE Trans. on Circuits

and

Systems

for

Video

Technology,

vol.8,

pp.539-546, Sept 1998.

-250-

Suggest Documents