Temporal Edges and Spatial Classification for Video Object Segmentation
Yuh Ren Choo, Pau-Choo Chung, Chich-Ling Huang and Jar-Ferr Yang, Department of Electrical Engineering National Cheng Kung University, Tainan, Taiwan, R.O.C.
[email protected]
Abstract
Tachun Wang and Chen-Chiung Hsieh Institute for Information Industry, Taipei, Taiwan, R.O.C.
I. Introduction
Extracting semantic video object planes (VOP) is an important step for the success of MPEG-4. In video object
With the increasing popularity of multimedia
plane segmentation, combining spatial and temporal
applications, new coding techniques that allow variable
information has been regarded as a promising approach
bit-rate transmission and content-based interactivity are
for the segmentation of video object planes. This paper
necessary. For this need, the standard MPEG-4 is therefore
proposes a spatio-temporal algorithm for extracting the
developed by encapsulating the concept of video object
VOPs of image sequences.
This method differs from
planes (VOPs) in order to provide region-based coding
traditional algorithm in that it is based on the combination
along with content-based interaction. Due to this reason,
of temporal edges and asymmetric fuzzy-C-mean on
the decomposition of video images into VOPs has been
spatial region classification. With the proposed temporal
considered one of the essential steps for the success of
edges, the generality of temporal information is retained
MPEG-4.
while computation time is saved in contrast to motion
Related literatures [1-6] on VOP segmentation were
vectors. The proposed spatial classification scheme,
published. Some of these existing methods used watershed
asymmetric fuzzy-C-mean, takes into account of the
for region partition and mapped the regions onto
degrees of dispersion and orientations of pattern
subsequent frame [2][7], while some used jointed feature
distributions. Thus, more accurate classification results
of intensity and motion vectors for dividing the image into
can be obtained.
disjointed regions. These approaches decomposed the
proposed
Experiments have shown that the
spatio-temporal
algorithm
can
effectively
sequence image into individual regions, rather than
segment video object planes in a nearly static background,
complete
meaningful
contents.
Furthermore,
these
without affected by lighting sources and shadows.
decomposition methods are easily affected by the inhomogeneity of the applied low-level features. On the other hand, motion fields are adopted as a basic feature for
-245-
combining regions into VOP and for tracking regions in
Section V.
subsequent frame [1-3][6]. Though motion fields could be theoretically used, they are extremely noise sensitive.
II. VOP generation from temporal edges
Inherent with motion fields are the two problems that their computation is very time-consuming and their accurate
Motion is one basic characteristic associated with
estimation is restrained when a region has very smooth or
moving objects and, thus VOPs. Because of this reason,
periodic features.
temporal information is regarded one important feature for
This paper proposes a new approach for VOP
segmenting VOPs. Temporal information was commonly
segmentation by extracting motion objects from an image
represented by motion vectors or optical fields. To
sequence. Since a motion object is characterized by a
compute the motion vectors, steepest descent or various
different motion from that of the background, some type of
block-matching
motion information has to be incorporated into the
Whichever approach is used, the computation load is
segmentation procedure. However, instead of using motion
extremely expensive and is highly sensitive to noise or
fields, our approach proposes to use temporal edges,
luminance change. A slight interference could cause
thereby avoiding the problems associated with the
largely incorrect results. Furthermore, moving regions
estimation of motion fields.
must contain sufficient texture to generate optical flow,
algorithms
have
been
employed.
Beside motion information from temporal domain, a
since the luminance in the interior of homogenous moving
spatial segmentation result is also combined in the
regions remains constant. All of these reasons make
proposed VOP segmentation method. An asymmetric
accurate computation of motion fields a very challenging
fuzzy-C-mean algorithm is proposed for spatial domain
task.
segmentation. Embedded in the asymmetric fuzzy-C-mean
As an alternative approach, this paper proposes the
algorithm are the orientations and degrees of dispersions of
use of temporal edges as the temporal information.
feature distributions. In a real image sequence, it is
Temporal edge detection is performed on the edge image
common to see noise and intensive details that deteriorate
throughout the video sequence. Common edge detectors
the segmentation results. In contrast to conventional
such as the high-pass filters -- Sobel or Laplacian
fuzzy-C-mean the proposed asymmetric fuzzy-C-mean is
operators, can be applied to obtain the edge image by
also designed to take the contextual information into its
capturing the high-frequency components or the abrupt
operations, which is an effective approach to reduce the
changes in gray levels. Alternatively, in this paper a
interference resulting from noise and image details.
“"DownLevel"
operator
performed
on
the
image
This paper is organized as follows. Section II
luminance is adopted to obtain the edge image. The
describes the employment of the proposed temporal edges
DownLevel operation is performed as follows. The gray
in VOP generation. The complete algorithm of VOP
value of the image, I(x,y), is quantized from 256 scales
generation using the proposed asymmetric fuzzy-C-mean
(0-255) to 16 scales (0-15) by the following equation:
approach is introduced in Section III. The results are revealed in Section IV. Finally we state conclusions in
-246-
Y(x,y) = I(x,y) >> 4
(1)
where the operation I(x,y) >> 4 represents right shifting
4 bits of the image I(x,y).
The normalized features serve as the input to the following
The pixel at (x, y) is marked as an edge pixel if its Y(x,y)
spatial classification.
value is different from its 8-connectivity pixels. One result using the "DownLevel" operation is shown in Figure 1(b).
III.I Asymmetric Fuzzy-C-mean for classification
While the "DownLevel" edge detector can obtain similar results as other filters, it is a lot more efficient than the
Fuzzy-C-mean algorithm is an iterative learning
Sobel and the Laplacian operators. Furthermore, it was
algorithm that is able to adjust each cluster representative
also found that the edges obtained by "DownLevel"
center to best partition an image [8-9]. However, this
operator are much thinner than those obtained by Sobel
algorithm is effective only when the patterns are
operator. Thus in our further temporal change detection,
symmetrically distributed in terms of between-class
these edges would offer more accuracy in detection results.
distribution and within-class distribution. As the patterns form an elongated class distribution or some class patterns are distributed more densely than others, the fuzzy-C-mean which put the decision boundary at midway between two neighboring classes would fail in discriminating the two
(a)
classes.
(b)
In compared with fuzzy-C-mean, the proposed
Figure 1: (a) Original, (b) Y Downlevel
asymmetric fuzzy-C-mean incorporates the covariance
III. Spatial Feature Generation
matrix in computing membership degree of a pixel belonging to each class. For pixel k with the feature vector
The features used for spatial region classification are the YUV values of the image. To reflex their relative v importance, these features are normalized. Let F ( x, y )
v v v v ( xk −Vi )T Σi−1 ( xk −Vi )
(2)
−1 m−1
v v v v ∑j=1 ( xk −V j )T Σ−j1 ( xk −V j ) C
−1 m−1
(5)
v
where Vi is the i-th cluster center, and u ik represents
where f i ( x , y ) , 1 ≤ i ≤ 3 , represent the feature values of YUV. The normalization is performed as
zi = ( fi − fi min) /(fi max − fi min)
computed as
uik* =
be the feature vector for a pixel at location (x,y),
v F( x, y) = [ f1 ( x, y),...,f 3 ( x, y)]
v x k its membership degree belonging to the i-th class is
the belongingness of pixel k to cluster i. The m here is the weighting exponent which defines the fuzziness of the
(3)
membership values. When m is extremely large, the fuzzy
where fi represents feature i, f i max and f i min represent
memberships of all the classes tend to be the same and
the maximum and minimum value of feature i respectively.
equal to 1/C, except when the input vector is exactly
Therefore, the normalized feature vector at pixel k is
positioned in one class center. Therefore, large values of m
v x k = [ z1 ( x , y ),..., z 3 ( x , y )]
should be avoided in a physical application. TheΣi here is (4)
used to take into account the denseness and orientation of
-247-
the i-th class distribution, and can be obtained as the fuzzy
own label instead of several regions representing the same
covariance matrix of the i-th class.
label. After that, the pixels in the noise-suspect class are
Once the belongings of all the pixels to each class
reassigned to a proper class based on their image features,
are obtained with Eq. (5), they are used to re-adjust each
color and luminance, and their spatial adjacency to
class center. Thus, the center of the i-th class is calculated
neighboring classes.
as
For easy explanation, the labeled regions will be v Vi =
v ∑k =1 (uik ) xk n
m
∑
n k =1
(uik )
m
called regular regions. Similarly, regions belonging to the
(6)
i = 1,2....,c
noise-suspect class called noise-suspect region. Then for a
Equations (5) and (6) are computed iteratively until the
pixel in noise-suspect region R, its surrounding regular
algorithm reaches a stable state when the class centers
regions are defined as those regular regions exactly
between two consecutive iterations does not have
adjacent to the noise-suspect region R. Let A k be a set
significant change; that is, the error ∆ = Vv t − Vv t −1
ε
smaller than a threshold
is vt , where V is the matrix
containing each of the class centers as its column vectors.
containing
the
labels
of
regions
surrounding
a
v noise-suspect pixel x k . Then in the relabel process, the v degree of the pixel x k belonging to its neighboring regular region i is computed according to u ik ( t ) + ∑ w l u il ( t )
Using a Noise class and contextual information for u ik ( t + 1 ) =
Reducing Noise Interference In the unsupervised learning, the most challenging
∑
l∈ Ω
j∈ A k
(u
jk
( t ) + ∑ w l u jl ( t ))
problem is the interference caused by noise that not only
l∈ Ω
(9)
r
itself is difficult to be classified, but also often biases the
where Ω k contains neighboring pixels of x k
computation of each class center. In order to increase the
noise-suspect region and w l ’s are the weighting factors
robustness of the asymmetric fuzzy-C-mean algorithm to
determining the contribution ratios of neighboring pixels
noise, a noise class is designed. This noise class is
to xk . The initial u ik ( 0 ) is obtained by applying xk
designed to attract noise-suspect pixels that reside far
into Eq. (5). Computation of Eq. (9) is repeated until the
outside every class. With this main idea, this noise class
membership degrees of noise-suspect pixels belonging to
membership degree is computed as
each region come to a stable state.
uC +1,k =
2 Mas (uik ) i
(7)
Max (uik ) + Mas (uik ) i
r
i
i ≠ j*
r
III.II. VOP region mapping and generation
i
where the Mas() operation is defined as Mas(uik ) = Max(uik ) ,
in the
As have been mentioned, noise-generated pseudo temporal edges are sparse; on the contrary, real temporal
j* = arg max (uik ). i
edges are dense and highly connected. Let P i be the set
(8)
containing pixels labeled as region i and #(.) represent
After convergence, a relabeling step for all regions is
operator receiving a set and computing the number of
conducted to ensure that each disconnected region has its
elements in the set. Also let set B contain the pixels of
-248-
temporal edges of the image frame. Then the combination
AKIYO sequence, this sequence images consist of
of
significant shadows on the background area around the
temporal
edges
and
asymmetric
fuzzy-C-mean
classification results is conducted as:
mother and the daughter shoulders area. These shadows
retain P i if # ( P i ∩ B ) ≥ β # ( Pi )
cause significant interference to the extraction of temporal
(10)
information. However, as the temporal edges are extracted
The combination process is repeated on each labeled
based on edges instead of luminance changes, effects of
region. Then, the VOP is obtained by merging all retained
interference have been significantly reduced.
regions.
there still exist some sparse pseudo temporal edges. A combination
IV. Experimental results
of
temporal
edges
and
the
Even so,
spatially
partitioned results using AFCM has completely eliminated the effects resulting from pseudo temporal edges. The
The proposed method was tested using several image
generated VOP is shown in Figure 3(d).
sequences. Figure 2(a) is the AKIYO sequence, in which the effect of the apparent motion of the background is not severe. Some single-pixel-wised noisy background edges can be easily eliminated.
Therefore, by only using
temporal edges is significant to retrieve the VOP shown in
Figure 3: (a) Clustering (b) Morphological
Figure 2(b). The required computation time is very little.
Open-closing (c) Relabeling (d) VOP
For the sequence of 300 frames, it took only 5 seconds in
Figure 4 demonstrates the improvements over
contrast to the motion vector approach which took 26.25
traditional fuzzy-C-mean approach by applying the
seconds for the steepest decent method [2], disregarding
asymmetric fuzzy-C-mean. The explicitness of the
the post processing to obtain VOPs. Furthermore, in this
partitioned background is significantly refined due to the
experiment, we also found that the motion of the
properties of orientations and degrees of dispersion being
broadcaster Akiyo is not apparent and several regions such
embedded in the classification, and therefore, yields up a
as the dress and the face areas are very smooth. These two
robust VOP result.
reasons cause motion vector detection very challenging.
(a) (a)
(b)
(b)
Figure 4: (a) Traditional FCM (b) AFCM
(c)
Figure 2. (a) Original image (b) Extracted VOP (c)
V. Conclusions
Edge Another test sequence sample -- MOTHER AND DAUGHTER is shown in Figure 3(a). Different from the
-249-
An algorithm in basis of the combination of temporal
and spatial information is proposed for video object
[3] C. Gu and M.C. Lee, "Semiautomatic Segmentation and
extraction. Differing from the conventional methods, this
Tracking of Semantic Video Objects", IEEE Trans. on
paper proposes the use of temporal edges. Several
Circuits
experiments have shown that temporal edges provide more
pp.572-584, Sept 1998.
and
Systems
for
Video
Technology,
vol.8,
accurate temporal information for stationary background
[4] L. Vincent, "Morphological Grayscale Reconstruction in
image sequences, especially in a head-and-shoulder case
Image Analysis: Applications and Efficient Algorithms",
such as AKIYO in which accurate motion vector
IEEE Trans. on Image Processing, vol.2, pp.176-201, April
estimation is doubted. Furthermore, its computation is
1993.
more efficient.
[5] Y. Rui, T. Huang, M. Ortega and S. Mehrotra, "Relevance
The spatial classification is conducted on the
Feedback: A Power Tools for Interactive Content-Based
proposed AFCM, in which orientations and degrees of
Image Retrieval", IEEE Trans. on Circuits and Systems for
dispersion are embedded. With these embedding, the
Video Technology, vol.8, pp.644-655, Sept 1998.
AFCM presents higher classification robustness to the
[6] Y.
Deng,
B.S.
Manjunath,
"Netra-V:
Toward
an
variation of pattern distributions. The combination of
Object-Based Video Representation", IEEE Trans. on
temporal domain and spatial domain is then conducted on
Circuits and Systems for Video Technology, vol.8, pp616-627,
the classification results.
Sept 1998.
The algorithm is tested on several benchmark
[7] L. Vincent and P. Soille, "Watersheds in Digital Spaces: An
standard image sequences and the resulting VOPs are
Efficient Algorithm Based on Immersion Simulations",
provided to demonstrate the feasibility of our approach.
IEEE Trans. on P.A. and M.I., vol.13, No.6, June 1991. [8] S. Haykin, Neural Networks – A Comprehensive Foundation,
Acknowledgment
Upper Saddle River, NJ: Prentice-Hall International. [9] J-S. R. Jang, C.T. Sun and E. Mizutani, Neuro-Fuzzy and
The work was supported by MOEA under contract
Soft Computing. Upper Saddle River, NJ: Prentice-Hall
number 88-EC-2-A-17-0208.
International, 1997.
References [1] T. Meier and K. Ngan, "Automatic Segmentation of Moving Objects for Video Object Plane Generation", IEEE Trans. on Circuits
and
Systems
for
Video
Technology,
vol.8,
pp.525-538, Sept. 1998. [2] D. Wang, "Unsupervised Video Segmentation Based on Watersheds and Temporal Tracking", IEEE Trans. on Circuits
and
Systems
for
Video
Technology,
vol.8,
pp.539-546, Sept 1998.
-250-