A Novel Spatio-Temporal Video Object Segmentation

A Novel Spatio-Temporal Video Object Segmentation Algorithm Shiping Zhu Xi Xia Qingrong Zhang Department of Measurement Control & Information Technology, School of Instrumentation Science and Optoelectronics Engineering, Beihang University, Beijing 100083, China. Email: [email protected]

Kamel Belloulata Département d’Électronique, Faculté des Sciences de l’Ingénieur, Université Djillali Liabès de Sidi Bel Abbès, Sidi Bel Abbès, 22000, Algérie. Email: [email protected]

Abstract—Video object segmentation is a key technology in video processing, which is the foundation of the newly objectbased video compression standard MPEG-4. A novel spatiotemporal video object segmentation algorithm is proposed in this paper, which is very efficient to acquire the video object moving in a static background. Firstly, video sequence is segmented with change detection algorithm to get segmentation result in temporal domain; Secondly, the spatial segmentation result is obtained by a segmentation algorithm based on local thresholds, which means each pixel in the image has its own threshold. In this algorithm, the threshold of a pixel in an image is estimated by calculating the mean of the grayscale values of its neighbor pixels, and the square variance of the grayscale values of the neighbor pixels is also calculated as an additional judge condition; Thirdly, the video object is extracted by spatiotemporal integration video segmentation algorithm. The experimental results show that the proposed spatio-temporal integration video object segmentation algorithm can extract the moving object in a video sequence with a static background accurately and efficiently. Index Terms—image segmentation, video object segmentation, change detection, spatio-temporal integration

I. INTRODUCTION With the growth of internet and computer science technologies, the interest in advanced interactivity with audio-visual contents is increasing rapidly. To meet these requirements, new video coding standards such as MPEG-4 was established. The MPEG-4 standard defines the conception of VO (Video Object) to support its interactive function based on content, the segmentation of video sequence combined into arbitrary region is required by the standard. So video object segmentation is critical to the MPEG-4 standard. Recently, there are many video object segmentation algorithms are proposed. Such as the methods by semiautomatic segmentation and tracking of sementic video objects [1], moving object segmentation algorithm using background registration technique [2], automatic segmentation of moving objects in image sequences based on spatio-temporal information [3], video object segmentation using Bayes-based temporal tracking and trajectory-based region merging [4], video object segmentation and tracking using I learning classification [5].

* This work is supported by the National Science Foundation of China (NSFC) under grant No. 60675018.

978-1-4244-1706-3/08/$25.00 ©2008 IEEE.

Based on the above algorithms, a novel spatio-temporal video object segmentation algorithm is proposed in this paper. Section II introduces the change detection algorithm to get segmentation result in temporal domain; Section III presents the spatial segmentation algorithm that is obtained by a segmentation algorithm based on local thresholds, which means each pixel in the image has its own threshold, the threshold of a pixel in an image is estimated by calculating the mean of the grayscale values of its neighbor pixels, and the square variance of the grayscale values of the neighbor pixels is also calculated as an additional judge condition. Experimental results are given in Section IV in which the video object is extracted by the proposed spatio-temporal integration algorithm. From the given results, it is obvious that the proposed video object segmentation algorithm can obtain the moving object in a video sequence with a static background accurately and efficiently. II. TEMPORAL SEGMENTATION ALGORITHM Temporal segmentation is to segment video object in temporal domain. Change detection algorithm is selected as the proposed temporal segmentation algorithm. Each frame of a video sequence will be separated into moving object and uncovered background. To decrease the computing complexity, only the scenario with static background in the video sequence is considered. So the region where moving object exists is obtained by comparing two adjacent frames. A frame is subtracted by its neighboring frame, which means the gray-level values of the pixels in a frame is subtracted by the gray-level values of the corresponding pixels in its neighboring frame. Then the FDM (Frame Difference Mask) can be calculated as following,

FDM d mn ( x, y )

255 | d ( x, y ) |t k ® else ¯ 0 f m ( x, y ) f n ( x, y ), m ! n

(1) (2)

Theoretically the background is static, but in reality there are always differences between two backgrounds in two adjacent frames for the influence of noises and illumination changes. So a threshold K is needed to decrease the influence of noises and illumination changes. If the gray-level value difference of a point is lower than the threshold, it is believed

as a background point, or it is believed as a point of moving object. The threshold K varies from the different environments. III. SPATIAL SEGMENTATION ALGORITHM In classical threshold image segmentation [6][7], an image is usually segmented and simply sorted to object and background by setting a threshold. There have been several upgraded algorithms based on threshold segmentation which give the optimum threshold, such as Otsu algorithm [8], 2-D Otsu algorithm [9][10], 3-D Otsu algorithm [11]. Otsu algorithm provides a way to set a self-adaptive threshold. Nevertheless, it has a limitation that it gives only one threshold, which is difficult to extract all the useful information in the image. A new segmentation algorithm is proposed in this paper that each pixel in the image has its own threshold by calculating the statistical information of the grayscale values of its neighborhood pixels. An additional judge condition is given that it is possible to get the edge of the image as the result of the proposed algorithm. Usually, it is believed that the gray-level values of the object in an image are lower than the gray-level values of the background. So in this paper, it is assumed that the pixels in an image ranked top 10% in gray-level values have little influence in the result of this segmentation algorithm. A threshold can be set by analyzing the histogram of an image. If the distribution function of the histogram is f(z) which is assumed to be continuous, then a threshold T will be calculated by Eq. (3).

³

255

T

f ( z ) dz

sum /10

G Gmin u 255 T Gmin

Here P1 and P2 are the probabilities of occurrences of the two classes of pixels. P1 is the probability (a number) that a random pixel with value z is an object pixel. Similarly, P2 is the probability that the pixel is a background pixel. It is assumed that any given pixel belongs either to an object or to the background, so that P1 P2 1 (6) An image is segmented by classifying as background pixels with gray levels greater than a threshold T. All other pixels are called object pixels. The main objective is to select the value of T that minimizes the average error in making the decision that a given pixel belongs to an object or to the background. P1(z) and P2(z) are the PDF of the object and the PDF of the background respectively, the probability of erroneously classifying a background point as an object point is

E1 (T )

(4)

A square-shaped operator (which is called structural operator as below in this paper) whose size may be 3×3, 5×5 or 7×7 pixels would be used to calculate the statistical information of every single pixel in an image. There may be several objects in an image. When the image is divided into hundreds of blocks by the operator, the information in each block becomes much fewer. So it is assumed that there are two gray-level regions in each block at most: object and background. If the size of the structural operator is small enough, the assumption is reasonable. And the experiment will give the proof. In this paper, the thresholds of the pixels in an image are estimated in following method [12]. Suppose that an image contains only two principal graylevel regions and let z denote gray-level values. These values

³

T

f

p2 ( z )dz

(7)

Similarly, the probability of erroneously classifying an object point as a background point is

E2 (T )

(3)

Sum is the total number of the pixels in the image. All the pixels with their gray-level values higher than T will be set to white (255). Then gray-level stretch is necessary that the smoother the histogram spreads, the more entropy we can extract. It would be accomplished according to Eq. (4).

Gstr

are often viewed as random quantities, and their histogram may be considered as an estimate of their probability density function (PDF), p(z). This overall density function is the sum or mixture of two densities, one for the bright and the other for the dark regions in the image. The mixture probability density function describing the overall gray-level variation in the image is p( z ) P1 p1 ( z ) P2 p2 ( z ) (5)

³

f

T

p1 ( z )dz

(8)

Then the overall probability of error is

E (T )

P2 E1 (T ) P1 E2 (T )

(9)

To find the threshold value for which this error is minimal requires differentiating E(T) with respect to T (using Leibniz’s rule) and equating the result to 0. The result is P1 p1 (T ) P2 p2 (T ) (10) Obtaining an analytical expression of T requires to know the equations for the two PDFs. Estimating these densities in practice is not always feasible, and an approach used often is to employ densities whose parameters are reasonably simple to obtain. One of the principal densities used in this manner is the Gaussian density. In this case,

p( z )

p1 e 2S V1

( z P1 )2 2 V12

p2 e 2S V2

( z P 2 )2 2 V2 2

(11)

Where ȝ1 and ı12 are the mean and variance of the Gaussian density of the object pixels, and ȝ2 and ı22 are the mean and variance of the Gaussian density of the background pixels. Using Eq. (10) in the general solution of Eq. (11) results in the following solution for the threshold T.

AT 2 BT C where

0

(12)

A V12 V2 2 B

2(P1V2 2 P 2V12 )

C

P1V 2 2 P 2 V12 4V12 V 2 2 ln(V 2 P1 V1 P2 )

(13)

algorithm, and the size of structural operator is 3×3 pixels. From Fig. 1, it is obvious to get better result by the proposed algorithm than by Canny operator, as the influence of the noise is restrained, meanwhile it is easy to get a segmentation result with higher precision.

If the variances are equal, ı2˙ı12˙ı22, then,

T

P1 P 2 P V2 ln( 1 ) 2 P1 P 2 P2

(14)

If P1= P2, the optimal threshold is the average of the means. The analysis above is also available in the small blocks in an image. So in this paper, the threshold of a pixel can be estimated by calculating the mean of the gray-level values of its neighbor pixels instead of the mean of ȝ1 and ȝ2.

Tij

m n 1 z (i x, j y ) (15) ¦ ¦ (2m 1)(2n 1) x m y n

In Eq. (15), z(i,j) is the gray-level value of the pixel which is located in (i,j). And m and n are all natural numbers. If z(i,j) is lower than Tij, it is supposed that the pixel belongs to the object, else it belongs to the background. Actually if there is a small deviation in the estimation of the threshold Tij, the result of the segmentation algorithm would not change much. And the experiment will give the proof. If there is a block in an image that the object and the background exist in the same time, the variance of the graylevel values of the pixels in this block is definitely greater than the variance in a block that contains only the object or only the background. So the edge of the image can be obtained as the variance is also considered in the proposed algorithm.

Vij 2

m n 1 ( z (i x, j y ) Tij ) 2 (16) ¦ ¦ (2m 1)(2n 1) x m y n

If the gray-level value of a pixel is lower than its threshold, while the variance of the gray-level values of the pixels in its neighbor is higher than the given value Delta, this pixel is defined as the boundary, and its gray-level value is evaluated as 0, else it is defined as the background and its gray-level value becomes 255. As to decreasing the complexity of the proposed algorithm, the variance of the gray-level values of the pixels in this block can be replaced by the gray-level values difference between the brightest pixel and the darkest pixel in this block. To verify the efficiency of the proposed algorithm, the video sample “Mother and daughter” is selected to do the experiment. The experiment is proceeded in a notebook PC (CPU: InterCoreTM2 T7100, 1.8GHz; RAM: 1G, DDR2). The algorithm is implemented in Visual C++ 6.0. The 24th frame in this video is picked as in random, and the image is resized to 272×229 pixels. The result of the segmentation is shown in Fig. 1. Fig. 1(a) is the original image, to make a comparison, this image is segmented by Canny operator [13]. During dozens of experiments in Canny operator with different parameters, a rather good result is picked out, which is shown in Fig. 1(b). Fig. 1 (c) is the result of the proposed

(a) Original image of “Mother and daughter”

(b) Result with Canny operator. (c) Result with the proposed algorithm Fig. 1 Comparison of the segmentation results between Canny operator and the proposed algorithm.

Then, the relationship between the size of the structural operator and the segmentation result is researched. In Fig. 2, (a), (b), (c) and (d) are the segmentation results by the proposed algorithm, and the sizes of the structural operators are 3×3, 5×5, 7×7 and 9×9 pixels, respectively.

(a) Result 1 (the size of the structural operator is 3×3 pixels)

(b) Result 2 (the size of the structural operator is 7×7 pixels)

(c) Result 3 (the size of the (d) Result 4 (the size of the structural operator is 11×11 pixels) structural operator is 15×15 pixels) Fig. 2 Experimental results with different sizes of structural operators by the proposed algorithm

From Fig. 2, it can be concluded that the larger the size of the structural operator is, the longer the time it takes to run the algorithm, and the thicker the edge of the image is. In this part, an edge detection algorithm in image processing based on threshold segmentation is proposed. In this algorithm, each pixel in an image has its own threshold, which is estimated by calculating the statistical information of its neighbor pixels. Experimental results show that it is apparent to obtain better results by the proposed algorithm than by Canny operator. In this algorithm, it takes little time to get a precise result when small size structural operator is selected. And the continuous contour of an image is easy to obtain when large size structural operator is selected. This algorithm also has an obvious effect on noise restraining, which is a good edge-detecting algorithm with wide applicability. This algorithm is used as the spatial segmentation algorithm in the part of the proposed spatio-temporal video object segmentation algorithm. IV. EXPERIMENTAL RESULTS OF THE SPATIOTEMPORAL VIDEO OBJECT SEGMENTATION ALGORITHM In this experiment, the video sequence “Silent” is chosen. The background in this video is irregular picture, and there isn’t apparent difference between the background and the object. So it is suitable to test the performance of the proposed algorithm. Firstly, each frame of this video should be transformed from color image into gray image, for it is convenient to handle gray-level image. In this video, the moving parts of the object mainly concentrate on the head and the shoulder. But generally the movement is not apparent. Actually if there is several frames’ interval between two frames, the difference between these two frames is definitely greater than the difference between two frames which are neighbors. In this experiment, if we calculate the difference between two frames with several frames’ interval between them, the region that contains the complete object is obtained. The region includes the moving object in a video can be obtained by temporal segmentation, as the static background will be erased. Using spatial segmentation, the precise edge of the simple frame can be gotten. Then an “exclusive-or” operation is run with the result of temporal segmentation and the result of spatial segmentation, the preliminary edge of the moving object is obtained. The result of this process is shown in Fig. 3 as below. In Fig. 3, (a) is the 118th frame picked from the video “Silent”. (b) is the temporal segmentation result of (a), which is realized by FDM of the 118th frame and the 116th frame. (c) is the spatial segmentation result of (a), where the size of the structural operator is 5 pixel (to obtain continues edge), and Delta is 45. (d) is the preliminary edge of the moving object. But it is obvious that the edge is not continues. If closed region is formed, moving object can be gotten easily. So

closed region is needed, that the edge of the object must be continues. To solve this problem, the solution is described as following.

(a) The 118th frame of “Silent” (b) The temporal segmentation result of (a)

(c) The spatial segmentation result of (a)

(d) Initial object segmentation result of (a)

Fig. 3 Initial object segmentation process

Scan all the pixels whose gray-level values are 0. Suppose P fulfill the condition, then check every line that goes through pixel P. If there are two pixels P1 and P2 on the line, and they meet the requirements as below: (1) The gray-level values of P1, P2 are both 255; (2) P1 and P2 are located in separate side of the pixel P; (The pixel P is located between P1 and P2) (3) |P- P1|