statistical image modeling for semantic segmentation.pdf

0 downloads 0 Views 3MB Size Report
Index Terms — Statistical image modeling, semantic image segmentation .... (5). The local pixel density ),(i. IXP is calculated by. N j j i i. N j j i i j i. X. SP. X. SPX. X. IXP. 1. 1. ) ..... [8] A.W.M. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain,.
Z. Zhu et al.: Statistical Image Modeling for Semantic Segmentation

777

Statistical Image Modeling for Semantic Segmentation Zhongjie Zhu, Yuer Wang, and Gangyi Jiang, Member, IEEE Abstract —Semantic image segmentation (SIS) is one of the

most crucial steps toward image understanding. In this paper, a novel framework to enable SIS is proposed by modeling images automatically. The statistical model for an image is automatically obtained by using a finite mixture model to approximate the underlying class distributions of image pixels. To accurately characterize the principal visual properties of the underlying dominant image compounds, a novel improved Expectation-Maximization (EM) algorithm is presented to select model structure and estimate model parameters simultaneously. Experiments were conducted and convincing results are obtained1. Index Terms — Statistical image modeling, semantic image segmentation, image understanding, finite mixture model, improved EM algorithm.

I.

INTRODUCTION

Semantic image segmentation (SIS) is a technique for partitioning an image into a finite number of non-overlapping and meaningful regions. It is one of the most crucial steps toward image understanding and has a variety of applications such as object recognition, image annotation, image coding and image indexing [1]-[3]. So far many techniques have been proposed for image segmentation [4]-[6]. However, SIS is still a very challenging task because of the semantic gap between the low-level visual features and the high-level human interpretation of image semantics [7]-[8]. Therefore, there is a need to develop new frameworks that are able to support efficient SIS. Statistical image modeling is one promising approach that can be used to characterize image semantics by performing implicit image content analysis [9]-[10]. Many techniques have been proposed for statistical image modeling. The most prevalent framework is to use Markov random field model or Gaussian Markov random field model [11]-[16]. However, the major limitation of Markov random field is that its complexity increases rapidly when a larger neighborhood of pixels is incorporated to model the local joint statistics of image pixels. In this paper, a novel framework is proposed to enable efficient SIS, which can be separated into four stages. Firstly, 1

This work was supported in part by the Natural Science Foundation of Ningbo, China under Grant No.2008A610015; the Natural Science Foundation of Zhejiang Province, China under Grant No.Y107740; the National Natural Science Foundation of China under Grant No.60902066; the Open Project Foundation of Ningbo Key Laboratory of DSP under Grant No.2007A22002. Zhongjie Zhu and Yuer Wang are with the Ningbo Key Laboratory of DSP, Zhejiang Wanli University, Ningbo, 315100, China. (e-mail: [email protected]). Gangyi Jiang is with the Institute of Circuits and Systems, Ningbo University, Ningbo, 315211, China (e-mail: [email protected] ). Contributed Paper Manuscript received December 15, 2009 Current version published 06 29 2010; Electronic version published 07 06 2010.

the raw image pixels are converted to a collection of feature vectors in a 10-dimensional feature sapce. Secondly, the class distribution of image pixels is approximated by using a finite mixture model. Thirdly, a novel improved Expectation-Maximization (EM) algorithm is proposed to select model structure and estimate model parameters simultaneously. Finally, segmentation results are obtained by determining the correspondence between mixture component and the relevant dominant image compound. The main contributions and advantages of our work can be summarized as: (1) Each image is replaced by a collection of feature vectors in a 10-dimensional feature space before statistical modeling. Compared with the raw pixel values, pixel-based visual features can characterize the principal visual properties of images more effectively. (2) A penalty term based on minimum description length (MDL) principle is incorporated into the maximum likelihood criterion to learn the finite mixture model, which can make the model structure and parameters more compact. (3) A novel improved EM algorithm is introduced to select model structure and estimate model parameters simultaneously by performing automatic merging, splitting, and removing of mixture components. (4) Novel criteria are proposed to select mixture components for merging, splitting and removing, which can provide a better approximation of the real class distribution for the image pixels. As a whole, compared with the existing techinques, our approach can provide more compact representations of image contents and the segmentation results are more consistent with human perception. This paper is organized as follows: In section 2 a new framework is proposed for statistical image modeling; in section 3 a novel improved EM algorithm is presented for model selection and parameter estimation; in section 4 some experimental results are given to evaluate the performance of the proposed techniques; Section 5 concludes the paper. II.

STATISTICAL IMAGE MODELING

Studies in computational vision have shown that the raw pixel values of images cannot characterize the underlying dominant image compounds compactly and accurately [23]. Thus it requires some qualitative descriptions to characterize the principal visual properties of images effectively. To exploit the visual properties for statistical image modeling, in this paper, each image pixel is characterized by 8-dimensional visual features and 2-dimensional location descriptors (i.e., x and y). The 8-dimensional pixel-based visual features include 3-dimensional color descriptors (i.e., R, G, and B), 3-dimensional color deviations of R, G, and B calculated by using 5×5 window, and 2-dimensional gradients of luminance channel along horizontal and vertical axis. Following the feature extraction strategy, each pixel is

0098 3063/10/$20.00 © 2010 IEEE

778

IEEE Transactions on Consumer Electronics, Vol. 56, No. 2, May 2010

represented by a 10-dimensional feature vector, and the image as a whole is represented by a collection of feature vectors in the 10-dimensional feature space. For a given image I , we use a finite mixture model to approximate the class distribution of image pixels in the 10-dimensional feature space: K

P ( X Si ,T i )Zi , ¦ i 1

P ( X | I , 4)

(1)

where 4 {K , Z ,T } is the parameter set of model structure, weights, and model parameters, P ( X Si ,T i ) is the i th mixture component to approximate the class distribution of the connected image pixels with similar visual properties, and K is the model structure (i.e., number of mixture components) to characterize the number of dominant image compounds for image semantics interpretation. T {T i i 1,, K } is the set of the model parameters for K mixture components, and Z {Zi i 1,, K } is the set of the relative weights among the K mixture components. To learn the finite mixture models for statistical image modeling, maximum likelihood criterion can be used to determine the underlying model parameters. But it prefers complex models with more free parameters and tends to over-fit image pixels [16-17] because complex model with a larger number of parameters is generally better in describing the class distribution of image pixels. Thus, an MDL penalty term is added to determine the model structure and parameters compactly. According to the MDL principle, the model with minimal description length should be chosen [17]. When a penalty term is added, the parameter set ˆ 4 ( Kˆ , Zˆ ,Tˆ) for the given image I is therefore determined by ˆ 4

In order to address the above-mentioned problems, we propose a novel improved EM algorithm. Our improved EM algorithm started from a reasonably large value of K and performs automatic merging, splitting, and removing of mixture components to select the model structure and re-organize the distribution of mixture components according to the real class distribution of image pixels. To select appropriate components for merging, splitting and removing, the Kullback divergence was employed to measure the divergence between two mixture components or between one mixture component and the local pixel density. Dkl ( Si , S j ) The inter-component Kull-back divergence between the ith mixture component P( X Si ,T i ) and the jth mixture component P ( X S j ,T j ) is defined as: Dkl ( S i , S j )

P ( X S i ,T i )

³ P( X S ,T ) log P( X S i

i

j

,T j )

.

(4)

The local Kull-back divergence Dkl ( I , Si ) between the ith mixture component P ( X Si ,T i ) and the corresponding local pixel density P( X I ,T i ) is defined as: Dkl ( I , S i )

P ( X S i ,T i )

³ P( X S ,T ) log P( X I ,T ) i

i

.

(5)

i

The local pixel density P ( X I ,T i ) is calculated by N

¦G ( X  X P ( X I ,T i )

j

) P ( S i ,T i X j )

j 1

,

N

(6)

¦ P ( S i ,T i X j ) j 1

where P( Si X ,T i ) , the posterior probability, is defined as: arg max{L( I , 4)} ,

L ( I , 4)

where

¦ log P( X | I , 4)  log p(4)

(2) is

the

objective function,  ¦ log P( X | I , 4) is the likelihood function, and log p(4) is the MDL term to penalize the complex models, which is defined as: NZ i 13  K K log ¦ i 1 2 12 K N K ( N  1)  log  , 2 12 2

log p (4)



(3)

where N is the number of image pixels. III. NOVEL IMPROVED EM ALGORITHM The estimation of maximum likelihood described in (2) can be achieved by using the EM algorithm with a pre-defined K [18]-[20]. However, setting a fixed number of mixture components in P( X | I , 4) may mismatch the real class distribution of image pixels because different images may consist of various image compounds with diverse visual properties.

P ( S i ,T i X )

Z i P ( X S i ,T i ) K

.

(7)

¦ Z P ( X S ,T ) i

i

i

i 1

If two mixture components, P ( X Si ,T i ) and P ( X S j ,T j ) , are strongly overlapped (i.e., with small value of Dkl ( Si , S j ) ), they may provide similar densities and can be potentially merged as one single mixture component P ( X Sij , T ij ) . To prevent poor merging operation, the local Kull-back divergence Dkl ( I , Sij ) between the merged mixture

component P ( X S ij , T ij ) and local pixel density P ( X I ,T ij ) is further calculated. Only the mixture component pair with the minimum value of the inter-component and local Kullback divergences is selected as the candidate for merging. Two types of mixture components may be split: (a) The elongated mixture components which underpopulate the relevant image pixels (i.e., characterized by the local Kullback divergence); (b) The tailed mixture components which overlap with other mixture components and thus overpopulate the relevant dominant image compounds (i.e., characterized by the inter-component Kullback divergence). To select the mixture

Z. Zhu et al.: Statistical Image Modeling for Semantic Segmentation

779

component for splitting, two criteria are combined: (1) The local Kull-back divergence Dkl ( I , Si ) to characterize the divergence between the ith mixture component P ( X Si ,T i ) and the local pixel density P( X I ,T i ) ; (2) The inter-component

Kull-back

divergence Dkl ( I , Sip , Siq ) to

characterize the overlapping between the two new generated mixture components P ( X Sip ,T ip ) and P ( X Siq ,T iq ) . If one specific mixture component is only supported by few image pixels, it may be removed from the underlying image model P ( X | I , 4) . To determine such unrepresentative mixture component for removing, our improved EM algorithm used the local Kull-back divergence Dkl ( I , Si ) to characterize the representation of the mixture component P ( X I ,T i ) for the relevant image pixels. The mixture component with the maximum value of the local Kullback divergence is selected as the candidate for removing. To jointly optimize these three operations of merging, splitting and removing, their probabilities are defined as: Pm (i, j ,T ij )

Dkl ( I , Sij )  JDkl ( Si , S j ) ,

Ps (i, p, q,T i ) Pr (i,T i )

Dkl ( Sip , Siq ) Dkl ( I , Si )

,

1 . Dkl ( I , S i )

exp(O L( I , 41 )  L( I , 4 2 ) ) ,

Inputs: Image

I ,K

Initialization is done by

K max k -mean clustering;

for each K do 1. Calculate the probabilities for three operations: merging, splitting, and removing;

Pm (i, j ,T ij ) Ps (i, p, q,T i )

Dkl ( I , Sij )  JDkl ( Si , S j ) Dkl ( Sip , Siq ) Dkl ( I , Si )

Pr (i,T i )

1 Dkl ( I , S i )

2. Select one optimal operation of the three to perform; 3. With the selected model structure, estimate the model parameters via EM iteration 4. Go back to step 1 if the selected operation meets the following accept probability, otherwise, go to step 5.

Th

exp( O L( I , 41 )  L( I , 4 2 ) )

5. Output end for

ˆ 4

(8)

IV. EXPERIMENTAL RESULTS

(9)

To segment an image, the pixels are firstly classified into different clusters based on the posterior probabilities once the statistical modeling process is complete, that is, the i th pixel will be classified into the j th cluster if

(10)

j

arg max Zl P ( S l ,T l X i ) .

(12)

l

The acceptance probability to prevent poor operation of merging, splitting or removing is defined as: Th

TABLE I NOVEL IMPROVED EM ALGORITHM

(11)

where L( I , 41 ) and L( I , 4 2 ) are the objective functions for the finite mixture models before and after performing the merging, splitting or removing operation respectively, and O is a constant that is determined experimentally. By performing automatic merging, splitting, and removing of mixture components, our proposed EM algorithm is able to achieve more accurate image modeling by selecting the model structures and estimating the model parameters simultaneously. The major operations of our improved EM algorithm are shown in TABLE I. Compared with the EM algorithm and its variants, our improved EM algorithm has the following advantages: (a) It is able to select automatically the model structure to capture the essential structure of image regions by performing automatic merging, splitting and removing of mixture components;(b) It is able to achieve more accurate estimation of model parameters and provide a better approximation of the real class distribution for the image pixels by running the local search from many different starting points; (c) It is able to address the mismatching problem by automatically estimating the model structure and re-organizing the distribution of mixture components according to the underlying class distribution of the image pixels. Thus, our proposed framework is able to obtain an effective model for capturing and representing the underlying principal visual properties of images accurately.

Then the segmentation results can be achieved by spatially pixel grouping and connecting. To evaluate the overall performance of the proposed framework, experiments were conducted. In our current experiments, the maximum value of the model structure ᧤ K max ᧥for statistical image modeling is set to 30 and the average computational time for a single image with size 352 × 288 is about 60 seconds. Some statistical image modeling results are shown in Fig. 1, Fig. 2 and Fig. 3, where the mixture components for statistical image modeling are visualized as multiple elements with different colors and sizes. Each element is drawn based on the mean vector and the variance matrix of the corresponding mixture component. The element’s color and location are determined by the color features and the location features of the mean vector. The shape and the size are calculated based on the location features of the variance matrix. Some comparison results in terms of segmentation are also shown in Fig. 4, Fig. 5, and Fig. 6 among our proposed approach, normalized cut [21], and JSEG image segmentation technique [22]. From the statistical modeling results, we can observe that each of the mixture components is relevant to one of the underlying dominant image compounds, and thus the procedure for statistical image modeling achieves automatic image content analysis implicitly. From the comparison results, we can conclude that the proposed statistical image modeling approach can provide more compact representation of image content and the segmentation results

780

IEEE Transactions on Consumer Electronics, Vol. 56, No. 2, May 2010

closely correspond to the dominant image compounds for image semantics interpretation. However, the proposed approach also has several limitations. One problem is the computational complexity. It is rather time-consuming to segment large images. TABLE II shows the comparison results of average segmentation time among the three approaches, where more than 300 images with three image sizes were tested. From TABLE II, it can be seen that, for images with size 176u144, the average segmentation time of the proposed approach is moderate of the three approaches, while for images with sizes 352u288 and 640u480, it is the most time-consuming one. One major reason of the high computational load is due to the complex iterative process of merging, splitting and

removing operations of the improved EM algorithm. One way to fix this problem is to select a more accurate K max to reduce the iterative times of merging, splitting and removing operations. However, it is not easy to automatically select accurate K max s for different images. The other problem of the proposed approach is the occurrence of over- or under-segmentation on some images. This problem is also not easy to handle because it depends on many aspects of the algorithm. Two measures can be considered to overcome this drawback. One is to select more accurate visual features, and the other is to use more reasonable criteria for merging, splitting and removing operations.

(a)

(b) Fig. 1 Experimental results for statistical image modeling: (a) original image; (b) visualization of statistical image model.

(a)

(b) Fig. 2 Experimental results for statistical image modeling: (a) original image; (b) visualization of statistical image model.

(a)

(b) Fig. 3 Experimental results for statistical image modeling: (a) original image; (b) visualization of statistical image model.

Z. Zhu et al.: Statistical Image Modeling for Semantic Segmentation

781

(a)

(b)

(c) Fig. 4 Comparison of segmentation performance among the three approaches: (a) JSEG segmentation technique; (b) normalized cut; (c) our proposed approach.

(a)

(b)

(c) Fig. 5 Comparison of segmentation performance among the three approaches: (a) JSEG segmentation technique; (b) normalized cut; (c) our proposed approach.

(a)

(b)

(c) Fig. 6 Comparison of segmentation performance among the three approaches: (a) JSEG segmentation technique; (b) normalized cut; (c) our proposed approach.

782

IEEE Transactions on Consumer Electronics, Vol. 56, No. 2, May 2010

TABLE II COMPARISON OF AVERAGE SEGMENTATION TIME AMONG THE THREE APPROACHES

Image size

640u480

352u288

176u144

Approach JSEG Normalized cut Proposed approach JSEG Normalized cut Proposed approach JSEG Normalized cut Proposed approach

V.

Average Segmentation Time (s) 42.06 139.07 163.74 10.35 43.72 57.61 1.94 10.06 5.21

CONCLUSION

A novel statistical image modeling framework has been proposed to enable efficient semantic image segmentation. A novel improved EM algorithm has been introduced to select model structure and estimate model parameters simultaneously by performing automatic merging, splitting, and removing of mixture components, which is able to accurately characterize the principal visual properties of the underlying dominant image compounds. Compared with the normalized cut and the JSEG image segmentation techniques, the proposed approach can provide more compact representations of image contents and acquire more accurate and perceptional consistent segmentation results. Our future work will be focused on how to handle the limitations and improve the results. REFERENCES [1]

[2]

[3]

[4]

M. Acharyya, and M. K. Kundu, “Image segmentation using wavelet packet frames and neuro-fuzzy tools”, International Journal of Computational Cognition, vol.5, no.4, pp.27-43, Dec. 2007. S. Wang, T. Kubota, J. Siskind, and J. Wang, “Salient closed boundary extraction with ratio contour”, TPAMI, vol. 27, no. 4, pp. 546-561, Apr. 2005. R. Li, S. Yu, X. Yang, “Efficient spatio-temporal segmentation for extracting moving objects in video sequences”, IEEE Transactions on Consumer Electronics, vol. 53, pp. 1161-1167, Aug. 2007. K. M. Pohl, J. Fisher, W. E. L. Grimson, R. Kikinis, W. Wells, “A Bayesian model for joint segmentation and registration”, NeuroImage,

vol. 31, pp.228ದ239, Feb. 2006. [5] P. Stelldinger, K. Ullrich, H. Meine, “Topologically correct image segmentation using alpha shapes”, LNCS, vol. 4245, pp. 542-554, Oct. 2006. [6] D. Letscher, and J. Fritts, “Image segmentation using topological persistence”, LNCS, vol. 4673, pp. 587-595, Aug. 2007. [7] Y. Rui, T. S. Huang, and S. F. Chang, “Image retrieval: current techniques, promising directions and open issues”, Journal of Visual Communication and Image Representation, vol. 10, pp.39-62, Jan. 1999. [8] A.W.M. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain, “Content-based image retrieval at the end of the early years”, IEEE Trans. on PAMI, vol. 22, no.12, pp. 1349-1380, Dec. 2000. [9] J. Besag, “Spatial interaction and the statistical analysis of lattice systems”, Journal of the Royal Statistical Society, Series B, vol. 36, no. 2, pp. 192-236, Mar. 1974. [10] S. Geman, and D. Geman, “Stochastic relaxation, gibbs distributions, and the bayesian restoration of images”, IEEE Trans. on PAMI, vol. 6, no. 6, pp. 721-741, Nov. 1984.

[11] M. Haindl1, S. Mike1, “Model-based texture segmentation”, LNCS, vol. 3212, pp.306-313, Oct. 2004. [12] A. Srivastava, A. B. Lee, E. P. Simoncelli, S. C. Zhu, “On advances in statistical modeling of natural images”, Journal of Mathematical Imaging and Vision, vol.18, no.1, pp.17-33, Jan. 2003. [13] C. Carson, S. Belongie, H. Greenspan, J. Malik, “Blobworld: Image segmentation using expectation- maximization and its application to image querying”, IEEE Trans. PAMI, vol.24, no.8, pp.1026-1038, Aug. 2002. [14] A. Blake, C. Rother, M. Brown, P. Perez, and P. Torr, “Interactive image segmentation using an adaptive GMMRF model”, LNCS, vol. 3021, PP. 428-441, May. 2004. [15] H. Greenspan, J. Goldberger, L. Ridel, “A continuous probabilistic framework for image matching”, Computer Vision and Image Understanding, vol.84, no.3, pp.384-406, Dec. 2004. [16] S.C. Zhu, “Statistical modeling and conceptualization of visual patterns”, IEEE Trans on PAMI, vol.25, no. 6, pp.691-712, June 2003. [17] A. Nikseresht, M. Gelgon, “Low-cost distributed learning of a Gaussian mixture model for multimedia content-based indexing on a peer-to-peer network”, In Proceedings of the 7th ACM international workshop on multimedia information retrieval, pp. 41-48, Nov. 2005. [18] G. McLachlan and T. Krishnan, The EM algorithm and extensions, John Wiley & Sons Press: New York, 1997, pp.1-288. [19] N. Ueda and R. Nakano, Z. Ghahramani, G. E. Hinton, “SMEM algorithm for mixture models”, Neural Computation, vol. 12 no. 9, pp. 2109-2128, Dec. 1999 [20] B. Zhang, C. Zhang, and X. Yi, “Competitive EM algorithm for finite mixture models”, Pattern Recognition, vol.37, no.1, pp.131-144, Jan. 2004. [21] J. Shi, J. Malik, “Normalized Cuts and image segmentation”, IEEE Trans. on PAMI, vol. 22, no. 8, pp. 888-905, Aug. 2000. [22] Y. Deng, B. S. Manjunath, “Unsupervised segmentation of color-texture regions in images and video”, IEEE trans. on PAMI, vol. 23, no. 8, pp. 800-810, May 2001. [23] W. Stefan, “Vision models and quality metrics for image processing applications”, PH.D Thesis, Swiss Federal Institute of Technology, Lausanne, Ecublens, Switzerland. Dec. 2000, pp.5-25.

BIOGRAPHIES Zhongjie Zhu received the Ph.D degree in electronics science and technology from Zhejiang University, China, in 2004. He is currently a professor with Faculty of Electronics and Information Engineering, Zhejiang Wanli University, China. His research interests mainly include digital video compression and communication, watermarking and information hiding, 3D image processing, and image understanding.

Yuer Wang received her M.S degree from Shanghai Fisheries University, China, in 2007. She is currently working in Zhejiang Wanli University, China and her research interests include digital video compression and signal processing, watermarking and information hiding.

Gangyi Jiang received his M.S degree from Hangzhou University, China, in 1992, and received his Ph.D degree from Ajou University, Korea, in 2000. He is now a professor at Faculty of Information Science and Engineering, Ningbo University, China. He is a member of the IEEE. His research interests mainly include digital video compression and communication, multi-view video coding, image based rendering, and image processing.

Suggest Documents