Saliency-Seeded Region Merging: Automatic Object Segmentation

4 downloads 27238 Views 5MB Size Report
In this paper, we are to build a saliency-seeded mechanism to automatically capture good prior interactions. Our motivation is simple: the pixels that have ...
Saliency-Seeded Region Merging: Automatic Object Segmentation Junxia Li

Runing Ma

Jundi Ding

College of Science Nanjing University of Aeronautics and Astronautics Nanjing, China 211100 Email: [email protected]

College of Science Nanjing University of Aeronautics and Astronautics Nanjing, China 211100 Email: [email protected]

School of Computer Science and Technology Nanjing University of Science and Technology Nanjing, China 210094 Email: [email protected]

Abstract—Interactive object segmentation is an active research area in recent decades. The common practice is to leave interactions to be set manually by users in advance. Often times, to get good interactions, one has to struggle with laborious local editing for re-correcting. Given the larger and larger databases occurred nowadays, it is impractical for one to draw manual interactions for each image. In this paper, we are to build a saliency-seeded mechanism to automatically capture good prior interactions. Our motivation is simple: the pixels that have different cues but from the same object are often good candidates for prior interactions, and those pixels at the same time are always with higher salience attracting human attentions. Adopting a newly-proposed idea, i.e., maximal similarity based region merging, we further develop a framework of saliency-seeded region merging for ‘automatic’ interactive segmentation. Extensive experiments and comparisons are conducted on a wide variety of natural images. Results show that our framework can reliably segment many objects out from their surrounding backgrounds.

(a)

I. I NTRODUCTION Object segmentation, especially natural object, is a hard task in pattern recognition [1]. The major difficulty is due to the large variation in the real world in object category, pose, position, etc. Various categories of objects often have their general coat patterns with different visual attributes [2][3]. However,even for the same category, the coat pattern of each object is as unique as the fingerprint. For example, zebras are known for their white and black stripes, but the striped coats of two zebras are never exactly alike. More importantly, any object can even yield an infinite number of different images, depending on its position, pose or background. Such variations are all uncontrolled and inextricably mixed together. Early methods segment an image into regions homogeneous with respect to one low-level attribute (e.g., intensity, color or texture). State-of-the-art examples include k-means [4], fuzzy c-means (FCM) [4], mean-shift [5], watersheds [6], normalized cuts [8] and tree-structured algorithms [7][9]. Though very sound in theory, they cannot give the experimentally satisfying results at many situations. Over-segmentation and under-segmentation are usually unavoidable. One attempt to overcome such limitations is cue integration [10][11]. The idea is that multiple cues are used to reach a combined similarity for segmentation. Each cue is handled in a separate module to assess the homogeneity of pixels or regions with

978-1-4577-0121-4/11/$26.00 ©2011 IEEE

(b)

(c)

(d)

(e)

Fig. 1. Segmented results on ‘Toucan’ by MSRM. 1st row: input image (left) and initial over-segmentation by mean-shift (right); 2nd row: five different interactive inputs of the object (green) and background (blue); 3rd row: the corresponding segmented toucan, and the rightmost one is our saliency-seeded automatic segmentation (e).

respect to that cue. Each module comes with its own set of parameters and careful value setting of each parameter is always required. However, as indicated, most natural objects do not simply adhere to a piecewise-constant or piecewisesmooth homogeneity in low-level visual cue(s) due to realworld variation (see ‘Toucan’ in Fig.1). In consequence, often times, segmented homogeneous regions do not co-occur with the accurate natural objects in images. In recent decades, interactive segmentation techniques have received an explosion of research interests [13][19]. The motivation is to explore some information about the image as prior interactions to enhance these solely homogeneity-based methods. One common practice is that the users manually provide rough object/background information being prior interactions. In this manner, a novice often fails to provide effective priors and more/less interactions are required for re-correcting. In some situations, to get what one wants, he/she has to struggle with laborious local editing. It is tedious, time-consuming and

691

(a)

(b)

(c)

(d)

(e)

Fig. 2. The results of four popular saliency maps: (a) input images; (b) Itti’s model [14]; (c) GB [15]; (d) IG [16], (e) SR [17]. Fig. 3.

tricky. Any type of manual inputs (strokes, scribbles, a lasso, a boundary brush or a bounding box) is no guarantee of results in each case. They all completely depend on the user intentions and intuitions at a moment. Given the larger and larger image databases occurred nowadays, it is impractical for one to draw manual interactions for each image. In this paper, we are to build a saliency-seeded mechanism to automatically determine prior interactions. Our motivations are twofold described as below: • The positions – where pixels have different visual cues but from the same object – are always good candidate parts being prior interactions. Note, for the toucan in Fig.1, in addition to black pixels, there are still several pixels in white and orange. They cannot be segmented into one homogeneous black region and have to be marked as prior interactions. • The positions – where pixels have different visual cues but from the same object – are always the salient parts attracting more human attentions. Note, as shown in Fig.2, those pixels near the high-contrasted locations but from one object are all with higher salience in bright. Figure 1(e) is one of our saliency-seeded automatic interactions on ‘Toucan’. As can be seen, our saliency-seeded objectmarks fall onto the very few positions where pixels are from the toucan but clearly in different colors (mainly in white and orange). Specifically, our automatic interactions can be flexibly embedded into any interactive work. For space limitation, we here only focus on a newly-proposed interactive work [13], i.e., maximal similarity-based region merging (MSRM), and thus we establish a framework of saliency-seeded region merging (SSRMf) for object segmentation. Extensive results on a wide variety of natural images show that SSRMf can automatically and reliably segment many objects out from their surrounding backgrounds. Such a successful case is given in the lower right corner of Fig.1(e), in which the toucan is successfully merged into one region as a whole by SSRMf. The remainder of this paper is organized as follows. Section 2 gives our developed SSRMf, as well as a review of MSRM and five involved salience maps. Section 3 conducts extensive experiments on a variety of natural images to evaluate SSRMf. Section 4 finally concludes the whole paper.

978-1-4577-0121-4/11/$26.00 ©2011 IEEE

An overview for the general schematic flowchart of SSRMf.

II.

SALIENCY- SEEDED REGION MERGING

In this section, we are first to briefly review the interactive work of MSRM and then to describe our SSRMf in detail. A. A Brief Review of MSRM MSRM is a newly-proposed region merging based interactive segmentation method. Its great novelty lies in putting a mutually maximal similarity-guided merging principle, not interactive inputs. Specifically, two adjacent regions are merged if they have the mutually highest similarity, namely, the most similar with each other among all their respective neighboring regions. By this principle, MSRM can merge the initially oversegmented regions adaptive to the image content and does not need to set some similarity threshold in advance. As reported in [13], with good manual inputs, MSRM performs well in many situations. However, the results are not so reliable as what one has expected. The manual inputs are impatient to MSRM. As is often the case, good segmentation occurs only when good interactions are available (see Fig.1(a)(d)). But within an image, which parts being prior interactions are good for object segmentation and how can one find them? In what follows, we are to demonstrate how does our method effectively answer these questions in an automatic way. B. Our proposed framework A general schematic flowchart of SSRMf is given in Fig.3. It consists of three separate steps: 1) initial over-segmentation (B); 2) salience-seeded automatic interaction (C-D); 3) region merging (E). Step 2 is our major focus here. 1) initial over-segmentation: Theoretically speaking, existing low-level segmentation methods, e.g., k-means [4], FCM [4], normalized-cuts [8], watersheds [6], super-pixels [13] and mean shift [5], each can be used for this step. But in practice, what we require are a very large number of segments that are homogeneous enough to keep the object boundaries preservedwell. Many methods such as k-means, FCM and normalizedcuts are a little expensive to reach such an over-segmentation, while others like watersheds and super-pixels seem to be more in favor of finding small regions of close pixels. So, in terms of both effective performance and computation efficiency, we also choose the mean shift method to do it. As done in [13],

692

(a)

(b)

(c)

(d)

Fig. 4. Result Illustration of Three Steps in SSRMf: (a) input images; (b) initial over-segmentation by mean shift; (c) saliency-seeded interactions; (d) object segmentation by MSRM.

the EDISON system (a mean shift software) is run on each image for initial over-segmentation (see Fig.4(b)). 2) salience-seeded automatic interaction: In fact, the term salience is known as a perceptual quality of an item that pops out relative to its neighbors. For an object or a pixel in one image, the more salient it is, the less time we need to find it at a first glance. Recent research in visual attention [14] indicates that the objects of interest in an image tend to be much more salient than the backgrounds. Just inspired by this, we intend to make full use of these techniques of salience detection for our automatic acquisition of prior interactions. In particular, among a number of current saliency detectors, we select four most popular ones, i.e., Itti’s model (IT) [14], Harel’s graph-based (GB) salience [15], Achanta’s frequencytuned salience (termed as IG in [16]) and Hou’s spectral residual (SR) approach [17]. The reasons for our choice are as follow: citation in literature (IT is classic and widely cited), recency (GB, IG and SR are recent) and variety (IT is biologically motivated, GB is hybrid, IG outputs full-resolution salience maps and SR works in the frequency domain). As illustrated in Fig.2, in each saliency map, pixel values are represented in gray and normalized to the range [0,1]. The brighter a pixel is, the higher its value is, and the more salient it is. Clearly, the pixels that are near high-contrasted positions but from one object of interest have the highest salience in the scene. These pixels as argued are apt to be isolated from their own object region in the homogeneity-based methods. That is, saliency pixels are always different from, i.e., inhomogeneous with their neighbors, and meantime they are largely related to the object regions. On the other hand, we can easily see that pixels in the background are mostly with the lowest salience in black. Intuitively, it makes sense for us to take pixels with the highest salience as object markers (simply denoted by O), and pixels with the lowest salience as background markers (simply denote by B). One possible way is to pick pixels with salience above one large threshold TO as object markers, and pixels with salience

978-1-4577-0121-4/11/$26.00 ©2011 IEEE

Fig. 5. Object segmentation of SSRMf based on different saliency maps. From top to bottom, they are respectively input images, the results by IT, GB, IG and SR (2nd-5th row), and human segmentation (last row).

below a small threshold TB as background markers: O = {(x, y) | s(x, y) ≥ TO }

(1)

B = {(x, y) | s(x, y) ≤ TB }

(2)

where (x, y) is the position of pixel in an image and s(x, y) is the salience value of that pixel. But actually, even for a salient pixel, it will have different salience values in different salience maps. There is no general-purpose value that can be set to TO or TB . To this end, we do not directly aim to find a certain threshold for TO or TB . Alternatively, we are to capture the set of pixels in O′ as object markers, the set of pixels in B ′ as background markers: O′ = {(x, y) | P r(s(x, y) ≥ TO ) = PO }

(3)

B ′ = {(x, y) | P r(s(x, y) ≤ TB ) = PB }

(4)

where, P r(·) is a probability function. In our experiments, a good candidate for PO is often found in the range [0.95, 0.98], and PB is set to 0.5. That is, in each salience map, the pixels with the highest salience that are above 95% of the total image pixels are set as object markers, while pixels with the lowest salience that are below 50% are set as background markers (see Fig.4(c) for some typical interaction results). The residual between pixels are just the non-markers need to be labeled as object or background in next step.

693

operations are also used to avoid noisy saliency areas detected by SR. It seems that their interactions are all well-designed. Despite so, their results do not appear to be smoother or more exact than our saliency-seeded segmentation.

Fig. 6. Result comparisons: (a) interactive graph cuts; (b) GrabCut; (c) saliency cuts; (d) our proposed SSRMf.

3) region merging: Here we follow the maximal similarity based principle in [13] for region merging. As reviewed before, it is worthwhile to note that in this principle, two neighboring regions will merge if they are mutually most similar with each other. The similarity between two regions is measured by the well-known Bhattacharyya coefficient. The higher the Bhattacharyya coefficient between two regions is, the more similar they are. For detailed information, interested readers can refer to MSRM in [13]. III. E XPERIMENTS AND C OMPARISONS With the help of manual inputs, MSRM is able to automatically merge regions and label the non-marker regions as object or background. But, after all, MSRM requires some efforts on the part of users and is yet not a fully-automatic approach. Our SSRMf does realize this goal by the saliency-seeded automatic interactions. In this section, we conduct extensive experiments to evaluate SSRMf on a wide variety of images. A. Qualitative Result Comparisons Figure 5 shows the corresponding results of SSRMf based on IT, GB, IG and SR. We can easily see that in these cases, most objects of interest are isolated from the background. However, with respect to human segmentation, GB-seeded interactions seem to be better for object region merging. Each object is accurately merged into one region as a whole. The interactions driven by other saliency maps may be a little more or less, and so some merging errors occur. To make up for this weakness, we advocate to combine any two saliency maps for the determination of prior interactions. Figure 6 is the comparisons with the interactive graph cuts [18], GrabCut [19] and saliency cuts [20]. In the three methods, object segmentation is considered as a minimal graph cuts problem. They all require some regions labeled as a prior, i.e., seeds. In graph cuts, the seeds are gained from a few marked strokes or lines of object/background interactions. GrabCut allows users to drag a rectangle surrounding the desired object and get the seeds from it. Both involve manual interactions. In saliency cuts, SR is employed to automatically provide some ‘Professional Labels’, and interested objects are assumed to be at the center of image. So, the left and right side of images are enforced to be background markers. Shrink and expansion

978-1-4577-0121-4/11/$26.00 ©2011 IEEE

Fig. 7.

Experimental results of SSRMf on a variety of natural images.

More results are illustrated in Fig.7 on a variety of natural images collected from the Internet. These images cover a wide range of distinct objects in the natural world, e.g., leopard, giraffe, flower, butterfly and wildcat. Their coat patterns exhibit a rich diversity of texture appearances such as spots, rosettes, stripes, blocks or patches. And even, those objects are all in different poses, crouching, sitting, standing, walking, running, etc. Despite these difficulties, objects of interest together with their long but thin legs or tails are all segmented in one piece. The integrity of each object is preserved well by SSRMf. B. Quantitative Result Evaluations To further evaluate SSRMf, we use Alpert’s database for our test. The entire database are available online at http://www.wisdom.weizmann.ac.il/ vision/ This database contains 100 gray images along with the ground-truth segmentations. Each image just depicts one object in the background and is segmented manually by three people. A pixel is declared as foreground only when it was marked as foreground by at least two people. On this database, we compare our SSRMf with GBVS [15] and MSRM [13]. Note, GBVS is an adaptive threshold based salience detection method, and in MSRM we are to manually provide the modest prior inputs. Average precision (P),recall (R), and F-measure are compared over the entire ground-truth

694

Fig. 9. A failure example of SSRM. (a) original image; (b) initial mean shift segmentation, object markers (green) and background markers (blue).

Fig. 8.

F-measure evaluations: (1) GBVS; (2) SSRMf; (3) MSRM.

database, with the F-measure defined as the harmonic mean of precision and recall. The F-measure is used to quantitatively assess the segmentation consistency of each method to the ground-truth. Figure 8 shows their F-measure evaluations. Our method clearly gains the highest F-measure score. This confirms the effectiveness of our SSRMf. C. Failure of SSRMf As described above, SSRMf is effective for object extraction in a variety of natural images. However, it may fail when there are low-contrasted edges and ambiguous areas in the image. Such a case is shown in Fig.9, for which any salience map perhaps cannot predicate the accurate object/background interactions. In addition, as one of key steps in SSRMf, initial segmentation is also very important. If the initial segmentation first fails, SSRMf also may fail. IV. C ONCLUSION In this paper, we develop a framework of saliency-seeded region merging for automatic object segmentation. With the aid of saliency detection, some useful knowledge about the object of interest and background can be automatically found as prior interactions. Experiments show that our framework is comparable to existing state-of-the-art interactive work. In many situations, the objects of interest can be well segmented out from their backgrounds. It involves none of manual interactions and is a fully automatic segmentation framework. Our future work will focus how to overcome the failure of SSRMf in some difficult situations. ACKNOWLEDGMENT This work is supported by NUAA Research Funding, No. NS2010196. R EFERENCES [1] Nicolas Pinto, David D. Cox and James J. DiCarlo, Why is Real-World Visual Object Recognition Hard? PLoS Computational Biology 4(1): 0151-0156, January, 2008. [2] YanWei Pang, Qiang Hao, Yuan Yuan, Tanji Hu, Rui Cai, Lei Zhang, Summarizing tourist destination by mining user-generated travelogues and photos. Computer Vision and Image Understanding 115(3): 52-363, 2011.

978-1-4577-0121-4/11/$26.00 ©2011 IEEE

[3] YanWei Pang, Xin Lu, Yuan Yuan, XueLong Li, Travelogue Enriching and Scenic Spot Overview Based on Textual and Visual Topic Models. IJPRAI 25(3): 373-390, 2011. [4] Songcan Chen and Daoqiang Zhang, Robust image segmentation using FCM with spatial constraints based on new kernel-induced distance measure. IEEE Transactions on Systems, Man, and CyberneticsB: Cybernetics 34(4): 1907-1916, 2004. [5] D. Comaniciu and P. Meer, Mean shift: a robust approach toward feature space analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(5): 603-619, 2002. [6] Jean Cousty, Gilles Bertrand, Laurent Najman and Michel Couprie.Watershed Cuts: Minimum Spanning Forests and the Drop of Water Principle. IEEE Transactions on Pattern Analysis and Machine Intelligence 31(8): 1362-1374, August 2009. [7] Jundi Ding, Runing Ma, and Songcan Chen, A Scale-based Connected Coherence Tree Algorithm for Image Segmentation. IEEE Transactions on Image Proessing 17(2): 204-216, February. 2008. [8] Shi J, & Malik J, Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 26(8): 888905, 2000. [9] Jundi Ding, Jialie Shen, HweeHwa Pang, Songcan Chen and Jingyu Yang, Exploiting Intensity Inhomogeneity to Extract Textured Objects from Natural Scenes. In: Asian Conference on Computer Vision, Part III, LNCS 5996: 1-10, Xi’an, China, 2009. [10] L. Grady, Random walks for iamge segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 28(6): 1768-1783, 2006. [11] Paragios, N., Deriche, R., Geodesic active regions and level set methods for supervised texture segmentation. International Journal of Computer Vision 46(3): 223-247, 2002. [12] Alpert, S., Basri, M., Brandt, A., Image segmentation by probabilistic bottom-up aggregation and cue integration. In: International Conference on Computer Vision and Pattern Recognition, pp. 1-8, 2007. [13] J. Ning, L. Zhang, D. Zhang and C. Wub, Interactive image segmentation by maximal similarity based region merging. Pattern Recognition 43(2): 445-456, 2010. [14] Itti L, Kouch C, and Niebur E, A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 20, pp. 1254-1259, November 1998. [15] J. Harel, C. Koch, P. Perona, Graph-Based Visual Saliency. Advances in Neural Information Processing Systems, vol. 19, pp. 545-552, 2007. [16] R. Achanta, S. Hemami, F. Estrada, S. Susstrunk, Frequency-tuned Salient Region Detection. IEEE International Conference on Computer Vision and pattern Recognition, pp. 1597-1604, 2009. [17] X. Hou, L. Zhang, Saliency Detection: A spectral residual approach. IEEE Conference on Computer Vision and Pattern Recognition, pp. 1-8, 2007. [18] Y. Boykov and V. Kolmogorov, An experimental comparison of mincut/max-flow algorithms for energy minimization in vision. IEEE Trans. Pattern Anal. Machine Intell., 26(9):1124-1137, Sept. 2004. [19] C. Rother, V. Kolmogorov and A. Blake, “GrabCut”: interactive foreground extraction using iterated graph cuts. ACM Transactions on Graphics, pp. 309-314, June 2004. [20] Yu Fu, Jian Cheng, Zhenglong Li and Hanqing Lu, Saliency Cuts: An Automatic Approach to Object Segmentation. International Conference on Pattern Recognition, 2008.

695