that models Gestalt grouping laws is also applied. The ... Gestalt; visual perception ..... Detection of Signals,â Journal of Experimental Psychology, vol. 109, no.
Segmentation of Natural Scenes Based on Visual Attention and Gestalt Grouping Laws R.G.Mesquita and C.A.B.Mello, IEEE Member Centro de Informática, Universidade Federal de Pernambuco Recife, Brazil - http://www.cin.ufpe.br/~viisar {rgm, cabm}@cin.ufpe.br
Abstract—Detection of salient regions in images of natural scenes can be applied as a pre-processing step for computer vision algorithms as image segmentation, content based image retrieval, object recognition or image compression. This paper presents a visual attention method that analyses the input image in multiple scales using stability information of image regions. The saliency maps constructed have the advantage of preserving well-defined boundaries and making a better separation between background and the salient object, without suffering from undesired effects of multi-scale approaches. Furthermore, a segmentation approach that models Gestalt grouping laws is also applied. The experiments using a database containing 1,000 images showed that our saliency maps outperform the results obtained by other seven visual attention algorithms in terms of F-Measure. In addition, our segmentation approach obtained better results if compared to three classic thresholding algorithms. Keywords-Saliency; Visual attention; image segmentation; Gestalt; visual perception
I.
INTRODUCTION
Segmentation is one of the most important steps in several image processing and computer vision applications [1]. It aims to divide an image into its major objects. It is clear that this is not a simple task as the concept of object can vary from one image domain to another. Even more, usually, segmentation results in more objects than the user is really interested. If you have a person in front of a tree, you will probably segment the person and the tree (in the best case). However, it is most certain that your region of interest is the person only. This is where you can add to segmentation elements of visual attention [2][3] and saliency maps [3]. At every moment we receive a large amount of stimuli (not only visual but to all of our senses). Visual attention is the part of the visual perception system [4] that filters all the information received allowing just part of it to reach the visual awareness. It is a necessary step in our perception of the world around us since the first human beings (for personal security reasons in this case). This analysis happens in an order of few milliseconds, making this process even more complex to model computationally. This paper presents a visual attention model for segmentation of natural scenes. In the next Section, we summarize some visual attention models with also some applications to image segmentation. Section 3 details our proposal with experiments being presented in Section 4, while Section 5 concludes the paper.
II.
VISUAL ATTENTION AND SEGMENTATION
There are several visual attention models [5]. It is important to develop such models to define computational principles that guide selection and object recognition [6]. In the development of a visual attention model, one should understand several aspects of the visual system. The work of Posner et al. [7] is very important as it analyzes the relationship between attention and the structure of the visual system. In the search of a reasonable model, the reading of Wolfe and Horowitz work [8] is mandatory where they discuss some attributes that can guide attention. One of the most known models is the one from Itti, Koch and Niebur [9][10], improved from Koch and Ullman early model [11]. Based on their model, an image is represented in different forms considering just low level visual features (as luminance, edges, colors, pre-attentive texture discriminators [12][13], and so). All these different representations of the same image are analyzed in different scales, normalized and compared resulting in a unique final grayscale image with the most interesting areas presented in light gray, while the less attractive regions are represented by low tones. Their model also allows the inclusion of top-down knowledge which can improve the accuracy of the visual attention model. This response (degrees of interest) in the form of a grayscale image is called a saliency map. It is strongly related to visual attention as it is supposed that the more salient areas should be the areas that are more attractive in an image. So, they appear as the light gray areas in the map. In [14], it is presented a method for image segmentation based on low-level primitives. This work proceeds with segmentation based on grouping similar pixels in the image based on some properties from Gestalt Theory [15] and Dempster-Shafer theory [16]. Grouping hypotheses are generated from a Region Adjacency Graph (RAG) [17] and the grouping is achieved based on Gestalt theory. RAG algorithm is an iterative process and its final result depends on a similarity parameter (called 'belief') that is not automatically defined. Achanta et al. [18] defines a new saliency method using luminance and color extracted from different scales. Each result is added and normalized. The main object is within the areas that have gray values above some certain threshold T. The same researchers improved their method in [19]. In this
case, the model was defined so that some requirements are satisfied: the largest salient object must be emphasized, the salient regions must have a uniform highlight, the borders of the salient objects must be well-defined, texture and noise must not be considered and full resolution saliency maps must be generated. The new model attends every requirement computing the saliency map as ( , ) = ||
−
( , )|| ,
(1)
( , ) is where is the average image feature vector and a Gaussian blurred version of the original image represented in LAB color space. A thresholding algorithm also defined in [19] concludes the segmentation of the saliency map resulting in the most salient object. In [20] this method is improved to decrease the saliency of the background. The Graph-Based Visual Saliency [21] uses ideas from graph theory to compute saliency maps. Initially, activation maps are created on certain feature channels and then they are normalized to highlight conspicuity and allow combination with maps from different features. The work in [22] proposes a contrast-based method to construct a saliency map using colors in LUV space as stimulus. The saliency of each pixel is defined by the summation of differences with surrounding pixels. Then, a fuzzy-growing method is applied to segment salient regions from the saliency map. A spectral-residual approach is proposed in [23]. This method analyzes the log-spectrum and extracts the spectral residual of an image in spectral domain to construct the saliency map. Then, to binarize the saliency map a threshold is defined as three times the average intensity of the saliency map. III.
PROPOSED METHOD
Many saliency detection methods, like the approaches in [9] and [21] are based on a multi-scale analysis, that is commonly performed by successive low-pass filtering and downsampling the original image. The success of this kind of approach is often achieved by the competition for saliency between different scales, because this increases the robustness of those methods. On the other hand, resizing and blurring the original image can lead to a saliency map with ill-defined object boundaries and, at the same time, the competition between scales may concentrate saliency on edge pixels (as these pixels may be present in most scales) instead of covering the whole region. This makes them not suitable to the task of image segmentation [19]. Fig. 1 illustrates saliency maps affected by these problems and the saliency maps generated by our method. In the middle column we can see, in the first line, a result in which the saliency is concentrated mainly in the edges of the object; and, in the second line one can notice that the location of the edge pixels is not well defined. To avoid the undesired consequences described above, the method proposed in [19] computes the saliency map in a way that it retains more frequency content from the original image working on a single scale instead of performing downsizing and blurring, which is a very effective approach for image
segmentation. In (1), we can understand the evaluated mean feature vector as a representation of the main background of the image. However, this approach is not suitable when the salient object strongly influences the mean of the image (this frequently happens when the image has a large salient object). When this situation happens, what we may see is that the object’s low level features excite the background. Moreover, as no spatial information is considered, the distance between pixels does not influence the saliency map. Fig. 2 shows the saliency maps generated by the method proposed in [19]. As one can observe, despite the fact that the salient regions were correctly detected by the algorithm presented in [19], the main background is also salient. In this situation it is desired to separate the background (the sky and the grass) from the salient object as maximum as possible. Thus, herein we propose an algorithm that improves the method presented in [19] by using a multi-scale analysis without suffering from the undesired effects (illustrated in Fig. 1) that usually affects multi-scale approaches. A. Saliency Map Generation Initially, the image represented in Lab color space (since this is an opponent model suitable to represent human vision) is filtered using a 3 × 3 Gaussian filter to remove high frequency noise, as proposed in [19]. Then we perform a multi scale analysis to represent many possible different backgrounds of the image. Thus, to represent the background in different scales, the image is downsized using a bicubic interpolation and resized back to its original size. At each scale the dimensions of the and = downsized image are defined as ℎ = /2 /2 , where h and w are the height and width of the downsized image and H and W are height and width of the original image, respectively. We use 8 scales with the image . Fig. 3 shows the background images of scale s called for = {4,6,8}. It is important to mention that all images in Fig. 3 are represented in RGB color space just for visualization purposes. Following, we evaluate the differences between the original Lab image and the background image of each scale as %&' ( , ) = ||
( , ) −
( , )||,
(2)
Figure 1. First column: original images; second column: saliency maps generated by [22] (first row) and [21] (second row); third column: saliency maps generated the method proposed herein.
where _ is the arithmetic mean feature vector of the pixels in the gaussian blurred input image marked as background in < , and % is a disk structuring element with radius of size r. In the first iteration a larger radius is used and as the algorithm proceeds the radius size decreases. The )*+_+)( images are shown in Fig. 3 (f), (i) and (l). Finally, our saliency map is evaluated as ?( , ) = ||
_
−
( , )|| × % EF ×G>
H HIJ×>
H HIJD>
.
(4)
To weigh precision and recall equally we have used K = 1. Fig. 5 shows the F-Measure, precision and recall curves obtained in this experiment. Analyzing these curves we can see that our method has a better F-Measure in the vast majority of all the possible thresholds, if compared to the saliency maps generated by the other methods. It also has worse precision but better recall if compared to the method proposed in [20].
TABLE I. PRECISION, RECALL AND F-MEASURE VALUES USING THE BINARIZATION METHOD HEREIN PROPOSED
Method
Precision
Recall
F-Measure
proposed msss ftsr mz it gb ac sr
0.8364 0.90009 0.81968 0.60719 0.76983 0.67187 0.74604 0.69022
0.74872 0.76523 0.71455 0.79986 0.57991 0.7677 0.56754 0.59076
0.75949 0.79381 0.73461 0.66652 0.61817 0.68152 0.60048 0.59103
on Gestalt principles, even showing better F-Measures values in the first experiment. This can be explained as follows: msss has worse recall, indicating a higher false negative value, if compared to our method. However, false negative pixels (after the first thresholding applied in our segmentation) can be recovered by the Gestalt grouping laws used in our approach. On the other hand, our saliency maps showed worse precision than the ones generated by msss (in the first experiment), what indicates a higher false positive value. Thus, as false positive pixels (after the first thresholding applied in our segmentation) cannot be excluded using our Gestalt approach, we can say that the proposed segmentation is less effective in our saliency maps than in the ones generated by msss. The fact that msss has achieved worse F-Measure values in the first experiment, but better results in the second experiment, can be seen as an evidence of a good performance of our segmentation method. Concluding this experiment, we compare our segmentation approach with other classic thresholding techniques. Table 2 shows the results of Otsu [26], Kittler [29] and Johannsen [30] methods applied to the saliency maps generated by our and msss methods. Comparing the F-Measures values from Tables 1 and 2, we can see that our segmentation outperforms the results from all the other thresholding methods. V.
Figure 5. F-Measure, precision and recall curves
B. Evaluation of the Segmentation Using Gestalt Principles This experiment evaluates the quality of our segmentation algorithm based on Gestalt principles. Table 1 shows average precision, recall and F-Measure from our segmentation applied to ours and all other saliency maps used herein. As one can see, excluding the method in [20], our saliency maps outperforms the other methods in F-Measure. Our saliency method has achieved worse results than msss with our segmentation based
CONCLUSIONS AND FUTURE WORK
We presented a visual attention method suitable to perform image segmentation in natural scenes. Our work improves the approach presented in [19] by analyzing the input image among different scales based on the stability [24] concept. This analysis allows our method to define a more meaningful average feature vector of the image background without affecting the final saliency map with undesirable effects caused by multi-scale approaches, like ill-defined boundaries and high saliency concentrated at edge pixels. Our saliency maps were compared with other seven methods, showing better F-Measure. Moreover, a segmentation algorithm to binarize saliency maps based on Gestalt grouping laws is also presented. It was compared with three classic thresholding techniques and showed better results. As future work, we plan to model other Gestalt grouping laws, as continuity, and to improve similarity law with other information in addition to color (as shape, for example). Moreover, an improvement to exclude false positive pixels from saliency maps during the segmentation process is also desired.
TABLE II. PRECISION, RECALL AND F-MEASURE VALUES USING OTHER BINARIZATION METHODS
Saliency
Binarization
Precision
Recall
F-Measure
Proposed Proposed Proposed MSSS MSSS MSSS
Otsu Kittler Johannsen Otsu Kittler Johannsen
0.79008 0.76873 0.81937 0.83002 0.79802 0.85884
0.69418 0.68604 0.60322 0.64459 0.65143 0.41121
0.70688 0.69282 0.62442 0.69773 0.68436 0.46716
ACKNOWLEDGMENT This research is partially sponsored by CNPq under grant 141190/2013-2. REFERENCES [1]
E. R. Davies, Machine Vision: Theory, Algorithms, Practicalities, 3rd ed. Morgan Kaufmann, 2005. [2] S. Frintrop and E. Rome, “Computational Visual Attention Systems and Their Cognitive Foundations : A Survey,” ACM Transactions on Applied Perception, vol. 7, no. 1, pp. 6.1–6.39, 2010. [3] J. K. Tsotsos, A Computational Perspective on Visual Attention. MIT Press, 2011. [4] J. Wolfe, K. Kluender, and D. Levi, Sensation and Perception, 2nd ed. Sinauer Associates, 2008. [5] B. Follet, O. Le Meur, and T. Baccino, “Modeling visual attention on scenes,” Studia Informatica Universalis, vol. 8, no. 4, pp. 150–167, 2010. [6] D. Marr, Vision: A computational investigation into the human representation and processing of visual information. W.H. Freeman, 1982. [7] M. I. Posner, C. R. R. Snyder, and B. J. Davidson, “Attention and the Detection of Signals,” Journal of Experimental Psychology, vol. 109, no. 2, pp. 160–174, 1980. [8] J. M. Wolfe and T. S. Horowitz, “What attributes guide the deployment of visual attention and how do they do it?,” Nature reviews: Neuroscience, vol. 5, no. 6, pp. 1–7, Jun. 2004. [9] L. Itti, C. Koch, and E. Niebur, “A model of saliency-based visual attention for rapid scene analysis,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 20, no. 11, pp. 1254–1259, 1998. [10] L. Itti and C. Koch, “Computational modelling of visual attention.,” Nature reviews: Neuroscience, vol. 2, no. 3, pp. 194–203, Mar. 2001. [11] C. Koch and S. Ullman, “Shifts in selective visual attention: towards the underlying neural circuitry.,” Human neurobiology, vol. 4, no. 4, pp. 219–227, Jan. 1985. [12] B. Julesz, “Textons, the elements of texture perception, and their interactions,” Nature, no. 290, pp. 91–97, 1981.
[13] J. Malik and P. Perona, “Preattentive texture discrimination with early vision mechanisms,” Journal of the Optical Society of America, vol. 7, no. 5, pp. 923–932, 1990. [14] N. Zlatoff, B. Tellez, and A. Baskurt, “Combining local belief from lowlevel primitives for perceptual grouping,” Pattern Recognition, vol. 41, pp. 1215–1229, 1998. [15] K. Koffka, Principles of Gestalt Psychology. Harcourt, 1935. [16] G. Shafer, A Mathematical Theory of Evidence. Princeton University Press, 1976. [17] A. Tremeau and P. Colantoni, “Regions adjacency graph applied to color image segmentation,” IEEE Transaction on Image Processing, vol. 9, no. 4, pp. 735–744, 2000. [18] R. Achanta, F. Estrada, P. Wils, and S. Sabine, “Salient Region Detection and Segmentation,” Springer Lecture Notes in Computer Science, pp. 66–75, 2008. [19] R. Achanta, S. Hemami, F. Estrada, S. Sabine, and D. L. Epfl, “Frequency-tuned Salient Region Detection,” in IEEE Conference on Computer Vision and Pattern Recognition, 2009, no. Ic, pp. 1597–1604. [20] R. Achanta and S. Susstrunk, “Saliency Detection Using Maximum Symmetric Surround,” in IEEE 17th International Conference on Image Processing, 2010, pp. 2653–2656. [21] J. Harel, C. Koch, and P. Perona, “Graph-Based Visual Saliency,” in Neural Information Processing Systems (NIPS), 2006. [22] Y. Ma and H. Zhang, “Contrast-based Image Attention Analysis by Using Fuzzy Growing,” in ACM International Conference on Multimedia, 2003, pp. 374–381. [23] X. Hou and L. Zhang, “Saliency Detection : A Spectral Residual Approach,” in IEEE Conference on Coomputer Vision and Pattern Recognition, 2007, no. 800, pp. 1–8. [24] P. L. Rosin, “A simple method for detecting salient regions,” Pattern Recognition, vol. 42, no. 11, pp. 2363–2371, Nov. 2009. [25] W. F. Bischof and T. Caelli, “Parsing scale-space and spatial stability analysis,” Computer Vision, Graphics, and Image Processing, vol. 42, pp. 192–205, 1998. [26] N. Otsu, “A Threshold Selection Method from Gray-Level Histograms,” in IEEE Transactions on Systems Man and Cybernetics, 1979, vol. 20, no. 1, pp. 62–66. [27] R. C. Gonzalez and R. E. Woods, Digital Image Processing, 3rd ed. 2010. [28] G. Borgefors, “Distance transforms in arbitrary dimensions,” Computer Vision, Graphics, and Image Processing, vol. 27, pp. 321–354, 1984. [29] M. Sezgin and B. Sankur, “Survey over image thresholding techniques and quantitative performance evaluation,” Journal of Electronic Imaging, vol. 13, no. 1, pp. 146–168, 2004. [30] G. Johannsen and J. Bille, “A threshold selection method using information measures,” in International Conference on Pattern Recognition: Proceedings of the 6th International Conference on Pattern Recognition, 1982, pp. 140–143.