Automatic Identification of Perceptually Important Regions in an Image Wilfried Osberger Space Centre for Satellite Navigation Queensland University of Technology GPO Box 2434, Brisbane 4001 Australia
[email protected] Abstract We present a method for automatically determining the perceptual importance of different regions in an image. The algorithm is based on human visual attention and eye movement characteristics. Several features known to influence human visual attention are evaluated for each region of a segmented image to produce an importance value for each factor and region. These are combined to produce an Importance Map, which classifies each region of the image in relation to its perceptual importance. Results shown indicate that the calculated Importance Maps correlate well with human perception of visually important regions. The Importance Maps can be used in a variety of applications, including compression, machine vision, and image databases. Our technique is computationally efficient and flexible, and can easily be extended to specific applications.
1. Introduction A challenging problem in image processing is the emulation of human vision tasks. The ease and simplicity with which humans perform complex visual tasks often masks the tremendous computation involved in performing these functions. Despite the inherent difficulties involved in automatically emulating human visual processing, the benefits in being able to perform this have led to continued widespread research in the area. In many image processing applications (e.g. regionbased compression), it is desirable to perform an image segmentation in a manner analogous to humans. Following this segmentation, it is useful to be able to identify which regions a human observer considers to be of the highest perceptual importance (i.e. the areas likely to attract our attention). Studies of visual attention and eye movements [3, 9, 11] have shown that humans generally only attend to a few areas in an image. Even when given unlimited viewing time, subjects will continue to focus on these few areas rather than scan the whole image. These areas are often highly correlated amongst different subjects, when viewed in the same context. In order to automatically determine the parts of an image that a human is likely to attend to, we need to understand the operation of human visual attention and eye movements. Research into the attentional mechanisms of the Human Visual System (HVS) has revealed several low level and high
Anthony J. Maeder School of Engineering University of Ballarat PO Box 663, Ballarat 3353 Australia
[email protected] level factors which influence attention and eye movements. In this paper, we use these factors to develop an Importance Map (IM) [5]. The IM predicts, for each region in the image, its perceptual importance. To obtain the IM, a segmentation is first performed. Each region in the image is then assessed with regard to a number of different low and high level factors. These factors are combined to produce the overall IM for the image. Such a map is useful for a wide range of image processing applications, including image compression and machine vision. The method is flexible and is amenable to extension to image sequences.
2. Detection of important regions by the Human Visual System In order to efficiently deal with the masses of information present in the surrounding environment, the HVS operates using variable resolution. Although our field of view is around 180 degrees horizontally and 140 degrees vertically, we only possess a high degree of visual acuity over a very small area (around 2 degrees in diameter) called the fovea. Thus in order to accurately inspect the various objects in our environment, eye movements are required. Rapid shifts in the eye’s focus of attention (saccades) occur every 100– 500 milliseconds. Visual attention mechanisms are used to control these saccades. Our pre-attentive vision [7] operates in parallel, looking in the periphery for important areas and uncertain areas for the eye to foveate on at the next saccade. Thus a very strong relationship exists between eye movements and attention.
2.1. Eye movement correlation between subjects If there was not a strong correlation between the directions of gaze of different people, then eye movements would be impossible to predict, and it would be difficult to make general use of eye movement information. However, studies on human eye movement patterns for both images and video indicate that eye movements are indeed highly correlated amongst subjects. The original work of Yarbus [11] showed that a strong correlation between viewer eye movements exists, as long as the subjects were viewing the image in the same context (i.e. with the same instructions and motivation). Yarbus also demonstrated that even if given unlimited viewing time, we will not scan all areas of a scene, but will instead attend to a handful of important regions which continually attract our attention. Similar results have been found for video by Stelmach [10]. This suggests that eye movements are not idiosyncratic, and that a strong relationship exists between the direction of gaze of different
subjects, viewing an image in the same context.
Contrast
2.2. Factors which influence eye movements and attention
Size
In order to automatically determine the importance of the different regions in an image, we need to determine the factors which influence human visual attention. Our attention is controlled by both high and low level factors. High level factors generally involve some feedback process from memory and may involve template matching. Low level processes are generally fast, feed forward mechanisms involving relatively simple processing. A general observation is that objects which stand out from their surrounds are likely to attract our attention, since one of the main goals of the HVS is to minimise uncertainty. This is also in agreement with Gestalt organisation theories. Low level factors which have been found to influence visual attention include:
Original Image
Shape
Combine Factors
Importance Map
Location Background
Figure 1. Block diagram for Importance Map calculation.
2.3. Considerations for modeling visual attention Although many factors which influence visual attention have been identified, little quantitative data exists regarding the exact weighting of the different factors and their interrelationship. Some factors are clearly of very high importance (e.g. motion), but it is difficult to determine exactly how much more important one factor is than another. A particular factor may be more important than another factor in one image, while in another image the opposite may be true. Due to this lack of information, it is necessary to consider a large number of factors when modeling visual attention [6, 7, 12]. This caters for the case when not all of the factors are used all of the time. It is also desirable that the factors used be independent, so that a particular type of factor does not exert undue influence on the overall importance. High level factors such as context can be very useful in determining a region’s importance. In situations where a template of a target is known a priori, viewer eye movements can be modeled with high accuracy. However, in the general case, little is known about the context of viewing and about the content of the scene, so such high level information cannot be used.
Contrast. The HVS converts luminance into contrast at an early stage of processing. Region contrast is therefore a very strong low-level visual attractor. Regions which have a high contrast with their surrounds attract our attention and are likely to be of greater visual importance [3, 9, 11]. Size. Findlay [3] has shown that region size also has an important effect in attracting attention. Larger regions are more likely to attract our attention than smaller ones. However a saturation point exists, after which the importance due to size levels off. Shape. [4, 9] Regions whose shape is long and thin (edge-like) have been found to be visual attractors. They are more likely to attract attention than rounder regions of the same area and contrast. Colour. [1, 7] Colour has been found to be important in attracting attention. Some particular colours (e.g. red) have been shown to attract our attention. A strong influence occurs when the colour of a region is distinct from the colour of its background.
3. Algorithm for Importance Map calculation The basic operation of our automatic region importance classifier can be seen in Figure 1. The original grey-scale image is input and is segmented into homogeneous regions. We have used a recursive split and merge technique to perform the segmentation. It gives satisfactory results and is computationally inexpensive. Our split and merge technique uses the local region variance as the splitting and merging criterion. We have found a variance threshold of 250 to work well for most images. Regions with a size less than 16 pixels are merged with their most similar neighbour, to avoid excessively small regions. The segmented image is then analysed by a number of different factors known to influence attention, and an importance is assigned to each region for each factor. We have chosen 5 different importance factors in this algorithm. However, the flexible structure of our approach allows additional factors to easily be incorporated (e.g. if information was known about the image or context a priori). The different factors that we have chosen are:
Motion. [7] Motion has been found to be one of the strongest influences on visual attention. Our peripheral vision is highly tuned to detecting changes in motion, and our attention is involuntarily drawn to peripheral areas undergoing motion distinct from its surrounds.
Other low level factors which have been found to influence attention include brightness, orientation, and line ends. Several high level factors have also been determined:
Segmentation
Location. Eye-tracking experiments have shown that viewers eyes are directed at the centre 25% of a screen for a majority of viewing material [2]. Foreground / Background. Viewers are more likely to be attracted to objects in the foreground than those in the background [1].
People. Many studies [4, 9, 11] have shown that we are drawn to focus on people in a scene, in particular their faces, eyes, mouth, and hands.
Contrast of region with background. The contrast importance Icontrast of a region Ri is calculated as:
Icontrast (Ri ) = gl(Ri ) ? gl(Ri?neighbours ) (1) where gl(Ri ) is the mean grey level of region Ri , and gl(Ri?neighbours ) is the mean grey-level of all of the
Context. [4, 11] Viewers eye movements can be dramatically changed, depending on the instructions they are given whilst observing the image. 2
neighbouring regions of Ri . Subtraction is used rather than division, since it is assumed that the grey-levels are a perceptually linear space. Icontrast is scaled to the range [0 1].
We would like to assign a higher importance to areas which rank very strongly in some factors. A simple averaging of the importance factors would not provide this. We have therefore chosen to square and sum the factors to produce the final IM:
Size of region. Importance due to region size is calculated as:
IM (Ri ) =
Isize (Ri ) = max( AA(Ri ) ; 1:0) (2) max where A(Ri ) is the area of region Ri in pixels, and Amax is a constant used to prevent excessive weighting being given to very large regions. We have set Amax to be equal to
( )
i
(3)
Location of region. Importance due to location of a region is calculated as:
(Ri ) Ilocation (Ri ) = centre (4) A(Ri ) where centre(Ri ) is the number of pixels in region Ri 25%
which are also in the centre of the image. Thus regions contained entirely within the central quarter of an image will have a location importance of 1.0, and regions with no central pixels will have a location importance of 0.0.
5. Discussion
Foreground / Background Region. We detect background regions by determining the proportion of the total image border that is contained in each region. Regions with a high number of image border pixels will be classified as belonging to the background and will have a low Background/Foreground importance as given by:
(Ri ) Ibg (Ri ) = 1:0 ? max( 0:5borderpix tot borderpix ; 1:0)
(6)
We have tested our technique on a wide variety of scene types. The results have been promising. For simple scenes, our method typically identifies the most salient regions in an image. Results for more complex scenes have also been good, if accurate segmentation of the image was possible. Figure 2 demonstrates the processes involved in obtaining the IM for the relatively simple Miss America image. As can be seen in Figure 2(b), the split and merge segmentation slightly oversegments the image, with 147 separate regions being produced. The outputs from the 5 importance factors are seen in Figures 2(c)–(g). These factors are combined to produce the final IM in Figure 2(h). This map gives a good estimate of the subjectively important areas in the picture. It can however be seen that the fine segmentation has resulted in an IM which is blotchy around the face. This problem could be solved by a more accurate segmentation, or by a post-smoothing operation. IMs for more complex scenes have also been generated. Figure 3 shows the IM for the Lighthouse image. This scene has a perceptually dominant lighthouse which attracts our attention, but also contains several areas of secondary perceptual importance such as the houses, the horizon, and the crashing wave. These areas of secondary visual importance have been accurately identified by our IM.
where bp Ri is the number of pixels in the region Ri which border with other regions, and sp is a constant. We found a value of sp of 1.75 to provide a good discrimination of shapes. Thus long, thin regions will have a high shape importance, while for rounder regions it will be lower. The final shape importance is normalised to fit in the range [0 1].
k=1
2
4. Results
Shape of region. Importance due to region shape is calculated as:
sp
5
where k sums through the 5 importance factors. The final IM is produced by scaling the result so that the region of highest importance has an importance value of 1.0.
1% of the total image area. Ishape (Ri ) = bpA((RRi ))
X(Ik(Ri))
This paper has presented a novel method for identifying perceptually important regions in an image. The technique is computationally efficient and has been designed in a flexible manner to easily accommodate modifications and application specific requirements. As demonstrated in Section 4, the results have been promising for both simple and complex images. However, many areas for improvement still remain. The results obtained by our technique are limited by the success of the segmentation, so we are currently investigating improved segmentation methods. Our method currently uses only grey-scale images as input, so inclusion of a colour importance factor is also desirable. We have investigated extending the method to image sequences by including a motion importance factor, and have recently adopted this technique to control an MPEG encoder [8]. In specific applications, a priori information may be known regarding the content of the images or context of viewing, so additional high level importance factors may easily be included (e.g. face detection in a videophone application). IMs can also be used in other areas such as machine vision, human perception research, and image databases. Any application where it is desirable to focus on the perceptually relevant in an image (or sequence of images) and to discard areas of lower perceptual importance areas could readily benefit from the use of IMs.
(5)
( )
where borderpix Ri is the number of pixels in region Ri which also border on the image, and tot borderpix is the total number of image border pixels. By this stage we have assigned, for each region in the image, an importance factor rating for each of the 5 factors. Each rating has been normalised to be in the range [0 1]. We now need to combine these 5 factors for each region to produce an overall IM for the image. As mentioned in Section 2.3, there is little quantitative data which indicates the relative importance of these different factors, and this relation is likely to change from one image to the next. We therefore choose to treat each factor as being of equal importance. However, if it was known that a particular factor was of higher importance, a scale factor could be easily incorporated. 3
(a)
(b)
(c)
(d)
(e) (f) (g) (h) Figure 2. Importance Map for the Miss America image. Intermediate importance factor calculations are also shown. (a) original image, (b) segmented image, (c) contrast factor, (d) size factor, (e) shape factor, (f) location factor, (g) background factor, and (h) final Importance Map produced by summing (c)–(g). For (c)–(h), lighter regions represent higher importance.
(a) (b) Figure 3. Importance Map for the Lighthouse image. (a) original image, and (b) final Importance Map, with lighter regions representing higher importance.
References
[7] E. Niebur and C. Koch. Computational architectures for attention. In R. Parasuraman, editor, The Attentive Brain. MIT Press, 1997. [8] W. Osberger, A. Maeder, and N. Bergmann. An MPEG encoder incorporating perceptually based quantisation. In Proceedings SPIE 3299, San Jose, Jan 1998. [9] J. Senders. Distribution of attention in static and dynamic scenes. In Proceedings SPIE 3016, pages 186–194, San Jose, Feb 1997. [10] L. Stelmach, W. Tam, and P. Hearty. Static and dynamic spatial resolution in image coding: An investigation of eye movements. In Proceedings SPIE 1453, pages 147–152, San Jose, Feb 1992. [11] A. Yarbus. Eye Movements and Vision. Plenum Press, New York NY, 1967. [12] J. Zhao, Y. Shimazu, K. Ohta, R. Hayasaka, and Y. Matsushita. An outstandingness oriented image segmentation and its application. In ISSPA, pages 45–48, Gold Coast, Australia, Aug 1996.
[1] B. L. Cole and P. K. Hughes. Drivers don’t search: they just notice. In D. Brogan, editor, Visual Search, pages 407–417. Taylor and Francis, 1990. [2] G. Elias, G. Sherwin, and J. Wise. Eye movements while viewing NTSC format television. SMPTE Psychophysics Subcommittee white paper, Mar 1984. [3] J. Findlay. The visual stimulus for saccadic eye movement in human observers. Perception, 9:7–21, Sept. 1980. [4] A. Gale. Human response to visual stimuli. In W. Hendee and P. Wells, editors, The Perception of Visual Information, pages 127–147. Springer-Verlag, 1997. [5] A. Maeder, J. Diederich, and E. Niebur. Limiting human perception for image sequences. In Proceedings SPIE 2657, pages 330–337, San Jose, Feb 1996. [6] X. Marichal, T. Delmot, V. De Vleeschouwer, and B. Macq. Automatic detection of interest areas of an image or a sequence of images. In ICIP, pages 371–374, Lausanne, Switzerland, Sep 1996.
4