IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,
VOL. 33, NO. 11,
NOVEMBER 2011
2131
Computational versus Psychophysical Bottom-Up Image Saliency: A Comparative Evaluation Study Alexander Toet, Senior Member, IEEE Abstract—The predictions of 13 computational bottom-up saliency models and a newly introduced Multiscale Contrast Conspicuity (MCC) metric are compared with human visual conspicuity measurements. The agreement between human visual conspicuity estimates and model saliency predictions is quantified through their rank order correlation. The maximum of the computational saliency value over the target support area correlates most strongly with visual conspicuity for 12 of the 13 models. A simple multiscale contrast model and the MCC metric both yield the largest correlation with human visual target conspicuity (>0:84). Local image saliency largely determines human visual inspection and interpretation of static and dynamic scenes. Computational saliency models therefore have a wide range of important applications, like adaptive content delivery, region-of-interest-based image compression, video summarization, progressive image transmission, image segmentation, image quality assessment, object recognition, and contentaware image scaling. However, current bottom-up saliency models do not incorporate important visual effects like crowding and lateral interaction. Additional knowledge about the exact nature of the interactions between the mechanisms mediating human visual saliency is required to develop these models further. The MCC metric and its associated psychophysical saliency measurement procedure are useful tools to systematically investigate the relative contribution of different feature dimensions to overall visual target saliency. Index Terms—Saliency, image analysis, visual search.
Ç 1
INTRODUCTION
T
HIS study investigates the agreement between the predictions of several recently developed computational bottom-up saliency models and human visual saliency estimates. Human visual fixation behavior is driven both by sensorial bottom-up mechanisms [1], [2] and by higher order task-specific or goal-directed top-down mechanisms [3], [4]. Visual saliency refers to the physical, bottom-up distinctness of image details [5]. It is a relative property that depends on the degree to which a detail is visually distinct from its background [6]. Visual saliency is believed to drive human fixation behavior during free viewing by attracting visual attention in a bottom-up way [7]. As such, it is an important factor in our everyday functioning. Human observer studies have indeed shown that saliency can be a strong predictor of attention and gaze allocation during free viewing, both for static scenes [8], [9], [10] and for dynamic scenes [11]. Moreover, saliency also appears to determine which details humans find interesting in visual scenes [12]. Since saliency is a crucial factor in most everyday visual tasks, it has long been a research topic of great interest. Despite a large amount of research effort, the nature of the underlying neural mechanisms mediating human visual
. The author is with the Faculty of Science, University of Amsterdam, Kruislaan 403, 1098 SJ, Amsterdam, and with TNO, Kampweg 5, 3769 DE Soesterberg, The Netherlands. E-mail
[email protected]. Manuscript received 9 Feb. 2010; revised 2 Aug. 2010; accepted 25 Nov. 2010; published online 15 Mar. 2011. Recommended for acceptance by H. Bischof. For information on obtaining reprints of this article, please send e-mail to:
[email protected], and reference IEEECS Log Number TPAMI-2010-02-0084. Digital Object Identifier no. 10.1109/TPAMI.2011.53. 0162-8828/11/$26.00 ß 2011 IEEE
saliency remains elusive. One of the main reasons is that human saliency estimates are typically derived from fixation behavior in conditions involving task demands and contextual knowledge, and therefore reflect target relevance instead of pure bottom-up saliency [5]. Human visual saliency has been operationalized through the concept of “conspicuity area,” which is defined as the spatial region around the center of gaze where the target can be detected or identified in the background, within a single fixation [13], [14], [15]. Geisler and Chou [16] showed that much of the variation in search times is indeed accounted for by variations in the visual conspicuity area. Thus, visual conspicuity determines human visual behavior and performance when freely inspecting and interpreting images. Visual conspicuity reflects the uncompromised physical, bottom-up distinctiveness of a target, and involves no high-level top-down processes such as memory, task demands, attention, and strategies. It is therefore inherently different from the concept of relevance [5]. Until recently, the practical value of the visual conspicuity concept was limited by the fact that the associated psychophysical measurement procedures were intricate and time-consuming [13], [14], [15], [16]. However, we developed a psychophysical procedure that alleviates this problem by allowing a quick in situ assessment of the extent of the conspicuity area, with full prior knowledge of the target and its location in the scene [17], [18]. The method measures the maximal lateral distance between target and eye fixation at which the target can be resolved from its background. This measure characterizes the degree to which the target stands out from its immediate surroundings [6], and correlates with human visual search and detection performance in realistic and complex scenarios [17], [18]. In Published by the IEEE Computer Society
2132
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,
this study, human observers use this procedure to measure the visual saliency of targets in complex rural backgrounds. These measurements are then compared with the predictions of 13 recently developed computational saliency models and with a new saliency metric introduced here which mimics our conspicuity measurement procedure. The automatic detection of salient image regions is important for applications like adaptive content delivery [19], adaptive region-of-interest-based image compression [20], [21], video summarization [22], progressive image transmission [23], image segmentation [24], [25], image and video quality assessment [26], [27], [28], object recognition [29], and content-aware image resizing [30]. Based on the notion that being a local outlier makes a point salient, Koch and Ullman [2] introduced the concept of a saliency map, which is a two-dimensional topographic representation of saliency for each pixel in an image. Over the past decade, many different algorithms have been proposed to compute visual saliency maps from digital imagery [2], [19], [31], [32], [33], [34], [35], [36], [37], [38], [39], [40], [41], [42], [43], [44], [45], [46], [47], [48], [49]. These algorithms typically transform a given input image into a scalar-valued map in which local signal intensity corresponds to local image saliency [36], [48]. They can be broadly classified as biologically-based [2], [44], [45], [46], purely computational [19], [48], [50], [51], [52], or a combination of both [31], [49]. Some algorithms detect saliency at multiple spatial scales [44], [45], [46], [52], while others operate on a single scale [19], [50]. They either separately create individual maps, which are then combined to obtain the final saliency map [19], [50], [53], [54], or they directly construct a feature combined saliency map [19], [52]. Recently, saliency algorithms have been extended to produce spatiotemporal saliency maps for the analysis of dynamic imagery [11], [21], [38], [47], [55], [56], [57], [58], [59], [60]. A limiting factor of applications that depend on the automatic detection of salient image regions or the prediction of human fixation locations is their reliance on the availability and ease of implementation of existing saliency models and the assumption that such models do a sufficiently good job of predicting fixation selection [61]. Unfortunately, there is no consensus on any assessment method for bottom-up saliency models [62]. In the literature, the quality (validity) of computational saliency maps is usually assessed through comparison with corresponding human observer data. Psychophysical saliency estimates are typically derived from studies investigating how visual cues affect human performance: Cues are said to have a high saliency when they effectively draw attention to the target’s location [63], [64], [65]. Metrics presented in the literature to assess the quality of computational saliency models through comparison with human observer data include: .
. .
The correlation between human fixation order and the ranking induced on image details by computational saliency values [8], [10], [21], [60] or simulated fixation order [66], [67], [68]. The agreement between the subjective and computational saliency ranking orders [12], [69]. The precision, recall, and F measures computed from the overlap of, respectively, binary human labeled mask images and binary (thresholded) computational saliency maps [43], [48], [70].
VOL. 33,
NO. 11,
NOVEMBER 2011
The hit and false alarm-rate as the mean computational saliency values in the nonzero valued regions of, respectively, binary human labeled mask images and their complement, using the assumption that mean saliency is high in object areas and low in background areas [51]. . The mean absolute difference or correlation between each pixel in human fixation density maps and their corresponding computational saliency maps [27], [34], [62]. . Binary (thresholded) saliency masks derived from the computational saliency map and used to compute both the segmentation order and the hit and miss rates from a comparison with human segmentations [21], [57]. . ROC area metrics to evaluate human fixation prediction [9], [21], [31], [32], [36], [47], [49], [56], [59], [71], [72], [73], [74], [75]. In this approach, the saliency map is treated as a binary classifier on every pixel in the image: Pixels with saliency values exceeding a given threshold are classified as fixated, while the remaining pixels are classified as nonfixated. Human fixations serve as ground truth. By varying the threshold, an ROC curve is obtained and the area under the curve indicates how well the saliency map predicts actual human eye fixations. However, most of these metrics reflect target relevance instead of target saliency [5], since the human visual performance measures used in their construction (e.g., fixation order, subjective ranking, and search and detection time) involve high-level top-down processes, such as memory, task demands, attention, and strategies. In most cases, it is not bottom-up saliency, but the most “interesting” or “meaningful” object in an image that attracts attention [76]. In the presence of phase-dependent higher order image features (e.g., lines, edges, and symmetries), lower order features only have a correlative effect on human fixation behavior [77], [78], [79]. Moreover, task demands override sensory driven (bottom-up) saliency [10], [80], [81], [82]. It is only in the absence of high-level features (e.g., in natural scenes where the global scene layout lacks diagnostic properties) and prescribed tasks that lower order features guide attention to some extent [78]. As a result, bottom-up saliency models are limited in their ability to predict fixation behavior [10], [67], [81], [83], [84], [85], [86]. In realistic visual search conditions, top-down attention dominates, and people are rarely distracted by visually salient but irrelevant items [83]. But even in free viewing conditions, when topdown processing is minimal, bottom-up saliency models are poor predictors of human fixation behavior [61], [87], [88], [89]. When viewing objects in natural images, the abovementioned performance metrics therefore provide only limited insight into the quality of bottom-up saliency models and the underlying mechanisms mediating human visual saliency [77]. We argue that visual conspicuity is a useful and efficient metric for the further development and calibration of computational bottom-up saliency models, which can serve to systematically investigate the relative contribution of different feature dimensions to overall visual target saliency. In this study, we perform a straightforward evaluation of 13 previously published computational bottom-up saliency models by correlating their output with human visual conspicuity (saliency) estimates. The visual conspicuity estimates are obtained using a previously developed simple .
TOET: COMPUTATIONAL VERSUS PSYCHOPHYSICAL BOTTOM-UP IMAGE SALIENCY: A COMPARATIVE EVALUATION STUDY
and efficient psychophysical measurement procedure [17], [18], which has recently been the subject of an extensive validation study [6]. We also introduce a new saliency metric which mimics our psychophysical measurement procedure, and compare its predictions with the human observer data.
2
METHODS
This study investigates the agreement between the predictions of several recently developed computational bottomup saliency models and human visual saliency estimates. In the next sections, we first present the psychophysical measurement procedure used to obtain the human visual saliency (conspicuity) estimates. Then, we briefly describe the state-of-the-art computational saliency models from the literature which we used in this study. Next, we present a new bottom-up saliency metric, inspired by our psychophysical measurement procedure. Finally, we compare the output of the computational saliency models with the human visual saliency estimates.
2.1 Visual Conspicuity Human visual target conspicuity has operationally been defined as the peripheral area around the center of gaze from which specific target information can be extracted in a single glimpse [13], [14], [15]. Thus, for fixations falling within this area, the target is capable of attracting visual attention. The size and shape of the conspicuity area have been measured for a range of static targets in static scenes [13], [14], [15], [16], [90], [91], [92], [93]. The conspicuity area is small if the target is embedded in a background with high feature or spatial variability. The conspicuity area is large if the target stands out clearly from its background. Psychophysical procedures for measuring the conspicuity area are typically rather intricate and time-consuming [13], [14], [15], [16]. Here, we define human visual conspicuity as the maximum angular gaze deviation at which targets can still be distinguished from their immediate surroundings. We previously developed and validated an efficient psychophysical measurement procedure to quantify this concept [6], [17], [18]. Since saliency is reliably sustained in static stimuli [94], the procedure can be used with full prior knowledge of the target and its location in the scene. The resulting saliency estimates determine human visual search and detection performance in realistic and complex scenarios [17], [18]. The conspicuity measurement procedure is as follows [6], [17], [18]: First, the observer foveates the target to ensure that he has full knowledge of its location and appearance. This is especially relevant for complex scenes where many details compete for the observer’s attention. Next, the observer fixates a point in the scene that is both 1) at a large angular distance from the target location and 2) in the frontoparallel plane through the target. This initial fixation point should be sufficiently remote from the target so that it cannot be distinguished at this stage. The observer then successively fixates locations in the scene that are progressively closer to the target position until he can perceive (distinguish) the target in his peripheral field of view. The successive fixation points are along a line through the initial fixation point and the center of the target. The outward-in fixation procedure simulates free search without a priori knowledge of target presence, and prevents hysteresis
2133
effects, which may occur for inward-out procedures. The angular distance between the fixation location at which the target is first noted and the center of the target is then recorded. The measurement is repeated at least three times. Subjects typically make a setting within one minute. The mean of the angular distances thus obtained is adopted as the characteristic spatial extent of the conspicuity area of the target, in the direction of the initial fixation point. This procedure yields reliable measurements (for three repetitions, the standard measurement error is typically less than 10 percent), and accounts for important visual effects like crowding [95] and lateral masking [6], [96]. Depending on the criterion used by the subjects to assess a target’s conspicuity, we distinguish two types of conspicuity estimates: detection and identification conspicuity [18]. Detection conspicuity corresponds to the criterion whether the image structure in the target area differs noticeably from the local background. In this case, previous target fixation merely serves to familiarize the observer with the location of the target. The observer is explicitly instructed not to note the target features. Thus, detection conspicuity is likely to reflect bottom-up saliency. Identification conspicuity corresponds to the criterion whether the image details at the location of the target indeed represent the target, or can actually be identified as the target. In this case, the observer is explicitly instructed to use the target features in the discrimination task. Thus, identification conspicuity also contains a top-down saliency component. This is probably the reason why identification conspicuity predicts mean search time more closely than detection conspicuity since both tasks depend on a priori target information: An image detail is more likely to be fixated when it resembles a target [17]. Identification typically requires a larger feature contrast than detection [97]. Thus, for a given feature contrast, identification saliency is typically smaller than detection saliency. It was recently found that detection and identification indeed involve different cortical mechanisms [97].
2.2 Computational Saliency Models Computational saliency models transform the input image into a saliency map in which local signal intensity corresponds to local image saliency. In the following sections, we briefly introduce the saliency models used in this study (for full details, we refer to the original papers), and we provide a brief comparative overview of the model features. 2.2.1 AIM: Self-Information-Based Saliency The Attention-based on Information Maximization (AIM) model [31], [32], [33], [34] uses Shannon’s self-information measure [98] to transform the image feature plane into a dimension that closely corresponds to visual saliency (available at http://web.me.com/john.tsotsos/Visual_Attention/ ST_and_Saliency.html). The underlying hypothesis is that the saliency of a visual feature is related to the information it provides relative to its local surround, or more specifically, how unexpected the local image content in a local patch is relative to its surroundings. In this sense, the self-information-based saliency measure has an inherent contrast component (as opposed to local entropy, which is a measure of local activity). The information IðxÞ conveyed by an image feature x is inversely proportional to the likelihood of observing x. Shannon’s relationship may be written explicitly as IðxÞ ¼ logðpðxÞÞ. Informativeness in a feature domain
2134
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,
requires the estimation of a one-dimensional probability density function. In the more general case that requires computing a joint likelihood measure based on local statistics, estimation of a 3 M N-dimensional probability density function is required for a local window size of M N in RGB space. In practice, estimation of such high-dimensional probability functions proves unfeasible, requiring prohibitively large degrees of computation and data. For dimensionality reduction, Bruce and Tsotsos [31], [32], [33], [34] employ a representation based on independent component analysis (ICA) [99], [100]. ICA reduces the problem of estimating a 3 M N-dimensional probability density function to the problem of estimating 3 M N one-dimensional probability density functions. The associated data transformation is lossless and invertible. To determine a suitable basis, ICA is performed on a large sample of small RGB patches drawn from natural images [31], [34]. For a given image, an estimate of the distribution of each basis coefficient is learned across the entire image through nonparametric density estimation. The probability of observing the RGB values corresponding to a patch centered at any image location is then evaluated by independently considering the likelihood of each corresponding basis coefficient. The product of such likelihoods yields the joint likelihood of the entire set of basis coefficients.
2.2.2 SUN: Saliency Using Natural Image Statistics Saliency Using Natural (SUN) image statistics (available at http://cseweb.ucsd.edu/~l6zhang) computes bottom-up saliency as the self-information of local image features [35], [36]. The underlying hypothesis of SUN is that the visual system aims to detect potential targets. To achieve this, it must actively estimate the probability of a target at every location given the visual features observed. SUN is based on the assumption that this probability, or a monotonic transformation thereof, is equal to visual saliency [36]. In SUN, the overall saliency contains both a bottom-up and a top-down component. Bottom-up saliency corresponds to the self-information of the local image features, and drives fixation behavior in the case of free viewing when no particular target is of interest. Top-down saliency corresponds to the amount of mutual information between local image features and target features and drives goaldirected search behavior. Most other information-based saliency measures only use the statistics of the current test image. In contrast, SUN compares the features observed at each point in a test image to the statistics of natural images. SUN model. Let z denote a point in the visual field. In the current context, a point corresponds to a single image pixel, but in other contexts, a point may refer to more complex image details. Let the binary random variable C denote whether or not a point belongs to a target class, let the random variable L denote the location (i.e., the pixel coordinates) of a point, and let the random variable F denote the visual features of a point. The saliency sz of a point z is then defined as pðC ¼ 1jF ¼ fz ; L ¼ lz Þ, where fz represents the feature values observed at z and lz represents the location (pixel coordinates) of z. This probability can be calculated using Bayes’ rule: sz ¼ pðC ¼ 1 jF ¼ fz ; L ¼ lz Þ pðF ¼ fz ; L ¼ lz jC ¼ 1Þ pðC ¼ 1Þ ¼ : pð F ¼ fz ; L ¼ lz Þ
ð1Þ
VOL. 33,
NO. 11,
NOVEMBER 2011
Assuming that features and locations are independent, and conditionally independent given C ¼ 1, (1) can be rewritten as
Taking the logarithm of this saliency function allows its interpretation in information-theoretical terms.
Since the logarithm is a monotonically increasing function, it does not affect the ranking of salience across the image. The first term on the right side of (3), log pð F ¼ fz Þ, depends only on the visual features observed at the point z and is independent of any knowledge about the target class. In information theory, this term is known as the selfinformation of the random variable F when it takes the value fz . Self-information increases when the probability of a feature decreases—in other words, rarer features are more informative. The second term on the right side of (3), log pðF ¼ fz jC ¼ 1Þ, is a log-likelihood term that favors feature values that are consistent with a priori knowledge of the target. For example, if the target is known to be green, then the loglikelihood term will be much larger for a green point than for a blue point. This corresponds to the top-down saliency effect when searching for a known target. The third term in (3), log pðC ¼ 1 jL ¼ lz Þ, is independent of visual features and reflects prior knowledge about the locations where the target is likely to appear. This corresponds to the fact that observers are more likely to attend to locations where targets are most likely to appear. In many cases, there is no a priori knowledge about potential target locations, so that the third term can be omitted and (3) reduces to
TOET: COMPUTATIONAL VERSUS PSYCHOPHYSICAL BOTTOM-UP IMAGE SALIENCY: A COMPARATIVE EVALUATION STUDY
The resulting expression, called the pointwise mutual information between the visual feature and the presence of a target, expresses overall saliency. Intuitively, it favors feature values that are more likely in the presence of a target than in a target’s absence. In the case of free viewing, attention will be drawn to any potential target in the visual field, despite the fact that the features associated with the target class are unknown. In that case, the log-likelihood term in (3) is unknown and can be omitted from the saliency calculation (note that this is equivalent to the assumption of a uniform likelihood distribution over feature values), so that the overall saliency reduces to just the self-information term [101]: log sz ¼ log pðF ¼ fz Þ:
ð5Þ
In summary, the calculation of image saliency leads naturally to the estimation of information content. In free viewing, when there is no specific search target, saliency reduces to the self-information of a feature. When searching for a particular target, the most salient features are those that have the most mutual information with the target. SUN implementation. Following Bruce and Tsotsos [31], [32], [33], [34], SUN employs a representation based on independent component analysis, using 362 linear ICA features that were derived from a large natural image data set [35], [36] to compute the overall saliency given by (5). Thus, F is a vector of 362 filter responses, F ¼ ½F1 ; F2 ; . . . ; F362 , where Fi represents the response of the ith filter at a point z and f ¼ ½f1 ; f2 ; . . . ; f362 are the values of these filter responses at this point.
2.2.3 Rarity-Based Saliency GR: Global rarity. Mancas’ Global Rarity (GR) model [37], [38], [39], [40], [41], [42] (available at http://tcts.fpms.ac.be/ attention) computes saliency as the self-information of the mean and variance of local image intensity. It assumes that visual attention is attracted by features which are in minority in a given image. This hypothesis appears supported by the fact that human fixation is attracted by, e.g., homogeneous areas in heterogeneous scenes, but also by heterogeneous areas in homogeneous scenes, and . bright areas in dark scenes, and by dark areas in bright scenes, and . low-contrast areas in high-contrast scenes and by high-contrast areas in low-contrast scenes. In this context, rarity is a synonym for anomalous or interesting areas. The rarity concept is closely related to Shannon’s self-information measure [98]. In general, the details or features composing an image M can be considered as a set of messages mi . Thus, messages can be anything, like edges, orientation or motion vectors, texture descriptors, etc. Here, we will assume that the messages are equal to the image intensity levels. A message rarity Iðmi Þ is defined as .
Iðmi Þ ¼ logðpðmi ÞÞ;
ð6Þ
where pðmi Þ is the probability that a message mi is chosen from all possible choices in the message set M (its occurrence
2135
likelihood). Iðmi Þ is large when a message is rare. The attention map is obtained by replacing each message mi by its corresponding rarity value Iðmi Þ. The probability pðmi Þ is estimated as a combination of two terms: pðmi Þ ¼ Aðmi Þ Bðmi Þ;
ð7Þ
with Aðmi Þ ¼
Hðmi Þ ; CardðMÞ
ð8Þ
where Hðmi Þ is the bin value of the histogram for message mi and CardðMÞ the cardinality of M. The quantification of the set M provides the sensibility of Aðmi Þ: For a smaller set cardinality, more messages will be indistinguishable. Bðmi Þ quantifies the global contrast of a message mi on the context M: PCardðMÞ
Bðmi Þ ¼ 1
jmi mj j j¼1 : ðCardðMÞ 1Þ MaxðMÞ
ð9Þ
For a very distinct message, Bðmi Þ will be small (near zero), so the occurrence likelihood pðmi Þ will be low and the message rarity will be high. Bðmi Þ was introduced to avoid cases where two messages have the same occurrence probability and thus the same value of Aðmi Þ, while one message is highly distinct whereas the other is quite similar to the other messages. While Aðmi Þ is a probability, Bðmi Þ is a factor that represents the influence of distinctness. Hence, the final pðmi Þ is not strictly equal to a probability and the rarity Iðmi Þ is not identical to the self-information as it is commonly defined. To take spatial information into account, both the local mean and variance of the image intensity are calculated over a sliding 3 3 window and are used as the messages mi . Equation (8) is then computed by counting pixel neighborhoods with similar mean and variance, while (9) is solved by computing the distance between the mean of a given pixel neighborhood and the mean of all other neighborhoods. The two resulting saliency maps (one for the mean and the other for the variance) are simply added to obtain the overall saliency map. Color images are handled in a luminance, red-green, and blue-yellow opponent color space. First, individual saliency maps are calculated for each of the three color channels. Then, the three individual saliency maps are fused into a single overall saliency map by using a maximum operator. The rationale for this approach is the fact that a feature which is salient in at least one of the channels will also be salient in an absolute sense. LR: Local rarity. Local rarity (LR) computes local contrast as a measure of saliency [39]. The underlying hypothesis is that fixation is attracted by high-contrast image details. An image detail with low global rarity may still have high local contrast and thus appear salient [39]. A problem with the computation of local contrast is that its value is highly scale-dependent (i.e., it depends strongly on the size of the local area over which it is calculated). This yields the unwanted side effect that textured regions may have high local contrast on a small scale, whereas they are not salient when observed at larger scales due to their highly predictable regular structure. To avoid this problem,
2136
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,
mean local contrast is computed over a range of spatial scales. Mancas [39] uses six Gaussian filters of increasing spatial scales to produce a set of six low-pass filtered images Rk , with k ranging from one (the lowest resolution or largest kernel) to six (the highest resolution or no filtering). Local contrast is then be defined as 5 X 1 ð10Þ LC ¼ R6 Rk : 5 k¼1 In this approach, a background model is constructed as the mean over multiple spatial scales, and the difference between the input (unfiltered) image and this model highlights salient details (i.e., details with high local contrast).
2.2.4 EDS: Edge Distance Saliency Rosin’s [43] Edge Distance Saliency (EDS, available at http://users.cs.cf.ac.uk/Paul.Rosin) computes image saliency as the inverse of the multiscale distance to the nearest edge. The underlying hypothesis is that fixation is attracted by intensity edges. First, a simple Sobel operator [102] is used to produce an edge transform of the original input image. Then, the resulting gray-level edge image is thresholded at multiple gray-scale levels to produce a set of binary edge images. Next, a distance transform [103] is applied to each of the binary edge images to propagate the edge information. The resulting set of gray-level distance transforms is then summed to obtain the overall saliency map. The pixel value in this map is inversely related to saliency: Pixels close to edges have low values, while pixels farther from edges have higher values. Note that the threshold decomposition step is introduced to approximate a gray-level distance transform by a set of binary distance transforms at multiple threshold levels. Attempts to extend the method to color images have not been successful [43]. 2.2.5 SaliencyToolbox 2.2 Walther’s SaliencyToolbox2.2 [44], [45] (ST, available at http://www.saliencytoolbox.net) computes a saliency map from individual of color, luminance, and orientation contrast maps at different spatial scales. The underlying hypothesis is that fixation is attracted by local feature contrast. It is based on the Itti et al. [46] implementation of the physiologically inspired saliency-based model of visual attention presented by Koch and Ullman [2]. The input image I is repeatedly convolved with a 6 6 separable Gaussian kernel ½1 5 10 10 5 1 =32 and subsampled by a factor of two, resulting in a pyramidal image representation with levels ¼ ½0 ; . . . ; 8. Thus, level has 1=2 times the resolution of the original input image I. Let r, g, and b, respectively, be the red, green, and blue values of the input RGB color image. The intensity map is then computed as MI ¼
rþgþb : 3
ð11Þ
This operation is repeated for each level of the input pyramid to obtain an intensity pyramid with levels MI ðÞ. Each level of the input pyramid is further decomposed into maps for red-green (RG) and blue-yellow (BY ) opponencies:
VOL. 33,
NO. 11,
NOVEMBER 2011
MRG ¼
rg ; maxðr; g; bÞ
ð12Þ
MBY ¼
b-minðr; gÞ : maxðr; g; bÞ
ð13Þ
and
To avoid large fluctuations of the color opponency values at low luminance, MRG and MBY are set to zero at locations with maxðr; g; bÞ < 1=10, assuming a dynamic range of ½0; 1. Note that these definitions of color opponency deviate from those of Itti et al. [46]; see [44] for details. Local orientation maps are computed by convolving the intensity pyramid with Gabor filters: M ðÞ ¼ kMI ðÞ Go ðÞk þ kMI ðÞ G=2 ðÞk;
ð14Þ
where
x02 þ 2 y02 G ðx; y; Þ ¼ exp 22
x0 cos 2 þ
ð15Þ
represents a Gabor filter with aspect ratio , width , wavelength , phase , and coordinates ðx0 ; y0 Þ transformed with respect to orientation : x0 ¼ x cosðÞ þ y sinðÞ;
ð16Þ
y0 ¼ x sinðÞ þ y cosðÞ:
ð17Þ
Default parameter values are ¼ 1, ¼ 7=3 pixels, ¼ 7 pixels, and 2 f0; =2g, while filters are truncated to 19 19 pixels [45]. Center-surround receptive fields are simulated by across-scale subtraction between two maps at the center (c) and surround (s) levels in these pyramids, yielding “feature maps”: Fl;c;s ¼ N ðjMl ðcÞ Ml ðsÞjÞ 8l 2 L ¼ LI [ LC [ LO ; ð18Þ with LI ¼ fIg; LC ¼ fRG; BY g; LO ¼ f0 ; 45 ; 90 ; 135 g:
ð19Þ
NðÞ is an iterative, nonlinear normalization operator, simulating local competition between neighboring salient image locations [104]. In our current study, we deactivated the normalization operator since it effectively performs as a threshold operator, resulting in large zero-valued areas in the resulting the saliency map. The feature maps are summed over the center-surround combinations using across-scale addition , followed by normalization: 4 cþ4 ð20Þ F~l ¼ N Fl;c;s 8l 2 L: c¼2 s¼cþ3
Color and orientation saliency maps are obtained by summing the subfeatures (opponency and orientation channels, respectively) and normalizing the result. The intensity saliency map is equal to F~I in (20):
TOET: COMPUTATIONAL VERSUS PSYCHOPHYSICAL BOTTOM-UP IMAGE SALIENCY: A COMPARATIVE EVALUATION STUDY
CI ¼ F~I ; CC ¼ N
X
! F~l ;
l¼Lc
CO ¼ N
X
!
ð21Þ
F~l :
l¼Lo
The individual saliency maps are finally combined into a single overall saliency map as follows: S¼
1 3
X
Ck :
ð22Þ
k2fI;C;Og
Here, we used the default values for all model parameters except for the size of the output saliency maps, which is set to the maximum value of 60 60 pixels.
2.2.6 GBVS: Graph-Based Visual Saliency Harel’s Graph-Based Visual Saliency (GBVS) model [49] (available at http://www.klab.caltech.edu/~harel/share/ gbvs.php) computes saliency from distance-weighted multiscale feature dissimilarity maps, using a similar approach as Itti et al. [46] and Walther [44] to create feature maps at multiple spatial scales. The input image is first transformed into a pyramidal representation (typically consisting of three levels), and at each level, a set of three feature maps are computed representing, respectively, the intensity, color, and orientation distribution. The algorithm then builds a fully connected graph over all grid locations of each feature map and assigns weights between nodes that are inversely proportional to the similarity of feature values and their spatial distance. Let Mði; jÞ and Mðp; qÞ represent, respectively, the values of feature map M at positions (nodes) ði; jÞ and ðp; qÞ. The dissimilarity between Mði; jÞ and Mðp; qÞ can then be defined as Mði; jÞ dðði; jÞkðp; qÞÞ ¼ log : ð23Þ Mðp; qÞ The directed edge from node ði; jÞ to node ðp; qÞ is then assigned a weight proportional to their dissimilarity and their distance on the lattice M: wðði; jÞ; ðp; qÞÞ ¼ dðði; jÞkðp; qÞÞ F ði p; j qÞ 2 a þ b2 : where F ða; bÞ ¼ exp 22
ð24Þ
The resulting graphs are treated as Markov chains by normalizing the weights of the outbound edges of each node to 1, and by defining an equivalence relation between nodes and states and between edge weights and transition probabilities. Their equilibrium distribution is adopted as the activation or saliency maps. Note that in the equilibrium distribution, nodes that are highly dissimilar to surrounding nodes will be assigned large values. The activation maps are finally normalized to emphasize conspicuous details and combined into a single overall saliency map. Here, the size of the saliency maps was set at their maximum of 60 60 pixels.
2137
2.2.7 SW: Spectral Whitening Bian and Zhang [47] use spectral whitening (SW) as a normalization procedure in the construction of a map that only represents salient features and localized motion, while effectively suppressing redundant (noninformative) background information and ego-motion. This reflects a basic principle of the human visual system which suppresses the response to redundant (frequently occurring, noninformative) features while responding to rare (informative) features. For scalar (gray-scale) images, the procedure is as follows: First, the input image is resized (through low-pass filtering and downsampling) such that the maximal dimension is 64 pixels, which is the heuristically optimized size [47]. The rationale for this step is the fact that the spatial resolution of the preattentive visual system is very limited [105]. Next, a windowed Fourier transform of the input image Iðx; yÞ is calculated as fðu; vÞ ¼ F ½wðIðx; yÞÞ;
ð25Þ
where F denotes the Fourier transform and w is a windowing function. The normalized (flattened or whitened) spectral response n is then computed as nðu; vÞ ¼ fðu; vÞ=kfðu; vÞk:
ð26Þ
The normalized spectral response is transformed into the spatial domain through the inverse Fourier transform F 1 , squared to emphasize salient regions, and finally convolved with a Gaussian low-pass filter gðu; vÞ (typically of width 8 pixels) to model the spatial pooling operation of complex cells: Sðx; yÞ ¼ gðu; vÞ kF 1 ½nðu; vÞk 2 :
ð27Þ
For vector-valued (color) input images, the standard Fourier transform is replaced by a quaternion vector Fourier transform [106] to account for dependencies between each of the color channels. In this approach, each color is treated as a vector and the color space is transformed as a whole. For a simple YUV color space transformation, the quaternion representation qðx; yÞ of the input image Iðx; yÞ can be written as qðx; yÞ ¼ wy yðx; yÞ i þ wu uðx; yÞ j þ wv vðx; yÞ k;
ð28Þ
where y; u; v denote the luminance and two color components, respectively, and wy ; wu ; wv are weights for the individual color channels representing the relative sensitivity of the visual system for each of these components. To compute the quaternion vector Fourier transform, we used Sangwine and Le Bihan’s Quaternion toolbox for Matlab Version 1.6 ([106], available at http://sourceforge.net/ projects/qtfm).
2.2.8 PQFT: Phase Spectrum of Quaternion Fourier Transform Guo and Zhang [21] calculate the spatiotemporal saliency map of an image from the phase spectrum of its quaternion Fourier transform (PQFT). The rationale for this approach is the fact that the phase spectrum of an image indicates the location of interesting details [57]. In this approach, each pixel of the image is represented by a quaternion that
2138
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,
consists of color, intensity, and motion features. The resulting spatiotemporal saliency map considers not only salient spatial features like color, orientation, and others in a single frame but also temporal features like motion. The PQFT model is independent of prior knowledge and is parameter-free, and requires only minimal computational effort, which makes it suitable for real-time implementation. Let F ðtÞ represent the input color image captured at time t, where t ¼ 1; 2; . . . ; T with T ¼ total number of frames. The red, green, and blue channels of F ðtÞ are, respectively, represented by rðtÞ, gðtÞ, and bðtÞ. Four broadly tuned color channels can then be defined as follows [46]:
Y ðtÞ ¼
gðtÞ þ bðtÞ ; RðtÞ ¼ rðtÞ 2
ð29Þ
GðtÞ ¼ gðtÞ
rðtÞ þ bðtÞ ; 2
ð30Þ
BðtÞ ¼ bðtÞ
rðtÞ þ gðtÞ ; 2
ð31Þ
rðtÞ þ gðtÞ jrðtÞ gðtÞ j bðtÞ: 2 2
ð33Þ
BY ðtÞ ¼ BðtÞ Y ðtÞ;
ð34Þ
and intensity and motion channels are calculated by rðtÞ þ bðtÞ þ gðtÞ ; 3
ð35Þ
MðtÞ ¼ jIðtÞ Iðt Þj;
ð36Þ
where is a predefined latency coefficient. The quaternion image representation is then given by ð37Þ
2i
¼ 1, 1 ? 2 , 2 ? 3 , 1 ? 3 , where i , i ¼ 1; 2; 3, satisfies 3 ¼ 1 2 . In symplectic form, qðtÞ can be written as qðtÞ ¼ f1 ðtÞ þ f2 ðtÞ 2 ;
NOVEMBER 2011
where ðu; vÞ represents the coordinates in the frequency domain, N and M are the image dimensions (width and height), fi , i 2 f1; 2g, are calculated by (39) and (40), and t is omitted for simplicity. The inverse quaternion Fourier transform is given by M 1 N 1 X X 1 e 1 2ðmv=Mþnu=NÞ Fi ½u ; v: ð43Þ fi ðn ; mÞ ¼ pffiffiffiffiffiffiffiffiffiffi MN v¼0 u¼0
The frequency domain representation QðtÞ of qðtÞ can then be written in polar form as QðtÞ ¼ kQðtÞ k e ðtÞ ;
ð44Þ
where ðtÞ is the phase spectrum of QðtÞ and is a pure unit quaternion. After setting kQðtÞ k ¼ 1, only the phase information of qðtÞ remains. Using (43) to compute the inverse q0 ðtÞ of the phase-only representation of QðtÞ yields q0 ðtÞ ¼ 0 ðtÞ þ 1 ðtÞ 1 þ 2 ðtÞ 2 þ 3 ðtÞ 3 :
sMðtÞ ¼ g kq0 ðtÞ k 2 ;
ð32Þ
RGðtÞ ¼ RðtÞ GðtÞ;
qðtÞ ¼ MðtÞ þ RGðtÞ 1 þ BY ðtÞ 2 þ IðtÞ 3 ;
NO. 11,
ð45Þ
The spatiotemporal saliency map is finally computed as
Similarly to the human visual system, opponent color channels are formed as follows [46]:
IðtÞ ¼
VOL. 33,
ð46Þ
where g is a 2D smoothing Gaussian filter (with a typical variance of 8 pixels). When applied to static images, the motion channel MðtÞ is simply set to zero. Although PQFT can be applied to images of arbitrary size, it was found that input images of 64 64 pixels yield results that agree most with common viewing conditions [51]. To compute the quaternion vector Fourier transform, we used Sangwine and Le Bihan’s Quaternion toolbox for Matlab Version 1.6 ([106], available from http://sourceforge.net/ projects/qtfm). The saliency map was computed at the resolution of the input images (1;436 924).
2.2.9 FTS: Frequency-Tuned Saliency Achanta et al. [48] (software available at http://ivrgwww. epfl.ch/supplementary_material/RK_CVPR09/index.html) compute bottom-up saliency as local multiscale color and luminance feature contrast. The underlying hypothesis is that fixation is driven by local center-surround feature contrast. First, the input sRGB image I is transformed to CIE Lab color space. Then, the scalar saliency map S for image I is computed as Sðx; yÞ ¼ kI I!hc k;
ð38Þ
ð47Þ
where f1 ðtÞ ¼ MðtÞ þ RGðtÞ 1 ;
ð39Þ
f2 ðtÞ ¼ BY ðtÞ þ IðtÞ 1 :
ð40Þ
The quaternion image representation at each pixel is denoted by qðn; m; tÞ, where ðn; mÞ represent the spatial location of each pixel and t is the temporal coordinate. The quaternion Fourier image transform is then given by Q½u; v ¼ F1 ½u; v þ F2 ½u; v 2 ; M1 X N1 X 1 e 1 2ðmv=Mþnu=NÞ fi ½n; m; Fi ðu; vÞ ¼ pffiffiffiffiffiffiffiffiffiffi MN m¼0 n¼0
ð41Þ
ð42Þ
I ¼ the arithmetic mean image feature vector; I!hc ¼ a Gaussian blurred version of the original image; using a 5 5 separable binomial kernel; kk ¼ the L2 norm ðeuclidian distanceÞ; and x; y ¼ the pixel coordinates: Blurring with a 5 5 Gaussian kernel serves to eliminate noise and fine texture details from the original image while still retaining a sufficient amount of high-frequency details. The mean image feature vector corresponds to blurring with a Gaussian of infinite extent. The difference I I!hc effectively represents the output of a range of bandpass (DoG) filters at several image scales. Since the norm of the
TOET: COMPUTATIONAL VERSUS PSYCHOPHYSICAL BOTTOM-UP IMAGE SALIENCY: A COMPARATIVE EVALUATION STUDY
difference is used, only the magnitude of the local differences contributes to the saliency of an image detail.
2.2.10 AWS: Adaptive Whitening Saliency Garcia-Diaz et al. [74], [75] adopt the variability in local energy as a measure of salience. Like the primate visual system [107], their Adaptive Whitening Saliency Model (available at http://merlin.dec.usc.es/agarcia/AWSmodel. html) computes a decorrelated representation of the input image as follows: First, the input color image is transformed into Lab color coordinates. The luminance image is then transformed into a multi-oriented multiresolution representation by means of a bank of log Gabor filters [108]. In the frequency domain, the log Gabor functions are as follows: logGaborð ; ; i ; i Þ ¼ e
ðlogð = i ÞÞ2 2ðlogð i = i ÞÞ2
e
ði Þ2 2ð Þ2
;
ð48Þ
where ð ; Þ are polar frequency coordinates and ð i ; i Þ is the central frequency of the filter. In the spatial domain, these filters are complex valued functions, whose components can be represented by a pair of phase quadrature filters f and h [109], [110]. For each scale s and orientation o, a measure of the local energy eso in the input signal can be obtained by taking the modulus of the complex filter response: qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ð49Þ eso ðx; yÞ ¼ ðLfso Þ2 þ ðLhso Þ2 : Since orientation selectivity is only weakly associated with color selectivity, the opponent color channels a and b simply undergo a multiscale decomposition, obtained as the responses to a bank of log Gaussian filters: log Gauss ð Þ ¼ e
ðlogð ÞÞ2 2ðlogð2n ÞÞ2
:
ð50Þ
Thus, a real-valued response is obtained for each scale and color component. Typically, eight spatial scales spaced by one octave, four orientations (for local energy), a minimum wavelength of 2 pixels, ¼ 37:5 , and a frequency bandwidth of two octaves are used. For each orientation, the multiscale information is then decorrelated by means of a Principal Component Analysis (PCA). Then, the statistical Hotelling’s T2 distance is computed for each feature to the center of the distribution. This represents the distinctness between the local and global image structure. The Hotelling’s T2 statistic at each point is given by 0 1 x11 . . . x1N B . 2 .. C .. 2 2 C X¼B . A ! ðP CAÞ ! T ¼ T1 ; . . . ; TN ; . @ .. xS1 . . . xSN ð51Þ where S is the number of scales and N is the number of pixels. Given the covariance matrix W , T 2 , defined as Ti2 ¼ ðxi xÞ0 W1 ðxi xÞ;
ð52Þ
represents a measure of local feature contrast. The final saliency map is obtained by normalizing, smoothing, and summing the extracted maps, first within
2139
each feature dimension and then over the resulting local energy saliency and color saliency maps. Finally, normalization and Gaussian smoothing are applied to gain robustness. The resulting maps are first summed, yielding local energy and color conspicuities, and then normalized and averaged, producing the final saliency map.
2.2.11 SDSR: Saliency Detection by Self-Resemblance The bottom-up Saliency Detection by Self-Resemblance (SDSR) model of Seo and Milanfar [56], [73] (available at http://users.soe.ucsc.edu/~rokaf/SaliencyDetection.html) computes a saliency map by measuring the similarity of a feature matrix at a pixel of interest relative to its neighboring feature matrices. Again, the underlying hypothesis is that fixation is driven by local feature contrast. First, local image structure at each pixel is represented by a matrix of local descriptors (local regression kernels), which are robust in the presence of noise and image distortions. Then, matrix cosine similarity (a generalization of cosine similarity) is used to quantify the likeness of each pixel to its surroundings. The saliency value Si at each pixel i represents the statistical likelihood of its feature matrix Fi given the feature matrices Fj of the surrounding pixels: Si ¼ P
N j¼1
exp
1
1þ ðFi ;Fj Þ 2
;
ð53Þ
where ðFi ; Fj Þ is the matrix cosine similarity between two feature matrices Fi and Fj , and is a local weighting parameter. The columns of the local feature matrices represent the output of the local steering kernels, which are modeled as ( ) pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi detðCi Þ ðxl xi ÞT Cl ðxl xi Þ ; ð54Þ Kðxl xi Þ ¼ exp 2h2 h2 where l ¼ f1; . . . ; P g, P is the number of pixels in a local window, h is a global smoothing parameter, and the matrix Cl is a covariance matrix estimated from a collection of spatial gradient vectors within the local analysis window around a sampling position xl ¼ ½x1 ; x2 Tl . SDSR can be applied on images of any size. In practice, the input images are downscaled (64 64) for computational efficiency. It was found that the inclusion of higher spatial scales did not improve the performance [56].
2.2.12 Esaliency: Extended Saliency The Esaliency algorithm [111] computes saliency as the global distinctness of small image segments in feature space (available at http://www.cs.technion.ac.il/Labs/isl/ Research/Attention/Attention.htm). The underlying hypothesis is that globally unique details attract attention. It assumes that 1) natural scenes are composed of small segments with uniform feature properties, 2) two visually similar segments are either both targets or both nontargets, and 3) the number of targets is small. After segmenting the input image, the algorithm applies a stochastic model to compute the probability that a segment is of interest (is a target), and adopts this probability as the saliency of the segment. Thus, saliency is a context-sensitive and a global property and not merely based on local feature contrast.
2140
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,
Let ðc1 ; . . . ; cn Þ denote the set of image segments, and let l ¼ ðl1 ; . . . ; ln Þ represent the associated vector of labels that are binary random variables which take the value 1 if a segment corresponds to a target and 0 if it is a nontarget. The Esaliency algorithm estimates the probability P ðli Þ ¼ pðli ¼ 1Þ that a segment corresponds to a target. Image segments are represented by their feature vectors, which may, for instance, consist of texture and color descriptors. Let dij denote the feature space distance between two image segments i and j. The visual similarity between these segments can then be defined as the correlation coefficient between their labels
ðli ; lj Þ ¼
covðli ; lj Þ ¼ ðdij Þ; varðli Þvarðlj Þ
ð55Þ
where ðdij Þ is 1 for zero feature space distance and approaches 0 for increasing dij . Given this correlation matrix, the Esaliency algorithm uses a tree-based Bayesian network that takes only the strongest correlations into account to specify a probability distribution on the joint segment labels. Let G be the graph with n nodes representing the random segment labels, and edges representing the pairwise dependency between the random variables. The edges of G are then weighted by the mutual information between the corresponding nodes, and the maximal weighted spanning tree of G is selected. Selecting a root node in this tree makes it a directed tree, which is then converted into a Bayesian network by specifying the two conditional probabilities pð li ¼ 1 j lp ¼ 0 Þ and pð li ¼ 1 j lp ¼ 1 Þ for each of the other nodes in the tree. For a vector of labels l ¼ ðl1 ; . . . ; ln Þ, its probability is then given by Y pðlÞ ¼ pðlr Þ pð li j lparðiÞ Þ; ð56Þ i¼1;...;n;i6¼r
where parðiÞ is the parent node of node i in the tree. Next, the distribution on the joint assignment vectors is obtained from the N most likely labels assignments (using Nilsson’s algorithm [112]): pðlÞ p0 ðlÞ ¼ PN ; j j¼1 pðl Þ
ð57Þ
where li represents the vector of labels corresponding to the ith most likely assignment. Typically, for scenes consisting of 100-500 segments, the first 100 most probable assignments are taken. Finally, the saliency of the image segment i is calculated as pT ðci Þ ¼
N X
p0 ðlj Þ lji :
ð58Þ
j¼1
2.2.13 MCC: Multiscale Contrast Conspicuity We now introduce a simple bottom-up saliency metric, based on target intensity contrast. Our rationale is the fact that intensity contrast is a strong predictor of human fixation [79], particularly for viewing natural landscapes [8]. The computational method mimics our human visual conspicuity measurement procedure (Section 2.1). Although additional features can be incorporated in the metric, it only accounts for intensity contrast in its current form. This is no
VOL. 33,
NO. 11,
NOVEMBER 2011
restriction for the present study since the targets in the Search_2 data set are primarily contrast-defined. In the following, we assume that the metric is applied to the intensity component of the color input images. The method is as follows: First, the mean target intensity is computed, using the binary target mask supplied with the Search_2 data set. Next, the mean image intensity is computed over the combination of the target area and a surrounding border area. Target contrast is then defined as the ratio of mean target intensity and mean intensity over the target+surround area. Target contrast as a function of surround area is obtained by progressively increasing the width of the target surround. Target contrast has a value of 1 for zero surround width (i.e., target and target + surround areas are equal), and initially either increases (lower mean intensity local background) or decreases (larger mean intensity local background) with increasing surround width. We define the Multiscale Contrast Conspicuity (MCC) metric as the maximal surround width up to which the contrast ratio either monotonously increases or decreases. The rationale for this approach is as follows: Minimal filter (receptive field) size in the human visual system increases rapidly toward the periphery. In a gaze contingent filtering paradigm, the MCC metric corresponds to the largest separation between fixation and the target at which a fixation shift toward the target induces no appreciable change in filter activity. This filtering paradigm suggests a multiresolution implementation of the MCC metric to produce a bottomup saliency map. First, construct a series of progressively low-pass filtered versions of the input image. Next, for each pixel in the input image, observe the variation of its value as a function of the level of resolution, and determine the lowest resolution level (largest filter size) at which this value no longer changes monotonically. The resolution parameter of this level is then adopted as the value of the MCC metric or the pixel’s saliency value. The method is closely related to other computational multiscale saliency algorithms in the literature [113], [114]. In this study, we only investigate the saliency of a single target in each image. The progressive increase in target surround width in the computation of the MCC metric can be implemented as a progressive dilation [115] of the binary target mask. This is again efficiently implemented by using the two-dimensional euclidean distance transform of the binary target mask with a square 3 3 structuring element [116]. The result of this transform is a gray-level image in which the value of each background pixel represents the euclidian distance to the nearest target pixel. Thresholding this distance image at a given level provides a dilated version of the binary mask image, representing the target plus a surrounding border with a width equal to the given threshold. This mask (resembling an inflated version of the target mask) is then used to compute the mean image intensity over the target area and its local surround. Progressively larger threshold values correspond to progressively larger (dilated) surround areas. The maximal surround width up to which the contrast ratio either monotonously increases or decreases is adopted as the MCC metric.
2.2.14 Model Characteristics In this section, we briefly review some relevant characteristics for each of the 14 computational saliency models
TOET: COMPUTATIONAL VERSUS PSYCHOPHYSICAL BOTTOM-UP IMAGE SALIENCY: A COMPARATIVE EVALUATION STUDY
investigated in this study (i.e., AIM [31], [32], [33], [34], SUN [35], [36], GR [37], [38], [39], [40], [41], [42], LR [39], EDS [43], ST [44], [45], GBVS [49], SW [47], PQFT [21], FTS [48], AWS [74], [75], SDSR [56], [73], Esaliency [111], MCC). All models used in this study transform a given input image into a two-dimensional intensity distribution that represents the saliency distribution over the image support, according to Koch and Ullman’s [2] original saliency map concept. Some models are inspired by biological mechanisms (ST, GBVS, and AWS), while others are based on information-theoretical or statistical principles (AIM, SUN, GR, LR, and Esaliency). However, this is not a fundamental distinction since a wide range of statistical computations are biologically plausible through implementation in simple and complex cells [117], [118]. Some models compute saliency at multiple spatial scales (LR, ST, GBVS, FTS, and AWS), while others operate on a single spatial scale (AIM, SUN, GR, EDS, and SDSR) or in the spatial frequency domain (SW and PQFT). Some models have several parameters (AIM, GR, ST, GBVS, AWS, SDSR, and Esaliency), while others are parameter-free (SUN, LR, EDS, SW, PQFT, FTS, and MCC). All models operate on color images. Some models yield a saliency map that has a lower resolution than the input image (LR, GBVS, SW, PQFT, and SDSR). Except for the models that operate in the Fourier domain (SW and PQFT), most models compute saliency from local feature contrast. Esaliency first segments the input image and then computes saliency as the global distinctness of these segments in feature space.
2.3 Stimuli The images used to validate the computational saliency models are 44 digital color images, each representing a military vehicle (the target) in a complex rural background. These images are part of the TNO Human Factors Search_2 image data set (see [119] for a full description of the image data set and the associated psychophysical measurements). The image data set provides, for each target, both human visual detection and identification saliency measurements, as well as the mean search time (averaged over 64 observers). The data set also includes for each image a binary mask that represents the location of the visible parts of the target support in the image. These masks enable the computation of different saliency value metrics over the target area. For this study, the original high-resolution (6;144 4;096 pixels) Search_2 images were subsampled by a factor 4 to 1;536 1;024 pixels, which was the maximal image size that could be used as input for all models tested in this study. 2.4 Computing Target Saliency We applied each of the 13 computational saliency models investigated in this study (i.e., AIM [31], [32], [33], [34], SUN [35], [36], GR [37], [38], [39], [40], [41], [42], LR [39], EDS [43], ST [44], [45], GBVS [49], SW [47], PQFT [21], FTS [48], AWS [74], [75], SDSR [56], [73], and Esaliency [111]) to each of the 44 images from the Search_2 data set [119]. Four of the models (FTS, PQFT, EDS, and LR) are parameter-free. The other models were used with their default parameters (as optimized by their authors). For all 13 computational models and for each of the 44 images in the data set, we computed both the average saliency and the maximum saliency over the target area, using the
2141
binary target masks that are provided with the Search_2 data set. Average target saliency values were computed from the corresponding saliency maps by taking the ratio between the sum of saliency values within the target support and the number of pixels in that region. In addition, we computed the MCC saliency metric for the target in each of the 44 images.
3
RESULTS
We computed saliency maps for all 44 images in the Search_2 data set, using each of the 13 models from the literature. Using the binary target masks supplied with the data set, we computed both the mean and the maximal predicted saliency values over the target areas. We also computed the MCC metric for each of the 44 targets. With SPSS 17.0, we computed the Spearman’s rank order correlations between, respectively, the model predictions and the human visual (detection and identification) conspicuity estimates. Preliminary analyses were performed to ensure no violation of the assumptions of normality, linearity, and homoscedasticity. The results are presented in Table 1. We also computed the correlation between the saliency predictions and mean search time to quantify the agreement between the model predictions and human search performance (fixation behavior). Note that the EDS model values are negatively correlated with the human observer data because these values are inherently inversely related to saliency [43]. Table 1 shows that the correlation between the maximum computational saliency over the target support and the psychophysical values is typically higher than the correlation between the mean computational saliency over the target support and the psychophysical saliency estimates, except for Rosin’s EDS [43] model. A Wilcoxon Signed Rank Test reveals that the Spearman’s rank order correlations between the maximum computed saliency values over the target support and human saliency estimates are indeed significantly larger than the correlations between the mean computed saliency values and human saliency estimates, both for detection and for identification (Z ¼ 3:181; p ¼ 0:001) saliency, with a large effect size (r ¼ 0:88). The mean value over the target support of the saliency computed with Mancas’ Global Rarity [37], [38], [39], [40], [41], [42] is inversely correlated with the human saliency data and positively correlated with mean search time. The overall largest correlations between model output and all psychophysical estimates (i.e., with both psychophysical target saliency measures and with mean search time) are obtained for both Achanta’s Frequency Tuned Saliency (FTS) model [48] and the MCC metric. The largest overall correlations are obtained for human identification saliency and, respectively, the maximum saliency value computed by FTS over the target support area (r ¼ 0:848, p < 0:01,1-tailed) and MCC (r ¼ 0:843, p < 0:01, 1-tailed). Since r2 is, respectively, 0.72 and 0.71 for these models, they both explain more than 70 percent of the original variance, while only less than 30 percent variability remains. Computational saliency typically correlates stronger with identification saliency than with detection saliency, except for the mean target saliency computed with Mancas’ Local Rarity model [39] and Seo and Milanfar’s SDSR model
2142
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,
TABLE 1 Spearman’s Rank Order Correlation between the Logarithmic Values of, Respectively, the Computational and Psychophysical (Detection and Identification) Target Saliency Values and Mean Search Time (N ¼ 44)
Mean and max refer to, respectively, the mean and maximum saliency value over the target support. ** and *, respectively, indicate that the correlation is significant at the 0.01 or 0.05 level (1-tailed). Bold values represent overall maxima.
[56], [73]. A Wilcoxon Signed Rank Test reveals that the Spearman’s rank order correlations between the maximum model outputs over the target support and human identification saliency estimates are significantly larger than the correlations between maximum model outputs and detection saliency estimates, (Z ¼ 2:481; p ¼ 0:013), with a large effect size (r ¼ 0:69).
4
DISCUSSION
For all computational saliency models except one (Rosin’s EDS [43] model), we find that the maximum computed saliency value over the target support area correlates best with human detection saliency, identification saliency, and mean search time. Achanta’s Frequency Tuned Saliency model [48] and our MCC metric yield the largest correlations between model output and all psychophysical estimates. We find that the mean value over the target support of the saliency computed with Mancas’ Global Rarity [37], [38], [39], [40], [41], [42] is inversely related with the human saliency data and positively related to mean search time. This somewhat unexpected result is probably due to the fact that the human saliency measures used here inherently encode
VOL. 33,
NO. 11,
NOVEMBER 2011
local saliency, whereas the mean Global Rarity measure represents the overall image saliency of a given detail. We find that the maximum value of Mancas’ Global Rarity computed over the target support shows a positive correlation with the human saliency data and a negative correlation with mean search time, as expected, and similar to Mancas’ Local Rarity measure [39]. This is a result of the fact that the maximum value of Mancas’ Global Rarity and Mancas’ Local Rarity are both dominated by local edge content. The underlying hypothesis of the Global Rarity model is that visual attention is attracted by features which are in the minority in a given image. In the present image set, it is evident that the target will not be located in rare areas like open spots or sky so that this hypothesis will not hold. This a priori or top-down knowledge may affect human search behavior such that areas containing many details similar to the target (rocks, shrubs) are more intensively inspected than relatively rare areas. As a result, the mean search time correlates more strongly with Local Rarity than with Global Rarity. For Rosin’s EDS [43] model, large mean output values correspond to low overall target saliency. Replacing the computation of the maximal saliency value over the target support by the minimal value is no option since this value is per definition equal to zero and thus provides no information about the relative target saliency. We therefore computed the maximal saliency value also in this case, since this measure reflects to some extent the internal target structure. Larger maximal saliency values over the target support typically correspond to targets with well-defined boundaries and less articulate internal structure, whereas smaller values correspond to targets with more prominent internal structure. The results show that the mean EDS value is more strongly correlated with human observer data than the maximal value. This suggests that targets with well-defined boundaries and less internal structure are more salient than targets containing more internal detail. The highest overall correlations obtained in this study are between identification conspicuity and both 1) the maximum saliency value over the target support computed by Achanta’s Frequency Tuned Saliency model [48] (r ¼ 0:848, p < 0:01, 1-tailed), and 2) the MCC metric (r ¼ 0:843, p < 0:01, 1-tailed). These methods both use a simple multiscale contrast approach to compute bottom-up saliency, and are therefore computationally efficient. In a previous study [17], we found that identification saliency correlates more strongly with mean search time (r ¼ 0:897, p < 0:01, 1-tailed) than detection saliency (r ¼ 0:802, p < 0, 1-tailed). The present finding that the output of most models correlates more strongly with identification saliency than with detection saliency implies that their output may be useful as input for bottom-up saliency driven visual search models. This is also reflected by the strong correlation between mean search time and maximum computed saliency over the target support as observed in this study. The highest overall correlation between mean search time and computed saliency is obtained both for 1) the maximum saliency value computed by FTS over the target support area (r ¼ 0:741, p < 0:01, 1-tailed), and 2) for MCC (r ¼ 0:735, p < 0:01, 1-tailed). This means that these models explain about 55 percent of the variance, while 45 percent residual variance still remains. In this sense, FTS and MCC
TOET: COMPUTATIONAL VERSUS PSYCHOPHYSICAL BOTTOM-UP IMAGE SALIENCY: A COMPARATIVE EVALUATION STUDY
both outperform more complex visual search and detection models predicting mean search time, like ORACLE [120], [121], Visdet [122], [123], and Travnikova [124], [125], [126], none of which explains more than 45 percent of the variance in mean search time data [127]. Since FTS is computationally extremely simple, it may be a valuable alternative to the more complex search models. A limitation of the present study is the small size of the Search_2 data set. However, it is currently the only available natural image database providing low-level bottom-up human saliency estimates that range over more than two decades (i.e., for targets ranging from hardly visible to very salient). The suitability of this data set for the evaluation of computational bottom-up saliency models appears from the statistical significance of the present results. The models investigated in this study quantify bottom-up saliency through local image distinctness, either in a statistical sense (AIM, SUN, GR, SDSR, and Esaliency), in a local feature contrast paradigm (LR, ST, GBVS, SW, PQFT, FTS, and AWS), or as a measure of edginess (EDS). Hence, they all mimic the human visual system to some degree, which analyzes local feature differences in size, shape, luminance, color, texture, binocular disparity, and motion to achieve figure-ground segregation [92], [93], [128], [129], [130], [131], [132], [133]. It has indeed been shown that all models predict human fixation well above chance level [21], [31], [36], [39], [43], [45], [47], [48], [49], [56], [74], [111]. Here, we also find that several of these models (AIM, SUN, ST, GBVS, PQFT, AWS, and Esaliency) correlate more strongly with mean search time (fixation behavior) than with conspicuity. Thus, although they appear to model fixation behavior quite well, it is not clear to what degree they model visual conspicuity. A limitation of all models is that they do not represent important visual effects, like gaze contingent filtering [134], lateral interaction [96], and crowding [95]. In addition, they represent saliency as a single topographic map, while there is mounting evidence that saliency has a hierarchical representation in the primate brain [135]. The resolution of the human visual system is highest in its center (fovea) and drops rapidly toward the periphery. Eye movements serve to direct the fovea to potentially interesting peripheral locations in order to achieve closer inspection. Hence, human fixation behavior is inherently driven by eccentric vision and thus by low-resolution features [78], [136]. This may explain why human fixation correlates more strongly with luminance contrast for low-pass filtered images than for their full resolution counterparts [78], [136]. It may also explain why several authors [21], [45], [47], [49], [56] previously found that their bottom-up saliency models (ST, GBVS, SW, PQFT, and SDSR) predict human performance better when the saliency maps are computed at lower levels of resolution (typically 64 64 pixels). The models investigated here do not explicitly represent the momentary eye position (fixation), and thus neglect the variation in visual resolution with retinal location. The resulting saliency maps can be regarded as the union of the saliency maps corresponding to all possible instantaneous fixation locations over the image area. To assess the influence of resolution on our present results, we repeated all calculations for input image sizes of, respectively, 3=4, 1=2,
2143
and 1=4 the original images sizes (1;536 1;024). We found that the Spearman rank order correlations varied less than 10 percent. We therefore conclude that our present results are quite robust against resolution variations. Human attentional resolution is dramatically lower than peripheral resolution [105], probably because visual saliency maps are mediated by parietal mechanisms with large receptive fields [82]. As a result, target saliency may be severely compromised in the presence of other visual details, even at separations up to 0.5 times the target eccentricity [137]. This effect, which is known as crowding [95], dominates peripheral resolution degradation in visual search [96]. However, crowding is not represented in any of the models investigated here. Several models (ST, GBVS, SW, PQFT, and SDSR) simulate low-resolute attentional maps by computing saliency maps at a very low resolution (about 60 60 pixels), independent of the size of the input image. The other models (AIM, SUN, GR, LR, EDS, FTS, AWS, and Esaliency) generate saliency maps at the resolution of the input images. The fact that attentional resolution dominates peripheral resolution degradation may explain why gaze contingent filtering does not improve intensity-contrast-based saliency predictions [138]. Since the targets in the present study are primarily defined by their intensity contrast, it is unlikely that gaze contingent image filtering will affect the present results. However, gaze contingent filtering may become relevant when other data sets are used in which targets are defined by features like color, orientation, flicker, or motion. It is currently not known how different feature dimensions contribute to the overall visual saliency of image details [139], [140]. As a result, most computational models rather arbitrarily adopt some kind of (weighted) average [44], [45], [47], [74], [75] or maximum operator [37], [38], [39], [40], [41], [42] of the individual feature channels to compute an overall scalar saliency value. However, it has been shown that dynamic weighting of individual feature maps [50] and the inclusion of nonlinear feature-specific interactions in these models [62], [140], [141] can significantly improve human performance predictions. Hence, knowledge of the interactions between the mechanisms mediating human visual saliency is required for the further development of computational saliency models. This involves a systematic psychophysical investigation of the relative contribution of different feature dimensions to overall target saliency. Our psychophysical saliency measurement procedure may be an efficient tool to obtain the required visual saliency estimates. Ultimately, having a robust, objective, and computationally efficient bottom-up saliency algorithm that takes into account important visual effects like lateral masking and crowding (scene clutter) will be of great practical value, allowing for instance real-time onsite optimization of target saliency (e.g., for road layout, placement of warning or traffic signs, billboards, etc.) using mobile camera-equipped computing devices.
ACKNOWLEDGMENTS The author wishes to thank Radhakrishna Achanta, Peng Bian, Neil D. Bruce, Pedro F. Felzenszwalb, Anto´n GarciaDiaz, Chenlei Guo, Jonathan Harel, Daniel P. Huttenlocher,
2144
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,
Nicolas Le Bihan, Matei Mancas, Ruth Rosenholtz, Paul L. Rosin, Steve Sangwine, Eero P. Simoncelli, John K. Tsotsos, Dirk B. Walther, and Lingyun Zhang for sharing their software and for helpful suggestions, and Hae Jong Seo and Peyman Milanfar for calculating the saliency maps for his images with their model and for sharing their software. He is also indebted to the referees for their valuable suggestions and comments on earlier versions of this manuscript. This research was supported by NWO-VICI grant 639.023.705 “Color in Computer Vision.”
REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9] [10]
[11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21]
S.J. Dickinson et al., “Active Object Recognition Integrating Attention and Viewpoint Control,” Computer Vision and Image Understanding, vol. 67, no. 3, pp. 239-260, 1997. C. Koch and S. Ullman, “Shifts in Selective Visual Attention: Towards the Underlying Neural Circuitry,” Human Neurobiology, vol. 4, no. 4, pp. 219-227, 1985. M. Land, N. Mennie, and J. Rusted, “The Roles of Vision and Eye Movements in the Control of Activities of Daily Living,” Perception, vol. 28, no. 11, pp. 1311-1328, 1999. A. Yarbus, Eye Movements and Vision. Plenum Press, 1967. J.H. Fecteau and D.P. Munoz, “Salience, Relevance, and Firing: A Priority Map for Target Selection,” Trends in Cognitive Sciences, vol. 10, no. 8, pp. 382-390, 2006. A.H. Wertheim, “Visual Conspicuity: A New Simple Standard, Its Reliability, Validity and Applicability,” Ergonomics, vol. 53, no. 3, pp. 421-442, 2010. E. Niebur and C. Koch, “Computational Architectures for Attention,” The Attentive Brain, R. Parasuraman, ed., pp. 163-186, MIT Press, 1998. D. Parkhurst, K. Law, and E. Niebur, “Modeling the Role of Salience in the Allocation of Overt Visual Attention,” Vision Research, vol. 42, no. 1, pp. 107-123, 2002. B.W. Tatler, R.J. Baddeley, and I.D. Gilchrist, “Visual Correlates of Fixation Selection: Effects of Scale and Time,” Vision Research, vol. 45, no. 5, pp. 643-659, 2005. G. Underwood and T. Foulsham, “Visual Saliency and Semantic Incongruency Influence Eye Movements When Inspecting Pictures,” The Quarterly J. Experimental Psychology, vol. 59, no. 11, pp. 1931-1949, 2006. L. Itti, “Quantifying the Contribution of Low-Level Saliency to Human Eye Movements in Dynamic Scenes,” Visual Cognition, vol. 12, no. 6, pp. 1093-1123, 2005. L. Elazary and L. Itti, “Interesting Objects Are Visually Salient,” J. Vision, vol. 8, no. 3, pp. 1-15, 2009. F.L. Engel, “Visual Conspicuity and Selective Background Interference in Eccentric Vision,” Vision Research, vol. 14, no. 7, pp. 459471, 1974. F.L. Engel, “Visual Conspicuity, Visual Search and Fixation Tendencies of the Eye,” Vision Research, vol. 17, no. 1, pp. 95108, 1977. F.L. Engel, “Visual Conspicuity, Directed Attention and Retinal Locus,” Vision Research, vol. 11, no. 6, pp. 563-575, 1971. W.S. Geisler and K.-L. Chou, “Separation of Low-Level and HighLevel Factors in Complex Tasks: Visual Search,” Psychological Rev., vol. 102, no. 2, pp. 356-378, 1995. A. Toet et al., “Visual Conspicuity Determines Human Target Acquisition Performance,” Optical Eng., vol. 37, no. 7, pp. 19691975, 1998. A. Toet and P. Bijl, “Visual Conspicuity,” Encyclopedia of Optical Eng., R.G. Driggers, ed., pp. 2929-2935, Marcel Dekker Inc., 2003. Y.-F. Ma and H.-J. Zhang, “Contrast-Based Image Attention Analysis by Using Fuzzy Growing,” Proc. 11th ACM Int’l Conf. Multimedia, pp. 374-381, 2003. C. Christopoulos, A. Skodras, and T. Ebrahimi, “The JPEG2000 Still Image Coding System: An Overview,” IEEE Trans. Consumer Electronics, vol. 46, no. 4, pp. 1103-1127, Nov. 2000. C. Guo and L. Zhang, “A Novel Multiresolution Spatiotemporal Saliency Detection Model and Its Applications in Image and Video Compression,” IEEE Trans. Image Processing, vol. 19, no. 1, pp. 185198, Jan. 2010.
VOL. 33,
NO. 11,
NOVEMBER 2011
[22] S. Marat, M. Guironnet, and D. Pellerin, “Video Summarization Using a Visual Attentional Model,” Proc. 15th European Signal Processing Conf., pp. 1784-1788, 2007. [23] R. Rodriguez-Sa´nchez et al., “The Relationship between Information Prioritization and Visual Distinctness in Two Progressive Image Transmission Schemes,” Pattern Recognition, vol. 37, no. 2, pp. 281-297, 2004. [24] B.C. Ko and J.-Y. Nam, “Object-of-Interest Image Segmentation Based on Human Attention and Semantic Region Clustering,” J. Optical Soc. of Am. A, vol. 23, no. 10, pp. 2462-2470, 2006. [25] J. Han et al., “Unsupervised Extraction of Visual Attention Objects in Color Images,” IEEE Trans. Circuits and Systems for Video Technology, vol. 16, no. 1, pp. 141-145, Jan. 2006. [26] Q. Ma and L. Zhang, “Saliency-Based Image Quality Assessment Criterion,” Proc. Fourth Int’l Conf. Intelligent Computing: Advanced Intelligent Computing Theories and Applications—with Aspects of Theoretical and Methodological Issues, pp. 1124-1133, 2008. [27] A. Niassi et al., “Does Where You Gaze on an Image Affect Your Perception of Quality? Applying Visual Attention to Image Quality Metric,” Proc. IEEE Int’l Conf. Image Processing, vol. 2, pp. 169-172, 2007. [28] Q. Ma, L. Zhang, and B. Wang, “New Strategy for Image and Video Quality Assessment,” J. Electronic Imaging, vol. 19, pp. 1-14, 2010. [29] D. Walther et al., “Selective Visual Attention Enables Learning and Recognition of Multiple Objects in Cluttered Scenes,” Computer Vision and Image Understanding, vol. 100, nos. 1/2, pp. 41-63, 2005. [30] S. Avidan and A. Shamir, “Seam Carving for Content-Aware Image Resizing,” ACM Trans. Graphics, vol. 26, no. 3, pp. 1-9, 2007. [31] N.D.B. Bruce and J.K. Tsotsos, “Saliency, Attention, and Visual Search: An Information Theoretic Approach,” J. Vision, vol. 9, no. 3, pp. 1-24, 2009. [32] N.D.B. Bruce and J.K. Tsotsos, “Saliency Based on Information Maximization,” Advances in Neural Information Processing Systems, Y. Weiss, B. Scho¨lkopf, and J. Platt, eds., vol. 18, pp. 155-162, MIT Press, 2009. [33] N.D.B. Bruce and J.K. Tsotsos, “An Information Theoretic Model of Saliency and Visual Search,” Proc. Fourth Int’l Workshop Attention in Cognitive Systems: Attention in Cognitive Systems. Theories and Systems from an Interdisciplinary Viewpoint, L. Paletta and E. Rome, eds., pp. 171-183, 2008. [34] N.D.B. Bruce, “Features That Draw Visual Attention: An Information Theoretic Perspective,” Neurocomputing, vols. 65/66, pp. 125-133, 2005. [35] C. Kanan et al., “SUN: Top-Down Saliency Using Natural Statistics,” Visual Cognition, vol. 17, nos. 6/7, pp. 979-1003, 2009. [36] L. Zhang et al., “SUN: A Bayesian Framework for Saliency Using Natural Statistics,” J. Vision, vol. 8, no. 7, pp. 1-20, 2008. [37] M. Mancas et al., “A Rarity-Based Visual Attention MapApplication to Texture Description,” Proc. IEEE Int’l Conf. Image Processing, pp. 445-448, 2006. [38] M. Mancas, B. Gosselin, and B. Macq, “Perceptual Image Representation,” J. Image and Video Processing, vol. 2007, pp. 1-9, 2007. [39] M. Mancas, Computational Attention: Towards Attentive Computers. Presses Universitaires de Louvain, 2007. [40] M. Mancas et al., “Computational Attention for Event Detection,” Proc. Fifth Int’l Conf. Computer Vision Systems, 2007. [41] M. Mancas, “Image Perception: Relative Influence of Bottom-Up and Top-Down Attention,” Proc. Int’l Workshop Attention and Performance in Computer Vision in conjunction with the Int’l Conf. Computer Vision Systems, pp. 94-106, 2008. [42] M. Mancas et al., “Tracking-Dependent and Interactive Video Projection,” Proc. Fourth Int’l Summer Workshop Multi-Modal Interfaces, pp. 73-83, 2009. [43] P.L. Rosin, “A Simple Method for Detecting Salient Regions,” Pattern Recognition, vol. 42, no. 11, pp. 2363-2371, 2009. [44] D. Walther, “Interactions of Visual Attention and Object Recognition: Computational Modeling, Algorithms, and Psychophysics,” PhD thesis, California Inst. of Technology, 2006. [45] D. Walther and C. Koch, “Modeling Attention to Salient ProtoObjects,” Neural Networks, vol. 19, no. 9, pp. 1395-1407, 2006. [46] L. Itti, C. Koch, and E. Niebur, “A Model of Saliency-Based Visual Attention for Rapid Scene Analysis,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 20, no. 11, pp. 1254-1259, Nov. 1998.
TOET: COMPUTATIONAL VERSUS PSYCHOPHYSICAL BOTTOM-UP IMAGE SALIENCY: A COMPARATIVE EVALUATION STUDY
[47] P. Bian and L. Zhang, “Biological Plausibility of Spectral Domain Approach for Spatiotemporal Visual Saliency,” Advances in NeuroInformation Processing, pp. 251-258, Springer, 2009. [48] R. Achanta et al., “Frequency-Tuned Salient Region Detection,” Proc. IEEE Int’l Conf. Computer Vision and Pattern Recognition, June 2009. [49] J. Harel, C. Koch, and P. Perona, “Graph-Based Visual Saliency,” Advances in Neural Information Processing Systems, B. Scho¨lkopf, J. Platt, and T. Hoffman, eds., vol. 19, pp. 545-552, MIT Press, 2007. [50] Y. Hu et al., “Salient Region Detection Using Weighted Feature Maps Based on the Human Visual Attention Model,” Proc. Fifth Pacific Rim Conf. Multimedia, pp. 993-1000, 2005. [51] X. Hou and L. Zhang, “Saliency Detection: A Spectral Residual Approach,” Proc. IEEE CS Conf. Computer Vision and Pattern Recognition, pp. 1-8, 2007. [52] R. Achanta et al., “Salient Region Detection and Segmentation,” Proc. Int’l Conf. Computer Vision Systems, pp. 66-75, 2008. [53] L. Itti and C. Koch, “Comparison of Feature Combination Strategies for Saliency-Based Visual Attention Systems,” Proc. SPIE Conf. Human Vision and Electronic Imaging IV, B.E. Rogowitz and N.P. Thrasyvoulos, eds., pp. 473-482, 1999. [54] S. Frintrop, M. Klodt, and E. Rome, “A Real-Time Visual Attention System Using Integral Images,” Proc. Fifth Int’l Conf. Computer Vision Systems, 2007. [55] K. Rapantzikos et al., “Spatiotemporal Saliency for Video Classification,” Signal Processing: Image Comm., vol. 24, no. 7, pp. 557-571, 2009. [56] H.J. Seo and P. Milanfar, “Static and Space-Time Visual Saliency Detection by Self-Resemblance,” J. Vision, vol. 9, nos. 12-15, pp. 127, 2009. [57] C. Guo and L. Zhang, “Spatio-Temporal Saliency Detection Using Phase Spectrum of Quaternion Fourier Transform,” Proc. IEEE CS Conf. Computer Vision and Pattern Recognition, pp. 1-8, 2008. [58] H. Shi and Y. Yang, “A Computational Model of Visual Attention Based on Saliency Maps,” Applied Math. and Computation, vol. 188, no. 2, pp. 1671-1677, 2007. [59] D. Gao, V. Mahadevan, and N. Vasconcelos, “On the Plausibility of the Discriminant Center-Surround Hypothesis for Visual Saliency,” J. Vision, vol. 8, no. 7, pp. 1-18, 2008. [60] V.K. Singh, S. Maji, and A. Mukerjee, “Confidence Based Updation of Motion Conspicuity in Dynamic Scenes,” Proc. Third Canadian Conf. Computer and Robot Vision, pp. 13-20, 2006. [61] G. Harding and M. Bloj, “Real and Predicted Influence of Image Manipulations on Eye Movements during Scene Recognition,” J. Vision, vol. 10, nos. 2-8. pp. 1-17, 2010. [62] O. Le Meur et al., “A Coherent Computational Approach to Model Bottom-Up Visual Attention,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 28, no. 5, pp. 802-817, May 2006. [63] L. Huang and H. Pashler, “Quantifying Object Salience by Equating Distractor Effects,” Vision Research, vol. 45, no. 14, pp. 1909-1920, 2005. [64] W. van Zoest and M. Donk, “Bottom-Up and Top-Down Control in Visual Search,” Perception, vol. 33, no. 8, pp. 927-937, 2004. [65] W. van Zoest and M. Donk, “The Effects of Salience on Saccadic Target Selection,” Visual Cognition, vol. 12, no. 2, pp. 353-375, 2005. [66] L. Itti and C. Koch, “A Saliency-Based Search Mechanism for Overt and Covert Shifts of Visual Attention,” Vision Research, vol. 40, nos. 10-12, pp. 1489-1506, 2000. [67] L. Itti, C. Gold, and C. Koch, “Visual Attention and Target Detection in Cluttered Natural Scenes,” Optical Eng., vol. 40, no. 9, pp. 1784-1793, 2001. [68] A.D. Clarke et al., “Visual Search for a Target against a 1/f(Beta) Continuous Textured Background,” Vision Research, vol. 48, no. 21, pp. 2193-2203, 2008. [69] C.M. Masciocchi et al., “Everyone Knows What Is Interesting: Salient Locations Which Should Be Fixated,” J. Vision, vol. 9, no. 11, pp. 1-22, 2009. [70] T. Liu et al., “Learning to Detect a Salient Object,” Proc. IEEE CS Conf. Computer and Vision Pattern Recognition, pp. 1-8, 2007. [71] N.D.B. Bruce, D.P. Loach, and J.K. Tsotsos, “Visual Correlates of Fixation Selection: A Look at the Spatial Frequency Domain,” Proc. IEEE Int’l Conf. Image Processing, vol. 3, pp. III-289-III-292, 2009. [72] D. Gao, V. Mahadevan, and N. Vasconcelos, “The Discriminant Center-Surround Hypothesis for Bottom-Up Saliency,” Proc. Neural Information Processing Systems, pp. 497-504, 2007.
2145
[73] H.J. Seo and P. Milanfar, “Nonparametric Bottom-Up Saliency Detection by Self-Resemblance,” Proc. IEEE CS Conf. Computer Vision and Pattern Recognition, First Int’l Workshop Visual Scene Understanding, 2009. [74] A. Garcia-Diaz et al., “Saliency Based on Decorrelation and Distinctiveness of Local Responses,” Proc. 13th Int’l Conf. Computer Analysis of Images and Patterns, pp. 261-268, 2009. [75] A. Garcia-Diaz et al., “Decorrelation and Distinctiveness Provide with Human-Like Saliency,” Proc. 11th Int’l Conf. Advanced Concepts for Intelligent Vision Systems, J. Blanc-Talon et al., eds., pp. 343-354, 2009. [76] W. Einha¨user, M. Spain, and P. Perona, “Objects Predict Fixations Better Than Early Saliency,” J. Vision, vol. 8, no. 14, pp. 1-26, 2008. [77] R.J. Baddeley and B.W. Tatler, “High Frequency Edges (But Not Contrast) Predict Where We Fixate: A Bayesian System Identification Analysis,” Vision Research, vol. 46, no. 18, pp. 2824-2833, 2006. [78] A. Ac¸ik et al., “Effects of Luminance Contrast and Its Modifications on Fixation Behavior during Free Viewing of Images from Different Categories,” Vision Research, vol. 49, no. 12, pp. 15411553, 2009. [79] R. Carmi and L. Itti, “Visual Causes versus Correlates of Attentional Selection in Dynamic Scenes,” Vision Research, vol. 46, no. 26. pp. 4333-4345, 2006. [80] W. Einha¨user, U. Rutishauer, and C. Koch, “Task-Demands Can Immediately Reverse the Effects of Sensory-Driven Saliency in Complex Visual Stimuli,” J. Vision, vol. 8, no. 2, pp. 1-19, 2008. [81] T. Foulsham and G. Underwood, “How Does the Purpose of Inspection Influence the Potency of Visual Salience in Scene Perception?” Perception, vol. 36, no. 8, pp. 1123-1138, 2007. [82] C. Roggeman, W. Fias, and T. Verguts, “Salience Maps in Parietal Cortex: Imaging and Computational Modeling,” Neuroimage, vol. 52, no. 3, pp. 1005-1014, 2010. [83] X. Chen and G.J. Zelinsky, “Real-World Visual Search is Dominated by Top-Down Guidance,” Vision Research, vol. 46, no. 24, pp. 4118-4133, 2006. [84] M. Nystro¨m and J. Holmquist, “Semantic Override of Low-Level Features in Image Viewing—Both Initially and Overall,” J. Eye Movement Research, vol. 2, pp. 1-11, 2008. [85] J.A. Stirk and G. Underwood, “Low-Level Visual Saliency Does not Predict Change Detection in Natural Scenes,” J. Vision, vol. 7, no. 10, pp. 1-10, 2007. [86] G. Underwood et al., “Is Attention Necessary for Object Identification? Evidence from Eye Movements during the Inspection of Real-World Scenes,” Consciousness and Cognition, vol. 17, no. 1, pp. 159-170, 2008. [87] J.M. Henderson et al., “Visual Saliency Does Not Account for Eye Movements during Visual Search in Real-World Scenes,” Eye Movements: A Window on the Mind and Brain, R.P.G. van Gompel et al., eds., pp. 537-562, Elsevier, 2007. [88] T. Foulsham and G. Underwood, “What Can Saliency Models Predict about Eye Movements? Spatial and Sequential Aspects of Fixations during Encoding and Recognition,” J. Vision, vol. 8, no. 2, pp. 6-17, 2008. [89] G. Underwood, T. Foulsham, and K. Humphrey, “Saliency and Scan Patterns in the Inspection of Real-World Scenes: Eye Movements during Encoding and Recognition,” Visual Cognition, vol. 17, nos. 6/7, pp. 812-834, 2009. [90] J.R. Bloomfield, “Visual Search in Complex Fields: Size Differences between Target Disc and Surrounding Discs,” Human Factors, vol. 14, no. 2, pp. 139-148, 1972. [91] Y.M. Bowler, “Towards a Simplified Model of Visual Search,” Visual Search, D. Brogan, ed., pp. 303-309, Taylor & Francis, 1990. [92] B.L. Cole and S.E. Jenkins, “The Effect of Variability of Background Elements on the Conspicuity of Objects,” Vision Research, vol. 24, no. 3, pp. 261-270, 1984. [93] S.E. Jenkins and B.L. Cole, “The Effect of the Density of Background Elements on the Conspicuity of Objects,” Vision Research, vol. 22, no. 10, pp. 1241-1252, 1982. [94] F.J.A.M. Poirier, F. Gosselin, and M. Arguin, “Perceptive Fields of Saliency,” J. Vision, vol. 8, no. 15, pp. 1-19, 2008. [95] D.M. Levi, “Crowding—An Essential Bottleneck for Object Recognition: A Mini-Review,” Vision Research, vol. 48, no. 5, pp. 635-654, 2008. [96] A.H. Wertheim et al., “How Important Is Lateral Masking in Visual Search?” Experimental Brain Research, vol. 170, no. 3, pp. 387-402, 2006.
2146
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,
[97] S. Straube and M. Fahle, “Visual Detection and Identification are Not the Same: Evidence from Psychophysics and fMRI,” Brain and Cognition, vol. 75, pp. 29-38, 2011. [98] C.E. Shannon, “A Mathematical Theory of Communication,” The Bell Systems Technical J., vol. 27, pp. 93-154, 1948. [99] T.-W. Lee, M. Girolami, and T.J. Sejnowski, “Independent Component Analysis Using an Extended Infomax Algorithm for Mixed Subgaussian and Supergaussian Sources,” Neural Computation, vol. 11, no. 2, pp. 417-441, 1999. [100] J.F. Cardoso, “High-Order Contrasts for Independent Component Analysis,” Neural Computing, vol. 11, no. 1, pp. 157-192, 1999. [101] N.D.B. Bruce, “Image Analysis through Local Information Measures,” Proc. 17th Int’l Conf. Pattern Recognition, J. Kittler, M. Petrou, and M.S. Nixon, eds., vol. 1, pp. 616-619, 2004. [102] W.K. Pratt, Digital Image Processing, second ed., Wiley, 1991. [103] G. Borgefors, “Distance Transformations in Digital Images,” Computer Vision, Graphics, and Image Processing, vol. 34, no. 3, pp. 344-371, 1986. [104] L. Itti and C. Koch, “Feature Combination Strategies for SaliencyBased Visual Attention Systems,” J. Electronic Imaging, vol. 10, no. 1, pp. 161-169, 2001. [105] J. Intriligator and P. Cavanagh, “The Spatial Resolution of Visual Attention,” Cognitive Psychology, vol. 43, no. 3, pp. 171-216, 2001. [106] T.A. Ell and S.J. Sangwine, “Hypercomplex Fourier Transforms of Color Images,” IEEE Trans. Image Processing, vol. 16, no. 1, pp. 2235, Jan. 2007. [107] A.S. Ecker et al., “Decorrelated Neuronal Firing in Cortical Microcircuits,” Science, vol. 327, no. 5965, pp. 584-587, 2010. [108] D.J. Field, “Relations between the Statistics of Natural Images and the Response Properties of Cortical Cells,” J. Optical Soc. of Am. A, vol. 4, no. 12, pp. 2379-2394, 1987. [109] M.C. Morrone and D.C. Burr, “Feature Detection in Human Vision: A Phase-Dependent Energy Model,” Proc. Royal Soc. of London B, vol. 235, no. 1280, pp. 221-245, 1988. [110] P. Kovesi, “Image Features from Phase Congruency,” Videre: J. Computer Vision Research, vol. 1, no. 3, pp. 2-26, 1999. [111] T. Avraham and M. Lindenbaum, “Esaliency (Extended Saliency): Meaningful Attention Using Stochastic Image Modeling,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 32, no. 4, pp. 693-708, Apr. 2010. [112] D. Nilsson, “An Efficient Algorithm for Finding the M Most Probable Configurations in Probabilistic Expert Systems,” Statistics and Computing, vol. 8, no. 2, pp. 159-173, 1998. [113] F. Liu and M. Gleicher, “Region Enhanced Scale-Invariant Saliency Detection,” Proc. IEEE Int’l Conf. Multimedia and Expo, pp. 1477-1480, 2006. [114] T. Kadir and M. Brady, “Saliency, Scale and Image Description,” Int’l J. Computer Vision, vol. 45, no. 2, pp. 83-105, 2001. [115] J. Serra, Image Analysis and Mathmatical Morphology. Academic Press, 1982. [116] R. Lotufo and F. Zampirolli, “Fast Multidimensional Parallel Euclidean Distance Transform Based on Mathematical Morphology,” Proc. 14th Brazilian Symp. Computer Graphics and Image Processing, T. Wu and D. Borges, eds., pp. 100-105, 2001. [117] D. Gao and N. Vasconcelos, “Decision-Theoretic Saliency: Computational Principles, Biological Plausibility, and Implications for Neurophysiology and Psychophysics,” Neural Computation, vol. 21, no. 1, pp. 239-271, 2009. [118] S. Han and N. Vasconcelos, “Biologically Plausible Saliency Mechanisms Improve Feedforward Object Recognition,” Vision Research, vol. 50, no. 22, pp. 2295-2307, 2010. [119] A. Toet, P. Bijl, and J.M. Valeton, “Image Data Set for Testing Search and Detection Models,” Optical Eng., vol. 40, no. 9, pp. 1760-1767, 2001. [120] K.J. Cooke, P.A. Stanley, and J.L. Hinton, “The ORACLE Approach to Target Acquisition and Search Modelling,” Vision Models for Target Detection and Recognition, E. Peli, ed., pp. 135-171, World Scientific, 1995. [121] K.J. Cooke, “The ORACLE Handbook,” BAe SRC JS12020, British Aerospace, Sowerby Research Center, 1992. [122] G. Waldman and J. Wootton, Electro-Optical Systems Performance Modeling. Artech House, 1993. [123] G. Waldman, J. Wootton, and G. Hobson, “Visual Detection with Search: An Empirical Model,” IEEE Trans. Systems, Man, and Cybernetics, vol. 21, no. 3, pp. 596-606, May/June 1991. [124] N.P. Travnikova, “Efficiency of Visual Search,” Mashinostroyeniye, Moscow, USSR, 1984.
VOL. 33,
NO. 11,
NOVEMBER 2011
[125] N.P. Travnikova, “Search Efficiency of Telescopic Instruments,” Soviet J. Optical Technology, vol. 51, no. 2, pp. 63-66, 1984, [126] N.P. Travnikova, “Visual Detection of Extended Objects,” Soviet J. Optical Technology, vol. 44, no. 3, pp. 166-169, 1977. [127] A. Toet, P. Bijl, and J.M. Valeton, “Test of Three Visual Search and Detection Models,” Optical Eng., vol. 39, no. 5, pp. 1344-1353, 2000. [128] W.F. Alkhateeb, R.J. Morris, and K.H. Rudock, “Effects of Stimulus Complexity on Simple Spatial Discriminations,” Spatial Vision, vol. 5, no. 2, pp. 129-141, 1990. [129] H.C. Nothdurft, “Texture Segregation and Pop-Out from Orientation Contrast,” Vision Research, vol. 31, no. 6, pp. 1073-1078, 1991. [130] H.C. Nothdurft, “Feature Analysis and the Role of Similarity in Preattentive Vision,” Perception & Psychophysics, vol. 52, no. 4, pp. 355-375, 1992. [131] H.C. Nothdurft, “Saliency Effects across Dimensions in Visual Search,” Vision Research, vol. 33, nos. 5/6, pp. 839-844, 1993. [132] H.C. Nothdurft, “The Role of Features in Preattentive Vision: Comparison of Orientation, Motion and Color Cues,” Vision Research, vol. 33, no. 14, pp. 1937-1958, 1993. [133] H.C. Nothdurft, “The Conspicuousness of Orientation and Motion Contrast,” Spatial Vision, vol. 7, no. 4, pp. 341-363, 1993. [134] M. Pomplun et al., “Peripheral and Parafoveal Cueing and Masking Effects on Saccadic Selectivity in a Gaze-Contingent Window Paradigm,” Vision Research, vol. 41, no. 21, pp. 2757-2769, 2001. [135] K.G. Thompson and N.P. Bichot, “A Visual Salience Map in the Primate Frontal Eye Field,” Progress in Brain Research, vol. 147, pp. 249-262, 2005. [136] U. Rajashekar et al., “Foveated Analysis of Image Features at Fixations,” Vision Research, vol. 47, no. 25, pp. 3160-3172, 2007. [137] A. Toet and D.M. Levi, “The Two-Dimensional Shape of Spatial Interaction Zones in the Parafovea,” Vision Research, vol. 32, no. 7, pp. 1349-1357, 1992. [138] L. Itti, “Quantitative Modelling of Perceptual Salience at Human Eye Position,” Visual Cognition, vol. 14, nos. 4-8, pp. 959-984, 2006. [139] S. Straube and M. Fahle, “The Electrophysiological Correlate of Saliency: Evidence from a Figure-Detection Task,” Brain Research, vol. 1307, no. 1, pp. 89-102, 2010. [140] R.J. Peters et al., “Components of Bottom-Up Gaze Allocation in Natural Image,” Vision Research, vol. 45, no. 18, pp. 2397-2416, 2005. [141] A.R. Koene and L. Zhaoping, “Feature-Specific Interactions in Salience from Combined Feature Contrasts: Evidence for a Bottom-Up Saliency Map in V1,” J. Vision, vol. 7, no. 7, pp. 1-14, 2007. Alexander Toet received the PhD degree in physics from the University of Utrecht, The Netherlands, in 1987, where he worked on visual perception and image processing. He is currently employed at the Intelligent System Laboratory Amsterdam (University of Amsterdam), where he investigates the visual saliency and affective appraisal of dynamic textures, and at TNO Defense, Safety and Security in Soesterberg, where he investigates multimodal image fusion, image quality, computational models of human visual search and detection, and the quantification of visual target distinctness. Recently, he started investigating cross-modal perceptual interactions between the visual, auditory, tactile, and olfactory senses. He has published more than 50 papers in refereed journals and 60 papers in refereed conference proceedings. He was coeditor of a book on the mathematical description of shape in images, and was a guest editor of the special issue of Optical Engineering on advances in target acquisition modeling II. He organized and directed three international workshops on search and target acquisition,combinatorial algorithms for military applications, and on the mathematical description of shape in images. He is a senior member of the IEEE and the International Society for Optical Engineering (SPIE), and is a member of The Institution of Electrical Engineers (IEE) and the Dutch Association for Pattern Recognition (DAPR).