Signal Processing: Image Communication 39 (2015) 386–404
Contents lists available at ScienceDirect
Signal Processing: Image Communication journal homepage: www.elsevier.com/locate/image
Perceived interest and overt visual attention in natural images Ulrich Engelke a,n, Patrick Le Callet b a
Commonwealth Scientific and Industrial Research Organisation (CSIRO), 7005 Hobart, Australia LUNAM Université, Université de Nantes, IRCCyN UMR CNRS 6597 (Institut de Recherche en Communications et Cybernétique de Nantes), Polytech Nantes, France b
a r t i c l e in f o
abstract
Available online 23 March 2015
Region of interest (ROI) based image and video processing has attracted increased research efforts in recent years. The concept of perceptual ROI, however, is not always clearly defined leading to different interpretations between researchers related to bottom-up saliency (signal driven visual attention), top-down attention (subconscious, driven by higher cognitive factors, e.g. interest) or perceived interest. While all of these concepts are likely meaningful in the context of perceptual ROI based image and video processing, it is worth understanding how they are linked altogether. In this paper, the relationship between perceived interest and overt visual attention (which can cover both bottom-up and topdown attention) is studied. Towards this goal, a dedicated ROI selection experiment was performed and is analysed in detail, revealing deep insight into perceived interest in natural images. The outcomes are compared to an eye gaze tracking experiment representing overt visual attention in natural images. It is shown that there is indeed a strong relationship between perceived interest and overt visual attention for a wide range of natural scenes. We show that the relationship has a strong dependence on image content and presentation time during the eye gaze tracking experiment. Furthermore, eye gaze tracking data is revealed to have a high predictive value of primary ROI, particularly in case of the latter dominating over the remainder of the image. Both, the ROI and the eye gaze tracking databases are made publicly available to the research community. Crown Copyright & 2015 Published by Elsevier B.V. All rights reserved.
Keywords: Overt visual attention Perceived interest Region-of-interest Eye gaze tracking Image processing Video processing
1. Introduction Digital image and video processing systems are typically designed as a trade-off between technical constraints and the quality of the output. To satisfy these constraints, technical parameters are chosen such as to optimise the system performance with respect to, for instance, the perceived visual quality or the intelligibility of the image or video. Similar to the technical limitations of these systems, our human visual system (HVS) also exhibits strong limitations regarding the computational complexity and the bandwidth with which we process visual information. The amount of
n
Corresponding author. Tel.: +61 3 6237 5650. E-mail address:
[email protected] (U. Engelke).
visual information available at any instant in time, however, is vast and goes beyond the processing capabilities of the HVS. Therefore, several mechanisms are in place that facilitate to focus only on the most relevant information in any given context. First and foremost, these mechanisms consist of preprocessing in the early visual system and higher-order cognitive processes in the later stages of the HVS. The former is realised as non-uniform sampling in the retina of the human eye, which allows to process information with high accuracy only in the central point of focus, the fovea. The latter mechanism is referred to as visual attention [1] and facilitates that the focus of attention is shifted across the visual scene to the most salient or interesting locations. These mechanisms together achieve that in any given context the most relevant information is constantly favoured at the cost of less relevant information [2].
http://dx.doi.org/10.1016/j.image.2015.03.004 0923-5965/Crown Copyright & 2015 Published by Elsevier B.V. All rights reserved.
U. Engelke, P. Le Callet / Signal Processing: Image Communication 39 (2015) 386–404
The fact that not all visual information is equally relevant to the observer has been found to be an instrumental tool for further optimisation of image and video processing systems [3,4]. Typically, these systems first identify regions-of-interest (ROI) in the content and subsequently utilise these to improve system performance. In ROI-based image and video coding [5,6], for instance, the quality of perceptually relevant regions is prioritised over the quality of less relevant regions through non-uniform bit allocation, thus aiming to improve the overall perceived visual quality. In a communication context, ROIbased error resilience [7] takes into account the visual content in addition to the bit stream relevance in order to maximise protection of perceptually relevant information. In image retargeting [8,9], ROI information is used instead of or in addition to energy measures in the seam carving algorithm to identify the most relevant parts of the scene. In image retrieval [10,11], ROI information serves to improve database queries by taking into account the relevant and suppressing the irrelevant visual information. Finally, in quality assessment [12,13], ROI information is integrated into quality models to take into account the impact of distortions in relation to the relevance of the content. It has been found that ROI information is often beneficial in improving overall system performance for these and likely other image and video processing applications. The perceptual relevance of the visual content is typically determined using computational models that are trained and validated using eye gaze tracking data [14]. Amongst the most common models are the biologically plausible models [15,16] following the neural-based architecture by Koch and Ullman [17], which incorporate characteristics of the HVS known to contribute to visual attention. Recently, there has been also a strong trend towards content-based models [18– 21], which often incorporate high-level semantic factors in addition to low-level features. Other models have been proposed based on statistical [22,23], information theoretic [24], or learning-based [25] approaches. One class of models that found particular interest is based on Bayesian methods [26–28]. In addition to saliency information, these models often incorporate prior information related to contextual effects and semantic information [27]. Despite their common goal of identifying the most relevant information in a visual scene, the type of relevance information that is predicted by the above models can be very different [3]. Some of the models focus on the prediction of saliency driven attention locations [15,16,24], whereas others aim at predicting ROI at an object level [18,19]. The former relates to visual locations that are standing out in relation to the remainder of the scene with respect to some low-level features. Saliency-driven bottom-up attention is usually fast and involuntarily controlled (exogenous attention). On the other hand, decisions on ROIs are strongly driven by topdown attention mechanisms that usually involve a voluntary control of the gaze shift (endogenous attention) and are strongly influenced by context and semantic information. The ground truth upon which the success of saliency and ROI prediction models is validated on is typically based on overt visual attention measured through eye gaze tracking experiments. The recorded gaze patterns, however, do not allow for a clear distinction between the various attentional mechanisms as they account for both bottom-up and top-
387
down driven visual attention. They further do not allow for a direct insight into which objects or regions are perceived to be most interesting in the visual scene. Understanding the perceived interest in a scene, however, is instrumental for successful augmentation of image and video processing systems. We argue that dedicated experiments are needed to determine perceived interest in a visual scene. In this paper, we are presenting a novel study to contribute to a better understanding of overt visual attention and perceived interest in natural scenes. The goals of this study are twofold. Firstly, we aim to quantify the relationship between overt visual attention and perceived interest for a wide range of natural images. This allows us to identify whether or not the locations that humans are looking at are also necessarily the most interesting ones. Secondly, we determine the success with which gaze patterns can predict ROI, which will allow for deeper insight into the legibility of using eye gaze tracking data for the creation of ROI maps and as a ground truth for the design of ROI prediction models. In order to address these issues, we conducted two dedicated experiments, an eye gaze tracking experiment and a ROI selection experiment to, respectively, measure overt visual attention and perceived interest of a number of observers when viewing natural scenes. The focus is here on the ROI selection experiment, as it describes a rather unconventional approach in comparison to the more commonly performed eye gaze tracking experiments. We therefore provide an extensive discussion and a detailed analysis of the ROI experiment to identify the perceived interest of human observers in natural image content. We then analyse and discuss the outcomes of the ROI experiments with regard to the results from the eye gaze tracking experiment. Furthermore, we classify the ROI maps into primary ROI, secondary ROI, and background, to determine which of these are best predicted by eye gaze tracking data. We are particularly interested in two factors, the image content and the presentation time during the eye gaze tracking experiment. The remainder of the paper is structured as follows. In Section 2 we discuss in more detail some of the conceptual differences between overt visual attention and perceived interest and the related experiment methodologies; eye gaze tracking and ROI selections. Section 3 briefly summarises the eye gaze tracking experiment and Section 4 discusses and analyses in more detail the ROI selection experiment that we conducted. The relationship between the eye gaze tracking data and the ROI selections is then analysed in great detail in Section 5 and the success of using eye gaze tracking data for the prediction of ROI is evaluated in 6. Finally, the results are summarised and conclusions are drawn in Section 7. An overview of all images used and created in this work is presented in Appendix A. 2. Overt visual attention versus perceived interest In this work we study the relationship between overt visual attention and perceived interest based on experimental data. While overt visual attention is commonly measured using eye gaze tracking, measuring perceived interest based on an ROI selection experiment is a rather novel and unconventional approach. In the following, we
388
U. Engelke, P. Le Callet / Signal Processing: Image Communication 39 (2015) 386–404
therefore elaborate on the conceptual and methodological differences between eye gaze tracking and ROI selections. The discussion focuses on our particular case of hand labelled ROI; we realise that there may be other means of selecting ROI, which are out of the scope of this discussion. We will also highlight related work to put our contribution into context of existing research. 2.1. Conceptual and methodological aspects Several processes are thought to be involved in making the decision for an ROI, including, attending and selecting a number of candidate visual locations, recognising the identity and a number of properties of each candidate, and finally evaluating these against intentions and preferences, in order to judge whether or not an object or a region is interesting [29]. We would like to note that in the context of this work, the term ROI synonymously refers to an object or a region within the visual scene that is of particular interest to the observer. The eye movements that are contributing to this selection process are all part of gaze patterns recorded through eye gaze tracking. As such, there exist some fundamental differences between gaze patterns recorded from eye gaze tracking and ROI that are consciously selected by an observer. In addition to the differences in cognitive processes, there are also methodological differences during the experimental collection of the data. Some of the major cognitive and methodological differences are summarised in Table 1 and are discussed in the following. Probably the most important difference between eye movement recordings and ROI selections is related to the cognitive functions they account for. We distinguish between three attentional processes:
Bottom-up Attention: exogenous process, mainly based
on signal driven visual attention, very fast, involuntary, task-independent. Top-down Attention: endogenous process, driven by higher cognitive factors (e.g. interest), slower, voluntary, task-dependent, mainly subconscious. Perceived Interest: strongly related to endogenous topdown attention but involving conscious decision making about interest in a scene.
Eye gaze tracking data is strongly driven by both bottomup and top-down attention, whereas ROI selections can be assumed to be mainly based on top-down attention and especially perceived interest. This is motivated by the fact that the observer consciously selects the ROI given a particular task. The gaze patterns therefore do not provide
direct knowledge about the perceived interest of human observers as the gaze is strongly driven by a number of other influencing factors, including low-level features, context, and semantic information. The motivation of observers to look at particular locations in the visual scene can therefore not be clearly determined. The ROI selections facilitate a more direct insight into the perceived interest of the observers as they represent the final selections and suppress the ‘noisy’ information involved in the selection process, as is contained in gaze patterns. Another difference relates to the spatial coverage and the different levels of the relevance information. Here, the gaze patterns provide spatial locations whereas the ROI selections determine particular regions or objects in the image. As such, the ROI provide relevance information at a spatial extent (i.e. regions or objects) whereas the gaze patterns provide information about locations, often within or as part of objects. As will be shown later (Section 3.2), the gaze patterns are typically transformed into fixation density maps (FDM). In this form, continuous levels of perceptual relevance are given in form of a landscape-like map. The individual ROI selections considered in this work, on the other hand, have only binary levels (ROI or background). As a combination of all observers, as we will see in Section 4.2, the ROI maps can also be transformed into landscape-like maps. Furthermore, gaze patterns contain temporal information about the viewing behaviour of the observer during the presentation time of the image. As such, one can clearly identify in which order the observers attended the different locations. The hand-labelled ROI selections do not provide such information and only the location of the selected region is known. Finally, there is a difference in the accuracy of the relevance maps. As with every measurement methodology one has a certain level of inaccuracy. In case of eye gaze tracking this is dependent on the accuracy with which the eye tracker records the gaze patterns and the accuracy of the calibration to each individual observer. As for the ROI maps, here the accuracy is mainly determined by the degree of diligence of the observers when conducting the manual selections as well as the precision of the selection tool used. In both cases, careful set-up of the hardware and clear instructions to the observers improve the quality and accuracy of the final results. 2.2. Related work Investigations into the relationship between eye movements and ROI in visual stimuli go back into the 1960s.
Table 1 Conceptual differences between gaze patterns recorded from eye gaze tracking and manually selected ROI. Concept
Gaze patterns
ROI selections
Volition control Perceived interest Spatial coverage Levels of relevance Temporal information Measurement accuracy
Exogenousþ endogenous No direct insight Attention locations Continuous Yes Eye tracker
Mainly endogenous Direct insight Regions/objects Discrete No Selections
U. Engelke, P. Le Callet / Signal Processing: Image Communication 39 (2015) 386–404
Mackworth and Morandi [30] conducted an eye gaze tracking experiment with 20 participants who viewed two different images. The images were then divided into 64 squares and used in a second experiment, where another observer population judged the recognisability of the squares. The result showed that maps created from both results peaked at locations with ‘unusual details’. The authors concluded from these results that a visual scene can be divided into informative and redundant regions. These results provide interesting insight into the intelligibility but not the perceived interest of visual scenes. The low number of images used does also not facilitate to evaluate the content impact on the outcomes. Henderson et al. [31] conducted a study on images comprising of line drawings. In this study, the contextual manipulations were performed by the authors to produce scenes with consistent objects (e.g. a beer glass on a bar) and inconsistent objects (e.g. a microscope on a bar). From an eye gaze tracking experiment it could be concluded that the initial fixation placement is not controlled by a peripheral semantic analysis of the objects in the scene. It was further found that fixations are longer on semantically informative (inconsistent) than uninformative objects (consistent) and that informative objects are returned to more often. The intentionally manipulated images do not reflect globally semantically meaningful scenes and the objects were selected by the experimenters, rather than the participants. An extensive study on a large database of photographic images was performed by Elazary and Itti [29]. The authors evaluated the similarity between hand-labelled objects from the LabelMe database [32] with saliency maps created from a visual saliency model [15]. The authors conclude that the visual saliency maps predict the labelled objects above chance despite top-down attentional processes that are assumed during labelling of the objects. The authors state that the full intention of the manually labelled objects is not known, as the database is open for anyone to label images. However, the authors infer that many of the selections were performed with respect to image segmentation. Masciocchi et al. [33] conducted an experiment in which the participants selected 5 points of interest in each image. It was found that the participants generally were in high agreement with respect to the locations of the interest points. The accumulated results over all observers were compared to eye gaze tracking recordings and it was concluded that the interest points were strongly correlated with the gaze patterns. It should be emphasized that the interest points did not denote any particular ROI but, as the name suggests, singular points instead. In Engelke et al. [34,35] manually labelled ROI were compared to gaze patterns. The gaze patterns were recorded during an image quality assessment task and as such, the eye movements do not reflect viewing behaviour under task-free conditions. Interestingly, however, it was found that these gaze patterns were also strongly correlated with the selected ROI and it was concluded, that quality assessment mainly takes place in the most interesting regions of an image, despite a large variety of distortions being present in the images. Indications were also found that especially the early fixations predict well the ROI. The ROI used in this study were of simple
389
rectangular shape and only seven different grey scale images were used. A more elaborate study comparing manually labelled ROI and gaze patterns has been conducted by Wang et al. [36]. In a first experiment, the participants were instructed to rate the importance of segmented regions in images from the Berkeley Segmentation Dataset [37]. A different panel of observers then took part in a free-viewing eye gaze tracking experiment in which they were shown the same images. The authors confirm the conclusions from [34,35] that the gaze patterns, and in particular the early gaze points, predict well the main ROI. The authors state that the predefined segmentations are a limitation of the study and that some regions would have benefited from further segmentation into sub-regions. Furthermore, the segmentations were performed for image segmentation purposes and not with respect to the perceived interest of observers. Despite the interesting and valuable conclusions drawn from these previous works, none of them explicitly addresses the relationship between overt visual attention and perceived interest in a task-free context, using dedicated experiment methodologies, and a wide range of natural image content. The work presented in this paper aims to fill this gap. 3. Experiment 1: overt visual attention through eye gaze tracking As a ground truth of overt visual attention to natural scenes, we utilise the results of an eye gaze tracking experiment that we conducted earlier. The eye gaze tracking data is made publicly available in the Visual Attention for Image Quality (VAIQ) [38] database and the experiment is explained in detail in [39]. For completeness, we briefly review the experiment here. 3.1. Experiment procedures The eye gaze tracking experiment was performed at the University of Western Sydney, Australia. Fifteen people participated, of which nine were male and six were female. The age ranged from 20 to 60 years with an average age of 42 years. In a task-free context, participants were shown the reference images from 3 well known image quality databases, the MICT database [40], the IRCCyN/IVC database [41], and the LIVE database [42]. These databases contain 14, 10, and 29 images, respectively. The MICT and the LIVE database have 11 images in common and hence, a total of 42 different images were presented to the participants. The names of the images along with a number assignment and the related databases are listed in Table 2. All images are also presented in the left column of Figs. 13– 18. A Samsung SyncMaster monitor of size 19 in and with a screen resolution of 1280 1024 was used for image presentation. An EyeTech TM3 eye tracker [43] was installed under the screen to record the eye movements of the human observers. The participants were seated at a distance of approximately 60 cm from the screen. Each image was shown for 12 s with a mid-grey screen shown between images for 3 s.
390
U. Engelke, P. Le Callet / Signal Processing: Image Communication 39 (2015) 386–404
3.2. Post processing of the recorded data The recorded gaze points are post-processed using the methodology described in detail in [34]. The procedure involves two main steps. Firstly, the gaze patterns are processed into visual fixation patterns (VFP) F by neglecting saccadic eye movements that do not contribute to active vision and by clustering the remaining gaze points into fixations. A Gaussian filter kernel, which accounts for nonlinear retinal processing, is then used in a second step to filter the VFP into a fixation density map (FDM) D. The VFP and FDM are used in the remainder of this paper as an overt visual attention ground truth. The FDM for all images are presented in the second column of Figs. 13–18. 4. Experiment 2: perceived interest through roi selections We conducted a dedicated experiment to determine the perceived interest in the same set of images that were shown in the eye gaze tracking experiment (see Section 3). The experiment is presented in [44]. In the following, we provide a summary of the experiment and a detailed analysis of the experiment results. 4.1. Experiment details 4.1.1. Experiment procedures The experiment was performed at the Blekinge Institute of Technology, Sweden. Twenty people participated, of which 12 were male and 8 were female. The age of the participants ranged between 23 and 37 years with an average age of 30 years. The subject pool was mutually exclusive to that of the eye gaze tracking experiment. The 42 images from the eye gaze tracking experiment (see Table 2) were presented in random order on a 19 in DELL LCD screen with a resolution of 1280 1024 and a black background behind the images. The viewing distance was approximately 50–60 cm. No time limits were imposed, but in general the participants took between 30 and 60 min to perform the ROI selections. Before the test images, 1 training image and 3 stabilisation images were presented to explain the ROI selection procedure and for the participants to get used to the selection process, respectively. 4.1.2. ROI selection procedure The participants were instructed to select an object or a region in the image that was of highest interest to them. If, and only if, there were multiple regions of highest interest, then the participants were allowed to select all of them. This was, for instance, often the case when two eyes, multiple humans, or multiple similar objects were selected. No limitations on the size of the ROI were imposed, except that they needed to constitute a subset of the entire image. Participants could make a new selection if they were unsatisfied with the choice or accuracy of their selection. The participants used a paint brush in Photoshop CS5 to perform the ROI selections. The brush had a circular shape and three different diameters for the participants to choose from; 20 pixels, 40 pixels, and 80 pixels. The small brush size was to be used for the selection of smaller objects and
object contours whereas the larger brush sizes mainly facilitated fast filling of larger areas. The colour of the brush was chosen to be pink with RGB values [255, 0, 240], as this colour is absent in all images and thus facilitates easy segmentation of the ROI from the background. Some example of ROI selections are presented in Fig. 1 for images 28, 37, and 40. Contrary to the interest points in [33] it can be observed that the ROI from our experiment have a spatial extent over entire or a part of objects. Furthermore, the ROI shapes are freely defined by the observers, unlike the pre-defined segmentations used in [36]. 4.1.3. Dominance ratings After completion of the ROI selection for each individual image the participants were asked to provide a rating on a scale from 1 to 3 as to how dominant their interest in the ROI selection is in relation to the background. The rating scale is shown in Table 3. The participants were asked to provide a rating of 3 if the selected ROIs were very dominant compared to the background. In case there were other objects of interest in the background that were not selected, then a rating of 2 was requested. If the participant struggled to decide between several objects of highest interest, but decided to select only a subset of them, then a rating of 1 was to be given. These dominance ratings, in the following denoted as rD, facilitate a better understanding whether the selected ROIs are dominating the image content or not. They can further serve as a weighting factor for the construction of ROI maps over all participants. 4.1.4. Post-experiment survey As we discussed in Section 2.1, the ROI selection involves a number of different cognitive processes and is impacted by a number of different factors driven by lowlevel visual features as well as higher-level semantic information and personal preference. To gain a better understanding of the degree with which different factors Table 2 Numbers (#) and names of the images in the IVC (I), LIVE (L), and MICT (M) databases (DB). #
Name
DB
#
Name
DB
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
avion barba boats clown fruit house isabe lenat mandr pimen bikes/kp05 building2 buildings/kp08 caps/kp03 carnivaldolls cemetry churchandcapitol coinsinfountain dancers flowersonih35 house/kp22
I I I I I I I I I L L/M L L/M L/M L L L L L L L/M
22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
lighthouse lighthouse2/kp21 manfishing monarch ocean/kp16 paintedhouse/kp24 parrots/kp23 plane/kp20 rapids sailing1/kp06 sailing2 sailing3 sailing4 statue stream/kp13 studentsculpture woman womanhat kp01 kp07 kp12
L L/M L L L/M L/M L/M L/M L L/M L L L L L/M L L L M M M
U. Engelke, P. Le Callet / Signal Processing: Image Communication 39 (2015) 386–404
391
Fig. 1. Example ROI selections for images 28 (left column), 37 (center column), and 40 (right column).
4.2. Post-processing of ROI selections
Table 3 Dominance rating scale. Dominance
Score
Very dominant Dominant Slightly dominant
3 2 1
impact on the ROI selections, we sent out a survey to each of the participants shortly after the experiment. In this survey, we asked the participants to rate 19 different factors relating to how much they influenced their selection using a three-point scale: ‘weakly’, ‘medium’, ‘strongly’. An additional rating ‘No opinion’ was to be used when the participants were not sure whether or not a factor influenced their decision. The factors covered in the survey are broadly classified into
Low-level features: colour, brightness, size, shape, location, texture.
Semantic information: humans, animals, faces, eyes, written text.
Personal preferences: image composition, art, cam
era focus. Personal attributes: personal taste, personal passion, personal emotions, personal ownership, cultural background.
We created binary ROI maps for the ith image and jth participant by segmenting the image according to the manual selection of the participants as follows: ( 1; ROI M i;j ðm; nÞ ¼ ð1Þ 0; Background with ½m; n denoting the pixel locations in the images. To highlight the perceived interest as an accumulation over all observers, the binary maps Mi;j ðm; nÞ are combined over all participants as follows: M 0i ðm; nÞ ¼
N X
ω Mi;j ðm; nÞ
ð2Þ
j¼1
with N ¼20 being the number of participants. The weighting factor ω determines the contribution of the individual ROI maps onto the combined map and could, for instance, be set to ω ¼ r D , with rD being the dominance ratings collected during the experiment. In the scope of this work, we chose a simple uniform weighting of ω ¼ 1. For further processing we normalise M0i ðm; nÞ as follows: M i ðm; nÞ ¼
M0i ðm; nÞ maxðM 0i ðm; nÞÞ
with M i ðm; nÞ A ½0…1.
ð3Þ
392
U. Engelke, P. Le Callet / Signal Processing: Image Communication 39 (2015) 386–404
Fig. 2. Examples of uniformly weighted ROI maps (top row) and dominance weighted ROI maps (bottom row) for images 4 (left column), 7 (center column), and 9 (right column). 20
ROImax
18 16 14 12 10 8 1
5
10
15
20
25
30
35
40
Image number
Fig. 3. Maxima of the ROI maps ϕmax for all 42 images.
Examples of uniformly weighted ROI maps (ω ¼ 1) for images 4, 7, and 9 are shown in the top row of Fig. 2. For visual comparison, dominance weighted ROI maps (ω ¼ r D ) are presented in the bottom row of Fig. 2. The effect of imperfect ROI selections has been smoothed using a Gaussian kernel for visualisation purposes only. For all uniformly weighted ROI maps used in this work please refer to the third column of Figs. 13–18. 4.3. Experiment results In the following sections we analyse the ROI maps to provide insight into the perceived interest of human observers to natural image content. 4.3.1. Quantitative analysis of the ROI maps The maxima ϕmax of the combined ROI maps before normalisation, M0i ðm; nÞ, are illustrated in Fig. 3 for all 42 images. The maxima represent the maximum number of overlapping ROI selections from all 20 observers. A higher number therefore relates to a higher agreement between the participants regarding a particular ROI selection. It can be observed that the maxima range from as low as 8 to as high as 19. In fact there were three images where 19 people agreed on an ROI selection: ‘avion’ (image 1), ‘clown’ (image 4), and ‘monarch’ (image 25).
These three images contain very distinct ROI in terms of written text, a face, and an animal. The latter two additionally exhibit unusual and beautiful patterns, which appear to be of high interest to many observers. On the other extreme are the images ‘kp01’ (image 40), ‘fruit’ (image 5), ‘bikes/kp05’ (image 11), and ‘lighthouse’ (image 22), which all have several ROI, resulting in lower agreement between the participants. Image ‘flowersonih35’ (image 20) may have received its low maximum due to no strong ROI being present at all. In addition to the maxima, we compute the ROI coverage ϕcov, which we define as the percentage of pixels within the ROI to the total number of pixels in the image. The ROI coverage for all images, averaged over all participants, is presented in Fig. 4. The average ROI coverage varies strongly with the image content, with the smallest ROI coverage of approximately 4% for the image ‘manfishing’ (image 24) and the largest ROI coverage of approximately 20% for the image ‘paintedhouse/kp24’ (image 27). Naturally, the variation amongst observers, as indicated by the standard error, is generally higher for larger mean coverage areas. The largest standard error is given for the image ‘sailing3’ (image 33). This image has a strong, small ROI on the people in the boat, but also exhibits very large ROI selections of the sails. An exception is the image ‘monarch’ (image 25), which has a large coverage area but low standard error. This is because of the high agreement of the observers on the large wing of the butterfly, as also indicated by the maximum shown in Fig. 3. 4.3.2. Dominance ratings Fig. 5 presents the dominance ratings D for all images, averaged over all participants. The average ratings range approximately between 1.4 to 2.6. The images ‘clown’ and ‘monarch’, which exhibited the highest maxima in the combined ROI maps, have also received the highest averaged dominance ratings. Similarly, the image ‘kp01’ received both the lowest maximum
U. Engelke, P. Le Callet / Signal Processing: Image Communication 39 (2015) 386–404
25
Table 4 Number of ratings for each factor contained in the survey.
µ
C
20 15
Class
Factor
Rating
10
Weakly Medium Strongly No opinion
5 0
1
5
10
15
20
25
30
35
40
Image number
Low-level features
Object colour Object brightness Object size Object shape Object location Object texture
Fig. 4. ROI coverage ϕcov averaged over all 20 participants. The bars denote the related standard errors.
3
D
2.5
µ
393
2
Semantic information
1.5 1
1
5
10
15
20
25
30
35
40
Image number
Fig. 5. Dominance ratings D averaged over all 20 participants. The bars denote the related standard errors.
in the ROI map and the lowest average dominance rating. These observations indicate a relationship between the ROI map maxima and the dominance ratings and indeed, these two factors exhibit a linear correlation coefficient of CC¼0.595. The ROI coverage is nearly uncorrelated (CC¼0.017) with the dominance ratings, indicating that the observers did not relate the size of the ROI selections to their dominance over the background. 4.3.3. Survey ratings The factors, the rating levels, and the number of ratings from the survey are summarised in Table 4. For better comprehension and illustration of the results we assign a numerical value to each of the three ratings as follows: weakly¼1, medium¼2, strongly¼3. The mean over all observer ratings using this numerical scale is given in Fig. 6. The results show that within each of the 4 classes of factors, there are one or two factors dominating over the others. Within the ‘low-level features’, colour has received a considerably higher average rating compared to the other features. It is well known that colour is a strong attractor of attention [45] and the results from the experiment reveal that the participants also consider it to be an influential factor on which to base their ROI selection on. Within the ‘semantic information’ factors, eyes and faces have outperformed humans, animals, and written text. Although it does not come at a surprise that eyes and faces have been rated high, one would have expected a higher rating for written text, which has been selected extensively throughout the experiment (see images 1/3/12/13/14/15/16/29/33). However, the image ‘rapids’ (image 30) shows that in the common occurrence of human faces and written text, the former still dominates over the latter. As can be seen from the FDM for the image ‘rapids’, this observation does not hold for the eye gaze tracking data. Within the category ‘personal preference’, the camera focus received by far the highest rating, which lies at hand as blurred objects cannot be fully comprehended and may thus be of reduced interest. The higher rating of this factor would also be
Personal preference
Personal attributes
1
7
12
0
5
6
9
0
8 9
8 7
4 3
0 1
6
6
8
0
6
7
5
2
Humans
6
6
8
0
Animals Faces Eyes Written text
3 2 2 2
12 3 1 13
5 15 17 5
0 0 0 0
Image composition Art Camera focus
2
11
7
0
7 2
6 5
6 11
1 2
7
9
4
0
4
9
7
0
3
12
5
0
14
4
2
0
12
6
2
0
Personal taste Personal passion Personal emotions Personal ownership Cultural background
colour brightness size shape location texture humans animals faces eyes text composition art focus taste passion emotions ownership background 1
1.5
2 Average rating
2.5
3
Fig. 6. Mean survey ratings with numerical assignments as follows: weakly¼ 1, medium ¼2, strongly ¼3.
394
U. Engelke, P. Le Callet / Signal Processing: Image Communication 39 (2015) 386–404
Fig. 7. NSS for all images and presentation times t. The red line indicates the mean NSS over all presentation times for the respective image. 2
1
1.8
0.5
1.6
0
1
5
10
15
20
25
30
35
40
NSS
NSS (t = 12s)
2 1.5
1.4
Image number
Fig. 8. NSS for all 42 images and for t¼ 12 s.
influenced by the fact that there are several images in the set that contain de-focus blur (see images 8/25/28/32/35) and less images that contain, for instance, artistic objects. Finally, the ‘personal attributes’ have received on average lower ratings than the other factors, with personal taste, passion, and emotion being rated considerably higher than ownership and cultural background. These factors, however, are considered to be highly dependent on the observer population whereas the first three categories are expected to be more dependent on the test images presented in the experiment. Overall, colour, eyes, faces, and camera focus are the factors that stand out amongst the other factors. This insight into the perceived importance of the different influential factors during ROI selection is considered to be instrumental in developing and improving ROI prediction models. 5. Relationship between overt visual attention and perceived interest We investigate the relationship between overt visual attention and perceived interest based on the VFP and the ROI maps. We focus in our analysis particularly on the effect of two factors on the similarity between the VFP and ROI maps; image content and the presentation time during the eye gaze tracking experiment.
1.2 1 0.8
.5 1
2
3
4
5
6
7
8
9
10
11
12
Presentation time [s] Fig. 9. NSS for all presentation times t, averaged over all 42 images.
5.1. Normalised scanpath saliency (NSS) The normalised scanpath saliency (NSS) [46] was originally designed to compare fixation locations with saliency maps computed with computational visual attention models. We adopt it here to compare the ROI maps to the VFP from the eye gaze tracking experiment. For this purpose, the ROI maps are normalised to have zero mean and unit standard deviation as follows: M ðNSSÞ ðm; nÞ ¼ i
M i ðm; nÞ μMi
σ Mi
:
ð4Þ
are then extracted The magnitudes of the ROI maps M ðNSSÞ i at each fixation location in the corresponding VFP as
κ j ¼ MðNSSÞ ðm; nÞ for F i ðm; nÞ a 0: i
ð5Þ
with j A J being the number of fixations. The final NSS value is then computed as the average over all extracted ROI
U. Engelke, P. Le Callet / Signal Processing: Image Communication 39 (2015) 386–404
magnitudes as 0 1 J 1X NSS@Mi F i A ¼ κ: J j¼1 j
ð6Þ
As a result of the ROI map normalisation, NSS values larger than zero reflect a correspondence greater than what Table 5 Correlation coefficients ρ between the NSS and the ROI maxima ϕmax, ROI coverage ϕcov, and dominance ratings D.
NSS (t ¼0.5 s) NSS (t ¼12 s)
ϕmax
ϕcov
D
0.485 0.533
0.381 0.394
0.374 0.605
395
would be expected by chance. On the contrary, values smaller than zero indicate an inverse correspondence whereas values close to zero relate to chance agreement.
5.2. Comparison between ROI maps and VFP The NSS between the VFP and ROI maps is shown for all images in Fig. 7. For each image, we computed the NSS for VFP of 13 different presentation times t A f0:5; 1; 2; 3; 4; 5; 6; 7; 8; 9; 10; 11; 12g s. This was implemented by using only the gaze points of the first ts from the possible 12 s of eye gaze tracking data that are available for each image. Computing the NSS for various presentation times allows us to investigate the relation between the VFP and ROI maps both for early fixations as well as for longer
Fig. 10. Examples of quantised ROI maps for image 21 (top) and image 30 (bottom). The white area indicates the (a) primary ROI M Pi;τ ðm; nÞ, (b) secondary ROI M Si;τ ðm; nÞ, and (c) background M Bi;τ ðm; nÞ.
Fig. 11. AUC for all images and presentation times t and for primary ROI (black circles), secondary ROI (dark grey squares), and background (light grey diamonds).
396
U. Engelke, P. Le Callet / Signal Processing: Image Communication 39 (2015) 386–404
central ROI in common and hence, centre bias may cause these early fixations to relate well with the ROI selections. In Fig. 9 we present the NSS for all presentation times t, averaged over all images. One can see the general trend of higher NSS for early fixations, however, the NSS of the individual images show that this trend is caused by only a subset of images. Please also note the narrower scale for the NSS values in Fig. 9 compared to Fig. 7.
1
0.8
AUC
0.6 ROI
ROI
P
0.4
BG
S
5.3. Relation of NSS to 0.2
0
1
2
3
4
5
6
7
8
9
10
11
12
t [s]
Fig. 12. AUC between the FDM and primary ROI, ROIP, secondary ROI, ROIS, and background, BG, averaged over all images for all presentation times.
observations of the images. The red lines in the figures indicate the mean NSS over all presentation times for the respective image. From the figures it can be observed that the range of NSS values varies widely between close to 0 to almost 6. There appears to be a strong dependence of the NSS values on both image content and presentation time t. Most NSS are well above 0, indicating that there are strong relationships between the VFP and the ROI maps for most images. Images 5, 11, 20, and 36 are exceptions from these observations with NSS values close to 0 for all presentation times. These images constitute interesting cases as the fixations do not agree with the ROI maps. For instance in image 11, the fixations were mainly drawn to the helmets of the bike riders whereas the tires were chosen as the ROI (see Fig. 14). Similarly in image 5, the fruit that was fixated the most does not agree with the fruit to be perceived of most interest. Contrary to the images with low NSS, there are also several images that exhibit particularly high NSS. To illustrate this we show in Fig. 8 the NSS for all images and for t¼12 s. Some of the images with particularly high NSS include images 2, 4, 8, 28, 39, and 42. These images share the property of having either faces, humans, or animals in their content. Faces had been rated particularly high by the participants as being an important factor for the ROI selections. As expected, they are also a strong attractor of overt visual attention, along with humans and animals. Image 1 also exhibits a high NSS value but contains a plane instead of a face or a human. In this case, the foreground objects dominates the central part of the image and hence the strong relation between the VFP and ROI maps. Not only the absolute NSS values differ between the images but also the trend of the NSS values with presentation time. For all images, the NSS values fluctuate more with shorter presentation times and converge with longer presentation time. For some images, however, the NSS values start considerably higher as compared to their respective converged values at t ¼12 s. This is particularly true for images 7, 19, 23, 24, and 41. These images have a
ϕmax, ϕcov, and D
The ROI maxima ϕmax, ROI coverage ϕcov, and dominance ratings D defined in Sections 4.3.1 and 4.3.2 present important properties of the ROI selections. We correlate these measures with the NSS values for both t ¼0.5 s and for t¼12 s to evaluate if these measures have a predictive nature of the relation between VFP and ROI maps. The Pearson Linear Correlation Coefficients ρ are presented in Table 5. The correlations are consistently higher for t ¼12 s as compared to t¼0.5 s. The NSS values are positively correlated with both the ROI maxima ϕmax and the dominance ratings D for both presentation times. Both of these measures are indicators of the dominance of an ROI selection. It appears that more dominant ROI selections result in stronger relationships between VFP and ROI maps. From this we may deduct that perceived interest in images is more likely to be predicted from fixations in case of dominant ROI being present in the image. The ROI coverage ϕcov is weakly negatively correlated with the NSS values. 6. Predictive value of overt visual attention for perceived interest Digital imaging applications, such as the ones discussed in Section 1, may require coarsely quantised ROI levels instead of the continuous levels of interest as provided by the ROI maps. Given that modern mobile computing devices are often equipped with a camera and most recently even with an integrated eye gaze tracker, these quantised ROI maps could be estimated from eye gaze tracking data. In the following, we therefore further investigate the predictive value of eye gaze tracking data for quantised ROI maps. Towards this end, we quantise the continuous ROI maps into ‘Primary ROI’, ‘Secondary ROI’, and ‘Background’ and evaluate how FDM relate to these quantised levels of perceived interest. 6.1. Thresholding of the ROI maps The ROI maps presented in the third column in Figs. 13–18 have normalised values between 0 (black) and 1 (white). We create three binary maps by quantising these ROI maps into ‘Primary ROI’, M Pi;τ ðm; nÞ, ‘Secondary ROI’, M Si;τ ðm; nÞ, or ‘Background’, M Bi;τ ðm; nÞ, as follows: 8 P > > > M i;τ ðm; nÞ for 0:5 r Mi ðm; nÞ r 1 < S ð7Þ M i;τ ðm; nÞ ¼ M i;τ ðm; nÞ for 0:05 r Mi ðm; nÞ o 0:5 > > > : M B ðm; nÞ for 0 rM ðm; nÞ o0:05: i;τ
i
U. Engelke, P. Le Callet / Signal Processing: Image Communication 39 (2015) 386–404
397
Fig. 13. Images 1–7: (a) original image, (b) FDM, (c) ROI map, and (d) thresholded ROI map.
Examples of quantised binary maps are shown in Fig. 10 for images 21 and 30 with the white areas indicating the areas under assessment (primary ROI, secondary ROI, background).
The quantised maps are shown for all images in the rightmost column of Figs. 13–18. For sake of space limitations, we visualise all three maps in the same image, with white
398
U. Engelke, P. Le Callet / Signal Processing: Image Communication 39 (2015) 386–404
Fig. 14. Images 8–14: (a) original image, (b) FDM, (c) ROI map, and (d) thresholded ROI map.
representing the primary ROI, mid-grey the secondary ROI, and black the background.
6.2. Area under the ROC curve (AUC) We perform Receiver Operating Characteristics (ROC) [47,48] analysis to evaluate the degree to which overt visual attention agrees with the three levels of perceived interest in terms of the quantised ROI maps, M i;τ ðm; nÞ. Traditionally, ROC analysis is used for binary classification of a performance measure into one of two classes. Here, we make an unconventional use of ROC analysis to quantify the level of agreement of the FDM (analogous to
the performance measure) with the primary ROI, secondary ROI, and background. The Area Under the ROC Curve (AUC) is defined on a scale from 0 to 1, with higher values indicating a stronger agreement between the FDM and the respective quantised ROI map (the white areas in Fig. 10). Values around 0.5 indicate chance agreement between the FDM and ROI maps and values below 0.5 represent an inverse relationship between the FDM and ROI maps. 6.3. Relationship between the FDM and quantised ROI maps Fig. 11 shows the AUC values between the FDM and respective quantised ROI maps for all images and presentation times. For almost all images, the agreement is best
U. Engelke, P. Le Callet / Signal Processing: Image Communication 39 (2015) 386–404
399
Fig. 15. Images 15–21: (a) original image, (b) FDM, (c) ROI map, and (d) thresholded ROI map.
between the FDM and the primary ROI, followed by the secondary ROI and the background. The AUC values for the primary ROI are mostly close to 1, showing a strong agreement with the respective FDM. This indicates that FDM in fact can serve as a predictor of the most interesting objects within a wide range of image content. There are several exceptions to the above observation, the most prominent one being image 11. Here, the secondary ROI actually agrees best with the FDM, which is in line with the earlier observation that the participants of the eye gaze tracking experiment focused mainly on the helmets of the bike riders, however, the bike wheels were labelled as being most interesting in the ROI experiment. There are several
other images for which the primary and secondary ROI have a similar agreement with the FDM, namely, images 5, 17, 19, 22, 29, and 32. These images tend to have ROI selections that were rated not to be very dominant (see Fig. 5). This is reflected in the eye gaze tracking data, which shows that the observers inspected the secondary ROI almost as much as the primary ROI. To better understand the overall trend for the agreement between the FDM and quantised ROI maps, we present in Fig. 12 the AUC for all presentation times, averaged over all images. The plots show AUC values between the FDM and the primary ROI of about 0.9, thus confirming the strong agreement for a wide range of images. The agreement for
400
U. Engelke, P. Le Callet / Signal Processing: Image Communication 39 (2015) 386–404
Fig. 16. Images 22–28: (a) original image, (b) FDM, (c) ROI map, and (d) thresholded ROI map.
secondary ROI is also well above chance, with AUC values ranging around 0.65. The low AUC values of about 0.2 reveal an inverse relationship of the FDM with the background of the images, showing that these areas do not receive much visual attention of the observers. As with the NSS, we compute the correlation coefficient ρ between the AUC at t¼12 s for the primary ROI and the ROI maxima ϕmax, ROI coverage ϕcov, and dominance ratings D. The ROI maxima have a notable correlation of ρ ¼ 0:505 with the primary ROI. The dominance ratings are only weakly correlated, ρ ¼ 0:34, and the ROI coverage is negatively correlated, ρ ¼ 0:428 with the primary ROI. These results confirm that overt visual attention has a higher predictive value of perceived interest in case of dominant ROI.
7. Discussion and conclusions We presented a novel discourse and study on the relationship between overt visual attention and perceived interest in natural scenes. We addressed the conceptual differences between overt visual attention and perceived interest and how these mechanisms are accounted for in eye gaze tracking and ROI selection experiments. We discussed related work to put our study into the context of earlier contributions and to highlight its significant contribution to the existing body of research. To obtain solid ground truths for our study, we conducted two dedicated experiments, an eye gaze tracking experiment and a ROI selection experiment. Especially the latter experiment was discussed in great detail given its
U. Engelke, P. Le Callet / Signal Processing: Image Communication 39 (2015) 386–404
401
Fig. 17. Images 29–35: (a) original image, (b) FDM, (c) ROI map, and (d) thresholded ROI map.
novel and rather unconventional approach. Analysis of the ROI selections revealed deep insight into the perceived interest in natural images. Image content was shown to have a strong impact on the agreement between observers regarding the ROI selections. The most influencing factors upon which the observers chose the ROI were identified in the experiment to be colour, eyes, faces, and camera focus. Quantitative comparison of the VFP and ROI maps showed that overt visual attention has indeed a strong relationship to perceived interest with NSS values well above 0 for a wide range of images. We revealed, however, that the image content strongly influences the agreement between VFP
and ROI maps, with images containing faces, humans, and animals exhibiting particularly high agreement. These factors were already identified in the ROI selection experiment to be of great interest to the observers and the results appear to translate into the similarity to the VFP. Image presentation time during the eye gaze tracking experiment was also found to have a strong impact on the relationship between VFP and ROI maps, with shorter presentation times generally resulting in a higher agreement in terms of NSS. We correlated the NSS values with the ROI maxima ϕmax, ROI coverage ϕcov, and dominance ratings D to evaluate if these measures have a predictive nature of
402
U. Engelke, P. Le Callet / Signal Processing: Image Communication 39 (2015) 386–404
Fig. 18. Images 36–42: (a) original image, (b) FDM, (c) ROI map, and (d) thresholded ROI map.
the relation between VFP and ROI maps. The ROI maxima ϕmax and dominance ratings D were found to be stronger correlated with NSS as compared to the ROI coverage ϕcov, which is further evidence that dominant ROI maps that a number of observers agree on are stronger related to eye gaze tracking data. Interestingly, these results are more prevalent for presentation time t ¼12 s as compared to t ¼0.5 s. We believe that this is due to the VFP created from longer presentation times being dominated by fixations related to perceived interest whereas VFP created from short presentation times are dominated by saliency driven attention.
We further determined the success with which eye gaze patterns can predict quantised ROI levels based on ROC analysis. Primary ROI were shown to be significantly better predicted by FDM as compared to secondary ROI and backgrounds. The observations hold for a wide range of images with only few exceptions (e.g. image 11). As with the NSS measures, the AUC were shown to be correlated to the ROI maxima ϕmax. We believe that the research presented in this paper provides a valuable contribution to the perception-driven image and video processing research community. Despite the thorough and extensive implementation of the experiments
U. Engelke, P. Le Callet / Signal Processing: Image Communication 39 (2015) 386–404
and the analytical methodology, there are certain limitations to this study that may be addressed in ongoing and future work. Firstly, we implemented a specific methodology to experimentally assess perceived interest in natural images by means of hand-labelled ROI selections. We believe that this is a valid approach to address this problem but we acknowledge that there may be other approaches that are as valid and may be implemented instead. Secondly, despite the wide range of image content that provided insight into the effect of different factors on overt visual attention and perceived interest, one may design an experiment specifically to validate certain factors in more depth. For instance, faces and written text are two factors to attract attention and interest, however, their relative contribution when appearing within the same image needs further investigation. Similar observations hold for other factors identified in this paper. Finally, task has a strong impact on overt visual attention and is expected to also have a strong impact on perceived interest. In this work, we considered a task-free context but future work should investigate the relationship between overt visual attention and perceived interest in different tasks. The experimental data of the ROI selection experiment and the eye gaze tracking experiment are freely available to the research community in the ROI-D [44] and VAIQ [39] databases, respectively. To receive access to the databases the lead author of this paper can be contacted. Appendix A. Images and perceptual relevance maps See Figs. 13–18. References [1] J. Wolfe, Visual attention, in: K.K.D. Valois (Ed.), Seeing, Academic Press, San Diego, 2000, pp. 335–386. [2] S. Treue, Visual attention: the where, what, how and why of saliency, Curr. Opin. Neurobiol. 13 (4) (2003) 428–432. [3] P. Le-Callet, E. Niebur, Visual attention and applications in multimedia technologies, Proc. IEEE 101 (9) (2013) 2058–2067. [4] S. Frintrop, E. Rome, H.I. Christensen, Computational visual attention systems and their cognitive foundations: a survey, ACM Trans. Appl. Percept. 7 (1) (2010). [5] K.H. Park, H.W. Park, Region-of-interest coding based on set partitioning in hierarchical trees, IEEE Trans. Circuits Syst. Video Technol. 12 (2) (2002) 106–113. [6] W.J. Chien, N.G. Sadaka, G.P. Abousleman, L.J. Karam, Region-ofinterest-based ultra-low-bit-rate video coding, in: Proceedings of the SPIE Visual Information Processing XVII, vol. 6978 of 1, 2008, http://dx.doi.org/10.1117/12.778216. [7] F. Boulos, W. Chen, B. Parrein, P. Le Callet, A new H.264/AVC error resilience model based on regions of interest, in: Proceedings of the International Packet Video Workshop, 2009. [8] V. Setlur, T. Lechner, M. Nienhaus, B. Gooch, Retargeting images and video for preserving information saliency, IEEE Comput. Graph. Appl. 27 (5) (2007) 80–88. [9] D. Wang, G. Li, W. Jia, X. Luo, Saliency-driven scaling optimization for image retargeting, Vis. Comput. 27 (9) (2011) 853–860. [10] K. Vu, K.A. Hua, W. Tavanapong, Image retrieval based on regions of interest, IEEE Trans. Knowl. Data Min. 15 (4) (2003) 1045–1049. [11] O. Marques, L.M. Mayron, G.B. Borba, H.R. Gamba, Using visual attention to extract regions of interest in the context of image retrieval, in: Proceedings of the 44th Annual Southeast Regional Conference, 2006, pp. 638–643. [12] U. Engelke, H. Kaprykowsky, H.-J. Zepernick, P. Ndjiki-Nya, Visual attention in quality assessment: theory, advances, and challenges, IEEE Signal Process. Mag. 28 (6) (2011) 50–59. [13] U. Engelke, H.-J. Zepernick, A framework for optimal region-ofinterest based quality assessment in wireless imaging, J. Electron. Imaging 19 (1) (2010) 011005.
403
[14] U. Engelke, H. Liu, J. Wang, P. Le-Callet, I. Heynderickx, H.-J. Zepernick, A. Maeder, Comparative study of fixation density maps, IEEE Trans. Image Process. 22 (3) (2013) 1121–1133. [15] L. Itti, C. Koch, E. Niebur, A model of saliency-based visual attention for rapid scene analysis, IEEE Trans. Pattern Anal. Mach. Intell. 20 (11) (1998) 1254–1259. [16] O. Le Meur, P. Le Callet, D. Barba, D. Thoreau, A coherent computational approach to model bottom-up visual attention, IEEE Trans. Pattern Anal. Mach. Intell. 28 (5) (2006) 802–817. [17] C. Koch, S. Ullman, Shifts in selection in visual attention: towards the underlying neural circuitry, Hum. Neurobiol. 4 (4) (1985) 219–227. [18] W. Osberger, A.M. Rohaly, Automatic detection of regions of interest in complex video sequences, in: Proceedings of the IS&T/SPIE Human Vision and Electronic Imaging VI, vol. 4299, 2001, pp. 361–372. [19] D. Liu, T. Chen, DISCOV: a framework for discovering objects in video, IEEE Trans. Multimed. 10 (2) (2008) 200–208. [20] U. Rajashekar, I. van der Linde, A.C. Bovik, L.K. Cormack, GAFFE: a gaze-attentive fixation finding engine, IEEE Trans. Image Process. 17 (4) (2008) 564–573. [21] C. Guo, Q. Ma, L. Zhang, Spatio-temporal saliency detection using phase spectrum of quaternion Fourier transform, in: Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, 2008, pp. 1–8. [22] D. Gao, V. Mahadevan, N. Vasconcelos, On the plausibility of the discriminant center-surround hypothesis for visual saliency, J. Vis. 8 (7) (2008) 1–18. [23] H.J. Seo, P. Milanfar, Static and space-time visual saliency detection by self-resemblance, J. Vis. 9 (12:15) (2009) 1–27. 〈http://journal ofvision.org/9/12/15/〉. [24] N.D.B. Bruce, J.K. Tsotsos, Saliency, attention, and visual search: an information theoretic approach, J. Vis. 9 (3:5) (2009) 1–24. 〈http:// journalofvision.org/9/3/5/〉. [25] W. Kienzle, M.O. Franz, B. Schölkopf, F.A. Wichmann, Centersurround patterns emerge as optimal predictors for human saccade targets, J. Vis. 9 (5:7) (2009) 1–15. [26] L. Itti, P. Baldi, A principled approach to detecting surprising events in video, in: Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, vol. 1, 2005, pp. 631–637. [27] A. Torralba, A. Oliva, M. Castelhano, J. Henderson, Contextual guidance of eye movements and attention in real-world scenes: the role of global features on object search, Psychol. Rev. 113 (4) (2006) 766–786. [28] L. Zhang, M.H. Tong, T.K. Marks, H. Shan, G.W. Cottrell, SUN: a Bayesian framework for saliency using natural statistics, J. Vis. 8 (7:32) (2008) 1–20. 〈http://journalofvision.org/8/7/32/〉. [29] L. Elazary, L. Itti, Interesting objects are visually salient, J. Vis. 8 (3:3) (2008) 1–15. 〈http://journalofvision.org/8/3/3/〉. [30] N.H. Mackworth, A.J. Morandi, The gaze selects informative details within pictures, Percept. Psychophys. 2 (11) (1967) 547–551. [31] J.M. Henderson, P.A. Weeks, A. Hollingworth, The effects of semantic consistency on eye movements during complex scene viewing, J. Exp. Psychol.: Hum. Percept. Perform. 25 (1) (1999) 210–228. [32] B. Russell, A. Torralba, W.T. Freeman, Labelme: The Open Annotation Tool, 〈http://labelme.csail.mit.edu〉, 2011. [33] C.M. Masciocchi, S. Mihalas, D. Parkhurst, E. Niebur, Everyone knows what is interesting: salient locations which should be fixated, J. Vis. 9 (11:25) (2009) 1–22. 〈http://journalofvision.org/9/11/25/〉. [34] U. Engelke, H.-J. Zepernick, A.J. Maeder, Visual attention modeling: Region-of-interest versus fixation patterns, in: Proceedings of the IEEE Picture Coding Symposium, 2009, pp. 1–4. [35] U. Engelke, H.-J. Zepernick, A. Maeder, Visual fixation patterns in subjective quality assessment: the relative impact of image content and structural distortions, in: Proceedings of the International Symposium on Intelligent Signal Processing and Communications Systems, 2010. [36] J. Wang, D.M. Chandler, P. Le Callet, Quantifying the relationship between visual salience and visual importance, in: Proceedings of IS&T/SPIE Human Vision and Electronic Imaging XV, vol. 7527, 2010. [37] P. Arbelaez, M. Maire, C. Fowlkes, J. Malik, Berkeley Segmentation Datasets and Benchmarks 500, 〈http://www.eecs.berkeley.edu/Research/Projects/ CS/vision/grouping/resources.html〉, 2011. [38] U. Engelke, A. Maeder, H.-J. Zepernick, VAIQ: The Visual Attention for Image Quality Database, 〈http://www.bth.se/tek/rcg.nsf/pages/ vaiq-db〉, 2009. [39] U. Engelke, A.J. Maeder, H.-J. Zepernick, Visual attention modelling for subjective image quality databases, in: Proceedings of the IEEE International Workshop on Multimedia Signal Processing, 2009, pp. 1–6.
404
U. Engelke, P. Le Callet / Signal Processing: Image Communication 39 (2015) 386–404
[40] Z.M.P. Sazzad, Y. Kawayoke, Y. Horita, Image Quality Evaluation Database, 〈http://mict.eng.u-toyama.ac.jp/mict/index2.html〉, 2000. [41] P. Le Callet, F. Autrusseau, Subjective Quality Assessment IRCCyN/IVC Database, 〈http://www.irccyn.ec-nantes.fr/ivcdb/〉, 2005. [42] H.R. Sheikh, Z. Wang, L. Cormack, A.C. Bovik, LIVE Image Quality Assessment Database Release 2, 〈http://live.ece.utexas.edu/research/ quality〉, 2005. [43] EyeTech Digital Systems, TM3 Eye Tracker, 〈http://www.eyetechds. com/〉, 2009. [44] U. Engelke, H.-J. Zepernick, Psychophysical assessment of perceived interest in natural images: The ROI-D database, in: Proceedings of
[45]
[46] [47] [48]
the SPIE/IEEE International Conference on Visual Communications and Image Processing, 2011. J.M. Wolfe, T.S. Horowitz, What attributes guide the deployment of visual attention and how do they do it? Nat. Rev. Neurosci. 5 (2004) 1–7. R.J. Peters, A. Iyer, L. Itti, C. Koch, Components of bottom-up gaze allocation in natural images, Vis. Res. 45 (18) (2005) 2397–2416. D. Green, J. Swets, Signal Detection Theory and Psychophysics, John Wiley, New York, 1966. T. Fawcett, An introduction to ROC analysis, Pattern Recognit. Lett. 27 (8) (2006) 861–874.