Evaluation of Perceptually-Based Selective Rendering Techniques using Eye-Movements Analysis Veronica Sundstedt∗ University of Bristol Department of Computer Science United Kingdom
Abstract In recent years, models of the human visual system, in particular bottom-up and top-down visual attention processes, have become commonplace in the design of perceptually-based selective rendering algorithms. Although psychophysical experiments have been performed to assess the perceived quality of selectively rendered imagery, little work has focused on validating the correlation of predictive region-of-interests (ROIs) as determined by perceptuallybased metrics to the actual eye-movements of human observers. In this paper we present a novel eye tracking study that investigates how accurately ROIs predict where participants direct their eyes towards while watching an animation. Our experimental study investigated the validity of using saliency and task maps as ROI predictors. This study involved 64 participants in four conditions: participants performing a task, or free-viewing a scene, while being naive or informed about the purpose of the experiment. The informed participants knew that they were going to assess rendering quality. Our overall results indicate that the task map does act as good predictor of ROI. CR Categories: I.3.7 [Computer Graphics]: Three-Dimensional Graphics and Realism-color, shading, shadowing, and texture—; Keywords: Perception, eye-movements, attention, selective rendering, saliency, task map, eye tracking
1
Introduction
High-fidelity rendering is the process of computing physically accurate images of real scenes, as perceived by a human observer. Since the introduction of high complexity computer models and high-fidelity lighting simulations there has been an increasing demand for time-efficient rendering algorithms. One method of reducing computation, while maintaining a high perceptual result, is to adjust the rendering parameters of the image to the human visual system. This results in an image with a spatially shifted degree of accuracy. If we can faithfully predict where visual attention will be directed in an image, or if the observer will be performing a certain task, only these perceptually important regions need to be rendered with high quality. Therefore, rendering time is not spent on perceptually less important regions. We refer to this approach as region-of-interest (ROI) rendering. Visual attention-based ROI ∗ e-mail:
[email protected]
Alan Chalmers University of Bristol Department of Computer Science United Kingdom has previously been exploited in coding algorithms where the bitrate was varied spatially across the image [Bradley and Stentiford 2003]. In this paper we validate, for the first time, the selective ROI rendering possibilities in high-fidelity applications. The ROI is manually or automatically detected resulting in task and saliency maps. These ROIs are supported by two general visual attention processes. The validation determined if the participants focused on the predicted ROIs. The participants were either performing a task or free-viewing an animation. We also explored whether a participant being naive to the purpose of the experiment would direct attention to different areas from a participant being informed about the experiment’s purpose. The naive participants did not know they were going to be asked to perform a two-alternative forced choice (2AFC) preference about rendering quality on completion of the trial. If the maps would act as accurate ROI predictors, the eyemovements should correlate well with the task map while performing a task and with the saliency map while free-viewing a scene. The participant’s eye-movements were recorded using a Tobii x50 eye tracker [Tobii 2005] to validate how well the fixations correlated with the pre-determined ROIs within our experimental selective rendering framework [Sundstedt et al. 2005]. The rest of this paper is organised as follows. We survey the related work in the field in Section 2, then we introduce our eye tracking validation and methodology in Section 3. We present the results and statistical analysis of the eye-movements validation in Section 4. Finally we conclude and discuss future work in Section 5.
2 2.1
Related Work Visual Attention
The amount of information in the environment that reaches our eyes is much greater than what our brain is able to process. Visual attention is a complex action composed by conscious and unconscious processes in the brain, which is used to find and focus on relevant information quickly and efficiently [Rensink 2002]. The retina and its photoreceptors in our eyes are used to convert light waves into neural activity which our brain can process. In the centre of the retina is the fovea, which provides the highest spatial and chromatic resolutions. The peripheral vision, outside this area, has far less visual acuity. The visual angle covered by the fovea is only about 2◦ . In order to obtain detailed information from different parts of the scene our eyes are redirected so that the relevant parts fall on the fovea. Instead of scanning the scene in a raster-like fashion, our eyes saccade to relevant objects. There are two general visual attention processes, bottom-up and top-down, which determine where humans locate their visual attention [Braun 1994]. Bottom-up, or feature-based models, which are an automatic visual stimulus have been the main focus of recent attempts at computational modelling of the human visual system. Top-down processes, on the other hand, focuses on the observer’s goal, i.e. they are task-dependent.
(a)
(b)
(c)
(d)
Figure 1: Region-of-interest (ROI) examples from the corridor and art gallery scenes used in the validation (Frame 1): (a) high quality rendering, (b) task objects, (c) task map with foveal gradient angle, and (d) saliency map.
2.1.1
Saliency Models
Bottom-up, feature-based models, extract features from an image that humans would automatically direct attention to. Koch and Ullman [1985] introduced the idea of a saliency map, which is a twodimensional map encoding these salient areas. Itti et al. [1998] later developed this into a computer model, which analyses pixels within an image and uses filters for low-level attributes to identify ROIs, Figure 1 (d). The saliency map is a combination of three conspicuity maps of colour, intensity, and orientation. The conspicuity maps are computed using feature maps at varying spatial scales. The features can be seen as stimuli at varying scales and conspicuity as a summary of a specific stimulus at all the scale levels combined. Saliency can be thought of as a summary of all the conspicuity of all the stimuli combined together. The model also contains an inhibition of return process to determine which location the eyes will shift to next. The saliency model has been used for different applications, including virtual vision, scene evaluation and rendering algorithms. However, the model does not consider dynamic scenes. Subsequent work by Marmitt and Duchowski [2002] also showed, using eye tracking, that such bottom-up visual attention model predicted fixations over a wide area of the display, whereas real observers concentrated more on the centre. The low-level approach has also shown to be influenced by size, shape, brightness, orientation, edges and motion.
2.1.2
Task Models
The top-down approach is supported by the visual psychologist Yarbus [1967], which showed that a strong correlation exists between observer’s eye-movements and the task currently being performed. Land et al. [1999] monitored viewer’s eye-movements with a portable eye tracker while they performed the ordinary task of making a cup of tea in an unfamiliar kitchen. Their experiments showed that the eyes monitored almost every step that was necessary for completing the task. Over 95% of the fixations were di-
rected towards objects related to the task being performed. Models which include task-dependency are proposed in for example [Laar et al. 1997; Navalpakkam and Itti 2002]. The saliency model was extended by Navalpakkam and Itti [2002] to include scene semantics, for example “people” eat “food”, to create goal oriented attention guidance. In their model the user specified tasks or taskrelevant objects. Canosa [2003] showed that low-level maps do not always correlate well with the fixation location of the viewers and suggests that a map of perceptual conspicuity, which incorporates high-level information should be taken into account. The additional information is for example figure/background segmentation, potential object detection, and task-specific location bias. Her experimental validation also proved that, for natural scenes, the extended model better correlated with the fixations of human observers. The idea that low-level information can be important for determining which location next will be fixated upon is supported, but not as important as in task-based applications. More recent work that also has focused on modelling the influence of task on attention is described in [Navalpakkam and Itti 2005].
2.2
Attention-Based Selective Rendering
The main goal of perception-based rendering algorithms is to save computation while obtaining an image that is perceptually indistinguishable from a fully converged solution. Recently, models of the human visual system, in particular those based on visual attention processes, have been used in perceptually assisted renderers to reach this goal. An extensive overview of perceptually adaptive graphics techniques can be found in [O’Sullivan et al. 2004]. Our related work focus on rendering algorithms which have taken into account these visual attention models. Yee et al. [2001] proposed a computational model of attention, for a dynamic environment, to improve the efficiency of indirect lighting computations. A saliency model, termed the Aleph Map, was exploited to adjust the search radius accuracy for the interpolation of irradiance cache values. The computation model included a motion predictor which analysed im-
ages and tracked objects in order to calculate their motion. Cater et al. [2002] showed, using psychophysical experiments, that conspicuous objects in a scene that would normally attract the viewer’s attention were ignored if they are not relevant to the task at hand. This failure of the human to see unattended items in a scene, even if the objects are within the foveal region, is known as inattentional blindness [Mack and Rock 1998; Simons and Chabris 1999]. Cater et al. [2002] also confirmed by using eye tracking that the effect was indeed caused by inattentional blindness and not peripheral vision. The concept of task maps, which are two-dimensional maps highlighting the task at hand, was introduced to exploit the taskdependent approach [Cater et al. 2003]. Saliency maps and the notion of task objects were also used in a real-time renderer to identify the most salient objects for which to render the glossy and specular components [Haber et al. 2001]. In [Cater et al. 2003; Sundstedt et al. 2004] both task maps and saliency maps were used to vary a number of rays traced per pixel in a global illumination environment. Debattista et al. [2005] exploited attention models in combination with a selective component-based rendering framework. Finally, in [Hill et al. 2002; Peters and Sullivan 2003] both task maps and low-level visual attention were used to guide virtual humans in a complex environment by making some perceptual objects more salient than others and letting the virtual agents pay attention to their surroundings. Although, visual attention models have been included in various rendering algorithms, there is a need for a validation of the accuracy of these models based on eye-movements from human observers.
2.3
Region-of-Interest (ROI) Rendering
We have developed a region-of-interest (ROI) rendering framework that exploits visual attention processes, which allow us to generate task and saliency maps. These maps can then be validated using eye tracking. For the convenience of the reader, we briefly review the technical background from [Sundstedt et al. 2005] that underpins this work. Our framework is composed of two major processes: region-of-interest (ROI) guidance uses a combination of saliency and a measure of task relevance to direct the rendering computation. selective rendering corresponds to the traditional rendering computation [Ward 1994]. However, computational resources are focused on parts of the image which are deemed more important by the ROI guidance. The process begins with a rapid image estimate (in the order of ms) of the scene using a quick rasterisation pass in hardware [Longhurst et al. 2005]. This estimate can be used in two ways. Firstly for building the task map by identifying user-selected task objects, and secondly, by using it as an input to a saliency generator. The map generation stage is composed of two processes. In the creation of the task map the program reads in the geometry information and a list of predefined task objects. It then produces a map with task objects in white and the other geometry in black, Figure 1 (b). A fovea angle gradient is then applied around task-related objects in the task map. When the fovea gradient is added in our validation we use an angle of 2◦ . We then blend from higher to lower quality up to an angle of 4◦ , Figure 1 (c). In the creation of the saliency map the image estimate serves to locate areas where an observer will be most likely to look. The estimate of the scene contains only direct lighting, but a simple shadow and reflection calculation is also included. The saliency estimation is carried out by using the existing method proposed in [Itti et al. 1998] and is computed in 2-3 seconds per frame. A hardware implementation can generate a
Figure 2: Tobii x50 eye tracking system [Tobii 2005].
saliency map in the order of tens of milliseconds [Longhurst et al. 2006].
3
Eye Tracking Validation
Sundstedt et al. [2005] examined experimental conditions under which attention-based selective rendering techniques supply a response indistinguishable from a fully rendered solution. By using psychophysics 2AFC preference experiments it was shown that participants performing a fire safety task could not distinguish a selectively rendered animation, based on a task map, from a high quality solution, Figure 1 (a). One of their hypotheses was that the viewers would be able to see the quality differences while free-viewing the two animations. The statistical results indicated that this was not the case. Since fire safety items are designed to be salient in real life, the participants might have been focusing on these objects rendered at high quality. The saliency map does show that these task objects were the most salient objects within this scene, Figure 1 (d). Although these experiments focused on what the participants perceived, they did not give information about what the participants directed their attention towards. In this paper we focus on validating the accuracy of the first part of our framework, the ROI guidance. It is important to validate the task and saliency map approach. Since these experiments rely on the assumption that the ROI guidance captures both top-down and bottom-up processes of the human visual system, there is a need for an eye tracking validation [Duchowski 2003]. The main goal of this validation was to investigate how well the participant’s eyemovements correlated with the pre-determined ROIs, while performing a task or free-viewing the scene. This is a novel approach which extends previous subjective experiments described the field of selective rendering validation. By performing eye tracking analysis while a participant is taking part in an experiment, we can not only gather information regarding perceived difference. This also allows us to find out where participants actually directed attention towards while making subjective judgements. If the ROI predictors are accurate, the eye-movements should correlate well with the task map while performing a task and with the saliency map while freeviewing a scene. The participant’s eye-movements were recorded using a Tobii x50 eye tracker, Figure 2.
3.1
Stimuli
The first stimuli used in the eye tracking validation was the corridor scene used by Sundstedt et al [2005]. This scene contains high salient objects such as fire extinguisher, emergency exit signs, action in case of fire instruction posters, and fire alarm buttons, Figure 1 (a) top image. The task objects were located mostly in the centre of the scene. The scene also contains other salient objects such
as doors, tables and paintings but with less salient features than the fire safety items. The second stimuli used was the art gallery scene. This scene contains high salient objects such as colourful paintings, plants, benches and garbage bins, Figure 1 (a) bottom image. This scene also contains non-salient architectural buildings on display. These task objects has got the same material properties as the art gallery walls, thus blending in with the background. The task objects were mostly located in the centre of the scene whereas the salient objects were more located in the periphery. Two animations were used for each scene to avoid familiarity effects that might influence the scan path of the observers. Each animation contained the same number of task-related objects.
3.2
Subjects
The eye tracking validation experiment had 64 participants (51 men and 13 women; age range: 20-37) in total. Subjects had a variety of experience with computer graphics, and all self-reported normal or corrected-to-normal vision. There were in total four groups with 16 participants in each group. Two groups were performing a task, the other two were free-viewing the scenes. To study the affect of prior knowledge on attention focus, each participant performed two experiments consecutively. Half the participants performed an experiment with the corridor scene first, while the other half performed an experiment with the art gallery scene first. Upon completion of each experiment the participants were asked which of the two consecutively displayed animations they thought contained the better rendering quality. Before each participant took part in their first experiment, they were naive, or uninformed, to the 2AFC preference [Sundstedt et al. 2005]. In the second experiment they were aware about the quality preference of the animations. Thus they were informed as to the purpose of the experiment.
3.3
Figure 3: Experimental setup: X = 60 cm, D = 11.4 cm, and Θ = 59◦ . Modified from [Tobii 2005].
Equipment
The Tobii x50 eye tracking system consists of hardware (Tobii x50) and software which controls the hardware and is used to generate the eye-movements data (The Tobii Eye Tracker Server, TET Server) [Tobii 2005]. The eye tracker was connected to a dual Intel Pentium 3.2 Ghz PC with 2GB RAM under Windows XP. The eye tracker uses Near Infra-Red Light-Emitting Diodes (NIR-LEDs) and a high-resolution camera with a large field of view. These are used to generate and collect reflection patterns from the cornea of both the participants eyes and to capture images of the user which are needed for tracking of the eyes. The software consists of image processing algorithms that extract important features, such as the eyes and the reflection patterns generated by the NIR-LEDs. Using this information the system calculates a three-dimensional position in space of where the eyes were located to determine the gaze point on the monitor at a certain point in time. The eye tracker is a stand-alone unit which does not introduce any restraints on the test subjects. The field of view of the high-resolution camera is 20 × 15 × 20 cm (width × height × depth) at 60 cm from the screen. This allows the participants to have a certain head movement. The head movement was not measured in the experiments. The system samples the position of the participants eyes at a constant frame rate of 50 Hz (every 20ms). The eye-tracker has an average accuracy of about 0.5 − 0.7◦ . This degree of accuracy corresponds to an average error of about 0.5 − 0.7 cm between the measured and actual gaze point at a distance of 60 cm between the user and the monitor. The system also resumes tracking within 100 ms if a participant blink or would be completely out of range from the camera. Tobii’s ClearView software was used to create the studies with the
experiment stimuli, to calibrate each participants eyes and for exporting the output numerical data. This data was then analysed in our validation, Section 4.
3.4
Setup
The Tobii x50 system requires some configurations before the start of the experiment. The distance from the participant to the eye tracker should be around 60 cm. The eye tracker was also positioned straight in front of the stimuli at a certain angle below the user, Figure 3. The width and height of the screen display area was given as input to the system together with D and Θ. The participants were seated on a chair in front of a 17” LCD monitor. The effects of ambient light were minimized and the lighting conditions were kept constant throughout the experiment. All animations were displayed in the centre of the screen at 600 × 600 resolution with a black background. A calibration was carried out for each participant, prior to each experimental trial, to ensure the collected data would be reliable. To calibrate the eye tracker, the participants fixated at a shrinking dot that appeared at different positions of the screen. In total a 3 by 3 grid of points were shown to each participant. Each calibration took 30 seconds on average. If ClearView indicated a bad calibration point, this point was fixated upon until the system provided sufficient calibration quality. The system were unable to calibrate one participant wearing glasses and in this case contact lenses were used instead.
3.5
Procedure
Following the calibration of each participant, half the subjects read a sheet of instructions on the procedure of the particular task they were to perform. The participants in the other half were simply shown the animations without any prior instructions. Each participant was shown two animations in each trial, one of animation A and one of animation B. One of the animations was always high quality while the other one was a selectively rendered using the settings described in [Sundstedt et al. 2005]. Each animation was viewed only once. The participants performing a task in the corridor scene were asked to take the role of a fire security officer whereby the task was to count the total number of fire safety items. For the art gallery scene the participants performing a task were asked to count the total number of architectural buildings on display in the gallery. Each instruction sheet contained an example of what kind of task items the scene could contain. A pre-study was run to determine that the observer would have enough time to per-
TASK NAIVE TASK INF. FREE NAIVE FREE INF.
Task objects (255) 11.79% ± 1.31 12.03% ± 1.12 8.87% ± 1.21 10.76% ± 1.09
Fovea, 2◦ - TO (255) 32.08% ± 1.48 30.12% ± 1.51 23.63% ± 2.39 21.12% ± 1.59
Fovea, 4◦ (< 255, > 0) 33.33% ± 1.75 33.84% ± 1.89 35.28% ± 1.56 28.64% ± 1.43
Outside (0) 20.70% ± 1.50 21.67% ± 1.49 29.31% ± 3.52 36.62% ± 2.45
Screen (-255) 2.09% ± 0.39 2.33% ± 0.43 2.91% ± 0.72 2.86% ± 0.62
Table 1: Experimental results for the corridor scene: Percentage of mean average fixation points (± standard error) in the task map correlation. TASK NAIVE TASK INF. FREE NAIVE FREE INF.
Task objects (255) 9.16% ± 1.05 5.52% ± 0.73 5.02% ± 0.73 4.15% ± 0.56
Fovea, 2◦ - TO (255) 23.56% ± 2.24 16.29% ± 1.66 14.79% ± 1.23 10.56% ± 0.99
Fovea, 4◦ (< 255, > 0) 46.97% ± 2.58 49.08% ± 2.34 40.59% ± 1.84 35.02% ± 1.92
Outside (0) 18.08% ± 2.03 27.43% ± 2.97 37.03% ± 2.55 47.90% ± 2.50
Screen (-255) 2.24% ± 0.72 1.68% ± 0.34 2.56% ± 0.42 2.37% ± 0.45
Table 2: Experimental results for the art gallery scene: Percentage of mean average fixation points (± standard error) in the task map correlation. form the task. Counterbalancing was used for the order in which the subjects saw their two animations to avoid any bias.
CORRIDOR SCENE
50
TASK NAIVE TASK INF. FREE NAIVE FREE INF.
45
4
40
Eye-Movements Analysis
4.1
Task Map Correlation
The average gaze point was used to lookup a relevance value from each frame’s corresponding task map. The relevance value was the brightness value of the pixel location in the task map. Brighter areas in the map corresponded to higher importance or more salient areas. A value of 255 indicates that the average fixation point lies within the task objects or the foveal area of 2◦ , whereas lower values correspond to the decreasing values in the foveal region of 2 − 4◦ . Values of 0 lie outside the angle of 4◦ . The average fixation point was set to -255 if the eye movement lay outside the 600 × 600 pixel animation or outside the 1280 × 1024 screen area resolution. Figure 4 shows the experiment conditions clustered percentage of mean average fixations points within and outside the task map with error bars representing standard error.
4.2
Results
From Table 1 we get a better indication of which areas the observers directed their attention towards while performing the experiment with the corridor scene. Around 76% of all average fixations lie within the fovea angle of 4◦ for both naive and informed subjects
30 25 20 15 10 5 0
Value 255
Value 0 Value < 255, > 0 Task map values
Value -255
ART GALLERY SCENE
50
TASK NAIVE TASK INF. FREE NAIVE FREE INF.
45 40 35
Percentage
The numerical text data that was retrieved from ClearView consisted of a time stamp in milliseconds, a frame number, gaze point (x,y), cam (x,y), validity, distance, and pupil information for each eye. ClearView contains many tools for analysing the gaze points of the participants. One of these tools is a plot of which visual fixations fall into a specified area-of-interest within the experiment stimuli. ClearView only contained this feature for still images so a program was written that could match the participant’s eyemovements data to the task and saliency maps throughout the animations. The numerical text data were then used to get an estimate of how many average fixation points lie within the predefined ROIs and specifically within the task areas. The analysis was conducted while the observers performed a task or free-viewed a scene. The eye-movements data used for the analysis was the obtained time stamp for each sample and its corresponding left and right eye fixation values. For each frame an average fixation point was calculated from the left and right eye position values assuming a constant framerate of 25 fps for playing the animations.
Percentage
35
30 25 20 15 10 5 0
Value 255
Value < 255, > 0 Value 0 Task map values
Value -255
Figure 4: Eye tracking results: Percentage of mean average fixation points within each task map category (independent of rendering quality and animation order). Error bars represent standard error.
performing a task. Of these around 43% lies even within an angle of 2◦ . The number within the 4◦ fovea angle decreased for naive and informed subjects free-viewing the scene, to a percentage 68% and 61% respectively. Of these around 32% lie within the angle of 2◦ . It seems like the naive and informed participants were focusing on a similar amount of task related areas when they were performing a task. Only around 21% of the average fixations points for participants performing a task were within the animations and
CORRIDOR SCENE
100
ART GALLERY SCENE
100
Value 1-255 Value 0 or -255
90
90
80
80
70
70
60
60
Percentage
Percentage
Value 1-255 Value 0 or -255
50 40
50 40
30
30
20
20
10
10
0
TASK NAIVE
TASK INF. FREE NAIVE Experiment Conditions
FREE INF.
0
TASK NAIVE
TASK INF. FREE NAIVE Experiment Conditions
FREE INF.
Figure 5: Eye tracking results: Percentage of mean average fixation points within and outside the selectively rendered areas (independent of animation and rendering quality), (left) corridor scene, (right) art gallery scene. Error bars represent standard deviation.
in areas not covered by the gradient. These numbers increased for naive and informed subjects free-viewing the scene, to a percentage of 29% and 37% respectively. In all conditions within the corridor scene no more than 3% of the average fixations points were outside the animation or the screen area. Table 2 shows the results for the art gallery scene. Around 80% of all average fixations for this scene lie within the 4◦ angle for naive participants performing a task. Of these 33% lies even within an angle of 2◦ . For informed participants these values decreased to 71% and 22% respectively. The number within the 4◦ fovea angle decreased for naive and informed subjects free-viewing the scene, to a percentage of 60% and 50% respectively. Of these around 20% and 15% lie within the angle of 2◦ . Only around 18-27% of the average fixations points for naive and informed participants performing a task were within the animations and in areas not covered by the gradient. These numbers increased for naive and informed subjects free-viewing the scene, to a percentage of 37% and 48% respectively. In all conditions within the art gallery scene no more than 3% of the average fixations points were outside the animation or the screen area.
4.3
Statistical Analysis
The results were analysed statistically to determine any significance. The statistical analysis was repeated for both the corridor scene with salient task objects and the art gallery scene with nonsalient task objects. The experiment condition were either naive or informed participants performing a task, or naive or informed participants free-viewing the scene (four in total). In the validation the average fixation points within selectively rendered areas as a function of experiment condition was investigated, Figure 5. The graphs show the mean average fixations points within the selectively rendered areas (task map value 1-255) and outside (task map value 0 or -255) of task map correlation. The means between four groups were also tested for significance to investigate if participant’s eyemovements were altered with a change in viewing instructions. The four groups that were tested for this significance were task naivetask informed, free naive-free informed, task naive-free naive, and task informed-free informed, Table 4. We decided to compare the difference between task naive and free naive to validate whether the
free-viewing condition would produce a similar result as performing a task due the non-salient and salient nature of the task objects. An ANOVA test was first used to identify whether the four different conditions, for each scene, resulted in significantly different scores. ANOVA handles data involving more than two conditions and can therefore not tell us exactly which pairs of conditions are significantly different. An independent t-test was used to compare whether the means between the two groups were statistically different from each other [Brace et al. 2003]. In order for the difference in mean between two groups to be significant, the calculated p-value has to be lower than 0.05 (at a 95% confidence interval). 4.3.1
Rendering Quality and Animation Ordering
Firstly, the effect of rendering quality (high or selective) and animation order (animation 1 or animation 2) on the average fixation points within the selectively rendered areas were evaluated. This was evaluated for each condition within each scene, Table 3. For this analysis the appropriate test was a paired two-sample for means t-test. This test was used since each participant within each condition was shown two animations, one high quality and one selectively rendered using counterbalancing. The statistical outcome of this test can be seen in Table 3. The results show that there was no statistical difference between the quality groups within each condition for either of the scenes (p > 0.05). This indicates that rendering quality did not effect the amount of average fixations point that was inside the foveal region. The results also show that there was
Task Naive - Quality Task Informed - Quality Free Naive - Quality Free Informed - Quality Task Naive - Order Task Informed - Order Free Naive - Order Free Informed - Order
Corridor Scene t-value p-value 0.8 0.41 0.5 0.60 0.2 0.87 0.1 0.90 0.7 0.51 0.1 0.95 0.1 0.95 0.2 0.81
Art Gallery Scene t-value p-value 1.7 0.11 0.5 0.65 0.9 0.36 1.2 0.25 1.0 0.32 2.5 0.02 0.1 0.95 1.6 0.13
Table 3: Output for the paired t-test analysis (df=15, 0.05 level of significance). Values in bold indicate a significant difference.
no statistical difference between the animation order groups within each condition for the corridor scene (p > 0.05). This indicates that for at least the corridor scene animation order did not effect the amount of average fixations point that was inside the selectively rendered areas. Interestingly, the results of the ordering analysis were different for the art gallery scene. For this scene a significant difference was found for the informed participants performing a task (t(15) = 2.5, p = 0.02). This indicates that the attention focus was altered for informed participants performing a task between the two animations. 4.3.2
The Corridor Scene
The ANOVA result from the corridor scene showed a significant difference over all conditions (F(3,124) = 10.1, p < 0.05). Due to the apparent significant difference between conditions the independent t-test was used to identify where this difference occurred, Table 4. Firstly we compared the naive and informed participants performing a task. The result showed that there was no significant difference between the two groups (t(62) = 0.6, p = 0.58). This was expected since both groups were focused on the task of counting the fire safety items. The difference between the naive and informed participants free-viewing the corridor scene was also not significant (t(62) = 1.7, p = 0.10). Interestingly, there was a statistical difference between naive participants performing a task and free-viewing the scene (t(62) = 2.4, p = 0.02). This indicates that free-viewing participants being naive to the experiment did in fact not direct attention to the selectively rendered areas to the same extent as the participants performing a task. One explanation of this result could be that there were other salient objects in the corridor scene. The result indicates that a task map is separate from a saliency map. Finally, we compared the informed participants performing a task with informed participants free-viewing the scene. The result showed a highly significant difference between the means (t(62) = 5.4, p = 0.00). 4.3.3
The Art Gallery Scene
The ANOVA result from the art gallery scene showed a significant difference over all conditions (F(3,28) = 15.5, p < 0.05). The independent t-test was used again to identify where the differences occurred, Table 4. Firstly we compared the naive and informed participants performing a task. The result showed that there was a significant difference between the means (t(62) = 2.4, p = 0.02). Figure 5 (art gallery scene) indicates that the average fixation points decreased for informed participants performing a task. One explanation could be that attention was drawn to more salient features of the scene in this condition after having seen the non-salient task objects once and identified them. Table 2 indicates that this was what occurred. The statistical analysis within the task informed condition also indicated that animation order had an effect of the location of the average fixation points, Table 3. The difference between the naive and informed participants free-viewing the art gallery scene was also highly significant (t(62) = 2.9, p = 0.00). Our results seem to agree with the results by [Marmitt and Duchowski 2002]. The naive participants in the free-viewing condition concentrated more on the centre, whereas the saliency model in [Itti et al. 1998] predicts fixations over a wider area. It appears that naive participants free-viewing the animations focus on objects of interest, especially those located near the centre of the screen. Thus saliency maps should be augmented to take this into account. When participants were assessing rendering quality while watching the animations, they did focus more on highly salient areas towards
Task Naive + Task Informed Free Naive + Free Informed Task Naive + Free Naive Task Informed + Free Informed
Corridor Scene t-value p-value 0.6 0.58 1.7 0.10 2.4 0.02 5.4 0.00
Art Gallery Scene t-value p-value 2.4 0.02 2.9 0.00 5.9 0.00 5.3 0.00
Table 4: Output for the independent t-test analysis (df=62, 0.05 level of significance). The results are order and quality independent. Values in bold indicate a significant difference.
the periphery. We also compared the difference between naive participants performing a task with naive participants free-viewing the scene to validate whether free-viewing would differ from performing a task. The statistical analysis showed a highly significant difference between these two conditions (t(62) = 5.9, p = 0.00). This indicates that free-viewing participants being naive did not direct attention to the selectively rendered areas to the same extent as the participants performing a task. Finally, we compared the informed participants performing a task with the informed participants freeviewing the scene. The result again showed a highly significant difference (t(62) = 5.3, p = 0.00). This seems to correlate well with the fact that participants performing a task where attending the task areas, whereas free-viewing participants focused more on the salient areas in the periphery.
5
Conclusions and Future Work
We have presented an original validation of perceptually-based selective rendering algorithms using eye tracking. In order to complete a visual task in an environment, a user’s eyes focus on the specific parts of the animation at the expense of other details in the scene. The validation indicates that when participants are performing a task while watching an animation, the task map correlates well with their eye-movements. The analysis of the eye-movements data indicate that naive participants performing a task have 80% of their average fixations point within a the foveal region of 4◦ . This is valid for both non-salient and high salient task objects. By exploiting these results it is possible to render only the perceptually important regions with high quality rendering settings. This will result in a substantially reduced computational cost for highfidelity rendering. Our results also indicate that the selective rendering quality settings, chosen for this study, did not effect the location of the average fixation points significantly. In the task informed condition for the art gallery scene a significant animation ordering effect was found. Future work will analyse participants performing a task, and where they focused outside the selectively rendered areas. This analysis will allow the accuracy of the task maps to be improved. As the geometry information is known it would be possible to cluster the average fixations points per object. We argue that an object-based approach augmented with low-level saliency would make interesting future research in perceptually-based selective rendering algorithms. Our results also agree with previous results that have shown that naive participants free-viewing a scene direct their attention more towards the central areas than a saliency map predicts. Thus the saliency map approach needs to be augmented to take this into account.
6
Acknowledgements
We would like to thank Patrick Ledda for use of the original corridor model. We would also like to thank Tim Dixon, Kurt Debattista, and Amoss for useful discussions. Finally, we would like to thank everyone that participated in the experiments. This work was supported by the Rendering on Demand (RoD) project within the 3C Research programme of convergent technology research for digital media processing and communications. For more information please visit www.3cresearch.co.uk.
References B RACE , N., K EMP, R., AND S NELGAR , R. 2003. SPSS for Psychologists. Palgrave Macmillan Ltd. B RADLEY, A., AND S TENTIFORD , F. 2003. Visual attention for region of interest coding in jpeg2000. In Journal of Vision Communications and Image Representation, vol. 14, 232–250. B RAUN , J. 1994. Visual search among items of different salience: Removal of visual attention mimics a lesion in extrastriate area v4. In Journal of Neuroscience, 554–567. C ANOSA , R. 2003. Seeing, Sensing, and Selection: Modeling Visual Perception in Complex Environments. PhD thesis, Rochester Institute of Technology, Rochester College of Science. C ATER , K., C HALMERS , A., AND L EDDA , P. 2002. Selective quality rendering by exploiting human inattentional blindness: looking but not seeing. In Proceedings of the ACM symposium on Virtual reality software and technology, 17–24. C ATER , K., C HALMERS , A., AND WARD , G. 2003. Detail to attention: Exploiting visual tasks for selective rendering. In Proceedings of the Eurographics Symposium on Rendering, 270– 280. D EBATTISTA , K., S UNDSTEDT, V., S ANTOS , P., AND C HALMERS , A. 2005. Selective component-based rendering. In GRAPHITE 2005, 3rd International Conference on Computer Graphics and Interactive Techniques, ACM. D UCHOWSKI , A. 2003. Eye Tracking Methodology: Theory and Practice. Springer-Verlag New York, Inc., Secaucus, NJ, USA. H ABER , J., M YSZKOWSKI , K., YAMAUCHI , H., AND S EIDEL , H.-P. 2001. Perceptually guided corrective splatting. In Computer Graphics Forum, vol. 20, 142–152. H ILL , R., K IM , Y., AND G RATCH , J. 2002. Anticipating where to look: predicting the movements of mobile agents in complex terrain. In Proceedings of the first international joint conference on Autonomous agents and multiagent systems, ACM Press, 821– 827. I TTI , L., KOCH , C., AND N IEBUR , E. 1998. A model of saliencybased visual attention for rapid scene analysis. In Pattern Analysis and Machine Intelligence, vol. 20, 1254–1259. KOCH , C., AND U LLMAN , S. 1985. Shifts in selective visual attention: towards the underlying neural circuitry. In Human Neurobiology, vol. 4, 219–227. L AAR , V. D., H ESKES , AND G IELEN. 1997. Task-dependant learning of attention. In Neural Networks, vol. 10,6, 981–992.
L AND , M., M ENNIE , N., AND RUSTED , J. 1999. The roles of vision and eye movements in the control of activities of daily living. In Perception, vol. 28, 1311–1328. L ONGHURST, P., D EBATTISTA , K., AND C HALMERS , A. 2005. Snapshot: A rapid technique for selective global illumination rendering. In The 13-th International Conference in Central Europe on Computer Graphics, Visualization and Computer Vision. L ONGHURST, P., D EBATTISTA , K., AND C HALMERS , A. 2006. A GPU based saliency map for high-fidelity selective rendering. In AFRIGRAPH 2006, ACM. M ACK , A., Press.
AND
ROCK , I. 1998. Inattentional Blindness. MIT
M ARMITT, G., AND D UCHOWSKI , A. 2002. Modeling visual attention in vr: Measuring the accuracy of predicted scanpaths. In Eurographics 2002, Short Presentations, 217–226. NAVALPAKKAM , V., AND I TTI , L. 2002. A goal oriented attention guidance model. In BMCV ’02: Proceedings of the Second International Workshop on Biologically Motivated Computer Vision, Springer-Verlag, London, UK, 453–461. NAVALPAKKAM , V., AND I TTI , L. 2005. Modeling the influence of task on attention. Vision Research 45, 2, 205–231. O’S ULLIVAN , C., H OWLETT, S., M ORVAN , Y., M C D ONNELL , R., AND O’C ONOR , K. 2004. Perceptually adaptive graphics. In Eurographics 2004, STAR, 141–164. P ETERS , C., AND S ULLIVAN , C. O. 2003. Bottom-up visual attention for virtual human animation. In Computer Animation for Social Agents. R ENSINK , R. 2002. Visual attention. In Encyclopedia of Cognitive Science, London: Nature Publishing Group. S IMONS , D., AND C HABRIS , C. 1999. Gorillas in our midst: Sustained inattentional blindness for dynamic events. In Perception, vol. 28, 1059–1074. S UNDSTEDT, V., C HALMERS , A., C ATER , K., AND D EBATTISTA , K. 2004. Top-down visual attention for efficient rendering of task related scenes. In Vision, Modeling and Visualization. S UNDSTEDT, V., D EBATTISTA , K., L ONGHURST, P., C HALMERS , A., AND T ROSCIANKO , T. 2005. Visual attention for efficient high-fidelity graphics. In Spring Conference on Computer Graphics, 162–168. T OBII. 2005. Tobii TECHNOLOGY USER MANUAL. http://www.tobii.com. WARD , G. 1994. The RADIANCE lighting simulation and rendering system. In SIGGRAPH, vol. 40, 459–472. YARBUS , A. 1967. Eye movements during perception of complex objects. In Eye Movements and Vision, 171–196. Y EE , H., PATTANAIK , S., AND G REENBERG , D. 2001. Spatiotemporal sensitivity and Visual Attention for efficient rendering of dynamic Environments. Master’s thesis.