Modeling cognitive effects on visual search for targets in ... - CiteSeerX

3 downloads 263 Views 571KB Size Report
a Email: [email protected]; Telephone: (617)491-3474x524; ... Saccade Generator performs peak detection on the Fixation Probability Map to determine ... retinal input from the foveated image region and task-specific stored object templates .... This measure—unlike the difference-of-dispersions—is equally responsive to either ...
12th Annual AeroSense, Proceedings of SPIE, Vol. 3375, Orlando, FL, 1998

Modeling cognitive effects on visual search for targets in cluttered backgrounds Magnús Snorrasona, Harald Rudab, and James Hoffmanc Charles River Analytics, One Alewife Center, Cambridge, MA 02140 Department of Psychology, University of Delaware, Newark, DE 19716

a,b

c

ABSTRACT

To understand how a human operator performs visual search in complex scenes, it is necessary to take into account top-down cognitive biases in addition to bottom-up visual saliency effects. We constructed a model to elucidate the relationship between saliency and cognitive effects in the domain of visual search for distant targets in photo-realistic images of cluttered scenes. In this domain, detecting targets is difficult and requires high visual acuity. Sufficient acuity is only available near the fixation point, i.e. in the fovea. Hence, the choice of fixation points is the most important determinant of whether targets get detected. We developed a model that predicts the 2-D distribution of fixation probabilities directly from an image. Fixation probabilities were computed as a function of local contrast (saliency effect) and proximity to the horizon (cognitive effect: distant targets are more likely to be found close to the horizon). For validation, the model’s predictions were compared to ensemble statistics of subjects’ actual fixation locations, collected with an eye-tracker. The model’s predictions correlated well with the observed data. Disabling the horizon-proximity functionality of the model significantly degraded prediction accuracy, demonstrating that cognitive effects must be accounted for when modeling visual search. Keywords: Visual search, fixation, fovea, cognitive, context, saliency, contrast, eye tracking, clutter

1. INTRODUCTION

The goal of this project was to construct a model that elucidates the relationship between bottom-up (stimulus driven) and top-down (cognitive) effects on visual search for distant hard-to-detect targets, with the ultimate goal of predicting the average search performance of a trained operator on a given scene.

The task of visual search—locating objects of a known category based on their visual appearance in unconstrained surroundings—is among the many functions of the human visual system that are not well understood. Given the multitude of stimulus variables that affect visual search, modeling this task is not possible without further constraints. Past approaches have generally been constrained by fitting psychophysical data from very simple scenes: uniform background with an array of many similar distractor items and one or a few target items (see [1] for review). Unfortunately, such models do not necessarily scale with increased scene complexity and are not applicable to photo-realistic imagery. Models specifically designed to handle photo-realistic imagery, such as the TARDEC vision model [2] and Georgia Tech vision model [3] have not included cognitive effects. There is great need for models capable of handling imagery of realistic complexity, yet designed with modeling constraints that enable both bottom-up and top-down factors of visual search to be better understood.

We chose to constrain the problem by focusing on hard-to-see targets, targets that can not be detected without foveation. This is a reasonable constraint because detecting a hard-to-see target requires high visual acuity, and visual acuity is a function of retinal eccentricity. Acuity is highest in fovea with rapid fall-off towards the periphery. Just 2.5° away from the point of fixation acuity is down to 50% of foveal and at 6° eccentricity it is down to 25%. Figure 1 below shows an example of vehicle targets, most of which can not be detected without foveation (there are six targets in a horizontal band through center of image). Email: [email protected]; Telephone: (617)491-3474x524; http://cns-web.bu.edu/~snorraso Email: [email protected] c Email: [email protected] a

b

Acuity falloff is nature’s way of solving a resource allocation problem: the fovea is far more useful than the periphery, but high acuity requires resource intensive high-resolution sampling and processing of the image. In fact, it is so resource intensive that even though a large portion of the human brain is dedicated to foveal processing, the fovea only covers a region the size of the tip of your finger at an arms length1.

Figure 1: Sample image of realistic complexity containing six distant vehicle targets With such a small fovea, most of the visual scene never gets foveated. This led us to the conclusion that the choice of fixation points in a given scene is the most important determinant of whether hard-to-see targets get detected. Hence, our modeling approach focused on the different factors that affect the choice of fixation points.

We chose to focus on cognitive top-down effects and to model them as modulations of early vision system parameters. This choice is based on two known properties of the human visual system: • •

The existence of extensive feedback paths from later to earlier visual areas The subsymbolic2 nature of early vision

The former implies that top-down effects must be important, as they have significantly shaped brain anatomy. The latter dictates that such top-down effects must be parametric rather than symbolic since the early visual system does not have the capability to process symbolic information.

2. MODEL ARCHITECTURE

Our model has four main components: Peripheral Processing, Fixation Selection, Foveal Processing, and Cognitive Processing. The block diagram shown in Figure 2 below shows these four components and their relationships. The whole retinal image is continuously sampled at low resolution by the Peripheral Processing component while at the same time one small region centered on the fixation point is processed in full resolution by the Foveal Processing component. The Fixation Selection component processes feature maps from the Peripheral component to generate coordinates specifying the next fixation point. The Cognitive Processing component receives information about what and where each object is, as soon as it is recognized by either the Peripheral or Foveal Processing components. Information from Cognitive Processing is reported as the model’s final output, but it is also used to bias future fixation point choices.

It has been estimated that processing the whole retinal image at foveal resolution would increase the computational cost by four orders of magnitude [4]. 2 Most current early vision models are subsymbolic: low-level (“early”) vision is primarily stimulus-driven and relatively isolated from higher level (“later”) processes, which are possibly symbolic. 1

2

Feature Maps

Peripheral Feature Extractor

Saliency Map

Σ

Fixation Selection

Peripheral Processing Peripheral Blurring

x

Preattentive Detector

Context Bias Map

Context Gating Map Whole Image

Cognitive Processing Foveal

Fixation Probability Map

Σ

Saccade Generator

Image

Foveal Processing

Attentive Detector

Detected Targets Figure 2: Overall model architecture

The Peripheral Processing component receives input from the peripheral regions of the retinal image. Since that is most of the image, the simplifying assumption was made to use the whole image as input to Peripheral Processing. The following three modules operate in sequence: • • •

Gaussians convolution (blurring) of the whole image models low-resolution sampling by peripheral retinal areas

Peripheral Feature Extractor computes Peripheral Feature Maps (PFM) of the same dimensions as the input image, one map per feature Preattentive Detector detects objects that are highly salient—such as the horizon—from the combined PFM, reports their category and location to Cognitive Processing, and generates a Context Gating Map as specified by cognitive feedback

It is commonly assumed that the visual system must contain at least one subsystem specialized for very rapid processing of visual input. We state that an important function for this system would be to determine where to apply the spatially limited foveal processing, i.e. continuously producing the destination coordinates for the next saccade. This is the function modeled by the Fixation Selection component, incorporating the following four modules: • • • •

Threshold & Sum Operator combines all PFM into one Saliency Map

Multiply Operator gates an average of the PFM with the Context Gating Map generated by the Preattentive Detector to produce a Context Bias Map (CBM) Sum Operator sums the Saliency Map and CBM to generate a Fixation Probability Map

Saccade Generator performs peak detection on the Fixation Probability Map to determine destination coordinates for the next saccade

The Foveal Processing component was not included in the results reported here, but it contains the following modules: • •

Attentive Detector predicts the probability that the currently fixated object is designated a target, based on full-resolution retinal input from the foveated image region and task-specific stored object templates

Fixation duration is also predicted, based on rate of motion, information density, distracting covert attention to nonfoveal regions, and overall workload

2.1 Peripheral feature extraction We assume that the probability of attracting attention to any given image region depends to a large degree on the feature saliency of that region with respect to the associated background. The Peripheral Feature Extractor measures feature saliency by computing Peripheral Feature Maps (PFM) from the filtered input image.

3

The concept of a feature map is attractive for a number of reasons. Feature maps are biologically plausible: they are topographic representations of the retinal image, just like the maps that have already been identified in visual cortex of primates and cats. In fact, there is significant evidence that topographic maps are the primary representation for all sensory information in the brain. Feature maps are fully parallel, both across spatial locations within each map and across all maps. Such parallelism is needed to explain the very rapid effects of visual “pop out”, whereby one can detect a target object among distractors in as little as 50ms (after response delays have been factored out). Finally, from a modeling point-of-view, feature maps are easy to work with. They retain all spatial relationships (unlike frequency domain representations) so they can always be viewed as images. Using feature maps begs the question: what are the features? After detailed analysis of past feature integration approaches (see [1] for a review), we decided that local contrast was the feature with the best potential in our problem domain, i.e. visual search for small hard-to-see targets in gray-scale still images. We implemented and evaluated a number of local, unoriented contrast measures. Through analysis and empirical testing, we found the following four contrast measures most successful: • • • •

Absolute-difference-of-medians Dispersion

Difference-of-dispersions Doyle

Figure 3 illustrates the results of blurring the input image shown in Figure 1 and then applying each contrast measure.

Figure 3: Blurred image (left) and Peripheral Feature Maps (right, clockwise): Absolute-difference-of-medians, Dispersion, Doyle, and Difference-of-dispersions The absolute-difference-of-medians contrast saliency measure at a point was defined as the absolute difference between the median brightness of a rectangular window A 1 centered around the point P and the median brightness of a larger rectangular window A2, as illustrated in Figure 4 below, and computed via equation (1): CP = | Med(A1) - Med(A2) |

(1)

where CP is the contrast metric at point P and Med(A) is the median brightness value inside area A. The median value is chosen over other low pass filters due to its preservation of edges. Other filters, such as the Gaussian filter or other linear smoothing filters, were considered for this function, but were found less desirable since they result in the presence of “shadows” near the edges. An additional advantage to using the median is that for standard byte-per-pixel images, the median value can be computed inexpensively using a histogram technique.

4

A2 A1 P

Figure 4: Two-window contrast measure kernel This contrast measure depends on the choice of the two free parameters areas A1 and A2 . An object which matches the size of A1 will give maximum contrast response. Since the size of objects of interest in the image is not known a priori to the operator, we implemented the contrast filter with multiple different sizes for A1 computed in parallel. The size of A2 was kept constant since it was significantly larger than even the largest A1. This is effectively equivalent to having multiple contrast filters operate in parallel. The outputs of the different contrast filters are then linearly combined to generate an overall contrast measure. The reason for taking the absolute value of the difference was to make the measure equally sensitive to either contrast polarity, i.e., light object on a dark background or dark object on a light background. Dispersion is the sum of absolute values of differences between the pixel intensity at point P and at all other points inside window A centered on P. This is an L1 measure of the local variance in intensity values around a pixel, in the same way as σ2 is the L2 measure of local variance. This measure of local contrast has one free parameter: the window size A. Unlike the difference-of-medians, this method is sensitive to intensity variations inside the window, hence only one window is used. For small window sizes, the dispersion value is a measure of abrupt variation in intensity levels, such as caused by edges running at any orientation through the window.

We also implemented a difference-of-dispersions contrast saliency measure using the same two window approach shown in Figure 4 above. The two window sizes are combined by subtracting the dispersion value for the larger window from the one for the smaller window and half-wave-rectifying the result. This is equivalent to saying “show me the locations where dispersion on a small scale is greater than dispersion on a large scale.” This is a selective measure of contrast, but one that correlates well with some highly salient image regions. An efficient way of computing dispersion used for both dispersion contrast measures is based on the histogram of pixel intensities computed over window A. The dispersion at point P is then given by:

Σ n

Dp =

k=1

z – µ p(k)

(2)

where N is the number of histogram bins (normally chosen to be 256), z is the bin center value, p(k) is normalized histogram frequency at bin k, and µ is the mean intensity value inside the window. The Doyle contrast measure is a measure of first-order statistics that combines mean and standard deviation as follows [(µ1 – µ2)2 + (σ1 – σ2)2]1/2

(3)

where µi and σi are mean and standard deviation computed over area Ai, referring once again to windows A1 and A2 in Figure 4 above. This measure—unlike the difference-of-dispersions—is equally responsive to either contrast polarity. It responds equally well to a small high contrast region in a larger low contrast area and a small low contrast region in a larger high contrast area. This insensitivity to contrast polarity is similar to the absolute-difference-of-medians, but that method was only sensitive to variations larger than the smaller window, the Doyle method is also sensitive to intensity variations inside the smaller window due to the standard deviation terms. To summarize, four different unoriented local contrast measures were selected for this model due to their complementary sensitivity domains and empirically demonstrated effectiveness. They all have one or two free parameters—the window sizes—which were empirically determined. In future work, we intend to bring those parameters under the control of the Cognitive Processing module. The symbolic knowledge of expected target size in the retinal image (tiny, small, medium, etc.) will be translated into a subsymbolic scalar value that gets interpreted by the contrast filters as representing a particular window size. In this modified model, objects of the expected target size will be more salient than larger or smaller objects.

2.2 Threshold & Sum Operator The Threshold & Sum Operator thresholds each Peripheral Feature Map (PFM) with an empirically determined threshold value and then sums the results into one Saliency Map (SM), as shown in Figure 5 below.

5

The notion behind thresholding each PFM is well established in modeling human perception: small differences in local features such as color or gray-level do not contribute to stimulus driven saliency [5]. A small difference in gray-levels translates to small values of contrast, hence a threshold is used to null out all regions of each feature map with “small” values.

The threshold also serves another purpose: all feature map values greater than the threshold are replaced by a uniform value. This is to model categorical perception, the highly non-linear response found in most forms of human perception [5]. The thresholded feature map can therefore be interpreted as a map that specifies for each pixel location whether that feature is peripherally perceived as “present” or “absent.”

Combining the thresholded feature maps is done by summing them, similar to the original feature integration theory models of Treisman [6]. Due to this summing, feature values that survived thresholding in the individual PFM all end up as non-zero values in the SM. If a particular image location activated multiple PFM strongly, then that location will have a high value in the SM. In other words, that location is highly salient and very likely to attract attention. The Saliency Map in Figure 5 below is the result of thresholding and summing the four PFM shown in Figure 3. Note how the highly salient objects in the lower left corner are brightest in this saliency map.

Figure 5: Saliency Map 2.3 Preattentive detection This module detects objects that are highly salient in the Peripheral Feature Maps (despite the low resolution). Since the objective of this research is to model visual search for objects that are difficult to see, this component is mostly interesting for its ability to detect and recognize “obvious” objects that provide relevant spatial clues to locations of “non-obvious” objects, such as targets. We demonstrated this concept by designing and implementing a Preattentive Horizon Detector.

In an image such as the one shown in Figure 1, the horizon can be considered an “obvious” object3. Being able to detect the horizon in outdoor scenes is an ability that has probably not changed during the evolutionary period of human beings. It is therefore intuitive that the brain should have evolved ways to take advantage of this highly consistent visual element. We reason that knowing that the horizon is visible in the image and where in the image it is provides valuable guidance to interpreting the rest of the image. If the ground is assumed to be reasonably flat and the horizon is visible then the observer must be viewing a plane receding towards the horizon. If the prior probability of target distribution is uniformly random over that plane, then the image of this plane as seen from the observer’s point-of-view has a gradient of target probabilities that is highest at the horizon and lowest at the bottom of the image, as seen in Figure 6 below. Note that determining whether the ground-plane is receding via strict bottom-up analysis (such as detecting depth from texture gradients or depth-of-focus) is also possible, but would be much more complicated since gray-level statistics of the ground-plane vary in complex ways with depth. It seems likely that being able to preattentively detect the horizon would be valuable because it helps to form a rough estimate of the prior probabilities of target distribution. We have found this idea to be well supported by the data in our research. Nearly all fixations are located in a horizontal band; a region that is clearly bounded above by the horizon but less clearly bounded below approximately half-way between the horizon and the bottom of the image. One might argue that it is not an object per se, but the boundary between two objects (the sky and the ground). This distinction is not important for the purpose of our model.

3

6

To put things in perspective, even though the prior probability of targets above the horizon is near zero (for ground-based vehicle targets), the visual system probably does not have to be instructed to ignore the sky. In other words, even though a preattentive horizon or sky detector could be used for this purpose, it would be superfluous. This is because the PFM are very unlikely to have peaks in the region representing the sky; the naturally low variance of gray-levels for the sky region ensures that the sky is mostly ignored in the local contrast features. However, this is not the case for the close-range region at the bottom of the image. There is enough natural variance in this region to produce significant local contrast. When the Preattentive Horizon Detector detects the horizon, it queries the Cognitive Processor about what to do next. Given the task of searching for ground-vehicle targets, the Cognitive Processor instructs the detector how to construct a gating mask—as shown in figure 6—representing the prior probability gradient of target distribution. The assumption is that the Cognitive Processor has learned to recognize how the spatial layout of a scene with a visible horizon can be estimated with a simple piece-wise-linear gradient. Since this gradient will be used as a gating mask, each pixel in the mask has a value between 0.0 (near the bottom of the image and above the horizon) and 1.0 (just below the horizon). The effect on corresponding pixels in the image-map to be gated is that they will be completely masked out where mask pixels are 0 and left unaffected where mask pixels are 1.

In summary, the Preattentive Horizon Detector and the Cognitive Processor together employ cognitive knowledge about the prior target probability distribution to prevent the near-ground image region from receiving an unwarranted number of fixations.

Figure 6: Horizon Gating Map We developed a five step process to automatically detect the horizon and generate a horizon gating mask, such as the one shown in Figure 6: 1) Combine PFM into a single image (max PFM value at each pixel)

2) Threshold the image to binary (black if pixel value < 10% of image max, otherwise white) 3) Remove noise (15x15 median filter)

4) Merge all small regions until two contiguous regions remain, one black (above horizon) and one white (below horizon): a)

Starting with a small value for the maximum allowable Euclidean distance D, merge pixels of same color (black or white) that are within D of each other

b) Check size of each region of uniform color, if it is < 10% of total image area then eliminate the region by changing its color to match the surrounding area c)

Count remaining regions of uniform color, if there are > 2 then increase the value for D and iterate merging process

5) Fill in the white region with a vertical linear gradient (from white at the horizon to black in the bottom 10%)

7

2.4 Combining saliency and context We assume that for any given region in an image, the probability that this region attracts attention depends both on the degree of feature saliency of that region with respect to the surrounding background (the SM) and on the proximity to preattentively detected objects that provide spatial context (the Horizon Gating Map).

Three methods were explored for applying the Horizon Gating Map. The first option—treating the horizon map as a “prototype” fixation probability map (rather than a gating map) and summing it with the SM—was tested and rejected because it produced Fixation Probability Maps with large regions of artificially high probability. The horizon map does not take into account any bottom up information other than the location of the horizon. Hence, treating the horizon map as prototype fixation probability map produced artificially high probabilities near the horizon in cases where large regions of uniform gray-scale are close to the horizon. Such regions have low saliency and tend to receive few fixations, even if they are close to the horizon.

The second option—applying the horizon map as a gating map to the SM—was tested and rejected because it produced Fixation Probability Maps that were missing regions of high probability. In cases where there were compact regions of high contrast near the bottom of the image (such as seen in Figure 5), the resulting high-saliency regions in the SM were attenuated by the gating process because the gating mask values are rear zero at the bottom of the image. Consequently, the Fixation Probability Map ended up with artificially low probabilities for such regions. The third option we tested was applying the horizon map as a gating map to an average of the PFM (Figure 7, left) to produce a Context Bias Map (Figure 7, right). The Context Bias Map is then summed with the SM to produce the final Fixation Probability Map (Figure 8). This option was chosen because it solved both of the previous two problems. It does not artificially inflate probabilities for low saliency regions near the horizon because the average PFM has low values in those regions. It also does not excessively decrease the probabilities for compact regions of high contrast near the bottom of the image because the other component in the sum, the ungated SM, has high values in those regions.

Figure 7: Averaged PFM before gating (left) and after gating, the Context Bias Map (right)

8

Figure 8: Fixation Probability Map

3. MODEL VALIDATION

The model’s ability to predict fixation locations when searching for hard-to-see targets was validated. This required a set of images and associated eye fixation data collected with an eye-tracker from observers searching for targets in those images. The images show model military vehicle “targets” on the Army’s terrain-board at Night Vision and Electronic Sensors Directorate (NVESD), Ft. Belvoir, VA. Each scene simulates either a desert or sparse forest location, under simulated daylight or night-time flare lighting (see Figures 1 and 12). Each scene contains 1 to 8 targets at a simulated distance of 1 to 3 km from the observer. Each image shows a 15° field of view, seen through a black-and-white camera that digitizes the image to 640 by 480 pixels with 256 possible gray levels. The observers were 15 enlisted men with previous experience viewing such scenes.

The eye-tracker data was collected with an I-Scan video-based system that collects eye position data at a rate of 60 Hz with an accuracy of about 0.5°. This raw data was post-processed by defining a fixation as at least six consecutive eye position samples falling within 0.5° of a running mean location. The fixation data was further processed by dividing each image into 48 square regions (8 by 6 grid, 80x80 pixels/square) and reporting for each region the number of fixations, the average fixation duration, and the total fixation duration. Figures 9-11 below show on the far left examples of these numberof-fixations maps.

To qualitatively evaluate the model, we visually compared model predictions with number-of-fixation data from the observed human experiments, as demonstrated in the following figures. Saliency, context bias, and fixation probability maps were collected from the model for each image. Each map was then subsampled down to the same 8 by 6 grid as the observed data, by averaging the values in each 80x80 pixel region. Once the observed and predicted data was in the same format of 8 by 6 data-bins or “region-maps,” we expanded each region-map by a factor of 24 (via pixel replication) and then individually normalized the gray values in each resulting image to maximize utilization of the 256 available levels and to enable the final step: combining the images-of-region-maps into one figure. The following figures demonstrate these visual comparisons by presenting four maps side-by-side for a sampling of input images. Each figure shows from left to right: observed number-of-fixations map, Saliency Map (SM), Context Bias Map (CBM), and Fixation Probability Map (FPM). Dark grays represent low fixation count/saliency/bias/probability, light grays represents high fixation count/saliency/bias/probability.

9

Observed fixations

Saliency Map

Context Bias Map

Fixation Probability Map

Figure 9: Observed fixation data and model predictions for input image 7 (see Figure 1)

Figure 10: Observed fixation data and model predictions for input image 3 (see Figure 12, left)

Figure 11: Observed fixation data and model predictions for input image 17 (see Figure 12, right)

Figure 12: Input images 3 (left) and 10 (right) A few conclusions can be drawn from those visual comparisons. First of all, it is obvious that all three types of maps produced by the model are in fairly good agreement with the observed data. Second, just looking at the observed data, it is clear that there is structure to the over-all spatial distribution of fixations: the fixations tend to be concentrated in a horizontal band. Third, this band tends to be near the middle of the image, but it does shift around and its width varies—and the three prediction maps track those changes. Fourth, there is a qualitative difference between the Saliency Maps (SM) and Context Bias Maps (CBM): the SM produce large areas of no fixation probabilities, but they produce compact probability regions

10

which in some cases are missed by the CBM (such as at the bottom left of image in Figure 9). Finally, it should be noted that there was no obvious correlation between the quality of the model’s prediction and scene type or lighting conditions.

Figure 13 shows a plot over the 20 evaluation images of the chi-square error values of the SM (circles), CBM (triangles), and FPM (squares). It is apparent that the CBM is overall a more valuable contributor than the SM. For images 14 and 17 the SM is marginally better than the CBM, but for all other images it performs somewhat worse and for images 8 and 16 it is quite far off. However, as shown earlier in the visual comparisons there are cases where the SM caught isolated regions of non-zero probability that were entirely missed by the CBM, but because those regions are very small they do not affect the chi-square error measure significantly. In summary, the SM and CBM are designed for complementary purposes and should be used together. However, as indicated by the results for images 8 and 16 in the plot below, the CBM performs more consistently than the SM, indicating that the horizon is indeed a reliable contextual cue.

Chi-Square Values

100000

10000

1000

Image 20

Image 19

Image 18

Image 17

Image 16

Image 15

Image 14

Image 13

Image 12

Image 11

Image 10

Image 9

Image 8

Image 7

Image 6

Image 5

Image 4

Image 3

Image 2

Image 1

500

Figure 13: Comparing predictive errors for SM (circles), CBM (triangles), and FPM (squares) The performance of our model was also compared with that of two alternative models. The first comparison model was very simple: the standard deviation of gray values in each 80x80 pixel region. This model was inspired by visual comparison of the input images and fixation data; there is a visible correlation between local contrast in the image and regions of frequent fixation.

The second model was even simpler: assume that fixation probabilities are identical for every image, 0% for the top and bottom rows in the 8x6 map and 100% for the middle four rows. The inspiration for this model came from the observation that very few fixations fell in the top (sky) or bottom (near ground) regions of each image.

Chi-Square Values

10000

1000

Image 20

Image 19

Image 18

Image 17

Image 16

Image 15

Image 14

Image 13

Image 12

Image 11

Image 10

Image 9

Image 8

Image 7

Image 6

Image 5

Image 4

Image 3

Image 2

Image 1

500

Figure 14: Comparing model performance (squares) with that of two alternative models

11

Figure 14 above plots the performance of the three models across the 20 input images. The squares represent our model, the circles represent the standard deviation model, and the triangles represent the fixed model. The figure above indicates that our model performed significantly better. There are indeed a few cases (images 4, 16, and 17) where our model is not the best performer, but Table 1 below shows that our model performs best overall by more than a factor of two. It is also the most consistent, as seen by the significantly lower standard deviation. Table 1: Ensemble comparison of our model and two alternative models Our Model

Std. Deviation Model

Fixed Model

Mean

1627

4074

3227

Median

1203

3180

2982

997

2655

1626

Std. Deviation

4. CONCLUSIONS AND FUTURE WORK

Our model has been validated against one set of images that vary significantly in clutter level and lighting conditions. The “feature integration” approach to constructing a Saliency Map was moderately successful in predicting fixation locations, but the horizon gating approach to constructing a Context Bias Map was consistently better. This indicates that a receding ground plane does indeed bias visual search for distant targets. In other words, observers seem to have an innate or learned understanding of the transformation of a uniform probability distribution into a gradient probability distribution due to the compression of perspective. Furthermore, this understanding of a high-level concept is available to low-level vision, at least at the level of fixation selection. Our model is currently being expanded and improved. For example, the Gaussian blurring used to simulate peripheral resolution is too simplistic since it ignores the lateral inhibition found in many processing layers of the retina and in cortex. We are therefore switching to a difference-of-Gaussians operator (bandpass filter) for this processing step, with all parameters (center and surround Gaussian widths and heights) chosen to fit known physiological and anatomical data. It should be noted that the model is designed to be functional for other types of contextual gating maps as well as the horizon map. There are undoubtedly other “obvious objects” that provide valuable cues for spatial context, such as roads, mountains, and lakes. We anticipate that each of those object types can be preattentively detected and a task-specific gating map produced. Each gating map can be treated in the same way as the horizon map: gate a combination of Peripheral Feature Maps and add the result into the total sum that produces the final Fixation Probability Map.

5. ACKNOWLEDGEMENTS

This research was supported by U.S. Army contracts DAAE07-96-C-X030 (TARDEC), DAAB07-96-C-J006 (NVESD), and DAAH04-96-C-0086 (NVESD).

6. REFERENCES

1) Wolfe, J.M. “Visual Search: A Review”, in H. Pashler (ed.), Attention, London, UK: University College London Press, 1996. 2) Gerhart, G., Meitzler, T., Sohn, E., Witus, G., Lindquist, G., & Freeling, J. R. “Early Vision Model for Target Detection”, Proceedings of SPIE, Orlando, FL, 1995. 3) Doll, T., McWhorter, S., Schmieder, D., Hetzler, M., Stewart, J., Wasilewski, A., Owens, W., Sheffer, A., Galloway, G., & Harbert, S. “Biologically-based vision simulation for target-background discrimination and camouflage/lo design”, Proceedings of SPIE, Vol. 3062, pp. 231-242, Orlando, FL, 1997. 4) Rojer, A.S. & Schwartz, E.L. “Design considerations for a space-variant visual sensor with complex-logarithmic geometry”, 10th International Conference on Pattern Recognition, Vol.2, pp. 278-285, 1990. 5) Wolfe, J.M. “Guided Search 2.0, A revised model of visual search”, Psychonomic Bulletin & Review, 1(2), pp. 202-238, 1994. 6) Treisman, A. & Gelade, G. “A Feature-Integration Theory of Attention”, Cognitive Psychology, 12, pp. 97-136, 1980.

12

Suggest Documents