A Probabilistic Estimation Approach for Dynamic Visual Search Tamar Avraham and Michael Lindenbaum Computer Science Department, Technion Haifa, Israel 32000 tammya,
[email protected]
Abstract This work proposes a stochastic model for efficient visual search in computer vision. Given an image, we extract a number of sub-images which are the candidates for attention. The identity of the candidates is modelled as a set of correlated random variables taking target/non-target values. We assume that the correlation between two candidate identities depends on the similarity between them: more similar candidates (sub-images) tend to have more similar identities. The proposed search algorithm uses the estimated identities obtained from a linear estimation procedure, which relies on this assumption. The proposed model may be considered as a quantification of Duncan and Humphreys’s qualitative model [2], used for modelling the HVS attention mechanism. Experiments show the efficiency of the visual search procedure and validate the correlation model.
1. Introduction Visual search is required in situations where a person or a machine views a scene with the goal of finding one or more familiar entities. The highly effective Visual-search (or more generally, attention) mechanisms in the human visual system were extensively studied from Psychophysics and Physiology points of view. Yarbus [24] studied human perception of stationary images by tracking eye-movements. His experiments show that the eyes rest much longer on some elements of an image, while other elements may receive little or no attention. Neisser [11] suggested that the visual processing is divided into pre-attentive and attentive stages. The first consists of parallel processes that simultaneously operate on large portions of the visual field, and form the units to which attention may then be directed. The second stage consists of limited-capacity processes that focus on a smaller portion of the visual field. Triesman and Gelade’s feature integration theory [18] formulate an hypothesis about how the human visual system performs pre-attentive processing. They hypothesized that some visual properties (such as color, orientation, spatial frequency, movement) are extracted early, automatically and in parallel
across the visual field. Therefore, if a target is unique from its surround in such a feature, it pops-out without the need of focal attention. If there is no feature in which the target is unique from its surround, e.g. when the target is unique only in a conjunction of such features, the distinction cannot be detected pre-attentively, forcing subjects to search serially through the image, until the attention spotlight focuses on the target, which enables the usage of integrated features (conjunctions) for recognition. While several aspects of the Feature Integration theory were criticized, the theory was dominant in Visual search research and much work was carried on to understand which features are pre-attentive (e.g. [6],[19],[10]), which features are preferred by the HVS (e.g [1]), how feature integration occur, etc. (e.g. [7],[23],[21]). The HVS study that is most related to the work described in this paper was introduced by Duncan and Humphreys [2]. They rejected the dichotomy to parallel vs. serial search, central to Feature Integration theory and proposed an alternative hypothesis based on Similarity. According to them, two types of stimulus similarity are involved in a visual search task. One, denoted the inter-alternative similarity, is the similarity between objects in the scene and the prior knowledge about the possible targets. The other similarity is between the objects in the scene. They suggest the search procedures starts with a hierarchial segmentation, defining structural units, which are the candidates competing for the attention resource. Units which are similar to the hypothesized target get higher weights which are used as priorities. The search is facilitated if several structural units are similar, because these are linked and when a weight is changed in one unit, the change propagates to the linked units as well. This way suppression may be spread among the units without treating every one of them individually. Thus, if all nontarget are homogeneous, they may be rejected together resulting in a fast (pop-out like) detection process, while if they are more heterogeneous the search is slower. Several models and computer implementations for attention mechanisms and visual search were implemented. Many of them focus on how and which pre-attentive features to extract and how to combine them to one saliency map that specifies the attention order. e.g. Koch and Ullman [7], Tsotsos et al. [21], Itti, Koch, and Niebur [5],
Wolfe [23]. In particular, a connectionist model SERR demonstrating the similarity based theory was suggested by Humphreys and Muller [3]. Some computer vision attention systems use other sources of knowledge to direct visual search: If we know for example that the sought for object is likely to be close to another, easier to find, object, then we may try to find the second object first and then look for the target only in a limited image region (Rimey and Brown [15], Wixson and Ballard [22]). Another method uses matching between image and object parts (Tagare, Toyama, and Wang [17]). Minut and Mahadevan [9] improve the search performance using on-line training in a limited environment (room). This paper aims to develop a general, probabilistic framework for computer vision attention. Specifically, we set the goal as finding the target(s) as fast as possible, in minimal expected time. Interestingly, while we did not aim at modelling the HVS attention system, it turns out that it shares many of its properties and in particular is similar to Duncan and Humphreys’ model [2]. Given a scene image, we propose to extract a number of candidate sub-images. The identity of the candidates is modelled as a set of correlated random variables taking target/non-target values. We assume that the correlation between the candidate identities depends on the similarity between them: more similar candidates tend to have more similar identities. The algorithms we consider begins from a static priority map, indicating the initial likelihood of each candidate to be a target. Iteratively, the candidate with the highest priority receives the attention. The relevant sub-image is examined by a high-level object recognition system. Based on the recognizer responses and the stochastic dependency between candidates, the priority map is dynamically updated using linear estimation techniques. We consider the cost of the algorithm as the expected number of queries to the recognizer module before the target/targets are found, and aim to minimize this cost. The context for visual search and some basic intuitive assumptions are described in section 2. Section 3 describe the underlying probabilistic model and the resulting linear estimation based algorithm. Some experimental examples demonstrating the algorithm effectiveness and validating the model are shown in section 4. Conclusions and several directions for future work are considered in section 5.
2. Framework 2.1. The context - A visual search task involving candidate selection and classification The task of looking for object/s of certain identity in a visual scene is often divided into two subtasks: One is to select sub-images which should be considered as possible candi-
dates. The other, the object recognition task, is to decide whether a candidate is a sought for object or not. The input to the candidate selection task is the wholescene image, and its output is a set of sub-images which we call candidates. The candidate selection task can be performed by a segmentation process or even by a simple division of the image into small rectangles. It can be based on top-down or bottom-up processes, meaning that it uses prior knowledge on the target character or just image based knowledge. The candidates may be of different size, may be bounded or unbounded [20], and can also overlap. The input to the object recognition task is a sub-image (a candidate). The recognizer should decide whether the given candidate is the required object or not. It can be based on statistical modelling , part decomposition, functional description or any other method. This stage is usually considered as a high level vision task and is commonly computational expensive, as the possible appearances of the object may vary due to changes in shape, color, pose, illumination and other imaging conditions. The object recognition may need to recognize a category of objects (and not a specific model), which usually makes it even more complex. For many algorithms, the candidate selection stage may be model dependent and may be based, for example, on invariants. Still, a classification/verification stage is usually needed to get a reasonable performance. The object recognition process gets the candidates, one by one, after some ordering. This ordering is the attentional mechanism on which we focus here. Many orderings are possible but some are better than others. One goal of this work is to find a method for specifying good ordering so that the number of calls to the recognition stage is minimized.
2.2. Sources of information for directing the search The knowledge for directing the search may be different in different contexts. In the simplest case no knowledge is available, and the strategy is to scan the candidates in some arbitrary order. Several sources of information enabling more efficient search are possible: 1. Bottom-up saliency of candidates - In modelling the attention mechanism in the HVS, it is often claimed, for example, that dominant orientation, size, and color are used to assess a saliency measure [18][7][5]. This measure usually quantifies how one candidate is different from the other candidates in the scene. Often, saliency has a major factor in directing attention to objects in a scene. Nevertheless, it can be sometimes misleading and may also not be applicable when, say, the target’s uniqueness is defined by a combination of features, or when there are few resembling targets in the
scene. 2. Similarity to the object we look for (top-down approach) [23], [4]- When prior knowledge on the targets is available, both the candidate selection stage and the attention stage may benefit from knowing it. The candidate selection stage can filter out irrelevant candidates. The remaining candidates may be ranked by their degree of consistency, or similarity with the target description. In many cases, however, it is hard to characterize the objects of interest visually in a way which is effective for discriminations and inexpensive to evaluate. Then, using top down information may cost more than actually applying the high level recognition module on each candidate. 3. Mutual similarity of the candidate [2] - Usually a higher inner-scene visual similarity implies a higher likelihood for similar (or equal) identity. Under this assumption, after the identity of one (or few) candidates is already known, it can effect the likelihood of the remaining candidates to have the same/different identity. In this work we focus only on the last source of information - mutual similarity between candidates, and assume that other information is not given. We leave the goal of integrating it quantitatively with other sources of information to future work.
2.3. Dynamic vs. Static Search Usually, systems based on bottom-up or top-down approaches suggest to calculate a saliency map before the search starts, pre-specifying the scan order. In this case, the attentional process can be called static. (Usually, the changes in the maps during the search include only inhibition of the attended locations as a part of an algorithmic mechanism, preventing the attention to return to the same place again.) The attention mechanism proposed here is dynamic in the sense that it allows the saliencies, or priorities, to be changed based on the results of the object recognition process. We show that the dynamic attention mechanism is preferable as it requires a lower number of calls to the recognition process to find the target(s).
3.
Dynamic stochastic search algorithm
In this section we suggest a stochastic framework for dynamic visual search, as well as a specific algorithm based on linear estimation minimizing the mean square error [13]. We start with some notations
3.1. Notations A search task is described by a pair (X, l), where X = {x1 , x2 , . . . , xn } is a set of partial descriptions associated with the set of candidates (for instance - feature vectors), and l : X → {0, 1} is a function assigning identity labels to the candidates. l(xi ) = 1 if the candidate xi is a target, and l(xi ) = 0 if xi is a non-target (or a distractor). An attention, or search algorithm is provided with the set X, but not with the labels l. The cost of a search algorithm A, costg (A, X, l) is defined as the number of queries to the recognizer oracle, until the goal of the search g is achieved. The goal of the search can be different depending on the application: in many search tasks finding one target is the goal (a customer looking for a certain product on a supermarket shelf, or a guy in a cafeteria looking for a date for Saturday), sometimes a certain number or certain percentage of the targets can be the goal (A host looking for cups in his kitchen cabinet in order to prepare coffee for his three guests), and sometimes one will not be satisfied unless he knows all the targets are detected (a mother looking for her five children). We aim to minimize the expected value of this cost.
3.2. Object labels as dependent random variables Taking a stochastic approach, we model the object identities as binary random variables with possible values 0 (for non-target) or 1 (for target). These values are initially unknown, and are revealed one by one by the recognition oracle. Estimates of these variables may be available and help in directing the search. Intuitively we know that objects associated with similar identity (or category) tend to be visually similar more than objects which are of different identities. Quantifying this intuition in a probabilistic way, we suggest that the random variables associated with two candidates are more dependent if the visually similarity between these candidates is higher. Specifically, we characterize the dependency behavior as follows Basic probabilistic assumption: similarity dependent stochastic dependency The covariance between two labels is a monotonic ascending function of their visual similarity and a descending function of the feature-spacedistance between them. cov(l(xi ), l(xj )) = γ(d(xi , xj )) where l(xi ) and l(xj ) are the labels of candidates xi and xj , d(xi , xj ) is the feature space distance between the two candidates, and γ is a monotone descending function.
In section 4.2 we experimentally challenge this assumption and find that indeed, such a monotonic dependency is found in most cases. From the experiments we found that an exponential descending γ is a good approximation to the actual dependency in many cases. This assumption is one way to quantify the intuition presented above. The advantage in knowing the second order statistics is that it allows to infer unknown labels from known ones, using common estimation techniques.
3.3. Dynamic search framework We propose the following greedy approach to dynamic search: At each iteration, estimate the probability of each unlabelled candidate to be a target using all the knowledge available. Choose the candidate for which the estimated probability is the highest and apply the object recognition oracle on the corresponding sub-image. After the m−th iteration, m candidates, x1 , x2 , . . . , xm , were already attended and m labels, l(x1 ), l(x2 ), . . . , l(xm ) are known. We use them to estimate the conditional probability of the label l(xk ) of each unlabelled candidate xk to be 1. pk = p(l(xk ) = 1 | l(x1 ), . . . , l(xm ))
ri = E[(l(xk )−E[l(xk )])(l(xi )−E[l(xi )])] = cov(l(xk ), l(xi )) The estimated label lˆk is the conditional mean of a label l(xk ) of an unclassified candidate xk , and therefore may be interpreted as the probability of l(xk ) to be 1 pk = p(l(xk ) = T | l(x1 ), . . . , l(xm )) ∼ lˆk
3.5. The algorithm • Given a scene image, choose n sub-images to be candidates. • Extract the set {x1 , x2 , . . . xn }.
lˆk = a0 +
m X
ai l(xi )
i=1
which minimizes the minimum mean square error e = E((l(xk ) − lˆk )2 ). This is a standard task with a known solution: lˆk = E[l(xk )] + ~at (~l − E[~l]) where ~l = (l(x1 ), l(x2 ), . . . , l(xm )) and the vector ~a, composed of the coefficients ai i = 1, . . . , m, specifies by solving the following set of (Yule-Walker) equations [13] ~a = R−1 · ~r
feature
vectors
X
=
• Calculate pairwise feature space distances and the implied covariance for each pair. • Select the first candidate/s randomly (or based on some prior knowledge). • In iteration m + 1: – For each candidate xk out of the n−m remaining candidates, estimate lˆk ∈ [0, 1] based on on the known labels l(x1 ), . . . , l(xm ).
3.4. Minimum mean square error linear estimation Now, note that the random variable lk is binary and therefore its expected value is equal to its probability to take the value 1. Estimating the expected value, conditioned on the known data, is generally a complex problem and requires knowledge about the labels joint distribution. We chose to use a linear estimator minimizing the mean square error criterion, which needs only second order statistics. Given the measured random variables l(x1 ), l(x2 ), . . . , l(xm ), we seek a linear estimate lˆk of the unknown random variable l(xk ),
of
– Query the oracle on the candidate xk for which lˆk is maximum. – If enough targets were found - abort. Notes: 1. The extraction of the candidates, and the choice of the feature vector and the feature space distance may depend on the application (see section 4.1). 2. Our goal is to minimize the expected search time, and the proposed algorithm, being greedy, cannot achieve an optimal solution. It is however optimal with respect to all other greedy methods, as it uses all the information collected in the search to make the decision. It should be interesting to consider also algorithms that at each iteration consider the minimization of the total search time by choosing the candidate that minimizes the future uncertainty. Such methods are used for selectively selecting samples for supervised learning algorithms. There the task is only exploratory, aiming to build a good classifier, while in our case the task is basically exploitative.
3. The first candidate may be chosen based on different considerations. Choosing for example, the first candiwhere ri , i = 1, . . . , m, Rij , i, j = 1, . . . , m are given by date from a large subset of similar candidates has the Rij = E[(l(xi )−E[l(xi )])(l(xj )−E[l(xj )])] = cov(l(xi ), l(xj )) potential to reduce the number of potential candidates
significantly, and therefore may be the best if the prior probability of all candidates to be target is uniform. This however seems to be an unjustified assumptions as targets usually do not appear in large groups. (If they would then saliency would never work.) 4. The estimation of the label associated with any unknown candidate is accelerated if only its nearest (classified) neighbors are used. As the covariance decreases with distance, ignoring far neighbors may be justified. 5. In tasks where the target differs significantly from the non-targets, and the non-targets are similar, we get a pop-out like behavior, and the target is found after one or two iterations. 6. If the non-targets consist of several k clusters in the feature space, then at most k candidates are tested before the target is attended. If the clustering is not strong, performance smoothly degrades and more attempts are required before the target is found. 7. If the targets themselves are one cluster, then after the first is detected, the rest follow immediately. In the next section we describe some visual search experiments.
4. Experiments 4.1. Visual search experiments The first set of experiments uses the Columbia Object Image Library COIL-100 [12] containing 100 gray level images of various objects. We can think about these images as candidates extracted from some larger image or to refer to a mosaic of them, denoted columbia100 , (figure 1) as the image we look at.
Figure 1: columbia100 - test image constructed from the COIL-100 Database images The extracted features are gaussian derivatives, based on the filters described in the work of Ballard and Rao 1995
[14]. We use two first order gaussian derivatives (at 0 and 90 degrees), three second order gaussian derivatives (at 0, 60 and 120 degrees), and four third order gaussian derivatives (at 0, 45, 90 and 135 degrees). Each of them at 5 scales, giving a total of 45 filters, resulting feature vectors of length 45 per candidate. Euclidean metric is used for measuring distance between two feature vectors. We model the dependencies of the covariance on the feature-space-distance using cov(l(xi ), l(xj )) = γ(d(xi , xj )) = µ(1 − µ)e−d(xi ,xj ) , where l(xi ) is the label of candidate xi , d(xi , xj ) is xi and xj feature-space distance, and µ = (expected number of targets / number of candidates). Taking toy cars as the targets category (10 out of the 100 candidates are cars), we perform all possible runs of the linear estimation based algorithm. (There are 100 different possible runs starting from a different candidate each time). It is possible to stop the search at any time (for instance after a certain number of targets are detected). For the sake of this study, we don’t stop the run, and let the algorithm scan all the candidates. See results in figure 2(a). Still using the columbia100 image, we repeat the experiment for the case where the targets category is cups , and for the case where the targets category is toy animals . Results are described in figure 2(b) and (c). The next two experiments use different input images, a different method for candidates selection, and different type of feature vectors. The two images, the elephants image and the parasols image, are taken from the Berkeley hand segmented Database [8]. See figure 3 for the two images and the hand segmentations we choose to used. Segments under a certain size are ignored, leaving us with 24 candidates in the elephants image and 30 candidates the parasols image. The targets are the segments containing elephants and parasols respectively. For those colored images we use color histograms as feature vectors. In each segment (candidate), we extract the b r values of r+g+b and r+g+b from each pixel, where r, g, and b are values from the RGB representation of a colored image. Each of these two dimensions are divided into 8 bins. This results a 2D histogram with 64 bins, i.e - a feature vector of length 64. Again, we use Euclidean metric for distance measure. (we tried also other histogram comparison methods, such as the ones suggested by Swain and Ballard in [16], but the results where similar). The results are shown in figure 4. In all above experiments the candidates of each task can be divided into few weakly homogeneous groups (or clusters), which seems to be the situation is in most natural scenes as well. In order to compare between the results of the experiments qualitatively, we focus on the two following difficulty factors: d(T,NT)- relative feature-spacedistance between targets and non-targets, and d(T,T)- relative feature-space-distance between targets and targets. The
Results 10
9
8
7
Targets Found
6
5
4
3
2
1
0
(a)cars
0
10
20
30
40
50 Time
60
70
80
90
100
Results 10
9
8
7
Targets Found
6
Figure 3: The elephants and parasols images taken from the Berkeley hand segmented database and the segmentations we have chosen to use in our experiments. (colored images).
5
4
3
2
1
0
(b)cups
0
10
20
30
40
50 Time
60
70
80
90
100
Results
Results
7
4
6
3.5
3
2.5
4
Targets Found
Targets Found
5
3
2
1.5 2
1 1
0.5
(c)animals
0
0
10
20
30
40
50 Time
60
70
80
90
100
(a)elephants
0
=
0
5
10
15
20
25
Time Results 6
5
4 Targets Found
Figure 2: Linear estimation algorithm results for columbia100 image. The solid lines describe one typical run. Other runs, starting each time from a different candidate, are commutatively described by the size of the gray spots as a distribution in the (time, number of target found) space. In (a) the cars are the targets, the first car is quit hard to find. Once it is found the rest are detected fast. In (b) cups are the targets. it is easy to find the first cup since most cups are different from non-targets. Most cups resemble and follow pretty fast, but there are two cups (one without a handle and one cup with a special pattern) that are different from the rest of the cups, and are found rather late. In (c) the toy animals are the targets. It is easy to find the first animal since one animal differs significantly from all non-targets. There is no special resemblance between the different toy animals, therefore the rest of the search performs similarly to a random search procedure.
3
2
1
(b)parasols
0
0
5
10
15 Time
20
25
30
Figure 4: Linear estimation algorithm results for Berkeley hand segmented color images. The solid lines describe one typical run. Other runs, starting each time from a different candidate, are commulatively described by the size of the gray spots as a distribution in the (time, number of target found) space. In (a) finding the first elephant is of medium difficulty since the color of the elephants is not too salient. Once the first is found, the rest of the elephants, having a similar color, follow. In (b) all the parasols are detected very fast, since their color is similar and differs from that of all other candidates.
qualitative search behavior is summarized in table 1.
cars cups animals elephants parasols
d(T,NT)
small some small, some big big small small
small big
Finding first target hard easy
big intermediate big
easy easy easy
Finding successive targets easy easy to find some. hard to find all hard intermediate easy
Covariance
d(T,T)
0.25 0.2 0.15 0.1 0.05 0 0.1
(a)cars
0.2
0.3
0.4
0.5 0.6 Distance
0.7
0.8
0.9
1
−3
4
x 10
2
0
Covariance
−2
Table 1: Qualitative search performance dependencies on the target-target similarity and target-nontarget similarity.
−4
−6
−8
−10
(b)cups
−12 0.2
0.3
0.4
0.5
0.6 Distance
0.7
0.8
0.9
1
−3
5
4.2. Covariance vs. distance experiments
4
3
2
Covariance
1
0
−1
−2
−3
−4
(c)animals
−5 0.1
0.2
0.3
0.4
0.5 0.6 Distance
0.7
0.8
0.9
1
0.3
0.25
0.2
Covariance
The linear estimation based search relies on a given covariance between every pair of labels of candidates. We use a covariance function that depends only on featurespace-distance, and we claim that for many search tasks this function is monotone descending. In this section we study the behavior of the covariance of labels vs. feature-spacedistance of search tasks, and check whether the choice made in section 4.1 is justified. Given a search task, i.e - the set of candidates and their labels, we compute the distances between each pair of candidates. The distances are sorted and divided into h intervals. The labels covariance γi i = 1, . . . , h in calculated separately for every interval Ii :
x 10
0.15
0.1
0.05
0
(d)parasols
−0.05
0
0.1
0.2
0.3
0.4
0.5 Distance
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5 Distance
0.6
0.7
0.8
0.9
1
0.12
0.1
0.08
Covariance
0.06
0.04
0.02
γi
= covi (l1 , l2 ) = E(l1 l2 ) − E(l1 )E(l2 ) nT T · 1 · 1 + nT N · 1 · 0 + nN N · 0 · 0 ' − µ2 nT T + nT N + nN N nT T = − µ2 nT T + nT N + nN N
where nT T , nT N , and nN N are the number of targettarget, target-nontarget, and nontarget-nontarget pairs in interval Ii respectively. µ = E(x) is number of total targets divided to number of total candidates. The estimated covariances are plotted as a function of distance in figure 5.(Non-uniform intervals containing an equal number of target-target pairs result less erratic estimates.) This procedure was performed on each of the search tasks described in section 4.1. The results are displayed in figure 5. Except for the task of finding toy animals in the columbia100 image (in which there is almost no resemblance between targets), all other results show a monotone descending behavior. Note the different range of the the estimated covariances for each case. For instance, compare the results of the cars being targets to that of cups being targets: in both cases the covariance decreases monotonically while with distance, but the covariance for short distances is about 0.3 for the ’cars’ and about 0.004 for the ’cups’.
0
−0.02
(e)elephants
−0.04
Figure 5: Estimation of labels covariance vs. feature-spacedistance. For each distance interval Ii for which we estimated the covariance γi , the point (di , γi ), where di is the center of Ii , is plotted.
5. Discussion and possible extensions In this work we proposed a new mechanism for efficient visual search in computer vision. Unlike most previous approaches, we do not try to model the HVS attention mechanism but rather set a quantitative goal: minimizing the expected search time. We propose a (greedy and thus approximate) algorithm relying on stochastic modelling. Experiments show the efficiency of the visual search procedure and validate the stochastic assumptions. As mentioned, the proposed visual search mechanism is strongly related to Duncan and Humphreys’ model for visual search in the HVS [2], as both paper focus on the benefits of inner-scene similarities for search efficiency. The main difference is that contrary to Duncan and Humphreys’ model, which is explicitly qualitative, we aim here an en-
gineering goal leading us to a quantitative approach, using quantification of the similarities and estimation techniques. This allows us not only to optimize the search but also to, say, quantitatively predict the influences of changes in nontarget appearance. It is interesting but probably not surprising that the proposed model share some search characteristics with Duncan and Humphreys’ model: the continuous influence of non-target heterogeneity on the search time and the explanation to pop-out behavior without feature distinction. In a sense, our work can be considered as a quantification of their model. As we did not aim to model HVS search, we did not check whether our model predicts human behavior. We expect it to be less accurate that models designed for modelling HVS, and in particular the SERR model proposed by Humphreys and Muller’s [3] which relies on Duncan and Humphreys’ model, uses an hierarchy network and a Bolzmann machine, and was able to match reaction times measured by psychophysics experiments. The search mechanism described here is partial as in this initial stage we put aside saliency and top down knowledge, and focused only on the benefits of using inner-scene similarity. We show that in many cases this sort of information can help speed-up the search. For example - the parasols search task described in section 4.1 is a good example for a case in which we may know that all our targets are similar and differ from the non-targets in a certain feature/s, but we don’t know the targets value of this feature, so we do not have a pre-specified preference for a certain feature value. (in this case - we know all parasols have the same color, but we don’t know which color). We can’t use bottom-up approach, giving priority to salient regions, since the targets cover a significant area of the image, and therefore are not salient. The knowledge available in this case is only based on inner-scene relationships and therefore the search strategy suggested in this paper is appropriate. More generally, considering a scene containing several objects of the same category, we argue that their sub-images are more similar than images of such objects taken in different times and places. This happens because the imaging conditions are more uniform and because the variability of objects is smaller in one place. (e.g. two randomly chosen trees are more likely to be of the same type if they are taken from the same area). Needless to say, we intend to integrate saliency and knowledge about the target in our future work. Other possible extensions are the optimization of features, the development of candidate extraction methods, a generalization to non-binary responses of an oracle, and the prediction of search behavior.
References [1] T.C. Callaghan. Interference and domination in texture segregation. Visual Search. Proceedings of the First International Conference on Visual Search, pages 81–87, September 1988.
[2] J. Duncan and G.W. Humphreys. Visual search and stimulus similarity. Psychological Review, 96:433–458, 1989. [3] G.W. Humphreys and H.J. Muller. Search via recursive rejection (serr): A connectionist model of visual search. Cognitive Psychology, 25:43–110, 1993. [4] L. Itti. Models of bottom-up and top-down visual attention. Thesis, January 2000. [5] L. Itti, C. Koch, and E. Niebur. A model of saliency-based visual attention for rapid scene analysis. PAMI, 20(11):1254–1259, November 1998. [6] B. Julesz. A brief outline of the texon theory of human vision. Trends in Neuroscience, 7(2):41–45, 1984. [7] C. Koch and S. Ullman. Shifts in selective visual attention: towards the underlying neural vircuity. Human Neurobiology, 4:219–227, 1985. [8] D. Martin, C. Fowlkes, D. Tal, and J. Malik. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Proc. 8th Int’l Conf. Computer Vision, volume 2, pages 416–423, July 2001. [9] S. Minut and S. Mahadevan. A reinforcement learning model of selective visual attention. In Proceedings of the Fifth International Conference on Autonomous Agents, pages 457–464, Montreal, Canada, 2001. ACM Press. [10] K. Nakayama and G.H. Silverman. Serial and parallel processing of visual feature conjunction. Nature, 320:264–265, 1986. [11] U. Neisser. Cognitive Psychology. Appleton-Century-Crofts, New York, 1967. [12] S. Nene, S. Nayar, and H. Murase. Columbia object image library (coil-100). Technical Report CUCS-006-96, Department of Computer Science, Columbia University, February 1996. [13] A. Papoulis and S.U. Pillai. Probability, Random Variables, and Stochastic Processes. McGraw-Hill, New York, NY, USA, fourth edition, 2002. [14] R.P.N. Rao and D.H. Ballard. An active vision architecture based on iconic representations. Artificial Intelligence, 78(1–2):461–505, 1995. [15] R.D. Rimey and C.M. Brown. Control of selective perception using bayes nets and decision theory. International Journal of Computer Vision, 12:173–207, 1994. [16] M.J. Swain and D.H. Ballard. Color indexing. International Journal of Computer Vision, 7:11–32, 1991. [17] H. Tagare, K. Toyama, and J.G. Wang. A maximum-likelihood strategy for directing attention during visual search. IEEE PAMI, 23(5):490–500, 2001. [18] A. Treisman and G.Gelade. A feature integration theory of attention. Cognitive Psychology, 12:97–136, 1980. [19] A. Treisman and S. Gormican. Feature analysis in early vision: Evidence from search asymetries. Psychological Review, 95(1):15–48, 1988. [20] J.K. Tsotsos. On the relative complexity of active versus passive visual search. IJCV, 7(2):127–141, 1992. [21] J.K. Tsotsos, S.M. Culhane, W.Y.K. Wai, Y. Lai, N. Davis, and F.J. Nuflo. Modeling visual attention via selective tuning. Artificial intelligence, 78(1-2):507–545, 1995. [22] L.E. Wixson and D.H. Ballard. Using intermediate objects to improve the efficiency of visual-search. IJCV, 12(2-3):209–230, April 1994. [23] J.M. Wolfe. Guided search 2.0: A revised model of visual search. Psychonomic Bulletin and Review, 1(2):202–238, 1994. [24] A.L. Yarbus. Eye Movements and Vision. Plenum Press, New York, 1967.