Document not found! Please try again

Biologically Motivated Visual Attention System ... - Semantic Scholar

1 downloads 0 Views 365KB Size Report
Sang-Woo Ban and Minho Lee. School of Electronic and Electrical Engineering, Kyungpook National University. 1370 Sankyuk-Dong, Puk-Gu, Taegu 702-701 ...
Neural Information Processing-Letters and Reviews

Vol. 2, No. 1, January 2004

LETTER

Biologically Motivated Visual Attention System Using Bottom-up Saliency Map and Top-down Inhibition Sang-Bok Choi Dept. of Sensor Engineering, Kyungpook National University 1370 Sankyuk-Dong, Puk-Gu, Taegu 702-701 Korea E-mail: [email protected] Sang-Woo Ban and Minho Lee School of Electronic and Electrical Engineering, Kyungpook National University 1370 Sankyuk-Dong, Puk-Gu, Taegu 702-701 Korea E-mail: [email protected], [email protected] (Submitted on November 18, 2003) Abstract—In this paper, we propose a trainable selective attention model that can inhibit an unwanted salient area and only focus on an interesting area in a static natural scene. The proposed model was implemented by the bottom-up saliency map model in conjunction with the modified adaptive resonance theory (ART) network model. The bottom-up saliency map model generates a salient area based on intensity, edge, color, and symmetry feature maps, and a human supervisor decides whether the selected salient area is important. If the selected area is not interesting, the ART network trains and memorizes that area, and also generates an inhibit signal so that the bottom-up saliency map model does not pay attention to an area with similar characteristic in subsequent visual search process. Computer simulation results show that the proposed model successfully generates a plausible sequence of salient regions that does not include unwanted areas. Keywords— Selective attention, bottom-up saliency map model, adaptive resonance theory network, trainable selective attention

1. Introduction The human eye can focus on an attentive location in an input scene, and select an interesting object to process in the brain. These mechanisms are very effective in processing high dimensional data with great complexity. If we apply the human-like selective attention function to the active vision system, an efficient and intelligent active vision system can be developed. Considering the human-like selective attention function, top-down or task dependent processing can affect how to determine a saliency map as well as bottom-up or task independent processing [1]. In a top-down manner, the human visual system determines salient locations through a perceptive processing such as understanding and recognition. It is well known that the perceptual mechanism is one of the most complex activities in our brain. Moreover, top-down processing is so subjective that it is very difficult to model the processing mechanism in detail. On the other hand, with bottom-up processing, the human visual system determines salient locations obtained from features that are based on the basic information of an input image such as intensity, color, and orientation [1]. Bottom-up processing can be considered as a function of primitive selective attention in human visual system since humans selectively attend to such a salient area according to various stimuli in the input scene. In a previous work, Itti and Koch introduced a brain-like model to generate a saliency map. Based on Treisman’s results [2], they used three types of bases, intensity, orientation, and color information, to construct a saliency map in a natural scene [1]. Koike and Saiki proposed that a stochastic WTA enables the saliency-based search model to cause the variation of the relative saliency to change search efficiency, due to stochastic shifts of attention [3]. In a hierarchical selectivity mechanism, Sun and Fisher integrated visual salience from bottom-up groupings and a top-down attentional setting [4]. Ramström and Christensen calculated saliency with respect to a

19

Biologically Motivated Visual Attention System

S. Choi, S. Ban and M. Lee

given task using a multi-scale pyramid and multiple cues. Their saliency computations were based on game theory concepts, specifically a coalitional game [5]. However, the weight values of these feature maps for constructing the saliency map are still determined artificially. Additionally, all of these models are non-interactive with environment, and as a result it is insufficient to confidently determine whether the selected area is interesting. On the other hand, Barlow suggested that our visual cortical feature detectors might be the end result of a redundancy reduction process [6] in which the activation of each feature detector is supposed to be as statistically independent from the other as possible. We suppose that the saliency map is one of the results of the redundancy reduction of our brain. The scan path that is a sequence of salient locations may be the result of the roles of our brain for information maximization. In Sejnowski’s results obtained by using independent component analysis (ICA), the redundancy reduction of a natural scene derives the edge filter [7]. Buchsbaum and Gottschalk found opponent coding to be the most efficient way to encode human photoreceptor signals [8]. Wachtler and Lee used ICA for hyperspectral color images and derived a color opponent basis from analysis of trichromatic image patches [9]. It is well known that our retina has preprocessing features such as cone opponent coding and edge detection [10], and that the extracted information is delivered to the visual cortex through the lateral geniculate nucleus (LGN). Symmetrical information is also an important feature to determine the salient object, which is related with the function of LGN. Our developed bottom-up saliency map model considers the preprocessing mechanism of cells in the retina and the LGN with on-set and off-surround mechanism before the redundancy reduction in the visual cortex. The Saliency map constructed by integration of the feature maps and by applying the ICA that is the best way to reduce redundancy. Using the bottom-up saliency map model, we can obtain a sequence of salient areas. However, the bottom-up saliency map model may select an unwanted area because it generates the salient areas based on the primitive features such as intensity, edge, color, and symmetry. On the other hand, human beings can learn and memorize the characteristics of the unwanted area, and also inhibit attention to that area in any subsequent visual searches. In this paper, we propose a new selective attention model to mimic such a hu-man-like selective attention mechanism not only with a truly bottom-up process but also with an interactive process to skip an unwanted area in the subsequent visual search process. In order to implement the trainable selective attention model, we use the bottom-up saliency map model in conjunction with the modified adaptive resonant theory (ART) network. It is well known that the ART model maintains the plasticity required to learn new patterns, while preventing the modification of patterns that have been learned previously [11]. Thus, the characteristics of an unwanted salient area selected by the bottom-up saliency map model is used as an input data of the ART model that is to learn and generalize a feature of an unwanted area in a natural scene. In the training process, the ART network learns about uninteresting areas that are decided interactively by a human supervisor, which is different from the conventional ART network. In the testing mode, the vigilance parameter in the ART network determines whether the new input area is interesting, because the ART network memorizes the characteristics of the unwanted salient areas. If the vigilance value is larger than a threshold, the modified ART network generates a signal to inhibit the selected area in the bottom-up saliency map model so that the area will be ignored in the subsequent visual search process. Section 2 describes the biological background of the proposed model. In Section 3, we explain our developed bottom-up saliency map model. Section 4 shows the proposed trainable selective attention model using the ART network. Simulation results and our conclusions will follow.

2. Biological Background In the vertebrate retina, three types of cells are important processing elements for performing edge extraction. These are photoreceptors, horizontal cells, and bipolar cells [12][13]. Edge information is obtained by cells in visual receptor, and is delivered to the visual cortex through the LGN and the ganglion cells. Horizontal cells spatially smooth the transformed optical signals, while bipolar cells yield the differential signal, which is the difference between optical signal and the smoothed signal. By the output signal of the bipolar cell, the edge signal is detected. On the other hand, a neural circuit in the retina creates opponent cells from the signals generated by the three types of cone receptors [10]. The R+G- cell receives inhibitory input from the M cone and excitatory input from the L cone. The opponent response of the R+G- cell occurs because of the opposing inputs from the M and L cones. The B+Y- cell receives the inhibitory input by adding the inputs from the M and L cones and the excitatory input from the S cone. Those preprocessed signals are transmitted to the LGN through the ganglion cells, and the on-set and off-surround mechanism of the LGN and the visual cortex intensifies the phenomena of opponency [10]. Additionally, the LGN plays a role in detecting the shape and pattern of an object [10]. In general,

20

Neural Information Processing-Letters and Reviews

Vol. 2, No. 1, January 2004

the shape or pattern of an object has symmetrical information, and as a result the symmetrical information is one of the most important features for constructing a saliency map. Even though the role of visual cortex in finding a salient region is important, it is very difficult to model the detailed functions of the visual cortex. Owing to the Barlow’s hypothesis, we simply consider the roles of the visual cortex as redundancy reduction. Moreover, we can inhibit an attention even though the primitive features such as intensity, edge and color are dominant, and memorize the features to skip an unwanted one during successive visual search process.

3. Bottom-up Saliency Map Model Fig. 1 (a) shows the bottom-up SM model. In order to model the human-like visual bottom-up attention mechanism, we used the 4 bases of edge (E), intensity (I), color (RG and BY), and symmetry information (Sym) in Fig. 1 (a), for which the roles of retina cells and the lateral geniculate nucleus (LGN) are reflected in the previously proposed attention model [4]. The feature maps ( I , E , Sym , and C ) are constructed by center surround difference and normalization (CSD & N) of 4 bases, which mimics the on-center and off-surround mechanism in the human brain, which then are integrated by an independent component analysis (ICA) algorithm that models the roles of the primary visual cortex for redundancy reduction [5]. The reason why we use ICA is that it is the best way to reduce redundancy [6]. A detailed description of the bottom-up saliency map (SM) model is found in [4]. Fig. 1 (b) shows the experimental results of a complex natural image. The preprocessed feature maps ( I , E , Sym , and C ) from a color image are convolved by the ICA filters to construct a saliency map. First, we compute the most salient region. Then an appropriate focus area centered by the most salient location is masked off, and the next salient location in the input image is calculated using the saliency map model. This means that previously selected salient location will not be considered a second time. Fig. 1 (b) also shows the successive salient regions.

(a)

(b)

Figure 1. Bottom-up saliency map model; (a) The architecture of the bottom-up saliency map model, (b) Experimental result of proposed saliency map model of a natural image; generated four feature maps, saliency map and successive salient regions. I: intensity feature, E: edge feature, Sym: symmetry feature, RG : red-green opponent coding feature, BY: blue-yellow opponent coding feature, CSD & N : center-surround difference and normalization, I : intensity feature map, E : edge feature map, Sym : symmetry feature map, C : color feature map, ICA : independent component analysis, SM : saliency map , Max : max operator

4. Trainable Selective Attention Model Although the proposed bottom-up saliency map model generates plausible salient areas, the selected areas may not be interesting for humans because the saliency map only uses primitive features such as intensity, edge, color and symmetry information. In order to implement a more plausible selective attention model, we need an interactive procedure together with bottom-up information processing. Human perception ignores an uninteresting area even if it has salient primitive features, and can memorize the characteristics of an unwanted area. We do not give attention to a new area with similar characteristics of the previously learned unwanted area. We propose a new selective attention model to mimic such a humanlike selective attention mechanism considering not only the primitive input features but also interaction with the environment. Moreover, the human brain can learn and memorize many new things without catastrophic forget-

21

Biologically Motivated Visual Attention System

S. Choi, S. Ban and M. Lee

ting of existing ones. It is well known that the ART network can be easily trained for additional input patterns and can also solve the stability-plasticity dilemma in another neural network models. Therefore, we use the ART network together with the bottom-up saliency map model to implement a trainable selective attention model that can interact with a human supervisor. During the training process, the ART network learns and memorizes the characteristics of the uninteresting areas selected by the bottom-up saliency map model. The uninteresting areas are identified by the human supervisor. After successful training of the ART network, an unwanted salient area is inhibited by the vigilance value of the ART network. Fig. 2 shows the architecture of the trainable attention model during the training process.

Figure 2. The architecture of the proposed trainable selective attention model during training mode ( I : intensity feature map, E : edge feature map, Sym : symmetry feature map, C : color feature map, ICA : independent component analysis, SM : saliency map). Square blocks 1 and 3 in the SM are interesting areas, but block 2 is uninteresting area In Fig. 2 the attention area obtained from saliency map inputs to the ART model, and then a supervisor decides whether it is salient or unwanted. If the selected area is unwanted even though it has salient features, the ART model trains and memorize that area. However, if the selected area is an interesting area, then it is not involved in training the ART network. In the proposed model, we use the ART 1 model. Thus, all the inputs of the ART network are transformed to binary vectors.

Figure 3. The architecture of the proposed trainable selective attention model during testing mode ( I : intensity feature map, E : edge feature map, Sym : symmetry feature map, C : color feature map, ICA : independent component analysis, SM : saliency map). Square blocks 1 and 3 in the SM are interesting areas, but block 2 is uninteresting area

22

Neural Information Processing-Letters and Reviews

Vol. 2, No. 1, January 2004

Fig. 3 shows the architecture of the proposed model during the testing mode. After the training process of the unwanted salient areas is successfully finished, the ART network memorizes the characteristics of unwanted areas. If a salient area selected by the bottom-up saliency map model of a test image has similar characteristics to ART memory, it should be ignored by inhibiting in the saliency map model. In the proposed model, the vigilance value in the ART model is used as a decision parameter whether the selected area is interesting or not. When an unwanted salient area inputs to the ART model, the vigilance value is higher than a threshold, which means that it has similar characteristics with the trained unwanted areas. Therefore, the ART model inhibited those unwanted salient areas not so as to give an attention to them. In contrast, when an interesting salient area inputs to the ART model, the vigilance value becomes lower than a threshold, which means that such an area is not trained and it is interesting attention area. As a result, the proposed model can focus on a desired attention area, but it does not focus on a salient area with unwanted features.

5. Computer Simulation and Results Fig. 4 shows the simulation process of the proposed trainable selective attention model during the training process. The numbers in Fig. 4 represent the sequence of the scan path according to the degree of saliency. In Fig. 4, the 5th salient area is determined to be an unwanted salient region by the human supervisor. In the ART network, learning takes place about the uninteresting 5th salient area through the modification of the weights, or long term memory (LTM) traces. Four other interesting salient areas are not trained in the ART network. Fig. 5 shows the simulation results of the proposed trainable selective attention model during the testing mode after the training process is successfully finished. In Fig. 5, tki and tko represent the input and output attention sequence at k-th salient area, respectively. The 5th selected salient area has a high vigilance value because it is already trained and memorized in the weights of the ART network as shown in Fig. 4. Therefore, Fig. 5 shows that the uninteresting salient area at t5o is not excited as an attention area even though it is selected as one of salient areas by the bottom-up saliency map model.

Figure 4. Simulation example of the proposed trainable selective attention model during the training process

Fig. 6 shows the simulation results for several different test images that are not used in training process. The proposed trainable selective attention model shows reasonable performance for many natural scenes. In Fig. 6, the first column images are the input natural scene, the second column shows the selected attention areas using only the bottom-up saliency map model, and the last column shows the selected attention areas using the proposed trainable selective attention model. The results in Fig. 6 show that our proposed model can focus on more reasonable salient areas in a manner similar to the human visual system.

23

Biologically Motivated Visual Attention System

S. Choi, S. Ban and M. Lee

Figure 5. Simulation example of the proposed trainable selective attention model during the testing process

(a)

(b)

(c) Figure 6. Comparison of the simulation results in both cases that one is to use only the bottom-up saliency map model and the other is to use the ART network with the bottom-up saliency map model (The first column: input natural scene, the second column: the selected attention areas by the bottom-up saliency map model, the last column: the selected attention areas by the proposed trainable selective attention model)

6. Conclusion We proposed a trainable selective attention model that can inhibit an unwanted salient area and focus only on an interesting area in a static natural scene. The proposed model was implemented using an adaptive resonance theory (ART) network in conjunction with a biologically motivated bottom-up saliency map model. Computer simulation results show that the proposed method gives a reasonable salient region and scan path that does not give attention to unwanted areas.

Acknowledgment This research was supported by the Brain Neuroinformatics Research Program of Korean Ministry of Science and Technology and grant No. R05-2003-000-11399-0 (2003) from the Basic Program of the Korea Science & Engineering Foundation.

24

Neural Information Processing-Letters and Reviews

Vol. 2, No. 1, January 2004

References [1] L. Itti, , C. Koch, and E. Niebur, “A model of saliency-based visual attention for rapid scene analysis,” IEEE Trans. Patt. Anal. Mach. Intell. Vol. 20, No. 11, pp.1254-1259, 1998. [2] A. M. Treisman and G. Gelde, “A feature-intergation theory of attention,” Cognitive Psychology, Vol. 12, No. 1, pp. 97-136, 1980. [3] T. Koike and J. Saiki, “Stochastic Guided Search Model for Search Asymmetries in Visual Search Tasks,” BMCV 2002, LNCS 2525, pp. 408-417, 2002. [4] Y. Sun and R. Fisher, “Hierarchical Selectivity for Object-Based Visual Attention,” BMCV 2002, LNCS 2525, pp. 427-438, 2002. [5] O. Ramström and H. I. Christensen, “Visual Attention Using Game Theory,” BMCV 2002, LNCS 2525, pp. 462-471, 2002. [6] H. B. Barlow and D. J. Tolhust “Why do you have edge detectors?,” Optical society of America Technical Digest, Vol. 23, pp.172, 1992. [7] A. J. Bell and T. J. Sejnowski, “The independent components of natural scenes are edge filters,” Vision Research, Vol. 37, pp.3327-3338, 1997. [8] G. Buchsbaum and A. Gottschalk, “Trichromacy, opponent colours coding and optimum colour information transmission in the retina,” Proc. R. Soc. London Ser. B, Vol.220, pp.89-113, 1983. [9] T. Wachtler, T. W. Lee, and T. J. Sejnowski, “Chromatic structure of natural scenes,” J. Opt. Soc. Am. A, Vol. 18, No. 1, 2001. [10] E. B. Goldstein, Sensation and Perception. 4th edn, An international Thomson publishing company, USA, 1995. [11] P. D. Wasserman, Neural Computing Theory and Practice, Van Nostrand Reinhold Intenational Company Limited, Australia, 1989. [12] E. Majani, R. Erlanson, and Y. Abu-Mostafa, The Eye, Academic, New York, 1984. [13] S. W. Kuffler, J. G. Nicholls, and J. G. Martin, From Neuron to Brain, Sinauer Associates, Sunderland, U.K, 1984. [14] R. G. Gonzalez and R. E. Woods, Digital Image Processing, Addison-Wesley Publishing Company, USA, 1993. [15] T. W. Lee, Independent Component Analysis-theory and application, Kluwer academic publisher, 1998. [16] K. S. Seo, C. J. Park, S. H. Cho, and H. M. Choi, “Context-Free Marker-Controlled Watershed Transform for Efficient Multi-Object Detection and Segmentation,” IEICE TRANS. Fundamentals, Vol. E84-A, No. 6, pp. 1066-1074, 2001. Sang-Bok Choi is currently a Ph. D. candidate, Department of Sensor Engineering, Kyungpook National University, Taegu, Korea. His research interest includes biologically motivated active vision systems, intelligent sensor systems, pattern recognition techniques, fingerprint recognition system, and embedded systems. Sang-Woo Ban is currently a Ph. D. candidate, School of Electronic and Electrical Engineering, Kyungpook National University, Taegu, Korea. His research interest includes brain science and engineering, intelligent sensor systems, neural networks, pattern recognition techniques, and biologically motivated active vision systems. Minho Lee graduated from Korea Advanced Institute of Science and Technology in 1995, and is currently a professor of School of Electronic and Electrical Engineering, Kyungpook National University, Taegu, Korea. His research interests include active vision systems based on human eye movements, selective attention, independent component analysis, active noise control, and intelligent sensor systems. (Home page: http://abr.knu.ac.kr)

25

Suggest Documents