Selective Attention for Robust Recognition of Noisy and Superimposed Patterns Soo-Young Lee Brain Science Research Center Korea Advanced Institute of Science & Technology Yusong-gu, Taejon 305-701, Korea
[email protected]
Michael Mozer Department of Computer Science University of Colorado at Boulder Boulder, CO, USA
[email protected]
Abstract Based on the “early selection” filter model, a new selective attention algorithm is developed to improve recognition performance for noisy patterns and superimposed patterns. The selective attention algorithm incorporates the error backpropagation rule to adapt the attention filters for a testing input pattern and an attention cue for a specific class. For superimposed test patterns an attention-switching algorithm is also developed to recognize both patterns one by one. The developed algorithms demonstrated much higher recognition rates for corrupted digit recognition tasks.
1
Introduction
In many pattern recognition tasks testing input patterns may be corrupted by noises or background images, which greatly reduce recognition accuracy. For small perturbations the robustness of the neural classifiers may be improved by enforcing low input-to-output mapping sensitivity during the learning process. [1] However, humans utilize selective attention to improve recognition accuracy even at very noisy cocktail parties. Therefore, it is natural to incorporate the selective attention mechanism into pattern recognition systems in noisy real world applications. Fortunately, psychologists have studied extensively on the selective attention mechanism. [2]-[4] However, there also exist controversy among different theories, and only a few model is defined enough to start engineering applications in real world. Only recently pattern recognition researchers had utilized the selective attention mechanism. Fukushima incorporated selective attention and attention switching algorithms into his Neocognitron model, and demonstrated good recognition performance of superimposed binary numbers one by one. [5] However, the Neocognitron model has many unknown parameters to be determined by heuristics or from psychological experiments, and its performance is sensitive to the parameter values. Also, its computational requirements are prohibitively expensive for many real-time applications. Recently Rao introduced a selective attention model based on Kalman filters and demonstrated classifications of superimposed patterns. [6] However, he uses linear systems, and nonlinear extension is not straightforward. In this paper a simple selective attention algorithm is developed based on the attenuating
“early filter” concept and multilayer Perceptron neural classifiers, and tested for binary digit recognition tasks. Both noisy patterns and superimposed patterns are considered.
2
Psychological Attention Models
The modern theory of the selective attention started from Broadbent’s pioneering book. [7] Broadbent observed that, although subjects could not recall most of the unattended channel of a dichotic tape, they often could recall the most recent several seconds of that channel. Therefore, he suggested that the brain temporarily retains information about all stimuli but that the information fades, and neither is admitted to the conscious mind nor is encoded in a way that would permit later recollection, unless attention is turned quickly to a particular memory trace. It is known as “early filtering” or “early selection” model. Several important modifications of the Broadbent’s model had been proposed. To account for the early findings of automatic analysis of unattended input, Triesman proposed that the filter merely attenuated the input rather than totally eliminating it from further analysis. [8] Broadbent also modified his model to allow stimuli selection on the basis of semantic properties and not just on the basis of physical properties. [9] This amendment was supported by later evidence, which showed the selection on the basis of physical cues is less effortful than selection on the basis of semantic cues. [10] The most extreme alternative to Broadbent's “early selection” theory is “late selection” theory. Some theorists noticed that recognition of familiar objects proceeds unselectively and without any capacity limitations, and proposed that all information is completely analyzed automatically and attention is being used only to determine which of the analyzed information will enter into the subject’s response. [11] However, latter experiments provide several count evidences again the “late selection” theory. Cowan suggested that “selective filter” is actually intrinsic to the long-term memory activation process. [12] The activated memory is a subset of long-term memory, and the focus of attention is a subset of activated memory. Any stimulus or any features in longterm memory could be selected by a voluntary attentional focus or spotlight. [13] In summary attenuating “early filter” model is commonly agreed, but there still remains many unknown and controversial issues on how the filter is actually formed. We will just borrow this “early filter” concept, and develop an algorithm to find the filter by adaptation.
3
Selective Attention for Noisy Patterns
The network architecture of the selective attention is shown in Figure 1. The dotted box is a standard neural network classifier, and an attention layer is added in front of the input layer. Although it is applicable to general feed-forward networks, a multi-layer Perceptron (MLP) with only one hidden-layer is shown here for simplicity. Each input value xk is gated to the k’th input of the MLP by an attention gain or filtering coefficient ak, which is usually set to 1 at the training phase. By adapting the ak’s for a training data set, this architecture had been investigated for improving learning performance of given trained networks [14]-[16] and speaker adaptation [17]. Here the attention gain is adapted during the test phase only.
x1
a1
x^ 1
x2
a2
x^ 2
h2
y2
x3
a3
x^ 3
h3
y3
xN
aN
x^N
hK
yM
a
x^
x
W(1) kn
W(1)
h1
(2) Wmk
h
y1
W(2)
y
Figure 1: Network architecture for selective attention Unless the selective attention process is turned on, all values of the attention gains fixed at one and the attention module does not play any role. When the attention process is turned on for a specific output class, attention gain values are adapted to minimize the output error between the actual output and a target output specified by the attention class. In this case the first-layer synaptic weights multiplied by the attention gain may be considered as dynamic synapses. All other synaptic weights are frozen during this attention process. The selective attention process is based on error backpropagation (EBP). When an attention class is introduced, corresponding output target vector t s ≡ [t1s t 2s t Ms ]T is defined. For bipolar binary output systems t is = 1 is for the
attention class and -1 for the others. Then the attention gain ak’s are adapted to minimize output error E s ≡ 1 2 ∑ (t is − y i ) 2 with the given input x ≡ [ x1 x 2 x N ] T and pre-
i
trained synaptic weights W. The update rule is based on a gradient-descent algorithm with error back-propagation. At the (n+1)’th iterative epoch, the attention gain a k is updated as a k [n + 1] = a k [n] − η (∂E / ∂a k )[n] = a k [n] + η x k δ k( 0 ) [n]
δ k( 0 ) = ∑ j W jk(1) δ (j1)
(1a) (1b)
where E , δ (j1) , and W jk(1) denote the attention output error, the j’th attribute of the back-propagated error at the first hidden-layer, and the synaptic weight between the input xˆ k and the j’th neuron at the first hidden-layer, respectively. Also η denotes a learning rate. The attention gains are thresholded to be in the range [0, 1]. The selective attention and classification process in the test phase is summarized as follows: Step 1: Apply a testing input pattern to the trained MLP and compute output values. Step 2: For each of the classes with top m activation values, (1) Initialize all attention gain ak’s as 1 and set the target vector ts. (2) Apply the test pattern and attention gains to the network and compute the output. (3) Apply the selective attention algorithm in Eqs.(1) to adapt the attention gains. (4) Go to (2) until the attention process converges.
(5) Compute an attention measure M from the converged attention gains. Step 3: Select the class with a minimum attention measure M as the recognized class. The attention measure is defined as M ≡ DI EO , D ≡ ∑ ( x − xˆ ) 2 / 2 N I k k k
(2a)
,
(2b)
E O ≡ ∑ [t i − y i (xˆ )] 2 / 2 M ,
(2c)
= ∑ x 2 (1 − a ) 2 / 2 N k k k i
where DI is the square of Euclidean distance between two input patterns before and after the selective attention and EO is the output error after the selective attention. Here, DI and EO are normalized with the number of input pixels and number of output classes, respectively. The superscript s for attention classes is omitted her for simplicity. To make the measure M dimensionless quantity, one may normalize the DI and EO with the input energy ( ∑ k x k2 ) and the training output error, respectively. However, it does not affect the selection process in Step 3. The attended input xˆ is actually the most probable input pattern in the vicinity of the testing input pattern for the attended class, and the Euclidean distance between x and xˆ is a good measure for the classification confidence. In facts DI is basically the same quantity minimized by Rao. [6] However, the MLP classifier in our model is capable of nonlinear mapping between the input and output patterns. The nearest-neighbor pattern matching method also finds the class with a minimum distance. Our model with the MLP classifier does similar function without large memory and computational requirements. The proposed selective attention algorithm was tested for recognition of noisy numeral patterns. The numeral database consists of 48 sets of hand-written numerals collected from 48 people. Each set is composed of 10 numerals with 16x16 binary pixels. In the average about 16 percentages of the pixels are black and coded as 1, while the white pixels are coded as 0. Four experiments were conducted with different training sets with 280 training patterns each. Standard one hidden-layer Perceptron was trained by the error backrpopagation algorithm. The numbers of input, hidden, and output neurons were 256, 30, and 10, respectively. Three noisy test patterns were generated from each training pattern by randomly flipping each pixel value with a probability Pf, and the 840 test patterns were fed to the network for classification. In Figure 2, false recognition rates are plotted as functions of the candidate number m. Results of 4 different pixel inversion probabilities, i.e., Pf =0.05, 0.1, and 0.15, are shown. Considering the average 16% of black pixels in the data, the noisy input patterns with Pf = 0.15 is equivalent to approximately 0 dB SNR. For each candidate number 4 false rates from 4 different training data sets are marked with ‘o’, and the average values are connected by a solid line.
2
False Recognition Rates
The numeric data consist of 10 classes, and only the top 5 candidates are considered here. Actually the performance does not increase with more than 3 top candidates. The false recognition rates with top 3 candidates are much smaller than those with top 1 candidate, i.e., without the selective attention process. The improvement is much higher for test data with lower false rates.
1.5
1
0.5
0
1
2
3
4
5
Number of Candidates
4 Attention Switching for Superimposed Patterns
and only if a k =1. From these attention gains as initial conditions, another selective attention process may go on to recognize the second pattern. The proposed selective attention and attention switching algorithm was tested for recognition of 2 superimposed numeral data. Again, four experiments were conducted with different training data sets. At each experiment 40 patterns were selected from 280 training patterns, and 720 test patterns were generated by superimposing 2 patterns from
False Recognition Rates
15
10
5
0
1
2
3
4
5
Number of Candidates
(b) Pf =0.10, 40
False Recognition Rates
Let’s suppose two unipolar [0,1] binary patterns are superimposed to form a new test pattern and one would like to recognize the two patterns in sequence. Once a pattern is recognized with the selective attention process in Section 3, one may switch attention from the recognized pattern to the remaining pixels for further classifications. It is actually accomplished by removing attention from the pixels of the already-recognized pattern. However, it is desirable not to turn off attentions from the common black pixels of the two patterns. One may accomplish this task by setting attention gains a k ’s to 0, if
(a) Pf =0.05, 20
30
20
10
0
1
2
3
4
5
Number of Candidates
(c) Pf =0.15, Figure 2: False recognition rates for noisy patterns versus the number of top candidates. Each binary pixel of training patterns is randomly inverted with a probability Pf.
different number classes. The test patterns were still maintained as binary. Recognition rates of three different candidate numbers, i.e., m=1, 2, and 3, were compared with those of the simple selection method by output activation only.
Figure 3: Examples of Selective Attention and Attention Switching Figure 3 shows six examples of the selective attention and attention switching. Four rectangles in Figure 3 show results from one test input pattern and, from left to right, represent an original input pattern x, the attended input xˆ for the first round classification, a masking pattern for the attention switching, and the residual input pattern for the second round classification, respectively. The 6 test input patterns are superimposed from (6,3), (9,0), (6,4), (9,3), (2,6), and (5,2) patterns. The attended input xˆ has analogue values, but thresholded by 0.5 to be shown in the second rectangles. The third rectangle is actually the attended input pattern thresholded by 1.0. As shown in Figure 3, the attention switching is done effectively, and the remaining input patterns to the second classifier are quite visible. In Table 1 the recognition rates are summarized. As expected, the selective attention increases recognition rates for the first pattern, and the attention switching also improves recognition rates for the second pattern. With top 3 candidates, the false recognition rate drop from 8.7% to 4.1% for the first pattern, and from 37.4% to 22.6% for the second pattern. Their reduction factors are 53% and 40%, respectively. Although the reduction of false recognition rates is smaller for the second patterns due to bigger perturbations, it is still very significant improvements. Table 1: Recognition Rates (%) of Two Superimposed Numeral Patterns First Pattern Mean Max Min Top 2 Output Activations 91.3 Selection Attention & switching - Top 1 candidate 91.3 - Top 2 candidates 95.4 - Top 3 candidates 95.9
Second Pattern Mean
Max
Min
91.7
90.8
62.7
65.7
60.1
91.7 96.1 96.3
90.8 94.9 95.1
75.4 76.9 77.4
77.2 80.3 79.7
74.7 75.3 76.0
5
Conclusion
In this paper a selective attention algorithm is demonstrated for improved recognition rates for noisy patterns and superimposed patterns. Also, a simple attention switching algorithm results in much better recognition rates for the recognition of the second patterns from 2 superimposed patterns. The algorithms are simple, and easily extendable to general feed-forward neural networks. These algorithms will be useful to extract and recognize multiple patterns in complex background. Acknowledgments S.Y. Lee acknowledges supports from the Korean Ministry of Science and Technology. References [1] Jeong D.G., and Lee, S.Y. (1996). Merging backpropagation and Hebbian learning rules for robust classification, Neural Networks, 9:1213-1222. [2] Cowan, N. (1997). Attention and Memory: An Integrated Framework, Oxford Univ. Press. [3] Pashler, H.E. (1998). The Psychology of Attention, MIT Press. [4] Parasuraman, R. (ed.) (1998). The Attentive Brain, MIT Press. [5] Fukushima, K. (1987). Neural network model for selective attention in visual pattern recognition and associative recall, Applied Optics, 26:4985-4992. [6] Rao, R.P.N. (1998). Correlates of attention in a model of dynamic visual recognition. In Neural Information Processing Systems 10, MIT Press. [7] Broadbent, D.E. (1958). Perception and Communication. Pergamon Press. [8] Treisman, A. (1960). Contextual cues in selective listening, Quarterly Journal of Experimental Psychology, 12:242-248. [9] Broadbent, D.E. (1971). Decision and Stress, Pergamin Press. [10]Johnson, W.A., and Heinz, S.P. (1978). Flexibility and capacity demands of attention. Journal of Experimental Psychology: General, 107”420-435. [11]Duetsch, J., and Deutsch, D. (1963). Attention: some theoretical consideration, Physical Review, 70:80-90. [12]Cowan, N. (1988). Evolving conceptions of memory storage, selective attention, and their mutual constraints within the human information processing system. Psychological Bulletin, 104:163-191. [13]Posner, M.I., and Snyder, C.R.R. (1975). Attention and cognitive control. In R.L. Solso (Ed.), Information Processing and Cognition. Erlbaum. [14]Lee, H.J., Lee, S.Y. Lee, Shin, S.Y., and Koh, B.Y. (1991). TAG: A neural network model for large-scale optical implementation, Neural Computation, 3:135-143. [15]Lee, S.Y., Jang, J.S., Shin, S.Y., and Shim, C.S. (1988). Optical Implementation of Associtive Memory with Controlled Bit Significance, Applied Optics, 27:1921-1923. [16]Kruschke, J.K. (1992). ALCOVE: An Examplar-Based Connectionist Model of Category Learning, Psychological Review, 99:22-44. [17]Lee, S.Y., Kim, D.S., Ahn, K.H., Jeong, J.H., Kim, H., Park, S.Y., Kim, L.Y., Lee, J.S., and Lee, H.Y. (1997). Voice Command II: a DSP implementation of robust speech recognition in real-world noisy environments, International Conference on Neural Information Processing, pp. 1051-1054, Nov. 24-27, 1997, Dunedin, New Zealand.