FEBAM: A Feature-Extracting Bidirectional Associative ... - CiteSeerX

0 downloads 0 Views 291KB Size Report
Canada). Institut Philippe-Pinel de Montréal (Montréal, Canada). (E-mail: [email protected]). Gyslain Giguère, Département d'informatique, Université ...
Proceedings of International Joint Conference on Neural Networks, Orlando, Florida, USA, August 12-17, 2007

FEBAM: A Feature-Extracting Bidirectional Associative Memory Sylvain Chartier, Gyslain Giguère, Patrice Renaud, Jean-Marc Lina & Robert Proulx

Abstract—In this paper, a new model that can ultimately create its own set of perceptual features is proposed. Using a bidirectional associative memory (BAM)-inspired architecture, the resulting model inherits properties such as attractor-like behavior and successful processing of noisy inputs, while being able to achieve principal component analysis (PCA) tasks such as feature extraction and dimensionality reduction. The model is tested by simulating image reconstruction and blind source separation tasks. Simulations show that the model fares particularly well compared to current neural PCA and independent component analysis (ICA) algorithms. It is argued the model possesses more cognitive explanative power than any other nonlinear/linear PCA and ICA algorithm. I.

I

INTRODUCTION

n everyday life, humans are constantly exposed to situations in which they must discriminate, identify, classify and categorize perceptual patterns (such as visual objects) in order to act upon the identity and properties of the encountered stimuli. To achieve this task, cognitive scientists have historically taken for granted that the human cognitive system uses sets of properties (or dimensions) which crisply or fuzzily define a particular object or category [1]. These properties are thought to strictly contain information that is invariant and relevant to future decisionmaking situations. These characteristics of human properties tend to accelerate cognitive processes by reducing the amount of information to be taken into account in a given situation. While most researchers view properties and dimensions as symbolic (or logically-based), some have argued for perceptual symbol or feature systems, in which properties would be defined in an abstract iconic fashion and created in major part by associative bottom-up processes [2, 3]. To achieve this “perceptual feature creation” scheme, humans are sometimes hypothesized to implicitly use an information reduction strategy. Unfortunately, while the idea of a system creating its own “atomic” perceptual representations has been deemed crucial to understanding many cognitive

Sylvain Chartier, School of psychology, University of Ottawa (Ottawa, Canada). Institut Philippe-Pinel de Montréal (Montréal, Canada). (E-mail: [email protected]). Gyslain Giguère, Département d’informatique, Université du Québec à Montréal (Montréal, Canada). (E-mail: [email protected]). Patrice Renaud, Département de psychologie, Université du Québec en Outaouais (Gatineau, Canada). (E-mail: [email protected]). Jean-Marc Lina, Département de génie électrique, École de Technologie Supérieure (Montréal, Canada). (E-mail : [email protected]). Robert Proulx, Département de psychologie, Université du Québec à Montréal (Montréal, Canada). (E-mail: [email protected]).

1-4244-1380-X/07/$25.00 ©2007 IEEE

processes [3], few efforts have been made to model perceptual feature creation processes in humans. When faced with the task of modeling this strategy, one statistical solution is to use a principal component analysis (PCA) approach [4]. PCA neural networks [5] can deal quite effectively with input variability by creating (or “extracting”) statistical low-level features that represent the data’s intrinsic information to a certain degree. The features are selected in such a way that the input data’s dimensionality is reduced while preserving important information. While widely used for still and moving picture compression, as well as many other computer science and engineering applications, PCA networks have not made a breakthrough in the world of cognitive modeling, even if they possess many interesting properties such as psychologically plausible local rules and flexible architecture growth, can deal with noisy inputs and extract multiple components in parallel. This is probably because cognitive science researchers are looking for solutions to apply to multiple perceptual/cognitive tasks without assuming a high number of different modules (needing ad hoc modifications of the rules and architecture) or adding multiple fitting parameters. This can hardly be done with PCA networks, since this class of neural network models is feedforward by definition, and therefore does not lead to attractor-like (or “categorical”) behavior. Also, the absence of explicitly depicted connections in the model forces supervised learning to be done using an external teacher. Moreover, some PCA models cannot even perform online or local learning [5]. A class of neural networks known as recurrent autoassociative memory (RAM) models does respond to these last requirements. In psychology, AI and engineering, autoassociative memory models are widely used to store correlated patterns. One characteristic of RAMs is the use of a feedback loop, which allows for generalization to new patterns, noise filtering and pattern completion, among other uses [6]. Feedback enables a given network to shift progressively from an initial pattern towards an invariable state (namely, an attractor). Because the information is distributed among the units, the network’s attractors are global. If the model is properly trained, the attractors should then correspond to the learned patterns [6]. Direct generalization of RAM models is the development of bidirectional associative memory (BAM) models [7]. BAMs can associate any two data vectors of equal or different lengths (representing for example, a visual input

and a prototype or category). These networks possess the advantage of being both autoassociative and heteroassociative memories [7], and therefore encompass both unsupervised and supervised learning. A problem with BAM models is that contrarily to what is found in real-world situations, they store information usingnoise free versions of the input patterns. In comparison, to overcome the possibly infinite number of stimuli stemming, for instance, from multiple perceptual events, humans must regroup these unique noisy inputs into a finite number of stimuli or categories. This type of learning in the presence of noise should therefore be taken into account by BAM models. To sum up, on one hand, PCA models can deal with noise but do not have the properties of attractor networks. On the other hand, BAM models exhibit attractor-like behavior but have difficulty dealing with noise. In this paper, it will be shown that unifying both classes of networks under the same framework leads to a model that possesses a clear advantage in explanatory power for memory and cognition modeling over its predecessors. Precisely, this paper will show that BAM-type architecture can be made to encompass some of the PCA networks’ properties. The resulting model will thus exhibit attractorlike behavior, supervised and unsupervised learning, feature extraction, and other nonlinear PCA properties such as blind source separation (BSS) capacities. II. THEORETICAL BACKGROUND A. Principal component analysis and neural nets PCA neural networks are based on a linear output function defined by: y = Wx

(1)

where x is the original input vector, W is the stabilized weight matrix, and y is the output. When the weight matrix maximizes preservation of relevant dimensions, then the reverse operation: xˆ = W T y

(2)

should produce a result that is close to the original input. Thus, the norm of the difference vector x − xˆ should be minimized. This translates by finding a W which minimizes: N

E=

∑ i =1

xi − xˆ i

2

N

=

∑ x −W i

T

Wxi

2

(3)

i =1

where E is the error function to minimize, and N is the input’s dimensionality. Several solutions exist for Equation 3 (for a review of most of the methods see [5]).

Generalization to nonlinear PCA [8] is straightforward; in this case, a nonlinear output function is to be used: y = f ( Wx )

(4)

The corresponding minimization function is thus given by N

E=

∑ x −W i

T

f ( Wxi )

2

(5)

i =1

Depending on the type of nonlinear output function, different types of optimization can be achieved. If the output function uses cubic or higher nonlinearity, then the network directly relates to independent component analysis (ICA) and other BSS algorithms [8, 9]. B. Bidirectional associative memories In Kosko’s standard BAM [7], learning is accomplished using a simple Hebbian rule, according to the following equation: W = YXT

(6)

In this expression, X and Y are matrices representing the sets of bipolar pairs to be associated, and W is the weight matrix which, in fact, corresponds to a correlation measure between the input and output. The model uses a recurrent nonlinear output function in order to allow the network to filter out the different patterns during recall. Kosko’s BAM uses a signum output function to recall noisy patterns [7]. The nonlinear output function typically used in BAM networks is expressed by the following equations: y (t + 1) = sgn( Wx(t ))

(7)

and x(t + 1) = sgn( W T y (t ))

(8)

where y(t) and x(t) represent the vectors to be associated at time t, and sgn is the signum function defined by

 1 if z >0  sgn( z ) =  0 if z =0 -1 if z 1   (10) ∀ i, ..., N , y i (t + 1) =  −1, If Wxi (t ) < −1 (δ + 1) Wx (t ) − δ ( Wx)3 (t ), Else i i  and 1, If Vy i (t ) > 1   (11) ∀ i, ..., M , xi (t + 1) =  −1, If Vy i (t ) < −1 (δ + 1)Vy (t ) − δ (Vy )3 (t ), Else i i 

where N and M are the number of units in each layer, i is the index of the respective vector element, y(t+1) and x(t+1) represent the network outputs at time t + 1, and δ is a general output parameter that should be fixed at a value lower than 0.5 to insure fixed-point behavior [16]. Figure 3 illustrates the shape of the output function when δ = 0.4. This output function possesses the advantage of exhibiting continuous-valued (gray-level) attractor behavior (for a detailed example see [10]). Such properties contrast with networks using a standard nonlinear output function, which can only exhibit bipolar attractor behavior (e.g. [7]).

Fig.1. FEBAM network architecture.

In a standard BAM, the external links given by the initial y(0) connections would also be available to the W layer. Here, the external links are provided solely through the initial x(0) connections on the V layer. Thus, in contrast with the standard architecture [10], y(0), the initial output is not obtained externally, but is instead acquired by iterating once through the network as illustrated in Figure 2. x(0)

x(1)

W

y(0)

V W

y(1)

Fig. 2. Output iterative process used for learning updates.

In the context of feature extraction, the W layer will be used to extract the features whose dimensionality will be determined by the number of y units; the more units in that layer, the less compression will occur. If the y layer’s

Fig. 3. Output function for δ = 0.4.

C. Learning function Learning is based on time-difference Hebbian association [10, 16-19], and is formally expressed by the following equations: W(k + 1) = W(k ) + η (y (0) − y (t ))(x(0) + x(t ))T and V (k + 1) = V (k ) + η (x(0) − x(t ))(y (0) + y (t ))T

(12) (13)

where η represents a learning parameter, y(0) and x(0), the initial patterns at t = 0, y(t) and x(t), the state vectors after t iterations through the network, and k the learning trial. The learning rule is thus very simple, and can be shown to constitute a generalization of hebbian/anti-hebbian correlation in its autoassociative memory version [10, 16].

For weight convergence to occur, η must be set according to the following condition [16]:

η

Suggest Documents