cial neural networks, namely constructive network algorithms. Originally, these ... architecture | by trial and error and/or other means such as information about the data .... on its search for the global minimum at which the weights would be optimally ..... to cognitive abilities, neurobiological growth is \a progressive increase.
The Emergence of Structured Receptive Fields in a Constructive Neural Network Model
Christian Bering Centre for Cognitive Science September 1997
A thesis submitted for the degree of MSc in Cognitive Science and Natural Language at the University of Edinburgh
Acknowledgements I would like to thank David Willshaw for his continuous advice and encouragement, and for his helpful comments on earlier drafts of this manuscript. I would also like to thank Gert `D.' Westermann and Sam Joseph for many interesting discussions and critical comments.
i
Contents 1 Introduction
1
2 Constructive Network Algorithms
2
2.1 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.2 Cognitive Development . . . . . . . . . . . . . . . . . . . . . . . . 12
3 Self-Organising Models of the Visual System
24
4 A Constructivist Linsker Network
36
5 Conclusions
45
A Receptive Fields in the CLN
47
B Feature Extraction
49
3.1 A Rough Sketch of the Visual System . . . . . . . . . . . . . . . 24 3.2 Network Models of the Emergence of Structure . . . . . . . . . . 27 4.1 The Feature Generation Mechanism . . . . . . . . . . . . . . . . 36 4.2 Merging FGM and Linsker . . . . . . . . . . . . . . . . . . . . . . 38 4.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
ii
1 INTRODUCTION
1
1 Introduction In this thesis, I shall be concerned with a comparatively new family of arti cial neural networks, namely constructive network algorithms. Originally, these algorithms have mostly been designed to deal with convergence problems encountered in stationary architectures or to improve the performance of these conventional networks. Over the last decade or so, they have increasingly drawn attention to some advantageous properties they display as implementations of constructivist models of cognitive development, and by virtue of which they provide evidence for the plausibility of the cognitive paradigm of constructivism. However, most of these network models are rather abstract models of cognitive abilities, and there seems to be a certain lack of `lower level' constructivist models of the dynamics of neural circuitry, probably partly because \very few [approaches to the on-line architectural design of neural networks] attempt to be particularly biologically plausible."(Joseph, in preparation) In order to provide a brief investigation into this neglected area, I will take a look at the well-studied mammalian visual system, and I will examine two models of the emergence of speci c structures in the visual system. One of these will turn out to be a constructivist model. For the other one, I will propose a neurobiologically motivated constructivist extension. Thus, the rest of this paper is structured as follows: in the following section, I will introduce constructive network algorithms, reviewing considerations of performance and problems of conventional networks and the solutions oered in constructive architectures on the one side, and introducing the new paradigm of constructivism on the other side, outlining the learning theoretical and psychological reasons which make constructive algorithms interesting within that new theoretical framework. In Section 3, I will begin with a brief sketch of the mammalian visual system and the speci c structural properties that have been found in it. I will then examine the self-organising models presented by Willshaw and von der Malsburg (1976) and Linsker (1986a,b,c). Of these, the model by Willshaw and von der Malsburg will prove constructivist. Consequently, Section 4 will provide an attempt to extend the Linsker model in a constructivist way as well as the attempt to reproduce in this new architecture some of the remarkable results Linsker had achieved. It will also brie y consider the properties of the extended Linsker algorithm with regard to feature extraction.
2 CONSTRUCTIVE NETWORK ALGORITHMS
2
2 Constructive Network Algorithms Until recent years, a \normal" arti cial neural network (ANN) would have a static architecture, i.e., the number of units and their connectivity patterns would be speci ed and xed by the network's `designer' prior to training the network. Training the ANN would then merely involve modifying the weights according to the chosen learning algorithm. In contrast, a Constructive Network Algorithm (CNA) employs a dynamic architecture. This means that the learning algorithm will not only modify the weights, but it can (and normally will) also alter the network's architecture by adding and/or removing units and/or connections.1 At a rst glance, one could think that not much is gained by this approach, since it replaces the burden of having to nd an appropriate architecture | by trial and error and/or other means such as information about the data | with the possibly much more complicated task of nding adequate algorithmic principles along which the network's architecture is to be modi ed so as to allow the learning algorithm to develop an appropriate architecture by itself. However, CNAs have several properties for which they have increasingly attracted attention over the last decade. Below, I will outline these properties along two main strands. The rst strand concerns general issues of problem solving. Although these aspects are not of immediate importance from a cognitive point of view, they have been elaborated here because performance related problems can be said to have formed the initial motivation for most CNAs, and they can serve well to give a brief overview of what methods have been employed in CNAs and what for reasons. Of course, this overview cannot pretend to be complete; a broader discussion of dierent techniques along with a taxonomy of a number of CNAs can be found in Kwok and Yeung (1997). The second strand will outline learning theoretic and psychological considerations which have recently led to augmented emphasis on the merit of these approaches for models of cognitive abilities and development. Again, it cannot be complete; Joseph (in preparation) provides a more comprehensive overview with focus on neurobiological issues. 1 Note that sometimes the term Constructive Network Algorithm is only used for learning algorithms which add units or connections, while those that remove them are then termed Pruning Algorithms. In what follows, I will mainly discuss constructive algorithms in the narrower sense, but much of what will be said applies to both approaches. I will make the distinction explicit where necessary.
2 CONSTRUCTIVE NETWORK ALGORITHMS 2 2n
3
lin. sep.
n abs. % 1 4 4 100 2 16 14 87.5 3 256 104 40.6 4 65536 1772 2.7 9 5 4:29 10 94572 2:2 10?5 19 6 1:84 10 5028134 2:7 10?13 Table 1: After the results from Widner (1989): For n input units, the total number of binary functions (22 ) is shown along with how many of them (absolute and percentage) are linearly separable. n
2.1 Performance
The Beginning First
One broad eld where CNAs have demonstrated various advantages over networks with static architectures is their general performance in problem-solving. A network's architecture in uences its performance in many ways (e.g., see Hertz et al. (1991), Ch. 6.4): choosing an architecture can be seen as implicitly xing a vast number of parameters of which often one will not know to which degree they will determine the results produced with the net, and which will in extreme cases even prevent the net from doing what one wants it to do. For instance, the standard (i.e., single-layer) perceptron as introduced by Rosenblatt (1962) is an example of where the architecture rather drastically limits what the network can possibly accomplish. On the one hand, Rosenblatt (1962) had demonstrated that if a perceptron can learn a classi cation, it will eventually learn it (perceptron convergence theorem ). On the other hand, Minsky and Papert (1969) demonstrated that what the perceptron can possibly learn is the rather restricted group of classi cations which are linearly separable. Many interesting functions such as the well-known binary exclusive-OR function are not linearly separable. Furthermore, Widner (1989) showed that the ratio between the linearly separable binary functions and all possible binary functions decreases very quickly as the number of input units increases (see table 1). The standard perceptron's limitation does not apply to architectures with hidden units. Indeed, feedforward networks with one layer of hidden units using a sigmoidal activation function have been shown to be universal approximators, i.e., to be able to approximate arbitrarily well any continuous function or decision boundary, provided the number of hidden units is suciently large. (For a proof of this universality property see for instance Bishop (1995), p. 130.)
2 CONSTRUCTIVE NETWORK ALGORITHMS
4
However, making such more powerful nets work did not seem straight-forward for a couple of years, because the delta learning rule used with the perceptron only allows for the training of output units. The weights w of these are modi ed in accordance with the dierence between the target output t and the observed output o: w / ; = (t ? o) But for hidden units, the target values are generally not known.
Constructive Perceptrons One group of algorithms has solved this problem in a constructive way by incrementing the perceptron during training in such a fashion that the delta rule can be used nevertheless. Such a constructive algorithm (in the narrower sense of the term) normally starts with a standard perceptron, which is trained until the error stagnates. Note that in order to be able to rely on error stagnation to occur in cases where the net cannot learn the classi cation entirely correctly, a modi ed perceptron learning rule has to be used, for instance the thermal perceptron learning rule (Frean, 1992) or the pocket algorithm (Gallant, 1986a). The pocket algorithm has the theoretical advantage of having been proven by Gallant (1986a) to be able to nd the optimal weight set with any given probability. After the error has settled, one or more units are added, and the modi ed perceptron is trained again. The manner in which new units are integrated into the existing structure depends on the speci c algorithm. For example, the tower algorithm by Gallant (1986b) inserts each new unit `on top' of the previous ones, thereby using the output from the trained units as a decision help. In the upstart algorithm (Frean, 1990), new units are interpolated between input units and trained units such that they correct the errors made by old units. Figure 1 depicts how the two algorithms increment a network. The procedure of training and adding is repeated until the error nally drops to zero. In the example of the upstart algorithm, Frean (1990) demonstrated that the error would necessarily go to zero, by showing that the insertion of new units would always decrease the number of errors made.
Back-Propagation Another approach to circumvent the perceptron's limitation is to employ a different learning rule which can also train hidden units. That way, one could employ a multi-layer architecture from the very beginning of the training. This solution has become very popular since the introduction of the generalised delta rule by Rumelhart et al. (1986). The generalised delta rule allows for the computation of the error values, , for hidden units by using the -values of the
2 CONSTRUCTIVE NETWORK ALGORITHMS Tower
5
Upstart
*
+
-
Figure 1: Insertion of new units (dashed lines) in the tower and the upstart algorithms after the old unit (solid lines) has been trained. In the tower algorithm, the old output unit (*) henceforth instead provides additional input to the new output unit; the upstart algorithm trains new units to correct speci c errors of the old ones, i.e., make them turn on in certain cases (+) or o (-).
output units or, recursively, of other hidden units which the units in question feed activation to. The weights can be then be modi ed according to the normal delta rule. The idea of the -values thus being `propagated backwards' through the net has earned the algorithm the name back-propagation, by which it is generally referred to. Rumelhart et al. (1986) were able to demonstrate that the generalised delta rule implements a gradient descent method with regard to a suitable error measure such as n X E = (tj ? oj )2: j =1
This E gives the squared Euclidean distance between a target pattern t1 : : :tn and the actual output pattern o1 : : :on, and it can be depicted as a function of the weights in the network, giving an error surface in the weight space (see Figure 2 for a one-dimensional example). With gradient descent methods, weights are modi ed in the direction of the steepest descent of this surface. In other terms, back-propagation minimises the error function subject to the constraints the learning rule imposes on how the weights are changed.
Moving through Error Surfaces However, it does not follow from the theoretical capabilities of such a multilayer architecture mentioned earlier that using one will automatically result in a zero error solution or even in a minimal error solution. As a rst reason,
2 CONSTRUCTIVE NETWORK ALGORITHMS E
6
E
a)
w
E
b)
w
d)
w
E
c)
w
Figure 2: Problems of gradient descent methods: a) a local minimum in the error surface; b) a plateau with near-zero gradient; c) oscillation in steep valleys; d) jumping over a narrow minimum. (Adapted from Zell (1994), p. 113.)
the architecture can in uence the performance of the learning rule. In backpropagation, whether or not the minimal error can be found basically depends on how well the learning rule is able to `navigate' around the error surface on its search for the global minimum at which the weights would be optimally con gured. Figure 2 depicts some of the problems it could encounter. They have been known since the introduction of the algorithm, but are normally taken to have no practical in uence on performance. As Rumelhart et al. (1986) noted (p. 332): \We do not know the frequency of such local minima, but our experience with this and other problems is that they are quite rare." On the other side, according to more recent ndings, the error surface gets increasingly rugged in weight spaces of higher dimensionality, whereby the probability of being trapped in a local minimumsteadily grows (Hecht-Nielsen, 1990). In this sense one can say that the bigger the architecture, the more it interferes with the progress of the learning rule. Such an eect can be in uenced | intensi ed as well as weakened | by other factors, e.g., by the settings of dierent network parameters. For instance, in experiments conducted by Hirose et al. (1991), the ability of back-propagation to converge even for small problems like the exclusive-OR with two inputs crucially depended on the weights being initialised in a speci c interval (here [?0:5; 0:5] or [?1; 1]). A number of variations of the back-propagation principle have since been designed with the intention of eliminating these problems. (Notably the recent resilient propagation, which has been shown to be very robust with regard to
2 CONSTRUCTIVE NETWORK ALGORITHMS
7
parameter choices (Riedmiller and Braun, 1992).) On the side of CNAs dealing with these problems, Hirose et al. (1991) examined a simple constructive variant of back-propagation, in which a single hidden layer is built up during training. In the rst phase, the algorithm adds a unit to the hidden layer whenever the error stagnates; this is repeated until error stagnation occurs. At that point, the algorithm's second phase starts, and one of the hidden units is removed, normally the last one added; then the network is trained again. This is repeated until the network no longer converges; the nal architecture is then the one which was the latest (and smallest) to converge. The approach proved much more robust than standard back-propagation for varying weight initialisations and learning rates tested. The in uence of weight initialisation was tested with the exclusive-OR problem. As mentioned above, back-propagation would not necessarily converge, depending on the initial range of weight values. For an initial random weight range of [?5; 5], back-propagation only converged in 50% of the trials, whereas the constructive variant always converged. The in uence of the learning rate was tested by a simple character recognition task, in which binary input patterns of 64 pixels were each to be mapped to one of 36 output units. For back-propagation, the percentage of nonconvergent calculations increased with increased learning rates; for the constructive version, this percentage grew much slower when initialising the weights of new units with zero (for a learning rate of 0:75: 100% nonconvergence with back-propagation as opposed to 30% with the constructive algorithm), and there were no cases of nonconvergence at all when new weights were initialised with random values.
Moving Targets A dierent problem entailed by a xed architecture is the so-called moving target problem noted by Fahlmanand Lebiere (1990). In a feedforward net, the -values for weights of units in the same hidden layer are determined independently of each other. Therefore, they are likely to react in a similar manner to error signals from higher layers. This can lead to the situation that all units try to compensate the same high error induced by some pattern, and thereby all partially unlearn other patterns. As learning proceeds, this herd eect will vanish with increasing specialisation of units, but until then, the net can go through time-consuming phases of error oscillations in which its units learn in too similar a manner. Fahlman and Lebiere (1990) addressed this problem with the Cascade Correlation (CasCor) algorithm. CasCor starts with a fully connected input and output layer without any hidden units. The connections are trained as well as possible with a single-layer learning algorithm such as back-propagation. Then, the algorithm sets up a number of candidate units and performs candidate training with them. During candidate training, the candidate units receive input
2 CONSTRUCTIVE NETWORK ALGORITHMS
8
Epochs needed Method 2-Spirals N-bit Back-Prop 20000 2000 Quickprop 8000 86 CasCor 1700 357 Table 2: Results for CasCor and stationary networks in the twospirals and N-bit-parity problems.
from the network as if they had been inserted as a new hidden layer, but they do not feed activation into the network. They are trained independently from each other in such a fashion as to maximise the absolute value of the correlation of their output with the network's error. This way, they are supposed to turn into highly specialised feature detectors. As a side-eect, using candidates with dierent initial weights allows for a better coverage of the weight space, thereby reducing the danger of ending up in local minima. The most successful unit is inserted into the net and the new network is trained again. This procedure is repeated until performance meets a chosen criterion. CasCor was compared with two static feedforward networks, one running with standard back-propagation, one using an improved variant thereof (Quickprop by Fahlman (1988)). Two benchmarks were tested for, namely the twospirals-problem and N-bit-parity-problem. In the two-spirals-problem, the net has to distinguish between two intertwined spirals which are drawn onto a twodimensional pixel grid. In N-bit-parity, the net has to distinguish bit patterns with an even number of bits turned on from those with an uneven number turned on. Table 2 shows the results of the comparison.In another test with `real-world' data from a medical data base, CasCor outperformed all fourteen competing stationary variations of back-propagation (Schimann et al., 1992). Since the moving target problem supposedly mainly shows itself in a lengthened training cycle due to error oscillations, these test results would suggest that CasCor is rather successful at reducing the problem. Interestingly, Hirose et al. (1991) have also noticed error oscillation when using standard back-propagation with certain parameter con gurations. With regard to their constructive variant, they noted that \varying the number of hidden units suppresses the oscillation and shortens the calculation time. We do not yet fully understand why the oscillation is suppressed in our method."(p. 65) If one interprets this in the light of the moving target problem, then it could be an indication that the main means against error oscillation is not necessarily the speci c way in which CasCor trains its candidates, but that a unit's ability to specialise is already sig-
2 CONSTRUCTIVE NETWORK ALGORITHMS
9
FlexNet
CasCor - Output -
- Hidden2 -
- Hidden1 -
- Input -
Figure 3: Dierent network incrementing styles of CasCor and FlexNet: FlexNet uses candidate sets instead of single units and considers dierent insertion points.
ni cantly enhanced when being inserted into a trained net rather than training it from the very start. This would seem plausible, because the previous training would have already introduced a certain degree of specialisation into such a net, hence making the newly initialised unit look dierent than the trained units and facilitating its subsequent specialisation. A variant of CasCor is FlexNet (Mohraz and Protzel, 1996), which overcomes two major limitations of CasCor: 1. While CasCor only considers single units as candidates, FlexNet uses sets of candidate units, \which has positive eects on both convergence speed and generalization"(Mohraz and Protzel, 1996). 2. CasCor always only considers one insertion point for the candidate units, namely as a new hidden layer for themselves. (But note a variant described in Baluja and Fahlman (1994) which also considers insertion in previous layers.) FlexNet sets up candidate sets for dierent possible insertion points. For instance, it will also test whether it might be favourable to add the new units to an existing layer. These dierences are schematically shown in Figure 3. For both the spiral and the N-parity problem, FlexNet generally took marginally longer than CasCor, while outperforming it with regard to the nal error rate. Similar results were found for a real-world classi cation task run on a breast cancer database (Mohraz and Protzel, 1996), although FlexNet in general used more hidden units than the algorithms it was competing with (CasCor and a standard feedforward net).
2 CONSTRUCTIVE NETWORK ALGORITHMS
10
Optimising Growth The principle of choosing candidates used in CasCor and FlexNet can be varied along a number of dimensions:
Connectivity Although FlexNet allows for three dierent connection strate-
gies (ranging from full connectivity to only connecting adjacent layers), the chosen strategy is used uniformly for all candidates during the whole training. It might be favourable to use dierent connection strategies especially in later stages of construction, when some of the errors made might originate from in uences of dierent speci c layers.
Size of Candidate Sets The number of units in a candidate set is also kept
constant in FlexNet. Again, it might prove favourable to vary the number. For instance, using smaller numbers of units in later construction stages might help keep the problems small which are caused by highly dimension weight spaces and might therefore prove more ecient.
A principled problem of this approach is the choice of candidates to try out; although candidates can be trained in parallel because of lack of interaction, there is obviously a limit to how many can be tested practically. It would therefore be a great advantage if one could use some error related measurements to identify possibly successful candidates. For example, when learning with a method based on back-propagation, one might select the connectivity of the candidates according to the -values of the hidden layers. An example of where error-related hints for the insertion of new units have been employed with Gaussian units is the growing cell structure (GCS) as described in Fritzke (1994c)2, which develops the Gaussian layer of a radial basis function network during training. Broadly speaking, Fritzke uses the accumulated errors of the units and an incrementally built topology which de nes each unit's neighbour units to determine where to insert new units: after a xed number of epochs, a new unit is inserted in the vicinity of the unit with the highest accumulated error. As to the exact position, Fritzke proposes to choose a topological neighbor unit and insert the new unit half-way between the two. Although Fritzke does not use candidate training, the algorithm does oer the option. Discussing which neighbouring unit to choose to determine the position of a new unit, he says (p. 261): \Currently [the chosen unit] is the neighbor with the maximum accumulated error. Other choices, however, have shown good results as well, e.g., the neighbor with the most distant center position or even a randomly picked neighbor." 2
This is the supervised version. For the unsupervised counterpart, see Fritzke (1994b).
2 CONSTRUCTIVE NETWORK ALGORITHMS
11
An indication for the success of this error-based approach might be seen in the fact that GCS outperformed CasCor and FlexNet on the two-spirals problem by a high factor: CasCor took on average 1700 epochs to achieve an error below 10%, FlexNet took approximately 4500 for an error of around 6%, GCS 180 epochs for zero error classi cation. A number of other comparisons with Kohonen feature maps and conventional radial basis function networks yielded similar successes (Fritzke, 1995a,b).
Bias and Variance Apart from the issues discussed so far, which concern the possible impact of static architectures on the learning rule, there is another problem directly inherent to static architectures, the so-called bias/variance dilemma (Geman et al., 1992). This dilemma is related to the important ability of neural networks to generalise from what they have learnt: they should not only learn data they have been trained on, but they should also be able to correctly identify new data afterwards. While the architecture restricts what can be learned, it determines at the same time how well the solution the net comes up with will generalise, and this relation entails a trade-o: on the one hand, a network with a very simple architecture might not be able to solve a given task well enough, because its architecture may not have the necessary freedom of parameters (so it has got a high bias ) to adapt appropriately. But on the other hand a net with too complex an architecture, which allows for too much adaption (hence, it has a high variance ), might over t the data, i.e., instead of capturing the main features/distinctions, it becomes too sensitive to the speci c set of training patterns | which will probably be noisy and contain errors |, and it will generalise badly. The ideal architecture, hidden somewhere between these extremes, would represent an adequate compromise between the requirements of a small bias and a small variance, then returning a small though not necessarily zero error for the training set, and being able to perform comparably on a new test set. Figure 4 shows schematic examples for the three cases. CNAs | in the narrower sense of the word | like those described above have frequently been stated (e.g., Bishop (1995), Ch. 9.5; Redding et al. (1993)) to oer a solution to this dilemma in that they start with a very limited architecture and thereby a high bias, which is gradually decreased as the network and the number of free parameters grow. The principle also works in the opposite direction by using pruning algorithms, which will start with a highly exible net and then remove connections or units to increase the bias. A slight disadvantage of using pruning algorithms is added computational eort, because one has to decide what to prune in a potentially very large architecture by using an adequate salience criterion to measure the importance of connections or units (broadly the counterpart to error-based incrementing in constructive nets). For
2 CONSTRUCTIVE NETWORK ALGORITHMS a)
b)
12 c)
Figure 4: Eects of network architecture on learning and generalisation. a) shows how a highly constrained net such as a single layer perceptron would classify the (two-dimensional) input patterns. c), at the other extreme, shows a classi cation boundary learnt by a highly complex architecture which can achieve perfect separation of the training data, but which is very likely not to generalise well. b) depicts a decision boundary corresponding to an intermediate architecture. (Adapted from Bishop (1995), p. 13f.)
instance, the optimal brain damage (Cun et al., 1990) and optimal brain surgeon (Hassibi and Stork, 1993) algorithms derive such a measurement for connections by considering their contributions to the overall error. Both algorithms showed capable of considerably reducing network architecture, and slightly improved generalisation abilities.
Summary As a conclusion it might be said that constructive algorithms for one point seem to be able to overcome some of the problems that learning rules may meet in static architectures during training, and that they are not subject to the same limitation as static architectures, which are a priori xed to a restricted range of problems for which they can perform adequately in terms of learning as well as generalisation. Furthermore, the `constructive burden' added by CNAs actually seems to oer diverse dimensions along which a network can be made more eective, such as candidate training and error-targeted construction. Just how ecient these methods exactly are can probably not be judged adequately on grounds of the few existing comparisons. More thorough tests should perhaps include systematic variations of combinations of the dierent methods.
2.2 Cognitive Development \How do neural mechanisms participate in, or underlie, cognitive development? In what ways do cognitive and neural processes in-
2 CONSTRUCTIVE NETWORK ALGORITHMS
13
teract during development, and what are the consequences of this interaction for theories of learning? In short, how is the mind built from the developing brain?" (Quartz and Sejnowski, in press)
Neural Constructivism Arti cial neural networks have been employed in a wide range of modelling tasks, from models of low-level interaction in neural circuits, like those to be introduced in Section 3, to more abstract models of cognitive abilities and their development. In recent years, the paradigm of neural constructivism (Quartz and Sejnowski, in press) has emerged, which makes a two-fold claim about the relation of the two domains and, as a result thereof, about certain conditions realistic models at both levels have to satisfy: 1. (Neuro)Biological development and cognitive development interact closely: on the one side, a realistic model of cognitive abilities or of their development thus has to take into account certain characteristics not only of the brain, but also of its development. The most important corollary thereof is the perspective on humans as nonstationary learners : the biological architecture is modi ed in the course of learning, and these modi cations are crucial for learning. This point of view contradicts opinions expressed for instance by Chomsky (1980), according to which learning can be idealised to be an \instantaneous" process independent of issues of development. For models of the formation of neural structures, on the other side, this postulate predicts that certain developmental neural processes will be of a cognitive nature or function, i.e., they will be in uenced by environmental input and in turn in uence the developing biological structures to adapt to the environment, e.g., to maximise cognitive eciency. 2. Cognitive development proceeds in a constructivist manner: with regard to cognitive abilities, neurobiological growth is \a progressive increase in the representational properties of cortex"(Quartz and Sejnowski, in pressmy emphasis ). This view is opposed by the selectionist paradigm (e.g., Changeux and Dehaene (1989)), which shares with the constructivist approach the emphasis on nonstationary learning, but which sees cognitive development as a continuous constraining of an initially too complex biological architecture through regressive neural mechanisms.
Constructivist Neurobiology A | biased | investigation into the neurobiological evidence related to cognitive development has been undertaken in the \Constructivist Manifesto" by Quartz and Sejnowski (in press). Broadly, their ndings were the following:
2 CONSTRUCTIVE NETWORK ALGORITHMS
14
Much data which had been taken to support the selectionist position had
been overrated: \[t]he link between cognitive development and synaptic elimination in cerebral cortex is questionable."(Quartz and Sejnowski, in press) Mainly, such data concerned quantitative changes in synaptic numbers and density with time. In the \most in uential"(Quartz and Sejnowski, in press) of these studies, Rakic et al. (1986) had observed a uniform overproduction and subsequent elimination of synapses in dierent cortical layers of the rhesus monkey. They inferred that \formation of synapses throughout the entire cortical mantle may be regulated by common genetic [... ] signals"(Rakic et al., 1986, , p. 234). They continue: \[B]ehavioral competence continues to increase beyond the stage of excess synapses. This suggests that full functional maturation may be related to synapse elimination"(ibid). Quartz and Sejnowski pointed out that this picture of uniformity gets disturbed when relating the development to speci c cell types. Apart from that, more recent ndings (Bourgeois et al., 1994) casted doubt on the validity of the previous results: \[T]hey found that synaptic density reached a peak around two months of age and did not begin to decline until puberty."(Quartz and Sejnowski, in press) Similarly, they criticised that results on synaptic overproduction and subsequent elimination in human development (for instance reviewed in Huttenlocher (1992)) had been drawn from very sparse data, which showed gaps for crucial developmental stages.
A broad variety of recent studies emphasise the importance of activity
dependent and constructive processes in the development of neural structures. For instance, some investigations found the development of the visual cortex to be characterised by extensive growth of dendritic extensions and increase in branching. Other studies in turn indicate that this dendritic outgrowth depends on patterned activity. Katz et al. (1989) had found that dendrites from the same cell feeding into the visual cortex prefer to grow into the same ocular dominance column even in the vicinity of column boundaries (for a brief outline of the visual cortex' structural properties, see Section 3.1). They suggested that a likely source for this behaviour might be the correlation of activity in cells within the same ocular dominance stripe, hence cells would tend to avoid uncorrelated activation (like that of cells from columns activated by dierent eyes). A supplementary piece of evidence was provided by Herrmann and Shatz (1995), who found that speci c thalamocortical connections could be suppressed by blocking activity.
2 CONSTRUCTIVE NETWORK ALGORITHMS
15
Note that neural constructivism does of course not postulate that regressive processes do not play a role in cognitive development. It is well established for various parts of the nervous system that their functionality crucially depends on what might even be called selectionist principles. For instance, the well studied neuromuscular junction of the mammalian skeletal muscle develops from an initial state of polyinnervation, in which multiple axon terminals connect to each muscle bre, to a state of single innervation, where each muscle bre is connected to exactly one axon terminal. (For a recent detailed model of the underlying processes, see Joseph et al. (1996).) Ooyen (1994) provides a review of investigations into similar eects in the central nervous system.3 He states: \The survival of neurons depends to a great extent on whether or not they have obtained adequate amounts of speci c neurotrophic factors [.. .], which seem to act by suppressing an intrinsic cell suicide program. [... ] The synthesis of neurotrophic factors is regulated by electrical activity. [... ] The rapid regulation of neurotrophic substances by neuronal activity [.. .], allowing the conversion of shortterm signalling into long-term changes, suggests that their function is much more than merely regulating neuronal survival during development."(p. 402) Processes of activity-dependent changes of network structure are obviously just what CNAs implement, both constructive and pruning algorithms; and constructive networks, possibly with regressive elements, would hence be predicted to form a promising base for models of cognitive development. This prediction has found theoretical support from learning theory as well as experimental support from models of language development.
Learning Theory Learning can theoretically be regarded as moving through a hypothesis space, which is continuously narrowed down as hypotheses are eliminated because of data presented to the learner. In the in uential probably approximately correct model of learning (PAC) by Valiant (1984), a function f in a class F of Boolean functions of some domain f0; 1gn has to be learned from example vectors ~x, which are chosen according to an arbitrary probability distribution D, and their associated classi cation f(~x) = 0 or 1. f is said to be learned by an algorithm exactly if, on the base of the examples provided, the algorithm arrives with a probability 1 ? at a hypothesis g of a representation class G such that g(~x) = f(~x) with a probability of at least 1 ? . is a con dence parameter, and the error tolerance: if the algorithm learns f successfully, then with a probability of 1 ? , it will make an error of no more than . 3
Note that the review also includes evidence of activity-dependent neural outgrowth.
2 CONSTRUCTIVE NETWORK ALGORITHMS
16
Quartz (1993) employed PAC to analyse the learning theoretic properties of feedforward ANNs. The central observation he made was that some arbitrary, xed architecture G of a feedforward ANN can be identi ed with the learner's representation class G in Valiant's model. If each node i in the net can instantiate a function fi from some set of functions Fi | parametrised by a threshold, weights and so on |, then the connectivity of the net and the node function sets form the net's initial hypothesis space. Learning in the net then means nding one speci c set of functions < f1 ; : : :fn > for the nodes i = 1 : : :n, which represents the learner's nal hypothesis g. This characterisation, Quartz related to the claim made by Fodor (1980), that whatever a learner can possibly learn must initially lie within his conceptual repertoire (or in this context: in his hypothesis space), because \we simply have no idea of what if would be like to get from a conceptually impoverished system to a conceptually richer system by anything like a process of learning"(p. 149), as is required by the constructivist account of cognitive development.4 Quartz (1993) noted two things: 1. The interpretation of stationary ANNs within the PAC framework shows that they do not refute Fodor's claim (as had been asserted before), but that they indeed serve as an illustration to it, since the net's hypothesis g cannot but lie within the initial hypothesis space G xed by the connectivity and the function sets < F1; : : :Fn >. Read the other way around, this also means that the choice of architecture commits an ANN to a \highly constrained hypothesis space"(ibid, p. 232), which explains why the choice of architecture plays a central role in the application of ANNs. 2. A network algorithm with the ability to alter the net's architecture in a constructive fashion can extend the hypothesis space and come to learn concepts which were not learnable in the initial state. As Quartz (1993) points out, this is exactly what constructive algorithms do, and they thus provide an example of how a system can achieve a \conceptually richer" state. In this way, they demonstrate the plausibility of the constructivist viewpoint, and can be called constructivist networks.
Starting Small Experimental evidence for the importance of such incremental mechanisms in learning has since been provided by dierent models of language development. One such model was devised by Elman (1993) to learn to predict words in sentences containing embedded clauses. Embedded clauses are an example of a language phenomenon which is not regular in the formal sense established by Chomsky (e.g., Chomsky (1959)). Gold (1967) (amongst others) had shown The claim was made in the context of the nativism/constructivism debate. For a review of the debate, see for instance Westermann (1996). 4
2 CONSTRUCTIVE NETWORK ALGORITHMS
17
that languages of higher than regular complexity cannot be learned by an unrestricted learner, if she is only given positive data, as seems to be the case in a child's course of development. One of the suggestions made by Gold to solve this paradoxon was that children might draw on innate knowledge, or, using learning theoretic terminology, that they pro t from an initially (highly) constrained hypothesis space. The network used by Elman (1993) had a recurrent architecture: the activation of one of the hidden layers (here the second of altogether three) was fed into a context layer, from where it was fed back into the source hidden layer one time step later, so for each input pattern, that speci c hidden layer also received information about past patterns. The context layer thus helps the net encode sequential information. Elman trained the net using backpropagation on sentences from an arti cial language with certain critical properties like subject/verb agreement, dierent object requirements and multiple embeddings of relative clauses. In his rst trials, he trained the net with sentences of varying complexity. Some of the examples Elman gives are (p. 75f): cats chase dogs. boys who chase dogs see girls. dogs see boys who cats who mary feeds chase. The net failed in these trials, as Elman summarises, even to learn the training data. However, it produced strikingly dierent results when being trained in phases of increased data complexity. Comparing the network predictions with likelihood estimates derived from the probability distributions of the training data, Elman measured an error of 0.177. The network was also able to generalise \to a variety of novel sentences which systematically test the capacity to predict grammatically correct forms across a range of dierent structures."(p. 77) The problem with these results, however, is the lack of psychological plausibility of the training process, as Elman notes, since a child obviously does not grow up in an environment of carefully sorted and portioned language input. He therefore ran a third series of trials, in which the amount of memory the net could use was staged, rather than the input. Elman achieved this by resetting the feedback from the context layer to a xed value after a given number of words, and thereby deleting the accumulated memory in the context layer. This was done after every third or fourth word in the rst stage, while not at all in the fth stage. The net managed to produce results comparable to those from the previous trials. In his discussion of the results, Elman argues that in the initial phase of learning, a network's learning process is more likely to be in uenced by unrepresentative data, which might be fatal for later learning all the more because \networks are most sensitive during the early period of learning"(p. 94), i.e., their weights are then subject to more and greater changes than
2 CONSTRUCTIVE NETWORK ALGORITHMS Verb types Percentage correct
Total Regular Irregular
18
R&M M&L SPA Constructivist 420 1,650 1,038 1,066 97.0 98.0 95.0
99.3 99.2 100.0 99.6 90.7 96.6
99.8 99.8 100.0
Table 3: Performance of four dierent models for learning the English past tense; R&M: Rumelhart & McClelland; M&L: MacWhinney & Leinbach; SPA: Symbolic Pattern Associator; Constructivist: Westermann. (Adapted from Westermann (in press).)
at later stages. Initially limiting capacities might protect the net from problems of highly dimensional error surfaces and allow later learning with added capacities not only to re ne earlier eects, but also to correct them if necessary.
The English Past Tense Other experimental evidence that constructive models can render language development more realistically comes from models of learning the past tense of English verbs. This somewhat restricted task has become \a landmark test for the validity of theories of human language learning in general"(Westermann, 1996, p. 7), since Rumelhart and McClelland (1986) presented the rst neural network model for the task and thereby challenged the traditional, rule-based view that learning regular and irregular past tense relies on separate mechanisms. Because of a number of aws, this model is now of little but historical value. More importantly, it sparked o a number of other attempts to tackle the problem with dierent models: MacWhinney and Leinbach (1991) presented an improved network model, and Ling and Marinov (1993) devised a symbolic counterpart, the Symbolic Pattern Associator.5 The most recent addition to this line is the constructivist network model introduced in Westermann (in press). Westermann employed the growing neural gas algorithm by Fritzke (1994a), a variant of the growing cell structures introduced earlier on. Table 3 shows the training results of the models mentioned. As can be seen, the constructivist model outperforms the competing models for the irregular verbs. Westermann attributes this to the network's ability to speci cally allocate hidden units for these exceptional cases. The net also generalised well to a set of sixty pseudo-verbs, which had been 5
For a more detailed account of the debate, see for instance Westermann (1996).
2 CONSTRUCTIVE NETWORK ALGORITHMS 10 9 8
111 000 000 111 000 111 000 111 000 111
human constructivist
0011 1100 000 111 000 111 000 111
10 SPA
9
R&M
8
7 6 5 4 3 2
000000000000000000000000000000 000111111111111111111111111111111 111 000000000000000000000000000000 000111111111111111111111111111111 111 111 000 000000000000000000000000000000 111111111111111111111111111111 000 111 000 111 000000000000000000000000000000 11 00 000 111 000111111111111111111111111111111 111 000000000000000000000000000000 111111111111111111111111111111 000000000000000000000000000000111 111111111111111111111111111111 000 000 111 1100 1100 000 111 000 111 000 111 1100 1100 000 111 000 111 111 000 000 111 000 111
1 0
P
7 6
1100 1100 111 000 000 111 000 111 000 111 000 111 000 111 000 111
5
19 111 000 000 111
111 000 000 111
0011111 1100000 000 111
0011 1100 000 111 000 111
111 000 000 111 000 111
4 3 2
111 000 000 111 000 111
I D Irregulars
111 000 000 111 000 111
1 0
P
I
D
Regulars
Figure 5: Generalisation performance of dierent models for dierent classes of pseudo-verbs. P: prototypical; I: intermediate; D: distant. (Reprinted from Westermann (in press) with kind permission.)
devised by Prasada and Pinker (1993) and tested on humans with regard to their in ectional properties. The verbs were grouped into pseudo-regular and pseudo-irregular, and the verbs in each group were in turn categorised into blocks of ten, namely distant, intermediate and prototypical, according to how similar they were to existing verbs. Figure 5 shows the results of the models in comparison with human performance. Apart from achieving good results, the model also developed in a psychologically plausible way in that it displayed U-shaped learning. This notion refers to psychological evidence that children initially tend to produce irregular verbs correctly, but subsequently go through a phase of overregulisation (e.g., they might say \goed" instead of \went"), before they then again produce them correctly. The network behaved in a comparable way, initially producing most irregular verbs correctly before temporarily overgeneralising them in dierent ways (e.g., \knowed" as well as \knewed" instead of \knew"). Westermann points out that the constructivist model is actually the rst in the line in which U-shaped learning was observable and a result solely of the internal network dynamics. The model by MacWhinney and Leinbach (1991) failed to produce the initial phase of correct production. The symbolic pattern associator displayed the dierent phases, but as Westermann (in press) notes, it did so due to the manipulation of a psychologically unmotivated learning parameter which directly controlled how often a verb had to be presented to the algorithm before it would be recognised as an exceptional verb rather than a regular. This parameter was initially set to two, later to six, thereby leading to the desired eects through the frequency of the dierent verbs. Westermann (1996) points out that this actually means
2 CONSTRUCTIVE NETWORK ALGORITHMS
20
\hard-wiring" into the model what it should explain, namely how children come to initially store verbs as exceptions and then make the transition to treating them as regular forms. These results not only show that a constructivist model learns better than networks with xed architectures, but they also indicate that constructivist models can display psychologically plausible characteristics during development. Westermann (in press) notes that the SPA, which similarly outperforms the stationary networks, is in fact also a (symbolic) constructivist model, as it gradually builds up a decision tree in response to the data presented to it. Consequently, he argues that probably \the dichotomy constructivist/ xed-architecture is more fundamental than the symbolic/subsymbolic distinction" emphasised by previous models.
Stability/Plasticity Lastly, another more abstract characteristic of human cognitive abilities which can be realised in constructive networks is the Stability/Plasticity duality (or dilemma, as it was originally termed) (Grossberg, 1976a). Basically, plasticity is the requirement that a system be able to adapt to new input (e.g., to learn or to memorise it). Stability, on the other side, is the requirement that the system should protect previous states or certain important characteristics thereof from being erased by adaption to new data. So the two requirements present a trade-o: a completely plastic system will not be able to remember anything old, while an entirely stable one cannot learn anything new. Most ANNs can be characterised as being completely plastic during training, and completely stable afterwards. Clearly, this strict separation and indeed limitation does not apply to human memory capabilities: throughout life, knowledge is accumulated and incorporated into existing memories. Whatever processes underlie our memorisation abilities, they provide plasticity and stability at the same time, presumably in some well-balanced and perhaps not entirely xed manner. The theory of adaptive resonance (ART) (Grossberg, 1976a,b) constitutes an attempt to harmonise the two requirements. It has been implemented in different variations in a number of networks. The rst version, ART 1 (Grossberg, 1976a; Carpenter and Grossberg, 1988), is an unsupervised net limited to binary input data. Subsequent models were developed to deal with continuously valued input, like ART 2 (Carpenter, 1987), and an improved version thereof, ART 2-A (Carpenter et al., 1991a); another model to the same end is FUZZY ART (Carpenter et al., 1991b), which integrates fuzzy logic into ART 1. Furthermore, there exists a model for supervised learning (ARTMAP, described in Carpenter and Grossberg (1988)); another variant is ART 3 (Carpenter and Grossberg, 1990), in includes more detailed modelling of synaptic chemical processes. For
2 CONSTRUCTIVE NETWORK ALGORITHMS
21
R1 ... Rm
-
b
-
U
w C1 ... Cn
Figure 6: A simpli ed schema of the architecture of ART 1. See text for explanations.
this discussion, I will mainly draw on ART 1, which is sucient to illustrate the principle. Leaving aside a couple of details concerning the complicated synchronisation of the network, the architecture of ART 1 consists of two layers, the comparison layer with units C1 : : :Cn, which is the layer input patterns are fed into, and the recognition layer with units R1 : : :Rm which can be understood to store prototypical binary patterns in their weights. The two layers are fully connected in both directions, with real-valued weights w on the connections from the comparison layer to the recognition layer (I will term this direction of ow of activation `up'.) , and binary weights b in the other direction (This I will term `down'.). Apart from that, there is a reset unit U which receives inhibitory input from the comparison layer and serves to inhibit units in the recognition layer in certain cases. Figure 6 depicts these elements. Learning proceeds as follows: an input pattern is presented to the comparison layer. Then the net input for each unit R in the recognition layer (as activity
ows `up') is computed as the scalar product of the input pattern and the weight vector wR . The unit with the highest netinput is chosen to be the winner. The process of choosing can be implemented as a winner-takes-all network, but is for practical reasons normally done by direct comparison. Note that, geometrically, the winner unit is the one which is the closest to the input pattern in the input space. This unit's activation is set to one, all others are set to zero, and the activation is fed back down to the comparison layer through the binary weights; thus, exactly the binary pattern stored in the weights of the winning unit is fed back. The activation of each of the comparison layer's units then is determined as a logical AND performed on the input pattern and the pattern returned from the recognition layer. At this point, two things can happen: If the original input pattern diers
2 CONSTRUCTIVE NETWORK ALGORITHMS
22
very much from the pattern returned from the recognition layer, the activation of many units in the comparison layer will be zero, and thereby the activation fed back `up' again will be low. If it goes below a certain threshold, called the vigilance, the previously inhibited reset unit gets activated and inhibits the unit in the recognition layer which has won the last time. That way, another unit in the recognition layer will win now. This way, the net can search through all the prototypes stored in the binary weights until one is found which is similar enough to the input to keep up the inhibition of the reset unit. If no prototype is close enough, a new unit in the recognition layer is assigned to the speci c input.6 The other possibility of what can happen when a pattern has been fed back down is that enough units stay turned on. The system then switches into the training cycle: the weights feeding into the winning unit are set to normalised values of the activations in the comparison layer, and the weights coming back from the winning unit are set just to that pattern. Altogether, the prototypes in the recognition layer are only modi ed to accommodate for input patterns which resemble them enough in the rst place. To which degree they have to be similar is determined by the vigilance parameter, which can be chosen intuitively in ART 1 in the range [0; 1], because it is compared against the normalised Hamming distance between the input pattern I and the prototype pattern P, which yields a value in the same range. In pseudo-code: if vigilance (1 ? jI ^nP j ) then adapt P to I else insert new unit for I
The lower the vigilance, the more a pattern may mismatch a prototype and still be accommodated for by it. So the solution of ART to the stability/plasticity dilemma basically consists in introducing a simple similarity criterion and assigning new units whenever the criterion is not met, in other terms, allocating resources in accordance with an activity-dependent criterion to prevent unlearning in allocated ones, and this solution is obviously constructivist by its very nature. Notably, a recent constructive algorithm, the supervised resource-allocating network (RAN) (Platt, 1991), which at a rst glance bears little resemblance to ART, uses the same mechanism to construct a radial basis function network, even though not in the context of the stability/plasticity problem. There, a new unit is inserted instead of adapting an old one whenever either the distance 6 Originally, ART is assumed to have a nite number of unused units available in the recognition layer, but for the purpose of this discussion this is equivalent to assuming the integration of new units.
2 CONSTRUCTIVE NETWORK ALGORITHMS
23
between the input pattern and the nearest gaussian unit exceeds a chosen resolution parameter or the output error exceeds a given accuracy threshold. This can be seen as a simpler, and computationally much more eective implementation of the principle; or the other way around, the ART network can be considered as an attempt to provide a neuropsychologically plausible interpretation of it. RAN has not been applied to classi cation problems like ART, but it should perform similarly when con gured adequately.
Summary The outlined paradigm of neural constructivism predicts constructive network algorithms to be promising models of cognitive abilities, and a diversity of evidence seems to support this claim. Constructive algorithms have advantages over static architecture networks from a principled learning theoretical point of view, they can display psychologically plausible characteristics during development, and they inherently oer an approach to tackle the stability/plasticity problem.
3 SELF-ORGANISING MODELS OF THE VISUAL SYSTEM
24
3 Self-Organising Models of the Visual System A second prediction is entailed by the neural constructivism's emphasis on constructive processes and the asserted close relationship between cognitive and neurobiological development, namely, that constructivist mechanisms should play a vital role in the development of neural structures central to cognitive abilities. One major problem with testing this prediction is that the cognitive functions of bigger neural modules in the nervous system are far from well known, with the possible exception of the visual system. There, various structural characteristics have so far been identi ed which | although their precise functional properties have yet to be determined | lend themselves to interpretations of their contribution to our cognitive abilities. (For a review, see for instance Churchland and Sejnowski (1992), Ch. 4.) Below, I will rst give a brief overview over the visual system and these structures. I will then proceed to examine two self-organising models of the emergence of some of these structures.
3.1 A Rough Sketch of the Visual System Before the Visual Cortex
Raw visual data are processed and transformed in a multitude of ways as they go from the retina to the visual cortex.7 To start with, the human retina itself already possesses a complex neural structure. Incoming light is rst analysed by dierent sorts of photo-receptors, the rods and cones, with respect to brightness | this is done by the rods | as well as to its composition out of the three basic colours blue, green, and red by dierent, wavelength-speci c cones. The transformed signals then go through a few more interacting retinal layers of dierent cell types before reaching the ganglion cells which transmit them to the brain. These later retinal layers serve functions such as contrast enhancement through lateral inhibitory connections; however, their functionality is not yet completely known, as they consist of many dierent cell types, which in turn seem to serve a variety of functions. The central role of this preprocessing may be anticipated when considering that a total of approximately 100 million rods and 3 million cones together feed into only 1.5 million ganglion cells. An added degree of complexity is introduced through dierences between rod vision and cone vision. Rod vision is thought to be evolutionary older than cone vision, and the two systems dier from one another in a number of ways. To start with, rods and cones are distributed in an almost complementary way across the retina: while in the very centre of the retina, there are only cones, the rods begin to outnumber them towards the periphery up to a factor of 7 The description given in this and the following section is mainly based on Guyton (1991), Ch. 50{51, and Churchland and Sejnowski (1992), Ch. 4.
3 SELF-ORGANISING MODELS OF THE VISUAL SYSTEM
25
ten. The inner-retinal circuitry of rods and cones also dier in what cell types they transmit their signals to, and, as indicated by the dierence in absolute numbers, in general a higher number of rods converges onto the same optic nerve. Lastly, activation is conducted through the dierent systems at dierent speeds, the cone system delivering information to the brain between two and ve times more rapid than the rods. Moving on from the retina, signals next reach the ganglion cells, which are also more than just pathways to the following stations: there are three dierent types of ganglion cells (designated as W, X, and Y ), each with a characteristic velocity of signal transmission (8 m/sec in W cells, 14 m/sec in X cells and 50 m/sec in Y cells), characteristic retinal receptive elds and thereby behaviour of response. In this context it is important to note that the main visual information is transmitted by the X cells, which have got very small receptive elds and thus convey activity corresponding to rather precise locations on the retina; Y cells on the other hand have large receptive elds. Thus, they cannot convey spatial information about retinal activation very accurately. The dierent types of ganglion cells furthermore project onto dierent regions in the brain. The W cells follow a path into older, precortical regions of the brain which for instance control eye movement and re exes, but are not used for direct processing of aspects like visual form or colour but by lower animals. For humans, visual processing continues with the X and Y cells, which project onto dierent layers of the dorsal lateral geniculate nucleus (LGN). Before these pathways reach the LGN, they bifurcate with the result that nerves corresponding to the same half of the visual eld from the left and right eye run together. (In other terms, the nerves are no longer bundled according to which eye they originate from, but which half of the retina in the eye they come from.) The LGN works as a sophisticated \relay station"(Guyton, 1991, p. 560) on the way to the visual cortex consisting of six layers. These layers provide a segregation of the incoming connections according to the two dierent types of ganglion cells, X and Y, and the eye of origin. 8
The Primary Visual Cortex The LGN feeds the signals into dierent layers of the primary visual cortex, also known as V1. Like almost all regions of the cerebral cortex, V1 has six distinct layers. Layer 4, to which the LGN connects, is in turn subcategorised into four layers 4a, 4b, 4c and 4c . While the fast, low-accuracy activations which originated from the Y ganglion cells feed into 4c, the LGN connections conveying the slower and, with regard to their origin in the retina, spatially 8 Note however, that, again, the LGN may well have a more complex function, since it does not only feed activation into the visual cortex, but is also the destination of inhibitory connections from the visual cortex.
3 SELF-ORGANISING MODELS OF THE VISUAL SYSTEM a)
26
b)
+
+
-
Figure 7: a) A centre-surround cell with a receptive eld showing excitatory connections in a circular, inner region (white), surrounded by inhibitory connections (shaded region). This cell will be maximally activated for activation occurring only in the inner circle. b) A simple cell which maximally responds to vertically oriented, bar-like activity patterns.
more accurate activation from the X cells terminate in 4a and 4c . Both types of activation then feed `outward' into both directions, i.e., into layers 1 to 3 and 5 and 6. The horizontal organisation of neurons is retinotopical : the spatial relation of cells in the retina is preserved all the way through the ganglion cells and the LGN into layer 4. This means that activation in the retina will be mirrored in the visual cortex with regard to its `shape'. The structural principle of topography preserving mapping is a frequently encountered characteristic of neural activation transmission; it has also been found in other sensory systems and the motor cortex. The vertical ordering of layers corresponds roughly to the complexity of the cell responses found in each, with the most simple ones in layer 4 and increasingly complex ones towards the outer and inner layers. Starting with layer 4, one nds mainly simple and centre-surround cells. A centre-surround cell will respond maximally when activated in a limited, symmetrical region in the centre of its receptive eld; a simple cell has a similarly organised receptive eld, with the dierence that it is not symmetrical, but will respond maximally to activation in the form of a bar or edge of speci c orientation (see Figure 7). In the outer layers, cells will display more complex response behaviour: their `favourite' activity pattern will also be asymmetrical and bar-like, but will evoke maximalactivation only if it is moving across the receptive eld in some particular direction. These assemblies of orientation-selective cells again are organised into clusters of similar orientations. This principle can be found through all the layers, resulting in orientation columns, the overall `cell orientation' of which varies
3 SELF-ORGANISING MODELS OF THE VISUAL SYSTEM
27
smoothly as one moves through dierent columns, yet also shows breaks and reversals in change. Still on a higher level, groups of orientation columns for dierent orientations are clustered together to `zebra-like' vertical columns of alternating, monocular response, the ocular dominance columns.
3.2 Network Models of the Emergence of Structure Topographical Mappings
The development of the visual cortex' structural characteristics such as retinotopical mapping have often been explained with epigenetic mechanisms like chemical labelling through which cells are to nd their appropriate successor cells. One problem of these models is that they have to assume complicated mechanisms of relabelling to satisfy the results of experiments in which a correct mapping developed despite various surgical manipulations of the neural structures involved.9 As an alternative to such models, Willshaw and von der Malsburg (1976) presented a model for the emergence of topographical mappings based on an activity-dependent mechanism of synapse formation. Their model shows how an initially random synaptic mapping between a pre- and a postsynaptic layer of neural cells develops into a topographical mapping, in which neighbouring clusters of cells in the presynaptic layer connect to neighbouring clusters in the postsynaptic layer (see Figure 8a). The synaptic mapping changes through mechanisms of local lateral excitatory and inhibitory interactions in the postsynaptic layer and a simple local Hebbian rule for the modi cation of synaptic strength between the two layers. To explain the model's dynamics in more detail, I will rst describe its architecture and then consider what happens when activation is presented to the network. To begin with the presynaptic layer, this can be thought as an array of sources of activation, which serve in the process of development no other function than providing activation. The important part of the architecture is the postsynaptic layer. Each postsynaptic cell has two types of connections. On the one hand, it has synapses to some or all presynaptic cells, which are of variable strength w > 0 and will be modi ed during development. On the other hand, each postsynaptic cell is connected to other postsynaptic cells in a speci c, xed manner: it has excitatory connections to its neighbouring cells up to a certain, chosen distance, and inhibitory connections to cells which lie within a certain distance beyond the excitatory neighbourhood (see Figure 8b). As mentioned earlier, similar lateral connectivity patterns are for instance observable in the retina. The function of these connections is to provide additional excitation when near-by cells are simultaneously activated, and (to a certain degree) cancel co-occurring activity in cells which are not close together, thereby working as a 9
These are brie y reviewed in Willshaw and von der Malsburg (1976).
3 SELF-ORGANISING MODELS OF THE VISUAL SYSTEM a) X
Y
X
X’
b)
C B
1111111111 00000 0000011111 00000 0000011111 00000 0000011111 11111 00000 0000011111 11111 0000011111 0000011111 11111 0000011111 00000
28
Y
Y’
x
A
Figure 8: a) Left: the initial structure of the network: connections to the neighbouring postsynaptic cell clusters X and Y are scattered randomly throughout the presynaptic layer. Right: after development, the two clusters receive activation from neighbouring presynaptic cell clusters X 0 and Y 0 . (Adapted from Willshaw and von der Malsburg (1976).) b) The connectivity of a postsynaptic cell x: A) synapses to various cells in the presynaptic layer, B) a neighbourhood of postsynaptic cells which is excited by x, and C) a second neighbourhood of inhibited cells.
contrast lter against scattered activation. Development in this architecture occurs through a succession of trials in each of which 1. small clusters of presynaptic cells are activated, 2. the activations of the postsynaptic cells are computed and 3. their synapses are changed accordingly. The steps in detail: 1. A small cluster of presynaptic cells ci is chosen randomly, and their activation levels Ai are set to 1, whereas the activation levels of all other presynaptic cells are set to zero. 2. All postsynaptic cells then in parallel go into a state of repeatedly determining their activities on the basis of the incoming signals. The activity Hj of a postsynaptic cell j depends on its internal membrane depolarisation Hj (conventionally also known as the net input) through a simple threshold function: H = j
(
Hj ? if Hj > , 0 else.
3 SELF-ORGANISING MODELS OF THE VISUAL SYSTEM
29
A cell's membrane potential in turn changes according to the weighted sum of presynaptic activation, excitatory and inhibitory activation from other postsynaptic cells, and a decay term: dHj = ?H + dt | {z j} decay
n X
+
wc j Ai i
|i=1 {z }
e
X
Hp
| p{z }
?
f
X
Hq
| q{z }
presynaptic input lateral excitation lateral inhibition The rst term of the equation makes the cell's depolarisation tend to go to zero at a rate determined by (0 < < 1; in their trials, Willshaw and von der Malsburg (1976) used = 0:5). The second equation sums up the presynaptic input, Ai being the input of the presynaptic cell i (out of n), which is weighted by the strength wc j of the synapse connecting it to j. The third and fourth term represent the summed excitatory and inhibitory contribution from other postsynaptic cells p (excitatory) and q (inhibitory). e and f are constant weights (e; f > 0) balancing the eects of lateral interactions with the in uence of presynaptic in uence.10 With the postsynaptic cells repeatedly computing their depolarisations and sending the resulting activation to other postsynaptic cells, the layer goes through a phase of oscillating activation. The computational cycle is repeated until the depolarisation levels of the units (and thereby their activations) meet a speci ed constancy criterion. In the trials run by Willshaw and von der Malsburg (1976), the average change per unit had to become less than 0.5 percent. 3. Once the activations of the postsynaptic cells have been determined, the strengths of the synaptic connections to the presynaptic cells are modi ed, if the postsynaptic activation exceeds a preset modi cation threshold: i
wc j = hAiHj i
In this equation, the preset parameter h determines the speed of structural changes. After the weights have been modi ed, they are normalised such that the total synaptic strength for each postsynaptic cell equals a preset constant S, i.e.: wc j = S Pwcwj i cj i
i
i
The weight modi cation rule used is the most direct mathematical transcription of the synaptic adaption principle proposed by Hebb (1949): Note that this is a marginal simpli cation of the original model, where the weights were assigned individually to other postsynaptic cells, thereby allowing for varying excitatory/inhibitory in uence, e.g., according to increasing distance. 10
3 SELF-ORGANISING MODELS OF THE VISUAL SYSTEM
30
\When an axon of cell A is near enough to excite a cell B and repeatedly or persistently takes part in ring it, some growth process or metabolic change takes place in one or both cells such that A's eciency, as one of the cells ring B, is increased."(p. 62) This is generally interpreted to denote that correlated high activity in a connected pair of neural cells should lead to an increase in the strength of their connection. Note that the Hebb rule in this form does not allow for a decrease of weight strength and that the weights are not limited with regard to their growth. Normalising the weights does both: it directly limits the maximal value a weight can assume, and a weight's value can decrease (though not go to or below zero) if it stagnates while some other weight increases in strength. Including the Hebb rule, three correlation-based mechanisms work together in this model to make the desired connectivity emerge. The rst one is introduced in the clustered activation of presynaptic cells. This correlation of the activity of nearby cells is translated into the postsynaptic layer by the Hebb rule, as in the postsynaptic layer, the lateral connections promote the correlation of activity through near-by cells. So, assumed at a point during development some postsynaptic cell does not yet display the desired connection pattern to some of the presynaptic cells `beneath' it, then this cell might still be activated through lateral connections from neighbouring cells which have already partially formed `correct' connections, and if the level of activation thus achieved is high enough, it will lead to an increase of the strength of the correct weights through the correlation of the postsynaptic activity with the presynaptic activity. Figuratively speaking, neighbouring postsynaptic cells point out current sources of activation to each other. These dynamics produce the wanted eect: \Our model is based on the idea that the geometrical proximity of presynaptic cells is coded in the form of correlations in their electrical activity. These correlations can be used in the postsynaptic sheet to recognize axons of neighbouring presynaptic cells and to connect them to neighbouring postsynaptic cells, hence producing a continuous mapping." (Willshaw and von der Malsburg, 1976, p. 433) A marginal note to add about the network mechanism is that as outlined so far, it is sucient to create topographic mappings, but with the restriction that the orientation of the pre- and postsynaptic cell arrays might be reversed (e.g., vertically: the presynaptic lower cells feed to the postsynaptic upper cells, and vice versa). To ensure that the dimensions of the two cell layers map in a speci c way, Willshaw and von der Malsburg (1976) introduce a slight bias in the form of polarity markers, which are four pairs of pre- and postsynaptic cells with stronger initial connections than the other, randomly initialised weights. As the
3 SELF-ORGANISING MODELS OF THE VISUAL SYSTEM
31
authors note, setting the presynaptic cells up in this way does not commit them to project onto a particular postsynaptic cell cluster in the nal mapping. The bias only speci es the spatial relation the post-synaptic target clusters formed by the pre-synaptic markers will have to each other, and thereby the orientation of the mapping. Willshaw and von der Malsburg (1976) ran a number of dierent simulations with this model. In the beginning, a six by six array of presynaptic cells was to map onto an equally sized postsynaptic layer. After approximately 15000 presentations of random activity clusters, a perfect topographical mapping had emerged. To model surgical operations, Willshaw and von der Malsburg then manipulated the resulting network by adding cells, initialising them with random connections and making the net develop further. This was done in three dierent ways, namely by adding rows of cells to the presynaptic layer, adding cells to the postsynaptic layer, and enlarging both layers simultaneously. In all three cases, the mapping extended adequately to accommodate for the added cells. This shows that the ordering of projective elds takes place at a global scale: the planes are mapped in whole, and the mappings are not dependent on speci c cell connections. Lastly, Willshaw and von der Malsburg also demonstrated that the mapping development shows tolerance against noisier activation patterns: they minimised the active cluster size to two adjacent presynaptic cells, and activated two such randomly chosen clusters in each trial. A perfect mapping developed none the less. So the model proved robust with regard to noise and reliably reproduced the results of mentioned experiments. What makes the model interesting in the present discussion is its synaptic plasticity. As Willshaw and von der Malsburg (1976) explicitly remarked, the emergence of the correct mapping does not depend on how the connectivity is set up initially. Even if at the beginning, a postsynaptic cell is not connected to a presynaptic cell it should connect to according to the topographical ordering, it can come to form this synapse during development through the mechanisms outlined above. This can happen because an absent connection is treated as one of strength zero, and is thus liable to being increased through the Hebb rule when an appropriate constellation of pre- and postsynaptic activity is given. Willshaw and von der Malsburg regard this as an unrealistic assumption of the two planes being fully connected, with a part of the connections having strengths of zero or near zero, and their reason and justi cation for this simpli cation was that the model is not supposed \to represent the complex three dimensional geometry of axonal growth"(p. 441). However, one could argue that the connections of zero strength are indeed not grown synapses, but just possible pathways for synapses. From this point of view, the model does indeed include an (admittedly very simple) representation of neuronal outgrowth in the possible transition of a zero strength connection to an activity conducting, `grown' connection. What
3 SELF-ORGANISING MODELS OF THE VISUAL SYSTEM
32
this viewpoint highlights is that the model is actually constructivist: the presynaptic units grow connections as needed. Thus, in this model the emergence of a topographic mapping is the result of the constructivist reorganisation of synaptic connectivity, and the constructive
exibility of the process is indeed a necessary ingredient when assuming that the two planes are not fully connected from the beginning. Willshaw and von der Malsburg (1976) brie y consider how sparse an initial random mapping can be and still allow for correct development. On the grounds of a few estimates, they calculate that for a postsynaptic plane of 106, there would have to be about 200 initial contacts from each presynaptic unit. The density could still be lower, they note, as their calculation does not take into account the lateral excitatory eects in the postsynaptic layer.
Structured Receptive Fields The network described in Linsker (1986a,b,c) (see Linsker (1988) for an overview) was targeted at modelling a dierent phenomenon, namely \the origin and organization of feature-analyzing (spatial-opponent and orientation-selective) cells in simple systems governed by biologically plausible development rules."(Linsker, 1986a, p. 7508). As mentioned earlier, structures which work in a centresurround fashion or which extract basic features like bars can be found in the intra-retinal neural system as well as in the primary visual cortex. Linsker employed successive neural layers which were connected in a feedforward, roughly two-dimensional topographic manner: each unit received activation from a random selection of units in the foregoing layer distributed with Gaussian density around the target unit's vertically projected location (depicted in Figure 9); these source units make up each unit's receptive eld.The response mj of a unit j was computed by the simple linear function X mj = a1 + wij mi , (1) i
where a1 is an arbitrary constant, and wij is the strength of the connection from a unit i in the previous layer to unit j. The weights were changed according to a modi ed version of the Hebb rule: wij = a2 mi mj + a3mi + a4 mj + a5 (2) Again, the a's are constants. These, Linsker took to be the same for cells of the same layer; their origin could for instance be the result of epigenetic processes. To avoid the Hebbian problem of unbounded weight growth, Linsker employed simple weight clipping: the absolute value of a weight was not allowed to grow beyond a chosen threshold, e.g., ?0:5 wij 0:5 for all wij . Developing the layers took place in a step-by-step fashion: the connections between two layers were modi ed until maturation, i.e., until they had acquired
3 SELF-ORGANISING MODELS OF THE VISUAL SYSTEM
33
Input Layer A
Layer B
Layer C
Figure 9: Flow of activation in the Linsker model. Units have approximately circular receptive elds with higher density towards the middle.
stable values, and then the subsequent connections were modi ed according to the output computed through the earlier, mature weights. The rst layer (layer A in Figure 9) was fed with random input. This was supposed to re ect ndings according to which some of the structural properties of the visual cortex outlined above are present in kittens and Macaque monkeys prior to any structured visual input (Hubel and Wiesel, 1963, 1974). Linsker (1986a) was able to demonstrate that the receptive elds of units in subsequent layers will display certain characteristics under speci c learning parameter con gurations and sizes of receptive elds. It is possible to con gure the parameters such that for the units in layer B, the weights will all reach the positive limiting value or will all reach the negative limiting value. This in turn | again, with the correct parameter regime | will lead to the emergence of centre-surround receptive elds in layer C, where the weights of all units reach one limiting value near the centre of the receptive eld, and reach the other limiting value in the outer region of the receptive eld. Linsker (1986b) went on to show that this can be extended similarly to further layers. At the same time, a dierent choice of parameters in sequent layers can lead to dierent types of receptive elds, like \bi-lobed" cells with receptive elds similar to those of simple cells found in the visual cortex. Notably, it is merely a choice of learning parameters and sizes of receptive elds whether units in layers from D onward acquire centre-surround or simple receptive elds. Linsker (1986b) chose to `delay' the emergence of bi-lobed cells until layer G, for one reason because the bi-lobed features get increasingly pronounced in these later layers, and because this allows for a rough comparison of the model with the visual system:
3 SELF-ORGANISING MODELS OF THE VISUAL SYSTEM
34
\In the visual system, we can loosely identify layer A of our system with the photoreceptor cell layer; layer B with the intraretinal convergence of inputs; opponent-cell layers C{F with bipolar and retinal ganglion cells cells, and possibly [... ] cells of lateral geniculate and of layer IVc of primary cortex; and layer G with the orientationspeci c `simple' cells of layer IVb and other layers of primary cortex." (Linsker, 1986b, p. 8394) Extending this comparison, Linsker (1986c) showed that by introducing lateral connections in a later layer with bi-lobed cells, the originally arbitrarily spread bi-lobed cells can be made to arrange such that the orientation of the detected bar varies monotonically across the layer, with some disruptions in the variation, as is the case with the orientation columns found in the visual cortex. The lateral connections are developed along the same general lines as the feedforward connections between the layers. The underlying force which the learning rule and the parameter settings make use of to develop the desired structures is the varying activation correlation of units in the same layer. In this paragraph, I will give a short and not mathematical, but in this context sucient outline of the mechanisms involved in the emergence of the centre-surround cells in layer C. In layer B, two mature units with all weights excitatory and at the positive threshold will tend to show very similar activities for dierent random input patterns from layer A if they are lying close together within B, because they will presumable share many incoming connections; similarly, the further two units are apart in B, the less correlation their activation will show, depending on the overlap of their receptive elds, and two units at opposite ends of the layer will probably have completely uncorrelated activation. Considering the connectivity of a unit in layer C to layer B, it will have a core region where many connections emerge from a comparatively small area in layer B, and an increasingly sparse distribution of weights from units lying further apart in B. Whenever high activation is propagated along one of the connections in the `inner' region, it is likely that many other connections in the vicinity will propagate an equally high amount of activation, because they originate from units in layer B with similar receptive elds. This means that activation in the core region in layer B will be highly correlated with the activation of the layer C unit. On the other side, the activity of the unit will not correlate to an equal degree with activation being propagated along the connections from the `outer' regions of layer B, because these connections are more sparse and their conveyed activation is not necessarily correlated with the major stream of activation from the inner region. Since the Hebbian learning rule changes a weight's strength in accordance with the product of pre- and postsynaptic activation (for the Linsker version, this depends on
3 SELF-ORGANISING MODELS OF THE VISUAL SYSTEM
35
the parameters chosen), it rewards correlation between the two, and thus the weights in the dense inner regions will be aected more than those in the sparse outer regions. Presupposing the correct choice of parameters, this will result in the predicted centre-surround cells. All in all, the Linsker model presents a very simple and general architecture which develops features strikingly similar to those found in the visual cortex, even though the most speci c assumptions it commits itself to are the Hebbian learning rule and the roughly topographical mapping between the layers. The observed structures emerge through self-organisation in the absence of structured input, but they do not depend on random input, as Linsker (1988) has noted. Structured input would indeed `shift' the observed receptive elds to emerge in earlier layers (ibid, p. 109). Apart from that, varying parameter settings produce dierent sequences of structured receptive elds. Linsker (1988) proposes that such mechanisms may in part account for the fact that the numbers of centre-surround layers preceding the orientation-selective cells dier in the cat and Macaque monkey visual system (ibid, p. 115). Note that von der Malsburg (1973) had presented a dierent model for the emergence of orientation columns. Being a forerunner of the Willshaw and von der Malsburg (1976) model, it basically employed the same architecture and development mechanisms as the model by Willshaw and von der Malsburg (1976) presented above, the dierence being that it was trained on structured input of bars with varying orientation. The Linsker model seems to have the advantage of being more general in that it does not have to rely on structured input, and that it provides a more uni ed view on structure emergence, since the lateral connections it introduces to model orientation columns are basically the same as assumed in the feedforward processing. However, it should be clear that the Linsker model, as opposed to the other two, is not a constructivist model. The initially speci ed, randomly chosen connectivity of the units cannot be not modi ed in the course of development.
Summary While the model by Willshaw and von der Malsburg (1976) does illustrate how constructivist mechanisms can play an important part in developing the described neural structures, the Linsker model does not lend itself to an interpretation within this framework by presupposing a xed architecture. In the following section, I will test whether the results of the Linsker model are compatible with a constructively developing architecture.
4 A CONSTRUCTIVIST LINSKER NETWORK
36
4 A Constructivist Linsker Network In this section, I will develop and test a constructivist version of the Linsker model. Instead of presupposing the xed architecture of the original model, this algorithm only presupposes an arrangement of successive, topographical layers of feedforward units. The receptive elds of the units have to develop during the presentation of patterns (or noise, for the original experiments). To devise a suitable construction mechanism, I will draw upon the Feature Generation Mechanism (FGM) described by Joseph (in preparation), which is explicitly neurobiologically motivated.
4.1 The Feature Generation Mechanism In its standard version, the FGM consists of two topographically related layers with equal numbers of Boolean units, i.e., which can have activation states of either +1 (`on') or ?1 (`o'). Activation is fed into the input layer, and the units in the following primary layer are in charge of developing receptive elds by forming and retracting Boolean connections to the input units. The activation Pi of a primary unit i is calculated according to a dynamic threshold function, the activation is one, if the weighted activation from the input units equals the total number of connections, otherwise minus one: Pi =
(
P
+1 if c wij Ij = cP i; ?1 otherwise (i.e., c wij Ij < ci ): i
i
Here, wij is the connection from the primary unit i to the input unit j, which has got the activation Ij . ci is the total number of connections formed by i. This function will make the unit turn on for only one speci c con guration of input values along its incoming connections, i.e., when the input activations equal their respective weights, because only then will each connection contribute a weighted activation of one to the total net input. This constellation of input values is the \feature" detected by the unit, and each connection detects what can be called an \elementary feature". In these terms, the feature detected by the unit is the logical AND of all the elementary features. Note that this means that by adding a connection to a unit, the then detected feature cannot occur more often than the previously detected one. The construction mechanism is designed such that each unit seeks, or `generates' a feature which occurs in exactly half of the patterns. The rationale behind this is that a unit which is turned on half the time is maximally informative in the sense of information theory. Initially, the two layers are connected in a one-to-one fashion, with each primary unit connected to the input unit `beneath' it. Development of the units' receptive elds proceeds in epochs and independently for each unit. An
4 A CONSTRUCTIVIST LINSKER NETWORK
37
epoch is divided into a phase of pattern presentation and a growth phase. During the rst phase, each (Boolean) pattern is fed into the input layer. At the same time, each unit, input as well as primary unit, stores its summed activity over all patterns. For the input units, this is simply the summed input activation over all patterns. After all patterns have been computed in this way, the net changes into the growth phase. In the growth phase, each primary unit i changes its number of connections P depending on the activation sum S Pi accumulated over the given pattern set S:
If PS Pi equals zero, then it has been turned on for half the patterns and o for the other. Since this is the desired equilibrium, i will not change its connectivity.
If PS Pi > 0 (i.e., the feature detected by the unit occurs to often), i will add a new connection. The new input unit is picked randomly according to a Gaussian probability distribution around i's projection onto the input layer. Input units which are already connected to i are not considered. If the summed activation of the chosen input unit is less than or equal to zero, the connection will be inhibitory, else excitatory. This choice of sign makes sure that the elementary feature detected by the connection appears in at least half of the patterns, so as not to automatically generate a feature with too low a frequency.
If
P P < 0 (i.e., the feature detected by the unit does not occur often S i
enough), i will delete a connection (usually the last one added). In the rst epoch, when i has just one connection, this will lead to the disconnection and functional disposal of i.
The two phases are repeated until all primary units which still have connections to the input layer have settled for a feature which occurs in half the patterns. Note that the algorithm will not necessarily converge, and that primary units can end up detecting the same feature, especially those located close to one another. The FGM demonstrates how a condition imposed by the desired con guration of the units can directly govern synaptic outgrowth and deletion. It basically adds and tries out connections until the unit has developed satisfactorily, deleting connections if they prove hindering. To be able to judge a unit's performance, the FGM computes its summed activation, a mechanism which can be regarded as drawing on a sort of neural memory. This notion of a neural memory is neurobiologically plausible, as pointed out by Joseph (in preparation) with reference to ndings which suggest that synaptic development might depend on a cell's internal calcium concentration, which can be thought of \as
4 A CONSTRUCTIVIST LINSKER NETWORK
38
re ecting a time average of the cell's previous activity"(ibid). In the next section, I will show how these general principles of synaptic development can be directly incorporated into the Linsker architecture.
4.2 Merging FGM and Linsker The constructivist Linsker net (CLN) has got all its stationary features in common with the original Linsker model. Its architecture is a successive order of topographically mapping feedforward layers, the units' response will be calculated by a simple linear response rule, and the unit's weights are modi ed according to Linsker's Hebbian-like learning rule. In those aspects where the CLN diers from the original, it resembles the FGM: the unit's receptive elds are not set from the beginning, but have to develop through constructive and regressive mechanisms. Like the FGM, the CLN needs a criterion to decide when to stop growing connections, and under what circumstances to retract them. A convenient measure for this purpose can be found in Linsker (1988), where Linsker demonstrated that the Hebb-type learning rule employed in the model maximises the variance of a unit's output activity, if its weights (or their absolute values) are constrained to sum up to a xed constant, additionally to being subject to the clipping constraint. The duty of the growth processes in the CLN will thus be to optimise synaptic development with regard to the maximisation of the unit's output variance. To do so, the algorithm will basically borrow the two principles used in the FGM: 1. Add connections until the output variance is optimal. The source units for new connections are determined by a Gaussian probability distribution as used in the FGM. Generally, a unit i's output variance Vi is calculated as N X Vi = N1 (api ? api )2 , p=1
where N is the number of patterns presented, api is the output of i for pattern p, and ai is i's average activation over all patterns. Assuming normalised input activation in the range [?1; 1], and the weights being normalised so that their absolute values add up to a constant value of , the output range for a unit will be [? ; ]. In this case, maximal variance will occur when the unit's activation is ? half the time and the other half; the output variance is then equal to 2 . This value can serve as a threshold which determines when development should terminate, since the unit could not get any better with respect to its output variance. However, it is unlikely that a unit's output variance will ever get near that value, if
4 A CONSTRUCTIVIST LINSKER NETWORK
39
the unit does not happen to have perfect initial input. A way to ensure that a unit will meet the threshold sooner or later is to gradually anneal the threshold as new connections are added. A new connection should initially be set to zero in order to avoid conductance of some random activity which could in uence the learning rule in such a way that it would reverse the development of other weights. When a new weight is initially set to zero, it will not in uence the unit's activity, but it will be modi ed according to the unit's output activity and its own input activity, thereby being able to gradually tune in with the settings of other weights. 2. Retract connections which have shown either to have a negative eect on the output variance, or which have too little in uence on it to be of interest. It is not guaranteed that adding a new connection will increase a unit's variance, so a new connection will be retracted again if it has not shown a positive eect on the unit's output variance after a certain tolerance period. How strict a `positive eect' is can be controlled by a tolerance parameter : a new connection is to be pruned if Vi < Vi0, where Vi0 is the unit's output variance prior to the insertion of the new connection. For > 1, a new connection must have increased the output variance in order not to be pruned, while setting < 1 will tolerate some decrease in the output variance. It might also be desirable to control the maximally possible size of a unit's receptive eld. To do so, the CLN will once so often compare the variances of the input activation of the P units' connections. It will compute the total input activation variance j Vj for a unit i, and the fraction of each connection j's input activation variance Vj to this sum is compared to a threshold which is a very small positive value. A connection with a P variance fraction below the threshold (i.e, Vj = j Vj < ) will be pruned. Thus, directly allows to regulate the maximal number of connections a unit can form, because as the number of connections increases, each connection's variance fraction will tend to decrease, until at least one is necessarily below the threshold. Apart from that, this mechanism will tend to lter out uninteresting connections, i.e., those with a low input variance relative to those of the unit's other connections. Similar to the FGM, the CLN's construction and pruning mechanisms require a cellular memory. For instance, each unit will have to remember its present output variance threshold, and it will have to be able to compute its actual output variance. However, to be able to compute its output variance, a unit would literally have to remember all its past output activations, which
4 A CONSTRUCTIVIST LINSKER NETWORK
40
would seem a very implausible biological assumption. Instead, the CLN will compute a cumulative variance Vc which can be updated for each new activation a by remembering the overall activation sum A and the total N of patterns processed: Vc = Vc + (a ? NA )2 . With increasing N, this cumulative variance will tend towards the correct output variance, and it requires assumptions about neural abilities to re ect past development which are not beyond biological possibilities. The CLN's pruning mechanism similarly requires speci c values to be remembered, and in this case in the dierent connections. Again, the value needed is the cumulative variance, this time of each connection's input activation, in order to be able to compute the variance fractions mentioned above. The assumption of a speci c synaptic memory is neurophysiologically not necessarily too bold. As Ooyen (1994) notes: \The in uences of electrical activity and neurotransmitters act on the level of both single neurite and whole cell."(p. 408) It would not seem unlikely that single synapses can incorporate `memory' mechanisms similar to those which are thought to work at neuronal level. What has not been treated so far is the question of when to add connections and when to prune them. Since the Linsker rule assumingly needs a few input patterns to develop a weight before any reliable predictions about its value can be made, there has to a 'tolerance time' for new connections during which they cannot be pruned. Mainly to keep things simple, in the CLN presented here this tolerance time more generally controls when the net can undergo changes in connectivity, as well constructive as regressive. An intuitive choice for this tolerance time might be a multiple of the number of patterns which are presented to the net. A neurophysiologically more plausible version of this mechanism could for instance be regulated by some energy resource, which would be spent in processes of growth and retraction and would then slowly have to be restored within the unit before the next change in connectivity could take place. The whole algorithm in pseudo-code notation, describing the development of some unit i: 1. Start with one connection to the input unit immediately `below' i. Initialise i's variance threshold vi with 1, and its tolerance time ti with a chosen value tm ; tm 2 IN + . Set i's memory values Pi (number of patterns presented), Ai (sum of output activation) and Vi (cumulative output variance) to zero, and do this as well for the initial connection. For i, also set Vi0 to zero. 2. Feed a pattern into the input layer, and: (a) Compute the activation ai according the linear response rule given in equation 1.
4 A CONSTRUCTIVIST LINSKER NETWORK
41
(b) Modify the weights according to the Linsker learning rule (equation 2), then clip and normalise them, i.e., ? < wji < ; wji = Pwjjiw j j ji for all connections wji from an input unit j to i, where and are chosen constants, ; > 0 . (c) Increase i's pattern counter Pi, update the summed activation Ai = Ai + ai , and update the cumulative variance Vi = Vi + (ai ? AP )2 . (d) Update the memories of all connections wji to i as done for i in step 2c, with the connection's activation aw being the activation aj of the input unit j. i i
ji
3. Decrease the unit's tolerance time ti by 1. If t = 0, reset it to tm and go into growth mode: (a) If PV > 2 vi, exit from learning. (b) If PV < Vi0 , then delete the last added connection wji (i.e., the one with the lowest P memory value), where > 0. Set Vi0 to PV . (c) For each weight wji, check whether PV V < (with 0 < 1) and delete wji if so. (d) Select an input unit k which is not yet connected to i according to a Gaussian probability distribution centred around the vertical projection of i onto the input layer. Initialise the weight wki with zero (or a random value near zero), and set the memory values Pw , Aw and Vw to zero. (e) Decrease the threshold vi : vi = vi , where determines the speed of convergence, 0 < 1. i i
i i
i i
wji
j
wji
ki
ki
ki
4. Repeat from step 2.
4.3 Experiments
Emergence of Centre-Surround Cells An ANSI-C implementation of the algorithm was run for two successive layers (B and C in the model) of around twenty by twenty units each, with the units in layer C developing between 90 and 100 synapses, which was about two times as many as the units in layer B formed. The input layer (A) was instantiated with random activation in the range [?1; 1]. The rather severe limitation in size had to be imposed in order to keep the computational eort within acceptable bounds; in his original experiments, Linsker drew upon a number of simpli ed
4 A CONSTRUCTIVIST LINSKER NETWORK
42
assumptions about statistical properties of the input activations which allowed him to simulate units with layers of up to 600 synaptic connections. For layer B, the algorithm reliably produced all-excitatory receptive elds for the generally used learning parameters a1 = 0, a2 = 0:25, a3 = 0:15, a4 = 0:15 and a5 = 0 (corresponding to the naming in equations 1) in conjunction with various construction parameter settings. For large parts of the learning parameter space, the algorithm developed but very ragged receptive elds in layer C. This was partly predicted by Linsker (1986a). However, for parameter con gurations similar to those with which the original model had reliably achieved the desired results, the CLN displayed only partial successes, with receptive elds of dierent structures emerging only occasionally in layer C for dierent runs with varying learning and construction parameters (on the average between ve and ten for the whole layer). Figure 10 in appendix A shows examples for the two dierent types of centre-surround receptive eld. Often, receptive elds would bear a similar, but slightly perturbed structure, on rare occasions even resembling the perceptive elds of orientation-selective cells, as shown in Figure 11a (appendix A). Still, the major part of the receptive elds would show no particular structure; an example is given in Figure 11b. These results have to be judged carefully and in the light of the dicult conditions under which the simulations ran. While Linsker used ideal statistical assumptions about the input activity at all stages of the network, the simulations had to work with a comparatively small number of pseudo-random input patterns (usually no more than 6000 before convergence occurred). Moreover, the idealisations allowed Linsker to assume very large receptive elds with nearly optimal Gaussian distribution, which | as outlined in the previous section | plays a central role for the development of the receptive elds. In the simulations, on the other hand, the receptive elds normally had no more than 100 synapses, which were distributed in a pseudo-Gaussian fashion over a maximally twenty ve by twenty ve input matrix. These considerations would suggest that the performance of the CLN was impaired by external constraints in various respects, and that the results can perhaps not entirely be compared with those produced by Linsker. From this perspective, it could be argued that a unit in the CLN had to be very lucky with the various conditions if it was to develop a structured receptive eld, and that the cases where such elds emerged (with varying degrees of quality) might just be such cases of `luck'. It would be revealing to see how the original model performs under these restrictive conditions; yet even more revealing, to let the CLN run with larger layers, with slower annealing and bigger receptive elds.
4 A CONSTRUCTIVIST LINSKER NETWORK
43
Feature Generation As a child of the Linsker model and the FGM, the CLN cannot only be tested with regard to its neurobiological properties, but also with regard to its performance as a feature generation mechanism. To be of value as a feature generating mechanism, the units should con gure their receptive elds such that they detect interesting features in the training data when developed under the in uence of structured input patterns. Just how `interesting' the features detected by the units are generally depends on what the learning rule con gures them for. As said above, using the Linsker rule with the additional constraint of weight normalisation makes a unit develop its weights such that the unit's output variance is maximised. Using other Hebb-type rules produces other results. A classic example thereof is the variant introduced by Oja (1982), which imposes the constraint that the squared values of a unit's weights add up to 1. Oja demonstrated that the weight vector then converges on the rst principal component of the input data, which is the direction of maximal spread of the data. More recently, learning rules have been devised which optimise a unit's weights with regard to information theoretic measures. For example, Bell and Sejnowski (1995) present a simple antiHebbian rule which maximises the output information of a continuous-valued unit with a sigmoid activation function. It would lead to far to give an overview over dierent unsupervised learning rules; a review which highlights the involved biological considerations is given in Linsker (1990), for a more current and comprehensive coverage see Harpur (1997). As a speculative outlook, however, note that the construction principles used in the FGM and in the CLN can be applied to architectures with very dierent learning rules. For instance, the Linsker rule in the CLN could be replaced by Oja's rule or indeed any other learning rule which maximises a unit's output variance subject to some constraint. The condition which has to be satis ed to be able to incorporate the principles is that it must be possible to de ne various thresholds with regard to the optimisation goal, such as an upper bound for which to stop the growth process. Equally, it has to be possible to decide at what stage to add a new connection. The same applies to the process of eliminating connections. Of course, other issues would also have to be addressed, for instance to which extent a new connection might disturb a unit's previous weight constellation. I have applied the CLN to a `toy' problem of extracting features from a set of grey-valued two-dimensional representations of digits; the patterns are shown in Figure 12 (in appendix B), they were presented to a square layer of 121 units in random order. Interestingly, the resulting connectivity patterns proved very robust with regard to dierent learning parameters. On the other hand, varying the size restriction on the receptive eld yielded surprisingly
4 A CONSTRUCTIVIST LINSKER NETWORK
44
dierent results. Figure 13 (in appendix B) shows typical receptive elds of the top left thirty units for = 0:15, and Figure 14 (in appendix B) depicts the receptive elds of the same units when developed with = 0:08. This would suggest that an increased frequency of weight deletion can indeed drastically alter a unit's weight con guration in cases where the total number of weights is comparatively small. Unfortunately, it was beyond the given limits of time to test how well the extracted features allow for a discrimination of the patterns. To do so, one could add a second layer and train it to discriminate the input for increasingly noisier input patterns. The results would indicate how robust the representation of the patterns in the rst layer is. A similar approach has been taken by Joseph (in preparation) to test the FGM. It is also worth noting that many of the input pixels were not connected to any unit at the end of the training. For = 0:15, normally between 69 and 72 of the 121 input pixels were connected to at least one unit. This means that apart from the outer forty pixels (which have zero input variance), around ten other pixels were also without in uence. In architectures with more than one layer, this could prove a useful approach to pruning units.
5 CONCLUSIONS
45
5 Conclusions In the foregoing sections, it has been shown that constructive network algorithms incorporate a variety of advantages over network algorithms working with stationary architectures. These advantages have been reviewed along the lines of general problem solving on the one hand and of modelling cognitive development on the other hand. With regard to the rst point, constructive algorithms have been shown not only to be able to overcome some of the problems the use of static architectures entails, but also to oer dierent options for the optimisation of a network's architecture. With regard to the second point, experimental evidence has been reviewed which suggests that constructive network models can develop in psychologically more plausible ways than static networks and that the ability to acquire some cognitive abilities might even critically depend on a gradual enlargement of an initially highly constrained learning architecture. Considerations of learning theoretical origin have also been presented along similar lines, as well as the demonstration that the principled discrepancy between the stability/plasticity equilibrium observed in human learning one the one side and the strict dissociation of both in conventional neural networks learning on the other side, can be discarded by using a constructive solution. Together with the fact that constructive neural networks provide an illustration of how a system can increase its complexity in response to input received from its environment, the above considerations make constructive network algorithms back the plausibility of claims about the nature of cognitive development which have been brought forth by exponents of the neural constructivist paradigm. However, by declaring the inseparability of neural and cognitive development, neural constructivism also explicitly makes the assumption that the nature of structural development in cognitively relevant neural circuitry is constructivist. The visual cortex with its multitude of peculiar neural formations seems a convenient part of the brain on which to test this assumption. When looking at the Willshaw and von der Malsburg (1976) model of the emergence of topographic mappings, it was found that the model actually establishes its nal state via a constructive re-organisation of neural connectivity. In contrast, the Linsker (1986a,b,c) model of the emergence of structured receptive elds proved not to be constructive. This led to the proposal of a constructivist extension for the model, developed along biologically motivated principles introduced in the Feature Generation Mechanism by Joseph (in preparation). The constructive Linsker network yielded ambiguous results when used to model the same phenomenon as its original: on the one hand, the desired result was the exception, but on the other hand it might be that these exceptions were the best results the model could yield due to computational limitations imposed on the simulations for economical reasons. Using the dimensions of the original model would then seem likely to yield better results.
5 CONCLUSIONS
46
As a summary, it would seem that the paradigm of neural constructivism can claim to be backed not only by constructive networks employed as models of cognitive development, but also has allies amongst self-organising models of developmental dynamics in neural circuitry. Clearly, the evidence which has been provided here is not a result on which to rest; but it makes the prospect of further studies into the same direction seem promising.
A RECEPTIVE FIELDS IN THE CLN
A Receptive Fields in the CLN a)
b)
Figure 10: a) The receptive eld of an o-centre cell located at (17/7), the leftmost corner of the surface being (0/0). The grey value on the left side of the surface denotes zero strength connections, lighter shading means positive strength, darker means negative. b) The receptive eld of an on-centre cell located at (2/14).
47
A RECEPTIVE FIELDS IN THE CLN
a)
b)
Figure 11: a) A distorted receptive eld for a cell located at (9/10), vaguely resembling an orientation-selective receptive eld for a bar stretching from the middle to the far right corner. b) A distorted receptive eld for a cell located at (10/7), which is just in the middle of the two positive peaks and the two negative valleys. In this plot, zero strength connections are of the dark grey which can be seen at the left border of the surface.
48
B FEATURE EXTRACTION
B Feature Extraction
Figure 12: A set of eleven by eleven pixels in seven grey values representing the ten digits. The patterns were created by fuzzifying a set of binary digit representations.
49
B FEATURE EXTRACTION
U(0,0)
U(1,0)
U(2,0)
50
U(3,0)
U(4,0)
U(0,1)
U(1,1)
U(2,1)
U(3,1)
U(4,1)
U(0,2)
U(1,2)
U(2,2)
U(3,2)
U(4,2)
U(0,3)
U(1,3)
U(2,3)
U(3,3)
U(4,3)
U(0,4)
U(1,4)
U(2,4)
U(3,4)
U(4,4)
U(0,5)
U(1,5)
U(2,5)
U(3,5)
U(4,5)
Figure 13: The receptive elds for the thirty top left units developed with a growth limit parameter = 0:15. Black circles denote positive weight values, white circles denote negative weights. The area of a circle is proportional to the weight's absolute strength.
B FEATURE EXTRACTION
U(0,0)
U(1,0)
U(2,0)
51
U(3,0)
U(4,0)
U(0,1)
U(1,1)
U(2,1)
U(3,1)
U(4,1)
U(0,2)
U(1,2)
U(2,2)
U(3,2)
U(4,2)
U(0,3)
U(1,3)
U(2,3)
U(3,3)
U(4,3)
U(0,4)
U(1,4)
U(2,4)
U(3,4)
U(4,4)
U(0,5)
U(1,5)
U(2,5)
U(3,5)
U(4,5)
Figure 14: The receptive elds for the thirty top left units developed with a growth limit parameter = 0:08. Black circles denote positive weight values, white circles denote negative weights. The area of a circle is proportional to the weight's absolute strength.
REFERENCES
52
References Baluja, S. and Fahlman, S. (1994). Reducing network depth in the cascadecorrelation learning architecture. Technical Report CMU-CS-94-209, Carnegie Mellon University, Pittsburgh. Bell, A. and Sejnowski, T. (1995). An information-maximization approach to blind separation and blind deconvolution. Neural Computation , 7, 1129{1159. Bishop, C. (1995). Neural Networks for Pattern Recognition . Oxford University Press. Bourgeois, J., Goldman-Rakic, P., and Rakic, P. (1994). Synaptogenesis in the prefrontal cortex of rhesus monkeys. Cerebral Cortex , 4, 78{96. Carpenter, G. (1987). ART 2: Self-organization of stable category recognition codes for analog input patterns. Applied Optics , 26, 4919{4930. Carpenter, G. and Grossberg, S. (1988). The ART of adaptive pattern recognition by a self-organizing neural network. IEEE Computer , pages 77{88. Carpenter, G. and Grossberg, S. (1990). ART 3: Hierarchical search using chemical transmitters in self-organizing pattern recognition architecture. Neural Networks , 3, 129{152. Carpenter, G., Grossberg, S., and Rosen, D. (1991a). ART 2-A: An adaptive resonance algorithm for rapid category learning and recognition. Neural Networks , 4, 493{504. Carpenter, G., Grossberg, S., and Rosen, D. (1991b). Fuzzy ART: Fast stable learning and categorization of analog patterns by an adaptive resonance system. Neural Networks , 4, 759{771. Changeux, J. and Dehaene, S. (1989). Neuronal models of cognitive functions. Cognition , 33, 63{109. Chomsky, N. (1959). On certain formal properties of grammar. Information and Control , 2, 137{167. Chomsky, N. (1980). Rules and representations. Behavioral and Brain Sciences , 3, 1{61. Churchland, P. and Sejnowski, T. (1992). The computational brain . MIT Press. Cun, Y. L., Denker, J., and Solla, S. (1990). Optimal brain damage. In D. Touretzky, editor, Advances in Neural Information Processing Systems , volume 2. Morgan Kaufmann.
REFERENCES
53
Elman, J. (1993). Learning and development in neural networks: the importance of starting small. Cognition , 48, 71{99. Fahlman, S. (1988). An empirical study of learning speed in back-propagation networks. In D. Touretzky, G. Hinton, and T. Sejnowski, editors, Proceedings of the 1988 Connectionist Models Summer School . Morgan Kaufmann. Fahlman, S. and Lebiere, C. (1990). The cascade-correlation learning architecture. In D. S. Touretzky, editor, Advances in Neural Information Processing Systems , volume 2. Morgan Kaufmann. Fodor, J. (1980). Fixation of belief and concept acquisition. In M. PiattelliPalmarini, editor, Language and learning: the debate between Chomsky and Piaget . Harvard Press. Frean, M. (1990). The upstart algorithm: A method for constructing and training feedforward neural networks. Neural Computation , 2, 198{209. Frean, M. (1992). A \thermal" perceptron learning rule. Neural Computation , 4, 946{957. Fritzke, B. (1994a). Fast learning with incremental RBF networks. Neural Processing Letters , 1, 2{5. Fritzke, B. (1994b). Growing cell structures | a self-organizing network for unsupervised and supervised learning. Neural Networks , 7, 1441{1460. Fritzke, B. (1994c). Supervised learning with growing cell structures. In J. Cowan, G. Tesauro, and J. Alspector, editors, Advances in Neural Information Processing Systems , volume 6. Morgan Kaumann. Fritzke, B. (1995a). Kohonen feature maps and growing cell structures | a performance comparison. In C. Giles, S. Hanson, and J. Cowan, editors, Advances in Neural Information Processing Systems , volume 5. Morgan Kaufmann. Fritzke, B. (1995b). Growing self-organizing networks - why? In M. Verleysen, editor, ESEANN'96: European Symposium on Arti cial Neural Networks . D-Facto. Gallant, S. (1986a). Optimal linear discriminants. In Proceedings of the 8th IEEE Conference on Pattern Recognition . IEEE Computer Society. Gallant, S. (1986b). Three constructive algorithms for network learning. In Proceedings of the 8th Annual Conference of the Cognitive Science Society , pages 652{660. Geman, G., Bienenstock, E., and Doursat, R. (1992). Neural networks and the bias/variance dilemma. Neural Computation , 4, 1{58.
REFERENCES
54
Gold, E. (1967). Language identi cation in the limit. Information and Control , 16, 447{474. Grossberg, S. (1976a). Adaptive pattern classi cation and universal recording, I: Parallel development and coding of neural feautre detectors. Biological Cybernetics , 23, 121{134. Grossberg, S. (1976b). Adaptive pattern classi cation and universal recording, II: Feedback, expectation, olfaction, and illusions. Biological Cybernetics , 23, 187{202. Guyton, A. (1991). Textbook of Medical Physiology . Saunders, eighth edition. Harpur, G. (1997). Low Entropy Coding with Unsupervised Neural Networks . Ph.D. thesis, University of College. Hassibi, B. and Stork, D. (1993). Second order derivatives for network pruning: optimal brain surgeon. In S. Hanson, J. Cowan, and C. Giles, editors, Advances in Neural Information Processing Systems , volume 5. Morgan Kaufmann. Hebb, D. (1949). The Organization of Behaviour . Wiley. Hecht-Nielsen, R. (1990). Neurocomputing . Addison-Wesley. Herrmann, K. and Shatz, C. (1995). Blockade of action potential activity alters inital arborization of thalamic axons within cortical layer 4. Proceedings of the National Academy of Sciences , 92, 11244{11248. Hertz, J., Krogh, A., and Palmer, R. (1991). Introduction to the theory of neural computation . Addison-Wesley. Hirose, Y., Yamashita, K., and Hijiya, S. (1991). Back-propagation algorithm which varies the number of hidden units. Neural Networks , 4, 61{66. Hubel, D. and Wiesel, T. (1963). Receptive els of cells in striate cortex of very young, visually inexperienced kittens. Journal of Neurophysiology , 26, 994{1002. Hubel, D. and Wiesel, T. (1974). Ordered arrangement of orientation columns in monkeys lacking visual experience. Journal of Computational Neuroscience , 158, 307{318. Huttenlocher, P. (1992). Neural plasticity. In A. Asburg, G. McKhann, and W. McDonald, editors, Diseases of the Nervous System , volume 1. Saunders, second edition.
REFERENCES
55
Joseph, S. (in preparation). Theory of Adaptive Neural Growth . Ph.D. thesis, Centre for Cognitive Science, University of Edinburgh. Joseph, S., , Steuber, V., and Willshaw, D. (1996). The dual role of calcium in synaptic plasticity of the motor enplate. In Computational Neuroscience: Trends in Research . Katz, L., Gilbert, C., and Wiesel, T. (1989). Local circuits and ocular dominance columns in monkey striate cortex. Journal of Neuroscience , 9, 1389{1399. Kwok, T.-Y. and Yeung, D.-Y. (1997). Constructive algorithms for structure learning in feedforward neural networks for regression problems. IEEE Transactions on Neural Networks , 8(3), 630{645. Ling, C. and Marinov, M. (1993). Answering the connectionist challenge: A symbolic model of learning the past tenses of english verbs. Cognition , 49, 235{290. Linsker, R. (1986a). From basic network principles to neural architecture: Emergence of spatial-opponent cells. Proceedings of the National Academy of Science , 83, 7508{7512. Linsker, R. (1986b). From basic network principles to neural architecture: Emergence of orientation-selective cells. Proceedings of the National Academy of Science , 83, 8390{8394. Linsker, R. (1986c). From basic network principles to neural architecture: Emergence of orientation columns. Proceedings of the National Academy of Science , 83, 8779{8783. Linsker, R. (1988). Self-organization in a perceptual network. Computer , pages 105{117. Linsker, R. (1990). Perceptual neural organization: Some approaches based on network models and information theory. Annual Revue of Neuroscience , 13, 257{81. MacWhinney, B. and Leinbach, J. (1991). Implementation are not conceptualizations: Revising the verb learning model. Cognition , 40, 121{157. Minsky, M. and Papert, S. (1969). Perceptrons . MIT Press. Mohraz, K. and Protzel, P. (1996). Flexnet | a exible neural network construction algorithm. In Proceedings of the 4th European Symposium on Arti cal Neural Networks . Oja, E. (1982). A simpli ed neuron model as a principal component analyzer. Journal of Mathematical Biology , 15(3), 927{935.
REFERENCES
56
Ooyen, A. V. (1994). Activity-dependent neural network development. Network , 5, 401{423. Platt, J. (1991). A resource-allocating network for function interpolation. Neural Computation , 3, 213{225. Prasada, S. and Pinker, S. (1993). Generalization of regular and irregular morphological patterns. Language and Cognitive Processes , 8(1), 1{56. Quartz, S. (1993). Neural networks, nativism, and the plausibility of constructivism. Cognition , 48, 223{242. Quartz, S. and Sejnowski, T. (in press). The neural basis of cognitive development: A constructivist manifesto. Rakic, P., Bourgeois, J.-P., Eckenho, M., Zecevic, N., and Goldman-Rakic, P. (1986). Concurrent overproduction of synapses in diverse regions of the primate cerebral cortex. Science , 232, 232{235. Redding, N., Kowalczyk, A., and Downs, T. (1993). Constructive higher-order algorithm that is polynomial time. Neural Networks , 6, 997{1010. Riedmiller, M. and Braun, H. (1992). Rprop | a fast adaptive learning algorithm. Technical report, University of Karlsruhe. Rosenblatt, F. (1962). Principles of Neurodynamics . Spartan. Rumelhart, D. and McClelland, J. (1986). On learning past tenses of english verbs. In D. Rumelhart and J. McClelland, editors, Parallel Distributed Processing: Explorations into the Microstructure of Cognition , volume 2: Psychological and Biological Models. MIT Press. Rumelhart, D., Hinton, G., and Williams, R. (1986). Learning internal representations by error propagation. In D. Rumelhart and J. McClelland, editors,
Parallel Distributed Processing: Explorations in the Microstructure of Cognition , volume 1: Foundations. MIT Press.
Schimann, W., Joost, M., and Werner, R. (1992). Synthesis and performance analysis of multilayer neural network architectures. Technical Report 16/1992, University of Koblenz. Valiant, L. (1984). A theory of the learnable. Communications of the ACM , 27, 1134{1142. von der Malsburg, C. (1973). Self-organization of orientation sensitive cells in the striata cortex. Kybernetik , 14, 85{100.
REFERENCES
57
Westermann, G. (1996). Learning the english past tense in a constructivist neural network. Summer Research Project. Westermann, G. (in press). A constructivist neural network learns the past tense of english verbs. In Proceedings of GALA 1997 . Widner, R. (1989). Single-stage logic. In Neural Computing, Theory and Practice . Van Nostrand Reinhold. Willshaw, D. and von der Malsburg, C. (1976). How patterned neural connections can be set up by self-organization. Proceedings of the Royal Society of London , 194, 431{445. Zell, A. (1994). Simulation Neuronaler Netze . Addison-Wesley.