Notes on Learning to Compute and Computing to Learn Khurshid Ahmad Department of Computing University of Surrey Guildford, Surrey, GU2 7XH (
[email protected])
Abstract One key message in modern neuroscience is multimodality: the ability of uni-modal areas in the brain such as speech and vision to interact with each other and with hetero-modal areas, areas that are activated by two or more input modalities, to converge the outputs of the uni-modal systems for producing ‘higher cognitive’ behaviour. Such behaviour includes the ability to quantify, the ability to retrieve images when linguistic cues are provided and vice versa. Multi-net neural computing systems that can simulate such behaviour are reported. Multi-net systems comprise of modules that take unimodal input and one or more modules that facilitate cross-modal interaction through Hebbian connections. These systems have achieved a modicum of success.
Keywords: Multi-net neural computing, multi-modal systems, competitive learning, image retrieval, information extraction.
1
Introduction
Neural computing systems were inspired by developments in neuroscience in the 1940’s and with few exceptions such systems work as cellular networks and attempt to simulate learning at the cellular level. Developments in neurosciences, especially during the 1980’s have provided a unique insight into how the brain functions, and by implication, how humans learn and behave. A brief review of these developments is in order here (Section 2) before I describe some of the work in neural computing, which has been inspired by looking at the cooperative behaviour of two or more cellular networks for simulating intelligent behaviour. The title of this paper is an imitation of Elaine Rich and Kevin Knight’s introduction to neural computing in their book Artificial Intelligence (1999) where they suggest if the network can compute then it will learn to compute. This is perhaps at some variance with other cognitivist approaches in machine learning
where the emphasis appears to be on ‘computing to learn’ – the insistence being on representation schema and reasoning strategies whether it is rule induction or rule deduction, or it is learning by synthesis or learning by analysis. More recent approaches to machine learning, for insistence case-based reasoning, despite their welcome departure from an immutable rule base, still have the structuralist influence of early AI. My interpretation of ‘learning to compute’ is that learning emerges alongside changes in the structure of the neural substrate. In neural computing, weight changes do suggest changes in the neural substrate. Work in neurobiology and neuropsychology suggests that areas within the brain get interconnected in addition to interconnections within an area. The more important lesson is that some human behaviour can only be explained on the basis of distinct and large areas of the brain interacting in unison (Section 3). Some neural computing systems developed by my students over the last decade in an attempt to train not only individual networks but also to train networks to learn to behave cooperatively are discussed in Section 4; Section 5 comprises an afterword. Before we embark on our discussion it may be useful to brush up on techniques used to observe the brain ‘in action’ and elaborate on terminology used in neuroanatomy.
1.1
Seeing the Brain in Vivo
The human brain specifically, and animal brain generally, is a complex whole which interacts with its immediate environment, reacting here, and through its muscular appendages, changing the environment there. The brain at work is a difficult place to observe. It is not possible to measure single (neuronal) activity of neurons or populations of neurons directly except during surgery. One indirect measurement of neuronal activity is the observation that active areas of the
brain invariably have increased blood flow as compared to when these areas are not active. Positron emission tomography (PET) technique uses radio-labelled positron emitting isototpes and areas of brain that may be active for durations of 1-1000 seconds and localised to within 10-100 mm can be ‘seen’ by PET scanner. Functional magnetic resonance imaging (fMRI) is used in visualising neuronal activity by observing flow rates changing and affecting the magnetic signals to which MRI images are sensitive. Many researchers now confidently report nueroanotmoical correlates of perceptual and cognitive behaviour. And there is speculation about the evolution of cortical areas of the brain as well. These indirect observations have to be interpreted carefully due to the ‘great variability’ in brain anatomy between individuals, which poses technical and conceptual problems as to the specific location of an activation in the brain. These problems are exacerbated because the ‘positions of these [functional] areas [of the brain] are not well predicted by anatomical landmarks in many regions’[8].
appears to be localised in one or more of the overlapping four lobes of the brain – frontal, or everything in front of the so-called central fissure of the brain; parietal lobe that is caudal to (or towards the tail of) the frontal lobe); temporal lobe jutting forward from the base of the brain and ventral to (or at the base of) the frontal and parietal lobes; and, occipital lobe which lies at the back of very back and is caudal to the parietal and temporal lobes. Each of the lobes is further subdivided into posterior (back) and anterior (front) or medial and lateral parts. The cerebral cortex is convoluted and as much as 2/3rd of the cortex is ‘hidden’ in small grooves (sulci, singular sulcus), large grooves (fissures), and gyri (singular gyrus) or the bulges between adjacent sulci or fissures. For instance, the fissure between the temporal and frontal lobe is called Sylvian fissure. P. Somatosensory Cortex
P. Motor Cortex
PAR IETAL LOBE 5
9
7 10
11
44
38
Visual Ass.Cortex
40 43
47
P. Auditory Cortex
Auditory Ass. Cortex
312
46
Motor Ass. Cortex
300 Words on Neuroanatomy
Our physical and mental environment appears ‘seamless’ despite the seemingly independent modalities of vision, speech, hearing, olfaction and touch amongst others. These modalities interact and intermingle to manifest as unities at a certain level of description: these unities include objects, events, concepts and emotions. The unities are dynamic in that they move, appear and disappear. The system that deals with each of these apparently independent modalities, the human nervous system, does appear to comprise many interacting and diffusely delineated parts. The human brain, part of the central nervous system, is not a uniform mass and appears to have two interconnected hemispheres, left and right. The cerebral cortex, the outermost layer of grey matter of the cerebral hemispheres, is critically involved in how we not only use the cortex, and other areas of the nervous system, to understand, exploit and sustain our environment, but learn to do it as well. This interplay of sensing and learning to make sense of the environment is most obvious in infancy and when humans encounter novelties in their environment. Each sensory modality and the associated motor movements
4
6
8
FRONTAL LOBE
45
1.2
Somatosensory Ass. Cortex
39 41
21
T EMPORAL LOBE
20
42 22
19 37
P. Visual Cortex
18 17
OCCIPITAL LOBE
Figure 1. An approximate lateral view of the various regions of the cerebral cortex. The numbered boxes refer to Brodmann’s cyto-architectural areas originally discussed in the early 20th century. (P. refers to Primary and Ass. to Association. More accurate view of the cerebral cortex are available on http://www.driesen.com/ or on
2
Neural Correlates of Behaviour
Literature on physiological psychology suggests that the left hemisphere is involved in ‘controlling serial behaviors’ including those of talking, understanding the speech of others, reading and writing. The right hemisphere appears to be more active when humans engage, in laboratory conditions, in ‘synthesis’ – drawing sketches, reading maps, making wholes from parts and is also involved in ‘the expression of and recognition of emotion in the tone of voice’ [11, 29]. Right-hemisphere lesions to the parietal cortex are possibly responsible for deficits in spatial attention – attention being the cognitive faculty, which enables humans to focus on certain features of the environment to the (relative) exclusion of others [16].
Each primary sensory area of the cerebral cortex sends information to adjacent regions, called the sensory association cortex also known as memory cortex: all motor areas have some sensory input. Association cortices, parts of the cerebral that do not have primary motor or sensory role, are involved in the so-called ‘higher order processing of sensory information.’ There are three major cortices of interest: posterior parietal cortex (PPC), prefrontal cortex, and temporal cortex. The temporal cortex has the visual and audio association cortices; the prefrontal cortex comprises ‘motor memories [specifically] memories of muscular movement that are needed to articulate words’ [11] and is supposed ‘to respond to complex sensory stimuli of behavioural relevance’ [6]; and, the PPC comprises areas involved in combining information on personal (somatic) awareness – where am I? – with information on extra-personal (visual) space. 2.1
Cortical Correlates of Perception
The visual cortex has at least 25 different subregions arranged in a hierarchical fashion for processing from colours (human extrastriate cortex) to object perception (ventral stream), from movement-, size-, and orientation of objects (inferior temporal cortex) to movement perception and object location (posterior parietal cortex). Various left-hemisphere (perisylvian) areas – temporal, parietal and frontal – have been identified as being involved with the ‘complexities of linguistic processing, ranging from semantic to syntactic, morphological, phonological and phonetic analysis and synthesis’ [20]. The ‘language areas’ in the brain were the first to be identified when the doctrine of cerebral localization was launched in the 19th century: Broca’s area or the ‘motor speech area’, comprising portions of the inferior frontal gyrus, for coordinated action of diverse muscles used in speech, and the ‘sensory or ideational speech area’ – distributed over the left temporal lobe and inferior parietal lobe – responsible for ‘comprehension of language, naming of objects’ [7] amongst others. Wernicke’s area, in the parietal lobe, contains ‘the auditory entries of words; the meanings are contained as memories in the sensory association areas’ [11]. Cerebral localization data for 58 word production experiments, involving measurements
of activation levels in parts of the brain to within 10-100 mm were performed for example with neuroimaging techniques as diverse as PET, EEG, and fMRI, suggest that there are 28 major regions, mainly in the cerebral cortex, which were highly activated and a finer grain analysis suggests that there maybe as many as 104 regions which are involved in language production [18] The experiments were performed by different research teams but mainly focussed on 7 word production tasks including picture naming and word generation. These tasks include ‘conceptual preparation, lexical selection, phonological coderetrieval and encoding’: the first task 150 ms, the second is completed in the next 125 ms and the last two within the next 125 ms. The next two tasks take another 125 ms all told. There are identifiable regions that are activated in word generation that are not activated in picture naming and vice versa. Lesion studies indicate that when local specialist areas in the brain suffer damage the brain cannot perform one or more functions that it can normally perform: damage to the various parts of parietal and temporal lobes results in the loss of naming capability and damage to the frontal cortex rostral (or in front of) and to the base of the primary motor cortex leads to non-fluent speech. (Section 4 briefly describes two multi-net computing systems that attempt to simulate language development and language deficit respectively). Vision and speech processing are good examples of unimodal processing: it appears that one single modality is being processed across a network, comprising units at different locations in the brain, and each component is more or less specialised in processing that modality. Output of the one unit becomes the input of the next and the process continues until manifestations of cognitive behaviour are finally produced: identification of an object in an image, understanding of a word or phrase, articulation of linguistic output in response to a linguistic stimulus.
2.2
Cortical Processing and Cortical Development
The above discussion is based implicitly on the behaviour of an adult human. It is reasonable to expect that humans have learnt how to react to the various modalities. Neonates, infants and young children, almost irrespective of their
intelligence level, appear to learn to react to their external environment at a considerable pace whilst their nervous system is rapidly evolving as well [5]. During the first 12-38 weeks of antenatal life the structure of the central nervous shapes up accompanied by neuronal differentiation, through axonal growth and dendritic ‘ontogeny’, and by the development of synapses (‘synapto-genesis’) and the encasing of the synapses in the myelin sheath (myelinogenesis). Even during the genesis of the antenatal nervous system, the infant responds to stimuli and when born appears to have capability as diverse as being able to visually enumerate and so appears to possess a substrate for language as well. The first year of post-natal life involves the establishment of connectivity in the infants brain together with glio-genesis – the encasing of the brain in the white matter. The child’s gaze stabilizes earlier on in the postnatal development, the child goes from ‘one-word’ language stage to two words during the first 18-24 months of life.
2.3
Modality and Neuronal Correlation
The child can integrate different modalities – Lewkowicz and Turkewitz [19] report an experiment involving 32 infants between the ‘ages’ of 11 hours and 48 hours in which their visual preferences for ‘light patches of different intensity’ were examined with and without an auditory stimulus (through a white noise generator). The authors conclude that ‘visual preference in human newborns can be modified by immediately preceding exposure to sound’ [19]. Such inter-sensory interaction between auditory and visual stimulation can be observed when we watch television or films where TV folk and actors appear to be talking through their lips but actually the sound is emitted from a loudspeaker. And ventriloquism depends on the dominance of the visual stimulus on the auditory stimulus. This has led to the claim that ‘illusions remind us the our visual experience is not a measurement of the physical stimuli, but rather a neural computation’ [31]. There are then degrees of involvement of other modalities, for example, naming an object in a visual scene requires two modalities, visual and speech, although one can segregate the two in that after the initial visual stimulus one can argue that speech takes over. However, attention and numerosity studies indicate that certain behaviour
can only be explained through multi-sensory integration. This integration has been defined as a statistically significant difference between the neuron’s response to a stimulus combination compared to its response to the individual component stimulus [21]. There is some evidence, based on experiments on cats, that certain areas of the cats’ nervous system comprise unimodal neurons, at least until 12 days after birth, but then these neurons develop capability for integrating multi-sensory information [30]. It has been suggested that there are ‘many areas in the mammalian brain […] where the different sensory streams converge onto individual neurons responsive to stimulation in more than one modality’ [10] and such heteromodal neurons have been found in the prefrontal cortex, posterior parietal cortex and the lateral temporal cortex and in the mid-brain (specifically in the super colliculus – the site where Meredith and Stein [21] found multi-sensory integration in cats).
3
Learning to Compute: Cross Modal Interaction and ‘Numerical’ Neurons
There are two developments that indicate that there are certain ‘higher level’ cognitive tasks that can be understood in terms of interaction between different sensory modalities (and motor movement). The first development relates to the orientation of human spatial-attention –our ability to ‘focus selectively on sensory information from particular locations in external space’ either voluntarily, endogenous attention, or ‘reflexively’ by salient events, exogenous attention. In a number of neuropsychological experiments, it has been found that deficits in spatial attention are correlated strongly with lesions to the parietal cortex in the right hemisphere and lesions to the frontal cortex. Amongst many like phenomena, visual illusions, actors speaking on a film/TV screen or ventriloquists’ dummy talking, indicate that human attention has to be co-ordinated crossmodally. This makes it possible to ‘select information from a common external source across several modalities despite the differences’ in the initial coding of the source of each modality [17]. Frequently, one or more senses substitute for the (temporary) loss of one of the senses: looking for a light source in a darkened region results in the heightened textural and spatial awareness is a good example for such a
substitution. This cross cueing has been exploited in the multi-net simulation reported in Section 0 where texts help in the retrieval of images. The second development relates to topics variously labelled numerosity, numerons, single neuron arithmetic and number sense in humans and some primates. The intuitive argument here is that judgement related to quantities, an animal guessing how many in a herd or the extent of ‘foraging area’, or infants and monkeys making ‘accurate’ judgements about whether two quantities are equal to each other or very different irrespective of physical attributes, must have a neuronal correlate. Observations on enumeration without having been taught a number system, subitisation or visual enumeration, or approximate calculation without rigorously carrying out arithmetic procedures, lead to the speculation that there may be areas in the brain where the visuo-spatial information about the objects, for instance, is processed such that the number information is preserved [15]. The development of numerosity has been simulated using a multi-net computing system later on in this paper (Section 4.2). We will now discuss the two developments in turn.
3.1
Cross-Modal Interaction and Spatial Attention
The key to spatial attention is that different stimuli, visual and auditory, help to identify the spatial location of the object generating the stimuli. One argument is that there may be a neuronal correlate of such crossmodal interaction between two stimuli. Information related to the location of the stimulus (where) and identifying the stimulus (what) appears to have correlates at the neuronal level in the so-called dorsal and ventral streams in the brain. In a number of neuroimaging studies of audio-visual speech perception, researchers have attempted to identify ‘putative’ neuroantomical sites where multimodal integration actually takes place [10] – these studies were inspired, in part, by the earlier work on cats [21, 22]. Two experiments, one dealing with subjects’ mouth movements whilst looking at a videotape of the lower half of a face silently mouthing numbers – silent lip reading – and the second with subjects listening to numbers being spoken with the videotape switched off – auditory speech perception – are of note here. The neuroimages
of the two experiments were compared and contrasted. The areas activated in both the experiments include primary auditory cortex and the auditory association cortex. The putative heteromodal cortex, that integrates the two stimuli, straddles audio- and visual association cortices and appears to include Wernicke’s area of language idealisation (Brodmann areas 37, 39, 40, 21/22), lie in the region ‘proximal to the superior temporal sulcus (STS). In the silent-lip reading test the visual stimulus provided to the brain appears to generate auditory cues from the auditory association cortex and the convergence takes place in the STS. Calvert and colleagues claim that ‘Activation in of the primary auditory cortex by visible speech cues might proceed via back projections from heteromodal cortex’ [10]. The authors point to other regions of the brain as well, including the prefrontal cortex, posterior parietal cortex, and possibly the midbrain region of superior colliculus. There is a concomitant claim that there may exist multimodal neurons active in ‘many areas in the mammalian brain […] where the different sensory streams converge onto individual neurons responsive to stimulation in more than one modality’ [10]. One related cross-modal phenomenon is that of synesthesia: a condition in which a sensory experience normally associated with one modality occurs when another modality is stimulated. This is a (congenital) condition when synesthetic humans recall colour when shown letters or numbers for example. Such cross-modal behaviour may be attributed to cross wiring of the otherwise unimodal brain or more specifically cortical regions. The grapheme-colour synesthesia or (involuntary cross-activation) has been explained by arguing that the colour areas in the brain are in the fusiform gyrus and the visualgrapheme area is also in the fusiform, especially in the left-hemisphere [27].
3.2
Numerosity, ‘Numerons’
Number
Sense
and
The observation that neonates and monkeys have a number sense and other mathematical skills like estimation, trajectory computation, without being formally educated in arithmetic, or any other branch of mathematics, has given rise to significant interest in this area. Neuroimaging techniques have significantly contributed here and some even claim to have neural correlates of our
ontogenetic numeracy. Furthermore, educational psychologists have observed numeracy and related skills can be acquired through training and that there are stages in which skills are acquired – there is a evolutionary process involved here. For some neonates and primates only acquire numeracy when they come in contact with their physical environment. Whether ontogenetic or evolutionary, our number sense involves ‘the coding and internal manipulation of an abstract semantic content, the meaning of number words’ [26]. Information about putative neural correlates of numerosity has come from the so-called lesion studies. Studies of brain damaged individuals showing mathematical deficits in their behaviour, when compared to their intact counterparts, suggests that lesions to specific regions of the parietal and temporal cortices may be the reason for the deficits . The parietal cortex appears to play a critical role in the representation of magnitudes and the temporal cortex is involved in the representation of the visual form of the numbers [9, 13]. Number sense has played a major role in psychology where many earlier studies were dedicated to ‘the mathematical description of how a continuum of sensation, such as loudness or duration’ is represented in the brain/mind. The 19th century psychophysicist, Gustav Fechner, had observed that ‘the intensity of subjective sensation increases as the logarithm of the stimulus intensity’. One of the 21st century rendition of this ‘law’ is that the ‘external stimulus is scaled into a logarithmic internal representation of sensation’ [15]. Number related behaviours ‘depend on the capacity to abstract information from sensory inputs and to retain it in memory’ and that in monkeys this capacity is in the ‘prefrontal cortex’ [24] and there are reports of activation in humans in proximate regions of the brain. As predicted by Fechner, there is a compressed scaling of numerical information, and this information is stored in the prefrontal cortex of the monkey [24] and the parietal cortex of the human [26]. Neider et al report over a third of the 352 randomly selected neurons from the lateral prefrontal cortex of two monkeys ‘showed activity that varied significantly with the number of items in the sample display’ [24]: this suggests that certain neurons specialise as ‘number detectors’ – the illusive numerons perhaps have been found.
The compressed number line theory can be used to explain the observation that neonates and monkeys, and adults in a hurry, can accurately enumerate quantities less than 5 without recourse to overt counting. Higher numbers cannot be enumerated with any accuracy through visual enumeration or subitisation and that within the numbers 1-5, there is a diminution in accuracy as we approach the higher number. Subitisation is sometimes related to the existence of ‘preverbal numerical abilities’ [34]. Recent findings about approximate calculations performed by healthy volunteers shows activity in the bilateral inferior parietal cortex, in the prefrontal cortex and in the cerebullum [14]; perhaps, the preverbal numerical abilites are localised in these areas. These cortices were less active when the same volunteers carried out tasks where they were asked to perform exact arithmetic calculations. During the exact arithmetic the left inferior parietal cortex was highly activated together with the left angular gyrus. The cortical regions active in approximate arithmetic ‘fall outside of traditional perisylvian language areas and are involved in various visuo-spatial and analogical mental transformations’ (ibid:971). Exact calculations, it appears, depend on languagebased representations. Dehane et al [14] recall from previous lesion studies that lesions to the left parietal area result in the loss of the sense of numerical quantity but preserve rote language based arithmetic, contrariwise the damages to the left hemisphere resulted in the loss of language abilities but the sense of numerical quantity were unimpaired. The contrasts in response times to stimulation by numbers verbally and in numbers presented in the Arabic notation show clear differences; response to visual stimulation is on average 100-200ms faster than to verbal stimulation [26]. Studies involving fMRI and event-related potential measurement show indication of localisation: higher activations reported for visual stimulation in the left and right parietal and prefrontal cortex and for verbal stimulation left and right occipital regions show greater activation (ibid:1020). Simon et al [28] have conducted fMRI experiments to examine the organisation of the parietal cortex by looking at fMRI images of subject performing six tasks: pointing, grasping, attention, saccades, calculation and phoneme detection. They found that ‘number processing tasks call for generic
processes of attention orientation and spatial manipulation that operate upon a “spatial map” of numerical quantities in the middle IPS (Inferior Parietal Sulcus]’ [28]; the IPS is a ‘fine grained anatomical specialisation’ which may have a region for the manipulation of numerical quantities.
4
Computing to Learn: Co-operating Neural Networks and Competing (inhibiting) Neurons
In this section I present multi-net neural computing systems in an attempt to simulate aspects of behaviour that invariably, and from what we have discussed above inevitably, involve interaction between two or more modalities (see Figure 2).
. . . Unimodal Input1
. . . Unimodal Net1
Cross-modal Net: Bi-directional Hebbian Network
Unimodal Net2
Unimodal Input2
Figure 2: An architecture for learning behaviour encoded in two modalities and for learning how to cross-modally learn the relationships between two unimodalities
Our early work during the early 1990’s focussed on language processing, specifically child language development and language disorder, both these systems have individually trained networks for representing concepts and linguistic description of concepts that are cross-linked through a Hebbian network. The language development system learnt to simulate how young children (18-24 months) produce one-word utterances in response to audio-visual cues and to simulate how these children learn the word-order of their language based on inputs from their adult caretakers. The word-order learning system learnt concepts and words using two independent SOFM’s, cross linked by a Hebbian network, together with an additive Grossberg network that related semantic relations (for example, agents to possessions, objects to containers, objects in spatial locations). A back-propagation network was used to teach the child the two-word organisation as found in adult language. What the network produced was a cross-modal output (concept-word) that guided its production of
candidate two-word collocates. The system’s learning outcomes were compared with standard child language productions reported in the literature and there was good agreement between the observations and our system [1]. The language disorder system was similarly constructed with conceptual and a word lexicon using SOFM’s and the system learnt to cross-link concepts and words through a Hebbian network. The system was trained on normal associations between words and concepts. The conceptual SOFM was then systematically ablated and the cross-modal output showed increasing semantic errors – the so-called naming errors found in a certain group of aphasics. The results of the language disorder multi-net were in good agreement with findings about aphasics in the literature [33]. In this paper we report on our more recent work: (i) a multi-net contents-based image retrieval system, which can store images with their collateral textual description, and the system can learn to retrieve images by their collateral linguistic features and even images for which there is no collateral text (Section 4.1), the link between image and linguistic features was established through a Hebbian network, and (iii) a multi-net system that can learn to subitise and another that can learn to count (Section 4.2), both these systems rely on a Hebbian connection that is learnt whilst two individual networks learn a single modality each to represent quantities in a visual scene and the verbal representation of the quantities. The co-operative multi-net architecture we report is essentially an extension of the original idea of Willshaw and von der Marlsburg [32] where two layers of neurons were connected via Hebbian synapses; the Hebbian synapses have the useful properties of local modification, time variance, and have the useful feature that correlates pre- and post-synaptic signals. Kohonen’s self-organising feature map is also based on a principle similar to that of Willshaw and von der Marlsburg. The links are initially weighted as zero or set at random and as the training proceeds the connection strengths are changed. Once the correlation is established between pre- and post- signals, two uni-modal inputs for us, we can use one modality as a cue for the other.In all the systems we have developed our emphasis has been on the use of the unsupervised learning algorithms, except in cases
it was necessary to use supervised learning algorithms. This preferential use of an unsupervised learning algorithm is based on the view that ontogenesis plays a key role in learning, and that occasions where environmental input acts as a teacher are rather limited though nonetheless important. We use self-organising feature maps (SOFM) due to Teuvo Kohonen that can transform a continuous n-dimensional input signal onto a one or two-dimensional discrete space of ‘neurons’. A discriminant function is used to relate the weight vectors that connect the input layer to the output layer: the weight vector closest to the input is regarded as the ‘winner’ and selected neighbourhood neurons in the output layer form a halo and are activated when the winner is activated.
4.1
Collateral Image and Text System
Images have been traditionally indexed with short texts describing the objects within the image. In some cases it needs a specialist to literally go behind the image and discover objects not apparent to a laypersons: radiologists, forensic professionals, good art critics, are amongst those specialists. The specialists excel also because they are succinct in their description and sometimes it’s a gift and at others this succinctness is acquired through experience and training. The accompanying text is sometimes described as collateral to the image. The ability to use the collateral texts for building computerbased image retrieval systems will help in dealing with image collections that can now be stored digitally. Theoretically, the manner in which we grasp the relationship between the ‘features’ of the image and the ‘features’ of the collateral text relates back to cross-modality. The use of a multi-net system comprising independent yet interacting neural networks also contributes to the debate on multiple classifier systems. We have developed a multi-net system that learns to classify images within an image collection, where each image has a collateral text, based on the common visual features and the verbal features of the collateral text. The multinet can also learn to correlate images and their collateral texts using Hebbian links – this means that one image may be associated with more than one collateral text and vice versa [3, 4]. The details of the system are given below:
Module Image Feature Representation Collateral Text Feature Representation Image-Text Cross Modality
Network Architecture Kohonen SOFM
Network Topology 15× 15
Kohonen SOFM
15× 15
Hebbian Links
15× 15
We have had access to 66 scene of crime images used in training scene-of-crime officers: 58 of these images were used for training and 8 for testing purposes. Images features were represented on a 112-dimensional vector: 21 for colour distribution, 19 for shapes, and 72 for texture; each of the features were automatically extracted; the collateral texts were represented by a 50-dimensional input vector: frequently occurring and semantically-relevant keywords were extracted automatically from the collateral texts. Both the image and text vectors were mapped onto 15X15 output layers of two Kohonen Maps. During training, a Hebbian network learnt to associate the most active neurons in the two maps thereby establishing a degree of cross-modality. Another system was created where only one SOFM was trained to recognise both the images and texts and was trained using a combined 161 (121+51) dimensional vector – we would call it a monolithic system. The trained system was then tested on the 8 images and their collateral texts. The multi-net system could classify 7 out of 8 images correctly whereas the monolithic system could only classify 3 out of the 8 images correctly. By correctly we mean that the images fell into a cluster of images on the output image and text SOFMs deemed similar by forensic experts working with us. A closer examination of how the test images and their collateral texts were classified in the image and the text SOFMs showed that the image SOFM could classify 4 out of 8 images correctly whereas 5 out of 8 collateral texts were in the ‘right’ clusters. The cross-modal interaction between the two improves the classification significantly – crudely by as much as twice 4 compared with 7.
4.2
distance effect – larger the difference between the two numerosities, further apart they are:
Numerosity Development
The Surrey Subitization System was developed to study how an artificial neural can learn to enumerate approximately (subitise) [2]. The system comprises three interacting modules. First, the mapping module, for responding to the presence of objects in a visual scene irrespective of their size and location, and then to represent the response. The second is a magnitude representation module that was learns to represent small numerosities as magnitudes along a number line according to the Fechner’s law and the socalled distance effect. The third module is the output module that comprises a word lexicon for representing numbers verbally and a cross-modal module that learns the relationship between the magnitude representation and the verbal representation. The details of the system are shown below: Module Mapping: Scale Invariance Mapping: Translation Invariance ‘Magnitude’ Representation Verbal Representation Magnitude-Verbal Cross Modality
Network Architecture Second-order Weight Sharing Kohonen SOFM Kohonen SOFM Hebbian Links
Network Topology 648× 72 72 × 15 36× 36 16× 64 36× 64
The mapping modules transform a simple visual scene onto a simple record of the whole entities in the scene. The magnitude representation module, an SOFM, receives its scale and translation invariant input from the mapping module. This input is mapped onto the output layer of the SOFM. Initially, each magnitude is assigned a random position on the output layer:
1, 4
2, 3, 5
In one example, we trained the network to learn numbers upto 5 and after a 100 training cycles, the output layer appeared ‘compliant’ with Fechner’s Law – numerosities organised on a compressed number line, and reflected the
1
2
3
4
5
The results of our simulation compare well with [12] in that the authors had used a hard-wired network and we have trained our network. Our subitisation system performs marginally better than the supervised subitisation system reported by [33]: the supervised system has difficulty in representing intermediate numerosity, e.g ‘4’ is not represented well when the system learns numerosities between ‘1’-‘5’, and is dependent upon the rather arbitrary choice of hidden layers used typically in a supervised network [2]. In another experiment the system was trained to subitise numbers between 1 and 22, except for a randomly selected set of six numbers - 2, 3, 10, 14, 15, 19 [4]. The magnitude and verbal representation were trained for the 16 numbers in the training set. The trained system was then tested for the six numbers in the test set: the results were encouraging in that each of the numbers was recognised by the neuron in the magnitude representation closest to its value and the corresponding verbal output was identified as well. In contrast a monolithic SOFM combining both the magnitude and verbal representation trained on the set of 16 numbers failed to recognise most of the test set. Another advantage of the cross-modal system was that input in one modality (say magnitude) could cue output in another modality (say verbal articulation) and vice versa.
5
Afterword
Based on observations from the neuroimaging studies, lesion studies, and neuropsychological experiments, I can claim retrospectively that all the systems reported below have a heteromodal region by way of an established training algorithm, the Hebbian learning algorithm. The strategy myself and my colleagues adopted was to simultaneously train two uni-modal neural networks, say, one dealing with learning image features and the other the linguistic description of
a visual scene. During this training, we interspersed a Hebbian network that learnt the association between the image features and the linguistic features of the accompanying description. The Hebbian links are the putative cross-modal links and the Hebbian networks the heteromodal area. It is important that neural computing literature is aware of some fascinating developments in neurobiology and neuropsychology as such awareness will show the scope and limitations of neural computing systems. There have been words of warning about reading too much in the neuroimages [8] and there had been numerous warnings about reading too much into the results reported by the neural computing community. Yes, each human is unique and perhaps this is reflected in each humans neuroanatomy, but we all believe, at least most of the times, that an actor is speaking on the TV or cinema screen, or that we do use modalities interchangeably whilst looking up objects and artefacts. How do we explain shared behaviour? Neural networks reported in the literature may be reduced a collection of switches or regression analysis systems – both these statements are true in that the whole purpose of a reductive argument is to do just that. But there is, according to our current scientific thinking and practice, electrical activity, chemical and metabolic activity observed in the brain, perhaps there is no harm in starting with switches and regression algorithms.
References [1] Abidi S.S.R. & Ahmad, K. (1997). Conglomerate Neural Network Architectures: The Way Ahead for Simulating Early Language Development. Journal of Information Systems Engineering, vol. 13(2), pp. 235-266. [2] Ahmad K., Casey, M. & Bale, T. (2002). Connectionist Simulation of Quantification Skills. Connection Science, vol. 14(3), pp. 165-201. [3] Ahmad, K., Vrusias, B. & Tariq, M. (2002). Cooperative Neural Networks and 'Integrated' Classification. Proceedings of the 2002 International Joint Conference on Neural Networks (IJCNN'02), vol.2, pp. 1546-1551. [4] Ahmad, K., Casey, M. Vrusias, B., & Saragiotis, P. (2003) ‘Combining Multiple Modes of Information using Unsupervised Neural Classifiers’, Proc. MCS 03 . LNCS 2709. Heidelberg: Springer-Verlag. [5] Anderson, Vicki., Northam, Elizabeth., Hendy, Julie., and Wrenall, Jacquie. (2001).
Developmental Neuropsychology: A Clinical Approach . Hove: Psychology Press. pp5 & 251. [6] Barker, Roger A., and Barasi, Stephen. (1999) Neuroscience at a Glance. Oxford: Blackwell Science. [7] Barr, Murray L., and Kiernan, John A. (1998). The Human Nervous System: An Anatomical Point of View (5th Edition). Philadelphia: JB Lippincotl Company. [8] Brett, Matthew., Johnsrude, Ingrid S., and Owen, Adrian M. (2002). ‘The problem of functional localization in the human brain’. Nature Vol 3 (March 2002) pp 243-249. [9] Butterworth, B. (1999). The Mathematical Brain . London: Macmillan. [10] Calvert, Gemma, A, Brammer, Michael J., and Iversen, Susan D. (1998). ‘Crossmodal identification’. Trends in Cognitive Sciences. Vol. 2 (no. 7, July 1998). 247-253. [11] Carlson, Neil, R. (1999). Foundations of Physiological Psychology. Boston Allyn and Bacon. [12] Dehaene, S. & Changeux, J.P. (1993). Development of Elementary Numerical Abilities: A Neuronal Model. Journal of Cognitive Neuroscience, vol. 5(4), pp. 390-407. [13] Dehaene, S. (2000). The Cognitive Neuroscience of Numeracy: Exploring the Cerebral Substrate, the Development, and the Pathologies of Number Sense. In Fitzpatrick, S.M. & Bruer, J.T. (Eds), Carving Our Destiny: Scientific Research faces a New Millennium , pp. 41-76. Washington: Joseph Henry Press. [14] Dehaene, S., Spelke, E., Pinel, P., Stanesun, Tsirkin, S., (1999) ‘Sources of mathematical thinking: behavioral and brain-imaging evidence’. Science. Vol 284 (number 5416, May 7 1999) pp 970-974. [15] Dehaene, Stanislas. (2003) ‘The neural basis of the Weber-Fechner Raw: a logarithmic mental number line’. Trends in cognitive sciences. Vol 7 (no. 4) April 2003, pp145-157. [16] Driver, Jon and Spence Charles (2000). Multisensory perception: beyond modularity and convergence’. Current Biology. Vol. 10, pp12731-735. [17] Driver, Jon and Spence, Charles (1998). ‘Attention and the cross-modal construction of space’. Trends in Cognitive Sciences. Volume 2 pp 254-262. [18] Indefrey, Peter & Levelt, Willem J. (2000). ‘The Neural Correlates of Language Production’. In (Ed.) Michael S Gazzaniga. pp 845-865. [19] Lewkowicz, David J. & Turkewitz, Gerald. (1981). ‘Intersensory Interactions in Newborns: Modification of Visual Preferences following Exposure to Sound’. Child Development. Vol 52, pp 827-832
[20] Levelt, Willem, J.M. (2000). ‘Introduction’ (to Chapters 56-65 on Language). In (Ed.) Michael S Gazzaniga. pp 843-844. [21] Meredith, M A., and Stein, Berry, E. (1983). ‘Interactions among converging sensory inputs in superior colliculus’. Science Vol. 221. pp 389391. [22] Meredith, M A., and Stein, Berry, E. (1993). The Merging of Senses. Boston & London: The MIT Press. [23] Nieder, Andreas, and Miller, Earl K. (2003). ‘Coding of cognitive Magnitude: Compressed scaling of Numerical information in the Primate Prefrontal Cortex’. Neuron. Vol37 (Jan 9, 2003). Pp 149-157. [24] Nieder, Andreas, Fredman, David J., and Miller, Earl, K. (2002). ‘Representation of the quantity of visual items in the Primate Prefrontal Cortex’. Science. Vol 297 (6 Sept. 2002) pp 1708-1711. [25] Peterson, S.A. & Simon, T.J. (2000). Computational Evidence for the Subitizing Phenomenon as an Emergent Property of the Human Cognitive Architecture. Cognitive Science, vol. 24(1), pp. 93-122. [26] Pinel P., Dehaene S., Riviere D., LeBihan D. (2001) Modulation of parietal activation by semantic distance in a number comparison task. Neuroimage, Vol 14(No. 5) pp. 1013-26. [27] Ramachandran, V.S., and Hubbard, E. M. (2001). ‘Synesthesia – A window into perception, language and thought’. Journal of Consciousness Studies. Vol. 8 (No.12) pp 3-34. [28] Simon O, Mangin JF, Cohen L, Le Bihan D, Dehaene S. (2002). Topographical layout of hand, eye, calculation, and language-related areas in the human parietal lobe. Neuron Vol. 33 (Jan 31, 2002). pp 475-487
[29] Springer, Sally, P. and Deutsch, Gregor (1998). Left Brain, Right Brain: Perspectives from Cognitive Neuroscience. New York: W H Freemand and Company (5th Edition). [30] Stein, Barry, E., Wallace, Mark T., and Stanford, Terrence R. (2000). ‘Merging sensory signals in the brain: The development of multi-sensory integration in the superior colliculus’. In (Ed) Michael S Gazzaniga. pp 55-71. [31] Wandell, Brian, A. (2000) ‘Computational Neuroimaging: color representations and processing’. In (Ed) Michael S. Gazzaniga. pp 291-303. [32] Willshaw, D.J. & von der Malsburg, C. “How Patterned Neural Connections can be set up by Self-Organization”, Proceedings of the Royal Society, Series B, vol. 194, pp. 431-445, 1976. [33] Wright, J.F. & Ahmad, K. ‘The Connectionist of Aphasic Naming’. Brain and Language. Volume 59, pp 367-389, 1997. [34] Wynn, Karen., Bloom, Paul., & Chiang, Wen-Chi. (2002). ‘Enumeration of collective entities by 5month old infants’. Cognition Vol 83, pp B55B62.
Acknowledgements This work has been in part supported by the UK Engineering and Physical Sciences Research Council (Scene of Crime Information Systems Project, Grant No.GR/M89041) and 5 UK Police Forces (Surrey, Metropolitan, Kent, Hampshire and South Yorkshire). The author is grateful to colleagues Matthew Casey and Bogdan Vrusias for stimulating discussions and to Mariam Tariq and Matthew for helping me to proof read this paper. Errors and omissions are entirely mine.