IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 15, NO. 3, MAY 2004
597
Classification Capacity of a Modular Neural Network Implementing Neurally Inspired Architecture and Training Rules Panayiota Poirazi, Costas Neocleous, Costantinos S. Pattichis, Senior Member, IEEE, and Christos N. Schizas, Senior Member, IEEE
Abstract—A three-layer neural network (NN) with novel adaptive architecture has been developed. The hidden layer of the network consists of slabs of single neuron models, where neurons within a slab—but not between slabs— have the same type of activation function. The network activation functions in all three layers have adaptable parameters. The network was trained using a biologically inspired, guided-annealing learning rule on a variety of medical data. Good training/testing classification performance was obtained on all data sets tested. The performance achieved was comparable to that of SVM classifiers. It was shown that the adaptive network architecture, inspired from the modular organization often encountered in the mammalian cerebral cortex, can benefit classification performance. Index Terms—Dissimilar neuron models, enhanced guidedannealing learning rule, medical data, neural networks (NNs).
I. INTRODUCTION
T
HE ability of the brain to learn, store, and retrieve huge amounts of information has been a source of inspiration for the connectionist society for many decades. Although the mechanisms of neural learning are still under investigation, part of this learning capacity could be attributed to very efficient learning rules implemented by the brain and part to the elegant structural organization of neural tissue. It has been suggested that neural structures have evolved to maximize rapid and efficient learning, while at the same time enabling a system to better generalize its learned behavior to new instances [16]. In brain areas with high connectivity and processing power such as the ones involved in vision, this organization takes the form of well-defined columnar modules, to which the high information processing capacity of these regions could be partially attributed. Cortical columns are vertically arrayed modules in which neurons share a common physiology. Physiological properties
Manuscript received October 23, 2002; revised July 2, 2003. This work was supported in part by the UNOPS project NeuroNet Contract WSE-PS01-402. The work of P. Poirazi was supported by a Marie Curie Fellowship of the European Community Programme Quality of Life under Contract MCF-QLK6-CT2001-51031. P. Poirazi is with the Institute of Molecular Biology and Biotechnology, Foundation for Research and Technology, Hellas Vassilica Vouton, GR 711 10 Heraklion, Crete, Greece (e-mail:
[email protected]). C. Neocleous is with the Mechanical Engineering Department, Higher Technical Institute, Nicosia 1678, Cyprus (e-mail:
[email protected]). C. S. Pattichis and C. N. Schizas are with the Department of Computer Science, University of Cyprus, 1678, Nicosia, Cyprus (e-mail:
[email protected],
[email protected]). Digital Object Identifier 10.1109/TNN.2004.826225
change relatively sharply at the boundary between adjacent columns. There are at least two intensively studied columnar systems in the mammalian brain: the visual cortex of cats and primates, and the barrel cortex of rodents. In 1975, Mountcastle [31] and coworkers discovered a tendency for somatosensory neurons that respond to skin stimulation (hair, light touch) to alternate with those specializing in joint and muscle receptors about every 0.5 mm. It was shown later that there is a similar mosaic organization within each macrocolumn where adjacent neurons have receptive fields optimized for the same patch of body surface [10], [11]. Parallel work by Hubel and Wiesel in the 1970s in the monkey visual cortex, revealed curtainlike clusters (“ocular dominance columns”) which specialized in the left eye, with an adjacent cluster about 0.4 mm away specializing in the right eye. The authors also described cells, which responded to the image of oriented bars and edges and found that neurons selective to a continuous range of different orientations were grouped together in a smaller cortical column [24], termed the orientation column [51]. There are many such minicolumns, (with cell-sparse gaps of ) each specializing in a different angle, within an ocular dominance macrocolumn [24]. Such minicolumns found in various neural systems have been proposed to serve as the basic functional modular unit of the cerebral cortex [8], [31], [47] by cooperating in the execution of cortical functions [2]. The more suitable candidate for a true module however was the “hypercolumn” [24]: two adjacent ocular dominance columns, each containing a complete set of orientation columns, suggested similar internal wiring, whatever the patch of visual field being represented. Given the topographic and functional specialization of neuronal elements in the aforementioned brain regions, one wonders what is the role of such an architecture. Why does the brain evolve into these functionally distinct modules? Is it possible that such an architecture maximizes the learning capacity of these systems while at the same time minimizing the time and energy expenditure required for memory storage or memory retrieval? It has been suggested that the modular organization of the brain might form the necessary neural substrate for decomposing tasks and the independent processing of subtasks, as well as allow for specific interactions between these subprocesses [16]. If we were to accept such a hypothesis, then one should explore how the various cell types differ in the form of their inputs, the patterning of their outputs, and the transformations they perform from one column to another. How the diversity of
1045-9227/04$20.00 © 2004 IEEE
598
cell types and synapses in the cortex contributes to computation within a column. Furthermore, what are the learning rules that facilitate the information capacity of such a modular neural structure. In addition to the neural architecture that may play a crucial role in information processing in these brain regions, the learning rules employed by such powerful systems should be at least of equal importance. Perhaps the most popular learning theory to this day, was expressed by Donald Hebb in the 1940s [18]: “When an axon of cell A is near enough to excite a cell B and repeatedly persistently takes part in firing it, some growth process or metabolic change takes place in one or both cells such that A’s efficiency, as one of the cells firing B, is increased.” In other words, Hebb suggested that information is stored in the strength of synaptic connections between neurons, which are modified based on locally available signals. Recent findings however, suggest that synaptic strength modifications could also be partially stochastic and partially regulated by feedback messages sent from the cell body to the modifiable sites [44], [46]. Previous computational work has shown that such a complex learning mechanism could significantly boost the memory capacity of a biologically inspired neural network (NN) [37]. The present paper investigates mainly two important questions via the use of computational models that try to mimic the columnar architecture and implement biologically plausible learning rules. Specifically, we are interested in investigating the effects of: 1) using a modular structure composed of neurons with different activation functions between hidden layer slabs and 2) using a biologically motivated guided annealing learning rule, on the classification performance of an artificial NN (ANN). Furthermore, by using adaptivity of not only the network weights, but also some important characteristics of the different activation functions, the procedure extends previous work as reported by Trentin [49]. Though explicit modeling of columnar organization and functionality is not the purpose of this paper, we hope that this brain-derived approach may lead to the development of more powerful machine learning tools by stressing the possible benefits of adopting neural architectures and training rules on classification performance. A modular NN architecture composed of multiple hidden layer slabs is introduced. Each slab consists of neurons with the same type of activation function, reflecting the similarity of neurons within a cortical column. Neurons belonging to different slabs have a different type of activation function, reflecting the different physiological properties between cortical columns. Note that the term “modular” refers to subdivisions of hidden layer neurons into groups and not to classical definitions of modular NNs where systems of NN modules cooperate with each other to complete a global task [12], [19]. The main goals of this paper are first to develop a novel ANN implementing a biologically inspired modular architecture and learning rule and second to explore the performance of such a model on pattern recognition problems. Our efforts also aim in shedding some light on the developmental reasons behind the columnar organization of cortical structures. Toward these goals, we study information capacity via the analysis of the classification
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 15, NO. 3, MAY 2004
performance of the aforementioned NN model on a variety of medical data. In Section II we discuss the network architecture and implemented training rules while in Section III we describe the public medical data sets used to validate the proposed model. In Section IV we analyze the performance of the NN classifier and in Section V we discuss the advantages and limitations of the proposed model. II. ARTIFICIAL NEURAL NETWORK A. Architecture The model architecture follows a columnar organization inspired by the neuronal organization of the visual and somatosensory cortices of cats [23] and rodents [31] respectively. Previous computational studies suggested that the columnar structure of the visual system serves the purpose of achieving efficient coding of sensory information [25], [26]. This approach was based on ideas of optimal coding from information theory, for example, decorrelation of sensory inputs to make statistically independent representations. The functional differentiation of macrocolumns and minicolumns has been suggested to enhance associative learning and pattern recognition by promoting decorrelation of the information learned by different parts of the physical network [6], [7]. Based on this evidence, this paper investigates the classification performance of an alternative NN architecture, where the network hidden layer is organized into groups of different single neuron models. Earlier work on a similar concept of the multislab architecture was found to yield high classification scores in different applications [33], [38]. The proposed model shown in Fig. 1 is a three-layer feedforward NN that consists of • 1) an input layer where input features are pre-processed with the use of generalized Gaussian activation functions; • 2) a hidden layer where neurons are divided into slabs each with a different type of single neuron activation function; • 3) an output layer with neurons activated by logistic functions. 1) Input Layer: The first layer of preprocessing neurons is inspired from the functionality of receptive fields (RF) of the retinal ganglion cells and lateral geniculate cells found early in the visual processing system [3]. These receptive fields have a Gaussian-like (center-surround) shape where stimuli that lie in the center of the receptive field excite the cell while stimuli in the periphery of the RF do not cause neuron firing. Such RFs were successfully modeled as difference of Gaussian filters [30] about two decades ago. In a similar concept, the preprocessing layer of the proposed network is composed of generalized Gaussian filters as shown in (1), that lie on the one-dimensional (1-D) line covering the range of input feature values. The general form of the Gaussian activation functions as applied to a training pattern (observation) is given by (1) where , , are adaptable parameters. is the amplitude is a measure of the steepness of the Gaussian, of the filter, is the center. is the value of the th feature in and is the index for the training pattern and
POIRAZI et al.: MODULAR NEURAL NETWORK IMPLEMENTING NEURALLY INSPIRED ARCHITECTURE
Fig. 1.
599
The network architecture.
the filters used. Thus, the vector after passed (element-wise) vector filter . For every training pattern , form patterns
is the postprocessed though the Gaussian , the postprocessed the combined vector
which is used as the input to the hidden layer neurons. To illustrate this windowing effect, imagine the following is a training case where are three Gaussian filters with centers at , pattern and and steepness . Then 0 and 6, amplitudes the combined output vector after passed from the three filters
will be where “,” is used to separate the three filter outputs. The combined feature has small nonzero values at the locations nearby the most important features, 1s at the most important features and 0s everywhere else. The aim of this layer is to selectively enhance the effect of the important or most discriminant features in the training set while at the same time suppressing the unimportant ones. The Gaussian filters are initially centered in equal intervals covering the entire range of input feature values. Their adaptive and parameters are initiated to unitary amplitude
600
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 15, NO. 3, MAY 2004
steepness equal to of the length of their in-between interval . As learning proceeds, the filters shift along the input space and adapt their shape (center, amplitude, and steepness) so as to apply the window effects on the most important feature region. The idea behind the adaptable design of the preprocessing Gaussians is to allow each hidden layer neuron to focus on a specific subset of the data, thus promoting the decorrelation of patterns learned by different parts of the network and enhancing information learning [26]. This is achieved by promoting the development of highly dissimilar preprocessing filters each of which focuses on a different region in feature space, thus zooming in on different important features. As the number of input features increases, and the magnitude of most important features is amplified versus that of unimportant ones, the discriminability of the input data is greatly enhanced. At the same time, the number of connection weights between the input and hidden layer is also growing, thus increasing the network flexibility although this can sometimes have a negative effect on the network’s generalization performance. 2) Hidden Layer: The hidden layer of the network consists of neurons grouped in slabs based on the form of their activation functions. The biological analogy of this layer corresponds to the simple cells in the primary visual cortex which are oriented and have characteristic spatial frequencies. Cells with similar orientation preference respond to stimuli that lie along similar orientations in the visual field. These cells are localized within the same minicolumn and fire only at specific stimuli orientation and width -which defines their characteristic spatial frequency. Previous work by Daugman [4] proposed that these cells could be modeled as complex 2-D Gabor filters. Based on the same concept of similar cell functionality within a minicolumn, neurons within a slab share the same form of activation function, which is different for each slab. All activation functions have adaptive parameters and their form was selected based on the findings of previous work [33], [38]. The idea of adaptive activation functions in NN models is not new. It was previously shown that ANN performance can be improved when the parameters of activation functions are adaptive [32]–[34], [49]. However, the idea of using completely different activation functions organized in separate groups of neurons, with adaptive parameters was inspired by the results of earlier work [33], [38] and is investigated and validated further in the present study, in combination with a biologically inspired learning rule. Following the preprocessing performed at the first layer of the network, the combined pattern is weighted by connection weights and used as input to the hidden layer neurons. More explicitly, for any given training pattern , to the hidden layer neurons is the input given by the weighted sum of the post processed pattern
hidden layer The outputs of the neurons (where is the number of neurons in slab and is the number of slabs used) are given by the following activation functions Generalized logistic form I)
(3)
II)
The adaptable parameters and their initial values are: , , , , , . Generalized Gaussian form
(4)
III)
The adaptable parameters and their initial values are , , , , , . Generalized second-degree polynomial form
(5)
IV)
The adaptable parameters and their initial values are , , , , , . Threshold function (6)
The adaptable parameters and their initial values are: , , . The output of the entire hidden layer for pattern is thus given by the vector which combines all slab outputs. 3) Output Layer: The output layer could be thought of as a complex cell in the visual system that combines the outputs of simple cells in order to derive a decision about a given input neupattern . The activation functions for the rons in the output layer are given by generalized logistic activation functions, the parameters of which are also adapted during learning (7) The adaptable parameters and their initial values are , , , where the vector with subtotals of linearly combined hidden layer outputs given by
where layers.
(2)
(8)
is the weight matrix between the input and hidden
with as the hidden layer output and being the weight matrix between the hidden and output layers.
POIRAZI et al.: MODULAR NEURAL NETWORK IMPLEMENTING NEURALLY INSPIRED ARCHITECTURE
B. Learning Activation function parameters in all three layers are modichange in their magnitude as fied during learning by a suggested by [20], [21], [28], [35]. Specifically, one neuron in the preprocessing layer, one neuron in each hidden layer slab and one neuron in the output layer are randomly selected in each epoch. For each selected neuron, one of its activation function parameters is then selected at random and increased or decreased by 10% of its value. The direction of the change is again defined at random. For each of the selected neurons, a parameter modification is kept only if it leads to a lower sum of squared errors (SSE) over the entire NN performance, as defined in (9). Otherwise, the old parameter value is restored. (9) matrix with the network outputs for where is the and is the matrix with the all training patterns respective class labels (i.e., 0s and 1s). In each training epoch, activation parameter changes are preceded by changes in the values of both connection weight matrices as defined by two different learning rules: 1) a random modifications [20], [21], [28], [33], [35], [38] or 2) a novel enhanced guided annealing rule. Random Modifications Rule (RM): The random modifications rule consists of the following steps: • initial weight values are drawn from a normal distribution ; • one neuron in each of the hidden and output layers is randomly selected; • for each selected neuron, one of its connection weights is randomly selected and modified by a random change in its magnitude; • for each single weight change, the new SSE over all patterns is calculated and if the new error is lower than the old one the change is kept, otherwise the old weight value is restored. Enhanced Guided Annealing Rule (EGA): The enhanced guided annealing rule consists of the following steps. • Initial weight values are drawn from a Normal distribution . • One neuron in each of the hidden and output layers is randomly selected. • For each selected neuron in the hidden layer, 10% of its input weights are selected at random. For each selected neuron in the output layer, 50% of its input weights are selected at random. The increased percentage is used to ensure selection of adequate number of candidate weights given the much smaller size of the weight matrix between hidden and output layers as compared to that of input and is calculated for each of hidden layers. A score these weights, which measures a correlation coefficient as shown in (11). The correlation coefficient between two vector variables and used here is given by (10)
601
denotes the mean and the dot product. where [36] is shown in the The weight scoring function following: (11) where is the input to a given neuron in the network layer associated to the selected weight, is the vector with the class labels (0s and 1s) and is an index used to idenand intify all misclassified patterns. Thus, dicate that the correlation is measured only over the patterns that were misclassified in the previous epoch. An example of the estimation of and for weights in the input-hidden layer matrix is shown in Fig. 2. In an analogous way, the calculations are carried out for . Training weights in the hidden-output layer matrix on the misclassified subset of the input data was previously shown to accelerate the convergence of the algorithm [36], [37] as compared to using the entire data set, since the number of patterns used by the scoring algorithm decreases steeply with convergence. score is selected and • The weight with the smallest random change. The its magnitude is modified by a change is kept or rejected based on a Boltzman equation (12)
A change is kept if the new SSE is smaller or equal to the one prior to the weight modification or a random number is smaller than . • A temperature variable shown in (12) is lowered over the course of learning such by a scaling factor that fewer changes that cause an increase in SSE are accepted as the algorithm converges to a minimum. • To avoid local minima, if the error rate is unchanged for is increased by: 400 consecutive weight changes, if the number of epochs is less that 2000 i) a factor of if the number of epochs is larger. or ii) a factor of For the experiments reported here, the initial temperature 20 and the temperature reduction factor . Learning was terminated after a maximum number of epochs (10,000) or when no further improvement in the SSE was observed after 1500 consecutive epochs. The Enhanced Guided Annealing rule combines the properties of a Delta rule (by incorporating an error reduction procedure, i.e., a feedback signal based on the network’s output) with the stochastic nature of simulated annealing. The biological inspiration for this learning rule comes from evidence regarding backpropagating action potentials in various types of neurons, which are thought to provide a feedback signal to dendritic synapses about the output of the cell [44]–[46]. This feedback signal could, potentially, be incorporated into the plasticity rules that govern synapse potentiation and depression thus eliminating the locality of the (so called) Hebbian rules, which form the most popular theory for synaptic plasticity. In addition to a biologically inspired delta signal, the EGA rule also implements
602
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 15, NO. 3, MAY 2004
W
G
Fig. 2. Illustration of scoring function calculations for the selected weight in the input-hidden layer weight matrix. The dashed box contains the outputs . Target values for these patterns are shown in the dashed box labeled of the preprocessing function (for all training patterns) that get multiplied by “Target vector.” The index numbers (I = 2; I = 3; . . . ; I = 49) on the right side of box indicate the training patterns that were misclassified in the previous epoch. Subsets of both and vectors are formed which contain only the misclassified patterns as indicated by the index vector . These reduced are done in a similar way, by replacing vectors are used to calculate the score for the selected weight. Calculations for weights in the hidden-output matrix the matrix with the hidden layer outputs and the hidden neurons with the output neuron.
f
f
G
T a
the stochastic nature of synaptic plasticity, which besides neuronal excitation frequency, it depends heavily on the probability of vesicle docking and neurotransmitter release at the synaptic terminal [5], [39], [40].
W
T
W
hidden layer output (which is calculated using entire all hidden layer neurons)
(13)
C. Pruning of Redundant Hidden Layer Neurons Due to the adaptivity of preprocessing and activation function parameters, the proposed network has an inherent capacity to force hidden layer neurons to become redundant when their contribution to performance is diminishable. This is achieved by learning function parameters such that the neuron output is identical (but not necessarily zero) over all training and test patterns, irrespectively of their class labels. Such redundant neurons can thus be deleted without affecting the network performance. This is done by looking at the output of each hidden layer neuron over the test set after learning the following. • Hidden layer neurons with identical response for all training patterns are considered to be redundant since they offer no information for the correct categorization of input patterns into their respective classes. Their only contribution to the network output is an additive shift of the hidden layer output range to the one that maximized classification performance as a result of activation function parameter tuning (see Section IV-A for a detailed explanation). • All redundant neurons and associated hidden-output layer weights are pruned. To ensure maintenance of the classification score, a constant is added to the reduced hidden layer output (which is calculated using only nonredundant neurons) that linearly shifts its range to that of the
I
T
where the constant is simply given by the difference of the means between the reduced and the entire hidden layer outputs over the training set (14) In other words, the contribution of the pruned hidden layer neurons is replaced with an additive constant that shifts the mean of the reduced hidden layer response to be the same as that of the entire hidden layer response. Classification score was verified to remain unchanged after redundant neuron pruning, for all data sets presented here. This additional step reduces the complexity of the proposed model significantly since, on average in this application, 75% of the hidden layer neurons appear to be redundant at the end of training. III. MEDICAL DATA SETS The NN model is used for the classification of four medical data sets from the University of California at Irvine Machine Learning Repository (http://www.uci.edu) and its performance is compared to that of a Support Vector Machine Classifier. Two of the data sets contain data related to heart disease diagnosis,
POIRAZI et al.: MODULAR NEURAL NETWORK IMPLEMENTING NEURALLY INSPIRED ARCHITECTURE
603
TABLE I HEART RATE DATA SETS ATTRIBUTES
TABLE II BREAST CANCER DATA SET ATTRIBUTES
one set contains breast cancer data related to the disease recurrence and the last set contains echocardiogram data related to the survival of patients who experienced heart attacks. A more detailed description of each data set used in this paper follows. The Veterans Administration heart rate data set was donated by the V.A. Medical Center, Long Beach, CA and the Cleveland Clinic Foundation, Cleveland, OH. The database was created by Dr. R. Detrano in 1988 and it consists of 200 patients, with (149) and without (51) heart disease. The H heart rate data set was created by Dr. A. Janosi in 1988 at the Hungarian Institute of Cardiology, Budapest, and it contains data from 294 patients, with (106) and without (188) heart disease. Both data sets have the same set of 14 attributes for each patient record, which are shown in Table I. The breast cancer data set was provided by Dr. M. Zwitter and Dr. M. Soklicfrom from the University Medical Centre, Institute of Oncology, Ljubljana, Yugoslavia, in 1988. The database contains a total of 286 patients, 85 of which had disease recurrence while the remaining 201 were cancer-free at the end of the testing period. The data set has nine attributes, which are shown in Table II.
The echocardiogram data set was donated by Dr. S. Salzberg (
[email protected]) in 1989 and collected by Dr. E. Kinney at the Reed Institute, Miami. The database contains patients who suffered heart attacks at some point in the past (132) with some of them being still alive (24) and some not (108). The survival and still-alive variables, when taken together, indicate whether a patient survived for at least one year following the heart attack. The problem addressed here is to predict whether or not the patient will survive for at least one year. The data set has 12 attributes, which are shown in Table III. Due to the rather limited number of patients (observations) in some of the data sets and in order to verify the correctness of the classification results, a bootstrapping procedure was used. For each data set, the system was trained and evaluated using ten different bootstrap runs. In each run, a set of observations was first selected at random, with repetition, to be used for evaluation. From the remaining patterns, another set was selected at random, with repetition and used for training. Details for the bootstrapping partitioning in each data set are shown in Table IV. All bootstrap runs are performed on the same partition of the data sets, with connection weights initialized at random. The
604
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 15, NO. 3, MAY 2004
TABLE III ECHOCARDIOGRAM DATA SET ATTRIBUTES
TABLE IV DATASETS USED FROM UCI REPOSITORY AND 10 BOOTSTRAPPING PARTITIONING OF EACH SET
2
NN classifiers are trained for a maximum of 10 000 epochs or until no improvement in the classification error was seen over 1500 consecutive epochs. In the following experiments, the network architecture con, four hidden layer sists of three preprocessing neurons slabs , and one output neuron , since all data sets tested are two-class classification problems. IV. RESULTS A. Significance of Network Architecture To assess the importance of the different activation functions used in the proposed NN model, we performed a series of experiments using different combinations of the available choices on the Breast Cancer data set. The selection of the specific data set was based on its size which was significantly larger than the remaining data sets used in this study. Table V shows the performance of the proposed network, in terms of misclassification error, when trained with the enhanced guided annealing rule for different model architectures. In order to explore the effect of each activation function on classification performance, a homogeneous NN with 20 hidden layer neurons was formed and trained over 10 bootstrap runs for each of the four types of activation functions. The results shown in Table V indicated that the most important activation functions were the generalized logistic (row 1), the polynomial (row 3), and the Gaussian types (row 2), the use of which led to small classification errors. On the contrary, the step function (row 4) was found to be the least
important with respect to training and classification errors. The next question addressed the effect of various combinations of these activation functions on the network performance. When a combination of the generalized Gaussian, logistic and polynomial functions (Table V, row 5) was tested, the network performance was not different than the performance of the networks utilizing the functions individually. A paired -test performed between the three-activation-functions network and each of the single-activation-function networks verified that the resulting values were not statistically significant. A combination of only logistic and Gaussian activation functions (Table V, row 7) was subsequently tested. The performance (both learning and testing) of the combined network appeared to be slightly worse but was again not statistically different than that of the single-activation function networks. Interestingly, the use of the step function—which performed badly in a singleactivation-function network—in combination with generalized logistic and Gaussian functions (Table V, row 6) led to a network performance comparable to that of all other “good” combinations (Table V, rows 5 and 7). To investigate this further, the network was trained using all four activation functions (Table V, rows 8 and 9) with a total of 16 (Table V, row 8) or a total of 20 (Table V, row 9) hidden layer neurons. The generalization performance of this latter classifier was significantly better than all previous combinations as verified by a paired -test. Toward a deeper exploration of the possible reasons for this result, the outputs of the preprocessing and hidden layer units were analyzed under two conditions: when a) a single type or
POIRAZI et al.: MODULAR NEURAL NETWORK IMPLEMENTING NEURALLY INSPIRED ARCHITECTURE
605
TABLE V EFFECT OF ACTIVATION FUNCTION TYPE ON NETWORK PERFORMANCE
Fig. 3. Role of hidden layer activation function dissimilarity on preprocessing modularity. Data shown are from posttraining testing of 94 patterns from the H-Heart data set. A, B. Shape and output range of preprocessing filters. Curves show mean filter output and standard deviation (y -axis) over 94 inputs with 13 features (x-axis). Circles show filter responses for patients belonging to class 1 (36) while stars show filter responses for patients belonging to class 2 (58). Gaussian filters in a network with four types of hidden layer activation functions (A) develop into more different curvy shapes, locate and transform input features differentially for the two classes as compared to a network with polynomial-only hidden layer neurons (B).
b) all four types of activation function were used in the hidden layer. Results for neuron responses in different layers of a network implementing all four activation functions (hereby termed “modular network”) and a network with second degree polynomial functions (termed “polynomial-only network”) on the H-heart rate data set are shown in Figs. 3–5. The polynomial-only network was selected because the difference of its generalization performance from that of the modular network was statistically significant in the previous experiment in a paired -test using the breast cancer data set). ( Both networks are trained with a set of 200 patterns and tested on an unseen set of 94 patterns from the H-heart rate data set. Results are shown for the test set, as results for the training set were very similar. The four-slab architecture was maintained in both networks to show that modularity in network performance is not a result of physical slab separation. Similar results were obtained for all four data sets and the different types of single neuron activation functions used in this paper. Fig. 3 shows the average response and standard deviation over 94 patterns of the three preprocessing layer functions ( -axis) for each of the 13 input features ( -axis). Circles show
filter responses for patients belonging to class 1 (36) while stars show filter responses for patients belonging to class 2 (58). The preprocessing filter response curves for the modular network [Fig. 3(a)] develop into relatively different curvy shapes over the range of input features. A curvy shape indicates that the filter transforms input features differently, thus suggesting it succeeds in differentiating between important and unimportant features. Furthermore, different response curves between the three filters suggest that each one of them focuses on a different subset of input features. An additional point can be made by comparing the response of any given filter to observations in each of the two classes (circles versus stars). In the modular network, all three filters show differential responses for class 1 versus class 2 for a small number of features. These features, which are different for each of the filters, are presumably the most important ones for discrimination. More specifically, if the average response of a filter for a given feature is clearly different between the two classes, it suggests that the filter was able to identify and enhance an important feature by increasing its magnitude discrepancy between the two classes. Taken together, these findings suggest that each preprocessing
606
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 15, NO. 3, MAY 2004
Fig. 4. Effect of activation function dissimilarity on hidden layer output. Data shown are from postlearning testing of 94 patterns from the H-heart data set. A. Two out of 20 neurons in a modular network are nonredundant after training and their responses over the test set are not significantly correlated. B. A network with polynomial-only hidden layer activation functions has more nonredundant neurons while many of their pairwise responses (8) over the test set are often highly correlated.
Fig. 5. Effect of activation function dissimilarity on classification performance. Data shown are from postlearning testing of 94 patterns from the H-heart data set. Thick symbols indicate misclassified patterns. Classification performance of a modular network (A) is better than that of a polynomial-only network (B).
filter in the modular network can amplify/supress different input data features. On the contrary, in the polynomial-only network [Fig. 3(b)] all filters have relatively flat shapes while responses for the two classes appear very similar, suggesting a model limitation to locate, differentiate and effectively boost or suppress various input features and possibly improve network classification performance.
Fig. 4 shows the hidden layer neuron responses (average and STD) over 94 patterns ( -axis). Neuron indices are organized on the -axis to show the 4-slab architecture. Points 1–5 in the -axis indicate neurons 1–5 that belong in the first slab. Neurons 6–10 belong in the second slab, 11–15 in the third slab and 16–20 in the fourth slab. Interestingly, 18 out of 20 hidden layer neurons in the modular network [Fig. 4(a)] give a
POIRAZI et al.: MODULAR NEURAL NETWORK IMPLEMENTING NEURALLY INSPIRED ARCHITECTURE
single response over all 94 test patterns while only neurons 4 and 9 have variability in their responses over the test set. This suggests that after training, only a few hidden layer neurons which can discriminate between the two classes are needed to ensure high classification performance. This was verified by pruning redundant hidden layer neurons after training and showing that classification score remains unchanged for this and all other runs and data sets. However, when pruning the same number of neurons before training, the network performance was significantly worse both in terms of training and generalization. A possible explanation for these findings entails that hidden layer neurons may be needed to assist in tuning the preprocessing Gaussians to find the most discriminative features and as the algorithm converges, their contribution weakens incrementally. Future work could explore progressive pruning of hidden layer neurons as their responses converge to a single value over training, but this is not explicitly addressed in the present study. To further explore the role of preprocessing in the hidden layer’s redundancy degree and the network’s performance, we performed a series of experiments using the modular network with and without preprocessing on the Breast Cancer data set. We found that the use of preprocessing filters had no effect on the modular network’s learning or generalization performance, as verified by a -test comparison. However, what was different between these cases was the number of resulting redundant hidden layer neurons. When all three preprocessing Gaussian filters were used, the mean and STD of the number of redundant neurons over ten bootstrap runs was . With just one preprocessing filter this number was and without any preprocessing filters the number climbed to . These results further support the notion that the preprocessing layer serves the purpose of locating and effectively boosting the most important input features while at the same time suppressing the unimportant ones. Consequently, once the input features are nonlinearly transformed in this manner, the network requires a smaller number of hidden layer neurons to achieve the same performance as a network without any preprocessing. An additional finding in Fig. 4 refers to the correlation coefficients between the pair-wise outputs of nonredundant hidden layer neurons. In the modular network these coefficients were found to be slightly negative or near-zero over multiple runs, a behavior that indicates the modularity achieved by the system. More specifically, most nonredundant neurons in the modular network were found to learn different subsets of the input data, thus achieving the decorrelation of information learned by different parts of the network. On the contrary, using a single activation function, despite the flexibility provided by its adaptivity, seems to prevent the network from fully developing its modular function, as suggested by the correlated responses of multiple hidden layer neurons. Fig. 5 shows the classification performance for both network models. A modular network is shown to have a lower classification error and a clear separation of network class predictions as seen in Fig. 5(a). On the contrary, a polynomial-only network [Fig. 5(b)] has poorer classification performance and a more uniform distribution of class predictions.
607
In summary, the results of Figs. 3–5 suggest that the utilization of different types of activation functions for the different hidden layer slabs may promote the network’s modular function by enhancing the decorrelation of input data and forcing different parts of the network to learn different information. This hypothesis is supported by three main differences in the training and testing behavior of the two models: 1) preprocessing filters in the polynomial-only network have relatively flat response curves over most input features which are very similar for observations belonging to either class, suggesting a limitation of the model to effectively focus on different input features. On the contrary, when all four activation functions are implemented, preprocessing filter responses appear more curvy, transform certain features differently for each class and focus on different subsets of input features. Furthermore, when the modular network is trained without any preprocessing, the number of nonredundant hidden layer neurons increases dramatically (more than doubles on the Breast Cancer data set) indicating a synergy between preprocessing and hidden layer neurons. 2) more neurons remain nonredundant after training in the polynomial-only network, possibly in an attempt to compensate for the poor preprocessing which does not enhance discriminability of the data. Furthermore, in a polynomial-only network, more hidden layer neurons are needed to achieve a generalization performance which is lower than the one achieved by the modular network. This behavior further suggests that when the important features in the input data set are successfully amplified, the number of hidden layer neurons that can distinguish between patterns belonging to different classes and can ensure high classification performance becomes very small (2 out of 20 for the case presented here). This hypothesis is also supported by the findings regarding the role of preprocessing in the modular network. It was shown that while performance per se is not affected by the use or lack of preprocessing, these filters seems to play a significant role in reducing the number of hidden layer neurons needed to achieve maximum performance. 3) in the polynomial-only network, multiple nonredundant neurons have correlated responses over the entire test set suggesting that they learn similar information about the input patterns. The correlations between outputs of nonredundant hidden layer neurons in the modular network however, was found to be very small on average, a behavior that further indicates that different neurons learn different subsets of the input data. As suggested from the previous experiments the multi-slab model architecture which promotes modular learning (Figs. 3 and 4) and minimizes both the classification error and its deviation from the mean over ten bootstrap experiments (Table V) consists of 20 hidden layer neurons organized into four equally sized slabs. Each slab in the modular network utilizes a distinct type of neuron activation function, the form of which is shown in (3)–(6). Thus, the above architecture is used in the remaining classification problems addressed in this paper. The random modifications rule was not used in the assays shown in Table V since its performance in the aforementioned modular
608
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 15, NO. 3, MAY 2004
Fig. 6. SSE training curves for both EGA and RM learning rules.
network was significantly worse than that of the EGA rule and thus there was no need to explore alternative architectures for this rule. B. Learning and Classification Performance Representative error curves for the proposed modular NN, implementing both RM and EGA learning rules and trained on all four data sets are shown in Fig. 6. Experimental results on all medical data sets show that the EGA algorithm speeds up the convergence and modifies the search path in the parameter space possibly reaching deeper minima and therefore improving generalization performance. The proposed network performance is compared to an SVM classifier with Gaussian radial basis function (RBF) kernels and Exponential Radial Basis Function (ERBF) kernels. Details about the implementation of the SVM algorithm used can be found in [15]. The mean and standard deviation of the misclassification scores, i.e., diagnostic yield, for the 10 bootstrap sets for each method and data set are tabulated in Table VI. Results show that the enhanced guided annealing rule outperforms the random modifications rule in all cases. Both the training and evaluation performance of the biologically inspired learning rule is significantly better than the RM rule. Moreover, the mean classification error as well as standard deviation in all data sets are consistently lower for the EGA as opposed to the RM rule. Finally, the generalization performance of the EGA rule shows no statistically significant difference at from that of either type of SVM classifiers. However, it should be emphasized that the SVM
classifier used on average 60 (RBF) or 61 (ERBF) activation functions (kernels) for the data sets tested, a number which is nearly three times bigger than the total number of adaptive activation functions used in the modular network (3 in the preprocessing layer, 20 in the hidden layer and 1 in the output layer). Thus, while both classifiers appear to have the same performance, the utilization of different types of activation functions such as those used in the modular network seems to achieve the same performance with a lower computational complexity. V. DISCUSSION An alternative NN model was developed, the architecture of which combines some of the integrative properties of cortical neurons with a biologically inspired learning rule. The network architecture is loosely analogous to that of a hypercolumn in the visual cortex, where parallel minicolumns (implemented as hidden layer slabs) composed of neurons with similar physiological properties and processing capabilities (described here by the use of a common type of activation function) are combined to achieve optimal processing of a given stimulus form, e.g., orientation invariant object recognition. While the biological motivation is very strong in this paper, the connection to cortical processing is rather weak since the network architecture does not realistically model the cortical circuitry. Just to point out a few differences, there are no connections between neurons within hidden layer slabs as in visual minicolumns and no lateral inhibition between the different slabs is explicitly modeled.
POIRAZI et al.: MODULAR NEURAL NETWORK IMPLEMENTING NEURALLY INSPIRED ARCHITECTURE
609
TABLE VI TRAINING AND EVALUATION RESULTS FOR THE LEARNING RULES TESTED IN THE MODULAR NETWORK, COMPARED TO AN SVM CLASSIFIER
The aim of this paper, however, was not to design a computational model of cortical processing but rather to study the possible effects of cortical structural organization and learning rules on the classification capacity of an ANN inspired by the powerfulness and robustness of cortical systems. Toward that target, the modular architecture seen in cerebral cortex was implemented in a NN in combination with a biologically inspired learning rule. While Hebbian learning has, thus far, been the most prominent mechanism for both the development and the plasticity of cortical columns, the present paper proposes a different training rule, where a feedback error signal is combined with annealing properties and both preprocessing and activation function parameters are allowed to change over the course of learning. A. Comparison to Previous Studies The present approach differs in many aspects from previous studies, were the emphasis was mostly on adaptation of activation function parameters while keeping the form of the function fixed for all neurons within layered networks. Previous work which resembles the proposed architecture in such ways is briefly discussed next. Yamata and Yabuta [52] introduced an autotuning method for controlling the shape of activation functions in MLPs. In that work, all neurons within the network were designed to share the same nonlinear function, which was sigmoidal in shape. The sigmoidal function depended on a global adaptable parameter , the value of which determined both its slope and amplitude at each time step. In a similar work, Hu and Shao [22] introduced an algorithm to learn simultaneously the parameters and of another special sigmoidal function of , which was again shared among the form all neurons in the network. Chen and Chang [1] used a steepest decent algorithm for auto-tuning the shape of sigmoidal activation functions including a parameter for the “saturation level.” In this latter case, the well defined sigmoidal activation function parameters were trained on neuron-by-neuron basis but trainable parameters were only allowed at the output layer. A special case of the proposed network architecture was recently published in Neural Networks [49]. In that work, novel training algorithms were introduced to learn the amplitudes of nonlinear activation functions in layered networks. The proposed training algorithms in [49] can be viewed as a particular double-step gradient descent procedures, as gradient descent learning rate schemes or as weight-grouping techniques. The author analyzed three different network architectures in which: 1) all neurons in the network shared
the same adaptable activation function amplitude; 2) all neurons in a layer shared the same adaptable activation function amplitude; and 3) neuron specific adaptable amplitudes were allowed. All cases were shown to have better performance than a MLP trained with standard backpropagation, backpropagation with adaptive learning rate or conjugate gradient. Case 3) was shown to have better performance than cases 1) and 2) in most of the examples shown in [49], due to its increased flexibility. The network architecture of case 3) can be seen as a special case of the proposed architecture in this paper, corresponding to a multislab network with one or more neurons-per-slab and no preprocessing elements. However, while in [49] just one adaptable parameter was allowed per activation function, namely its amplitude, the proposed architecture in the present work allows adaptivity in the generalized shape of the activation functions as well as the preprocessing filters. Furthermore, in [49] only logistic and Gaussian forms of activation functions were used, while the present work implements two additional types of activation functions. Learning rules for activation function parameters also differ in our paper. Instead of the more complex rules used in [49], where estimation of function derivatives is required in each epoch, our paper uses a more simple rule with random modifications of parameter values. Moreover, network weights in the present work are modified according to a biologically inspired learning rule as opposed to traditional rules used in [49] and other previous studies [1], [22], [52]. Experimental findings in the present work suggest that using activation functions that vary both in their general form and adaptive parameter values, enhances the network performance notably more than when identical functions with adaptable parameters are used. Finally, the architecture of the proposed NN should be contrasted with that of Committee Machines where several elementary NNs are combined to produce a response. There are two classes of Committee Machines, those with Static Structures, where the combined committee response does not involve the input signal and those with Dynamic Structures, where the input signal is directly involved in actuating the mechanism used to combine the individual expert outputs [17]. The Static Structure category includes: a) The Ensemble averaging method, where the outputs of different predictors are linearly combined to produce an overall response. In this method, each predictor is trained on the same data set with different initial conditions. b) The Boosting method, where a weak learning algorithm is converted into one with arbitrary high accuracy. Each predictor in this method is trained on
610
a data set with entirely different distribution. The Dynamic Structure category includes: a) The Mixture of Experts method, in which each expert network learns a different subregion of the input data and a gating network is used to control the degree to which each “expert” will contribute to the network’s response and b) the Hierarchical Mixture of Experts method, in which the individual responses of experts are nonlinearly combined by means of several gating networks arranged in a hierarchical fashion. The proposed multi-slab network described in this paper shares some common characteristics with a hybrid between the Ensemble averaging and Mixture of Experts methods. Specifically, each hidden layer neuron in the proposed architecture could be considered as a generalized “expert” with a nonlinear activation function. Like the “experts” in a Committee Machine, hidden layer neurons in the proposed network are trained with the help of feedback signals to learn different subsets of the input data and produce modular elements but unlike the Mixture of Experts method no gating mechanism is used to control the contribution of each “expert.” In the proposed work, the methodology for combining hidden layer responses is analogous but more generalized than the Ensemble averaging method. To produce the network’s response, a linear combination of hidden layer outputs is formed, followed by a thresholding nonlinearity at the output layer logistic neurons. Furthermore, the modularity of the system in the proposed architecture is achieved via the use of dissimilar hidden layer neurons as opposed to dissimilar initial conditions. Thus, the proposed architecture is more efficient than the Ensemble averaging method as it does not require training multiple classifiers to average the resulting solutions. Moreover, the proposed network is trained to decompose the classification task into smaller problems by learning different subsets of the data via the use of different dissimilarity measures. These include the various types of hidden layer neurons as well as the adaptivity of all activation function parameters. B. Redundancy in Hidden Layer Neuron Responses As seen by the experimental results presented in this paper, the proposed model promotes the development of redundant hidden layer neurons whose responses do not differ for input patterns belonging to different categories. This behavior is more prominent when the most discriminant input features are correctly identified and boosted with the use of preprocessing filters and possibly as a result of the modularity of the hidden layer slabs that utilize different types of adaptable activation functions. The formation of such modular elements may result from grouping together neurons with the same type of activation function, though this is not explicitly investigated here. Another possible explanation can be provided from the organization of the cortical circuitry. Cortical columns in the cortex are designed to contain redundant neurons. In an orientation hypercolumn for example, cells within a vertical minicolumn have almost identical receptive fields while cells between neighboring minicolumns have overlapping receptive fields. Thus, more than one neurons respond—in different degrees—to a given stimulus orientation and their responses are often highly correlated. This design serves the purpose of achieving orientation-independent object recognition at higher processing levels and ensuring recognition of noisy inputs [27]
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 15, NO. 3, MAY 2004
but also introduces a significant degree of redundancy that can be unnecessary for the recognition of “easy stimuli” like a simple object laying in a specified direction in the visual field. To deal with this overrecruitment problem, the visual system uses competitive lateral connections between neurons to silence redundant cells. Lateral connections connect primarily areas with similar properties, such as neurons with the same orientation or eye preference in the visual cortex [13], [14], [29]. They can be excitatory or inhibitory and have been suggested to play an important role in modulating and controlling cortical responses via: 1) permitting amplification of weak stimuli and suppression of strong stimuli, thus providing a mechanism for activity normalization [43]; 2) assisting in the development of sharp orientation tuning and hyperacuity [9], [41], [43]; and 3) playing a role in implementing attention and control [48]. Furthermore, lateral connections may play a role in representing information in the cortex. Evidence indicates that they could be used to store information for decorrelation of visual input, and filter out known statistical redundancies in the cortical representations [7], [42]. Findings also indicate that they may play a role in mediating the perceptual learning processes observed as early as the primary visual cortex by encoding local associations [7], [9], [50]. Thus, lateral connections are suggested to be wisely used by the neural tissue to maximize information processing while at the same time keeping the energy expenditure low by modulating the recruitment of neurons to the ones necessary for task completion. While the columnar organization of the visual cortex is represented by the multi-slab architecture of the proposed network, lateral connections are not implemented and this could be an explanation for why redundant hidden layer neurons are not automatically pruned during learning. Future work will explore this issue by incorporating lateral connections between hidden layer neurons and investigating whether a) different slabs develop into different processing modules and b) redundant neurons are forced to be silent -by autopruning- instead of being identically responsive to all input patterns. Future work will also explore the importance of hidden layer neurons for the optimal tuning of the preprocessing layer onto the most important features. This can be tested by performing a progressive pruning of hidden layer neurons as their responses converge to a single output value over the course of training. In summary, the following conclusions can be drawn from this paper. 1) The proposed adaptable design of preprocessing and activation function parameters in the network serves a dual purpose: a) to promote the decorrelation of i) input features learned by the different neurons in the first layer and ii) input patterns learned by different hidden layer neurons, thus creating a functionally modular network b) in doing so, to help maximize information processing and classification capacity. As evident by the results, the use of different activation functions in the hidden layer neurons, in combination with the preprocessing filters, boosted the robustness and classification performance of the proposed model.
POIRAZI et al.: MODULAR NEURAL NETWORK IMPLEMENTING NEURALLY INSPIRED ARCHITECTURE
2) Using adaptable preprocessing filters does not affect the network’s learning or generalization performance. The role of preprocessing is rather to better identify and boost the most important input features, thus reducing the work of the hidden layer and requiring fewer nonredundant neurons to achieve maximum performance. 3) Using a single type of adaptable hidden layer activation function, may harness the decorrelation of information learned by different parts of the network. 4) Using different forms of adaptable activation functions for each hidden slab promotes the development of multiple redundant hidden layer neurons, the subsequent deletion of which reduces the size and complexity of the network without worsening its performance. Using functions of the same form with adaptable parameters leads to a much larger network with fewer redundant neurons. 5) The enhanced guided stochastic rule with biologically inspired attributes significantly outperforms a rule with weight modifications made according to a constrained random algorithm. Experimental results on all medical data sets show that the EGA algorithm speeds up the network convergence and modifies the search path in the parameter space possibly reaching deeper minima and improving generalization performance. While the enhanced guided annealing rule does not obey the classical Hebbian association rules often mentioned in cortical modular processing, the incorporation of a delta signal could potentially account for the error correction signals provided by feedback connections from higher layers in a cortical column, which are not taken into account in the proposed model. Furthermore, the idea of feedback signals traveling backward within many neuron types and which can greatly affect synaptic plasticity, is continuously gaining support in light of recent physiological evidence [44], [45]. Thus, the biological analogy of the proposed network, while not direct, suggests that “mimicking” cortical architecture and information processing rules can greatly benefit the field of classical pattern recognition. Furthermore, in a neurobiological context, our findings suggest that the columnar organization of neural structures may have evolved as a result of the increased learning and memory capacity that can be provided by such a modular organization.
REFERENCES [1] C. T. Chen and W. D. Chang, “A feedforward neural network with function shape autotuning,” Neural Networks, vol. 9, no. 4, pp. 627–641, 1996. [2] O. D. Creutzfeldt, “Generality of the functional structure of the neocortex,” Naturwissenschaften, vol. 64, pp. 507–517, 1977. [3] C. Daqing, C. G. Gregory, and R. D. Freeman, “Spatiotemporal receptive field organization in the lateral geniculate nucleus of cats and kittens,” J. Neurophysiol., vol. 78, pp. 1045–1061, 1997. [4] J. G. Daugman, “Complete discrete 2-D Gabor transforms by neural networks for image analysis and compression,” IEEE Trans. Acoust., Speech, Signal Processing, vol. 36, pp. 1169–1179, 1988. [5] L. E. Dobrunz and C. F. Stevens, “Heterogeneity of release probability, facilitation and depletion at central synapses,” Neuron, vol. 18, pp. 995–1008, 1997. [6] D. Dong, “Associative decorrelation dynamics: A theory of selforganization and optimization in feedback networks,” in Advances in Neural Information Processing Systems, G. Tesauro, D. S. Touretzsky, and D. K. Leen, Eds. Cambridge, MA: MIT Press, 1995, p. 925.
[7]
[8] [9]
[10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27]
[28] [29] [30] [31]
[32]
[33] [34]
[35]
611
, “Associative decorrelation dynamics in visual cortex,” in Lateral Interactions in the Cortex: Structure and Function, J. Sirosh, R. Miikkulainen, and Y. Choe, Eds. Austin, TX: The UTCS Neural Networks Res. Group, 1996. J. C. Eccles, Neurosci., vol. 6, pp. 1839–1855, 1981. S. Edelman, “Why have lateral connections in the visual cortex?,” in Lateral Interactions in the Cortex: Structure and Function, J. Sirosh, R. Miikkulainen, and Y. Choe, Eds. Austin, TX: UTCS Neural Networks Res. Group, 1996. O. V. Favorov and D. G. Kelly, “Minicolumnar organization within somatosensory cortical segregates, II: Emergent functional properties,” Cerebral Cortex, vol. 4, pp. 428–442, 1994. , “Minicolumnar organization within somatosensory cortical segregates, I: Development of afferent connections,” Cerebral Cortex, vol. 4, pp. 408–427, 1994. P. Gallinary, “Training if modular neural net systems,” in The Handbook of Brain Theory and Neural Networks, M. Arbib, Ed. London, U.K.: MIT Press, 1995, pp. 582–585. C. D. Gilbert and T. N. Wiesel, “Columnar specificity of intrinsic horizontal and corticocortical connections in cat visual cortex,” J. Neurosci., vol. 9, pp. 2432–2442, 1989. C. D. Gilbert, J. A. Hirsch, and T. N. Wiesel, “Lateral interactions in visual cortex,” in Proc. Cold Spring Harbor Symp. Quantitative Biology, vol. LV, 1990, pp. 663–677. S. R. Gunn, “Support Vector Machines for Classification and Regression,” Image Speech and Intelligent Systems Research Group, Univ. Southampton, 1997. L. Bart, M. Happel, and M. J. Murre, “The design and evolution of modular neural network architectures,” Neural Networks, vol. 7, pp. 985–1004, 1994. S. Haykin, Neural Networks: A Comprehensive Foundation. Englewood Cliffs, NJ: Prentice-Hall, 1995. D. O. Hebb, The Organization of Behavior: A Neurophysiological Theory. New York: Wiley, 1949. S. Haykin, Neural Networks, 2nd ed. Englewood Cliffs, NJ: PrenticeHall, 1999. R. Horst and H. Tuy, Global Optimization. New York: SpringerVerlag, 1990. R. Horst and P. M. Pardalos, Eds., Handbook of Global Optimization. Dordrecht: Kluwer, 1995. Z. Hu and H. Shao, “The study of neural network adaptive control systems,” Contr. Decision, vol. 7, pp. 361–366, 1992. D. H. Hubel and T. N. Wiesel, “Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex,” J. Physiol., vol. 160, pp. 106–154, 1962. , “Functional architecture of the macaque monkey visual cortex,” in Proc. Royal Society, vol. 198, London [B], U.K., 1977, pp. 1–59. G. Indiveri, L. Raffo, S. P. Sabatini, and G. M. Bisio, “A neuromorphic architecture for cortical multi-layer integration of early visual tasks,” Mach. Vis. Appl., vol. 10, pp. 305–314, 1995. , “A recurrent neural architecture mimicking cortical preattentive vision systems,” Neurocomputing, vol. 11, pp. 155–170, 1996. M. W. Jung, Y. Qin, D. Lee, and I. Mook-Jung, “Relationship among discharges of neighboring neurons in the rat prefrontal cortex during spatial working memory tasks,” J. Neurosci., vol. 20, no. 16, pp. 6166–6172, 2000. A. H. Kan and G. T. Timmer, “Stochastic global optimization methods part I: Clustering methods,” Mathemat. Program., vol. 39, pp. 27–78, 1987. S. Löwel and W. Singer, “Selection of intrinsic horizontal connections in the visual cortex by correlated neuronal activity,” Science, vol. 255, pp. 209–212, 1992. D. Marr and E. Hildreth, “Theory of edge detection,” in Proc. Royal Society, vol. B207, London, U.K., 1980, pp. 187–217. V. B. Mountcastle, “An organizing principle for cerebral function: The unit module and the distributed system,” in The Mindful Brain, G. M. Edelman and V. B. Mountcastle, Eds. Cambridge, MA: MIT Press, 1975. M. C. Nechyba and Y. Xu, “Improved ANN performance when using different activation functions in problems of system identification and control. Neural network approach to control system identification with variable activation functions,” in Proc. IEEE Int. Symp. Intelligent Control, vol. 1, 1994, pp. 358–63. C. Neocleous, “A neural network architecture composed of adaptively defined dissimilar single-neurons: applications in engineering and design,” Ph.D. dissertation, Brunel Univ. 1997, U.K. C. Neocleous, I. Esat, and C. Schizas, “A neural network architecture composed of adaptively defined dissimilar single-neuron units,” in Proc. 2nd Int. Conf. Computational Intelligence and Neuroscience, Druham, NC, 1997. J. D. Pinter, Global Optimization in Action. Dordrecht: Kluwer, 1996.
612
[36] P. Poirazi, “Contributions of active dendrites and structural plasticity to the neural substrate for learning and memory,” Ph.D. dissertation, Univ. Southern California, Los Angeles, CA, 2000. [37] P. Poirazi and B. W. Mel, “Impact of active dendrites and structural plasticity on the memory capacity of neural tissue,” Neuron, vol. 29, pp. 779–796, 2001. [38] P. Poirazi, C. Neocleous, C. Pattichis, and C. Schizas, “A biologically inspired neural network composed of dissimilar single neuron models,” in Proc. 23rd Ann. Int. Conf. IEEE Eng. Med. Biol. Soc., Oct. 2001. [39] S. J. Pyott and C. Rosenmund, “The effects of temperature on vesicular supply and release in autaptic cultures of hippocampal neurons,” J. Physiol., vol. 539.2, pp. 523–535, 2002. [40] C. Rosenmund, A. Sigler, I. Augustin, K. Reim, N. Brose, and J. S. Rhee, “Differential control of vesicle priming and short-term plasticity by Munc13 isoforms,” Neuron, vol. 33, pp. 411–424, 2002. [41] S. P. Sabatini, “Recurrent inhibition and clustered connectivity as a basis for gabor-like receptive fields in the visual cortex,” in Lateral Interactions in the Cortex: Structure and Function, J. Sirosh, R. Miikkulainen, and Y. Choe, Eds. Austin, TX: UTCS Neural Networks Research Group, 1996. [42] J. Sirosh, R. Miikkulainen, and J. A. Bednar, “Self-organization of orientation maps, lateral connections, and dynamic receptive fields in the primary visual cortex,” in Lateral Interactions in the Cortex: Structure and Function, J. Sirosh, R. Miikkulainen, and Y. Choe, Eds. Austin, TX: UTCS Neural Networks Res. Group, 1996. [43] D. Somers, L. J. Toth, E. Todorov, S. C. Rao, D.-S. Kim, S. B. Nelson, A. G. Siapas, and M. Sur, “Variable gain control in local cortical circuitry supports context-dependent modulation by long-range connections,” in Lateral Interactions in the Cortex: Structure and Function, J. Sirosh, R. Miikkulainen, and Y. Choe, Eds. Austin, TX: UTCS Neural Networks Research Group, 1996. [44] N. Spruston, Y. Schiller, G. Stuart, and B. Sakmann, “Activity-dependent action potential invasion and calcium influx into hippocampal CA1 dendrites,” Sci., vol. 286, pp. 297–300, 1995. [45] G. Stuart and B. Sakmann, “Active propagation of somatic action potentials into neocortical pyramidal cell dendrites,” Nature, vol. 367, pp. 69–72, 1994. [46] G. Stuart and N. Spruston, “Determinants of voltage attenuation in neocortical pyramidal neuron dendrites,” J. Neurosci., vol. 18, pp. 3501–3510, 1998. [47] J. Szentágothai, “The neuron network of the cerebral cortex: A functional interpretation,” in Proc. Royal Society , vol. 201, London, U.K., 1997, pp. 219–248. [48] J. G. Taylor and F. N. Alavi, “A basis for long-range inhibition across cortex,” in Lateral Interactions in the Cortex: Structure and Function, J. Sirosh, R. Miikkulainen, and Y. Choe, Eds. Austin, TX: UTCS Neural Networks Research Group, 1996. [49] E. Trentin, “Networks with trainable amplitude of activation functions,” Neural Networks, vol. 14, pp. 471–493, 2001. [50] M. Usher, M. Stemmler, and E. Niebur, “The role of lateral connections in visual cortex: Dynamics and information processing,” in Lateral Interactions in the Cortex: Structure and Function, J. Sirosh, R. Miikkulainen, and Y. Choe, Eds. Austin, TX: UTCS Neural Networks Research Group, 1996. [51] R. Vogels and G. A. Orban, “How well do the response changes of striate neurons signal differences in orientation: A study in the discriminating monkey,” J. Neurosci., vol. 10, pp. 3543–3558, 1990. [52] Y. Yamada and Y. Yabuta, “Neural network controller using autotuning method for nonlinear functions,” IEEE Trans. Neural Networks, vol. 3, pp. 595–601, 1992.
Panayiota Poirazi was born in Cyprus on August 6, 1974. She received the Diploma in Mathematics with honors from the University of Cyprus, in 1996, and the M.S. and Ph.D. degrees in biomedical engineering in 1998 and 2000, respectively, from the University of Southern California (USC), Los Angeles, both with honors. In January 2002, she received a Marie Curie Fellowship from the EC for conducting research in the field of bioinformatics. She currently holds a tenure track Research Assistant Professor position in Bioinformatics at IMBB-FORTH. Her general research interests lie in the area of computational modeling of biological systems with emphasis in the fields of neuroscience, machine learning, artificial intelligence, and functional genomics. She has published more than 16 refereed journal and conference papers in these areas Dr. Poirazi was awarded the “Fred S. Grodins’ Graduate Research Award” for her Ph.D. thesis.
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 15, NO. 3, MAY 2004
Costas Neocleous received the B.E. degree in mechanical engineering from the American University of Beirut, the M.Sc. degree in naval architecture and marine engineering from the University of Michigan, and the Ph.D. degree in neural networks from Brunel University, U.K. He is a Senior Lecturer with the Higher Technical Institute in Nicosia, Cyprus, and a Research Associate with the University of Cyprus. His research interests are in computational intelligence, artificial neural networks, control systems, diagnostic systems, and computational creativity.
Costantinos S. Pattichis (S’88–M’88–SM’99) was born in Cyprus on January 30, 1959. He received the diploma as Technician Engineer from the Higher Technical Institute, Cyprus, the B.Sc. degree in electrical engineering from the University of New Brunswick, Canada, the M.Sc. degree in biomedical engineering from the University of Texas at Austin, the M.Sc. degree in neurology from the University of Newcastle Upon Tyne, U.K., and the Ph.D. degree in electronic engineering from the University of London, U.K. He is a Senior Scientist and co-director of the Department of Computational Intelligence of the Cyprus Institute of Neurology and Genetics, as well as an Associate Professor with the Department of Computer Science of the University of Cyprus. He has published 30 refereed journal and 80 conference papers. His current research interests include medical imaging, biosignal analysis, intelligent systems, and health telematics. Dr. Pattichis was guest co-editor of the Special Issue on Emerging Health Telematics Applications in Europe of the IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, and co-chairman of the conferences MEDICON’98, and the IEEE Region 8 MELECON’2000. He also served as chairperson of the Cyprus Association of Medical Physics and Biomedical Engineering (1996–1998), and the IEEE Cyprus Section (1998–2000).
Christos N. Schizas (M’83–SM’91) received the B.Sc. degree in electronic engineering from the University of London, U.K., in 1978, the M.B.A degree from the University of Indianapolis, IN, in 1988, and the Ph.D. degree in systems theory from the University of London, U.K. in 1981. He served as Professor of Computer Information Systems with the University of Indianapolis. Since 1991, he has been with the Department of Computer Science, University of Cyprus, Cyprus, where he served as Interim Chair of the Department until 1994. He was elected Vice Rector of the University in October 2002. He is the founder of the Multimedia Research and Development Laboratory of the University of Cyprus, and the Director of Computational Intelligence Research at the Cyprus Institute of Neurology and Genetics. He has published more than 100 refereed journal and conference papers. His research interests include computational intelligence decision support systems, medical informatics, diagnostic and prognostic systems, telematics applications, and multimedia. Prof. Schizas is a Fellow of the Institute of Electrical Engineers and a Fellow of the British Computer Society.