2011 IEEE International Conference on Robotics and Automation Shanghai International Conference Center May 9-13, 2011, Shanghai, China
Modular Neural Networks for Multi-Class Object Recognition Yuhua Zheng and Yan Meng
Abstract—Multi-class object recognition is a critical capability for an intelligence robot to perceive its environment. In this paper, a new approach consisting of a number of modular neural networks is proposed to recognize multiple classes of objects for a robotic system. The population of the modular neural networks depends on the class number of the objects to be recognized and each modular network only focuses on learning one object class. For each modular neural network, both the bottom-up (sensory-driven) and top-down (expectation-driven) pathways are fused together, and a supervised learning algorithm is applied to update corresponding weights of both pathways. Furthermore, two different training strategies are evaluated: positive-only training and positive-and-negative training. Experiments on visual image recognition demonstrate the efficiencies of the proposed approach and the corresponding training strategies.
I. INTRODUCTION multi-class objects remains as RECOGNIZING challenging problem for an intelligent robot
a to understand its environment. Generally, the objective of object recognition is to find out the correlations between the raw data of objects with various appearances and the relative invariant features of the corresponding object class. There are two basic approaches to learn and describe such relationships: the generative model and the discriminative model. The generative models presume the distribution of features and learn related parameter values from data, such as Bayesian networks [1], constellation model [2] and bagof-words [3]. On the other hand, discriminative approaches aim to find separation boundaries between different object classes without modeling their distributions, such as nearest neighbors [4], neural network [5] and support vector machines (SVM) [6]. In this paper, we propose a new approach using multiple modular neural networks to learn and recognize multiple objects, where each modular neural network is learned for one object class. For example if there are 5 different object classes need to be recognized, then 5 modular neural networks will be applied. Artificial Neural Networks (ANN) have been studied and applied for object recognition in different ways. Most of them adopt feed-forward structures and the supervised learning with error back propagations. However, evidence found in cognitive brain research and neuroscience suggests that the nervous system responsible for object recognition has distributed cortical structures containing both bottom-up and top-down pathways [7, 8]. Many works have been conducted to address this issue, such as Adaptive Resonance Yuhua Zheng and Yan Meng are with the Department of Electrical and Computer Engineering, Stevens Institute of Technology, Hoboken, NJ 07030, USA. (Email:
[email protected], yan.meng@ stevens.edu).
978-1-61284-385-8/11/$26.00 ©2011 IEEE
Theory (ART) [9], biased competition [10], etc. In our previous work [11], we firstly propose a neural network model by fusing the information from both bottom-up (stimulus) and top-down (expectation) pathways, called neural network with fusing bottom-up and top-down pathways (FBTP-NN). Different from other spatial attention based approaches; the top-down expectation in FBTP-NN is generated from training samples, and focuses on interpreting the object appearances instead of searching and localizing them. In other words, our FBTP-NN model focuses on solving “what” problem instead of “where” problem. When applied to multi-class object recognition, the monolithic classifier may be affected by its limited capacity. For example, with the increase of class numbers, it will be harder for the monolithic model with a single neural network to learn the boundaries between multiple classes with fixed hidden neurons and weights. Therefore, to improve the capacity and efficiency of the FBTP-NN model, a modular neural networks mechanism is proposed in this paper. This new approach consists of a number of modular neural networks, where each modular network is a FBTPNN model that aims at learning just one object class. Such a system is called the FBTP-MNN model, i.e., modular neural networks with fusing bottom-up and top-down pathways. There are some modular-based algorithms[12] have been proposed in the literature, like Hierarchical Mixture of Experts (HME)[13] and different variances of local learning algorithms [14, 15]. The multi-class recognition naturally fits into the FBTP-MNN algorithm by assigning each modular FBTP-NN to learn one object class individually and to compete for the final output of the whole system. On the other hand, for multi-class object recognition, the complexity of learning and recognition may be different from one class to another. Monolithic classifier, however, can only treat all the object classes using the same learning method and parameters of neural networks. Therefore, the proposed modular system of FBTP-MNN will provide a good framework to explore adaptive learning methods and parameter settings for specific needs of various classes. Although the learning of each modular FBTP-NN model is independent, the training data can be different. Each modular network can either learn only from its own classes or learn from other classes as well. Similar training strategies have been discussed in [16] and we will use both training strategies in this paper. The rest of this paper is organized as follows. The proposed FBTP-MNN approach is presented in Section II, as well as the description of the modular FBTP-NN model. The learning process of the FBTP-NN model and training strategies of FBTP-MNN are described in Section III. Section IV presents some experimental results to
2927
demonstrate the efficiency of the proposed FBTP-MNN approach and the corresponding learning process. Conclusions and future work are provided in Section V. II. MODULAR NEURAL NETWORKS FOR MULTI-CLASS RECOGNITION
A. The System Framework The modular neural networks with fusing bottom-up and top-down pathways (i.e., FBTP-MNN) contains a number of modular neural networks. Each modular neural network is a FBTP-NN model which fuses the information of bidirectional pathways to learn object features. The major motivation of using modular neural networks instead of monolithic neural network is to improve the capacity of the neural networks. Within a monolithic neural network, neuron values and weights of the neural network are shared by all the object classes. With the increase of the number of object classes, the neuron values and weights will be affected by the cross information between different classes, which may degrade the classification performance due to the limited network capacity. One simple solution to this issue is to separate the multiple object classes and learn each object individually. As shown in Fig.1, the gating module assigns data of different classes to corresponding modular neural networks. Each modular neural network learns and updates its weights and neuron activities independently. Their outputs are fed into the competition module to compete for the final output of the system.
On the other hand, each modular network can be trained either by the data from its own class only or by the data from other classes as well. The first training strategy is called the positive-only (PO) training strategy. The second one is called the positive-and-negative (PN) training strategy where both positive data from its own class and negative data from other classes are used for learning. Both strategies will be applied and compared in the experimental results. For each new coming test data, all modular networks will do the recognitions and their modular outputs will compete for the final output in the competition module using the winnertakes-all approach. In following sections, the structure of the modular FBTP-NN model will be introduced as well as the corresponding learning algorithms. B. The Modular Network of the FBTP-NN Model Different from the general feed-forward neural network (FFNN), we proposed a FBTP-NN model to fuse the information of bi-directional pathways of a neural network for better object learning and recognition in our previous work [11]. The top-down signals will contain a priori knowledge or the memory of related objects, which will help to modulate the bottom-up pathway and generate more confident outputs. During the training of the FBTP-NN model, both sensory input data and output labels are treated equally as the environmental constraints to update the corresponding network weights. And the bi-directional data flows are fused via the modulations of neuron activities of the neural network. In this way, both information in the current input stimulus and the previously learned knowledge can be integrated to the neural network to improve the learning and recognition capability.
Fig. 1. The framework of the FBTP-MNN model.
To make the problem simple, we use the prior knowledge of the class number to construct the modular neural networks, i.e. each class having one modular network and each modular network only learning its own class. Therefore, the gating module just needs to correlate modular networks with their corresponding classes. After being assigned with the object classes, modular networks can learn from the data of their classes respectively. All weights are intra-weights within each network and there is no interweight between different modular networks.
Fig. 2. The framework of the modular FBTP-NN model. The bottom-up process is represented by solid lines and the top-down process is represented by dashed lines. At every hidden layer, the bottom-up stimuli (in solid circles) are fused with top-down expectations (in dashed circles) and vice versa.
As shown in Fig.2 FBTP-NN model may contain multiple hidden layers but only one input layer and one output layer, which are the interfaces of the network to the environment (i.e. the input data and output labels). A number of hidden
2928
layers exist in between. Each FBTP-NN model has only one output neuron on the output layer when it is used as a modular network in the FBTP-MNN model. The input layer receives the sensory input and generates a hypothesis at the output layer through the bottom-up pathway; the output layer then produces expectations on the sensory stimulus via the top-down pathway. The expectation information will be fused with the sensory stimuli to update neuronal activities of hidden layers. The updated neuronal activities will then generate new hypothesis and send them to the output layer accordingly. Such procedures repeat until a certain stop condition (e.g. the maximal number of loops) is met. During the learning process, the fusion of the neural dynamics in both pathways is conducted on every hidden layer. C. Neuronal Dynamics of the FBTP-NN Model The FBTP-NN model may contain multiple hidden layers. Since the update of neuronal activities and synaptic weights depend only on adjacent layers, for the sake of simplicity, we will just discuss the neuronal dynamics of basic twolayer structure, as shown in Fig. 3. The bottom layer X is called the data layer and the top layer Y is the feature layer. Between any two connected neurons, there are bottom-up (solid lines), top-down (dotted lines), and correlation (horizontal lines) connections. The network (inside the dotted square) is stimulated by the environment, which may contain both data information D and feature information L. The dynamics of a neuron x on the data layer is defined as: M
xi (t 1) 1 xi (t ) 1 g ( pui (t ) yu (t )) u 1
xi (t 1) xi (t ) xi (t 1)
(1)
where xi (t 1) is the change of i-th data neuron activity at time t+1, which consists of two terms: self-decay and the top-down expectations from all feature neurons. yu (t ) is the u-th feature neuron activity at time t and M is the number of feature neurons. pui (t ) is the top-down synaptic weight at M
time t, and pui (t ) yu (t ) represents the sum of top-down u 1
expectations from all the feature neurons. The expectations are then fed into the activation function g, which is a sigmoid function defined as g ( x) 1/ (1 e x ) to represent the activation characteristic of neurons. 1 and 1 are decay constant and stimulus coefficient, respectively. Similarly the dynamics of a feature neuron y is defined as: N
M
i 1
v 1
yu (t 1) 2 yu (t ) 2 g ( wiu (t ) xi (t ) cuv (t ) yv (t )) yu (t 1) yu (t ) yu (t 1)
(2)
where yu (t 1) is the change of u-th feature neuronal activity at time t+1, which consists of three terms. The first term is self-decay. The second one is the bottom-up stimuli N
from the data layer as wiu (t ) xi (t ) , where wiu (t ) is the i 1
weight in the bottom-up pathway from i-th data neuron to uth feature neuron, and N is the number of neurons of the data layer. The third term is the correlation between the feature neurons, which can be either inhibition or M
excitation cuv (t ) yv (t ) , where cuv (t ) is the correlation v 1
weight between neuron u and neuron v, and M is the number of feature neurons. 2 and 2 are decay constant and stimulus coefficient of feature neurons, respectively. L
yu
cuv
yv
p vj wiv
xi
xj
D
Fig. 3. A basic two-layer FBTP-NN model.
III. THE LEARNING ALGORITHMS A. Weights Updates of the FBTP-NN Model The supervised learning algorithm for the FBTP-NN model uses the information from both bottom-up and top-down pathways. Given a number of data-label constraints ( D, L) , applying square errors to measure the distances, the cost function can be written as: N
M
i 1
u 1
E ( d i x i ) 2 (l u y u ) 2
(3)
di and lu are target neuron values of sensory data and label vector. For simplicity, the time index is omitted thereafter unless otherwise stated. Combining with Eqn. (2), the derivative of the cost function with respect to the bottom-up weight wiu can be obtained as follows: dE (4) 0 2 2 g x i (l u y u ) , dwiu where 2 is the stimulus coefficient of feature neuron. g is
the derivative of the activation function. xi represents the related data neuronal activity. lu and yu are the desired and real output neuronal activities, respectively. Therefore, to minimize the cost function E, the change of weight wiu is defined as: wiu r1 xi (lu y u ) , (5) where r1 is the learning rate of the bottom-up weights. Eqn. (5) is a Hebbian-like error-driven learning method. Similarly, we can get the derivative of the cost function with respect to the top-down weights P and the update rule for a specific pui can be derived as:
2929
dE 0 2 1 g y u ( d i x i ) , dpui
pui r2 yu (d i xi ) ,
(6) (7)
where r2 is the learning rate of the top-down weights. From above cost function, it can be seen that each neuron tries to converge to its desired value under the environmental constraints. The desired value of the neurons on the input layer is the input data. The desired value of the neurons on the output layer is the true label. The desired values of the neurons on the hidden layers are fused by the bottom-up stimulus and top-down expectation accordingly as: li (t ) yi (t ) f [ xi (t 1) yi (t )] (8) where li (t ) represents the desired value of i-th neuron at time step t. It depends on the current latent value yi (t ) provided by the bottom-up process, and the top-down expectation xi (t 1) of the same neuron from last time step. Similarly, when a neuron works as a data neuron during the top-down process, its desired value is defined as: di (t ) xi (t ) f [ yi (t 1) xi (t )] (9) where di (t ) is the desired value of i-th neuron at time step t. It relies on the current data value xi (t ) by the top-down process and the bottom-up stimulus yi (t 1) from last step. On the other hand, the correlation connections C reflect the dependencies among the feature neurons and can be updated by applying Hebbian rule of “Fire together wire together”. B. Learning the Modular FBTP-NN Model
Fig.4 shows the diagram of the FBTP-NN learning procedure. Given data/label pairs and the initialized FBTPNN, the network continues learning until the maximal number of the training loops is reached. Inside each training loop, one bottom-up process will be executed from the input layer all the way up to the output layer, followed by one topdown process from the output layer all the way down to the input layer. At any specific moment, only one basic subnetwork that consists of two adjacent layers (as shown in Fig. 3) will be evaluated and the corresponding weights will be updated. By moving up or moving down one layer, another basic sub-network will be learned. In this way the whole network will be adjusted iteratively. C. The Training Strategies of the FBTP-MNN Model The FBTP-MNN model can have different training strategies. As shown in Fig.5, the positive-only (PO) training strategy is applied, where each modular neural network only accepts the data of its own class. The modular neural network has no awareness of the existence of other neural networks for other classes. If a modular neural network learns not only from its own class but also from other object classes, the training strategy is called positiveand-negative (PN) training. The label of its own class is one and labels of other classes are zero. Once the FBTP-MNN model has been trained (either using PO or PN training strategy), it has both discriminative and generative capabilities. When unseen data are presented, object recognition can be achieved by running the bottom-up discriminative process only for each modular FBTP-NN model. All networks with their output neurons value beyond some threshold are considered for the final competition. For all these final competing modular neural networks, the topdown propagation will generate expectations at their input layers. These expectations will be compared with the true input data to estimate the generative difference. Then combining the discriminative confidence on the output layer and the generative difference on the input layer, the overall decision for object classification can be made by choosing the modular network with highest combined score, which is the winner-takes-all approach.
Fig. 5. The training strategies for the FBTP-MNN model. (Solid lines for the PO training and the dashed lines for the PN training.)
Fig. 4. The diagram of the FBTP-NN learning procedure. 2930
IV. EXPERIMENTS AND DISCUSSIONS A. Experimental Settings To demonstrate the learning process and the recognition ability of the proposed FBTP-MNN algorithm, a four-class classification experiment on visual object recognition has been performed. The data of bicycle, pan, revolver, and treadmill are taken from Caltech 256, as shown in Fig.6. The original images are transformed into gray images, where objects are presented as white pixels and the background as black pixels. No advanced feature extraction is conducted and only the pixel values of images are applied as inputs for the FBTP-MNN model. There are four modular FBTP-NN models need to be built for four different classes. For simplicity, they all have the same structures with three layers. The number of neurons in the input layer equals to the size of the training images, i.e. 32x24=768. The hidden layer has 0.5x768=384 neurons and the output layer has 1 neuron. Neurons of adjacent layers are fully connected. Also all modular FBTP-NN models adopt same learning parameters as shown in Table I. They are chosen empirically by trial and error. Actually under the modular mechanism, different modular network might have different parameters according to the characteristics of their own classes, which might improve the overall performance. However, in this paper, we just keep the identical settings and will investigate the local variant settings in our future work.
Fig. 6. Experimental data taken for 4-class object recognition. TABLE I. PARAMETER SETUPS Coefficient Value
1 , 2 1 , 2
for the bike images. Since we applied raw pixel values of images as the network inputs, the expectation naturally looks like original images. At the beginning, the network has learned nothing and the expectation is just noise, as shown on the top part of Fig.7. With more learning samples shown on the left of Fig.7, the network can better capture the features and generate the expectation prototype for the object (bottom in Fig.7).
Fig. 7. Illustrative example of evolvements of the top-down expectation. With more samples (on the left), the network is able to generate better expectation of the object (on the right) from top to bottom.
For this four-class recognition problem, each class has 50 samples with various sizes, appearances and orientations. 50% data of each class are applied for training and the rest of the data are used for testing. Fig.8 shows the average recognition rates over 4 classes using two training strategies (PO and PN) with different training loops. The good average recognition rate (up to 92%) can be achieved with small learning loops of 200. From Fig. 8, it can be seen that the PN training strategy outperforms the PO training strategy with all different training loops. Such result is reasonable since under the PN training all modular FBTP-NN models learned more knowledge not only from their own classes, but also the separations from other classes, which may help the recognitions over time.
Self-decay rate
0.2
0.9
Stimulus rate
0.8
0.8 0.7
0.05
Recognition Rate
Learning rate r1 , r2 , r3 Fusion ratio f
Comparisons of Averaging Recognition Rates under Different Training Loops
1
0.01
The batch learning is applied to train each modular FBTPNN model, which means all the training data are presented to the network at one time. After training, the testing data are sequentially classified by the proposed FBTP-MNN model. B. Learning and Recognition Performance Firstly we will show the fusion process of the modular FBTP-NN model. Fig.7 shows the evolvements of the topdown expectation on the input layer of the FBTP-NN model
0.6
0.5 0.4
FBTP-MNN (PN)
0.3
FBTP-MNN (PO)
0.2
FF-MNN (PN)
0.1 0
0
50
100
150
200
250
Training Loops
Fig. 8. The average recognition rate for the 4-class object problem under 2 training strategies (i.e. PO and PN).
2931
Table II lists the average recognition rates of each class over different training loops as 50, 100, and 200 (same with Fig.8). Generally, the FBTP-MNN model with the PN training has better performance over the FBTP-MNN model with the PO training for all classes, which is consistent with the results of Fig.8. For example, the recognition rate of bike images has been improved by more than 50%. Class
TABLE II. RECOGNITION RATES FOR EACH CLASS FBTP-MNN FBTP-MNN FF-MNN (PO) (PN) (PN)
Bike
42%
91%
87%
Pan
77%
88%
84%
Revolver
58%
81%
80%
Treadmill
96%
97%
95%
will contain more complex and more variant objects than the data adopted in this paper. New problems will be introduced as the appearance changes due to THE object deformation. Therefore we will also evaluate the proposed model on more complex datasets like Caltech 101 or collecting data via the real robotic navigation, and compare the performance of our algorithm with the state-of-the-art approaches on these complex data. With more object classes, we will investigate the algorithms to locally select adaptive training parameters for different classes so that the optimized resources assignments can be achieved as well as better recognition performance. REFERENCES [1]
In Fig.8 and Table II, we also compare our algorithm with feed-forward modular neural networks (FF-MNN), implemented by using the Netlab open source toolbox. Similar with the FBTP-MNN model, the FF-MNN model contains multiple feed-forward neural networks (FF-NN) and each FF-NN learns one class. Both positive and negative data are applied to train FF-NN. It can be seen from Fig.8 that the performances of the FBTP-MNN with the PN training are better than the FF-MNN with the PN training. On the other hand, Table II also demonstrates that some classes are more difficult to learn than others using FBTPMNN model. For example, the recognition rates of pan and revolver objects are under 80% while treadmill object reaches up to 96% even at worse case. This observation indicates that adaptive parameters selection and adaptive learning for different classes may provide better recognition performance, which will be investigated in our future work.
[8]
V. CONCLUSION AND FUTURE WORK
[9]
In this paper, a modular neural network system with fusing both bottom-up and top-down pathways, called FBTP-MNN is proposed. Each modular network is one FBTP-NN model which is able to learn one specific class through the bidirectional data flows using the proposed supervised learning algorithm. To solve multi-class problem, two training strategies are proposed and evaluated on visual objects. Experimental results demonstrate the efficiency of the proposed algorithms. In future work, we would continue to investigate the dynamics of the FBTP-NN model, especially on the topdown pathway, through which the better top-down expectation might be produced as well as the better learning algorithm. Although the PN training strategy has better performance over the PO training strategy, the PO training strategy has its own advantages in feasibility, especially with increment of class numbers. It would be extensive resource consuming to present all negative data to train every modular network. Therefore one of our future works is to improve the performance of the PO training. For robotics applications, the real world environments
[2]
[3]
[4] [5] [6]
[7]
[10] [11]
[12] [13]
[14] [15] [16]
2932
B. Ommer and J. M. Buhmann, "Learning the Compositional Nature of Visual Objects," in IEEE Conference on Computer Vision and Pattern Recognition, 2007, pp. 1-8. R. Fergus, P. Perona, and A. Zisserman, "Object class recognition by unsupervised scale-invariant learning," in IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003, pp. 264-271. W. Gang, Z. Ye, and F.-F. Li, "Using Dependent Regions for Object Categorization in a Generative Framework," in IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2006, pp. 1597-1604. G. Shakhnarovich, P. Viola, and T. Darrell, "Fast pose estimation with parameter-sensitive hashing," in IEEE International Conference on Computer Vision, 2003, pp. 750-757. M. A. Arbib, The Handbook of Brain Theory and Neural Networks: The MIT Press, 2002. Z. Hao, A. C. Berg, M. Maire, and J. Malik, "SVM-KNN: Discriminative Nearest Neighbor Classification for Visual Category Recognition," in IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2006, pp. 2126-2136. A. K. Engel, P. Fries, and W. Singer, "Dynamic predictions: Oscillations and synchrony in top-down processing," Nat Rev Neurosci, vol. 2, pp. 704-716, 2001. S. Treue, "Visual attention: the where, what, how and why of saliency," Current Opinion in Neurobiology, pp. 428-432, 2003. S. Grossberg, "Competitive learning: from interactive activation to adaptive resonance," in Connectionist models and their implications: readings from cognitive science, ed: Ablex Publishing Corp., 1988, pp. 243-283. D. M. Beck and S. Kastner, "Top-down and bottom-up mechanisms in biasing competition in the human brain," Vision Research, vol. 49, pp. 1154-1165, 2009. Y. Zheng, Y. Meng, and Y. Jin, "Fusing Bottom-up and Top-down Pathways in Neural Networks for Visual Object Recognition," presented at the IEEE International Joint Conference on Neural Networks 2010. A. J. C. Sharkey, "Modularity, Combining and Artificial Neural Nets," Connection Science, vol. 9, pp. 3 - 10, 1997. M. I. Jordan and R. A. Jacobs, "Hierarchical mixtures of experts and the EM algorithm," in Neural Networks, 1993. IJCNN '93-Nagoya. Proceedings of 1993 International Joint Conference on, 1993, pp. 1339-1344 vol.2. L. Bottou and V. Vapnik, "Local learning algorithms," Neural Comput., vol. 4, pp. 888-900, 1992. E. Alpaydin and M. I. Jordan, "Local linear perceptrons for classification," IEEE Transactions on Neural Networks, vol. 7, pp. 788-794, 1996. A. Goltsev and V. Gritsenko, "Modular neural networks with Hebbian learning rule," Neurocomputing, vol. 72, pp. 2477-2482, 2009.