Generalization and Modularity in Feed-Forward Boolean Networks

0 downloads 0 Views 125KB Size Report
1 Introduction. One central theme in Neural Networks is the design of optimal architectures to solve ... In section III we analyze the generalization ability of the modular networks intro- duced in section II, ..... Introduction to the Theory of. Neural ...
Generalization and Modularity in Feed-Forward Boolean Networks

Leonardo Franco and Sergio A. Cannas y

Facultad de Matematica, Astronoma y Fsica. Universidad Nacional de Cordoba. Ciudad Universitaria. (5000). Cordoba. Argentina. email: ffranco,cannasg@ s.uncor.edu

Abstract We construct a family of architectures that implements the parity function in feedforward neural networks. The resulting networks have a modular architecture where the degree of modularity can be controlled by the fan-in max allowed. Among different features we analyze the generalization ability of these structures, obtaining analytical results for an arbitrary number of input bits. Both analytical and numerical simulation results show that the generalization ability is greatly improved by modularity.

1 Introduction One central theme in Neural Networks is the design of optimal architectures to solve a determined problem. The implementation of an optimal feed-forward network for a given problem involves several aspects, such as the complexity of the problem, depth of the network (i.e., number of hidden layers), number of neurons in the hidden layers, convergence of the learning algorithms, ability to generalize, etc. The parity function is one of the most used function for testing learning algorithms because both of its simple de nition and its great complexity (any change of a single bit produces a change in the output). The simplest and most known architecture (see Minsky & Papert (1969), McClelland & Rumelhart (1987) and Hertz (1986)) using linear threshold functions has N input bits, N fully connected neurons in the single hidden layer and a single output, that has to be ON whenever an odd number of input bit are ON and OFF otherwise. Being one of the hardest problems, many di erent architectures have been constructed to compute parity, essentially by adding neurons in the hidden layer in order to reduce the number of local minima, where the learning algorithms could get stuck. [See Tessauro, 1988]. In this work we construct a family of modular architectures that implement the parity function. Every member of the family can be characterized by fan-in max (fmax) of the network, i.e., the maximum number of connections that a neuron can receive. This parameter controls the degree of modularity, the number of hidden layers (it increase logarithmically with fmax) and the total number of synapses ns . 1

An interesting fact is that ns diminish as fmax increase. Besides several computational advantages (see Haykin, 1991 and references therein) modular architectures are of particular interest from the biological point of view. Modularity seems to be an important principle in the architecture of vertebrate nervous systems, while fully connected networks seems to be very unrealistic in nature (Haykin, 1991). The family of networks and its general properties are presented in section II. Concerning learning in neural networks, one of the most interesting topic is generalization, i.e., the capacity of infer answers to questions that have never been seen by the network. In section III we analyze the generalization ability of the modular networks introduced in section II, rst by computing analytically the minimum number of examples needed for full generalization and second by numerical simulations. Both analytical and numerical results show that the generalization ability of these networks is sistematically improved by the degree of modularity of the network. We also analyze the in uence of the selection of examples in the generalization capacity. Some conclusions and remarks are presented in section IV.

2 A family of architectures that solve the parity problem. The simplest known solution for the N-bit parity function using linear threshold units has one hidden layer with a number of neurons equal to the number of inputs. The network is fully connected and the value of the weights that solve the problem are very simple: All the weights connecting the input towards the hidden layers are set to 1, the thresholds of these neurons are the semi-integers numbers ranging from 0.5 to N , 12 , the connections from the hidden neurons to the output are set to 1 and -1 alternatively and the threshold output is set to 0.5. See in picture 1 a gure of this architecture implemented for a parity function with 4 input bits. The network functioning is as follows: when i of the N input neurons are ON, i of the rst hidden neurons, with thresholds less than i, are ON; but as the synapses connecting hidden neurons to the output have alternate thresholds 1 and -1, only when an odd number of inputs neurons are ON the sum of the weights coming to the output exceed its threshold letting the output neuron ON, in the other case when an even numbers of input neurons are ON the sum of the weights reaching the output is 0 and consequently the output neuron results inactively. Now on we construct a family of architectures to solve the parity function which structure will be dependent on the fmax allowed. The structure will have more hidden layers as lower is the fmax and they are basically many solutions of the simplest one, mentioned above, coupled together. For example, for the case of 4 input bits and a fmax of 2 the construction leads to a structure having 3 hidden layers with 4 neurons in the rst hidden layer and 2 neurons in the other two remaining hidden layers, as shown in gure 2. The network computes rst the parity of pairs of neurons, using the simplest architecture for the 2

2 bit case, to later compute the parity of the results in the same way, until we obtain a single output. For the general case of N input bits and a fmax of m, the structure is similar to the example mentioned before. We compute in the rst two hidden layers the parity of Nm groups of m neurons, in the following two hidden layer we compute the parity of the results obtained before, again in groups of m neurons and we continue in the same way until we arrive to a single output neuron. The resulting network has a modular architecture based in the most simple structure with the following features: , 1) number of neurons = 1 + 2m (N (1) m,1 and N , 1) (2) number of synapses = m(m +m1)( ,1 From eqs. (1-2) we see that for large networks the number of neurons increase linearly with N and it is almost independent of m and that the number of synapses increase linearly as a function of N and m.

3 Generalization ability We analyze the generalization ability of the networks constructed before, calculating the minimum number of examples needed to obtain full generalization, approach used before by Franco and Cannas, 1998. We rst concentrate on the analysis of the simplest architecture because it will be one of the keys in understanding the rest of the results. We start analyzing a particular network, with four input bits, four hidden neurons and one output, as the one shown in gure 1. Also to simplify the procedure we restricted the synaptic weights to be f1, -1g to later generalize the results to N input bits and to the use of non-restricted weights. Consider the network of gure 1, where Ii ; (i = 1; 2; 3; 4) are the input bits, hi; (i = 1; 2; 3; 4) are the hidden neurons and O is the single output neuron. With the restriction we imposed to the weights to be f1,-1g there are just two possible internal representations that solve the problem, but in order to simplify we set the thresholds of all the neurons to allow a single internal representation. We set the thresholds of the hidden neurons to 0:5; 1:5; 2:5; 3:5 respectively and the threshold output to 0:5. Our procedure is give to the network some selected examples to learn and analyze the consequences of this learning procedure, supposed successful, in terms of the restriction to the weights that we obtain. The examples, constructed with all the combinations of f0,1g in the input bits, are denoted by the value of the input bits followed by the expected output value. We start the learning procedure using the example [1000 : 1], observing that its learning implies that the synapsis connecting the input neuron I1 to the hidden neuron h1 has to be 1, and also that the synapsis 3

between neuron h1 and the output has to be set to 1. We follow this procedure given the rest of examples that have one input bit ON, the examples [0100 : 1],[0010 : 1],[0001 : 1], to set to 1 the remaining synapses that connect the input with h1. We continue using the examples that have two input bits ON, but noting that just 2 of the 6 possible examples are needed, with the restriction that the selected ones have to use all the input bits, for instance we could take the examples [1100 : 0] and [0011 : 0]. The learning of these two examples assures the setting of the synapses connecting the input to h2 to 1 and also the setting to -1 of the synapses between h2 and the output. The procedure goes on by using the examples containing three input bits ON, and taking just any two of them because any combination ensures the use of all the input bits. These two examples restrict the possible values of the synapses connecting the input neurons with h3 to 1, and also set to -1 the weights connecting this neuron to the output. Finally we take the single example with 4 input bits ON, [1111 : 0], which set the rest of the synapses, the ones connecting the input to h4 to 1, and the synapses connecting h4 with the output to ,1. In this way we see that for the case of 4 input bits it is enough the learning of 8 selected examples to obtain full generalization, as we have set all the synapses to its correct values. The generalization of this result to the case of N input bits is straightforward: we analyze the complete set of examples in groups formed by the examples having i bits ON and taking from every group just int( Ni ) examples out of the possible ones, we get that the total number of examples needed to obtain full generalization is :   int Ni i=1

N X

(3)

Now on we analyze the generalization ability of the family of architectures describe in the previous chapter. We start with a particular example for the case of 9 input bits and a fmax of 3 resulting in a 3 hidden layer network composed by 9 neurons in the rst hidden layer and 3 neurons in each of the two remaining hidden layers. See in gure 3 a picture of the network structure. The structure consist essentially in three coupled networks corresponding to the simplest architecture for the 3 input bits case with 3 hidden neurons and one output. The three resulting outputs together with another layer with 3 neurons and an output form another similar structure. As we have three coupled architectures we select all the examples corresponding to these architectures but converted to this new problem, for instance the example [010:1] corresponding to the simplest architecture now becomes the example [010  P 000-000:1], so we take the 3 3i=1 int 3i = 18 examples to set all the synapses between the input and the rst hidden layer and the ones between this last layer and the second hidden layer. The rest of the synapses corresponding to the last part of the network, that is another three bit simplest parity network are set considering 4

Number of neurons

Number of synapses

Examples needed for generalization Restricted Non-Restricted weights weights

Depth

N ,1) 2 log N N + (N ,1) Pm int m N + N ,1 (2m , m , 1) 1 + 2m Nm,,11 m(m+1)( m m,1 m,1 i=2 i m,1

Table 1: Some features of the networks used to compute the N-bit Parity function with a fan-in max equals to m. that the inputs bits of P this network   are the outputs of the other three network. So 3 we need another set of i=1 int 3i = 6 examples corresponding to the input in the second hidden layer but as all the examples with one bit ON have been already used with the examples selected before, the total number of examples results in : 3

  X     3 3 X int 3i + int 3i = 9 + 4 int 3i = 29 i=1 i=2 i=2

3 X

(4)

The generalization of this result to the case of having N input bits and an architecture with a fmax of m is along the same line leading to 8 N ),1 m

Suggest Documents