training algorithm for neuro-fuzzy-ga systems - Semantic Scholar

15 downloads 0 Views 51KB Size Report
Many improvements on the ANN learning algorithm are carried out over .... the delta rule for training min-max ANN based on the differentiation theory of min-max ...
A NEW APPROACH TO CLASSICAL BACKPROPAGATION ALGORITHM FOR NEURO-FUZZY-GA SYSTEMS LEARNING L. M. BRASIL1*, F. M. DE AZEVEDO*, J. M. BARRETO**, M. NOIRHOMME-FRAITURE*** * Dept. of Electrical Engineering,**Dept. of Informatics and Statistics Federal University of Santa Catarina University Campus, Florianópolis, Brazil, 88040-900 *** Institut d´Informatique, FUNDP, Namur, Belgium [email protected], [email protected], [email protected], [email protected]

ABSTRACT

NEURAL MODEL OF A FUZZY ANN

The aim goal of this paper is to present a new learning algorithm, which has been applied to feedforward neural networks, as well as complexity problems to optimize the topology of type of network. We used this approach not only during the learning phase of the network, but also to optimize the number of hidden neurons. This learning algorithm is inspired on the classical backpropagation algorithm. It owns some variations due to kind of network used. The algorithm developed was applied to a particular network which has AND/OR fuzzy neurons.

The mathematics model of the neuron is given by [8], where X(t) = n-dimensional input vector in the ANN or output of neuron exciting the neuron considered.

Keywords: Fuzzy Neural Networks, Genetic Algorithm, Complexity.

Mathematically, the neural nonlinear mapping function N can be divided into two parts: a function called confluence and a nonlinear activation operation [2]-[8]. The confluence function  is the name given to a general function having as arguments the synaptic weights and inputs. A particular case widely use is the inner product.

INTRODUCTION Learning is often formulated as an optmization problem in the machine-learning scope. For example, backpropagation algorithm is often used to train feedforward Artificial Neural Networks (ANN) [1]. It is one of the most popular learning algorithms. However, back-propagation is, basically, a gradient-based optmization which is used to minimize an error function, e.g., the mean square error of an ANN. The learning algorithm as presented here, is a typical optimization problem in numerical analysis.

Many improvements on the ANN learning algorithm are carried out over optimization algorithms [2]-[8]. Nevertheless, most of the researches accomplished in this sense took care about optmization complexity of weights among the neuron connections of an ANN. In this paper, addition to choose of the learning process kind of the ANN, we are also concerned with the complexity problem to optimize the topology of the ANN. Depending on problem we possibly know the number of neurons in the first and last layer of an ANN, but we do not know the hidden one. The approach accomplished in this work will allow a solution in this sense [2]-[7].

1

X(t) = [x1(t), x2(t),..., xi(t),..., xn(t)]T ∈ ℜ

n

(1)

1 y(t) = o(t) = scalar output of each neuron, o(t) ∈ ℜ N: nonlinear mapping function, X → O, x(t) |→ o(t) n Z+ → ℜ e O: Z+ → ℜ n 1 o(t) = N[X(t) ∈ ℜ ] ∈ ℜ X:

(2) (3)

This mapping yields a scalar output u(t) which is a measure of the similarity between the neural input vector X(t) and the knowledge stored in the synaptic weight vector W(t). So, u(t) and W(t) are given by, 1 T W(t) = [w0(t), w1(t),..., wi(t),..., wn(t)] ∈ ℜn+1, u(t) ∈ ℜ (4)

Redefining X(t) to include the bias x0 (t) we have: u(t) = X(t)  W(t)

(5)

The nonlinear activation function Ψ[.] maps the confluence value u(t) ∈ [-∞, ∞] and occasionally it is an ativation value to a bounded neural output. Then, the nonlinear activation operator transforms the signal u(t) into a bounded neural output o(t), that is, o(t) = Ψ[u(t)]

o(t) = Ψ[X(t) W(t)] ∈ ℜ1

(6) (7)

Applying the equations (1), (4), (5), and (7) to a multilayer ANN (e.g.,three layers), we have: n m O(t) = N3[N2[N1[X(t) ∈ ℜ ]]] ∈ ℜ

(8)

Currently at the Institut d´Informatique, FUNDP, Namur, Belgium: the work sponsored by CAPES (Coordination Foundation of High Level Personnel Improving), Brazil.

O(t) =Ψ3 [W3(t)Ψ2[W2(t)Ψ1[W1(t)X(t)]]]

(9)

where Ψi[.] is non-linear activation operator,  is the confluence operator, and W1(t), W2(t), and W3(t) are the synaptic weight vectors for the input, hidden and output layers, respectively. If we express neural input signals in terms of their membership functions each over the interval [0,1], rather than in their absolute amplitudes, we can perform mathematical operation on these signals using logical operations such as AND/OR [ 8]. Let us express the inputs x1 and x2 over [0,1]. We define the generalized AND (T-norm) as a T mapping function and generalized OR (T-conorm) as a S mapping function: o1 = [x1 AND x2] = T[x1, x2] o2 = [x1 OR x2] = S[x1, x2]

(10) (11)

For OR neurons, in (5), by replacing the -operation by the T-operation, and the ∑-operation by the S-operation, then n

u(t) =

S

i=0

[wi(t) T xi(t)] ∈ [0,1]

n

T

i =0

[wi(t) ∗ xi(t)] ∈ [0,1]

o(t) = Ψ[u(t)] ∈ [0,1]

In [5]-[7][11], the authors have made another attempt in developing a rigorous theory for the differentiation of minmax functions by means of functional analysis, and derived the delta rule for training min-max ANN based on the differentiation theory of min-max functions. Before stating the mathematic development for this stage, let us define, based in [7], the function lor(.) on the real number field ℜ as: 1  lor ( x ) = 1 / 2  0

(12)

For AND neurons, in (5), by replacing the -operation by the operation of the algebraic product, and the ∑-operation by the T-operation, we get u(t) =

Because of the difficulty in the analysis of min and/or max operations, the training of fuzzy ANN, especially minmax ANN, appears not to be approachable rigorously and nor systematically. Therefore, in practice, one tends to choose bounded-addition and multiplication to replace the min and max operations just to bypass the difficulty. Despite the fact that the modified ANN is readily trainable for its analytical nature, it is functionally very different from the original one. The lack of an appropriate analytical tool for the min and max operations greatly limits their applicability.

(13) (14)

LEARNING ALGORITHM The learning algorithm proposed, called Genetic-Backpropagation Based Learning Algorithm (GENBACK), provides modifications not only in the connection weights, but also in the ANN structure by generating and/or eliminating connections. In this case, it is suggested that optimization in hidden layer can be accomplished using Genetic Algorithm (GA) [2]-[9]. In summary, the first stage in the implementation of the GENBACK is considered a priori, ANN already exists. That is, the number of neurons in the input and in the output layer are determined by a given set of patterns. In addition the ANN structure is defined in relation to neurons be either fuzzy or crisps. The number of neurons in the hidden layer can be introduced by some heuristic, as suggested in [10]. In the next stage, GENBACK uses the set of examples such as modifying not only the weight of each connection, but also the ANN structure. In this last case, connections are included or excluded among neurons, as well as neurons can be included or excluded of the hidden layer by the application of GA [2]-[9]. GA are based on the work of Holland [9] where it was inspired by the evolution of a population subject to reproduction, mutation and crossover in a selective environment. We choose GA to optimize the size of the hidden layer and determining weights to be set to zero.

if x > 0 if x = 0 if x < 0

and in according to one of the theorems given in [7], suppose that f(x), g(x), h1(x) = f(x) S g(x), and h2(x) = f(x) T g(x) are real functions. If they are all differentiable at point x, then dh1( x ) df ( x ) = d [ f ( x ) S g ( x )] dx = lor[ f ( x ) − g ( x )] dx dx dg ( x ) + [ g ( x ) − f ( x )] dx dh2( x) df ( x) = d [ f ( x) T g( x)] dx = lor[ g( x) − f ( x)] dx dx dg ( x) + [ f ( x) − g( x)] dx

(15)

(16)

Suppose we have a classical ANN with one input layer, one output layer, and N hidden layers and the activation function Ψ(.). Using wjin to denote the weight between node i in layer n and node j in layer n + 1, using Ts and Os to denote the desired output and the actual output of node s, respectively, in the output layer N + 1, and using Uin and Oin to denote the input activation and the output of node i in layer n (n = 0,1, ... , N + 1). We generally have the following formulas linking Uin and Oin then, (17) Uin = ∑ wki ( n − 1) Ok ( n − 1) k

Oin = Ψ(Uin )

(18) where k goes through all nodes in layer n - 1 (n = 1,2, ... , N + 1). We also define the cost function J as, J =

1 2

P

∑ ∑ ( T − O )2 s

p

s

(19)

s

where s goes through all the nodes at output layer N + 1, and P is the number of data samples. Suppose P = 1, then through differentiating the cost function J, we can arrive at the delta rule learning algorithm for the ANN as follows: (20) δ ( N + 1) s = ( Ps − Is)Ψ'[U ( N + 1) s ] δin

= Ψ'(Uin ) ∑ wjinδj ( n + 1) j

(21)

Then the delta rule is readily derived as   ∆wjin = −η ∂J ∂wjin  = −ηδj ( n + 1) Oin

(22)

. Ol (n − 1)

Now our task is to train a min-max ANN (or fuzzy ANN) such that it can fit a given set of desired ANN input and output pairs to a given precision. For this purpose, we use the conventional idea of gradient descent to design a delta rule to minimize J. For a given p, the following, in according to theorems proved in [7], all partial differentials of J with respect to wjin exist almost everywhere in ℜ, and they have the following representations [2]-[7]: Forward Pass: Input Layer/Hidden Layer

Uin = T [ wil (n − 1) * Ol ( n − 1)] ∈[-1,1]

(23)

l

Hidden Layer/ OutputLayer

Uj ( n + 1) = S [ wjin T Oin] ∈[-1,1]

(24)

i

The neural output is defined as Oj ( n + 1) = Ψ[Uj ( n + 1)] ∈[-1, 1] Ψ[Uj ( n + 1)] = tanh[Uj ( n + 1)]

(25) (26)

Backward Pass: The learning rules to modify wjin may be developed as follows: Hidden Layer/ OutputLayer For a specific case, e.g, the case of layers three where the hidden layer has AND neurons and the output layer has OR neurons, these are helped for an activation function of the bipolar sigmoidal, and β is the parameter called momentum, then Ψ[Uj ( n + 1)] = tanh[Uj ( n + 1)] Ψ'[Uj ( n + 1)] = 4 exp( −2Uj ( n + 1)) / [1 + exp(−2Uj ( n + 1))]2   ∆wji ( n + 1) = −η ∂J ∂wji ( n + 1)  ∂Uj ( n + 1) ∂J ∂J ∂wji ( n + 1) = ∂Uj ( n + 1) ⋅ ∂wji ( n + 1)    ∂Uj (n + 1) = lor ( wji (n + 1)T Oin) −  S ( wji ' (n + 1)T Oi ' n))  ∂wji (n + 1)  i ' ≠ i 

(27) (28) (29) (30)

(31)

. lor (oin − wji (n + 1)

δj (n + 1) = (Tj − Oj) ⋅ Ψ'[Uj (n + 1)] ∂J ∂Uj ( n + 1) = −δj ( n + 1)

∆wji ( n + 1) = δj ( n + 1) ⋅

∂Uj ( n + 1)

∂wji ( n + 1)

wji ( n + 1) = wji ( n ) + ∆wji ( n + 1)

  wji ( new) = wji ( old ) + η  −∂J ∂wji  + β∆wji ( old )

(32) (33) (34) (35)

∂J

∂Uin = −δin = δj ( n + 1) ∆wil ( n − 1) = δin ⋅ ∂Uin ∂wil ( n − 1)

wil ( n ) = wil ( n − 1) + ∆wil ( n )

(

)

wil ( new) = wil ( old ) + η −∂J ∂wil + β∆wil ( old )

(43) (44) (45)

(46)

SIMULATIONS We provided an example to show that the delta rule developed in this work is effective in training min-max ANN. This example has been used to classify epileptic crises. This set of data was obtained at the University Hospital of Federal University of Santa Catarina, Brazil. We classified about 77 symptoms and 4 diagnostic classes. Let us give a min-max NNES with three layers, i.e., an input layer, an output layer, and a hidden layer. Since the range of values of the min-max NNES is often constrained on [1,-1], the activation function for each neuron is chosen as the hyperbolic tangent. We describe some simulations for the example considered. We used: input layer = 6 neurons, output layer = 3 neurons, hidden layer = 6 neurons, Generation Number = 10, Initial Population (P) = 8, Gauss Distribution, Crossover Rate (C) = 0.6, Mutation Rate (M) = 0.1, Learning Rate = 0.1, Momentum = 0.7, Tolerance = 0.1, Maximum Epoches (ME) = 5, and Total Epoches (TE)= 50. With this training data obtained 5 neurons in the hidden layer. As the number of more important data of a set of patterns was 4, this simulation show us that for a basic ANN with 6 neurons in the hidden layer it can be minimized until a number near to more important data obtained of the processing of knowledge extraction. Taking the same data given above but ME = 50 and TE = 500, it was also obtained 5 neurons in the hidden layer. Nevertheless it is observed that the quality of the ANN is improved. In this case, the value of Relative Variance Coefficient (RVC), e.g., RVC = Standard Deviation/Average Value, decreased, while the fitness increased. In other simulation, we kept almost all variable used in the previous ones but C = 0.8, the number of neurons in the hidden layer was optimized to 4 neurons. We observed who the standard deviation curve reached higher values with than in the previous simulations. So variable (fitness) dispersion in relation to average of this variable (fitnesses average) increased. A great occurence of crossover and mutation in some generations occurred.

(36)

Input Layer/Hidden Layer Ψ[Uin] = tanh[Uin ] Ψ'[Uin ] = 4 exp( −2Uin) / [1 + exp(−2Uin)]2   ∆wi ln = −η ∂J ∂wil ( n − 1)  ∂ijn ∂J ∂J ∂wil ( n − 1) = ∂Uin ⋅ ∂wil ( n − 1)

   ∂Uin = lor T ( wil ' (n − 1) * Ol ' (n − 1)))  − ( wil (n − 1) * Ol (n − 1))  ∂wil (n − 1)    l '≠ l  (42)

(37) (38) (39) (41)

Nevertheless we have made other simulations where we change, for example, M = 0.2 and the values of all other variables were unalterede. The final ANN presented 4 neurons in the hidden layer and when we change only P = 30 the NNES continued to have 4 neurons in the hidden layer. We observe that a bigger diversity in the initial population in chromosomes layer of a given population of individuals was brought about as by the change of P values as M values.

Moreover we have made other simulations. Another simulation was carried out with the following values: input layer = 39 neurons, output layer = 4 neurons, hidden layer = 20 and 77. We observed for these last simulations that the connectionist architecture proposed has reached a good performance.

CONCLUSIONS

We also observed the generalization ability of the connecionist paradigm. ANN owns the number of neurons in the hidden layer lesser than the number of important data related to set of examples, the system minimizes the number of neurons in the hidden layer. Still it loses a bit the generalization ability, otherwise, the system generalizes. We observed when the learning rate value is increased, the generalization ability of the ANN is decreased.

GENBACK training algorithm to the fuzzy ANN was inspired in the classic backpropagation algorithm, with some alterations: optimization of the hidden layer was supported by GA, incorporation of logic operators AND/OR in place of the weighted sum, and formation of the ANN by fuzzy logic. Besides, it was observed that in the backward pass the error propagation among the layers has reached expected values.

In conclusion, after the optimization of the ANN is accomplished the refinement it. When we have obtained the winner ANN we have as the number of neurons in the hidden layer optimized as the training ANN. We present also other sets of tests for ANN whose objetive is to analyse the ANN refinement. We observed who the sets of test patterns presented to ANN the majority of them were recognized. In

In this work was also reached one of the limitations of a feedforward ANN with multiple layers which uses a backpropagation basead algorithm. It requires an ativation function that be nonlinear and differentiable. We used in development of this work the hyperbolic tangent function. However, when we accomplish necessary operations in the backward pass there was a problem. The ANN is AND/OR and these functions are not differentiable, a priori. Nevertheless we can observe in [11] that this difficulty was overcomed using a function called lor(x). Other result reached was with regard to optimization of the topology to be adopted for the fuzzy ANN. The optimization of the hidden layer was supported by GA. The adopted solution for the maximization of the hidden layer considered the following points: the values for crossover and mutation rates were chosen empirically, e.g., the value of 0.6 for the crossover operator and 0.1 for the mutation operator. Using these values, the system performade well. In minimization case of the hidden layer was considered the following points: given a set of example we can determine which values are more predominant to others. The minimun value of chromosomes generated by the selection process and genetic operators should be decreased to maximum equal to the number of important data related to set of examples. This case was showed in the last Section. However occured some cases when the number of neurons in hidden layer reached values below expected. That is, we expected who the winner ANN did not have lower values than the number of important data related to set of examples. We believe this fact happened due to mutation process. The mutation operator must have provoked a irregular pertubation in chromosomes chain. We know that for an ANN accomplish a given patterns classification task the number of neurons in the hidden layer must be at least equal to two neurons. Then we have chosen for this case that: if there is pertubation of chromosomes chain, the value generated must be chosen randomically, but this value must respect to the condition as been the lower value while superior one must respect to Komolgorov theorem. That is, this value assumes to be equal to, for example, twice the number of neurons in the input layer plus one.

addition, we expect that we can minimize the complexity problem with respect to this learning algorithm to optimize the topology of the ANN.

REFERENCES [1] D.E Rumelhart., G.E. Hinton, and R.J. Willians, Learning internal representations by error propagation, in D.E. Rumelhart, J.L. McClelland, and the PDP group, editor, Parallel Distributed Processing, Cambridge, MA: MIT Press, 1(2), 1987, 319-362. [2] L. M. Brasil, F.M. De Azevedo, R.Garcia Ojeda, and J.M. Barreto, A methodology for implementing hybrid expert systems, in Proc. The IEEE Mediteranean Electrotechnical Conference, Bary, Italy, 1996, 661-664. [3] L. M. Brasil, F.M. de Azevedo, and J.M. Barreto, A hybrid expert architecture for medical diagnosis, in Proc. 3th International Conference on Artificial Neural Networks and Genetic Algorithms, Norwich, England, 1997, 176-180. [4] L. M. Brasil, F.M. de Azevedo, and J.M. Barreto, Uma arquitetura híbrida para sistemas especialistas, in Proc. III Congresso Brasileiro de Redes Neurais, Florianópolis, Brasil, UFSC, 1997, 167-172. [5] L. M. Brasil, F.M. de Azevedo, and J.M. Barreto, Uma arquitetura para sistema neuro-fuzzy-ga, in Proc. XII Congreso Chileno de Ingeniería Eléctrica, Universidad de La Frontera, Temuco, Chile, 1997, 712-717. [6] L. M. Brasil, F.M. de Azevedo, and J.M. Barreto, Learning algorithm for connectionist systems, in Proc. XII Congreso Chileno de Ingeniería Eléctrica, Universidad de La Frontera, Temuco, Chile, 1997, 697-702. [7] L. M. Brasil, F.M. de Azevedo, J.M. Barreto and, M. Noirhomme, Training algorithm for neuro-fuzzy-ga systems, in Proc.16th IASTED International Conference on Applied Informatics, AI'98, Garmisch-Partenkirchen, Germany, 1998, (In Press). [8] M.M. Gupta, D. H. Rao, On the principles of fuzzy neural networks, Fuzzy Sets and Systems, 61(1), 1994, 1-18. [9] J.H. Holland, Adaptation in Natural and Artificial Systems, Cambridge, MA: MIT Press, 1975. [10] R.C. Eberhart and R.W. Dobbins, Neural network PC tools - A pratical guide. Laurel, Maryland, Academic Press, Inc, 1990. [11] X. Zhang , C. Hang, S. Tan, and P.Z. Wang, The min-max function differentiation and training of fuzzy neural networks, IEEE Transactions on neural networks, 7(5), 1996, 1139-1150.

Suggest Documents