Improvement in the learning process as a function of ... - CiteSeerX

0 downloads 0 Views 52KB Size Report
theoretical background for the phenomenon. For this end, a number of binary training data is employed to show how does the learning process depend on data ...
Improvement in the learning process as a function of distribution characteristics of binary data set Halis Altun , Tankut Yalçınöz, Member, IEEE, and Bekir Sami Tezekici

Abstract In literature improvements in neural learning are reported on, which have been achieved through input data manipulation, based on entirely experimental studies. Theoretical background is not supplied for these studies and neural networks are employed as a “black box” model. Within this work, this problem is highlighted and the impact of the modified training sets is evaluated in order to establish a theoretical background for the phenomenon. For this end, a number of binary training data is employed to show how does the learning process depend on data distribution within the training sets Index Terms: Neural Learning, Data encoding

I. INTRODUCTION Neural network community has been aware for a long time that representation of training data set has a great impact on neural learning [1]. Input data encoding has been used to improve neural network training in literature. One example is to train a neural network to solve the two-spiral problem [2]. This is a difficult problem to be learnt and have become a common benchmark problem for learning algorithms. It has been shown that neural network can be successfully trained when an input encoding is employed [3]. For this purpose a binary coding, gray coding and temperature coding have been used along with raw data. Results have shown that the temperature coding is an optimal one for the problem [3]. Despite these attempts in literature, there is a great lack of theoretical study on the effect of the input data encoding. The paper attempts to shed a light on the theoretical background for the phenomenon in the case of binary representation of the input data. II. OPTIMAL ENCODING OF INPUT DATA The theoretical findings in the case of analog training data set have shown that the optimum point of the input pattern activation level for a faster neural learning should be around 0.5 when the activation domain has been bounded

University of Nigde, Department of Electrical &Electronics Eng. Nigde, Turkey

within the range of [0,1] and the condition of being K/M >> 1 has been met, where K is the number of input nodes in the input layer and M is the number of nodes in the output layer [4]. Figure 1 illustrates the optimum production of the weight update signal, depending on the hidden layer activation level and the ratio of the number of the input and output nodes. This result shows that for an optimum production it is necessary to determine the behavior of the activation characteristics of hidden nodes. Statistical analysis has shown that this requirement may be partly met when input data distribution is modified [4]. The finding points out that the statistical dependency of the activation level of the hidden node characteristics on the input node activation has a linear like relation. As a result, input data representation can be employed in order to meet the condition of the optimum weight update signal production. The reported results above for continuous data will be applied for the binary input data and encoding. In the case of binary data set, it will be shown that the probability of 1’s, P(1) should be greater than the probability of 0’s, P(0), for a faster learning process where the number of input nodes is higher than that of the output nodes. These findings give insight into the experimental results reported in [3], where it was shown that neural network was not able to learn the two-spiral problem even after 30000 iterations. On the other hand when a temperature coding was employed instead of binary presentation of the input patterns, the convergence was obtained within 400 iterations. This improvement can be explained, according to the theoretical findings presented within this paper, as a results of an increase in the number of input nodes K and the probability of 1’s, P(1), in the binary data set.

III. THEORETICAL FOUNDATION. In order to investigate the effect of the statistical characteristics of the binary input pattern vectors on the MLP NN, let a NN with a single hidden layer have a vector space S. Let xi , xh and xo, be activation vectors in this space, which present node activation level of the layers. Assume that the input activation vectors xi have a dimension of K, the hidden activation vectors xh have a dimension of L and the output activation vectors xo have a dimension of M.

(1)

xh(s)=[x1(s), x2(s),..,xL(s)]T

(2)

xo(s)=[x1(s), x2(s),..,xM(s)] T

(3)

where s is the number of training patterns. The weighted connection between the input-hidden and hidden-output layers are wih and who. Training continues until wih and who are optimized so that a predefined error threshold is met between xo and to as follows xo(s)= to(s) ±e

s=0,1,..,n

∆wih+ ∆who

1.25 Strength of weight update signal

xi(s)=[x1(s), x2(s),..,xK(s)] T

1.00

0.75

0.50 rnode=8/2 rnode=16/1 rnode=1/4

0.25

(4)

0 0

0.2

where e is the predefined error tolerance and n is the number of patterns For the sake of clarity, let the input, hidden and output node activations, namely xi , xh and xo, be termed as "activation levels" rather than elements of activation vectors, xi , xh and xo. Using the new notation, each interconnections between the nodes are adjusted by the amount of the weight update value as follows

∆who = −ηxo′ ∆xo x h

(5)

M

∆wih = −ηx h′ ∑ x o′ who ∆xo xi

(6)

δ o = xo (1 − xo )(t o − x o )

(7)

o

M

δ h = xh (1 − xh )∑ δ o who

(8)

  xo = fsig  ∑ xh who   h 

(9)

  xh = f sig  ∑ xi wih   i 

(10)

o

where ∆x o = (t o − x o ) : sigmoid activation function, fsig( ) δ : delta error term, η : learning rate, i, h, o : input, hidden and output layer indices K, L, M : the number of input, hidden and output nodes, respectively : target value, to : output activation level, xo : derivative of the output activation level, x'o : hidden layer activation level, xh

0.4

0.6

0.8

xh

1.0

hidden node activation level

Figure 1. Optimum acivation level for hidden layer node as a function of rnode x'h xi who

Suggest Documents