Optimization of Multi-layer Artificial Neural Networks ... - IEEE Xplore

3 downloads 0 Views 286KB Size Report
Abstract— The number of hidden layers is crucial in multilayer artificial neural networks. In general, generalization power of the solution can be improved by ...
Optimization of Multi-layer Artificial Neural Networks Using Delta Values of Hidden Layers N. M.Wagarachchi

A. S. Karunananda

Dept. of Computational Mathematics University of Moratuwa Sri Lanka [email protected]

Dept. of Computational Mathematics University of Moratuwa Sri Lanka [email protected]

optimization. Moreover, Burkitt proposed a pruning technique by introducing an additional cost function on a set of auxiliary linear response units [1], [2]. Also in [5] and [6] authors have been optimized hidden layer architecture by using the mean square error. Further, in [8] introduced a constructive method based on cascade correlation defined by Fahlman [9]. Md. Moniral et al have proposed a new adaptive merging and growing algorithm to design artificial neural networks in [10]. Further, Sridhar in [11] proposed an improved version of adaptive learning algorithm in a structured multilayer networks. In addition, some diverse algorithms in constructive neural networks have been presented by [16] . Usually, determining the solution of hidden layer architecture has been based on pruning and constructive methods and evolutionary methods (based on optimization and their combinations). However, these approaches have various limitations especially, when the network comprises with a large number of layers and neurons. Hence, there is no exact solution for the problem of hidden layer architecture in multilayer artificial neural networks and still it remains as an unsolved problem. We have researched into mathematical modeling of hidden layers considering various mathematical concepts. For example we have studied the possibility of modeling through eigenvalues of weight matrices of connection weights. However, this was not successful due to serious computational limitations when the size of network is considerably large. As such, we have started to work on an

Abstract— The number of hidden layers is crucial in multilayer artificial neural networks. In general, generalization power of the solution can be improved by increasing the number of layers. This paper presents a new method to determine the optimal architecture by using a pruning technique. The unimportant neurons are identified by using the delta values of hidden layers. The modified network contains fewer numbers of neurons in network and shows better generalization. Moreover, it has improved the speed relative to the back propagation training. The experiments have been done with number of test problems to verify the effectiveness of new approach. Keywords—Artificial Neural networks, Delta values, Hidden layers, Hidden neurons, Multilayer.

I.

INTRODUCTION

The generalization power of a multilayered artificial neural network very much depends on the number of hidden layers. Hence, the number of hidden layers is significant in multilayer artificial neural networks. In other words, generalization power of a neural network can be increased by adding more layers to the network. Nevertheless, these solutions may not be computationally optimized. On the other hand, they are time consuming. Not only the layers, but also the number of neurons in each layer is crucial for the performance of the network. Generally, too many hidden neurons over-train the network whist too few exhibits insufficient degrees of freedom [1], [2]. Therefore, determining the most appropriate architecture for multilayered artificial neural network is very important. But until today there was no exact solution for optimized solution on hidden layer architecture in multilayer artificial neural networks. Therefore, designing an appropriate architecture for the solution of a given task is still an open problem. As such, a large number of researchers modeled the hidden layer in multilayered artificial neural networks. For example Giovanna et al [3] presented a pruning algorithm for hidden layer architecture based on genetic algorithm. In [4] authors suggested a method to minimize the network using sensitive analysis and swarm

c 978-1-4673-5871-2/13/$31.00 2013 IEEE

alternate approach to model the hidden layer architecture and in our study, it was noted that back propagation has

become the most successful training algorithm for multilayered artificial neural networks. Therefore, we decided to model hidden layer architecture based on the back propagation training [13]. Accordingly, we selected to model the hidden layers thorough the delta ( Dž ) values of hidden layer neurons. Our insight was based on the fact that delta values of the hidden layers are computed

80

using the error involved in output layer. Hence, delta values can be used in minimizing the said error over the training cycles. Generally, the summation of such delta values shows a correlation with the error of the training cycle. Therefore, we postulated that identification of hidden layer neurons which can be minimized the error of the training cycle would be significant for modeling hidden layer architecture effectively. Therefore, it may be able to remove such neurons to prune the hidden layers. Experiments have been conducted to identify the best way to decide the most appropriate neuron to be removed. For an example, we may set threshold value for Dž of the hidden layers and remove neurons according the defined threshold value. The current approaches to hidden layer architecture presented by other researchers are given in next section. Section III gives the mathematical background for new approach while the algorithm proposed is presented in section IV. Research methodology and results are presented in sections V and VI respectively. Finally, the conclusion is given by sections VII. II.

CURRENT APPROACHES TO HIDDEN LAYER ARCHITECTUE

A. Pruning Algorithms

The objective of pruning algorithms is to obtain the optimum structure of network by incrementally removing low-saliency neurons from the architecture. Optimal Brain Damage (OBD), presented by Le Cun and Jhon Denker [14] is a commonly use pruning algorithm, where parameters with low-saliency are detected based on the second derivative and removed from the network. The intension of OBD is that, it is possible to obtain network which perform in same manner (or better), by eliminating about half of the weights. However, a major issue arises with the enormous size of the Hessian matrix. This causes computational barriers and also takes considerable time to train. Hence, it assumes the Hessian matrix is diagonal. In contrast to Le Cun et al. Hassabi [19] introduced Optimal Brain Surgeon(OBS) and claimed that Hessian matrix is non-diagonal and for every problem and hence, weights elimination of OBD may not accurate. Even though, it argues that OBS is better than OBD and yields better generalization, much computational limitations arises especially working with the inverse of the Hessian matrix. Giovanna et al. [3] proposed a conjugate gradient algorithm in least-square sense to prune neurons after trained the network. It reached the optimal size by removing neurons successfully from a large trained network and then adjusting the remaining weights to maintain original input-output behavior. They claim that

this algorithm yields better results relative to other existing methods. However, in this research it is not considered the multi-layered architecture and hence, it deviates from our aim. On the other hand, our goal is to identify and prune the neurons before train the network. Finally, we concentrate on removing a cluster of neurons rather than removing one neuron at a time. Faisal et al [4] have been used hybrid sensitive analysis and adaptive particle swarm optimization (SAAPSO ) algorithm to determine the minimal architecture of artificial neural networks. Firstly, prune neurons with less salient using the impact factor defined by ‫ܨ݉ܫ‬௜ ൌ  σ ‫ݓ‬௝௜ଶ ߪ௜ଶ The neurons with ImF less than certain threshold value ߪ௜௥  have been identified as removable neurons with less salient. At the each remove it tries to replace with some other suitable neuron which has the similar effect. The similarity is defined by using the correlation coefficient and if two neurons are highly correlated they can be merged. Anyhow, this research is restricted only for two layered network. But there is no such rule that two layer network can solve almost all the problems. Also there are some problems which require three layer networks [15] In [1] Antony et al claim that for a particular network configuration there is a continuous set of weights and biases that have the same value of the cost function ଶ

෍ሺ݀௣௡ െ ‫݋‬௣௡ ሻ ௡ǡ௣

where the sum is over n output and p training patterns while d and o are desired and observed outputs respectively. This set of weights defines a contour of the cost function in the space of all possible weights. In order to determine the optimum architecture, network trained by the back propagation and required network can be obtained by eliminating weights which are close to zero. B. Constructive Algorithms

Constructive neural networks the network structure is built during the training process by adding hidden layers, nodes and connections to a minimal neural network architecture. However, determining whether a weight should be added and it adds to the existing hidden layer of new layer is not straightforward. Therefore in most algorithms, pre-defined and fixed number of nodes are added in the first hidden layer and the same number of nodes are added to second layer and so on [16]. This number is crucial for the better performance of the network and it makes as small as possible to avoid the complex computations during the training.

2013 IEEE Symposium on Computational Intelligence, Cognitive Algorithms, Mind, and Brain (CCMB)

81

The cascade correlation algorithm (CCA), firstly proposed by S. E. Falhman [9] is a well known and mostly used constructive algorithm. This algorithm has proposed as a solution of problems such as local minima problem. The dynamic node creation (DNS) [17] algorithm is supposed to be the first constructive algorithm for designing single layer feed-forward networks dynamically. Sridar et al [11] improved the adaptive learning algorithm for multi-category tiling constructive neural networks for pattern classification problems.

MATHEMATICAL BACKGROUND ON NEW APPROACH

The error term E(n) of the nth training cycle of a feedforward multilayered artificial neural network is defined as ௠

ͳ ‫ܧ‬ሺ݊ሻ ൌ ෍ሺ‫ݐ‬௞ ሺ݊ሻ െ ‫݋‬௞ ሺ݊ሻሻଶ ሺͳሻ ʹ ௞ୀଵ

Where m is the number of neurons in the output layer and ݀௝ and ‫݋‬௝ represent the desired and actual outputs of the jth neuron respectively. This can be written as ௠

ͳ ‫ܧ‬ሺ݊ሻ ൌ ෍ሺ݁௞ ሺ݊ሻሻଶ ሺʹሻ ʹ

C. Evolutionary Methods

Other than the pruning and constructive methods some researchers have been suggested evolutionary methods to model the hidden layer architecture based on various mathematical concepts. For instance, An alternative method proposed by Stathakis [5] to obtain the near-optimal solution by searching with a genetic algorithm. This research was carried out to determine the optimum architecture and it claims that obtained network is much efficient in problem classification. Although it provides very small error in the training cycle, topology of network is comparatively large. Hence, it is difficult to apply especially when there is large number of inputs in the network. Shuxiang et al . [6] have been presented an approach the selecting the best number of hidden units based on some mathematical evidence. According to their suggestion number of hidden nodes can be given as ଵ

ൗଶ ܰ ൰ ݈݀‫ܰ݃݋‬ Where N is the number of training pairs and C is a constant. The approach is to increase C and train the feedforward network for each n and then determine the optimal value for n which gives the smallest rooted mean square error. Although they have applied this to mediumsized data set and there is no guarantee of applying the theory for large number of hidden neurons. Even though methods to optimize the hidden layer architecture have been developed, they have not fully addressed the problem regarding the topology of multilayer artificial neural networks such that the determining number of required layers for better performances and then minimize the error in the training cycle.

݊ ൌ‫ܥ‬൬

III.

௞ୀଵ

Where ݁௞ is the difference between the actual and desired errors of the neuron k in the output layer. That is ݁௞ ሺ݊ሻ ൌ  ‫ݐ‬௞ ሺ݊ሻ െ ‫݋‬௞ ሺ݊ሻሺ͵ሻ When there are N number of input/output training patterns total error of the network becomes ே



ͳ ଶ ‫ܧ‬ே ൌ ෍ ෍൫‫ݐ‬௝௞ െ ‫݋‬௝௞ ൯ ሺͶሻ ʹܰ ௝ୀଵ ௞ୀଵ

An algorithm has been proposed to optimize the size of multilayered artificial neural network. It prunes neurons as much as possible from the hidden layers of over-sized network while maintaining the same error rate as the back propagation training. Pruning is done by using the delta values of hidden layers. The delta value of the jth neuron of the hidden layer immediate before the output layer is given by ߜ௝ ൌ ݂௛ᇱ ൫݊݁‫ݐ‬௝ ൯ ෍ ߜ௞ ‫ݓ‬௞௝ ሺͷሻ ௞

where ݂௛ᇱ is the pre-defined activation function of the hidden layer, wkj is the connection weight of the neurons j of the last hidden layer and neuron k of the output layer. Džk is delta value of kth neuron in the output layer define by ߜ௞ ൌ ݂௢ᇱ ሺ݊݁‫ݐ‬௞ ሻሺ‫ݐ‬௞  െ ‫݋‬௞ ሻሺ͸ሻ Where ݂௢ᇱ is the activation function defined for output layer and tk and ok are the desired and actual outputs respectively. The intension of choosing delta value is that the delta value of jth neuron in last hidden layer at nth training cycle ߜ௝௛ ሺ݊ሻ is calculated by using the error݁௞ ሺ݊ሻ. This delta is used to update the connection weights as follows. ௛ ௛ ሺ݊ ൅ ͳሻ ൌ ‫ݓ‬௞௝ ሺ݊ሻ ൅ ߟߜ௝௛ ሺ݊ሻ݂௛ ሺ݊ሻሺ͹ሻ ‫ݓ‬௞௝ Where h is the number of hidden layers in the network and ߟ is the learning rate.

82

2013 IEEE Symposium on Computational Intelligence, Cognitive Algorithms, Mind, and Brain (CCMB)

Then the error of the next training cycle ‫ܧ‬ሺ݊ ൅ ͳሻ will be calculated. Therefore, it is clear that there is a strong relation between the error ‫ܧ‬ሺ݊ ൅ ͳሻ and the delta with the delta ߜ௝௛ ሺ݊ሻ. Therefore these delta values can be used to minimize the error of the training cycle. However, experimental results show that the summation of Dž values of hidden layer and the error of the training cycle are highly correlated. For example table 2 shows the correlation of summation of delta values of hidden layers and the error of each training cycles for several well known classification problems obtained from [20]. N refers the number of input/output training cycles. n and m are are the number of nodes in input and output vectors respectively. corr(Hi,E) denotes the correlation coefficient between the summation of delta values of the hidden layer i and the error of the training cycle E. Hence, the error in (1) can be minimized by using the summation of Dž in (3). The removable neurons are identified by considering two facts. First is the sign of the delta value. Because positive correlation indicates that eliminating neurons with positive delta from a hidden layer minimize the error E. In contrast, if there is a negative correlation, error can be minimized by removing neurons with negative delta values from such a hidden layer. The other one is the size of delta. According to (7), ‫ݓ‬௡௘௪  ൌ  ‫ݓ‬௢௟ௗ ൅ ߟǤ ߜ݂ሺ݊݁‫ݐ‬ሻ It implies that zero deltas are not significant in updating weights. Therefore neurons with zero delta values can be removed from the architecture as their effect on the minimizing error is negligible. Therefore, by removing neurons with positive or negative delta values which are very close to zero, error E in (1) can be reduced. Choosing positive or negative depends on the correlation coefficient of the error and summation of hidden layer delta values. On the other words, if corr(Hi,E) denotes the correlation between the error of the training cycle and the summation of the delta values of the hidden layer h, then we introduce a threshold value ߬ఈ such that ‫ܽݐ݈݂݁݀݋݁݃݊ܽݎ‬ ߬ఈ ൌ ߙ ൬ ൰ ‫ܭ‬ Where α is a positive integer and K is a large positive real number (eg: in problem 3, K=20,000) such that removing hidden neurons ߜ௛ , where ߬ఈ ൏ ߜ௛ ൏ Ͳ minimizes E when corr(Hi,E) < 0 . If ܿ‫ݎݎ݋‬ሺ‫ܪ‬௜ ǡ ‫ܧ‬ሻ ൐ Ͳ minimal network can be obtained by deleting neurons such that ߬ఈ ൐ ߜ௛ ൐ Ͳ.

paper is to propose a method to trim the network by maintaining same error rate as the back propagation training given in (4). The resultant network will tends to the required error faster than that of back propagation training. According to the new algorithm train the network once for the first input. Obtain the difference ek between desired and actual errors of the kth node on the output layer. This error is used to determine the delta values of the output layer as follows. ߜ௞ ൌ ݁௞ Ǥ ݂ ᇱ ሺ݊݁‫ݐ‬௞ ሻ Calculate the delta of last hidden layer by using the equation ௠

ߜ௝௛

ൌ ݂௛ᇱ ൫݊݁‫ݐ‬௝ ൯ ෍ ‫ݓ‬௞௝ ߜ௞ ௞ୀଵ threshold value ߬ఈ . Remove

Decide the all the nodes, which are in the interval ሾ߬ఈ ǡ Ͳሿ or ሾͲǡ ߬ఈ ሿ according to the correlation defined as corr(Hi,E) from the last (hth) hidden layer. If the number of remaining nodes in this hidden layer is ݊௛ᇱ , Dž values of (h–1)th hidden layer can be computed according to ǡ ௡೓

ᇱ ሺ݊݁‫ ݐ‬ሻ ߜ௜௛ିଵ ൌ ݂௛ିଵ ௜ ෍ ‫ݓ‬௝௜ ߜ௝ ௝ୀଵ

After removing all possible nodes weights must be updated. Apply the next input and continue the procedure until the network becomes stable. The complete procedure is given by the following algorithm. New Algorithm Step 1: Initializing MLANN Determine the number of input nodes (p), output nodes (m) and number of training patterns (N) Insert input and desired vectors. Determine the activation functions ݂௛భ ,݂௛మ …݂௢ for the hidden layers 1,2.. h respectively. Determine the values of Į, Lj Step 2: Weight Initialization Initialize all connection weights Step 3: Error Calculation Apply the first input and calculate the error ௠

ͳ ‫ ܧ‬ൌ ෍ሺ‫ݐ‬௞ െ ‫݋‬௞ ሻଶ  ʹ ௞ୀଵ

IV.

PROPOSED ALGORITHM FOR HIDDEN LAYER MODELING

In order to find the optimal architecture we follow the procedure of back propagation. The main aim of this

Step 4: Stopping Condition While stopping condition ሺ‫ ܧ‬൐ ‫ܧ‬௠௔௫ ሻ false do steps 2 and 3.

2013 IEEE Symposium on Computational Intelligence, Cognitive Algorithms, Mind, and Brain (CCMB)

83

TABLE 1: DETAILS OF THE PROBLEMS

Step 5: Remove Nodes from the Last Hidden Layer. Find the correlation corr(Hi,E) (With loss of generality let ܿ‫ݎݎ݋‬ሺ‫ܪ‬௜ ǡ ‫ܧ‬ሻ ൏ Ͳሻ Compute Dž values of the output layer ߜ௞ ൌ ݁௞ Ǥ ݂ ᇱ ሺ݊݁‫ݐ‬௢௞ ሻ Compute Dž values of the last hidden layer ௠

ߜ௝௛



݂௛ᇱ ൫݊݁‫ݐ‬௝ ൯ ෍ ‫ݓ‬௞௝ ߜ௞ ௞ୀଵ

Remove node with ߜ௝ if ߬ఈ ൏ ߜ௝ ൏ Ͳ Let the remaining nodes in this layer be ݊௛ᇱ Step 6: Remove Nodes from the (h – 1)th Hidden Layer Use į of the last hidden layer to compute Dž values of the (h – 1)th hidden layer ᇲ ௡೓

ᇱ ሺ݊݁‫ ݐ‬ሻ ߜ௜௛ିଵ ൌ ݂௛ିଵ ௜ ෍ ‫ݒ‬௝௜ ߜ௝ ௝ୀଵ

Remove all possible nodes as in Step 5. Continue the procedure for all the hidden layers Step 7: Update the Weights Update the weights as follows ‫ݓ‬௞௝ ሺ݊݁‫ݓ‬ሻ ൌ  ‫ݓ‬௞௝ ሺ‫݈݀݋‬ሻ  ൅ ߟߜ݂௝ ሺ݊݁‫ݐ‬ሻ Step 8: Apply the next input Insert another input chosen randomly and Repeat steps 5-7 Step 9: Repeat steps 5–8 until the network remains unchanged. Step 10: Repeat the procedure for different Į to obtain most efficient architecture. V.

EXPERIMENTAL METHOD

In this section we discuss the experimental method of applications of new algorithm on several well known problems namely car evaluation, breast cancer, Iris. The

84

Problem

Problem description

N

p

m

h

Problem 01

Car

100

6

1

3

Problem 02

Car

40

6

1

3

Problem 03

Car

60

6

1

2

Problem 04

Cancer

60

30

1

2

Problem 05

Cancer

40

30

1

3

Problem 06

Iris

80

4

1

3

Problem 07

Iris

60

4

1

4

detailed description of these data can be obtained from [20] and these problems have been used to explain many studies in artificial neural networks and machine learning. In each problem, several architectures have been considered. The main characteristics of the networks, namely number of input/output datasets (N), number nodes in input and output layers , p and m respectively, number of hidden layers used (h) are shown in table 1. To explain the new model different data sets of each problem were chosen. Data sets were partitioned into two groups: a training set and a testing set as shown in table 3. Using the training set data trained by back propagation and new algorithm respectively until the desired error E occurred. Then the testing data set was used to compare the performances of two algorithms. The results are shown in table 4. VI.

EXPERIMENTAL RESULTS

The log sigmoid and the linear functions were chosen as activation functions of hidden layers and output layer respectively. Fig:1 and Fig:2 show how error rate of TABLE 2: CORRELATION BETWEEN ERROR AND SUMMATION OF DELTA OF HIDDEN LAYERS Problem

Corr(H4,E)

Corr(H3,E)

Corr(H2,E)

Corr(H1,E)

Problem 01

-

0.8752

-0.9231

0.6327

Problem 02

-

-0.9224

0.7698

0.8145

Problem 03

-

-

0.9750

-0.5343

Problem 04

-

-

-0.97845

0.9994

Problem 05

-

-0.9804

0.8256

0.7893

Problem 06

-

0.8660

-0.4558

0.4950

Problem 07

-0.8933

0.7652

0.4520

-0.7459

2013 IEEE Symposium on Computational Intelligence, Cognitive Algorithms, Mind, and Brain (CCMB)

TABLE 3 : Number of testing sets

Epochs in BP

Epochs in NM

Problem 01

100

500

1201

993

Problem 02

40

200

2457

2103

Problem 03

60

200

3254

2921

Problem 04

60

250

651

584

Problem 05

40

150

861

622

Problem 06

80

70

1253

967

Problem 07

60

90

1130

1028

training cycles and number of hidden layers change for different values ofߙ. Here we have considered the car problem with 40 training sets. Initially network contained 20 neurons in each hidden layer. To train initial configuration by back propagation it needs 2457 training cycles. The network trained for different values of ߙ using the new algorithms. It can be observed that while we are increasing the value of ߙ, definitely size of network becomes smaller, but number of training cycles oscillates and after some level it rapidly increases. For this particular example number of training cycles becomes it’s minimal level (2103) when numbers of hidden layers 12 – 7 – 7. The performances of the new algorithm are presented in the table 3 and 4. Epochs in BP refers the number of iteration required to train the network for the initial configuration by using the back propagation algorithm. Epochs in NM refers the number of iterations to obtain the required error, to train the network while removing the unimportant neurons. The car evaluation problem, breast cancer problem and Iris problem were tested for different architectures. According to the results it is clear TABLE 4: COMPARISON OF NEW METHOD WITH BACK PROPAGATION TRAINING Problem

Hidden layers

Hidden layers in NM

Accuracy in BP

Accuracy in NM

Problem 01

50 – 50 – 50

38 – 35 – 31

82%

84%

Problem 02

20 – 20 – 20

12 – 7 – 7

80%

84%

Problem 03

35 – 35

28 – 21

76%

77%

Problem 04

35 – 35

27 – 25

80%

80%

Problem 05

15 – 15 – 15

11 – 9 – 6

80%

81%

Problem 06

30 – 30 – 30

25 – 22 – 22

82%

85%

Problem 07

20– 20– 20-20

16– 13– 13-14

85%

85%

that, there is architecture with fewer number of neurons which shows better performance than back propagation of arbitrary sized network. For example in problem 01 network with 3 hidden layers, each layer contains 50 neurons was trained by back propagation algorithm. To obtain the required error (