Optimization of Neural Network Weights and Architectures for Odor

0 downloads 0 Views 78KB Size Report
Architecture design is a crucial issue for the successful application of Artificial Neural Networks (ANNs). This is because the architecture has significant impact ...
Optimization of Neural Network Weights and Architectures for Odor Recognition using Simulated Annealing A. Yamazaki, M. C. P. de Souto and T. B. Ludermir Center of Informatics – Federal University of Pernambuco P. O. Box 7851, Cidade Universitária, Recife – PE, Brazil, 50.732-970 {ay,mcps,tbl}@cin.ufpe.br Abstract – This paper shows results of using simulated annealing for optimizing neural network architectures and weights. The algorithm generates networks with good generalization performance (mean classification error of 5.28%) and low complexity (mean number of connections of 11.68 out of 36) for an odor recognition task in an artificial nose.

I. INTRODUCTION Architecture design is a crucial issue for the successful application of Artificial Neural Networks (ANNs). This is because the architecture has significant impact on the network’s information processing capabilities. Given a learning task, a neural network with only a few connections may not be able to perform the task at all due to its limited capability. In contrast, a network with a large number of connections may overfit noise in the training data and fail to have good generalization ability. The design of the optimal architecture for a neural network can be formulated as a search problem in the architecture space, where each point represents an architecture. Given some cost measure, such as the training error and the network complexity (e.g., the number of weight connections), the cost of all architectures forms a discrete surface in the space. The design of the optimal architecture is equivalent to finding the lowest point on this surface. A major problem with the design of neural architectures, when the weight connections and their respective values are not considered, is the noisy fitness evaluation [1]. This happens when a network with a full set of weights is used to approximate the fitness of the solution, where the solution means a neural network without any weight information. For instance, different weight initializations and training parameters can produce different results for the same topology. The problem described above (the design problem) can be minimized by designing neural network architectures and connection weights (with their respective values) simultaneously [2]. In this case, each point in the search space is a fully specified neural network with complete weight information. Since there is an one-to-one mapping between a neural network and a point in the search space, fitness evaluation will be accurate. One of the optimization methods that can be used to deal with the design problem is simulated annealing, which was first presented by Kirkpatrick, Gellat and Vecchi[3]. They were inspired by the annealing (cooling) processes of crystals that reach the lowest energy, corresponding to the perfect crystal structure, if cooled sufficiently slowly.

0-7803-7278-6/02/$10.00 ©2002 IEEE

This method has been widely and successfully used to solve global optimization problems in many fields, including training of ANNs [4]. However, simulated annealing has not been popular in simultaneous optimization of network weights and architectures. In fact, such a problem has been often handled by using Genetic Algorithms (GAs) [2], in spite of the simplicity of simulated annealing when compared to GAs. Thus, the main goal of this paper is to present results of using simulated annealing for optimization of architectures and weights of Multi-Layer Perceptron (MLP) networks [5]. The networks will be trained to classify three different vintages of a given wine (the data set was generated by an artificial nose). The results obtained show that simulated annealing is able to produce networks with good generalization performance (mean classification error for the test set was 5.28%) and low complexity (mean number of connections used was 11.68 out of 36) for the odor classification task. The remainder of this paper is divided into four sections. Section 2 describes the odor classification task problem. In Section 3, important details about the implementation of the simulated annealing method, such as representation of the solution and cooling schedule, are given. The main contribution of this paper is presented in Section 4, where the results of the experiments with the simulated annealing method for architecture and weight connections (including the values of the weights) optimization are analyzed. Finally, in Section 5, some final remarks are presented. II. PROBLEM AND DATA DESCRIPTION The two main components of an artificial nose are the sensor and the pattern recognition systems. Each odorant substance presented to the sensor system generates a pattern of resistance values that characterizes the odor. This pattern is often first pre-processed and then given to the pattern recognition system, which in its turn classifies the odorant stimulus [6]. Sensor systems have been often built with polypyrrol-based gas sensors. Some advantages of using such kind of sensors are [7]: (1) rapid adsorption kinetics at environment temperature; (2) low power consumption, as no heating element is required; (3) resistance to poisoning; and (4) the possibility of building sensors tailored to particular classes of chemical compounds. ANNs have been widely applied as pattern recognition systems in artificial noses. Implementing the pattern recognition system with ANNs has advantages such as [8]: (1) the ability to handle non-linear

signals from the sensor array; (2) adaptability; (3) fault and noise tolerance; and (4) inherent parallelism, resulting in high speed operation. The type of neural network most commonly used for odor classification in artificial noses has been the MLP, together with the backpropagation learning algorithm [5]. The aim is to classify odors from three different vintages (years 1995, 1996 and 1997) of the same wine (Almadém, Brazil). A prototype of an artificial nose was used to acquire the data. This prototype is composed of six distinct polypyrrol-based gas sensors, built by electrochemical deposition of polypyrrol using different types of dopants. Three disjoint data acquisitions were performed for every vintage of wine, by recording the resistance value of each sensor at every half second during five minutes. Therefore, this experiment yielded three data sets with equal numbers of patterns: 1800 patterns (600 ones from each vintage). A pattern is a vector of six elements representing the values of the resistances recorded by the sensor array. In this work, the data for training and testing the network were divided as follows: 50% of the patterns from each vintage were assigned to the training set, 25% were assigned to the validation set, and 25% were reserved to test the network, as suggested by Proben1 [9]. The patterns were normalized to the range [–1.0,+1.0], for all network processing units implemented hyperbolic tangent activation functions. This data set was used in a previous work [10][11], where two approaches were compared: MLP networks and TDNNs (Time Delay Neural Networks) [12], with 2, 4, 8, 12, 16 and 20 hidden nodes. The results showed that the TDNN approach achieves better generalization performance (mean classification error of 4.32%) than those obtained by MLP networks (42.79%). III. IMPLEMENTATION DETAILS The simulated annealing algorithm consists of a sequence of iterations. Each iteration consists of randomly changing the current solution to create a new solution in the neighborhood of the current solution. The neighborhood is defined by the choice of the generation mechanism. Once a new solution is created, the corresponding change in the cost function is computed to decide whether the new solution can be accepted as the current solution. If the change in the cost function is negative, the new solution is directly taken as the current solution. Otherwise, it is accepted according to Metropolis´s criterion [13]: if the difference between the cost function values of the current and new solutions is equal to or larger than zero, a random number in [0,1] is generated from an uniform distribution. If the random number is equal to or less than exp(- (7 , where ( is the change in the cost function and T is the current temperature, then the new solution is accepted as the current solution. If not, the current solution is unchanged [14]. Given a set S of solutions and a real-valued cost function f : S → ℜ , the algorithm searches for the global solution

0-7803-7278-6/02/$10.00 ©2002 IEEE

s , such that f ( s ) ≤ f ( s ’) , ∀s ’∈ S . The search stops after I epochs, and a cooling schedule updates the temperature Ti of epoch i. The basic structure of the simulated annealing algorithm is as follows: s 0 ← initial solution in S -

For i = 0 to I − 1 Generate neighbor solution s ’ If f ( s ’) ≤ f ( s i )

s i +1 ← s ’

-

else

-

s i +1 ← s ’ with probability e −[ f ( s ’) − f ( si )] / Ti +1

-

otherwise s i +1 ← s i

Return s I . In order to implement the simulated annealing algorithm for a problem, there are four principal choices that must be made [14]: (1) representation of solutions; (2) definition of the cost function; (3) definition of the generation mechanism for the neighbors; and (4) definition of the cooling schedule. A. Representation of Solutions In this work, each MLP is specified by an array of connections, and each connection is specified by two parameters: (a) the connectivity bit, which is equal to 1 if the connection exists, and 0 otherwise; and (b) the connection weight, which is a real number. If the connectivity bit is 0, its associated weight is not considered, for the connection does not exist in the network. The maximal network structure is an one-hidden-layer MLP with six input nodes (one for each sensor), four hidden nodes and three output nodes (one for each vintage of wine, for the output is represented by a 1-of-m code). Since the networks have all possible feedforward connections between adjacent layers - having no connection between non-adjacent layers -, the maximum number of connections is 36. B. Cost Function For each solution, the cost function used is the mean of two important parameters: (1) the classification error for the training set (percentage of incorrectly classified training patterns); and (2) the percentage of connections used by the network. Therefore, the algorithm tries to minimize both network performance and complexity. Only valid networks (i.e., networks with at least one unit in the hidden layer) were considered. If the solution created is an invalid network, a new neighbor solution is generated. C. Generation Mechanism for the Neighbors The generation mechanism acts as follows: first the connectivity bits for the current solution are changed according to a given probability, which in this work is set to

20%. This operation deletes some network connections and creates new ones. Then, a random number between –1.0 and +1.0 is added to the connection weights. These two steps can change both topology and connection weights to produce a new neighbor solution. D. Cooling Schedule There are several cooling schedules in the literature [15]. They are characterized by the different temperature updating schemes used. Examples of these schemes are the stepwise temperature reduction schemes, which are widely used [14]. Stepwise schemes include very simple cooling strategies, such as the geometric cooling rule. According to this rule, the new temperature is equal to the current temperature multiplied by a temperature factor (smaller than 1 but close to 1) [14]. In this work, the cooling strategy chosen is the geometric cooling rule. The initial temperature is set to 1, and the temperature factor is set to 0.9. The temperature is decreased at each 10 iterations, and the maximum number of iterations allowed is 1000. The classification error on the validation set is measured after every tenth iteration. The algorithm stops if: (1) the GL5 criterion defined in Proben1 [9] is met (based on the classification error for the validation set) after 300 iterations; or (2) the maximum number of 1000 iterations is achieved. The GL5 criterion is a good way to avoid overfitting of the network to the particular training examples used, which would reduce the generalization performance. The generalization loss parameter (GL) at iteration t is the relative increase of the error on the validation set over the minimumso-far (in percent). The GL5 criterion makes the training process stop as soon as the generalization loss exceeds 5% [9]. IV. EXPERIMENTS AND RESULTS In this work, three different architectures were taken as initial topologies: one-hidden-layer MLP networks with 2, 3 and 4 hidden nodes, having all possible feedforward connections between adjacent layers. For each initial topology, 10 distinct random weight initializations were used, and the initial weights were taken from an uniform distribution between –1.0 and +1.0. For each weight initialization, 30 runs of simulated annealing were performed. The best 10 runs (the ones with lowest classification error for the validation set) and the worst 10 runs (the ones with highest classification error for the validation set) were excluded, and the 10 remaining runs were considered for the final results. For each weight initialization of the 2-hidden-node initial topology, mean and standard deviation of the results obtained for the 10 runs considered are presented in Table I. The results presented in this table show that the networks achieve good generalization performance (small classification errors on validation and test sets) and use fewer nodes and

0-7803-7278-6/02/$10.00 ©2002 IEEE

connections than the maximal architecture allowed. The mean classification error on the test set was 5.19%, and the mean number of connections used was 11.75, much fewer than the maximum number of connections allowed (i.e., 36). The same experiments were performed for initial topologies with 3 and 4 hidden nodes - the results are shown in Tables II and III, respectively.

TABLE I RESULTS FOR THE 2-HIDDEN-NODE INITIAL TOPOLOGY

Weight Initialization

Training Set Validation Set Test Set Classif. Error Classif. Error Classif. Error Mean St.dev. Mean St.dev. Mean St.dev

1

4.578

1.038

4.830

0.822

4.430

0.538

2

4.070

1.686

4.274

1.588

3.926

1.365

3

4.656

2.075

5.037

2.411

4.933

2.221

4

5.393

1.865

5.874

1.606

5.259

1.548

5

6.122

2.258

6.326

1.863

6.030

2.191

6

7.922

2.532

8.348

2.487

7.889

2.455

7

5.556

1.958

6.481

2.199

5.770

1.964

8

2.811

0.795

3.370

0.946

2.941

0.858

9

4.348

2.183

4.356

2.025

4.104

1.946

10

6.544

2.957

7.556

3.197

6.585

2.909

Weight Initialization

Number of Number of Number of Input Nodes Hidden Nodes Connections Mean St.dev. Mean St.dev. Mean St.dev

1

5.100

0.994

2.800

0.422

13.300

2.791

2

4.600

0.699

2.800

0.919

11.500

1.581

3

4.600

0.843

2.900

0.738

12.300

3.945

4

4.600

0.966

3.300

0.823

11.200

2.098

5

4.500

1.080

2.900

0.568

11.000

1.491

6

5.000

0.943

2.900

0.876

11.900

2.601

7

4.600

0.843

3.200

0.632

12.100

2.378

8

4.800

0.789

3.000

0.816

12.700

2.406

9

3.900

0.994

2.800

0.789

10.100

2.846

10

4.400

0.966

3.100

0.738

11.400

2.171

TABLE II RESULTS FOR THE 3-HIDDEN-NODE INITIAL TOPOLOGY

Weight Initialization

Training Set Validation Set Test Set Classif. Error Classif. Error Classif. Error Mean St.dev. Mean St.dev. Mean St.dev

1

2.715

1.285

2.711

1.126

2.600

1.197

2

4.293

2.898

4.889

3.010

4.185

2.740

3

5.370

2.218

5.170

2.178

5.059

1.930

4

5.567

2.528

6.222

3.314

5.578

2.814

5

2.648

1.829

3.126

2.330

2.859

2.349

6

5.593

3.057

5.726

2.699

5.311

2.714

7

3.833

1.666

4.526

1.915

3.926

1.757

8

4.274

1.496

4.341

1.521

4.037

1.428

9

5.044

1.976

6.022

2.221

5.444

2.155

10

7.311

2.065

7.985

2.429

7.533

1.947

Weight Initialization

Number of Number of Number of Input Nodes Hidden Nodes Connections Mean St.dev. Mean St.dev. Mean St.dev

1

4.900

0.568

2.900

0.738

10.900

1.595

2

4.700

0.823

3.100

0.568

12.300

2.312

3

5.100

0.876

3.300

0.483

11.900

1.729

4

4.600

0.699

3.000

0.816

10.900

2.025

5

5.000

1.054

2.800

0.632

10.600

2.271

6

4.600

1.075

2.900

0.316

11.500

2.635

7

4.600

1.265

3.400

0.516

12.900

2.767

8

4.200

1.135

2.800

0.789

10.400

2.547

9

5.200

0.789

3.200

0.632

13.500

2.838

10

5.100

0.876

3.100

0.738

12.800

1.989

TABLE III RESULTS FOR THE 4-HIDDEN-NODE INITIAL TOPOLOGY

Weight Initialization

Training Set Validation Set Test Set Classif. Error Classif. Error Classif. Error Mean St.dev. Mean St.dev. Mean St.dev

Weight Initialization

Number of Number of Number of Input Nodes Hidden Nodes Connections Mean St.dev. Mean St.dev. Mean St.dev

1

4.700

0.823

2.900

0.738

11.000

2.261

2

4.700

1.160

3.400

0.699

10.800

2.348

3

4.500

0.850

3.000

0.471

11.600

2.836

4

5.000

1.054

2.900

0.568

12.000

1.826

5

4.300

1.160

3.200

0.632

11.500

2.991

6

4.500

0.850

3.100

0.568

11.600

1.838

7

4.500

0.850

3.000

0.943

10.700

2.111

8

4.400

0.516

3.100

0.738

12.000

2.404

9

4.900

0.568

3.100

0.876

12.300

1.494

10

4.400

0.843

3.100

0.568

11.600

2.171

For the 3-hidden-node initial topology, the mean classification error on the test set was 4.65%, and the mean number of connections used was 11.77. For the 4-hiddennode initial topology, the mean classification error on the test set was 6.01%, and the mean number of connections used was 11.51. The initial topologies have also been trained with the backpropagation learning algorithm [5]. In order to do so, for each topology, the same 10 sets of initial weights used in the simulated annealing runs were used in the backpropagation experiments. In all cases, the learning rate was set to 0.001, and the momentum term to 0.8. The training process was stopped if: (1) the GL5 criterion defined in Proben1 [9] was met (based on the sum squared error for the validation set); or (2) the maximum number of 1000 iterations was achieved. The results for the topologies with 2, 3 and 4 hidden nodes are shown in Tables IV, V and VI, respectively.

TABLE IV RESULTS FOR THE 2-HIDDEN-NODE TOPOLOGY TRAINED WITH THE BACKPROPAGATION ALGORITHM

1

5.274

2.995

5.733

3.082

5.274

3.274

2

5.919

3.318

6.252

3.059

5.815

3.136

3

3.889

1.932

4.163

1.851

3.874

1.562

4

8.278

1.888

9.267

2.228

8.274

1.868

5

7.100

3.326

7.578

3.673

7.237

3.272

1

55.074

52.963

54.222

1.427

2

49.741

49.259

49.778

2.166

3

66.481

66.519

66.519

3.356

4

44.815

45.556

46.000

3.773

5

70.815

70.296

70.444

2.125

6

33.333

33.333

33.333

7

33.778

33.778

33.630

8

66.667

66.667

66.667

9

33.333

33.333

33.333

10

54.370

54.074

53.185

6 7 8 9 10

3.211 5.296 6.404 6.807 7.426

1.554 1.831 3.268 3.590 2.251

3.319 5.615 7.385 7.185 8.163

1.627 1.760 3.760 3.763 2.018

0-7803-7278-6/02/$10.00 ©2002 IEEE

3.067 5.474 6.496 6.793 7.770

Weight Initialization

Training Set Classif. Error

Validation Set Classif. Error

Test Set Classif. Error

TABLE V RESULTS FOR THE 3-HIDDEN-NODE TOPOLOGY TRAINED WITH THE BACKPROPAGATION ALGORITHM

Weight Initialization

Training Set Classif. Error

Validation Set Classif. Error

Test Set Classif. Error

1

33.333

33.333

33.333

2

40.444

41.407

40.963

3

59.185

60.667

60.296

4

43.741

45.926

44.370

5

33.333

33.333

33.630

6

33.333

33.333

33.333

7

46.815

46.815

46.222

8

66.667

66.667

66.667

9

10.111

11.778

10.148

10

60.111

61.259

60.815

TABLE VI RESULTS FOR THE 4-HIDDEN-NODE TOPOLOGY TRAINED WITH THE BACKPROPAGATION ALGORITHM

Weight Initialization

Training Set Classif. Error

Validation Set Classif. Error

Test Set Classif. Error

1

73.630

72.296

72.519

2

33.370

33.333

33.333

3

55.556

56.296

54.074

4

66.667

66.667

66.667

5

100.000

100.000

100.000

6

66.667

66.667

66.667

7

0.000

0.000

0.000

8

9.556

11.111

9.481

9

56.889

57.185

57.556

10

66.667

66.667

66.667

V. FINAL REMARKS In this paper, results regarding the use of simulated annealing for the optimization of neural network weights and architectures have been presented. It was shown that simulated annealing is able to produce networks with low complexity (mean number of connections used was 11.68 out of 36) and better generalization performance (mean classification error for the test set was 5.28%) than MLP networks trained with the backpropagation algorithm (mean classification error of 48.80%) for the odor classification task. Thus, this work shows that low classification errors can be achieved without using all the connections in a standard one-hidden-layer fully connected MLP. This is very important in a wide range of applications, including hardware implementation of neural networks for artificial noses. One possible future work is the implementation of other optimization techniques for the same problem, such as genetic algorithms [2] and tabu search [14], to generate a comparative study on optimization methods for neural networks. Acknowledgments The authors would like to thank CNPq and Finep (Brazilian Agencies) for their financial support. References [1] X. Yao and Y. Liu, “A new evolutionary system for evolving artificial neural networks,” IEEE Transactions on Neural Networks, vol. 8, n. 3, pp. 694-713, 1997. [2] X. Yao, “Evolving Artificial Neural Networks,” Proceedings of the IEEE, 87(9):1423-1447, September, 1999. [3] S. Kirkpatrick, C.D. Gellat Jr. and M.P. Vecchi, “Optimization by simulated annealing,” Science, 220: 671-680, 1983.

For the 2-hidden-node topology, the mean classification error on the test set was 50.71%. For the topologies with 3 and 4 hidden units, the mean classification errors on the test set were 42.98% and 52.70%, respectively. Comparing these results with those obtained with the simulated annealing algorithm, one can see that the maximal networks do not have to be used in order to achieve a good performance (mean classification error for the test set was 5.28%). Therefore, in the context of solving the odor recognition problem described in Section II, the simulated annealing method has shown to be very efficient in finding minimal network architectures with better performance than the fully connected MLP trained by the backpropagation algorithm (mean classification error of 48.80%).

0-7803-7278-6/02/$10.00 ©2002 IEEE

[4] S. Chalup and F. Maire, “A Study on Hill Climbing Algorithms for Neural Network Training,” Proceedings of the 1999 Congress on Evolutionary Computation (CEC’99), July 6-9, 1999, Washington, D.C., USA, Volume 3, pp. 2014-2021, 1999. [5] D.E. Rumelhart, G.E. Hinton and R.J. Williams, “Learning internal representations by error propagation,” Parallel Distributed Processing (Edited by D.E. Rumelhart and J.L. McClelland), Vol. 1, pp. 318362, Cambridge, MIT Press, 1986. [6] J.W. Gardner and E.L. Hines, “Pattern Analysis Techniques,” Handbook of Biosensors and Electronic Noses: Medicine, Food and the Environment (Edited by E. Kress-Rogers), pp. 633-652, CRC Press, 1997.

[7] K.C. Persaud and P.J. Travers, “Arrays of Broad Specificity Films for Sensing Volatile Chemicals,” Handbook of Biosensors and Electronic Noses: Medicine, Food and the Environment (Edited by E. Kress-Rogers), pp. 563-592, CRC Press, 1997. [8] M.A. Craven, J.W. Gardner and P.N. Bartlett, “Electronic noses – development and future prospects,” Trends in Analytical Chemistry, Vol. 5, n. 9, 1996. [9] L. Prechelt, “Proben1 – A Set of Neural Network Benchmark Problems and Benchmarking Rules,” Technical Report 21/94, Fakultät für Informatik, Universität Karlsruhe, Germany, September, 1994. [10] A. Yamazaki and T.B. Ludermir, “Classification of vintages of wine by an artificial nose with neural networks,” Proceedings of 8th International Conference on Neural Information Processing (ICONIP’2001), Vol. 1, pp. 184-187, Shanghai, China, November 14 to 18, 2001. [11] A. Yamazaki, T.B. Ludermir and M.C.P. de Souto, “Classification of vintages of wine by an artificial nose using time delay neural networks,” IEE Electronics Letters, 2001, to be published. [12] K.J. Lang and G.E. Hinton, “The development of the time-delay neural network architecture for speech recognition,” Technical Report CMU-CS-88-152, Carnegie-Mellon University, Pittsburgh, PA, 1988. [13] N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller and E. Teller, “Equation of state calculations by fast computing machines,” J. of Chem. Phys., Vol. 21, No. 6, pp. 1087-1092, 1953. [14] D.T. Pham and D. Karaboga, “Introduction,” D.T. Pham and D. Karaboga (eds.), Intelligent Optimisation Techniques, pp. 1-50, Springer-Verlag, 2000. [15] I. H. Osman, “Metastrategy Simulated Annealing and Tabu Search Algorithms for Combinatorial Optimisation Problems,” PhD Thesis, University of London, Imperial College, UK, 1991.

0-7803-7278-6/02/$10.00 ©2002 IEEE

Suggest Documents