Model Selection of RBF Networks via Genetic Algorithms - dca.ufrn

2 downloads 0 Views 1MB Size Report
It also proposes a genetic algorithm for RBF networks with a nonredundant ..... 4.8 Decoding the chromossome formed by a list of 5-gene sequences . . . . 78 .... kind of ANN for applications in multivariate nonlinear regression, classification and.
Pernambuco Federal University Informatics Center

Model Selection of RBF Networks via Genetic Algorithms By E. G. M. de Lacerda

THESIS

Recife, PE - Brazil March, 2003

Pernambuco Federal University Informatics Center

Model Selection of RBF Networks Via Genetic Algorithms

By E. G. M. de Lacerda

THESIS Submitted to the Informatics Center of Pernambuco Federal University in partial fulfillment of requirements for degree of doctor of Philosophy, 2003.

Abstract

One of the main obstacles to the widespread use of artificial neural networks is the difficulty of adequately defining values for their free parameters. This work discusses how Radial Basis Function (RBF) neural networks can have their free parameters defined by Genetic Algorithms (GAs). For such, it firstly presents an overall view of the problems involved and the different approaches used to genetically optimize RBF networks. It also proposes a genetic algorithm for RBF networks with a nonredundant genetic encoding based on clustering methods. Secondly, this work addresses the problem of finding the adjustable parameters of a learning algorithm via GAs. This problem is also known as the model selection problem. Some model selection techniques (e.g., crossvalidation and bootstrap) are used as objective functions of the GA. The GA is modified in order to adapt to that problem by means of occam’s razor, growing, and other heuristics. Some modifications explore features of the GA, such as the ability for solving multiobjective optimization problems and handling objective functions with noise. Experiments using a benchmark problem are performed and the results achieved, using the proposed GA, are compared to those achieved by other approaches. The proposed techniques are quite general and may also be applied to a large range of learning algorithms.

ii

Resumo

Um dos principais obstáculos para o uso em larga escala das Redes Neurais é a dificuldade de definir valores para seus parâmetros ajustáveis. Este trabalho discute como as Redes Neurais de Funções Base Radial (ou simplesmente Redes RBF) podem ter seus parâmetros ajustáveis definidos por algoritmos genéticos (AGs). Para atingir este objetivo, primeiramente é apresentado uma visão abrangente dos problemas envolvidos e as diferentes abordagens utilizadas para otimizar geneticamente as Redes RBF. É também proposto um algoritmo genético para Redes RBF com codificação genética não redundante baseada em métodos de clusterização. Em seguida, este trabalho aborda o problema de encontrar os parâmetros ajustáveis de um algoritmo de aprendizagem via AGs. Este problema é também conhecido como o problema de seleção de modelos. Algumas técnicas de seleção de modelos (e.g., validação cruzada e bootstrap) são usadas como funções objetivo do AG. O AG é modificado para adaptar-se a este problema por meio de heurísticas tais como narvalha de Occam e growing entre outras. Algumas modificações exploram características do AG, como por exemplo, a abilidade para resolver problemas de otimização multiobjetiva e manipular funções objetivo com ruído. Experimentos usando um problema benchmark são realizados e os resultados alcançados, usando o AG proposto, são comparados com aqueles alcançados por outras abordagens. As técnicas propostas são genéricas e podem também ser aplicadas a um largo conjunto de algoritmos de aprendizagem.

iii

For God and for my father.

iv

Acknowledgments

I would like to express my gratitude to my supervisors, Dra. Teresa Ludemir and Dr. André de Carvalho, for their permanent guidance and patience throughout my research. I also thank to them, Dr. Aluísio Araújo and Dr. Francisco Carvalho for revising the manuscript of the thesis. I thank my Parents for their love and support and my brothers for encouragement. Finally, I wish to acknowledge CNPq, FAPESP and FACEPE for their support.

v

Contents 1

Introduction 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2

Genetic Algorithms 2.1 Introduction . . . . . . . . . . . . . . . . . . 2.1.1 The Binary Chromosome . . . . . . . 2.1.2 Selection . . . . . . . . . . . . . . . 2.1.3 Crossover and Mutation . . . . . . . 2.1.4 Elitism . . . . . . . . . . . . . . . . 2.1.5 n-Point and Uniform Crossovers . . . 2.1.6 GA Terminology . . . . . . . . . . . 2.1.7 Schema Theorem . . . . . . . . . . . 2.1.8 Which Crossover is the Best? . . . . 2.2 Optimization via GA . . . . . . . . . . . . . 2.2.1 A Brief Introduction . . . . . . . . . 2.2.2 GA and the Others Methods . . . . . 2.2.3 Exploration-Exploitation Trade-off . 2.3 RCGA - The Real-Coded Genetic Algorithm 2.3.1 Binary vs. Real Encoding . . . . . . 2.3.2 Genetic Operators . . . . . . . . . . 2.4 Practical Aspects . . . . . . . . . . . . . . . 2.4.1 Initial Population . . . . . . . . . . .

vi

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

1 1 3 4 5 5 7 9 12 13 16 19 20 23 24 24 27 29 31 31 33 40 40

CONTENTS

2.5 3

4

5

vii

2.4.2 Objective Function . . . . . . . . . . . . . 2.4.3 Stopping Criteria . . . . . . . . . . . . . . 2.4.4 Generational and Steady State Replacement 2.4.5 Convergence Problems . . . . . . . . . . . 2.4.6 Remapping the Objective Function . . . . . 2.4.7 Selection Methods . . . . . . . . . . . . . Summary . . . . . . . . . . . . . . . . . . . . . .

Learning and RBF Networks 3.1 The Learning Problem . . . . . . . . . . . . . 3.2 The True Prediction Error . . . . . . . . . . . . 3.3 Estimating the True Prediction Error . . . . . . 3.4 Introduction to Radial Basis Function Networks 3.5 Hybrid Learning of RBF networks . . . . . . . 3.6 Computational Considerations . . . . . . . . . 3.7 Ridge Regression . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

Combining RBF Networks and Genetic Algorithms 4.1 Combining Neural Networks and Genetic Algorithms . . . . . 4.1.1 Encoding Issues . . . . . . . . . . . . . . . . . . . . 4.1.2 Desirable Properties of Genetic Encodings . . . . . . 4.1.3 Redundancy and Illegality in RBF Network Encodings 4.2 Review of Previous Works . . . . . . . . . . . . . . . . . . . 4.2.1 Selecting Centers from Patterns . . . . . . . . . . . . 4.2.2 Crossing Hypervolumes . . . . . . . . . . . . . . . . 4.2.3 Functional Equivalence of RBFs . . . . . . . . . . . . 4.2.4 Other Models . . . . . . . . . . . . . . . . . . . . . . 4.3 Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Proposed Genetic Encodings and their Operators 5.1 Model I - Multiple Centers per Cluster . . . . . . . 5.1.1 Encoding . . . . . . . . . . . . . . . . . . 5.1.2 Partitioning the Input Space . . . . . . . . 5.1.3 Decoding . . . . . . . . . . . . . . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . . . . .

. . . . . . .

. . . . . . . . . .

. . . .

. . . . . . .

. . . . . . .

. . . . . . . . . .

. . . .

. . . . . . .

. . . . . . .

. . . . . . . . . .

. . . .

. . . . . . .

. . . . . . .

. . . . . . . . . .

. . . .

. . . . . . .

40 41 42 43 44 48 50

. . . . . . .

52 52 54 54 58 60 62 62

. . . . . . . . . .

64 64 66 68 70 71 71 74 75 77 78

. . . .

80 81 81 81 82

CONTENTS

5.2

5.3 6

7

5.1.4 The Cluster Crossover . . 5.1.5 Mutation Operators . . . . Model II - One Center per Cluster 5.2.1 Encoding . . . . . . . . . 5.2.2 Decoding . . . . . . . . . 5.2.3 Genetic operators . . . . . Comments . . . . . . . . . . . . .

viii . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

Model Selection via Genetic Algorithms 6.1 Training and Adjustable Parameters . . . . . . . . . . . . 6.2 Model Selection . . . . . . . . . . . . . . . . . . . . . . . 6.3 A Simple GA for Model Selection . . . . . . . . . . . . . 6.3.1 The Holdout Fitness Function . . . . . . . . . . . 6.3.2 The Occam Elitism . . . . . . . . . . . . . . . . . 6.3.3 Experimental Studies . . . . . . . . . . . . . . . . 6.4 Other Fitness Functions for Model Selection . . . . . . . . 6.4.1 The k-Fold-Crossvalidation Fitness Function . . . 6.4.2 The Generalized Cross-Validation Fitness Function 6.4.3 The .632 Bootstrap Fitness Function . . . . . . . . 6.4.4 Experimental Studies . . . . . . . . . . . . . . . . 6.5 Other Heuristics for Model Selection via GAs . . . . . . . 6.5.1 Shuffling . . . . . . . . . . . . . . . . . . . . . . 6.5.2 Growing . . . . . . . . . . . . . . . . . . . . . . 6.6 The Multiobjective Optimization for Model Selection . . . 6.6.1 The Multiobjective Genetic Algorithm . . . . . . . 6.6.2 The Choice of the Fitness Functions . . . . . . . . 6.6.3 Experimental Studies . . . . . . . . . . . . . . . . 6.7 Experimental Studies with Other Techniques . . . . . . . . Conclusion

. . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . .

83 84 85 85 87 87 88

. . . . . . . . . . . . . . . . . . .

90 91 92 94 95 96 97 101 101 101 102 103 104 104 107 108 108 110 111 112 117

List of Figures 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 2.11 2.12 2.13 2.14 2.15 2.16 2.17 2.18 2.19 2.20 2.21 2.22 2.23

Function f (x) = x sin(10πx) + 1 . . . . . . . . . . . . . . . . . . . . A Simple Genetic Algorithm . . . . . . . . . . . . . . . . . . . . . . . Initial population . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Roulette wheel parent selection algorithm . . . . . . . . . . . . . . . . Chromosomes of the first generation . . . . . . . . . . . . . . . . . . . First generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Eighth generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Twentieth-five generation . . . . . . . . . . . . . . . . . . . . . . . . . The maximum and mean value of objective function as a function of generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . GA with and without elitism . . . . . . . . . . . . . . . . . . . . . . . 2-point-crossover . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-point-crossover . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Uniform crossover . . . . . . . . . . . . . . . . . . . . . . . . . . . . Strings and Schemata . . . . . . . . . . . . . . . . . . . . . . . . . . . A population . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chromosome interpreted as a ring . . . . . . . . . . . . . . . . . . . . Infeasibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The downhill method . . . . . . . . . . . . . . . . . . . . . . . . . . . Arithmetical crossover . . . . . . . . . . . . . . . . . . . . . . . . . . BLX-α applied to an unidimensional space . . . . . . . . . . . . . . . BLX-α applied to a multidimensional space . . . . . . . . . . . . . . . Heuristic crossover . . . . . . . . . . . . . . . . . . . . . . . . . . . . Premature convergence . . . . . . . . . . . . . . . . . . . . . . . . . .

ix

7 8 9 12 14 15 15 16 17 17 18 18 19 21 23 24 26 28 35 36 36 38 43

LIST OF FIGURES

x

2.24 2.25 2.26 2.27

Ranking and the selection pressure . . . . . . . . . . . . . . . Linear scaling . . . . . . . . . . . . . . . . . . . . . . . . . . Procedure for calculating the linear scaling coefficients a and b Stochastic Universal Sampling . . . . . . . . . . . . . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

46 48 49 50

3.1 3.2 3.3 3.4

A dataset . . . . . . . . . . . . . . . . . . . . . . . . . . The k-fold-crossvalidation method for k = 5 . . . . . . . . Hypothesis selected by k-fold-crossvalidation with k = 10 A Radial Basis Function Network . . . . . . . . . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

55 57 59 60

4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8

Genotype to phenotype mapping. . . . . . . . . . . . . . . . . . Reparing an illegal network . . . . . . . . . . . . . . . . . . . . Redundant RBF networks. . . . . . . . . . . . . . . . . . . . . Overlapped Gaussian Functions. . . . . . . . . . . . . . . . . . Lucasius and Kateman’s variable length crossover . . . . . . . . Trade mutation . . . . . . . . . . . . . . . . . . . . . . . . . . Crossing Hypervolumes . . . . . . . . . . . . . . . . . . . . . . Decoding the chromossome formed by a list of 5-gene sequences

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

67 68 71 71 73 73 76 78

5.1 5.2 5.3 5.4 5.5

Partioning the input space by means of clusters of patterns Chromosome decoding . . . . . . . . . . . . . . . . . . . The cluster crossover . . . . . . . . . . . . . . . . . . . . Traditional crossover generates duplicated genes . . . . . . Cluster crossover using the template bit string (0,1,1) . . .

. . . . .

. . . . .

. . . . .

. . . . .

82 83 85 86 86

6.1 6.2 6.3

The data set partition by the holdout method. . . . . . . . . . . . . . . 96 Mackay’s Hermite polynomial. . . . . . . . . . . . . . . . . . . . . . . 98 Comparing the performance of occam and traditional elitism in terms of the number of basis functions . . . . . . . . . . . . . . . . . . . . . . . 99 Comparing the performance of occam and traditional elitism in terms of the error on the test set. . . . . . . . . . . . . . . . . . . . . . . . . . . 100 The k-fold-crossvalidation method with k = 5. . . . . . . . . . . . . . 102 Comparing the performance of several kinds of fitness in terms of the number of basis functions. . . . . . . . . . . . . . . . . . . . . . . . . 105

6.4 6.5 6.6

. . . .

. . . . .

. . . .

. . . . .

. . . . .

LIST OF FIGURES Comparing the performance of several kinds of fitness in terms of the number of basis functions. . . . . . . . . . . . . . . . . . . . . . . . . 6.8 Pareto ranking method. . . . . . . . . . . . . . . . . . . . . . . . . . . 6.9 Comparing the performance of the multiobjective GA in terms of the number of basis functions. . . . . . . . . . . . . . . . . . . . . . . . . 6.10 Comparing the performance of the multiobjective GA in terms of the error on the test set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.11 Comparing the performance of the multiobjective GA with other techniques in terms of the number of basis functions. . . . . . . . . . . . . 6.12 Comparing the performance of the multiobjective GA with other techniques in terms of the error on the test set. . . . . . . . . . . . . . . . .

xi

6.7

106 109 112 113 115 116

Notation of Chapter 2 S D N l fi f¯ f (·), g(·) H O(H) δ(H) U (x, y)

feasible set solution set population size bit string (chromossome) length fitness average fitness objective (or fitness) functions schema order of a schema H defining length of a schema H uniform distribution (being x and y the lower and upper limits of this distribution) N (µ, σ) normal distribution with mean µ and standard deviation σ σ standard deviation r ∼ F it indicates that r is a random number drawn from a distribution F T cj = [cj1 , cj2 , . . . , cjn ] jth offspring (real-coded GA) T pj = [pj1 , pj2 , . . . , pjp ] jth parent (real-coded GA)

xii

Notation of Chapter 3 D (xi , yi ) xi yi p H h e(h) eb(h) m k·k h·i cj = [c1j , c2j , . . . , cnj ]T n

dataset ith example of the dataset input vector desired output (output for short) number of examples in dataset hypothesis space hypothesis true error of hypothesis h estimate of true error of hypothesis h number of hidden units euclidean norm average vector center of the jth hidden unit dimension of the input space (or the number of input units) T w = [w1 , . . . , wm ] Weight vector. wj is the weight connecting the jth hidden unit and the output unit w0 bias σj width of the jth hidden unit α overlap factor for widths zj (·) activation function (or basis function) of the jth hidden unit Z design matrix, which is a matrix with the j th column [zj (x1 ) , zj (x2 ) , . . . , zj (xp )]T Z+ pseudo-inverse of Z y = [y1 , y2 , . . . , yp ]T desired output vector β ridge or regularization parameter I identity matrix

xiii

Notation of Chapters 4 and 5 D E F P G L P, Q pi , qi

Si Ri K rj li = [li1 , . . . , lin ]T ui = [ui1 , . . . , uin ]T bj

decoding function problem-specific genetic encoding feasible set phenotype space genotype space legal set chromossome (or individual) ith component of a chromossome (in general, encodes parameters associated with a basis function) ith cluster (K-means algorithm) ith region of the input space (associated with cluster Si ) number of regions (or clusters) identifier indicating that the jth center is inside the region Rrj of the input space lower limits of region Ri upper limits of region Ri boolean flag: if bj = TRUE then pj is valid, otherwise pj is discarded during the decoding

xiv

Notation of Chapter 6 L E X Y θ τ λ h(λ, D) h(x; λ, D) e(λ) eb(λ, D) δ(x, y) Dt Dh k n p B f (λ) a

Suggest Documents