heterogeneous evolution of surrogate models - Semantic Scholar

7 downloads 12982 Views 9MB Size Report
Jun 6, 2007 - been widely used in many diverse domains with applications ranging from .... 1One of the most famous Arabian mathematicians, whose name is still ..... Such surrogates are a useful, cheap way to gain insight into the global.
FACULTY OF ENGINEERING THESIS SUBMITTED FOR THE PROGRAMME MASTER OF ARTIFICIAL INTELLIGENCE ACADEMIC YEAR 2006-2007 KATHOLIEKE UNIVERSITEIT LEUVEN

HETEROGENEOUS EVOLUTION OF SURROGATE MODELS

Dirk Gorissen Promotor : Prof. Dirk Roose External Promotor : Prof. Tom Dhaene Co-readers : Prof. Johan Suykens Prof. Adhemar Bultheel

Master of Artificial Intelligence Masters’ Thesis Heterogeneous Evolution of Surrogate Models Dirk Gorissen {[email protected]} June 6, 2007

Abstract Due to the scale and computational complexity of current simulation codes, surrogate models have become indispensable tools for exploring and understanding the design space. Consequently, there is great interest in techniques that facilitate the construction and evaluation of such approximation models while minimizing the computational cost and maximizing surrogate model accuracy. Many surrogate model types exist (polynomials, Kriging models, RBF models, ...) but no type is optimal in all circumstances. Nor is there any hard theory available that can help make this choice. The same is true for setting the surrogate model parameters (Bias - Variance trade-off). Traditionally, the solution to both problems has been a pragmatic one, guided by intuition, prior experience or simply available software packages. This thesis presents a more solid approach to these problems. It describes an adaptive surrogate modeling environment, driven by speciated evolution, to automatically determine the optimal model type and complexity. Its performance is illustrated on a number of benchmark problems.

Copyright Java and Jini are registered trademarks of Sun Microsystems in the United States and other countries. Linux is a trademark of Linus Torvalds and Matlab of The Mathworks Inc. All other trademarks are the property of their respective owners.

1

Acknowledgments First and foremost I would like to thank my promoters: From the Katholieke Universiteit Leuven, I would very much like to thank Prof. Dirk Roose for agreeing to be my promoter on a subject of my own choosing. From Antwerp University, thanks go out to Prof. Tom Dhaene for allowing me to pursue the Master of Artificial Intelligence programme in general and for the many valuable comments, ideas, and corrections. Many thanks also to Prof. Johan Suykens and Prof. Adhemar Bultheel for willing to be co-promoters. I would also like to thank Wouter Hendrickx for his valuable feedback on many of the issues and for the help with the rational and polynomial models. Likewise credits to Wim van Aarle for the help with Kriging. Finally, a special thank you to the researchers who contributed code and data for the examples in chapter 5: • Matthias Ihme from Stanford University for providing the data and information on the chemistry example. • NASA and a very special thank you to Robert Gramacy from the University of California, Santa Cruz for his patience and for releasing the data on the LGBB example. • Adam Lamecki and Michael Mrozowski from the Technical University of Gdansk for providing the Iris EM simulation code. Though, it is not described in this thesis it was a very helpful, real world example to test with. Naturally, all possible mistakes and/or inaccuracies are exclusively of my own making.

2

Contents 1

2

Introduction 1.1 Preamble . . . 1.2 Problem domain 1.3 Goal . . . . . . 1.4 Challenges . . . 1.5 Related work . 1.6 Conclusion . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

6 6 6 7 7 7 8

Surrogate Modeling 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 History . . . . . . . . . . . . . . . . . . . . . 2.1.2 What is a model . . . . . . . . . . . . . . . . 2.1.3 Problem domain . . . . . . . . . . . . . . . . 2.1.4 Modeling process . . . . . . . . . . . . . . . . 2.1.5 What is a metamodel . . . . . . . . . . . . . . 2.1.5.1 Definition . . . . . . . . . . . . . . 2.1.5.2 Theoretical remarks . . . . . . . . . 2.2 Modeling approaches . . . . . . . . . . . . . . . . . . 2.2.1 Model-driven . . . . . . . . . . . . . . . . . . 2.2.2 Data-driven . . . . . . . . . . . . . . . . . . . 2.2.3 Hybrid . . . . . . . . . . . . . . . . . . . . . 2.2.4 Comparison . . . . . . . . . . . . . . . . . . . 2.3 The need for surrogate modeling . . . . . . . . . . . . 2.3.1 Global vs local . . . . . . . . . . . . . . . . . 2.3.2 Forward vs inverse . . . . . . . . . . . . . . . 2.3.3 Motivation . . . . . . . . . . . . . . . . . . . 2.3.3.1 Reduction of the computational cost 2.3.3.2 Large scale simulations . . . . . . . 2.3.3.3 Others . . . . . . . . . . . . . . . . 2.4 Words of caution . . . . . . . . . . . . . . . . . . . . 2.5 Surrogate modeling applications . . . . . . . . . . . . 2.6 Building surrogate models . . . . . . . . . . . . . . . 2.6.1 Surrogate modeling ingredients . . . . . . . . 2.6.1.1 Surrogate model requirements . . . . 2.6.1.2 Reference model . . . . . . . . . . . 2.6.1.3 Feature selection . . . . . . . . . . . 2.6.1.4 Data collection strategy . . . . . . . 2.6.1.5 Model type . . . . . . . . . . . . . . 2.6.1.6 Model selection criteria . . . . . . . 2.6.2 Classical surrogate modeling algorithm . . . . 2.6.3 Improvements to the classical algorithm . . . . 2.6.3.1 Data gathering improvements . . . . 2.6.3.2 Model building improvements . . . . 2.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9 9 9 9 10 10 11 11 12 12 12 12 12 13 13 13 14 14 14 15 15 15 16 16 16 16 17 17 17 18 18 19 19 19 20 22

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

3

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

CONTENTS

3

4

5

The M3-Toolbox 3.1 Introduction . . . . . . . . 3.1.1 History . . . . . . 3.1.2 Philosophy . . . . 3.1.3 Design goals . . . 3.2 Control flow . . . . . . . . 3.3 Extensibility . . . . . . . . 3.4 Configuration . . . . . . . 3.5 Architecture . . . . . . . . 3.5.1 Sample selection . 3.5.2 Sample evaluation 3.5.3 Model building . . 3.5.4 Others . . . . . . . 3.6 Related work . . . . . . . 3.7 Limitations - Scope . . . . 3.8 Critique . . . . . . . . . . 3.9 Conclusion . . . . . . . .

CONTENTS

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

23 23 23 23 24 24 26 27 27 28 28 28 31 31 32 32 32

Evolutionary Modeling 4.1 Biological Foundations . . . . . . . . . . . . 4.1.1 Heritable variation . . . . . . . . . . 4.1.1.1 Mutation . . . . . . . . . . 4.1.1.2 Recombination . . . . . . . 4.1.2 Differential survival and reproduction 4.2 Evolutionary Algorithms . . . . . . . . . . . 4.2.1 History . . . . . . . . . . . . . . . . 4.2.2 Important remarks . . . . . . . . . . 4.2.3 Types . . . . . . . . . . . . . . . . . 4.3 The Genetic Algorithm (GA) . . . . . . . . . 4.3.1 The Canonical GA . . . . . . . . . . 4.3.2 Extensions to the CGA . . . . . . . . 4.3.3 Theoretical foundations . . . . . . . 4.3.4 Applications . . . . . . . . . . . . . 4.4 Parallel Genetic Algorithms . . . . . . . . . 4.4.1 Introduction . . . . . . . . . . . . . . 4.4.2 Island model . . . . . . . . . . . . . 4.4.3 Cellular model . . . . . . . . . . . . 4.4.4 Fitness sharing . . . . . . . . . . . . 4.4.5 Others . . . . . . . . . . . . . . . . . 4.4.6 Applications . . . . . . . . . . . . . 4.5 Evolution of surrogate models . . . . . . . . 4.5.1 Motivation . . . . . . . . . . . . . . 4.5.2 Related work . . . . . . . . . . . . . 4.6 Genetic model builder . . . . . . . . . . . . . 4.7 Heterogeneous evolution . . . . . . . . . . . 4.7.1 Algorithm . . . . . . . . . . . . . . . 4.7.2 Extinction prevention . . . . . . . . . 4.7.3 Heterogeneous recombination . . . . 4.8 Conclusion . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

35 35 36 36 36 36 37 37 38 38 39 39 40 40 41 41 41 42 43 43 43 43 44 44 45 45 45 45 46 47 47

Performance Results 5.1 Objective . . . . . . . . . . . . . 5.2 Test problems . . . . . . . . . . . 5.2.1 Ackley Function (AF) . . 5.2.2 Langermann Function (LF) 5.2.3 Chemistry Example (CE) .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

48 48 48 49 49 49

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . .

. . . . . . . . . . . . . . . .

. . . . .

. . . . . . . . . . . . . . . .

. . . . .

. . . . . . . . . . . . . . . .

. . . . .

. . . . . . . . . . . . . . . .

. . . . .

4

. . . . .

CONTENTS

5.3

5.4 5.5

5.6

5.7

5.8 6

CONTENTS

5.2.4 LGBB Example (LE) . . . . . Model types . . . . . . . . . . . . . . 5.3.1 ANN . . . . . . . . . . . . . 5.3.2 Rational Functions . . . . . . 5.3.3 RBF Models . . . . . . . . . 5.3.3.1 Creation function . 5.3.3.2 Mutation function . 5.3.3.3 Crossover function . 5.3.4 Kriging models . . . . . . . . 5.3.5 SVM . . . . . . . . . . . . . 5.3.6 LS-SVM . . . . . . . . . . . 5.3.7 Ensembles . . . . . . . . . . Tests . . . . . . . . . . . . . . . . . . Experimental setup . . . . . . . . . . 5.5.1 Sampling settings . . . . . . . 5.5.2 GA settings . . . . . . . . . . 5.5.3 Termination criteria . . . . . . 5.5.4 Others . . . . . . . . . . . . . Results . . . . . . . . . . . . . . . . . 5.6.1 Ackley Function . . . . . . . 5.6.2 Langermann Function . . . . 5.6.3 Chemistry Example . . . . . . 5.6.4 LGBB Example . . . . . . . . Discussion . . . . . . . . . . . . . . . 5.7.1 Ackley Function . . . . . . . 5.7.2 Langermann Function . . . . 5.7.3 Chemistry Example . . . . . . 5.7.4 LGBB Example . . . . . . . . 5.7.5 Extinction prevention . . . . . Conclusion . . . . . . . . . . . . . .

Conclusion 6.1 Summary . . . . . . . . . . . . 6.2 Critique . . . . . . . . . . . . . 6.2.1 Theoretical remarks . . 6.2.2 Implementation remarks 6.3 Future Work . . . . . . . . . . .

. . . . .

. . . . .

. . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

50 51 51 53 53 54 54 54 55 56 56 56 57 57 57 57 58 59 60 60 65 70 75 78 78 78 78 78 79 79

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

80 80 80 80 81 81

Bibliography

82

5

Chapter 1

Introduction Any sufficiently advanced technology is indistinguishable from magic - Clarke’s Third Law, Arthur C. Clarke, famous science fiction writer and inventor (1917-...)

1.1

Preamble

Of the many scientific fields that have held my interest throughout the years, evolutionary biology has always been a recurring topic. Even so that, inspired by the work by Darwin, Dawkins and Gould I almost considered it as a career path. What has always fascinated me is the elegance of its core theory, the theory that has now become the cornerstone of modern Biology: adaptation through natural selection. With hindsight, the theory seems conceptually very simple, yet the mental leap required to fathom its far reaching implications is considerable. Contemplating how deep its influence reaches out, it actually becomes hard to fault Darwins’ Victorian contemporaries for their outrage. In this thesis, I have the privilege to apply my interests in evolutionary biology to domain that did turn out to be my career path: computer science. Though the field is some 100 years younger, computation has occupied humans ever since their existence. The first mechanical adding machine (the Pascalene) was developed by the famous French mathematician Blaise Pascal in 1642, and the first 4-function calculator saw the light around Darwin’s time in 1893. The real revolution started in 1943 when Alan Turing and others Completed Colossus (the first all electronic calculating device), followed by the discovery of the transistor at Bell Labs 4 years later. Though the first personal computer (the IBM 5150PC) did not appear until 1981 and the computing paradigm was still very new, the Norwegian scientist Nils Aall Barricelli was already applying it to to the simulation of evolution in 1954. Though many researchers picked-up on the idea, it wasn’t until work in the early 1970s, by researchers at the University of Michigan led by John Holland, that the use of evolution as a problems solving method became widely recognized. Since then, evolutionary algorithms (EA) have been widely used in many diverse domains with applications ranging from language processing to hull optimization in aerospace. The work presented in this thesis will build upon such results and describe a new implementation of the ideas from evolutionary biology and its application to a number of real world problems from science and engineering.

1.2

Problem domain

For many problems from science and engineering it is impractical to perform experiments on the physical world directly (e.g., airfoil design, earthquake propagation). Instead, complex, physics-based simulation codes are used to run experiments on computer hardware. While allowing scientists more flexibility to study phenomena under controlled conditions, computer experiments require a substantial investment of computation time. This is especially evident for routine tasks such as optimization, sensitivity analysis and design space exploration. To quote [200]: “...it is reported that it takes Ford Motor Company about 36-160 hrs to run one crash simulation [64]. For a two-variable optimization problem, assuming on average 50 iterations are needed

6

1.3. GOAL

CHAPTER 1. INTRODUCTION

by optimization and assuming each iteration needs one crash simulation, the total computation time would be 75 days to 11 months, which is unacceptable in practice.” As a result researchers have turned to various approximation methods that mimic the behavior of the simulation model as closely as possible while being computationally cheap(er) to evaluate. Different types of approximation methods exist, each with their advantages and disadvantages. This work concentrates on the use of data-driven, global approximations using compact surrogate models (also known as emulators, metamodels or response surface models (RSM)). Examples of surrogate model include: rational functions, Kriging models, Support Vector Machines (SVM), Multivariate Adaptive Regression Splines (MARS), and Artificial Neural Networks (ANN). The difficulties arise when actually constructing the needed surrogate models. There are a large number of options available to the designer: different surrogate model types, different data collection strategies (both static and incremental), different model selection criteria, different model parameter optimization algorithms, etc. In addition, in most cases there is no theory available, besides some simple rules of thumb, that can be used to help make these decisions.

1.3

Goal

The primary goal of this thesis is to tackle the surrogate model type selection and surrogate model parameter optimization problems through the use of evolutionary algorithms. Evolutionary optimization of model parameters has been done many times before [112, 66, 30, 181, 52, 213, 31, 76, 132, 73]. What is novel in this work is that it considers heterogeneous evolution instead of homogeneous evolution. By letting different model types compete against each other using speciated evolution, the model type selection problem is potentially solved as well.

1.4

Challenges

In general, the scientific challenge of global surrogate modeling is the generation of a surrogate that is as accurate as possible (according to so some, possibly physics-based, measure) using as little simulation evaluations as possible. More specifically for this work, the following challenges need to be overcome: 1. combine different surrogate model implementations in one common framework 2. identify a suitable candidate among the available speciated evolutionary algorithm variants 3. design and implement suitable genetic operators for each of the model types 4. overcome the problem of heterogeneous recombination 5. try to determine robust hyperparameters for the evolution process to ensure the model selection problem is not shifted to a parameter selection problem Challenge (1) should not be underestimated. Simply getting all the available software to interoperate coherently with a clean object oriented design is no trivial task. About half of the work for this thesis was devoted to getting this right.

1.5

Related work

As stated in the previous section, the application of evolutionary algorithms to the optimization of model parameters of a single model type (homogeneous evolution) has been common (e.g., [26]). The same is true for the use of surrogate models in evolutionary based optimization of expensive simulators (see the work by Ong et al. [142, 143]). However, the author was unable to find any evidence of the use of evolutionary algorithms for the evolution of multiple surrogate model types simultaneously (heterogeneous evolution). In all the related work considered, speciation was always constrained to one particular model type (e.g., neural networks [181]). The model type selection problem was still left as an a priori choice for the user. The author assumes to be the first to tackle the model parameter optimization and model type selection problem in one speciated evolutionary approach. 7

1.6. CONCLUSION

1.6

CHAPTER 1. INTRODUCTION

Conclusion

There are two core topics to this thesis: surrogate modeling and evolutionary computing. The next chapter will introduce surrogate models and motivations for their use. Chapter 3 then describes the software platform used as a basis for implementing the evolutionary algorithms discussed in chapter 4. The performance results are presented in chapter 5 and the thesis concludes with some critical remarks in chapter 6.

8

Chapter 2

Surrogate Modeling All models are wrong but some are useful - George Box, famous statistician (1919-...) This chapter is necessary to facilitate the understanding of the subsequent chapters as it provides some theoretical background. We first sketch a brief history of modeling and introduce the problem domain. After explaining the different modeling approaches we introduce the concept of surrogate modeling and motivate its use. A number of applications of surrogate models are given and we conclude with the different steps needed to build an accurate model.

2.1

Introduction

2.1.1

History

Since the dawn of civilization, humans have used abstractions of the real world in order to be able to reason about it and the phenomena that occur within it. Actually, anthropologists think that the ability to build abstract models is the most important feature which gave homo sapiens a competitive edge over less developed human races like homo neandertalensis [168]. The very first ’models’ were numbers and the writing of numbers (e.g., as marks on bones) and date back to about 30.000BC. With the development of Astronomy and Architecture around 4.000BC, models slowly became more complex and became well used (in an algorithmic sense) at about 2.000BC. From then on the complexity of models (and their application) further continued to increase through the Hellenic and Roman Ages (Thales of Miletus, Aristotle, Euclid, Ptolemey, ...) on to the Middle Ages (Abu Abd-Allah ibnMusa Al-Hwarizmi1 , Fibbonacci, Vieta, ...) and modern times (Newton, Russel, Einstein, ...). While the emphasis mostly was on mathematical models, other types of models were (are) used as well. Examples include: Visual Models (for example the Anatomy models developed by Vesalius in the early 16th century), structural models (e.g., scale model of an aircraft to test in a wind tunnel), and the more modern biologically inspired models such as Neural Networks (originally proposed by Warren McCulloch and Walter Pitts in 1943). A detailed history of modeling is available in [168].

2.1.2

What is a model

The word model comes from the Latin word modellus, a diminutive form of modulus, the word for measure or standard. The old Italian derivation modello referred to the mould for producing things. In the sixteenth century the word was assimilated in French (modele), taking its meaning as a small representation of some object, spreading into other languages like English and German [168]. The Compact Oxford Dictionary has the following definition for model: 1 a three-dimensional representation of a person or thing, typically on a smaller scale. 2 (in sculpture) a figure made in clay or wax which is then reproduced in a more durable material. 3 1 One

of the most famous Arabian mathematicians, whose name is still preserved in the modern word algorithm.

9

2.1. INTRODUCTION

CHAPTER 2. SURROGATE MODELING

something used as an example. 4 a simplified mathematical description of a system or process, used to assist calculations and predictions. 5 an excellent example of a quality. 6 a person employed to display clothes by wearing them. 7 a person employed to pose for an artist. 8 a particular design or version of a product. In this context definition (4) is the most relevant.

2.1.3

Problem domain

What domain of problems does modeling attempt to solve, when should one look toward approximation models? The general answer to this is when there is a particular real world phenomenon that one wants to understand or reproduce (e.g., what are the conditions for it to occur, how does it behave over time, ...). The restriction is that the phenomenon should be measurable (i.e., it can be captured in a number of data tuples). The goal is then to take the measured data and any additional context information (laws, rules, constraints,...) and use it as a basis for: • Prediction (e.g., weather forecasting) • Interpolation (e.g., obtaining values for missing data points) • Extrapolation (e.g., life expectancy in 10 years) • Decision making (e.g., classifying tumors) • Communication (e.g., visual models, 3D structure of DNA) • Dimensionality Reduction (e.g. clustering of genetic data). • Denoising (e.g., image recognition) In order to make these actions possible however, one requires a tool that can take said information and generalize from the given knowledge to new data values. A tool that succeeds in doing this is called a model with good generalization. This stands in contrast with a model with poor generalization, or even no generalization at all (i.e., a look-up table). Generalization says something about a models’ performance, but models also vary in (following [168]) their level of formality, explicitness, richness and relevance.

2.1.4

Modeling process

Different paradigms exist for describing the process of constructing a model. We follow the simplified model development process as introduced by Sargent [165] and depicted in figure 2.1.

Figure 2.1: Simplified version of the modeling process [165] 10

2.1. INTRODUCTION

CHAPTER 2. SURROGATE MODELING

The problem entity is the system or real world phenomena to be modeled. In order to come to a computerized model of the problem, the problem entity must first be captured in a mathematical/logical/verbal/... conceptual model through an analysis and modeling phase. The computerized model is then the conceptual model implemented through a computer programming and implementation phase. Once a computerized model is available, it can be used for inference about the problem entity in the experimentation phase. Checking that the theories and assumptions underlying the conceptual model are correct and ‘reasonable’ for the intended purpose of the model is represented by the conceptual model validity arc. Likewise, the computerized model verification arc is defined as ensuring that the computer programming and implementation of the conceptual model is correct. Determining that the model’s output behavior has sufficient accuracy over the domain of the model’s intended applicability is the purpose of operational validity. Finally, data validity is defined as ensuring that the data necessary for model building, model evaluation and testing, and conducting the model experiments to solve the problem are adequate and correct [165]. For the purpose of this thesis we will mainly focus on the computerized model, the problems it introduces, and how we can further abstract it to alleviate these problems.

2.1.5

What is a metamodel

2.1.5.1

Definition

The focus of this thesis is not so much modeling as it is metamodeling. The prefix meta implies a second level of abstraction, a “model of a model”. This hierarchy is illustrated in figure 2.2.

Figure 2.2: Modeling Hierarchy The term metamodel2 was originally coined by Jack Kleijnen in [94] and implies there is something inadequate about the original model (also called the reference model or the simulator3 ). The main motivation for metamodeling is that the simulator is too computationally expensive to run. A simpler approximation of the simulator is needed to make optimization, design space exploration, etc. feasible (see also subsection 2.3.3). We adopt the definition given by [35]: A metamodel is a relatively small, simple model intended to mimic the behavior of a large complex model, called the object model -that is, to reproduce the object model’s input-output relationships The mathematical formulation of the problem is as follows (notation adopted from [18]): approximate an unknown multivariate function f : Ω 7→ Cn , defined on some domain Ω ⊂ Rd and whose function values f |X = (f (x1 ), ..., f (xk ))∈ Cn are known at a fixed set of pairwise distinct sample points X = {x1 , ..., xk } ⊂ Ω. Constructing an approximation then requires finding a suitable function s from an approximation space S such that s : Rd 7→ Cn ∈ S and s closely resembles f as measured by the approximation error ||f − s||ν according to some norm ||.||ν . The task is then to find the best approximation s∗ ∈ S such that s∗ satisfies mins∈S ||f −s||ν = ||f −s∗ ||ν . Additional assumptions are that f is expensive to compute (thus the number of function evaluations f (X) needs to be minimized) and the dimensionality of the input variables can be large, i.e., d  1. 2 The term metamodel as used here should not be confused with the metamodels used in Software Engineering. There metamodels have nothing to do with mimicking a physical system but all to do with ontologies and modeling languages such as UML. 3 From now on the terms ’original model’, ’reference model’ or ’simulator’ will be used interchangeably.

11

2.2. MODELING APPROACHES

2.1.5.2

CHAPTER 2. SURROGATE MODELING

Theoretical remarks

As stated above, the objective of metamodeling is to generate an approximation surface, based on a limited set of samples, for an unknown function f (x). Considering this objective we recall some interesting theoretic discoveries by Kolmogorov and others. In 1900 the famous German mathematician David Hilbert gave a memorable lecture at the Second International Congress of Mathematicians in Paris. During his lecture he listed 23 conjectures, hypotheses concerning unsolved problems which he considered the most important outstanding mathematical problems of the 20th century. His 13th conjecture stated that there exist continuous multivariate functions which cannot be decomposed into a finite superposition of continuous functions of fewer variables [190]. In 1957 the eminent Russian mathematician Vladimir Arnold disproved Hilbert’s hypothesis [5], shortly followed by Kolmogorov [98] who proved (with constructive proof) that any continuous function of n dimensions can be completely characterized by a 1-dimensional continuous function. Mathematically, Kolmogorov’s original theorem can be stated as follows [190, 106]: Theorem: For all n ≥ 2, and for any continuous real function f of n variables on the domain [0, 1], f : [0, 1]n → R, there exist n(2n + 1) continuous, monotone increasing univariate functions on [0, 1], by which f can be reconstructed according to the following equation ! 2n n X X f (x1 , ..., xn ) = φq ψpq (xp ) (2.1) q=0

p=1

The functions ψpq (xp ) are universal for the given dimension n and independent on f . φq does depend on f and is a continuous, one-dimensional function which totally characterizes f (x1 , ..., xn ) (φq is typically highly non-smooth). Consequently, “we see that the approximation problem is not so much the dimensionality, but the complexity of the function (high dimensional functions typically have the potential to be more complex)” [106]. An intriguing result.

2.2

Modeling approaches

Roughly speaking, the different methods for approximating a complex simulation code (simulator) can be divided into three categories: Model-driven, data-driven and hybrid.

2.2.1

Model-driven

Model-driven approximation is commonly known as Model Order Reduction (MOR) [65, 155] or phenomenological approximation [35]. Taking a top down approach, MOR starts from the original simulator equations and derives approximations using rigorous mathematical techniques [154, 206]. There is a tight relation to the field of numerical linear algebra and most methods are based on Krylov subspace methods (e.g., [147, 87]). Another important aspect is the calculation of dominant eigenvalues and singular values.

2.2.2

Data-driven

In contrast, data-driven (or statistical) approximation takes a bottom-up approach. The exact, inner working of the simulation code is not assumed to be known (or even understood), solely the input-output behavior is important (e.g., [119]). A model is constructed based on modeling the response of the simulator to intelligently chosen input configurations. This approach is also known as Reduced Order Modeling (ROM) or behavioral modeling, though the terminology is not always consistent. The terms surrogate models, response surface models and emulators refer to this modeling approach. We shall use these terms interchangeably from now on.

2.2.3

Hybrid

Finally, there is a large gray zone where the two overlap: data-driven modeling may include problem specific rules and constraints (e.g., the enforcement of passivity when modeling a passive electronic component

12

2.3. THE NEED FOR SURROGATE MODELING

CHAPTER 2. SURROGATE MODELING

or circuit) and model-driven modeling may incorporate simulation data as a further approximation or validation step. A well known example is the technique known as space mapping [7, 102, 224, 101]. If a simplified approximate simulator is available (referred to as the coarse model) as well as the original, high fidelity simulator (referred to as the fine model), a surrogate model can be used to map the former onto the latter. Different ways have been developed to do this, references include [39, 152].

2.2.4

Comparison

Model-driven modeling has the advantage of staying true to the ’real’ simulation model since physical laws are conserved (though verifying this is not a trivial task). In contrast, the advantage of data-driven modeling is its generality: it can be applied to any problem where the process can be described as a data generating black box. This is useful for systems where the governing equations are not yet fully understood or known, or when the available simulation code is such that it can not be altered in a domain specific way (proprietary or legacy code). In addition, at the end of the day, often it is the global input-output behavior that is important (e.g., will the total energy in the system be conserved, what is the maximal total force that a structure will hold, etc). In this case, all the approximation intricacies of the different subsystems in the global system no longer play a role and a full MOR may not be worthwhile. A related point is made by [25] in the context of hydrological modeling: Despite the strength and a growing interest in application of ANNs to hydrological modeling, the weakness associated with traditional applications of ANNs in which the networks essentially function as black box models is obvious. On the other hand, considering that conceptual models are usually formulated on the basis of a simple arrangement of a relatively small number of elements, each of which is itself a simple representation of a physical relationship (Dooge, 1977), most original conceptual models may be better understood as a lumped nonlinear ‘total-response’ model, having quasi-physical mechanisms and parameter thresholds. Finally, a nice summary is given by [34]: Some statisticians, operations researchers, and computer scientists prefer the first approach and want to know nothing about the “innards” of the model whose behavior they are attempting to replicate. They may have a purist philosophy of “allowing the data to speak,” without “contaminating it” with theoretical assumptions. Or they may simply prefer not having to deal with the complexities of the model’s innards: they may wish to turn the problem over to automated software. At the other extreme, some theoretically inclined academicians clearly prefer the second approach because it allows rigorous tying together of phenomena at different levels of detail (as when classical thermodynamics is understood from quantum mechanics). These, then, are the extremes. Most scientists, engineers, and analysts, however, should prefer something in between.

2.3 2.3.1

The need for surrogate modeling Global vs local

An important distinction must be made between two different applications of surrogate models. The first is by far the most popular ([160, 58, 28, 141, 199, 124, 142, 171, 211, 153, 84, 200]). and involves building small and simple surrogates for use in optimization. Simple surrogates are used to guide the search towards a global optimum. Once the optimum is found the surrogates are discarded. In the second case one is not interested in finding the optimal parameter vector but rather in the global behavior of the system. Here the surrogate is tuned to mimic the underlying model as closely as needed over the complete design space. Such surrogates are a useful, cheap way to gain insight into the global behavior of the system. Optimization can still occur as a post processing step. In addition, they can cope with varying boundary conditions. This enables them to be chained together in a model cascade in order to approximate large scale systems (e.g., a full circuit board versus a single component). In this case, simple surrogate-driven optimization is not enough since a new optimization needs to be performed for each set of boundary conditions. 13

2.3. THE NEED FOR SURROGATE MODELING

CHAPTER 2. SURROGATE MODELING

Finally, even if optimization is the goal, one could argue that a global model is less useful since significant time savings could be achieved if more effort were directed at finding the optimum rather than modeling regions of poor designs. However, this is the logic of purely local models, but they forgo any wider exploration of radical designs [51]. Both approaches are illustrated in figure 2.3. In this report we are concerned with the latter case.

Figure 2.3: Surrogate modeling versus Design Optimization General surrogate model references include [92, 166, 10, 12, 200] and the excellent (Dutch) report by Janssen et al. [79].

2.3.2

Forward vs inverse

There is also a distinction between forward metamodeling and inverse metamodeling. In the former, metamodels provide estimates of simulation outputs as a function of design parameters (the ’classic’ way). However, often in the design of a system or product, one has performance targets in mind, and would like to identify system design parameters that would yield the target performance vector. Typically, this is handled iteratively through an optimization search procedure. As an alternative, one could map system performance requirements to design parameters via an inverse metamodel [9]. In this thesis we shall only discuss forward metamodeling.

2.3.3

Motivation

The two main motivations for using surrogate models are (see also section 2.2.4): 1. Reduction of the computational cost 2. Large scale simulations 2.3.3.1

Reduction of the computational cost

The principal reason driving surrogate model use is that the simulator is too time consuming to run for a large number of simulations [214]. One model evaluation may take many minutes, hours, days or even weeks [64, 116, 134, 152]. Nevertheless, one could argue that in order to obtain an accurate global surrogate one still needs to perform numerous simulations, thus running into the same problem. However, this is not the case since: (1) building a global surrogate is a one-time, up-front investment (assuming the problem stays the same), (2) distributed computing can speed-up the evaluation time and (3) adaptive modeling and adaptive sampling (sequential design) can drastically decrease the required number of data points to produce a good model. 14

2.4. WORDS OF CAUTION

2.3.3.2

CHAPTER 2. SURROGATE MODELING

Large scale simulations

A second argument advocating the use of metamodels is when simulating large scale systems (e.g., global climate change, electronic devices, complex mechanical machines,...). Modeling a complex system like the Earth [111], for example, with accurate simulation models would require a huge number of simulators to work together coherently. Not only is this impossible from a computational point of view, simply getting all the software to interoperate is an equally daunting task. Therefore second (or even third or fourth) order simplifications are necessary to keep everything manageable. A classic example is the full-wave simulation of an electronic circuit board. Electro-magnetic modeling of the whole board in one run is almost intractable. Instead the board is modeled as a collection of small, compact, accurate surrogates that represent the different functional components (capacitors, transmission lines, resistors, ...) on the board. Surrogate models can be chained together relatively easy, so large scale simulations can literally be pieced together. Examples of such applications can be found in [11, 216, 111]. 2.3.3.3

Others

Other reasons include (from [34, 35]): • the reference model is old, opaque, and difficult to work with • as a way to reduce the number of variables • cognitive, the scientist or engineer wants to ‘understand’ why the reference model behaves as it does • exploratory analysis, often there is a need to explore the behavior of a model over a large part of its domain

2.4

Words of caution

As any tool or technique surrogate modeling has its restrictions. The use of some types of surrogate models makes little sense if: • The simulator is simple and cheap to evaluate. There is no need for a second level of abstraction. • There is little or no information available about the simulator. It makes little sense to put effort into building a metamodel if you lack important information like noise level, dynamics, robustness, domain, .... Consequently the very first step in the surrogate modeling process should always be a critical one: Is surrogate modeling the best way to accomplish the necessary goals? Maybe faster machines, parallel computing, more manpower, data reduction, a faster implementation of the simulator, simple linear regression, .... are more cost effective solutions. The answer to this should depend on a thorough evaluation of: • the available time, money, manpower, software, expertise for surrogate modeling • the available expertise about the original simulator • the need for extrapolation • how often the surrogate model will be used • available examples of typical inputs • how often the surrogate model will need to be updated to reflect changes in the simulator In addition, surrogate modeling has its own set of challenges: experimental design, sample selection, model type, model tuning, black box vs gray box vs white box, ... Finally, it should not be forgotten that a surrogate model is only as good as the available data and designer. Surrogate models are still models, making model assessment & selection crucial steps in the design process.

15

2.5. SURROGATE MODELING APPLICATIONS

2.5

CHAPTER 2. SURROGATE MODELING

Surrogate modeling applications

Surrogate modeling has found its way into many fields where it is used to approximate some complex and/or expensive reference model. Some examples of applications are: • Economics: economic validation of capital projects [23] • Robotics: evolution of gait patterns of four-legged walking robots [33] • Electronics: mobile antenna design [210] • Physics: study of proton beams [107] • Chemistry: prediction of fibrinogen absorption onto polymer surfaces [178] • Engineering: study the effect of a frontal impact on a vehicle [212] • Environmental Science: studying the vulnerability of ground water to pesticide leaching [191] • Biology: prediction and explanation of biodiversity data [183, 182] • Geology: modeling of (oil, gas, water, ...) reservoirs [133] • Meteorology: studying the effect of emission reduction on ozone concentrations [129] • Sociology: modeling innovation diffusion [123] • Medicine: modeling colon coloration [70]

2.6

Building surrogate models

Now that we understand the philosophy of surrogate modeling and the motivations that drive its use, we discuss how surrogates are actually built. We list the different ingredients and the algorithms that combine the ingredients into a usable surrogate.

2.6.1

Surrogate modeling ingredients

So how does one go about building a scalable, compact surrogate model? What information needs to be gathered, what design choices need to be made, ...? We list these below: 2.6.1.1

Surrogate model requirements

Arguably the most important ingredient is knowing what minimum requirements the surrogate has to meet in order for it to be useful within the final application. This includes answering the following questions: • What software, time, budget, expertise, ... is available for building and testing the surrogate model. For example, [197] report 152.6 hours necessary to construct a RBF neural network vs 3 minutes for a regression tree on the same data set. Depending on the situation this may or may not be a problem. • What deviations with respect to the reference model are acceptable and how will they be measured (accuracy). • How will the quality of the surrogate model be assessed (formal analysis, benchmark scenarios, ...). • What are the resource restrictions on the final surrogate model (execution speed, memory usage, ...). For example, this could be an issue if the surrogate model was integrated into a hardware controller. • What level of traceability is required? What process knowledge, physics, parameter interactions, ... do you want to see in the final surrogate model. This also includes things like adhering to documentation and rapportation protocols.

16

2.6. BUILDING SURROGATE MODELS

CHAPTER 2. SURROGATE MODELING

• Should the surrogate model be able to interoperate easily with other systems (expert system, database, ...) • Will the surrogate model be used within a chain of other surrogate models. If so, what are the requirements on data scaling, software platform, communication protocol, etc. As always, this depends completely on the user and the target application. The process of building a surrogate model for use in missile control will be completely different than one built for stock market prediction. 2.6.1.2

Reference model

Obviously some kind of reference model (also referred to as the simulator, object model or high fidelity model) is needed. Be it in the form of simulation code and its dependencies, a set of equations or a pregenerated data set. More importantly, what is needed is as much information about this simulator as possible. This is where interaction with the domain experts is paramount. Necessary information includes: • availability of existing surrogate models and their restrictions • dimensionality and domain of inputs and outputs • system type (deterministic, stochastic, dynamic, ...) • information about discontinuities, non-linearities, sensitive parameters, epistasis, non-excited modes (sleeping dynamics), feedbacks, ... • estimation of the noise level and distribution (Gaussian, Poisson, Cauchy, ...). This includes the likelihood of outliers. Remember that, in principle one should be able to use the reference model directly. However for practical reasons an explicit choice is made to use a surrogate instead. Therefore, if the quality of the reference model is inadequate for the task at hand one should seriously question the usefulness of a surrogate model (Garbage-In-Garbage-Out). 2.6.1.3

Feature selection

In many realistic problems the system under consideration has a large number of parameters, ranging from less than 10 to over 1000 (e.g., for problems in bio-informatics). To understand the significance of all the different parameters would require an exceedingly large number of experiments (one for each parameter assignment). As the dimensionality increases this quickly becomes intractable. This is the infamous curse of dimensionality, a term originally coined by Bellman in 1961. Another way of formulating it is given by [106]: in high dimensions, the less data points we have, the simpler the [target] function has to be in order to represent it accurately. For this reason there is a large body of research that deals with (automatically) determining which variables (features) can be ignored based on a limited sample size (Automatic Relevance Determination, (Kernel) PCA, Genetic Algorithms, ...). 2.6.1.4

Data collection strategy

Since data is computationally expensive to obtain, it is impossible to use traditional, one-shot, full factorial or space filling designs. Data points (also known as support points or design sites) must be selected iteratively, there where the information gain will be the greatest. Mathematically this means defining a sampling function φ(Xi−1 ) = Xi , i = 1, .., L

(2.2)

X0 ⊂ X1 ⊂ X2 ⊂ ... ⊂ XL ⊂ X

(2.3)

that constructs a data hierarchy

of nested subsets of X = {x1 , ..., xk }, where L is the number of levels. X0 is referred to as the initial experimental design and is constructed using one of the many algorithms available from the theory of 17

2.6. BUILDING SURROGATE MODELS

CHAPTER 2. SURROGATE MODELING

Design of Experiments (DOE) (see the work by Kleijnen et al. [95]). Once the initial design X0 is available it can be used to seed the the sampling function φ. An important requirement of φ is to minimize the number of sample points |Xi | − |Xi−1 | selected each iteration (f is expensive to compute), yet maximize the information gain of each successive data level. This process is called adaptive sampling, but is also known as active learning [42], query learning [62], selective sampling [62], reflective exploration [40], Optimal Experimental Design (OED) [158] and sequential design [86]. Adaptive sampling was pioneered by the Finnish mathematician Gustav Elfving in the early 1950’s. Elfving gave the first optimality criterion according to which one can decide if an observation is relevant or not [48]. Since then the theory has been extended and widely applied in such diverse fields as aerospace engineering and medicine (e.g., optimal subject selection in clinical trials). In the context of computer experiments, a large variety of adaptive sampling methods have been developed. Examples can be found in [18, 37, 81, 86, 127, 62, 167, 225]. 2.6.1.5

Model type

Another ingredient is the type of model that will be used (analytical, neural network, regression tree, ...). This is an important choice that depends on the factors above and on the required • interpretability • adaptability • target implementation platform • resource constraints • amount of available data • prediction speed • composability (will there be multiple models that need to be combined) • tracability (how well can relationships, behavior, dynamics, ... explicitly be traced back to the reference model) • simplicity (how complex is the implementation and use of the surrogate) Many studies are available that compare surrogate model types [173, 80, 35, 137, 171, 153, 212, 27, 200]. Unfortunately, few meaningful conclusions can be drawn from such studies since the results are very dependent on the problem, experimental setup and the expertise of the experimenter. Together with some theoretical foundations from the so called No-Free-Lunch-Theorems [207], selecting an appropriate model type is still very much an art. This problem shall be revisited in chapter 4 and an evolutionary solution proposed. 2.6.1.6

Model selection criteria

A crucial step in the surrogate modeling process is identifying a suitable model selection criteria: raw accuracy, scalability (in dimensions or data), robustness, resource usage, ... This will again differ per application. In the case of accuracy common metrics are: • Empirical: validation error, cross validation, leave-one-out, jackknifing, bootstrapping, R-squared (R2 ), ... • Theoretical: Akaike information criterion (AIC), Minimum Description Length (MDL), Bayesian information criterion (BIC), VC-dimension, ... Note that the choice of performance measure is not completely independent of the choice of metamodel type or sampling algorithm. For example, when using Least Squares SVMs (LS-SVM) the accuracy measure used should preferably also be a least squares variant. If instead a maximum error is used, for example, the search for a good model (meaning a low maximum error) will be inefficient since the two error measures are not aligned. 18

2.6. BUILDING SURROGATE MODELS

CHAPTER 2. SURROGATE MODELING

For discussions on the different performance measures see [91, 80, 44, 128, 13, 176, 34, 35]. For references on model selection see [8, 85, 176] and the excellent collection of resources at [75]. For comparisons among different model selection criteria see [97, 4, 196, 24, 176]. For a more theoretical approach to the problem of model verification and validation see the work by Robert G. Sargent et. al (e.g., [165])

2.6.2

Classical surrogate modeling algorithm

There are many different ways metamodels can be constructed, but in the majority of the cases they boil down to the very simple algorithm depicted in figure 2.4 [35].

Figure 2.4: Classical Algorithm The workflow is as follows: 1. By studying the the real world a set of rules and/or equations are identified that capture the essence of the phenomenon as accurately as possible. 2. These rules and equations are implemented in a simulation model which we will refer to as the simulator. This work, and the work in the previous step, is carried out by the domain experts. 3. Next, in order to gather sufficient data, the simulator is evaluated a number of times according to a predefined data distribution (e.g., a full factorial design). The size of the data set constrained by the available computational resources. 4. Having all data, one then builds and tunes a metamodel that fits the data as closely as possible (taking care not to overfit). 5. Once the metamodel performance is acceptable it can be used within an application. While already very useful, there are a number of obvious improvements that can be made. In the next subsection we list some of these improvements with examples from scientific literature.

2.6.3

Improvements to the classical algorithm

2.6.3.1

Data gathering improvements

The first major drawback of the classical algorithm is its one-shot-approach. Why the strict dichotomy between the gathering of data and the building of a model? An obvious improvement is to make this iterative as already presented in subsection 2.6.1. There are two types of sequential design: • passive sequential design: each step in the sequential design is considered as the last one to be performed • active sequential design: takes into account the fact that further observations will be available when tuning the model For more information on these two types and their relative differences see [53, 54, 156]. Regardless of type, sequential design still allows for a large degree of variation. The design decisions that need to be made are: • When to select new data points: every iteration, only when the model cannot be improved, ...

19

2.6. BUILDING SURROGATE MODELS

CHAPTER 2. SURROGATE MODELING

• How to select new data points: randomly, error based, Bayesian [47], ... • Where to select new data points: which input/output variables, corner points, around optima, slightly outside of the domain, ... • The number of data points to select For which again their is no clear answer. This depends on many factors like available time and resources, degree of non linearity, type of model used, etc. See [86, 167] for some examples. Further references on sampling algorithms and sequential design include [172, 192, 82, 83, 86, 118, 137, 171, 62, 95, 153, 47, 69, 27]. 2.6.3.2

Model building improvements

A further area of improvement concerns the building of models. Possibilities include: • Trying different model types and picking the best one • Let model parameters be chosen automatically • Adopt a divide-and-conquer approach • Use of ensembles or committees • Knowledge based modeling • Combination of the above We will now treat each of the items in turn. Multiple model types Given the range of different model types available it makes sense to try out the different alternatives, possibly in parallel. A number of model selection criteria can then be used to select among the different types. While this seems to be an obvious thing to do, in practice one rarely tries out more than on or two model types. Mostly this is due to practical reasons: 1. The expertise is not available. 2. The designer sticks to what he knows best (“Not invented here” syndrome). 3. Trying out more model types simply requires more time (finding/installing packages, tuning model parameters, ...), time that may not be available. 4. Lack of user friendly software packages. 5. The final application restricts the designer to one particular type. For example in control systems neural networks are often used due to their natural implementation in hardware. Adaptive modeling Selecting the model type is only part of the problem, the complexity of the model needs to be chosen as well. Each model type has a set of tunable parameters θ that control the complexity of the model M and thus the bias-variance trade-off. For example, with rational functions this could be the degree dn, dd of the nominator and denominator, with ANN the number of units per hidden layer Hi and with SVM the kernel function K(x1 , ..., xn ), the regularization constant γ, and the kernel parameters (e.g., the spread σ in the case of an RBF kernel). In general, finding the optimal bias-variance trade-off, i.e., s∗ as defined in section 2.1.5, is hard. All too often this is done by trying out different parameters manually and using those that work best (see for example [214]). A better approach is to use an optimization algorithm guided by a performance metric (e.g., external validation set, leave-one-out error, an approximation of the posterior p(θ, M |data), ...). In this way a successive set of approximation models M1 , M2 , ..., Mm are generated that converge towards a local minimum of the optimization landscape, as determined by the performance metric. The challenge here is to converge to a good local optimum, since the landscape can be expected to be highly multi-modal, high dimensional, deceptive and epistatic. Examples from literature include ANN structure optimization with pattern search [77], SVM hyperparameter optimization with genetic algorithms [112], pattern search [135] or heuristic [15], and automatic trimming of redundant polynomial terms [117]. 20

2.6. BUILDING SURROGATE MODELS

CHAPTER 2. SURROGATE MODELING

Hierarchical modeling4 Often the systems to model are highly non-linear and may contain discontinuities. Trying to model the whole system in one iteration may prove to be too difficult. Instead a divide-andconquer approach is adopted, either on the system level or on the data level. At the system level, the system itself is decomposed into smaller components and each is modeled separately. See also section 2.3.3.2. At the data level, the problem space is partitioned recursively until a model with acceptable performance can be built on each partition. The final surrogate model then consists of a tessellation of simpler submodels. The advantages of this approach are that, if one allows the model complexity to vary, the final model will consist of a set of models that are only as complex as the data requires them to be. For example, in the case of polynomial models, plateaus will be modeled by simple 1 degree polynomials while more non linear partitions will be fitted with higher degree polynomials. Depending on the size and dimensionality of the data set, though, such algorithms may be prohibitively expensive in terms of CPU and memory requirements. A more important problem is how to ensure continuity at the boundaries of the different partitions. While this is possible by enforcing equality at the nth derivative using analytic models, it is harder to do if the models are not differentiable directly (e.g., SVMs). Hierarchical modeling has been used successfully though, references include [199, 142, 222, 223]. Ensembles Ensemble methods are closely related to hierarchical modeling and have been very popular within the neural network literature (cfr., committee networks). In ensemble modeling a group of models (the ensemble) is constructed and trained on different subsets of the data. The final prediction is then based on the combination of the predictions of the different members in the ensemble. For example, bagging is one of the more popular ensemble methods. Under certain conditions it can be proven that by combining models in this way a superior prediction may be achieved [67, 103]. The difficulty with ensemble methods is finding the optimal ensemble composition and identifying the best combination expression (weighted linear, non-linear, ...). There is a huge amount of literature on ensembles, some references include [120, 19, 121, 201, 89, 186, 202, 220, 114, 218, 217, 219, 205]. A good overview reference is [169]. Knowledge based modeling Part of the difficulty of surrogate modeling originates from the fact that one tries to tune a generic technique in order to reproduce the results of some specific physical system5 . In order to help close the gap, researchers have taken problem specific rules and equations and integrated them into the generic modeling technique. An example are the Knowledge Based Neural Networks (KBNN) used for microcircuit modeling by [198, 39]. It should come as no surprise that these techniques perform better than their generic counterparts. The disadvantage is that these approaches only work on problems where the underlying process is at least partially known and knowledge can be encapsulated into standalone rules and/or equations. See also subsection 2.2.3. A good overview of integrating problem knowledge with statistical metamodels can be found in the work by Davis and Bigelow [34, 35], who strongly believe in the need for incorporating knowledge: Finally, because of our personal interdisciplinary inclinations, we found pure statistical metamodeling to be distasteful, given that much is known about the real-world systems being described. Why should we not be using some of that information? And why shouldn’t explanations be in causal terms if at all possible? Hybrid modeling Finally all the methods mentioned above may be combined in endless different ways, ideas include using a genetic algorithm to evolve an ensemble of models [221, 78, 66] and a combination of Regression Trees and RBF neural networks [144]. In general the downside of hybrid methods is that they are computationally expensive and demand a lot from the designer with respect to implementation and tuning complexity. Furthermore it still remains to be seen whether the performance of an ultra adaptive, hybrid approach performs marginally better than one properly tuned ‘classical’ model. Performance data on a wide range of problems is needed in order to decide whether the added implementation complexity is worth the effort. 4 This should not be confused with multilevel modeling which is a generalization of linear and generalized linear modeling in which regression coefficients themselves are given a model, whose parameters are also estimated from the data [55, 56]. In a sense this can also be seen as a metamodel. 5 Expert systems are the extreme example of this.

21

2.7. CONCLUSION

2.7

CHAPTER 2. SURROGATE MODELING

Conclusion

The reader should now have a thorough understanding of surrogate modeling, motivations for its use, and the advantages and disadvantages that it incurs. In the next chapter we now take the considerations mentioned here and translate them into a concrete implementation.

22

Chapter 3

The M3-Toolbox Civilization advances by extending the number of important operations which we can perform without thinking about them - Alfred North Whitehead, famous mathematician and philosopher (1861-1947) Now that we have laid the theoretical foundation in the previous two chapters we are ready to take the gained insights and discuss them in relation to a concrete metamodeling toolbox, the M3-Toolbox. The discussion of the M3-Toolbox is important since about half of the total time spent on this thesis was devoted to extending and enhancing the toolbox to make the experiments in chapter 5 possible. In this chapter we shall introduce the toolbox, its history, its design, and compare it with other research efforts within the domain of adaptive surrogate modeling.

3.1 3.1.1

Introduction History

In 2004 research within the Computational Modeling and Simulation (COMS) research group at the University of Antwerp concentrated on developing efficient, adaptive and accurate algorithms for polynomial and rational modeling of electro-magnetic data. On the multivariate front this work resulted in a set of Matlab scripts that were used as a testing ground for new ideas and techniques. Research progressed, and with time (early 2005) these scripts were re-worked and refactored into one coherent Matlab toolbox, tentatively named the Multivariate MetaModeling Toolbox. The first public release (v2.0) of the toolbox occurred in November 2006. At the time of writing the current version is v3.4.

3.1.2

Philosophy

The M3-Toolbox was developed when research made clear that there was room for an adaptive tool that integrated different modeling approaches and did not tie the user down to one particular set of problems [69]. More concretely, the authors were interested in a fully automated, adaptive global surrogate model construction algorithm. Given a simulation model, the software should produce an accurate surrogate model with as little user interaction as possible. However, at the same time keeping in mind that there is no such thing as a ‘one-size-fits-all’. Different problems need to be modeled differently and require different a priori process knowledge. Therefore the software should be modular and extensible but not be too cumbersome to use or configure. Given this design philosophy, the toolbox caters to both the scientists working on novel surrogate modeling techniques as well as to the engineers who need the surrogate model as part of their design process. For the former, the toolbox provides a common platform on which to deploy, test, and compare new modeling algorithms and sampling techniques. For the latter, the software functions as a highly configurable and flexible component to which surrogate model construction can be delegated, easing the burden of the user and enhancing productivity. Note that there is no inherent bias towards a particular application domain. Therefore it can just as easily be used for surrogate model construction in chemistry as in meteorology or electronics. Of course, 23

3.2. CONTROL FLOW

CHAPTER 3. THE M3-TOOLBOX

since many problems do have additional constraints or requirements, the toolbox can be tuned or extended to cater for those cases where necessary, without compromising generality. The research domain of the toolbox is depicted in figure 3.1.

Figure 3.1: Research Domain

3.1.3

Design goals

The following design goals served as the guidelines for the design of the M3-toolbox. These goals are: 1. Development of a fully automated, adaptive metamodel construction algorithm. Given a simulation model the software should produce a metamodel with as little user interaction as possible (“one button approach”). 2. There is no such thing as a ’one-size-fits-all’, different problems need to be modeled differently. Therefore the software should be modular and extensible but not be too cumbersome to use or configure (sensible defaults). 3. The toolbox should minimize the required prior knowledge of the system to be modeled. 4. The algorithm should minimize the number of required samples in order to come to an acceptable metamodel. 5. The algorithm should terminate only when the predefined accuracy (set by the) user has been reached or the maximum number of iterations has been exceeded. Studying these requirements it is obvious that there exist some mutual contradictions. It is impossible to maximize problem generality, minimize model accuracy and minimize required domain knowledge all at the same time. The more knowledge and data you posses about a system, the faster and more accurately you can model it but the less generic your approach will be. As a result we place a strong emphasis on extendability, pluggability, configurability and adaptivity.

3.2

Control flow

The detailed control flow of the toolbox is illustrated in figure 3.2, or more formally in algorithm 1. The algorithm is conceptually very simple, but in its simplicity lies its power and flexibility. The execution of each step of the control flow is as follows:

24

3.2. CONTROL FLOW

CHAPTER 3. THE M3-TOOLBOX

Algorithm 1 Adaptive Surrogate Modeling Algorithm 01. target = getAccuracyTarget(); 02. X = initialSampleDesign(); 03. f |X = evaluateSamples(X); 04. M = []; 05. while(target not reached) 06. M = buildModels(X, f |X ); 07. while(improving(M )) do 08. M = optimizeModels(M ); 09. end 10. Xnew = sampleSelection(X, f |X , M ); 11. f |Xnew = evaluateSamples(Xnew ); 12. [X, f |X ] = merge(X, f |X , Xnew , f |Xnew ); 13. end 14. return bestModel(M );

Figure 3.2: M3-Toolbox Control Flow 1. The user sets up the experiment in an XML configuration file. Among other things he specifies: the problem to model, the modeler to use, model scoring mechanism, how to build new models, whether sample evaluation should occur locally or on a cluster, etc. These configuration settings are then merged with the default configuration options and the result is used to configure each component of the toolbox. 2. The control code starts with selecting an initial experimental design and generating a number of samples (data points). (a) The data points are added to the input queue of the Sample Evaluator (SE) which starts scheduling them on the available hardware. 3. The main loop waits until the number of evaluated samples has reached a certain threshold. 4. Once the threshold has been reached, the evaluated samples are retrieved from the output queue and passed to the Model Builder (MB). Based on the available samples the MB starts building models that fit the samples as accurately as possible. This is done within an adaptive modeling loop:

25

3.3. EXTENSIBILITY

CHAPTER 3. THE M3-TOOLBOX

(a) An optimization algorithm is used to tune the model parameters in order to optimally fit the available data. (b) Each model produced as part of the optimization process is given a score that depends on a number of measures. It is this score that drives the optimization algorithm. (c) As long as optimization process is successful, the tuning process is continued. (d) If the optimization has converged the adaptive modeling terminates and the k best models are returned to the control loop. 5. The control loop checks if the found models meet the requirements set out by the user. If so, the loop terminates and the best model is returned to the user. 6. If the best model found is not good enough, a new set of samples is selected (sequential design) and control returns to step 3 (unless the maximal number of iterations has been exceeded or a timeout occurs. In that case the control loop terminates with the best intermediate result). The specifics of the steps above (what algorithms to use, which simulator outputs to model, if noise should be added, whether complex data should be split into its real and imaginary components, etc.) are completely determined by two XML configuration files. The first defines the interface of the simulation code: number of input and output parameters, type of each parameter (real or complex), simulator executable & dependencies and/or one or more data sets. The second XML file contains the configuration of the toolbox itself: which outputs to model, model type to use, whether sample evaluation should occur on a grid or cluster, etc. This file also allows the user to specify multiple runs. The toolbox configuration file allows one to define multiple runs in the form of a plan. Each run may be configured completely independently or inherit configuration specified for all runs. The advantage of this is that it makes it easy to setup experiments where different model types, algorithms, data sets, ... need to be compared. The core algorithm presented here is comparable to the Hierarchical Nonlinear Approximation algorithm presented in [18]. It too integrates adaptive modeling with adaptive sampling but is strongly biased towards Kriging models and does not take the integration and extensibility as far as the work presented here. Likewise, [47] presents a Bayesian framework for optimal sample selection and model adaptation but is also restricted to a single model type, modeling algorithm and sampling algorithm.

3.3

Extensibility

In light of the No-Free-Lunch theorems, a primary design goal of the toolbox was to allow the user maximum flexibility in composing a surrogate modeling run. This was achieved by designing a plugin-based infrastructure using standard object oriented design patterns. Different plugins (model types, measures, ...) can easily be composed into various configurations or replaced by custom, more problem specific plugins. Some of the plugins currently available include (see figure 3.3): • Data sources: flat file, native code, Java class, Matlab script • Model types: Polynomial/Rational functions, Multi Layer Perceptrons, RBF models, RBF Neural Networks, Support Vector Machines (-SVM, ν-SVM, LS-SVM), Kriging models, splines. • Modeling algorithm: hill climbing, BFGS, pattern search, genetic algorithm, simulated annealing • Initial experimental design: random, latin hypercube, full factorial, pre-calculated data set, central composite • Adaptive sample selection: error-based (model comparison), density-based (voronoi), hybrid (gradient), random • Model selection: cross validation, validation set, AIC, in-sample error, model difference In addition, the toolbox includes built-in support for high performance computing. Sample evaluation can occur locally (with the option to take advantage of multi-CPU or multi-core architectures) or through a distributed middleware (possibly accessed through a remote head-node). Currently the Sun Grid Engine

26

3.4. CONFIGURATION

CHAPTER 3. THE M3-TOOLBOX

Figure 3.3: M3-Toolbox Plugins (SGE), A Parameter Sweep Tool (APST) and LHC Computing Project (LCG) middlewares are supported, though other interfaces (e.g., for Condor) may be added. This is graphically illustrated in figure 3.4. As presented so far, the toolbox provides a whole suite of components that can be easily composed and benchmarked on different problems. However, even more innovating would be to provide a second level of adaptivity that makes the selection of components itself an automatic step. For example the model type need not be specified a priori but could be dynamically changed as the modeling progresses. Another example is the integration of expert knowledge about the problem (e.g.: using fuzzy theory). Some promising results have already been achieved in this respect (chapter 5) but much research remains.

3.4

Configuration

Each of the components available in the toolbox has its own set of parameters and options. In order for power-users to retain full flexibility these options should not be hidden but at the same time should not confuse more casual users. The M3-Toolbox provides an extensive configuration framework based on XML. Sensible defaults are provided but every modeling aspect can be adjusted as needed. Figure 3.5 shows an example toolbox configuration file. The corresponding simulator file is shown in figure 3.6. Figure 3.5 is purely an illustrative example. It describes the modeling of a native simulation code for a passive electric component. The output of the code are the four complex scattering parameters S11 , S12 , S21 , S22 . One run is defined that models the first output twice using rational functions (once with added noise) and the second output using Kriging models. Finally last output tag specifies that S11 and S12 should be modeled together in one ANN model (evolved using a genetic algorithm). Also, since the ANN model type can not handle complex numbers directly, the modulus is modeled instead. Since training an ANN is expensive, a validation error is used instead of the globally defined cross validation measure. In addition, ANN models will be scored using an absolute RMS error function instead of the default maximum relative error.

3.5

Architecture

Now that we understand how the control passes through the different components of the toolbox we can discuss how each component is implemented. A detailed discussion of the structure of each class is out of scope for this thesis and would consume too much space considering the size of the M3-Toolbox code 27

3.5. ARCHITECTURE

CHAPTER 3. THE M3-TOOLBOX

Figure 3.4: Sample Evaluation Backend base. Therefore we shall only discuss the three main parts: sample selection, sample evaluation and model building. These are discussed below together with a high level class diagram for each part. Note that the class diagrams shown in figures 3.7, 3.8, 3.9, and 3.10 are simplified versions. Not all classes and interactions are shown for reasons of clarity only.

3.5.1

Sample selection

The first core subsystem of the toolbox is that responsible for sample selection (see figure 3.7). Different initial experimental designs are available, each one sub-classing from the InitialDesign base class. New initial designs can be added by sub-classing this base class and implementing the generateSamples(.) method. In the case of sequential design the structure is exactly the same, be it that the base class is now SampleSelector.

3.5.2

Sample evaluation

The subsystem that takes care of the evaluation of data points within the toolbox is called the Sample Evaluator (SE) (see figure 3.8). It is represented by the interface SampleEvaluator and an abstract implementing class BasicSampleEvaluator is provided for convenience. This class is then further sub-classed to allow for different types of SEs that evaluate samples using: a data set, a local executable, a local executable that is distributed through some (remote) grid middleware, etc. Further SEs, like for example one that uses a remote database to evaluate data points, can easily be added by sub-classing BasicSampleEvaluator. SEs run in their own thread of control and maintain an output queue and (pluggable) input queue (SampleSource) to communicate with the rest of the toolbox. The SE will take all points arriving on the input queue, evaluate them, and post the result on the output queue. Support for distributed computing is realized through a pair. Thus support for new distributed middlewares can easily be added by adding a new SE - Backend pair.

3.5.3

Model building

The building of models is coordinated by the class AdaptiveModelBuilder (see figure 3.9). Depending on the subclass, a different optimization algorithm is used to optimize the model parameters. Since the implementation of the model building algorithm depends on the type of model used an intermediate class is needed to interface between the two (e.g., LSSVMGeneticInterface). In this way the concrete model type is decoupled from the abstract model optimization algorithm. Adding support for a new model type (e.g., regression trees) means having to implement a subclass of Model in addition to an Interface version of at least one of the model building algorithms.

28

3.5. ARCHITECTURE

CHAPTER 3. THE M3-TOOLBOX

default default gradient StepDiscontinuity.xml local rational rational kriging anngenetic ...

Figure 3.5: Example toolbox configuration file

29

3.5. ARCHITECTURE

CHAPTER 3. THE M3-TOOLBOX

Step Discontinuity ...a user friendly description... S11,S22 S12,S21 StepDiscontinuity StepDiscontinuity StepDiscontinuity.exe StepDiscontinuityGrid StepDiscontinuityScattered

Figure 3.6: Example simulator configuration file

30

3.6. RELATED WORK

CHAPTER 3. THE M3-TOOLBOX

Figure 3.7: Sample Selection (simplified)

3.5.4

Others

Another subsystem not yet mentioned is that of the measures (= model selection algorithms). Its structure is shown in in figure 3.10, and follows the usual design consisting of one base class, Measure, with a number of concrete sub classes. Many other classes are available besides those mentioned here. Of primary importance is the M3 class, it is the core component of the toolbox and links together all three subsystems. Note that even here pluggability was kept in mind. While the M3 class is responsible for generating, global, standalone surrogate models, an alternative implementation (M3Optim) is available as well. M3Optim links the other subsystems for surrogate driven optimization. As with all other components, other implementations (e.g., for time series prediction) can be added as well. The configuration subsystem is worth mentioning as well. It is responsible for reading the configuration options from the XML files and ensuring that each class is instantiated with a ContextConfig and NodeConfig object. The first holds global, context, configuration relevant to the entire toolbox (e.g., which outputs are being modeled), while the second holds only the properties needed for that specific object (e.g., the numHiddenLayers property for an ANN model). Finally, the toolbox also incorporates a profiling framework that implements a generic way of collecting and plotting data about the modeling process (e.g., how does the regularization parameter γ evolve versus the number of samples as we optimize a SVM using pattern search).

3.6

Related work

When it comes to tools for adaptive global surrogate modeling, the M3-Toolbox seems to be unique. Similar algorithms and tools exist but are heavily biased towards optimization only. Well known examples in this respect are the proprietary tools developed by LMS and Noesis (Optimus, Virual.Lab). From academia, the more prominent projects are Geodise [148] from the University of Southampton, Nimrod/O [1] from Monash University, and the DAKOTA toolkit from Sandia National Labs [45]. Perhaps DAKOTA comes closest to the research presented here. Through the Surfpack approximation library [59], and some of its own classes, the toolkit can build different (global) surrogate model types and use them to drive the optimization of a complex simulation code (possibly relying on high performance computing through the GridApplicInterface class). For this DAKOTA defines a number of core abstractions: Strategy (optimization strategy, i.e. direct or surrogate-driven), Minimizer (the optimization algorithm), Analyzer (the sampling algorithm), and Model (maps variables onto responses). These abstractions can be combined in many ways to allow nested and multilevel optimization.

31

3.7. LIMITATIONS - SCOPE

3.7

CHAPTER 3. THE M3-TOOLBOX

Limitations - Scope

As with any system, there are a number of limitations that need to be kept in mind. Firstly, the toolbox can only be applied to input-output systems, or in other words, systems which can be represented as a function f : Rd 7→ Cn for a fixed d and n. Secondly, as is, the toolbox can not be used to model dynamical systems where the objective is to predict the next xk+l time steps based on the previous x1 , ..., xk values (i.e., time series prediction). In this setting, sampling as defined here would not make sense. However, it would not require much effort to extend the toolbox to provide this capability in the same, pluggable manner. Finally, the scope of the toolbox is to build global, standalone metamodels. While this is where the core focus of the toolbox lies, the popularity of surrogate driven optimization has led to the addition of a new component (see section 3.5.4) that caters for this. While the default surrogate-driven optimization algorithm implementation is not state-of-the-art, it provides a generic platform that can be extended to more complex algorithms.

3.8

Critique

Of course no approach is without criticism, the M3-Toolbox approach can be found to be lacking in the following ways: • Non trivial implementation. The design must be well thought out to minimize reliance on a specific model type, simulator or model generation algorithm. In addition care must be taken to ensure the implementation as a whole remains user friendly and intuitive. This is very hard to get right. • Number of parameters. Every addition of an adaptive feature means the addition of at least one extra parameter that needs to be set. Care must be taken that the problem is not moved from optimizing the model parameters to optimizing the algorithm hyper-parameters. • High accuracy is not always needed. Often engineers are perfectly happy with the 1 or 2 significant digits that a straightforward polynomial regression can give them. • There is no guarantee that an ultra-adaptive, self tuning approach will always outperform a simple full factorial design together with a slightly hand tuned model. In some cases benefit of using the former may simply be nihil or too small to be worth it. • The availability of HPC resources weakens the argument in favor of sequential design (and metamodeling in general). If enough resources are available a brute force approach, though less elegant, can easily be used. The criticisms are valid though one should always keep into consideration that: • There is no silver bullet for software design, it is a hard problem that every project has to deal with. • Not everybody has access to huge, state-of-the-art HPC resources. Even so, time is still money. • Many problems have non-polynomial time or space complexity, no amount of Moore’s law will ever change this. • There will always be a need for small scalable metamodels, be it to integrate in larger simulations or hardware controllers or simply as a way to quickly respond to what-if? type questions. So, as always the mantra should be “Use the right tool for the right job”.

3.9

Conclusion

Having outlined the philosophy behind the toolbox, the general flow of control and the software architecture brings us to the end of the chapter. This chapter was important since a large part of this thesis was dedicated to designing, updating, testing and bug fixing the toolbox in order to make the extensions presented in chapter 5 and tests in chapter 6 possible. Finally, while the M3-Toolbox is still under development, stable and snapshot releases are freely available for academic use at http://www.coms.ua.ac.be.

32

3.9. CONCLUSION

CHAPTER 3. THE M3-TOOLBOX

Figure 3.8: Sample Evaluation (simplified)

33

3.9. CONCLUSION

CHAPTER 3. THE M3-TOOLBOX

Figure 3.9: Model Building (simplified)

Figure 3.10: Measures (simplified)

34

Chapter 4

Evolutionary Modeling It is not the strongest of the species that survives, nor the most intelligent that survives. It is the one that is the most adaptable to change. - Charles Darwin, famous English naturalist (1809-1882) Arguably the hardest problems in science and engineering are those that involve mimicking or understanding biological systems: auto-correction during DNA transcription, the muscle spindle reflex, language processing, vision and cognitive reasoning. The complexity of such systems is staggering, yet at the same time their implementation has often turned out to be breathtakingly elegant. This is especially true for seemingly trivial actions like walking or picking up a pen. Confucius’ dictum rightfully comes to mind: A common man marvels at uncommon things; a wise man marvels at the commonplace.”. The mechanism nature has used to achieve this is evolution through natural selection. Man has attempted to crudely replicate this process through the use of Evolutionary Algorithms (EA) for global search. This chapter will introduce the different kinds of EA and their application to surrogate modeling.

4.1

Biological Foundations

Since EAs were inspired by processes in nature, appreciation of these mechanisms is necessary. Given a population of organisms, evolution can occur only if the following two conditions are satisfied: 1. In the population there must exist variation for some trait and this variation must be heritable. Examples are beak size, skin complexity, eye color, .... 2. There must be differential survival and reproduction associated with the possession of that trait. The first condition requires each trait to have a genetic basis (genotype), that this genotype varies between individuals, and that it can be passed onto offspring. The second condition states that the genetically encoded trait must have a phenotypic expression that incurs some advantage or disadvantage for survival and reproduction in an environment. Together these conditions form the basis for adaptation through Natural Selection: I have called this principle, by which each slight variation, if useful, is preserved, by the term Natural Selection. - Charles Darwin, The Origin of Species Though the terms “Natural Selection” and “Evolution” are inseparably linked with Darwin, he was not the only to think along these lines. While Darwin had been developing the idea of natural selection for many years, it was only when he received a draft of a paper by Alfred Russel Wallace, detailing very much the same ideas, that he decided to come out into the open. Darwin and Wallace jointly presented the theory of evolution by natural selection to the Linnean Society of London in separate papers in 1858. However, it was not until the publishing of Darwin’s The Origin of Species a year later that the theory gained the full attention of the scientific world.

35

4.1. BIOLOGICAL FOUNDATIONS

4.1.1

CHAPTER 4. EVOLUTIONARY MODELING

Heritable variation

It cannot be stressed enough that variation must be heritable (the first condition for evolution). Each phenotypic effect must have a matching code in the organism’s genome, and this code can be passed onto future generations. Traits with no genetic basis at all (acquired traits) play no role in evolution since they have no effect on offspring and are thus not susceptible to selection. The theory where acquired traits do play a role in evolution is known as Lamarckian evolution1 (vs Darwinian evolution) but no evidence for this theory has ever been found and the scientific consensus is that there never will be. While this genetic aspect may be obvious in retrospect, in Darwin’s time it posed a great problem since he, unaware of Mendel’s work in 1865, could not explain how traits were inherited or blended. It wasn’t until Mendel’s work on genetics was rediscovered in 1900 and reconciled with Darwinian evolution in 1930 that the foundations for the current theory modern evolutionary synthesis, or neo-Darwinism were laid. Of course, simply requiring that traits have a heritable, genetic basis is not enough to explain the enormous diversity of species that exist in the world. If each parent would simply pass on an exact copy of its genes, there never would be any new variation, the gene pool would remain static and no new species would arise. Therefore, as genetic information is passed onto offspring it undergoes mutation and, in some cases, recombination. 4.1.1.1

Mutation

Mutation is a change in the genetic information and can be caused by: • copying errors in the genetic material during cell division • exposure to ultraviolet or ionizing radiation • chemical mutagens or viruses • deliberately under cellular control during processes such as meiosis or hypermutation Mutation can occur in any bodily (somatic) cell, but it only contributes to evolution if it occurs in the germ cells responsible for reproduction (we only consider multi-cellular organisms, excluding plants). Such mutations are called germline mutations (vs somatic mutations). Mutation is a mechanism by which new information can be added to the gene pool. Figures 4.1 illustrates some examples of mutation. 4.1.1.2

Recombination

The second genetic operator is recombination, also referred to as crossover. It is only required for sexual reproduction (vs asexual). In humans, recombination occurs during meiosis, the process during which a diploid cell (= a cell containing the full 46 chromosomes, 23 from each parent) divides into four haploid cells, each containing only 23 chromosomes. During sex, some of these haploid cells will then recombine with haploid cells of the other sex to form a zygote, and eventually, a new individual. Recombination occurs during the Prophase I of meiosis, the chromosomes from each parent pair up and randomly exchange information through crossing over. The 23 pairs then break-up again and the 46 chromosomes divide themselves randomly (but equally) over two new haploid cells, each containing only 23 chromosomes. These cells are then duplicated in a way similar to somatic cell division (mitosis) to form the final four haploid cells. An important feature of these four cells is that the combination of genes they carry on their 23 chromosomes is a unique mix of the genes present in the original single cell. Thus recombination is a mechanism that combines existing variation in the gene pool but cannot create new variation. Crossover is illustrated in figure 4.2.

4.1.2

Differential survival and reproduction

Having heritable variation available for a trait is only one side of the equation. It is how the expression of this trait helps the organism gain reproductive success that drives evolution. The advantage of having a particular trait is completely determined by the environment (climate, geography, predators, available 1 As a historical sidenote, the inheritance of acquired characteristics is not the aspect of his theory that Lamarck himself emphasized, he simply took over the conventional wisdom of his time and grafted to it other principles like ‘striving’ and ‘use and disuse’ [36].

36

4.2. EVOLUTIONARY ALGORITHMS

CHAPTER 4. EVOLUTIONARY MODELING

Figure 4.1: Examples of mutation resources, number and species of other organisms, ....) the organism finds itself in. Traits that make the organism better adapted to its environment, i.e., increase the chances of successful reproduction, will have a larger probability of being passed on to the next generation. The classical example is that of the Peppered Moth in England. Prior to 1800, the moth typically had a light pattern which camouflaged it against the light tree trunks and lichens they rested upon. With the advent of the industrial revolution, however, soot and other industrial waste darkened the trees and killed off lichens. The lightly colored moths had suddenly become more visible to predators, decreasing their chances of surviving until reproductive age. In contrast, the darker colored moths suddenly found themselves more camouflaged and as a result their percentage of the population increased. When discussing such examples it is common to use the following terminology: “Natural selection selected against light color”, “There is positive selective pressure for dark wings”,... To the uninformed reader this may seem like Natural Selection is a directed process, working towards some unidentified goal. Nothing could be further from the truth. At its core, evolution through natural selection is a stochastic process with no intrinsic direction or preference whatsoever. This sense of undirectedness implies that the solutions found by natural selection are by no means guaranteed to be optimal in any way. Or as Darwin, so vividly put it in a letter to his friend Joseph Hooker: “What a book a devil’s chaplain might write on the clumsy, wasteful, blundering, low, and horribly cruel works of nature!”

4.2 4.2.1

Evolutionary Algorithms History

Given the success of evolution by natural selection, scientists were quick to try to replicate this success for man-made problems. The use of EAs concentrated on two domains: modeling and validation of biological evolution and global search. This thesis is only concerned with the latter. The computer simulation of evolution dates back to the early 1950’s by the Norwegian scientist Nils Aall Barricelli who was studying artificial life at the institute for advanced study in Princeton, NJ [49]. A few years later in 1958 the Australian quantitative geneticist Alex Fraser published his seminal work

37

4.2. EVOLUTIONARY ALGORITHMS

CHAPTER 4. EVOLUTIONARY MODELING

Figure 4.2: Example of crossover "Simulation of genetic systems by automatic digital computers". Fraser’s efforts in the 1950s and 1960s had a profound impact on the development of computational models of evolutionary systems. Another key player at the time was the American scientist Lawrence J. Fogel who is known as the father of evolutionary programming [60]. Though many researchers picked-up on the idea, it wasn’t until the early 1970s when John Holland et. al at the University of Michigan introduced genetic algorithms (GA) and Ingo Rechenberg and Hans-Paul Schwefel from the Technical University of Berlin introduced evolution strategies, that EA became widely recognized. These areas developed separately for about 15 years and were joined by genetic programming in the 1980s (Stephen F. Smith (1980) [177], Nichael L. Cramer (1985) [32], D. Dickmanns (1987) [41]) and 1990s (John R. Koza [100]). EAs are part of of a wider class of biologically inspired algorithms (sometimes referred to as soft computing). Other members of this class include neural networks, fuzzy theory, Bacteriologic Algorithms, Harmony Search, Ant Colony Optimization and Particle Swarm Optimization.

4.2.2

Important remarks

Before we continue the reader should be reminded that EAs, like other soft computing techniques (e.g., neural nets), are extreme simplifications of their biological counterparts and results/conclusions obtained in the artificial setting can usually not be generalized to the biological setting. In addition it cannot be stressed too strongly that an EA is not a random search for a solution to a problem. EAs use stochastic processes, but the result is distinctly non-random (better than random) [68]. Finally, a common misconception is that EAs don’t require any structure in the search space. This is definitely not the case, especially if a recombination operator is involved.

4.2.3

Types

Five major types of EAs can be identified [68]: 1. Genetic Algorithms (GA) 2. Evolutionary Programming (EP) 3. Evolution Strategies (ES) 4. Classifier Systems (CS) 5. Genetic Programming (GP) Gray zones exist between the different classes but all share a common conceptual base of simulating the evolution of individual structures via processes of selection, recombination, mutation and reproduction. These processes are driven by the performance of the individual structures as defined by an environment 38

4.3. THE GENETIC ALGORITHM (GA)

CHAPTER 4. EVOLUTIONARY MODELING

Algorithm 2 The Canonical Genetic Algorithm //start with an initial time t := 0; //initialize a random population of individuals initpopulation P(t); //evaluate fitness of all initial individuals of population evaluate P(t); //test for termination criterion (time, fitness, etc.) while not done do //increase the time counter t := t + 1; //select a sub-population for offspring production P’ := selectparents P(t); //recombine the "genes" of selected parents recombine P’(t); //perturb the mated population stochastically mutate P’(t); //evaluate its new fitness evaluate P’(t); //select the survivors from actual fitness P := survive P,P’(t); end

(fitness function). The art of applying EAs is finding a good balance between exploration (global search) and exploitation (local search) when combining the different processes. In this thesis we are concerned with genetic algorithms: a population of individuals, represented by their genome, is evolved through the use of selection, recombination (crossover) and mutation operators for a fixed number of generations.

4.3 4.3.1

The Genetic Algorithm (GA) The Canonical GA

The GA is probably the most well known EA and is used as an algorithm for global search, optimization being the most obvious application. The core algorithm, as introduced by Holland [71] is referred to as the Canonical Genetic Algorithm (CGA) and is presented in pseudo code in algorithm 2 ([68]). We adopt the notation from [162]. The population of the CGA consists of an n-tuple of binary strings bi of length l, where the bits of each string are considered to be the genes of an individual chromosome. Each individual bi represents a feasible solution in the search space with the quality of the solution determined by a fitness function f . Selection of individuals to reproduce is performed proportional to their fitness. The probability that individual bi is selected from tuple (b1 , b2 , ..., bn ) to be a member of the next generation is given by f (bi ) P {bi is selected} = Pn >0 j=1 f (bj )

(4.1)

The population is initialized with random bit strings and individuals are modified by crossover and mutation operators. Mutation operates independently on each bi by randomly flipping one or more bits. The event that the j-th bit of the i-th individual is flipped is stochastically independent and occurs with probability pm . Crossover is applied to randomly paired individuals with probability pc . Usually singlepoint crossover is used, a randomly chosen recombination point is chosen and crossover on the following two parents

39

4.3. THE GENETIC ALGORITHM (GA)

CHAPTER 4. EVOLUTIONARY MODELING

111 | 11111 000 | 00000 produces the following two offspring 11100000 00011111

4.3.2

Extensions to the CGA

The CGA presented in the previous subsection is the GA in its simplest form. As such it is useful for studying the theoretical properties of GAs but for most applications it is extended in one or more of the following ways: • non-bit string representation (e.g., integers, floating point numbers, character strings, ...) • adaptive parameters (e.g., varying mutation rates, crossover rates, population size, ...) • selection functions (e.g., tournament selection, stochastic universal sampling, ...) • speciation (see section 4.4) • ... In addition the GA may be combined with Lamarckian learning by performing a local optimization on every generated individual. This approach is referred to as a memetic algorithm [130].

4.3.3

Theoretical foundations

The fundamental theorem in GAs is Hollands Schema Theorem [71]. A schema is a bit string with one or more don’t care values. For example, the schema 100*10 has one don’t care value and represents two possible bitstrings 100110 and 100010. The order of a schema s is defined as the number of non-don’t care positions. E.g., in the previous example the order o(s) = 5. The defining length δ of a schema is defined as the distance between the first and the last fixed string positions. It defines the compactness of information in a schema. For example, for s=***001*110, δ(s) = 10 − 4 = 6. Given these definitions the Schema Theorem can be stated as follows: Short, low-order, above-average schemata receive exponentially increasing trials in subsequent generations of a genetic algorithm. Mathematically, this can be formulated as:   ξ(S, t) δ(S) ξ(S, t + 1) ≥ ¯ · (1 − pc ) · − o(S) · pm l−1 F (t) Where F ¯(t) is the average fitness of the population and ξ(S, t) is defined as the expectation of the number of bit strings matching schema S at time t. Based on the Schema Theorem Goldberg [60] proposes the Building Block Hypothesis [131]: A genetic algorithm seeks near-optimal performance through the juxtaposition of short, loworder, high-performance schemata, called the building blocks. Goldberg states in [60]; Short, low-order, and highly fit schemata are sampled, recombined, and re-sampled to form strings of potentially higher fitness. In a way, by working with these particular schemata (the building blocks), we have reduced the complexity of our problem; instead of building highperformance strings by trying every conceivable combination, we construct better and better strings from the best partial solutions of past samplings. 40

4.4. PARALLEL GENETIC ALGORITHMS

CHAPTER 4. EVOLUTIONARY MODELING

So, the schema theorem says that the GA will produce increasingly fit individuals where the better individuals match short, low-order schemata. However, the main question is, does the CGA converge to the globally best solution? Intuitively we can already see a problem if we consider a search space where the best individual is a long, high order schema (e.g., 01111011). Crossover and mutation will easily break the schema thus the GA may never find the best individual. Work by Rudolph in [162] confirms that this is the case, convergence of the CGA cannot be guaranteed. [162] proves that, to ensure convergence, an elitist GA must be used where the best individual found over time is manually preserved. Note though, that Rudolph’s proofs only prove that the global solution can be found, they say nothing about the time needed to reach a solution. For more information on the convergence properties of the CGA see [113]. In sum, the Schema theorem proves that the CGA will make progress when searching the parameter space. However, the theorem is plagued with a number of problems that limit its practical application: • Proof of convergence • Only applicable to the CGA (bit string representation) • The Schema theorem is an inequality instead of an equality2 , it only provides a lower bound for the expected number of schema’s. This makes it difficult to use schema theories to predict the future behavior of a GA even for a single generation ahead [151] Though many extensions to the theorem have been developed, the schema theorem and its variants have been widely criticized with many researchers believing that “schema theorems are nothing more than trivial tautologies of no use whatsoever” [151]. This is arguably even more so for the Building Block Hypothesis, where Goldberg himself admits that “While these claims seem perfectly reasonable, how do we know whether they hold true or not” [60]. In conclusion, while many of these criticisms may be unjustified, the theoretical foundations of GA are shaky to say the least and the debate has far from settled.

4.3.4

Applications

Theoretical foundations aside, GAs have found widespread use in many domains. The main disadvantage of GAs are the high number of function evaluations needed for convergence. Therefore they are best suited to problems where there is little problem specific information that can be exploited and where traditional, usually gradient-based, algorithms perform poorly. Such problems are typically characterized by high noise, NP time complexity, many local minima (multi-modal), and dependencies between variables (epistasis). Within this domain GAs have performed very successful with many applications in transportation [185], electronics [74], vehicle design [215], scheduling [174], data fitting [30], and many others.

4.4 4.4.1

Parallel Genetic Algorithms Introduction

The fact that GAs are population based make them inherently parallel. The total population can be divided into different sub-populations evolving in parallel, each scheduled on a different CPU. The motivation for dividing up the population need not be a purely computational one. For example, from a biological standpoint it makes sense to consider speciation: Genomes that differ considerably from the rest of the population are automatically split off into a separate sub-population and continue to evolve independently. Thus forming a new species. In this way the parameter space is searched more efficiently. The idea of speciation, like many other concepts and operators, was pioneered by Holland in the early seventies [71]. The terms Parallel Genetic Algorithms (PGA) or Distributed Genetic Algorithms (DGA) [189] usually refer to the case whenever the population is divided up in some way, for whatever reason. Strictly speaking the terms refer to the actual implementation of the GA on (massively) parallel hardware or on a grid (e.g., [115]). Unfortunately though, the terminology for the different models varies between authors and can be very confusing [139]. In this thesis we are only interested in PGA on the model level (i.e., different speciation models), parallelism for computational reasons, including scheduling, will not be considered. 2 The theorem neglects the small probability that a string belonging to the schema s will be created from nothing by a mutation of a string that did not belong to s in the previous generation.

41

4.4. PARALLEL GENETIC ALGORITHMS

4.4.2

CHAPTER 4. EVOLUTIONARY MODELING

Island model

The island model [204] is probably the most well known PGA. Different sub-populations exist (initialized differently) and sporadic migration can occur between islands allowing for the exchange of genetic material between species and inter-species competition for resources. This model is also known as the migration model [6] or stepping stone model [139], depending on the migration constraints. As stated above, the population is divided into a number of independent subpopulations, so-called demes, with inter-deme migration. Selection and recombination are restricted per deme, such that each sub-population may evolve towards different locally optimal regions of the search space (called niches in the terminology of Goldberg [60]). Depending on the size and number of demes the model can be coarse grained or fine grained [139]. The migration model introduces five new parameters: the migration topology, the migration frequency, the number of individuals to migrate, a strategy to select the emigrants, and a replacement strategy to incorporate the immigrants. The island model is illustrated in figure 4.3 for two topologies.

Figure 4.3: Ring and grid migration topologies in the Island Model A famous real world example of this are Darwin’s finches (also known as the Galapagos Finches). These are 13 or 14 different but closely related species of finches Charles Darwin collected on the Galapagos Islands during the Voyage of the Beagle. Darwin later established that each species was uniquely related to individual islands. The geographical isolation was such that each species could adapt to the environment on its specific island, from a common ancestor, while still being able to migrate to a different island. The following quote from chapter 17 of Darwin’s The Voyage of the Beagle illustrates this: The remaining land-birds form a most singular group of finches, related to each other in the structure of their beaks, short tails, form of body and plumage: there are thirteen species, which Mr. Gould has divided into four subgroups. All these species are peculiar to this archipelago; and so is the whole group, with the exception of one species of the sub-group Cactornis, lately brought from Bow Island, in the Low Archipelago. Of Cactornis, the two species may be often seen climbing about the flowers of the great cactus- trees; but all the other species of this group of finches, mingled together in flocks, feed on the dry and sterile ground of the lower districts. The males of all, or certainly of the greater number, are jet black; and the females (with perhaps one or two exceptions) are brown. The most curious fact is the perfect gradation in the size of the beaks in the different species of Geospiza, from one as large as that of a hawfinch to that of a chaffinch, and (if Mr. Gould is right in including his sub-group, Certhidea, in the main group) even to that of a warbler. The largest beak in the genus Geospiza is shown in Fig. 1, and the smallest in Fig. 3; but instead of there being only one intermediate species, with a beak of the size shown in Fig. 2, there are no less than six species with insensibly graduated beaks. The beak of the sub-group Certhidea, is shown in Fig. 4. The beak of Cactornis is somewhat like that of a starling, and that of the fourth subgroup, Camarhynchus, is slightly parrot-shaped. Seeing this gradation and diversity of structure in one small, intimately related group of birds, one might really fancy that from an original paucity of birds in this archipelago, one species

42

4.4. PARALLEL GENETIC ALGORITHMS

CHAPTER 4. EVOLUTIONARY MODELING

had been taken and modified for different ends. In a like manner it might be fancied that a bird originally a buzzard, had been induced here to undertake the office of the carrion-feeding Polybori of the American continent. "Mr. Gould" from the quote refers to John Gould, the famous English ornithologist.

4.4.3

Cellular model

Another model is the cellular model [61] (also known as the diffusion model [6] or massively parallel GA [139]). Instead of parallelism on the population level, the diffusion model concentrates on interactions of individuals within a single population. In this case parallelism is performed on the level of individuals. Communication (selection, recombination) of individuals is restricted to a local neighborhood structure. This type of separation is referred to as isolation by distance [209]. This way, advantageous genetic information may arise at different points in the topological interaction structure and spread slowly over the population. In this case the neighborhood size and the interaction structure play an important role for maintaining diversity [6]. While there are no explicit islands, there is the possibility of similar effects. The cellular model is illustrated in figure 4.4 for a neighborhood distance of one.

Figure 4.4: Cellular Model

4.4.4

Fitness sharing

A third model, the one originally proposed by Holland [71] and applied to the 2-arm bandit problem, revolves around the sharing concept. It is inspired from the observation of positive assortive mating in nature (like mates like). The model is commonly known as fitness sharing. A sharing function s(d) is defined to determine the neighborhood and degree of similarity between each individual in the population (e.g.: if individuals are bit strings, s(d) can be defined as s(d) : N 7→ [0, 1], with d proportional to the Hamming distance). Each individual that belongs to the same species (as determined by s(d)) then receives the same fitness value. See for example the overview in [164].

4.4.5

Others

The three model types listed here are the major categories, though many variations exist such as: inbreeding with intermittent crossbreeding, overlapping demes, dynamic demes, segregative GA, crowding, preselection, co-evolutionary algorithms, hierarchical GA, Cohort GA, the community model and the plant pollination model. Additionally, many hybrid schemes are possible where different aspects of each model are combined. See for example [179].

4.4.6

Applications

The different speciation models have also found widespread use. For example, [88] uses speciated evolution for inference of Bayesian networks. . Another example is NEAT [181], a platform that uses fitness sharing to 43

4.5. EVOLUTION OF SURROGATE MODELS

CHAPTER 4. EVOLUTIONARY MODELING

evolve neural networks. Other uses include scheduling [136, 14], surrogate driven optimization [57] (using a hierarchy of island models), and vehicle concept selection in aerospace [17]. An extensive treatment of all the applications of EA is out of scope for this paper, excellent references can be found in [3, 139, 99].

4.5 4.5.1

Evolution of surrogate models Motivation

In this section we apply speciated evolution, in the form of the island model, to the problem of surrogate model selection. Popular surrogate model types include Radial Basis Functions (RBF), Rational Functions, Artificial Neural Networks (ANN), Support Vector Machines (SVM), and Kriging models. Different model types are preferred in different domains. For example rational functions are widely used by the EM community [110, 38] while ANN are preferred for hydrological modeling [90, 180]. Differences in model type usage are mainly due to practical reasons3 : 1. Trying out more model types simply requires more time (installing packages, tuning model parameters, ...), time that may not be available. 2. The expertise is not available. 3. The designer sticks to what he knows best (“Not invented here” syndrome). 4. Lack of user friendly software packages. 5. The final application restricts the designer to one particular type. Reason (2) is related to an important point: there is no such thing as an inherently ’good’ or ’bad’ model. A model is only as good as the data it is generated from and the expert that built it. Some models are more sensitive to changes in their parameters than others and usually it takes a great deal of experience to know how these parameters should be set. To complicate things even further, the usefulness of a model and the setting of its parameters also depends on the problem at hand. Therefore, claims that a particular model type is superior to others should always be taken with a grain of salt. This is related to the so called No-Free-Lunch-Theorems [207]. One of the many formulations is as follows4 : ‘...it is impossible to justify a correlation between reproduction of a training set and generalization error of the training set using only a priori reasoning. As a result, the use in the real world of any generalizer which fits a hypothesis function to a training set (e.g., the use of back-propagation) is implicitly predicated on an assumption about the physical universe.’ [208] Basically the theorems explain why, over the set of all possible learning problems & making no assumptions, each algorithm will do on average as well as any other due to the bias in each algorithm. Reason (4) is also worth emphasizing. Often the designer is willing to try out different models but lacks the time, expertise, ... to implement these models from scratch. Consequently there is a need for ready made tools and libraries that are easy to use and easy to integrate within an existing design process. There is ample literature available that benchmarks different model types on toy problems [173, 80, 137, 171, 153, 212, 27, 200]. However, examples where trying out different models is actually part of the modeling cycle, towards solving a real world problem, are less numerous [197, 107]. Selecting the model type is only part of the problem, the complexity of the model needs to be chosen as well. Each model type has a set of tunable parameters θ that control the complexity of the model M and thus the bias-variance trade off. For example, with rational functions this could be the degree dn , dd of the nominator and denominator, with ANN the number of units per hidden layer Hi and with SVM the kernel function K(x1 , ..., xn ), the regularization constant γ, and the kernel parameters (e.g., the spread σ in the 3 In some cases, however, knowledge of the physics of the underlying system can make a particular model type to be preferred. For example, rational functions are popular in electro-magnetic applications since theory is available that can be used to prove that a rational model conserves certain physical quantities (e.g., enforcement of passivity [63]). 4 See also http://www.no-free-lunch.org/.

44

4.6. GENETIC MODEL BUILDER

CHAPTER 4. EVOLUTIONARY MODELING

case of an RBF kernel). In general, finding the optimal bias-variance trade-off is hard. All too often this done by trying out different parameters manually and using those that work best (e.g., [214]). Thus, in both cases there is little theory that can be used as a guide. It is in this setting that the evolutionary approach can be expected to do well. In this contribution we describe the application of a single GA with speciation to both problems: the selection of the surrogate type and the optimization of the surrogate model parameters.

4.5.2

Related work

The evolutionary generation of regression models for given input-output data has been widely studied in the Genetic Programming community [104, 184, 140]. Given a set of mathematical primitives (+, sin, exp, /, x, y, ...) the space of symbolic expression trees is searched to find the best function approximation. The application of GAs to the optimization of model parameters of a single model type (homogeneous evolution) has also been common [26, 112]. The same is true for the use of surrogate models in GA-based optimization of expensive simulators (see the work by Ong et al. [142, 143]). However, the authors were unable to find any evidence of the use of EAs for the evolution of multiple surrogate model types simultaneously (heterogeneous evolution). In all the related work that the authors considered, speciation was always constrained to one particular model type (e.g., neural networks [181]). The model type selection problem was still left as an a priori choice for the user. The authors assume to be the first to tackle the model parameter optimization and model type selection problem in one speciated GA.

4.6

Genetic model builder

Recall from section 3.2 that we made a distinction between the surrogate model and the algorithm used to optimize its parameters (= the model builder). For example, using simulated annealing to optimize the kernel parameters of a SVM. Refer to the previous chapter for details on the toolbox software architecture. In order to enable speciated evolution, the following tasks needed to be tackled: 1. Reworking of the software architecture to make the implementation of a speciated GA possible 2. Identification or implementation of a suitable GA variant 3. Implementation of a GeneticModelBuilder class 4. Implementation of different interface classes, one for each supported model type (i.e., picking a representation, designing genetic operators, ...) 5. Testing and debugging of homogeneous evolution 6. Implementation of a heterogeneous genetic interface class 7. Testing and debugging of heterogeneous evolution These tasks were completed and the results can be found in chapter 5. A core task was identifying an suitable GA variant. After an extensive survey of the available software it was decided to use the GA as implemented in the Matlab Genetic Algorithm and Direct Search Toolbox (GADS). This choice was primarily a pragmatic one. The way the GA was implemented did not allow for as much flexibility as would be preferred (see section 6.2), yet it provided an implementation of the island model, provided a large number of selection functions and integrated easily with the M3-Toolbox. Thus the advantages outweighed the advantages, implementing the equivalent GA from scratch would have taken far too much time.

4.7 4.7.1

Heterogeneous evolution Algorithm

We now discuss the concrete GA for heterogeneous evolution as implemented in the M3-Toolbox. As stated above, the algorithm is based on the Matlab GADS toolbox and is shown in Algorithm 3.

45

4.7. HETEROGENEOUS EVOLUTION

CHAPTER 4. EVOLUTIONARY MODELING

Algorithm 3 Implemented Genetic Algorithm (island model) 01. M = {M1 , ..., Mn } 02. demeS i = initP op(Mi ), i = 1, .., n n 03. P = i=1 demei 04. scores = φ 05. gen = 1 06. while(¬terminationCriteraReached)do 07. f oreach demei ⊆ P do 08. scoresi = f itness(demei , X, E) 09. elite = sort([scoresi ; demei ])|1:k 10. parents = select(scoresi , demei ) 11. [mutChildren,SxoChildren] = S createOf f spring(parents, pm , pc ) 12. demei = elite mutChildren xoChildren 13. end 14. if (mod(gen, mi ) = 0) 15. P = migrate(P, scores, mf , md ) 16. end 17. end The algorithm works as follows: An initial sub-population demei is created for each model type Mi . Then each deme is allowed to evolve according to an elitist GA. The fitness function calculates the quality of the model fit on a set of n-dimensional data points, according to an error measure E. I.e., f itness : Rn × Cn 7→ R. Parents are selected according to some selection algorithm (e.g., tournament selection) and offspring are generated through mutation and recombination genetic operators (with respective probabilities pm , pc ). The current deme population is then replaced with its offspring together with k elite individuals. Once every deme has gone through a generation, migration between individuals is allowed to occur at migration interval mi , with migration fraction mf and migration direction md (a ring topology is used). The migration strategy is as follows: the l = (demei ∗ mf ) fittest individuals of demei replace the l worst individuals in the next deme (defined by md ). As in [149], migrants are duplicated, not removed from the source population. The algorithm iterates until some stopping criteria has been reached. Note that in this contribution we are primarily concerned with inter-model speciation (speciation as in different model types). Intra-model speciation (e.g., through the use of fitness sharing within one model type) is something which was not done but could easily be incorporated.

4.7.2

Extinction prevention

Initial tests with this algorithm exposed a major shortcoming, specifically due to the fact that we are evolving models. Since not all data is available at once but trickles in, k samples at a time, models that need a reasonable-to-large number of samples to work well will be at a huge disadvantage initially. Since they perform badly in the beginning, they quickly get overwhelmed by other models who do not have this problem. In the extreme case where they are driven extinct, they will never have had a fair chance to compete when sufficient data does become available. They may even have been the superior choice had they still been around5 . Therefore an extinction prevention algorithm was introduced. Extinction prevention (EP) is a custom extension added to the GA by the author to combat this. It works by monitoring the population and each generation recording the number of individuals of each model type. If this number falls below a certain threshold (2 was used here) for a certain model type, the EP algorithm steps in and ensures the model type has its numbers replenished up to the threshold. This is done by reinserting the last models that disappeared for that type (making copies if necessary). The re-inserted models replace the worst individuals of the other model types (who do have sufficient numbers) evenly. Strictly speaking, EP goes completely against the philosophy of an evolutionary algorithm. By using it we are manually working against selection, preserving weak individuals. However, in this setting it seems a fair measure to take and should improve results in some cases. 5 As

an example, this observation was often made when using rational models on electro-magnetic data.

46

4.8. CONCLUSION

4.7.3

CHAPTER 4. EVOLUTIONARY MODELING

Heterogeneous recombination

The attentive reader will have noticed that there remains one major problem with the implementation as discussed so far. The problem lies in the genetic operators, more specifically in the crossover operator. Migration between demes means that model types will mix. This means that a set of parents selected for reproduction may contain more than one model type. The question then arises: how to perform recombination between two models of completely different types. For example, how to meaningful cross an Artificial Neural Network with a rational function? The solution we have employed here is to use ensembles. If two models of different types are selected to recombine, an ensemble is created with the models as ensemble members. However, the danger with this approach is that the population may quickly be overwhelmed by large ensembles containing duplicates of the best models. To counter this phenomenon we apply the similarity concept from Holland’s sharing as described in section 4.4.4. Individual models will try to mate only with individuals of the same type. Only in the case where selection has made this impossible shall different model types combine to form an ensemble. This leaves us with only two cases left to explain: 1. ensemble - ensemble crossover: a single-point crossover is made between the ensemble members of each model 2. ensemble - model crossover: the model replaces a randomly selected ensemble member with probability pswap or gets absorbed into the ensemble with probability 1 − pswap (ensuring the maximum ensemble size set by the user is not exceeded and preventing exact duplicates). Using ensembles in this way has the additional benefit of allowing a particular model type to lie dormant in an ensemble with the possibility of re-emerging later. Thus delaying extinction.

4.8

Conclusion

With this chapter we have come to an end of the theoretical part of this thesis. The reader should now have a clear understanding of the goals set out by this thesis and the algorithms used to achieve them. Whether this will be the case is the subject of the next chapter.

47

Chapter 5

Performance Results The path of precept is long, that of example short and effectual. - Seneca, Roman philosopher (5 BC - 65 AD) The current chapter will apply the heterogeneous GA described in section 4.7 to a number test problems in order to validate if the best model type is indeed determined automatically. The problems include 2 predefined mathematical functions, and real world problems from chemistry and aerodynamics.

5.1

Objective

The purpose of this chapter is to test if the heterogeneous approach presented in the previous chapter works as expected on a number of test problems. More concretely, the objective is to empirically test the following hypothesis: If an initial heterogeneous population of surrogate models is evolved using the speciated GA shown in Algorithm 3, the algorithm will consistently and unambiguously converge to the surrogate model type that is most suited to the given problem. In addition we hope to see evidence of a ‘battle’ between model types. While initially one species may have the upper hand, as more data becomes available (dynamically changing optimization landscape) a different species may become dominant. This should result in clearly noticeable population dynamics, a kind of oscillatory stage before convergence.

5.2

Test problems

The following test problems shall be used to test the speciated GA: • Predefined 2D mathematical functions – Ackley’s Path function – Langermann’s function • A data set from a combustion experiment in chemistry • A data set from aerodynamics Each of these will be described in more detail below.

48

5.2. TEST PROBLEMS

5.2.1

CHAPTER 5. PERFORMANCE RESULTS

Ackley Function (AF)

The first test problem is Ackley’s Path, a well known benchmark problem from optimization. The function is shown in figure 5.1. Its mathematical definition for n dimensions is: v   ! u n n X u1 X 1 F (~x) = −20 · exp −0.2t · x2  − exp · cos(2π · xi ) + 20 + e n i=1 n i=1 with xi ∈ [−2, 2] for i = 1, ..., n.

Figure 5.1: Ackley’s Path Function For this function a validation set and a test set of 5000 scattered points each is available.

5.2.2

Langermann Function (LF)

The Langermann function is another difficult multi-modal test function from optimization. It is extremely irregular as can be seen from figure 5.2. Mathematically it is defined as F (~x) = −

m X

   ||¯ x−A(i)||2 π · cos π · ||¯ x − A(i)||2 ci e

i=1

with m = 5 and xi ∈ [0, 10] for i = 1, ..., n. See the implementation in the toolbox for the values of A and c. For this function a validation set and a test set of 10000 scattered points each is available.

5.2.3

Chemistry Example (CE)

This example and its description is taken from [77], where the authors describe the generation of an optimal ANN using a pattern search algorithm. The chemical process under consideration describes methane/air combustion. The GRI 2.11 chemical mechanism containing 277 elementary chemical reactions among 49 species is used. The steady laminar flamelet equations [150] are often employed to describe the reaction-diffusion balance in non-premixed flames. The solutions to these equations provide temperature and mass fractions of all species in terms of two parameters. The mixture fraction z and the reaction progress variable c are used for this parametrization. 49

5.2. TEST PROBLEMS

CHAPTER 5. PERFORMANCE RESULTS

Figure 5.2: Langermann Function Temperature and the chemical source term of c, which can be viewed as a measure of heat release, are shown as functions of these parameters in figure 5.3. The solid lines represent the system boundary and flame states outside this boundary are inaccessible. It can be seen that the chemical source term is very localized around z ≈ 0.2 and 0.15 ≤ c ≤ 0.23. For the approximation 1000 data samples are available, half of which will be used for training, the other half for validation. Sample data were obtained by applying an acceptation-rejection method [161]. This method consequently results in a better resolution of the important regions with high chemical source term and temperature. In addition to the training and validation sets, a separate dense data set of 13959 samples are available for testing.

5.2.4

LGBB Example (LE)

NASA’s Langley Research Center is developing a small launch vehicle (SLV) [146, 159, 145] that can be used for rapid deployment of small payloads to low earth orbit at significantly lower launch costs, improved reliability and maintainability. The vehicle is a three-stage system with a reusable first stage and expendable upper stages (the schematic is shown in figure 5.4). The reusable first stage booster, which glides back to launch site after staging around Mach 3 is named the Langley Glide-Back Booster (LGBB). In particular, NASA is interested in the aerodynamic characteristics of the LGBB from subsonic to supersonic speeds when the vehicle reenters the atmosphere during its gliding phase (see figure 5.5). More concretely, NASA want to learn about the response in lift, drag, pitch, side-force, yaw, and roll of the LGBB as a function of three inputs: Mach number, angle of attack, and side slip angle. For each of these input configurations the Cart3D flow solver is used to solve the inviscid Euler equations over an unstructured mesh of 1.4 million cells. Each run of the Euler solver takes on the order of 5-20 hours on a high end workstation [159]. The geometry of the LGBB used in the experiments is shown in figure 5.6. Figure 5.7 shows the lift response plotted as a function of speed (Mach) and angle of attack (alpha) with the side-slip angle (beta) fixed at zero. From the figure it can be seen there is a marked phase transition between flows at subsonic and supersonic speeds (notice the ridge in the response surface at Mach 1). This transition is distinctly non-linear and may even be non-differentiable or non-continuous [62]. Given the computational cost of the CFD solvers, the LGBB example is an ideal application for metamodeling techniques. Unfortunately access to the original simulation code is restricted. Instead 2 data sets 50

5.3. MODEL TYPES

CHAPTER 5. PERFORMANCE RESULTS

Figure 5.3: Solution of the steady laminar flamelet equations as a function of mixture fraction z and progress variable c; (a) temperature (K) and (b) chemical source term (kg/(m3 s) (Source: [77])

Figure 5.4: LGBB Schematic [145] are available: • a hand-crafted gridded data set of 3167 points, mostly focused on inputs near Mach 1 and large alpha. Due to poor detection of convergence in the Euler solver in this initial experiment, the six response (lift, drag, ...) appear "noisy". • a data set of 780 points chosen adaptively according to the method described in [62]. In addition to being adaptively designed, the experiments used an updated version of the Euler solver, and a more robust convergence detection algorithm. The latter data set will be used here.

5.3

Model types

For the experiments the following model types are used: artificial neural networks (ANN), rational functions, RBF models, Kirging models, SVMs, and LS-SVMs. Each with their own genetic operator implementations. As stated in subsection 4.7.3 the result of a heterogeneous recombination will be an ensemble. So in total seven model types will be in the running for approximating the data. We briefly explain each of these below:

5.3.1

ANN

Since their initial conception as crude models of biological neurons, ANNs have proven to be very useful, flexible and powerful tools for solving a wide range of complex problems: classification, pattern recogni51

5.3. MODEL TYPES

CHAPTER 5. PERFORMANCE RESULTS

Figure 5.5: LGBB flight profile [145]

Figure 5.6: LGBB geometry [159] tion, system control, time series prediction, function approximation, regression, optimization, reasoning, etc. In the context of surrogate modeling they constitute a particularly attractive method considering their universal, black-box nature. ANNs have been proven, under reasonable conditions, to be able to approximate any computable function with arbitrary accuracy through learning [203, 193] without requiring any a priori knowledge of the underlying system. As model parameters we consider the network topology (number of hidden layers, number of units per hidden layer). The weights w are not evolved but trained using Levenberg-Marquardt backpropagation in conjunction with Bayesian regularization [50] for 300 epochs. From a theoretical perspective, not evolving the weights and topology together is known to be sub-optimal [213]. However, from a practical viewpoint a hybrid approach performs much faster even though the ANN training problem is known to be ill-conditioned [163]. Genetic operators: • creation function: mutated default configuration (In − 5 − 5 − Out network complexity) • mutation function: randomly perturbs the model parameters • crossover function: modified 1-point crossover In addition, extra logic is added to the genetic operators to ensure enough (partial) re-initializations of w occur (possibly inheriting trained weights from parents) to reduce the chance of converging to a poor local minimum.

52

5.3. MODEL TYPES

CHAPTER 5. PERFORMANCE RESULTS

Figure 5.7: Lift plotted as a function of Mach (speed) and alpha (angle of attack) with beta (side-slip angle) fixed to zero. The ridge at Mach 1 separates subsonic from supersonic cases [62].

5.3.2

Rational Functions

Linear regression models have always been a key tool for modeling a wide variety of systems. Specifically, using polynomials for interpolation and approximation of scattered data has been popular for many years. In the context of surrogate modeling, several researchers have modeled simulation outputs with polynomials and ratios of polynomials (rational functions) [37, 105, 109]. The degrees which appear in numerator and denominator of a rational model are governed by three (sets of) parameters: the variable weighting (w1 , . . . , wd ), denominator flags (f1 , . . . , fd ) and degrees of freedom (P ). The degrees of freedom (number of unknowns) in the model is given by P , as a percentage of the number of samples. The weights wj assign a certain importance to each input parameter, thus causing the monomials in numerator and denominator to contain higher degrees in the variables with lower weights. The flags fj indicate which variables should appear in both numerator and denominator, and which variables should only occur in the numerator. The genetic algorithm is configured with bounds for weights and degrees or freedom, and with a probability F at which the flags will be set to 1. • creation function: To generate an individual in the initial population, flags are set at random (respecting the probability F ), weights are assigned random values within their range, and P is set to a random value, again within the assigned range. • mutation function: The mutation operator is quite simple: one-third of the time, the weights are changed to random values close to their original values. One-third of the time, the flags are changed to new random values according to the probability F . The third part of the time, the percentage is changed to a new value close to its old one, and respecting the bounds. • crossover function: The crossover operator just takes the weights from one parent, the flags from the other, and the percentage P as an average of the P ’s of the two parents.

5.3.3

RBF Models

Radial basis function (RBF) models are a popular surrogate modeling technique, closely related to Kriging and DACE, often used in surrogate driven optimization. A RBF model is built based upon three main parameters: the basis function, the shape parameters and the regression polynomial degree. So for an RBF

53

5.3. MODEL TYPES

CHAPTER 5. PERFORMANCE RESULTS

model

M X

βk fk (x) +

k=1

N X

φ(kx − xj k2 , α1 , . . . , αn )

j=1

the model parameters are the function φ, the extra shape parameters αi for that φ, and the degree M up to which monomials fk are added to the model. The default RBF’s that were used, together with their parameter ranges, are provided in figure 5.8. RBF Gaussian Multiquadric Thin plate spline Exponential

Formula 2 φ(x) = p exp−α1 x φ(x) = α12 + x2 φ(x) = log(α1 |x|)x2 α2 φ(x) = exp−α1 |x|

Lower bound 0.1 0.1 0.1 0.1, 0.5

Upper bound 5 5 5 5, 2

Scale log log log log, lin

Figure 5.8: Basis functions and shape parameters The genetic operators must provide the three parameters described above, based on the previous generation. To restrict which basis functions and which shape parameters may be used, the genetic operators use a lookup table which specifies which basis functions are allowed, and for each basis function which are suitable ranges for the shape parameters. 5.3.3.1

Creation function

The initial population is created by randomly selecting a basis function φ and a regression degree M for each individual. Each individual gets shape parameters randomly selected within the allowable range for its basis function. 5.3.3.2

Mutation function

A mutation only changes the shape parameters and not the basis function. This is because the same shape parameters don’t really mean anything when used with a different basis function. The mutation operator changes only one shape parameter 60% of the time, and changes all shape parameters 20% of the time. The remaining times a completely random model is generated. On top of shape parameter changes, the regression degree is changed 25% of the time. Figure 5.9 depicts this graphically. [Exponential, (2, 1.5), R] ↓   20% random: 60% mild mutation:  20% mutation:

[Gaussian, (x), R0 ] [Exponential, (x, 1.5), R0 ] or [Exponential, (2, y), R0 ] [Exponential, (x, y), R0 ]

where R0 = R 75% of the time, else R0 is chosen randomly Figure 5.9: The mutation operator visualized

5.3.3.3

Crossover function

The crossover operation tries to incorporate as much intelligence as possible. If both parents use the same basis function, either an average is taken of the shape parameters (in case this basis function has only one parameter) or a straightforward 1-point crossover is done between the shape parameter vectors. When the basis functions of both parents differ, it’s difficult to find a sensible crossover function. Our operator just takes the regression degree and basis function of one parent, and then rescales the shape parameters of the second parent into the range of the shape parameter of the first parents basis function. The reasoning behind this is that if it is the tendency to select large shape parameters (flatter functions) for one basis function, it’s

54

5.3. MODEL TYPES

CHAPTER 5. PERFORMANCE RESULTS

probably the same for the other basis function. When doing such a scaling, a linear or logarithmic scale is used depending on which parameter has to be scaled. Figure 5.3.3.3 visualizes the crossover operation. 

[Exponential,(3,1),4] [Exponential,(2,1.5),5]





[Gaussian,(3),4] [Exponential,(2,1.5),5]





[Exponential, (3, 1.5), 4]

[Gaussian, (z), 4]

 where z = exp

1.5 − 0.5 (log 5 − log 0.1) + log 0.1 2 − 0.5





Figure 5.10: The crossover operator visualized

5.3.4

Kriging models

Kriging models (also known as Gaussian Process models) are a well-established statistical method introduced by the South African mining engineer Krige in the 1950’s. Some years later it was then formalized as a fundamental geostatistical technique by Matheron [126] in 1963, and later utilized by Sacks et. al, Kleijnen, and others [125, 93, 194] in computer experiments. Kriging models have been proven, under specific conditions to be equivalent to Bayesian methods and RBF interpolation models [18]. Given samples xi ∈ Rm and corresponding responses yi ∈ Rq for i = 1, . . . , n. Statisticians [96] have suggested that the responses can be modeled using y=

o X

βj xj +

n X

j

αi φ(θ, xi , x)

(5.1)

i

where the α’s and β’s are the free parameters that need to be learned from the data. φ is a correlation function and θ a correlation constant, one for each factor. The first part of equation 5.1 corresponds to a linear polynomial regression of order o, thus providing a global approximation of the design space. A correlation model is then added that creates local deviations around the evaluated sample points. An example correlation function is: m Y φ(θ, w, x) = e−θk |wk −xk | (5.2) k

The closer an unknown point x is to a known point xi , the more they are considered alike and the more the corresponding model parameter αi will alter the response surface model in x. Defining the β’s is usually done through a least squares fitting technique. Once they are known, the correlation model can be defined as the difference between the actual model and the regression model. This then results in a system with n equations and n unknown variables that can be solved to find the α’s. • creation function: Each correlation constant is given a lower and upper bound, θl and θu . The initial population p is created by choosing random values within these bounds. In addition, the individual θi = θl θu7 is included as well since it has been shown to be a good initial guess [122]. • mutation function: The correlation constants are altered randomly, but always remain inside the interval [θl , θu ]. • crossover function: arithmetic crossover with a random perturbation. The implementation of Kriging models is based on the DACE toolbox from the Technical University of Denmark [122].

55

5.3. MODEL TYPES

5.3.5

CHAPTER 5. PERFORMANCE RESULTS

SVM

The Support Vector Machine was introduced by Boser, Guyon and Vapnik at COLT 92 [16], a result of research into statistical learning theory since the early 1960’s [195]. The past few years SVMs and related techniques (commonly referred to as kernel methods) have become increasingly popular due to their solid theoretical foundations and good empirical performance on many non-linear classification and regression tasks in various fields. While originally devised for classification tasks, a formulation for regression was presented in 1996 [43, 21]. Their use as a surrogate modeling technique is limited compared to other techniques, but steadily growing. There are already many reports on their use for global surrogate modeling [29, 46, 72, 170, 22] as well as for surrogate-driven optimization [138, 2]. A major advantage of SVMs is that their mathematical formulation is dimension independent. This makes them an attractive solution to cope with the curse of dimensionality which sometimes discourages researchers from using surrogate models [107]: "An alternative approach to direct use of the simulator is to build a fast surrogate model (also referred to as a meta-model or a statistically equivalent model), a statistical model of the simulator which can then be run cheaply Sacks et al. (1989); Kennedy and OHagan (2001); Santner et al. (2003); Fang et al. (2006). In our case, this approach is not tractable because of the high dimension of the input space, which is two bivariate distributions and so theoretically of infinite dimension. Typical surrogate models are built on relatively low- dimensional input spaces, so that approach is not realistic here." In our experiments we shall only consider SVMs with a RBF kernel since this usually gives the best results. In this case a SVM is defined by 3 parameters: the regularization constant γ, the width of the loss function  and the spread of the RBF function σ. Since we are dealing with noise-free data, we set  = 0. Thus, only γ, σ need to be set by the GA. The termination threshold t for the SVM optimization was set to 0.00001. Genetic operators: • creation function: a two dimensional latin hypercube distribution, equal to the size of the population • mutation function: a random perturbation of the model parameters within the parameter bounds: ln(γ) ∈ [−4, 4], ln(σ) ∈ [−7, 7] • crossover function: 1-point crossover and/or arithmetic crossover, each with probability 0.5 When designing these operators, an attempt was made to use existing operators described in literature. Though such literature is available [112, 30, 52, 31, 76, 73], all papers considered lacked vital information to allow a sufficiently similar implementation. Therefore, in the end a custom implementation was used. The implementation of SVMs is based on the libSVM toolbox from Taiwan National University [20].

5.3.6

LS-SVM

Here we use the Least Squares SVM formulation developed by Suykens [187] instead of the classical Vapnik SVM. In a nutshell, the difference with the classical SVM is that the LS-SVM simplifies the formulation of the SVM as follows: first it employs equality constraints (instead of inequalities) and error variables ek instead of (though similar to) the original slack variables ξk . Second, a squared loss function is taken for these error variables instead of the Vapnik -insensitive loss function (hence the prefix least squares). It turns out that these modifications greatly simplify the problem: instead of having to solve a Quadratic Programming problem a simpler system of linear equations needs to be solved instead, which speeds up the execution. See [188] for more information. The genetic operator settings were the same as for the SVM case. The implementation of LS-SVMs is based on the LSSVMlab toolbox from the Katholieke Universiteit Leuven [188].

5.3.7

Ensembles

We have not yet mentioned what type of ensemble will be used. There are several methods for combining the outputs of models, such as average, weighted average, Dempster-Shafer methods, using rank-based information, voting, supra-Bayesian approach, stacked generalization, etc [108, 169]. To keep the complexity 56

5.4. TESTS

CHAPTER 5. PERFORMANCE RESULTS

(number of parameters) low we have opted for a simple average ensemble. Of course a different method can easily be used instead. In addition an ensemble has a minimum size of 2, a maximum size of 3, and the ensemble members must differ at least 5% in their response. Genetic operators: • creation function: ensembles are never created directly, only as a result of a heterogeneous recombination • mutation function: one ensemble member is randomly deleted • crossover function: see subsection 4.7.3

5.4

Tests

The following tests will be conducted: • All examples with the same default settings (see the next section) • All examples with extinction prevention enabled (EP=true) by the For each test we record: • the composition of the final population • the accuracy of the best model • the number of samples evaluated at termination Each test is run 4 times to average out random effects.

5.5 5.5.1

Experimental setup Sampling settings

For the CE and LE examples a fixed, medium size, data set is available. Thus, selecting samples adaptively makes little sense. So for these examples the adaptive sampling loop was switched off. Only adaptive modeling was performed. For the other example the settings were as follows: an initial latin hypercube experimental design of size 50 augmented with the corner points is used and modeling is allowed to commence once at least 90% of the initial samples are available. Each iteration 50 new samples (per output) are selected using the gradient adaptive sampling algorithm. The gradient algorithm is a state-of-the-art sampling method whose algorithm is shown in Algorithm 4. Its working was inspired by previous research into ray tracing [157]. The algorithm is as follows: For each sample point xi it approximates the gradient around that point by trying to fit a hyperplane through xi ’s neighborhood. The neighborhood of a sample in a d-dimensional, Euclidean space is defined as the set of 2d samples closest to it. The error of the plane with respect to the neighborhood is then calculated and scaled with the size of the Voronoi cell around xi . In this way a trade-off between error-based and density-based sampling is achieved. New samples are selected in those Voronoi cells were the scaled error is highest.

5.5.2

GA settings

GA is run for a maximum of 15 generations. The GA terminates if one of the following conditions is satisfied: (1) the maximum number of generations is reached, or (2) 5 generations with no improvement. The size of each deme is set to 15. The migration interval mi is set to 12, the migration fraction mf to 0.1 and the migration direction is both (copies of the mf best individuals from island i replace the worst individuals in islands i − 1 and i + 1). A stochastic uniform selection function was used. The fitness of an individual was defined as the maximum relative error:

57

5.5. EXPERIMENTAL SETUP

CHAPTER 5. PERFORMANCE RESULTS

Algorithm 4 Gradient sample selection algorithm 01. X = currentSamples(); 02. v = approximateVoronoi(X); 03. Xnew = {}; 04. k = 1; 05. foreach xi of X do 06. nb = calculateNeighborhood(xi ); 07. plane = fitPlane(nb); 08. errorxi = calculateError(nb,plane); 09. errorxi = scale(errorxi , area(vxi )); 10. end 11. [error, X] = sort([error, X],‘descending’); 12. while(k