Design of Experiments for the Tuning of Optimisation ... - CiteSeerX

0 downloads 169 Views 4MB Size Report
Research and Development, Multi-Agent and Grid Systems, vol. 3, IOS. 15 ...... tuning parameters affecting solution qual
Design of Experiments for the Tuning of Optimisation Algorithms

Enda Ridge PhD Thesis The University of York Department of Computer Science

October 2007

Abstract This thesis presents a set of rigorous methodologies for tuning the performance of algorithms that solve optimisation problems. Many optimisation problems are difficult and time-consuming to solve exactly. An alternative is to use an approximate algorithm that solves the problem to an acceptable level of quality and provides such a solution in a reasonable time. Using optimisation algorithms typically requires choosing the settings of tuning parameters that adjust algorithm performance subject to this compromise between solution quality and running time. This is the parameter tuning problem. This thesis demonstrates that the Design Of Experiments (DOE) approach can be adapted to successfully address the parameter tuning problem for algorithms that find approximate solutions to optimisation problems. The thesis introduces experiment designs and analyses for (1) determining the problem characteristics affecting algorithm performance (2) screening and ranking the most important tuning parameters and problem characteristics and (3) tuning algorithm parameters to maximise algorithm performance for a given problem instance. Desirability functions are introduced for tackling the compromise of achieving satisfactory solution quality in reasonable running time. Five case studies apply the thesis methodologies to the Ant Colony System and the Max-Min Ant System algorithms for the Travelling Salesperson Problem. New results are reported and open questions are answered regarding the importance of both existing tuning parameters and proposed new tuning parameters. A new problem characteristic is identified and shown to have a very strong effect on the quality of the algorithms’ solutions. The tuning methodologies presented here yield solution quality that is as good as or better than than the general parameter settings from the literature. Furthermore, the associated running times are orders of magnitude faster than the results obtained with the general parameter settings. All experiments are performed with publicly available algorithm code, publicly available problem generators and benchmarked experimental machines.

Contents Abstract

1

List of Figures

7

List of Tables

11

Acknowledgments

13

Author’s Declaration

15

I

19

Preliminaries

1 Introduction and motivation 1.1

Hypothesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . .

21 25

1.2

Thesis structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

26

1.3

Chapter summary . . . . . . . . . . . . . . . . . . . . . . . . . . . .

27

2 Background

II

29

2.1

Combinatorial optimisation . . . . . . . . . . . . . . . . . . . . . .

29

2.2

The Travelling Salesperson Problem (TSP) . . . . . . . . . . . . . .

30

2.3

Approaches to solving combinatorial optimisation pro-blems . . .

31

2.4

Ant Colony Optimisation (ACO) . . . . . . . . . . . . . . . . . . . .

34

2.5

Design Of Experiments (DOE) . . . . . . . . . . . . . . . . . . . . .

47

2.6

Chapter summary . . . . . . . . . . . . . . . . . . . . . . . . . . . .

49

Related Work

51

3 Empirical methods concerns

53

3.1

Is the heuristic even worth researching?

. . . . . . . . . . . . . .

3.2

Types of experiment . . . . . . . . . . . . . . . . . . . . . . . . . . .

55

3.3

Life cycle of a heuristic and its problem domain . . . . . . . . . .

56

3.4

Research questions . . . . . . . . . . . . . . . . . . . . . . . . . . .

58

3.5

Sound experimental design . . . . . . . . . . . . . . . . . . . . . .

59

3.6

Heuristic instantiation and problem abstraction . . . . . . . . . .

64

3.7

Pilot Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

65

3.8

Reproducibility

65

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3

54

CONTENTS 3.9

Benchmarking

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

67

3.10

Responses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

68

3.11

Random number generators

71

3.12

Problem instances and libraries

3.13

Stopping criteria

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

71

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

73

3.14

Interpretive bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

74

3.15

Chapter summary . . . . . . . . . . . . . . . . . . . . . . . . . . . .

75

4 Experimental work

III

77

4.1

Problem difficulty . . . . . . . . . . . . . . . . . . . . . . . . . . . .

77

4.2

Parameter tuning of other metaheuristics . . . . . . . . . . . . . .

79

4.3 4.4

Parameter tuning of ACO . . . . . . . . . . . . . . . . . . . . . . . Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . .

82 89

Design Of Experiments for Tuning Metaheuristics

93

5 Experimental testbed

95

5.1

Problem generator . . . . . . . . . . . . . . . . . . . . . . . . . . . .

95

5.2

Algorithm implementation . . . . . . . . . . . . . . . . . . . . . . .

97

5.3

Benchmarking the machines

. . . . . . . . . . . . . . . . . . . . .

100

5.4

Chapter summary . . . . . . . . . . . . . . . . . . . . . . . . . . . .

102

6 Methodology

IV

105

6.1

Sequential experimentation . . . . . . . . . . . . . . . . . . . . . .

105

6.2

Stage 1a: Determining important problem characteristics

. . . .

106

6.3

Stage 1b: Screening . . . . . . . . . . . . . . . . . . . . . . . . . . .

110

6.4

Stage 2: Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . .

117

6.5

Stage 3: Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

121

6.6

Stage 4: Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . .

124

6.7

Common case study issues

. . . . . . . . . . . . . . . . . . . . . .

125

6.8

Chapter summary . . . . . . . . . . . . . . . . . . . . . . . . . . . .

128

Case Studies

131

7 Case study: Determining whether a problem characteristic affects heuristic performance 133 7.1

Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

133

7.2

Research question and hypothesis . . . . . . . . . . . . . . . . . .

134

7.3

Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

134

7.4 7.5

Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

136 137

7.6

Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

140

7.7

Chapter summary . . . . . . . . . . . . . . . . . . . . . . . . . . . .

141

4

CONTENTS

8 Case study: Screening Ant Colony System 8.1

Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8.2

Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

146

8.3

Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

147

8.4

Conclusions and discussion . . . . . . . . . . . . . . . . . . . . . .

150

8.5

Chapter summary . . . . . . . . . . . . . . . . . . . . . . . . . . . .

151

9 Case study: Tuning Ant Colony System

143

153

9.1

Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9.2

Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

157

9.3

Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

160

9.4 9.5

Conclusions and discussion . . . . . . . . . . . . . . . . . . . . . . Chapter summary . . . . . . . . . . . . . . . . . . . . . . . . . . . .

164 167

10 Case study: Screening Max-Min Ant System

153

169

10.1

Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

10.2

Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

172

10.3

Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

174

10.4

Conclusions and discussion . . . . . . . . . . . . . . . . . . . . . .

176

10.5

Chapter summary . . . . . . . . . . . . . . . . . . . . . . . . . . . .

178

11 Case study: Tuning Max-Min Ant System

169

11.1

Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

179 179

11.2

Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

183

11.3

Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

188

11.4

Conclusions and discussion . . . . . . . . . . . . . . . . . . . . . .

195

11.5

Chapter summary . . . . . . . . . . . . . . . . . . . . . . . . . . . .

196

12 Conclusions

V

143

197

12.1

Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

197

12.2

Advantages of DOE . . . . . . . . . . . . . . . . . . . . . . . . . . .

198

12.3

Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

199

12.4

Summary of main thesis contributions . . . . . . . . . . . . . . . .

199

12.5

Thesis strengths

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

204

12.6

Thesis limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . .

205

12.7

Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

206

12.8

Closing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

207

Appendices

209

A Design Of Experiments (DOE)

211

A.1

Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

211

A.2

Regions of operability and interest . . . . . . . . . . . . . . . . . .

213

A.3

Experiment Designs . . . . . . . . . . . . . . . . . . . . . . . . . . .

214

A.4

Experiment analysis

. . . . . . . . . . . . . . . . . . . . . . . . . .

220

A.5

Hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . . . .

225

5

CONTENTS A.6

Error, Significance, Power and Replicates . . . . . . . . . . . . . .

225

B TSPLIB Statistics

229

C Calculation of Average Lambda Branching Factor

233

D Example OFAT Analysis

235

D.1

Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

235

D.2

Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

236

D.3

Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

237

D.4

Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

238

D.5

Conclusions and discussion . . . . . . . . . . . . . . . . . . . . . .

238

References

243

6

List of Figures 2.1

Growth of TSP problem search space. . . . . . . . . . . . . . . . .

30

2.2

Special cases and generalisations of the TSP. . . . . . . . . . . . .

32

2.3

Experiment setup for the double bridge experiment. . . . . . . . .

35

2.4

An example of a graph data structure. . . . . . . . . . . . . . . . .

36

2.5

The ACO Metaheuristic. . . . . . . . . . . . . . . . . . . . . . . . .

37

2.6

Common tuning parameters and recommended settings for the ACO algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

45

2.7

Tuning parameters and recommended settings for MMAS

. . . .

46

2.8

Tuning parameters and recommended settings for MMAS

. . . .

46

5.1

Relative frequencies of normalised edge lengths for several TSP instances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

97

5.2

Results of the DIMACS benchmarking of the experiment testbed.

101

5.3

Data from the DIMACS benchmarking of the experiment testbed.

101

6.1

The sequential experimentation methodology.

. . . . . . . . . . .

107

6.2

Schematic for the Two-Stage Nested Design with r replicates. . .

108

6.3

A sample overlay plot.

126

7.1

Number of outliers deleted during each problem difficulty experi-

. . . . . . . . . . . . . . . . . . . . . . . . .

ment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

137

7.2

Relative Error response for ACS on problems of size 300, mean 100. 138

7.3

Relative Error response for ACS on problems of size 700, mean 100. 138

7.4

Relative Error response for MMAS on problems of size 300, mean 100.

7.5

8.1 8.2

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

139

Relative Error response for MMAS on problems of size 700, mean 100. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

139

Descriptive statistics for the ACS screening experiment.

145

. . . . .

Descriptive statistics for the confirmation of the ACS screening ANOVA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

146

8.3

95% Prediction intervals for the ACS screening of Relative Error.

147

8.4

95% Prediction intervals for the ACS screening of ADA.

. . . . .

147

8.5

95% Prediction intervals for the ACS screening of Time. . . . . . .

148

8.6

Summary of ANOVAs for Relative Error, ADA and Time. . . . . . .

148

7

LIST OF FIGURES 9.1

Descriptive statistics for the full ACS FCC design. . . . . . . . . .

155

9.2

Descriptive statistics for the screened ACS FCC design. . . . . . .

156

9.3

Descriptive statistics for the confirmation of the ACS tuning. . . .

158

9.4

95% Prediction intervals for the full ACS response surface model of Relative Error-Time.

9.5

. . . . . . . . . . . . . . . . . . . . . . . . . .

95% Prediction intervals for the screened ACS response surface model of RelativeError-Time. . . . . . . . . . . . . . . . . . . . . .

9.6

158 159

RelativeError-Time ranked ANOVA of Relative Error response from full model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

160

9.7

RelativeError-Time ranked ANOVA of time response from full model. 161

9.8

Full RelativeError-Time model results of desirability optimisation.

9.9

Screened RelativeError-Time model results of desirability optimisation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9.10

162

Evaluation of Relative Error response in the RelativeError-Time model of ACS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9.11

162

163

Evaluation of Time response in the RelativeError-Time model of ACS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

164

9.12

Evaluation of ADA response in the ADA-Time model of ACS . . .

165

9.13

Evaluation of Time response in the ADA-Time model of ACS . . .

165

10.1

Descriptive statistics for the MMAS screening experiment. . . . .

171

10.2

Descriptive statistics for the confirmation of the MMAS screening ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

172

10.3

95% Prediction intervals for the MMAS screening of Relative Error. 173

10.4

95% Prediction intervals for the MMAS screening of ADA. . . . . .

173

10.5

95% Prediction intervals for the MMAS screening of Time. . . . .

174

10.6

Summary of ANOVAs for Relative Error, ADA and Time for MMAS.

174

11.1

Descriptive statistics for the full MMAS experiment design. . . . .

181

11.2

Descriptive statistics for the screened MMAS experiment design.

182

11.3 11.4

Descriptive statistics for the MMAS confirmation experiments. . . 184 95% prediction intervals of Relative Error by the full RelativeErrorTime model of MMAS. . . . . . . . . . . . . . . . . . . . . . . . . . .

11.5

185

95% prediction intervals of Relative Error by the screened RelativeErrorTime model of MMAS. . . . . . . . . . . . . . . . . . . . . . . . . . .

185

11.6

Predictions of Time by the full RelativeError-Time model of MMAS. 186

11.7

Predictions of Time by the screened RelativeError-Time model of MMAS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11.8

95% prediction intervals of ADA by the full ADA-Time model of MMAS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11.9

187

95% prediction intervals of ADA by the screened ADA-Time model of MMAS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11.10

186

187

95% prediction intervals of Time by the full ADA-Time model of MMAS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8

187

LIST OF FIGURES 11.11

95% prediction intervals of Time by the screened ADA-Time model of MMAS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11.12

188

RelativeError-Time ranked ANOVA of Relative Error response from full model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

189

11.13

RelativeError-Time ranked ANOVA of time response from full model. 190

11.14

Full RelativeError-Time model results of desirability optimisation.

11.15

Screened RelativeError-Time model results of desirability optimisation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11.16

192

Evaluation of the Time response in the relativeError-Time model of MMAS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11.18

191

Evaluation of Relative Error response in the RelativeError-Time model of MMAS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11.17

191

193

Evaluation of the Time response in the RelativeError-Time model of MMAS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

193

11.19

Evaluation of the Time response in the ADA-Time model of MMAS. 194

11.20

Evaluation of Relative Error response in the RelativeError-Time model of MMAS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

194

A.1

Region of operability and region of interest . . . . . . . . . . . . .

213

A.2

Fractional Factorial designs for two to twelve factors. . . . . . . .

216

A.3

Effects and alias chains . . . . . . . . . . . . . . . . . . . . . . . .

217

A.4

Savings in experiment runs when using a fractional factorial design instead of a full factorial design.

. . . . . . . . . . . . . . . . . . .

218

A.5

Central composite designs for building response surface models.

218

A.6

Individual desirability functions. . . . . . . . . . . . . . . . . . . .

221

A.7

Examples of possible main and interaction effects . . . . . . . . .

224

B.1

Some descriptive statistics for the symmetric Euclidean instances in TSPLIB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

230

B.2 B.3

Histogram of the bier127 TSPLIB instance. . . . . . . . . . . . . . Histogram of the Oliver30 TSPLIB instance. . . . . . . . . . . . . .

230 231

B.4

Histogram of the pr1002 TSPLIB instance. . . . . . . . . . . . . .

231

C.1

Pseudocode for the calculation of the average lambda branching factor.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

234

D.1

Fixed parameter settings for the OFAT analysis. . . . . . . . . . .

236

D.2

Descriptive statistics for the six OFAT analyses. . . . . . . . . . .

237

D.3 D.4

Summary of results from the six OFAT analyses. . . . . . . . . . . 238 Plot of the effect of alpha on relative error for a problem with size 400 and standard deviation 10. . . . . . . . . . . . . . . . . . . . .

D.5

400 and standard deviation 40. . . . . . . . . . . . . . . . . . . . . D.6

239

Plot of the effect of alpha on relative error for a problem with size 239

Plot of the effect of alpha on relative error for a problem with size 400 and standard deviation 70. . . . . . . . . . . . . . . . . . . . .

9

240

LIST OF FIGURES D.7

Plot of the effect of alpha on relative error for a problem with size 500 and standard deviation 10. . . . . . . . . . . . . . . . . . . . .

D.8

Plot of the effect of alpha on relative error for a problem with size 500 and standard deviation 40. . . . . . . . . . . . . . . . . . . . .

D.9

240 241

Plot of the effect of alpha on relative error for a problem with size 500 and standard deviation 70. . . . . . . . . . . . . . . . . . . . .

10

241

List of Tables 2.1

A selection of ant heuristic applications. . . . . . . . . . . . . . . .

36

3.1

The state of the art in nature-inspired heuristics from 10 years ago.

57

4.1

Evolved parameter values for ACS. . . . . . . . . . . . . . . . . . .

86

6.1

A full factorial combination of two problem characteristics. . . . .

123

7.1

Parameter settings for the problem difficulty experiments . . . . .

135

8.1

Design factors for the screening study with ACS. . . . . . . . . . .

144

9.1

Design factors for the tuning study with ACS.

. . . . . . . . . . .

154

10.1

Design factors for the screening study with MMAS. . . . . . . . .

170

11.1

Design factors for the tuning study with MMAS. . . . . . . . . . .

180

11.2

Amount of outliers removed from MMAS tuning analyses. . . . .

183

A.1

Numbers of each effect estimated by a full factorial design of 10

A.2

factors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

215

Some common response transformations.

222

11

. . . . . . . . . . . . .

Acknowledgments I am very grateful to my family and friends whose support and encouragement helped me through the PhD process. I thank my supervisor Daniel Kudenko at the University of York for his supervision. He promptly reviewed my writing and was always available when I had questions or doubts. I thank my departmental assessor, Professor John Clark for his advice and Dimitar Kazakov for reviewing some of my early papers. The thorough examination of the thesis and constructive ¨ criticisms by Thomas Stutzle, at l’Universit´e Libre de Bruxelles, and John Clark greatly improved the thesis. I also wish to thank Daniel and my colleagues Leonardo Freitas and Arturo Servin for allowing me to use their machines for my experimental work. Pierre Andrews, Rania Hodhod, Silvia Quarteroni, Juan Perna and Sergio Mena also made their machines available when a deadline required some additional computing power. I am very grateful to my colleague and friend Jovan Cakic who passed away during the course of my research. Jovan helped me quickly set up the original C program on which much of this thesis is based. I am grateful to the anonymous reviewers of my publications whose comments helped shape my research. The research was also greatly improved by discussions with Rub´en Ruiz, Thomas Bartz-Beielstein, Holger Hoos, David Woodruff, Marco Chiarandini, and Mike Preuss at various international conferences and with Simon Poulding at York. Major parts of the thesis were proof read by Leonardo Freitas. I thank Pauline Greenhough, Filomena Ottaway, Judith Warren, Diane Neville, Richard Selby, Carol Lock, Nicholas Black, Ian Patrick and all the administrative and technical support staff at the Department for their help during the PhD. I am very grateful to Michael Madden, my MSc supervisor, and Colm O’ Riordan at the National University of Ireland, Galway who encouraged and supported my application to York. Finally, I gratefully acknowledge the financial support of my scholarship from the Department of Computer Science at the University of York and its support of my research travel.

´ Admhalacha ´ Taim an-bhu´ıoch de mo chlann agus de mo chairde as a dtaca´ıocht agus as a ´ liom i rith an phroiseas ´ spreagadh a chuidigh go mor PhD. Gabhaim bu´ıochas le m’fheitheoir Daniel Kudenko in Ollscoil Eabhraic as a aiseolas. Cheartaigh s´e mo chuid scripteanna go tapa agus bh´ı s´e ann nuair a bh´ı aon cheist agam no´ amhras

13

LIST OF TABLES ´ oir ´ rannach i Eabhrac, an tOllamh Sean ´ orm. Gabhaim bu´ıochas le mo mheasun ´ O Cl´eirigh, as a chomhairle agus le Dimitar Kazakov as a l´eirmheas de chuid de ´ ´ de thoradh an dianscrud ´ u ´ o´ mo scripteanna. D’fheabhsa´ıodh an trachtas go mor ´ Cl´eirigh. ´ Stutzle, ¨ ´ O Thomas o´ l’Universit´e Libre de Bruxelles, agus o´ Shean Gabhaim bu´ıochas freisin le Daniel agus le mo chomhalta´ı Leonardo Freitas ´ aid ´ do mo thurgnamh. Chuir agus Arturo Servin as ligean dom a r´ıomhair´ı a us Pierre Andrews, Rania Hodhod, Silvia Quarteroni, Juan Perna agus Sergio Mena ´ dom freisin nuair a bh´ı m´e faoi bhru ´ le cuspoir. ´ ´ a r´ıomhair´ı ar fail Taim an´ le linn mo thaighde. bhu´ıoch de chara agus chomhalta liom Jovan Cakic a fuair bas ´ r´ıomhaire, ar a bhfuil chuid mhor ´ den trachtas ´ Chuidigh Jovan liom an tasc-chlar ´ go tapa. seo bunaithe, a chruthu ´ ´ ı neamhaitheanta de mo chuid foilseachain ´ mar Taim bu´ıoch de na l´eirmheastoir´ ´ liom cruth a chur ar mo thaighde. D’fheabhsa´ıodh chuidigh a gcuid tuairisc´ı go mor ´ de bharr pl´e le Rub´en Ruiz, Thomas Bartz-Beielstein, Holger an taighde go mor ´ Hoos, David Woodruff, Marco Chiarandini, agus Mike Preuss ag comhdhalacha idir´ unta ´ ´ naisi e´ agsula agus le Simon Poulding in Eabhrac. L´eigh agus cheartaigh ´ den trachtas. ´ Leonardo Freitas p´ıosa´ı mora Gabhaim bu´ıochas le Pauline Greenhough, Filomena Ottaway, Judith Warren, Diane Neville, Richard Selby, Carol Lock, Nicholas Black, Ian Patrick agus an ´ ar fad sa Roinn as a gcabhair le linn an PhD. fhoireann riarachain ´ Taim an-bhu´ıoch de Michael Madden, m’fheitheoir MSc, agus de Colm O’ Rior´ dan in Ollscoil na hEireann, Gaillimh as a spreagadh agus as a dtacaiocht do mo iarratas go Eabhrac. Sa deireadh thiar, gabhaim bu´ıochas leis an Roinn Riomheola´ıochta in Ollscoil Eabhraic as a dtaca´ıocht airgeadais do mo thaisteal taighde.

14

Author’s Declaration This thesis describes original research carried out by the author Enda Ridge under the supervision of Dr. Daniel Kudenko at the University of York. This research has not been previously submitted to the University of York or to any other university for the award of any degree. Some chapters of the thesis are based on articles that the author published or submitted for publication in the peer-reviewed scientific literature during the course of the thesis research. The details of these publications follow. The early ideas in this research arose from explorations into parallel and decentralised versions of Ant Colony Optimisation (ACO) algorithms. 1. Enda Ridge, Daniel Kudenko, Dimitar Kazakov, Edward Curry. Parallel, Asynchronous and Decentralised Ant Colony System, in Proceedings of AISB 2006: Adaptation in Artificial and Biological Systems. First International Symposium on Nature-Inspired Systems for Parallel, Asynchronous and Decentralised Environments, vol. 2, T. Kovacs and J. A. R. Marshall, Eds. AISB, 2006, pp. 174-177. 2. Enda Ridge, Daniel Kudenko, and Dimitar Kazakov, A Study of Concurrency in the Ant Colony System Algorithm, in Proceedings of the IEEE Congress on Evolutionary Computation, 2006, pp. 1662-1669. 3. Enda Ridge, Edward Curry, Daniel Kudenko, Dimitar Kazakov, Nature-Inspired Systems for Parallel, Asynchronous and Decentralised Environments, in Multi-Agent and Grid Systems, vol. 3, H. Tianfield and R. Unland, Eds. IOS Press, 2007. It quickly became obvious that these experiments would have a large amount of experimental noise arising from the parallel and asynchronous nature of the software. This prompted a search for how experiments with Ant Colony Optimisation (ACO) had been conducted in the literature and how the original sequential single machine versions of the algorithms were set up. An examination of the literature revealed there were few guidelines and no rigourous approaches to setting up ACO algorithms. The original research direction changed. ‘Roadmap’ publications called for, among other things, recommended experiment designs and analyses for experiments with metaheuristics such as ACO. 4. Enda Ridge and Edward Curry, A Roadmap of Nature-Inspired Systems Research and Development, Multi-Agent and Grid Systems, vol. 3, IOS 15

LIST OF TABLES Press, 2007. 5. Marco Chiarandini, Lu´ıs Paquete, Mike Preuss, Enda Ridge, Experiments on Metaheuristics: Methodological Overview and Open Issues, Institut for Matematik og Datalogi, University of Southern Denmark, Technical Report IMADA-PP-2007-04 (http://bib.mathematics.dk/preprint.php?id=IMADA-PP2007-04), March 2007, ISSN 0903-3920. A preliminary version of the screening and tuning methodologies of the thesis appeared in the following publication. 6. Enda Ridge and Daniel Kudenko, Sequential Experiment Designs for Screening and Tuning Parameters of Stochastic Heuristics, in Workshop on Empirical Methods for the Analysis of Algorithms at the Ninth International Conference on Parallel Problem Solving from Nature, L. Paquete, M. Chiarandini, and D. Basso, Eds., 2006, pp. 27-34. A refined version of this methodology is described in Chapter 6 and is used in the case studies of Chapters 8 to 11. Initial attempts to apply this methodology were not performing as well as expected and so an investigation was conducted into possible unknown problem characteristics that might be interfering with the methodology’s models. This led to the following publications, the second of which contains the updated data of Chapter 7. 7. Enda Ridge and Daniel Kudenko, An Analysis of Problem Difficulty for a Class of Optimisation Heuristics, in Proceedings of the Seventh European Conference on Evolutionary Computation in Combinatorial Optimisation (EvoCOP), vol. 4446, Lecture Notes in Computer Science, C. Cotta and J. Van Hemert, Eds. Springer-Verlag, 2007, pp. 198-209. ISBN 978-3-54071614-3. 8. Enda Ridge and Daniel Kudenko, Determining whether a problem characteristic affects heuristic performance. A rigorous Design of Experiments approach, in Recent Advances in Evolutionary Computation for Combinatorial Optimization. Springer, Studies in Computational Intelligence, 2008. ISBN 1860-949X. The first application of the methodology was published in the following papers, updated versions of which appear in Chapters 8 and 9. The third of these was nominated for best paper in its track at the Genetic and Evolutionary Computation conference 2007. 9. Enda Ridge and Daniel Kudenko, Screening the Parameters Affecting Heuristic Performance, in Proceedings of the Genetic and Evolutionary Computation Conference, vol. 1, D. Thierens, H.-G. Beyer, M. Birattari, et al., Eds. ACM, 2007. ISBN 978-1-59593-697-4.

16

LIST OF TABLES 10. Enda Ridge and Daniel Kudenko, Screening the Parameters Affecting Heuristic Performance. The Department of Computer Science, The University of York, Technical Report YCS 415 (www.cs.york.ac.uk/ftpdir/reports/index.php), April 2007. 11. Enda Ridge and Daniel Kudenko, Analyzing Heuristic Performance with Response Surface Models: Prediction, Optimization and Robustness in Proceedings of the Genetic and Evolutionary Computation Conference, D. Thierens, H.-G. Beyer, M. Birattari, et al., Eds. ACM, 2007, p. 150-157. ISBN 978-1-59593-697-4. Finally, the methodology was applied to the MMAS heuristic and published in the following paper, an updated version of which appears in Chapter 11. The paper was winner of the best paper award at the Engineering Stochastic Local Search Algorithms workshop. 12. Enda Ridge and Daniel Kudenko, Tuning the Performance of the MMAS Heuristic in Engineering Stochastic Local Search Algorithms. Designing, Implementing and Analyzing Effective Heuristics, vol. 4638, Lecture Notes in ¨ Computer Science, T. Stutzle and M. Birattari, Eds. Berlin / Heidelberg: Springer, 2007, pp. 46-60. ISBN 978-3-540-74445-0.

17

Part I

Preliminaries

19

1 Introduction and motivation This thesis presents rigorous empirical methodologies for modelling and tuning the performance of algorithms that solve optimisation problems. Consider the very common problem of efficiently assigning limited indivisible resources to meet some objective. For example, a manufacturing plant must schedule machines to a particular job in the correct order so that machine utilisation is maximised and a product is manufactured as quickly as possible. Low cost airlines must assign cabin crew shifts from a minimum size of workforce and to as many aircraft as possible. Logistics companies need to deliver products to a set of locations in an order that minimises delivery cost. Many such similar problems occur in management, finance, engineering and physics. These problems are known as Combinatorial Optimisation (CO) problems. CO problems are notoriously difficult to solve because a large number of potential solutions must be considered. Constraints on the available resources will limit the feasible alternatives that need to be considered. However, most CO problems still contain sufficient alternatives to make the best choices of available options difficult. CO problems typically require exponential time for solution in the worst case. In plain terms, as the problem gets larger, the difficulty of finding an exact solution increases extremely quickly. This has lead to the use of heuristic solution methods—methods that sacrifice the guarantee of finding an exact solution in order to find a satisfactory solution in reasonable time. We term this reduction in solution quality in exchange for an increase in solution time the heuristic compromise. Metaheuristics1 are a more recent attempt to combine basic heuristics into a flexible higher-level framework in order to better solve CO problems. Some of the most popular metaheuristics for combinatorial optimisation are Ant Colony Optimisation (ACO), Evolutionary Computation (EC), Iterated Local Search (ILS), Sim-

1

The terms metaheuristic and heuristic are used interchangeably throughout the thesis.

21

CHAPTER 1. INTRODUCTION AND MOTIVATION ulated Annealing (SA) and Tabu Search (TS). Many of these metaheuristics have achieved notable successes in solving difficult and important problems. Industry is taking note of this. Several companies incorporate metaheuristics into their solutions of complex optimisation problems [12]. These include ILOG (www.ilog.com), SAP (www.sap.com), NuTech Solutions (www.nutechsolutions.com), AntOptima (www.antoptima.com) and EuroBios (www.eurobios.com). Metaheuristics are therefore a research area of growing importance. The flexibility of the metaheuristic framework comes at a cost. Metaheuristics typically require a relatively large amount of ‘tuning’ in order to adjust them to the particular problem at hand. This tuning involves setting values of many tuning parameters, much as one would adjust the dials on an old-fashioned television set to find a given station. This situation is exacerbated if one considers parameterising internal components of the metaheuristic and then adding or removing these parameterised components to modify performance. We term these design parameters. Some metaheuristics can have anything from five to more than twenty-five of these tuning parameters [33] and the scope for design parameters is effectively limitless. It quickly becomes very difficult to search through all possible tuning parameter settings and thus the potential performance of the metaheuristic is not realised. This is the parameter tuning problem. The parameter tuning problem is one of the most important research challenges for any given metaheuristic2 . The main elements of this research challenge are as follows. 1. Screening problem characteristics to determine which problem characteristics affect metaheuristic performance. 2. Screening tuning parameters to determine which tuning parameters affect metaheuristic performance. 3. Modelling the relationship between tuning parameters, problem characteristics and performance. 4. Predicting metaheuristic performance for a given problem instance or set of instances given particular tuning parameter settings. 5. Tuning metaheuristic performance for a given problem instance or set of instances by recommending appropriate tuning parameter settings. 6. Assessing robustness of the tuned metaheuristic performance to variations in problem instance characteristics. That is, determining whether tuned parameter settings for a given combination of problem instance characteristics deteriorate significantly when applied to similar problem instance characteristics. The key obstacles to addressing these challenges are as follows:

2 There are, of course, other very important research challenges. Comparing heuristics, for example, is an important challenge that is fraught with its own difficulties.

22

CHAPTER 1. INTRODUCTION AND MOTIVATION • Problem space. All of the important problem characteristics are generally not known and may be difficult to determine. • Parameter space3 . The number of tuning parameters and the possible combinations of values they can take on is large or even infinite. • Multiple performance metrics. Performance must be analysed in terms of both solution quality and solution time because of the heuristic compromise. • Application scenario. The emphasis in a particular parameter tuning problem will depend on the specific application scenario. If problem instances are likely to be similar in characteristics then it is advantageous to have a general model of the relationship between parameters, instances and performance. If a small number of problem instances are likely to be tackled and those instances require significant resources for their solution then a relatively fast tuning approach is to be preferred. These challenges and obstacles must be addressed for every new metaheuristic that is proposed, for every modification to an existing metaheuristic that is proposed, and for every new problem type that is addressed. The parameter tuning problem is ubiquitous. Without addressing these challenges and overcoming these obstacles, the metaheuristic is of little use in practice as its user cannot set it up for maximum performance. So how does one address these challenges? One can distinguish two broad approaches [28]. Analytical approaches attempt to analytically prove characteristics of the algorithm such as its worst-case and average-case behaviour. Empirical analyses implement the algorithm in computer code and evaluate its behaviour on selected problems. Both of these approaches have been unsatisfactory to date. The analytical approach is the more ideal of the two in principle because of its potential generality and pure mathematical foundation. While it is to be expected that analytical approaches will improve with time and effort, they are far from ideal at their current level of maturity. The mathematical tools do not yet exist to successfully formalise and theorise about the behaviour and performance of existing cutting-edge metaheuristics. While early attempts at analysis are emerging, they generally resort to extreme simplifications to the metaheuristic description to render the analyses tractable. There is also a lack of comparisons of the theoretical predictions to actual implementations to determine whether the theory predicts the reality. These simplifications make the majority of conclusions inapplicable for practical purposes. An empirical approach would seem an attractive alternative by virtue of its simplicity—collect enough data and interpret it without bias. The reality is very different. Which data should be collected? What issues affect the measurement and collection of the data? How much data is enough data? How should data be

3

We use the term parameter space when considering all the possible combinations of tuning parameters. We use the term design space when considering all the possible combinations of both tuning parameter settings and problem characteristics.

23

CHAPTER 1. INTRODUCTION AND MOTIVATION interpreted? Can data interpretation be backed by mathematical precision or must we be limited to subjective interpretation? How do we ensure that an empirical analysis is both repeatable and reproducible?4 An examination of the professional research journals shows that while empirical analyses of metaheuristics are often large and broad ranging, they are seldom backed by the scientific rigour that one would expect in more mature disciplines such as the physical, medical and social sciences. Few of the questions from the previous paragraph regarding empirical methodology are either recognised or clearly addressed by researchers. Proper experimental designs are seldom used. Interpretations of results are subjective opinions rather than sound statistical analyses. Parameters are selected without justification or based on the reports from other studies without verification of their appropriateness for the current scenario [2]. This leaves the metaheuristic ill-defined, experiments irreproducible and leads to an underestimation of the time needed to deploy the metaheuristic [69]. The list of failings is long and has often been lamented in the literature of the last two decades [69, 64, 101, 7, 48, 65]. While these criticisms in the literature are justified, others point out that few publications go further and explicitly illustrate the application of sound established scientific methodology to the analysis of metaheuristics [28]. Without research that sets a good example, the impoverished state of the field’s methodology has thus persisted. Researchers in the natural sciences have available an extensive lore of laboratory techniques to guide the development of rigorous and conclusive experiments. This has not been the case in algorithmic research [79]. Attempts to improve this situation with illustrative case studies and to educate researchers with tutorials are emerging [122, 27, 9, 90]. A comprehensive methodology for addressing the aforementioned research challenges is needed. A comprehensive illustration of the application of such a methodology is needed. Fortunately, a good candidate methodology already exists. The field of Design of Experiments (DOE) is defined as: . . . a systematic, rigorous approach to engineering problem-solving that applies principles and techniques at the data collection stage so as to ensure the generation of valid, defensible, and supportable engineering conclusions. In addition, all of this is carried out under the constraint of a minimal expenditure of engineering runs, time, and money. [1] As well as providing this rigorous and efficient approach to data collection, DOE also provides statistically designed experiments. A statistically designed experiment offers a number of advantages over a design that does not use statistical techniques [89]. Attention is focussed on measuring sources of variability in results. The required number of tests is determined reliably and may often be reduced. Detection of effects is more precise and the correctness of conclusions is

4 A repeatable experiment is one which the original experimenter can redo and get very similar results. A reproducible experiment is one which another experimenter can reproduce independently and get similar results that lead to the same conclusions.

24

CHAPTER 1. INTRODUCTION AND MOTIVATION known with the mathematical precision of statistics. DOE is a well-established field that has existed for over eighty years. It evolved for the manufacturing industry and is now well supported by commercial software. The National Institute of Standards and Technology describes four general engineering problem areas to which DOE may be applied [1]: • Screening/Characterizing: the engineer is interested in understanding the process as a whole in the sense that he/she wishes to rank factors that affect the process in order of importance. • Modelling: the engineer is interested in modelling the process with the output being a good-fitting (high predictive power) mathematical relationship. • Optimizing: the engineer is interested in optimising the process by adjusting the factors that effect the process. • Comparative: the engineer is interested in assessing whether a given choice is preferable to an alternative. The first three of these application areas map directly to the parameter tuning research challenges for metaheuristics identified earlier5 . The metaheuristic being studied is the ‘process’ to which DOE is applied. The rigour of DOE provides the framework to address the concerns regarding the methodology of empirical analyses. The statistically designed experiments address any concerns about the subjective nature of the interpretation of results.

1.1

Hypothesis Statement

We can now identify the central hypothesis of this research:

The problem of tuning a metaheuristic can be successfully addressed with a Design Of Experiments approach.

If the parameter tuning problem is addressed successfully, then we can expect • to make verifiably accurate predictions of metaheuristic performance with a given confidence. • to make verifiably accurate recommendations on the most important tuning parameters with a given confidence. • to make verifiably accurate recommendations on tuning parameter settings with a given confidence.

5 Comparative DOE studies are appropriate for the comparison of heuristics, typically answering questions such as whether one heuristic is better than another. The difficulties of comparative studies are covered in the literature. Comparative studies are appropriate once all other issues regarding design, setup and running have been addressed. This thesis focuses on tuning and so should facilitate fairer comparative studies.

25

CHAPTER 1. INTRODUCTION AND MOTIVATION • to make all of these recommendations in terms of solution quality and solution time. The specific metaheuristic studied in this thesis is Ant Colony Optimisation (ACO) [47]. The CO problem domain to which ACO will be applied is the Travelling Salesperson Problem [75]. The importance of and need for this research has already been highlighted in the ACO field [118].

1.2

Thesis structure

The thesis is divided into three parts. Preliminaries are the necessary topics that must be covered to place the research in context. The second part, Related Work, presents a synthesis of the methodological issues that arise in empirical analyses of metaheuristics and critically reviews the literature on parameter tuning in light of these issues. The third part, Design Of Experiments for Tuning Metaheuristics, is concerned with methodology. It introduces the experimental testbed and presents one of the thesis’ main contributions, a Design of Experiments methodology for metaheuristic parameter tuning. The final part, Case Studies, contains several examples of the successful application of the methodology. The specific chapters are now summarised. Chapter 2 on page 29 gives a background on combinatorial optimisation and the Travelling Salesperson Problem, the type of optimisation and problem domain studied in this thesis. Various approaches to solving combinatorial optimisation problems are covered. The discussion then focuses on Ant Colony Optimisation (ACO), the family of metaheuristics used to illustrate the methodology advocated in the thesis. The chapter concludes with an overview of the Design Of Experiments field. Chapter 3 brings together and discusses many of the issues that arise when performing empirical analyses of heuristics. Some of these issues have often been raised in the research literature but are scattered across a range of related research fields. This chapter therefore draws on literature from fields such as Operations Research, Heuristics, Performance Analysis, Design of Experiments and Statistics. Chapter 4 is a critical review of the literature on parameter tuning in light of the methodological issues highlighted in the previous chapter. It begins with approaches to analysing problem difficulty for heuristics. Parameter tuning is addressed in terms of metaheuristics other than ACO and in terms of the ACO metaheuristic. For the treatment of ACO parameter tuning, the chapter reviews analytical, automated and empirical approaches. Chapter 5 describes the experimental testbed. It covers the problem generator and metaheuristic code used. It also details the benchmarking of the experiment machines. All topics are covered in light of the empirical analysis issues discussed in Chapter 3. This chapter is key to the reproducibility of the results the thesis presents. Chapter 6 is a detailed step-by-step description of the Design Of Experiments methodology that the thesis introduces. The methodology is crafted in terms of the

26

CHAPTER 1. INTRODUCTION AND MOTIVATION empirical analysis concerns of Chapter 3. This chapter serves as a template for all the case studies reported in the final part of the thesis. Chapters 7 to 11 on page 179 are the thesis case studies. They illustrate all aspects of the thesis’ Design Of Experiment methodology of Chapter 6. Case studies cover the two best performing members of the ACO metaheuristic family, Ant Colony System and Max-Min Ant System. Many new results for the ACO field are presented and open questions from the literature are answered. This underscores the benefits of adopting the thesis’ rigorous Design Of Experiments methodology. The thesis concludes with Chapter 12 on page 197. Appendix A is an overview of Design Of Experiments (DOE) terminology and concepts. It is provided for the convenience of the reader who is unfamiliar with DOE. It should not be taken as a replacement for comprehensive textbooks on the subject [89, 84, 85]. Appendix B contains some statistics related to the TSP. Appendix C is an important complexity calculation related to the MMAS heuristic.

1.3

Chapter summary

This chapter has introduced and motivated the main thesis of this research. • Problems of combinatorial optimisation were introduced and the difficulty of solving them was explained. • Metaheuristics were introduced as a popular emerging approach for solving CO problems. • The parameter tuning problem was identified as a key research challenge that will always be faced when dealing with newly proposed metaheuristics, proposed changes to existing metaheuristics and new problem types. The difficulty of the parameter tuning problem was explained and the importance of solving the problem in terms of the heuristic compromise of solution time and solution quality was emphasised. Approaches to addressing the parameter tuning problem were categorised as either analytical or empirical and the current deficiencies in the state-of-the-art of both approaches were explained. • Design Of Experiments (DOE) was identified as a well-established field that may be a very good candidate for empirically solving the parameter tuning problem in a rigorous fashion. This lead to the central hypothesis of this thesis:

The problem of tuning a metaheuristic can be successfully addressed with a Design Of Experiments approach.

27

2 Background The previous chapter introduced combinatorial optimisation problems, discussed their importance across academia and industry and explained why they are typically difficult to solve. Metaheuristics were introduced as a general framework for solving such problems and the parameter tuning problem was presented as one of the key obstacles to the successful deployment of metaheuristics. The chapter highlighted the lack of experimental rigour in the field’s attempts to analyse and understand its heuristics, particularly its lack of a rigourous approach to the parameter tuning problem. This led to the hypothesis that rigourous techniques can be adapted from the Design Of Experiments (DOE) field to successfully tackle the parameter tuning problem. This chapter gives a more detailed background to the areas mentioned in the previous chapter’s motivation and hypothesis. It begins with a general description of combinatorial optimisation before focussing on the particular combinatorial optimisation problem addressed in this thesis. The approaches to solving combinatorial optimisation problems are reviewed. The chapter then focuses on the particular family of metaheuristics that is studied in this thesis. The chapter concludes with some background on the Design Of Experiments techniques that will be adapted to the parameter tuning problem in this thesis.

2.1

Combinatorial optimisation

Optimisation problems in general divide naturally into two classes: those where solutions are encoded with real-valued variables and those where solutions are encoded with discrete variables. Combinatorial Optimisation (CO) problems are of the latter type. An illustrative example of a CO problem is that of class timetabling. Such timetabling typically involves assigning a group of teachers and students to classrooms. This assignment is subject to the constraints that a teacher cannot teach

29

CHAPTER 2. BACKGROUND all subjects, students are only taking a limited number of all the available classes, teachers and students cannot be in two classrooms at once and no more than one class can be taught in a classroom at a given time. The variables are discrete because we cannot consider some fraction of a student, room or teacher. The difficulty of the problem lies in the large number of possible solutions that have to be searched and the constraints on keeping all teachers and students satisfied. Some other popular examples of CO problems are the Travelling Salesperson Problem (TSP) [75], the Quadratic Assignment Problem (QAP) [55, p. 218] and the Job Shop Scheduling Problem (JSP) [55, p. 242]. The ubiquity of CO problems and their importance for logistics, manufacture, scheduling and other industries has resulted in a large body of research devoted to their understanding, analysis and solution. This thesis is concerned with a particular type of combinatorial optimisation problem called the Travelling Salesperson Problem.

2.2

The Travelling Salesperson Problem (TSP)

Informally, the Travelling Salesperson Problem (TSP) can be described in the following way. Given a number of cities and the costs of travelling from any city to any other city, what is the cheapest round-trip route that visits each city exactly once? [121] The most direct solution would be to try all the ordered combinations of cities and see which combination, or tour, is cheapest. Using such a brute force search rapidly becomes impractical because the number of possible combinations of n cities to consider is the factorial of n. This rapid growth in problem search space

Combinations of cities

size is illustrated in Figure 2.1. 1.00E+160 1.00E+144 1.00E+128 1.00E+112 1.00E+96 1.00E+80 1.00E+64 1.00E+48 1.00E+32 1.00E+16 1.00E+00 0

20

40

60

80

100

Number of Cities (n)

Figure 2.1: Growth of TSP problem search space. The horizontal axis is the number of cities in a TSP problem. The vertical axis is the number of combinations of cities that have to be considered. The figure shows an exponential growth in search space size with problem size.

In fact, the TSP has been shown to be Nondeterministic Polynomial-time hard (NP-hard). Informally, this means that it is contended that the TSP cannot be 30

CHAPTER 2. BACKGROUND solved to optimality within polynomially bounded computation time in the worst case. A detailed examination of the TSP, computational complexity theory and NPhardness is beyond the scope of this thesis. The reader is referred to the literature for a discussion of this important topic [70]. For practical purposes, the difficulty of the TSP means that a sophisticated approach to its solution is required. The difficulty of solving the TSP to optimality, despite its conceptually simple description, has made it a very popular problem for the development and testing of combinatorial optimisation techniques. The TSP “has served as a testbed for almost every new algorithmic idea, and was one of the first optimization problems conjectured to be ‘hard’ in a specific technical sense” [70, p. 37]. This is particularly so for algorithms in the Ant Colony Optimisation (ACO) field where ‘a good performance on the TSP is often taken as a proof of their usefulness’ [47, p. 65]. The type of TSP described at the start of this section can be termed the general asymmetric TSP. It is asymmetric because the cost of travelling between two given cities can be different depending on the direction of travel. The cost from city 1 to city 2 can be different to the cost from city 2 to city 1. There are several further categories of TSP problem that we can distinguish [70, p. 58-61]. Their relationship to one another in terms of generalisations and specialisations of the general asymmetric TSP are illustrated in Figure 2.2 on the following page. This thesis focuses exclusively on symmetric TSP instances. The reader is referred to the literature for details of the other TSP types [70, p. 58-61]. The symmetric TSP specialisation was chosen as a problem domain because the heuristics researched in this thesis were originally developed for this TSP type. This thesis follows the usual convention of using the term problem to describe a general problem such as the Travelling Salesperson Problem and an instance to be a particular case of a problem.

2.3

Approaches to solving combinatorial optimisation problems

Algorithms to tackle combinatorial optimisation problems can be classified as either exact or approximate. Exact methods are guaranteed to find an optimal solution in bounded time. Unfortunately, many problems are NP-hard like the TSP and so may require exponential time in the worst case. This impracticality of exact methods has led to the use of approximate (or heuristic) methods—methods that sacrifice the guarantee of finding an optimal solution in order to find a satisfactory solution in reasonable time. We term this the heuristic compromise. This compromise is even mentioned implicitly in some definitions of CO problems [42, p. 244]. Approximate methods (or heuristics) can be distinguished as being either constructive methods or local search (or improvement) methods. Constructive methods start from scratch and add solution components until a complete solution is found. The nearest neighbour heuristic for the TSP is an example of a constructive heuristic. It begins at some city and repeatedly chooses the nearest unvisited city

31

CHAPTER 2. BACKGROUND

K-Salesman TSP

Dial-a-ride

Stacker Crane

General Asymmetric TSP

Asymmetric triangle Inequality TSP

Mixed Chinese Postman

Symmetric TSP

Directed Hamiltonian Cycle

Symmetric Triangle Inequality TSP

Hamiltonian Cycle

Rectilinear TSP

Euclidean TSP

Hamiltonian Cycle for Grid Graphs

Figure 2.2: Special cases and generalisations of the TSP. The TSP type studied in this thesis is highlighted in bold. Image adapted from [70, p. 59].

32

CHAPTER 2. BACKGROUND until a complete tour has been constructed. Local search (or improvement) heuristics start with an initial solution and iteratively try to improve on this solution by searching within an appropriately defined neighbourhood of the current solution1 . Others [11] provide an overview of local search approaches. This research focuses on constructive methods rather than local search. It is clear that there is a myriad of possible combinations of constructive heuristics and local search heuristics. Some are more suited to particular types of combinatorial optimisation problems and instances than others. Metaheuristics try to combine these more basic heuristics (both constructive and local search) into higher-level frameworks in order to better search a solution space. Some examples of metaheuristics for combinatorial optimisation [16, p. 270] are Ant Colony Optimisation [47], Evolutionary Computation [83], Simulated Annealing [73], Iterated Local Search [66], and Tabu Search [56]. The general high-level framework of the metaheuristic makes it difficult to define exactly what a metaheuristic is. Many definitions have been summarised in the literature [16, p. 270]. A metaheuristic is an iterative master process that guides and modifies the operations of subordinate heuristics to efficiently produce highquality solutions. It may manipulate a complete (or incomplete) single solution or a collection of solutions at each iteration. The subordinate heuristics may be high (or low) level procedures, or simple local search, or just a construction method. [120] Metaheuristics are typically high-level strategies which guide an underlying, more problem specific heuristic, to increase their performance. . . . . Many of the metaheuristic approaches rely on probabilistic decisions made during the search. But, the main difference to pure random search is that in metaheuristics algorithms randomness is not used blindly but in an intelligent, biased form. [117, p. 23] Interestingly, the first definition permits a stand-alone constructive heuristic without local search to be considered as a metaheuristic. It can be useful to consider the different dimensions along which metaheuristics can be classified ([16, p. 272] and [117, p. 33-35]). The metaheuristic studied in this research can then be classified in relation to other metaheuristics. • Nature-inspired versus non nature-inspired.

There are nature-inspired

algorithms, like Genetic Algorithms and Ant Algorithms, and non natureinspired ones such as Tabu Search and Iterated Local Search. This dimension is of little use as most modern metaheuristics are hybrids that fit in both classes. • Population-based versus trajectory methods. This describes whether an algorithm works on a population of solutions or a single solution at any time.

1 We have used the terms local search and improvement together here. Henceforth, we only use the term local search since this is now the more fashionable term for such heuristics.

33

CHAPTER 2. BACKGROUND Population-based methods evolve a set of points in the search space. Trajectory methods focus on the trajectory of a single solution in the search space. • Dynamic versus static objective function. Dynamic metaheuristics modify the fitness landscape, as defined by the objective function, during search to escape from local minima. • One versus various neighbourhood structures. Some metaheuristics allow swapping between different fitness landscapes to help diversify search. Others operate on one neighbourhood only. • Memory usage versus memory-less methods. Some metaheuristics use adaptive memory. This involves keeping track of recent decisions made and solutions found or generating synthetic parameters to describe the search. Metaheuristics without adaptive memory determine their next action solely on the current state of their search process. Having described the need for heuristic (approximate) methods for tackling combinatorial optimisation problems and the concept of a metaheuristic, we can now describe the metaheuristic family examined in this thesis.

2.4

Ant Colony Optimisation (ACO)

Ant Colony Optimisation (ACO) [47] is a metaheuristic based on the way many species of real ants forage for food. It is helpful to consider this process in a little detail before describing the actual ACO heuristics. Real ants manage to find paths between their nest and food sources over large distances relative to a single ant’s size. They manage to coordinate this foraging for large swarms of ants despite individual ants having only rudimentary vision and no centralised swarm leader. It turns out that real ants communicate between themselves by leaving chemical markers in their environment. These markers are called pheromones. By laying down pheromones and sensing existing pheromones, real ants can locate and converge on trails leading to food sources. These pheromones also evaporate over time so that as a food source is exhausted and fewer ants visit it, the trail eventually disintegrates. The original ant algorithm was inspired by the so-called ‘double bridge experiment’, an experiment in biology that demonstrated this pheromone-laying behaviour for real ants. The experiment is summarised here to help in understanding the subsequent algorithm descriptions. It is described in more detail in the ACO literature [47, p. 1-5]. A double bridge was set up to connect a nest of ants to a food source. One bridge was twice as long as the other (Figure 2.3 on the next page). Ants leave their nest, encounter the fork in their path at point 1 and randomly choose one of the two bridges. Ants choosing the shorter bridge will arrive at the food source and start returning to the nest sooner. Because more ants can make the journey along the shorter bridge in the same time as ants on the longer bridge, the pheromone markers build up more quickly on the shorter bridge. Subsequent

34

CHAPTER 2. BACKGROUND ants, leaving the nest and encountering the fork at point 1 in Figure 2.3, sense a higher level of pheromone on the shorter bridge and therefore favour choosing the shorter bridge. This positive feedback of attractive pheromone trails enables the swarm of ants to successfully find the shorter path to the food source without any centralised leader and without any global vision of the two bridges. Two points are particularly noteworthy about this experiment. Firstly, when the vast majority of the ants had converged on the shorter bridge, a small proportion continued to choose the longer bridge because of the random decision process. This can be considered as a type of continuous exploration of the environment. Secondly, when 1.1 Ants’ Foragingpresented Behavior and Optimization with an even shorter bridge after convergence, 3the ants were unable to

move to the new shortest bridge because pheromone levels were so high on the original bridge on which they had already converged. The natural evaporation of the pheromone chemical was too slow to allow ants to ‘forget’ the first bridge. 15 cm

Nest

Food

60 0

Nest

1

2

Food

(a) (b) Figure 2.3: Experiment setup for the double bridge experiment. Ants leave the nest and move towards the food source. One bridge is longer than the other (adapted from [47, p. 3]). Figure 1.1 Experimental setup for the double bridge experiment. (a) Branches have equal length. (b) Branches have di¤erent length. Modified from Goss et al. (1989).

This ability of real ant swarms to find the shortest route along constrained paths using pheromone markers and random decision processes was the inspiration for 100 the ant colony heuristic. 50

2.4.1

% of experiments

% of experiments

100

50

The Ant Colony Heuristic

The idea proposed in the original ant heuristic [44] and developed in many ACO 0

0 0 -2 0

2 0 -4 0

heuristics can be0 -2summarised a6 0general 0 2 0 -4 0 4 0 -6in 0 -8 0 8 0 -1 0 0sense as follows. A combi4 0 -6 0 6 0 -8 since 0 8 0 -1 0then 0

% of traffic on one of the branches

% of traffic on the short branch natorial optimisation problem consists of a set of components. A solution of the

(a)

(b)

problem is an ordering of these components and a cost is associated with the orFigure 1.2 dering of each solution. This situation is represented by a data structure called a Results obtained with Iridomyrmex humilis ants in the double bridge experiment. (a) Results for the case in which the two branches have the same length ¼ 1);the in this case the ants use one Nodes branch orin thethe othergraph in graph (Figure 2.4(r on following page). (the black dots) are approximately the same number of trials. (b) Results for the case in which one branch is twice as long as components and ofa ants directed edge two nodes the other (r ¼ 2); heresolution in all the trials the great majority chose the shortbetween branch. Modified from is the cost of ordering Goss et al. (1989). those components one after the other in the problem’s solution.

A number of artificial ants construct solutions by moving on the problem’s fully connected graph representation. A movement from one node to another represents a given ordering of those nodes in the constructed solution. Movements are governed by stochastic decisions. The constraints of the problem are built into the ants’ decision processes. Ants can be constrained to only construct feasible solutions or can also be allowed to construct infeasible solutions when this is beneficial. The edges of the graph have an associated pheromone value and heuristic value. 35

the trail distribution. In the figure the length of the edges is proportional to the distances between the towns; the thickness of the edges is proportional to their trail level. Initially (Fig. 6a) trail is uniformly distributed on every edge, and search is only directed by visibilities. Later on in the search process (Fig. 6b) trail has been deposited on the edges composing good tours, and is evaporated completely from edges which belonged to bad tours. The edges of the worst tours actually resulted to be deleted from the problem graph, thus causing a reduction of the search space. CHAPTER 2. BACKGROUND 10

7 10

7

9

9

8

8

3

3

2

4

2

6

4

6

1

1

5

5

a) Fig. 6.

b)

Figure 2.4: An example of a graph data structure. Nodes (the black dots) represent solution compoEvolution of trail distribution for the joining CCA0 problem. nents and edges (the lines the nodes) represent the costs of ordering the connected nodes one a) Trail distribution at the beginning of search. after the other in the solution. The graph is fully connected because every node is connected to every b) Trail distribution after 100 cycles. other node by an edge (adapted from [44]).

Besides the tour length, we also investigated the stagnation behavior, i.e. the situation in which all theThe ants make the same tour. This is indicates that theby system ceased to explore new pheromone value updated the has ants while the heuristic value comes from possibilities and no better tour will arise. With some parameter settings we observed that, after knowledge of thetheproblem or the problem instance. Both the pheromone several cycles, all the ants followed same tour despite thespecific stochastic nature of the algorithms because of avalue much higher trail level on the edges comprising that tour than on all the others. of an ant’s stochastic and the heuristic value of an edge are components This high trail level made the probability that an ant chooses an edge not belonging to the tour ¨ decision process whenproblem, it considers the edges to branching move along. Stutzle and Dorigo very low. For an example, see the Oliver30 whose evolution of average is presented inprovide Fig. 7. In fact, after 2500 cyclesdiscussion circa, the number arcsthe exiting from each node works [47, p. 34-36]. a more formal of of how ant heuristic sticks to the value of 2, which – given the symmetry of the problem – means that ants are Colony always followingAnt the same cycle. heuristics have been applied to a wide range of problems that can

represented a graph (Table for 2.1). This led be us to also investigatebythesuch behavior of the structure ant-cycle algorithm different combination of parameters α and β (in this experiment we set NCMAX=2500). The results are summarized in Fig. 8, which was obtained running the algorithm ten times for eachReferences couple of Problem parameters, averaging the results and ascribing each averaged result to one of the three following different classes. 1 Travelling Salesman Problem [44], [46], [118] • Bad solutions and For high values of α the algorithm stagnation 2 stagnation. Quadratic Assignment Problementers the [77], [118] behavior very quickly without finding very good solutions. This situation is represented by the symbol ∅ in Fig. 3 8; Scheduling [39], [51], [80], [17] • Bad solutions and4no stagnation. If enough importance was not given to the trail (i.e., Vehicle Routing [25]α was set to a low value) then the algorithm did not find very good solutions. This situation is Set [54] represented by the5symbol ∞ . Packing

6

Graph Colouring

[32]

7

Shortest Supersequence Problem

[81]

8

Sequential Ordering

[53]

9

Constraint Satisfaction Problems

[116]

10

Data Mining

[93]

11

Edge disjoint paths problem

[88]

12

Bioinformatics

[113]

13

Industrial

[10], [59]

14

Dynamic

[62, 61]

Table 2.1: A selection of ant heuristic applications (from [15]).

Since this thesis focuses on the TSP problem, we now explain how ant colony heuristics are applied specifically to the TSP.

36

CHAPTER 2. BACKGROUND

2.4.2

Application to the Travelling Salesperson Problem

The application of the general ant colony heuristic of the previous section to the TSP of Section 2.2 on page 30 is straightforward. A solution is an ordering of all the graph nodes because the TSP is a tour of all cities. Ants are therefore restricted to constructing feasible tours only. Pheromone values are associated with each edge. Higher pheromone values reflect a greater desirability for visiting one node after another. The heuristic associated with each edge is simply the inverse of the cost of adding that edge to the constructed solution. This cost is typically the distance between the two nodes where distances can be calculated in several ways. These costs are typically stored in a cost matrix.

2.4.3

The ACO Metaheuristic

Since the introduction of the original ant colony heuristic, Ant System [44], a pattern in implementation has emerged that has allowed Ant System and many of the subsequent ant colony heuristics to be grouped within a metaheuristic framework. This metaheuristic is called Ant Colony Optimisation (ACO) [43]. The ACO metaheuristic consists of several stages, illustrated in Figure 2.5.

1

Initialise pheromone trails While (stopping criterion is not yet met) For (each ant)

2

Construct Solutions using a probabilistic decision rule End For For (ants) Apply local search

Main Loop

End For

3

For (graph edges) Update pheromones End For

4

Daemon actions End While

Figure 2.5: The ACO Metaheuristic. The four main stages described in the text are numbered.

1. Initialise Pheromone trails. An initial pheromone value is applied to each edge in the problem instance. 2. Construct Solutions. Ants visit adjacent states of the problem by moving between nodes on the problem graph. Once solutions have been constructed, local search (Section 2.3 on page 31) can be applied to the solutions.

37

CHAPTER 2. BACKGROUND 3. Update Pheromones. Pheromone trails are modified by evaporating from and by depositing pheromone onto the problem graph’s edges. Evaporation decreases the pheromone associated with an edge and deposition increases the pheromone. 4. Daemon Actions. Centralised actions occur that are not part of the individual ant actions. These typically involve global information such as determining the best solution found by the construct solutions phase. All ACO stages are scheduled by a Schedule Activities construct. The metaheuristic does not impose any detail on how this scheduling might occur. Solution construction, for example, might occur asynchronously [111], in sequence or in parallel. This thesis investigates two ant colony heuristics within the ACO metaheuristic framework. These are Max-Min Ant System [118] and Ant Colony System [46]. ¨ Stutzle and Dorigo [47, p. 69] state that one may distinguish between those heuristics that descend directly from the original Ant System and those that propose significant modifications to the structure of Ant System. Of the heuristics studied in this thesis, MMAS is of the former type and Ant Colony System is of the latter. The following sections provide a detailed description of the ACO heuristics studied. The descriptions follow the 4 stages in the ACO metaheuristic description. Ant System is described first for completeness.

2.4.4

Ant System (AS)

Ant System (AS) was first introduced to the peer-reviewed literature in 1996 [44] as a heuristic for the TSP. The details of its four stages within ACO are as follows. Stage 1: Initialise Pheromone trails An initial pheromone value τ0 is applied to all edges in the problem. In the ACOTSP code [47] this initial value was calculated according to the following equation: τ0 =

1 ρ · N N T our

(2.1)

where ρ is a heuristic tuning parameter related to the update pheromones stage described in Section 2.4.4 on the next page and N N T our is the length of a single tour generated using the nearest neighbour heuristic. If local search has been specified for the algorithm then this local search is applied to the solution from the nearest neighbour heuristic. Stage 2: Construct Solutions AS ants apply the following so-called random proportional rule when choosing the next TSP city to visit. The probability of an ant at a city i choosing a next city j, is given by

38

CHAPTER 2. BACKGROUND

α

β

[τij ] [ηij ] pij = P α β [τil ] [ηil ]

if j ∈ Fi

(2.2)

l∈Fi

where Fi is the set of cities that the ant has not yet visited. τij is the pheromone level on the graph edge connecting cities i and j and ηij is the heuristic value for that edge. α and β are heuristic tuning parameters that adjust the relative influence of pheromone and heuristic values respectively. Stage 3: Update pheromones Pheromones are updated with both evaporation and deposition once all ants have constructed a solution. Evaporation In general, evaporation of pheromone occurs on all edges in the problem graph. In the original source code, evaporation was limited to edges in the candidate list (see Section 2.4.8 on page 44) if local search was used. For any given edge connecting nodes i and j, the new pheromone value τij after evaporation is given by τij = (1 − ρ)τij

(2.3)

where ρ is a heuristic tuning parameter controlling the rate of pheromone evaporation. Deposition After evaporation, all ants deposit pheromone along the problem graph edges belonging to their constructed solution. For any given edge in an ant’s solution connecting nodes i and j, the new pheromone value τij after deposition is given by τij = τij + 1/C

(2.4)

where C is the cost of the solution built by the ant. Since better solutions have lower costs, equation ( 2.4) means that better solutions receive a larger deposition of pheromone. Stage 4: Daemon Actions. There are no actions in this stage of AS.

2.4.5

Max-Min Ant System (MMAS)

Max-Min Ant System [118] makes several modifications within the AS structure. These modifications involve the use of limits on pheromone values (τmax and τmin ) and the reinitialisation of edge pheromone limits.

39

CHAPTER 2. BACKGROUND Stage1: Initialise Pheromone trails The maximum pheromone value τmax for all edges is initialised as per Equation ( 2.1 on page 38). The initial value of the pheromone minimum τmin is then set according to τmin =

τmax 2n

(2.5)

where n is the TSP problem size (the number of nodes in the graph). All edges are initialised to the maximum trail value. Stage 2: Construct Solutions The Construct Solutions phase is the same as for AS (Section 2.4.4 on page 38). Stage3: Update Pheromones Before any evaporation occurs, the trail limits are updated according to the following equations. The trail maximum is always calculated as follows. τmax =

1 ρ · Cbest so f ar

(2.6)

where Cbest so f ar is the tour length of the best so far ant, the ant that produced the best solution during the course of the heuristic so far. The calculation of the trail minimum in the original source code [47] on which the thesis experiments are based was confounded with whether local search was used. The first method, used when local search was in use, calculated the new trail minimum as in Equation ( 2.5). This method is described in the book [47]. However, when local search was not in use, the source code accompanying the book used another calculation.

τmin

  p τmax 1 − elog( 2 ) =  log( p ) candlist+1 2 e 2

(2.7)

where p is another possible heuristic tuning parameter. This was the calculation used in this thesis research with p fixed at 0.05, a value that was hard-coded in the source code. Equation ( 2.7) is similar in form to a version used in the literature [118]. Evaporation Pheromone evaporation is the same as for AS (Section 2.4.4 on the preceding page and Equation ( 2.3 on the previous page)). After evaporation, the trail limits are checked. Any pheromone value less than the trail min value is reset to be equal to trail min. Any pheromone value greater than trail max is reset to be equal to trail max. Deposition Pheromone deposition is also very similar to deposition for AS (Section 2.4.4 on the preceding page and Equation ( 2.4 on the previous page)) except that only a single ant is allowed to deposit pheromone. The choice of whether this ant is the best

40

CHAPTER 2. BACKGROUND ant so far (best so far) or the best ant from the current iteration (best of iteration) is rather complicated. The best so far ant is used every u gb heuristic iterations. In all other iterations the best of iteration ant is used. This frequency of best so far ant can vary. For example, in the original source code and one piece of literature [118], the frequency is varied according to a schedule. The schedule approach was used when local search was also used. Alternatively, when local search was not in use, the best so far ant was used every u gb iterations and this value was fixed at 25 in the original code. Clearly there are many possible schedules that can be applied to the frequency of pheromone deposition with best so far ant. In this research we take a simpler approach of having a fixed frequency restart freq with which best so far ant is used, as in the case of no local search in the original source code. This fixed frequency is a heuristic tuning parameter. Stage 4: Daemon Actions. In MMAS, the daemon actions involve occasionally reinitialising the pheromone trail levels. Reinitialisation occurs if both of two conditions are met. In the literature [47, p. 76], determining the reinitialisation was described as one condition or the other being met. This research uses an and condition to maintain backwards compatibility with the original source code. The first condition is whether a given threshold number of iterations since the last solution improvement has been exceeded. The second condition (which is expensive to calculate relative to the first condition) is whether the branching factor has dropped below a given threshold. Branching factor is a measure of the uniformity of pheromone levels on all edges in the problem’s graph. Its calculation and its expense are discussed in more detail in Appendix C on page 233. The check is done after a fixed number of iterations because of the expense of calculating branching factor. There are therefore three tuning parameters controlling pheromone reinitialisation: threshold iterations (reinit iters), threshold branching factor (reinit Branch) and check frequency (reinit freq). In the original source code, these were hard coded to 250, 1.0001 and 100 respectively. However, at least one case in the literature [118] uses a reinit iters of 50. In the research reported in this thesis, we fix reinit freq =1 so that checks on these conditions are made in every iteration. We made this decision because the nesting of the other two parameters within the checking frequency made it impossible to combine properly all combinations of these tuning parameters in an experiment design. When a reinitialisation is due, trails are reinitialised to the trail max value that is calculated as: τmax =

1 ρ · Cbest so f ar

41

(2.8)

CHAPTER 2. BACKGROUND

2.4.6

Ant Colony System (ACS)

Ant Colony System [46] differs significantly from AS in its solution construction and pheromone evaporation procedures. Stage 1: Initialise Pheromone Trails The initial pheromone value for all edges is given by τ0 =

1 n · N N T our

(2.9)

where n is the problem size and N N T our is the length of a nearest neighbour tour. This is different from AS and MMAS (Equation ( 2.1 on page 38)) where the n term has replaced the pheromone evaporation term ρ. As with AS and MMAS, if local search is in use, it is applied to the solution generated by the nearest neighbour heuristic. Stage 2: Construct Solutions Solution construction is notably different from previous algorithms. An ant at a city i chooses a next city j as follows (

n o α β maxj∈Fi [τij ] [ηij ] if q ≤ q0

(2.10)

J where q is a random variable uniformly distributed in the range [0, 1]. q0 is a tuning parameter that determines the threshold q value below which exploitation occurs and above which exploration occurs in Equation ( 2.10). Fi is the set of feasible cities (cities not yet visited by the ant). J is a randomly chosen city using the same Equation ( 2.2 on page 39) as AS and MMAS, repeated below for convenience. α

β

[τij ] [ηij ] pij = P α β [τil ] [ηil ]

if j ∈ Fi

(2.11)

l∈Fi

ACS was the first ACO algorithm to use a different decision process for exploration and exploitation. The original source code facilitated applying this decision process to solution construction in all heuristics provided with ACOTSP [47]. This thesis takes advantage of this detail to apply the exploration/exploitation threshold option to all heuristics studied. When a given ant moves between two nodes, a local pheromone evaporation is immediately applied to the edge connecting those nodes. After a movement from node i to node j, the new pheromone level on the connecting edge τij is given by τij = (1 − ρlocal )τij + ρlocal τ0

(2.12)

where ρlocal is a heuristic tuning parameter and τ0 is the initial pheromone value of Equation ( 2.9). Because of this local pheromone evaporation, the order in which ants construct solutions in ACS may affect the pheromone levels presented to a

42

CHAPTER 2. BACKGROUND subsequent ant and ultimately the solutions produced by the swarm. There are two distinct methods to construct solutions given a set of ants. Firstly, one can iterate through the set of ants, allowing each ant to make one move and associated local pheromone evaporation. This can be considered parallel solution construction and was the default implementation in the source code. Secondly, one can move through the set of ants only once, allowing each ant to build a full tour and apply all associated local pheromone evaporations. This can be considered as sequential solution construction. It was an open question in the literature [47, p. 78] whether there was a difference between sequential and parallel solution construction so this research included solution construction type as a tuning parameter for ACS. This thesis will answer that open question. Stage 3: Update pheromones Evaporation There is no evaporation in the update pheromones phase of ACS because pheromone has already been evaporated in the construct solutions phase. Deposition Pheromone deposition occurs along the trail of a single ant according to the following: τij = (1 − ρ) τij + ρ

1 Cchosen ant

(2.13)

where C is the tour length of the chosen ant and ρ is a tuning parameter. It is claimed in the literature that the use of the best so far ant is preferable for instances greater than size 100 [47, p. 77]. We wished to investigate this claim methodically and so created a tuning parameter that determines the chosen ant used in ACS pheromone deposition. The tuning parameter determines whether the chosen ant is the best so far ant or the best of iteration ant. Stage 4: Daemon Actions. There are no daemon actions for the ACS algorithm.

2.4.7

Other ACO heuristics

There are of course many other variants within the ACO metaheuristic framework. Best-Worst Ant System (BWAS) [31], although included in the original source code, was omitted from our studies. This was because some of the behaviours of BWAS were triggered by a CPU time measure. This made it impossible to guarantee back¨ wards compatibility of our code with the original source code of Stutzle and Dorigo. Ant System (AS), Elitist Ant System and Rank-based Ant System were omitted because they do not perform as well as MMAS and ACS and so have become less popular. Applying the methodology, experiment designs and analyses introduced by this thesis to AS, EAS, RAS and BWAS would be a straightforward matter to explore as future work.

43

CHAPTER 2. BACKGROUND

2.4.8

Additional tuning parameters

There are other possible tuning parameters that have been suggested in the literature or are implicitly used in the original source code or have been introduced in our implementation of the original source code. We describe these parameters here and discuss our decisions on their inclusion in the subsequent thesis research. Exploration and exploitation It was mentioned in the description of tour construction for ACS (Section 2.4.6 on page 42) that a random decision is made between exploration and exploitation based on an exploration/exploitation threshold and that the original source code allowed the application of this decision to all ACO algorithms. We decided to include the use of the exploration/exploitation threshold in all ACO algorithms investigated. If the threshold is not important, then a q0 threshold of 0 will be recommended by the tuning methodology and tour construction will default to the original case of only using the random proportional rule (see Equation ( 2.10 on page 42)). Candidate lists A speed-up trick known as a candidate list was first used in ACS. A candidate list restricts the number of available choices at each tour construction step to a list of choices that are rated according to some heuristic. For the TSP, one possible candidate list for a given city is a list of some number of neighbouring cities, sorted into increasing distance from the current city. This number of neighbours, the candidate list length, is a possible tuning parameter. For a static TSP problem, candidate lists can be constructed for each TSP city at the start of the heuristic run. This was the case in the original source code. Candidate lists simplify tour construction as follows. When an artificial ant makes a decision on the next city to visit, it first checks its current city’s candidate list. If all cities in the list have been visited, the ant applies the usual tour construction rules to the remaining cities. If, however, there are unvisited cities in the candidate list, the ant chooses from its current city’s candidate list according to the usual rule. For this research, candidate lists were applied to all ACO heuristics. List length was expressed as a percentage of the problem size. Computation limit The previous section described candidate lists and how they are used to limit ant decisions in the ACS heuristic tour construction. An examination of the original source code reveals that candidate lists also influence the update pheromones stages. Specifically, evaporation of pheromone and subsequent update of pheromone levels on edges are limited to the edges in each node’s candidate list. However the influence of candidate lists was further complicated by its confounding with the use of local search. Specifically, if local search was applied, then evaporation and

44

CHAPTER 2. BACKGROUND update were limited to the candidate list. If local search was not specified, evaporation and update were applied to all edges. The decision was taken in this research to introduce a new heuristic tuning parameter, called the computation limit, that specifies whether pheromone updates should be limited to the candidate lists or applied to all edges leading from every node. This tuning parameter can be applied independently of whether local search was specified. Applying computation to all edges is obviously extremely expensive and so in this research, computation limit is always set to be limited to the node candidate lists. In Design of Experiments (DOE) terms it is a held-constant factor (Section A.1.2 on page 211). For MMAS, the candidate list length was also involved in the calculation of updated trail minimum (Equation ( 2.7 on page 40) in Section 2.4.5 on page 40) and computation of branching factor for the trail reinitialisation decision. These calculations are not affected by the new computation limit parameter and can be specified independently.

2.4.9

Summary of tuning parameters

Figure 2.6 summarises the various tuning parameters that are common to all the ACO heuristics in this research and, where available, the recommended tuning parameter settings from the literature [47, p. 71]. Some of these settings were hard-coded from the ACOTSP author’s experience with the algorithms. In the absence of such experience it is useful to parameterise these hard-coded values and experiment with tuning them. #

Parameter

AS

4m 5q 0 6c

Description Exponent of pheromone term in the random proportional rule. Exponent of heuristic term in the random proportional rule Global pheromone evaporation term. Number of ants. Exploration/exploitation threshold. Length of candidate list.

7 Placement

Type of ant placement on the TSP graph.

random

1

α

2

β



MMAS

ACS

1

1

1

2 to 5

2 to 5

2 to 5

0.5 n None

0.02 n None

0.1 10 0.9

random random

8 Local search type The type of local search to use. A parameter related to the local search 9 Don’t look bits routine. Neighbourhood The number of neighbours examined by the 10 size local search routine. Whether certain computations are limited to 11 Computation limit the candidate list or applied to the whole problem.

Figure 2.6: Common tuning parameters and recommended settings for the ACO algorithms. These tuning parameters are common to all ACO algorithms in this research. n is the size of problem in terms of number of nodes.

The parameters and recommended settings from the literature for MMAS and ACS are given in Figure 2.7 on the next page and Figure 2.8 on the following page respectively. The nested parameter column in the MMAS table is for parameters that only make sense as part of their parent parameter. 45

CHAPTER 2. BACKGROUND

#

Parameter Trail min update 12 type

13

Nested parameter

Recommended None. Varies in the literature. None. Hard-coded to A term used in one particular type of trail min 0.05 in the original update calculation. source code. None. Varies between The frequency with which the best-so-far ant fixed frequency and is used to deposit pheromone. more complicated scheduled frequencies The frequency, in terms of iterations, with None. Hard-coded in which a check is done on the need for trail original source to 100. reinitialisation. The threshold iterations without solution None. Hard-coded in improvement after which a trail reinitialisation original source to 250. is considered. None. Hard-coded to The threshold branching factor after which a 1.00001 in the original trail reinitialisation is considered. source code. Used to determine the cut-off point for None. Hard-coded to inclusion of an edge in the branching factor 0.05 in the original calculation. source code. The calculation used when a new trail minimum is set.

p

14 restart_freq

15 reinit_freq

16 reinit_iters

17 reinit_Branch

18 lambda

Figure 2.7: Tuning parameters and recommended settings for the MMAS algorithm.

# 12

Parameter

ρ local

13 Const

Description Recommended A term in the local pheromone evaporation 0.1 equation. The solution construction method.

Pheromone The choice of whether to use the best_so14 deposition far ant or the best_of_iteration ant in ant pheromone deposition.

parallel None

Figure 2.8: Tuning parameters and recommended settings for the ACS algorithm.

46

CHAPTER 2. BACKGROUND It is clear from these summaries that the ACO algorithms have many tuning parameters2 . Looking at the common parameters alone, there are eleven tuning parameters. When specific heuristics are considered, the number of tuning parameters increases to potentially 18 for MMAS and 14 for ACS. We say that there is a large parameter space. Moreover, there are no recommendations for many of the parameter settings. Sophisticated techniques are required so that it is feasible to experiment with such large numbers of tuning parameters. Such techniques exist in a field called Design of Experiments.

2.5

Design Of Experiments (DOE)

This section may contain some terminology unfamiliar to the reader. Further details on the specific DOE techniques and issues encountered in this thesis are summarised in Appendix A and are detailed in the literature [1, 89, 84, 85]. The National Institute of Standards and Technology defines Design Of Experiments (DOE or sometimes DEX) as: . . . a systematic, rigorous approach to engineering problem-solving that applies principles and techniques at the data collection stage so as to ensure the generation of valid, defensible, and supportable engineering conclusions. In addition, all of this is carried out under the constraint of a minimal expenditure of engineering runs, time, and money. [1] The systematic approach comes from the clear methodologies and experiment designs used by DOE. The analysis of the designs is supported with statistical methods, providing the user with defensible conclusions and mathematically precise statements about confidence in those conclusions. The DOE principles of data collection ensure that only sufficient data of a high quality is collected, improving the efficiency and cost of the experiment. The main capabilities of DOE are as follows [112]: 1. Quantify multiple variables simultaneously: Many factors and many responses can be investigated in a single experiment. 2. Identify variable interactions: the joint effect of factors on a response can be identified and quantified. 3. Identify high impact variables: the relative importance of all factors on the responses can be ranked.

2 It is possible to draw a distinction between tuning parameters and what we shall term design parameters. Tuning parameters are known to affect heuristic performance and so must be specified for every deployment of the heuristic. Design parameters, by contrast, are alternative heuristic components that have been parameterised so that they can be plugged into the heuristic. The aim is to determine whether any of the alternative designs have a favourable affect on performance. An example in the current research is the use of sequential or parallel solution construction in ACS (Section 2.4.6 on page 42). If an alternative value of the design parameter is shown not to affect performance then that alternative is removed as a parameter and the improved design is thereby fixed.

47

CHAPTER 2. BACKGROUND 4. Predictive capability within design space: performance at new points in the design space can be predicted. 5. Extrapolation capability outside design space: occasionally and with some caution, performance outside the design space can be extrapolated. These capabilities make DOE an essential approach for any research dealing with large and expensive experiments. In industry, users of DOE include NASA3 and Google4 . The Operations Research community has been aware of these advantages for some time, acknowledging that the risk of not adopting DOE is that the absence of a statistically valid, systematic approach can result in the drawing of insupportable conclusions [3]. Adenso-D´ıaz and Laguna [2] give a brief list of OR papers that have used statistical experiment design over the past 30 years. Some discussions are quite general and offer guidelines on the subject [7, 35, 60]. Experimental design techniques have been used to compare solution methods [3] and to find effective parameter values [33, 123]. None of these papers’ techniques or methodologies, however, have become so widespread that they approach being the standard for experimental work in OR. None have been applied to ACO heuristics. More often than not, ACO research uses a trial-and-error approach to answering its research questions. Birattari [12, p. 34-35] identifies two disadvantages of the trial-and-error approach to parameter tuning and relates these to an industrial and academic context. From an industrial perspective, the trial-and-error approach is time-consuming and requires a very specialised practitioner. From an academic perspective, the approach does not facilitate a methodical scientific analysis. When the need for a more methodical approach to parameter tuning is acknowledged, researchers may attempt a One-Factor-At-A-Time (OFAT) analysis. OFAT involves tuning a single parameter when all others are held fixed, repeating this process with each parameter one at a time. However it is well recognised outside the heuristics field that DOE has many advantages over OFAT. Czitrom [37] illustrates these advantages by taking three real-world engineering problems that were tackled with an OFAT analysis and re-analysing them with designed experiments. In summary, the following advantages are clearly illustrated: • Efficiency. Designed experiments require fewer resources, in terms of experiments, time and material, for the amount of information obtained. • Precision. The estimates of the effects of each factor are more precise. Full factorial and fractional factorial designs use all observations to estimate the effects of each factor and each interaction. OFAT experiments typically use only two treatments at a time to estimate factor effects.

3 Obtained by searching the NASA Technical Reports server (http://ntrs.nasa.gov/search.jsp) with the phrase “Design Of Experiments”. 4 Web page of Peter Norvig, current Director of Research at Google (http://norvig.com/experimentdesign.html).

48

CHAPTER 2. BACKGROUND • Interactions. Designed experiments can estimate interactions between factors but this is not the case with OFAT experiments. • More information. There is experimental information in a larger region of the design space. This makes process optimisation more efficient because the whole factor space can be studied and searched. Despite all the advantages and capabilities of DOE presented above, we still encounter several common excuses for not using DOE [112]. We list these here with our own refutation of those excuses. • Claim of no interactions: it may indeed be the case that there are no interactions between factors. This claim can only be defended after a rigorous DOE analysis has shown it to be true. • OFAT is the standard: we have seen that trial-and-error approaches and OFAT approaches are the norm. However, the comparison to DOE to OFAT that we presented from another field [37] shows that if OFAT is the standard then it is a seriously deficient standard that must be improved. • Statistics are confusing: it is true that we cannot expect heuristics researchers to become experts in statistics and Design Of Experiments. That is the job of statisticians. It is also true that becoming an expert is not necessary for leveraging the power and capabilities of DOE. In other fields such as medicine and engineering, the research questions are often repetitive. Is this drug effective? Can this manufacturing process be improved? This permits identifying a small set of experiment designs and analyses that serve to answer those common research questions. We will see in Chapter 3 that heuristic tuning involves a similarly small set of research questions. This thesis will demonstrate the use of the designs and analyses to answer the most important of those questions. Even the statistical analyses themselves can be performed in software that shields the user from unnecessary statistical details and guides the user in interpreting statistical analyses. • Experiments are too large: it is true that experiments with tuning heuristics are large. We have already mentioned the prohibitive size of the design space in Chapter 1’s list of obstacles to the parameter tuning problem. This thesis will introduce new experiment designs that permit answering the common research questions with an order of magnitude fewer experiments.

2.6

Chapter summary

This chapter covered the following topics. • Combinatorial optimisation. Combinatorial optimisation (CO) was introduced and described. The Travelling Salesperson Problem was highlighted as a particular type of CO problem.

49

CHAPTER 2. BACKGROUND • The Travelling Salesperson Problem. The reasons for the popularity of the TSP were given along with a summary of the various types of TSP. The Symmetric TSP is the focus of this thesis. • Heuristics. The difficulty of finding exact solutions to CO problems necessitates the use of approximate methods or heuristics. • Metaheuristics. Metaheuristics were introduced as an attempt to gather various heuristics into common frameworks. • Ant Colony Optimisation. Ant Colony Optimisation is a particular metaheuristic based on the foraging behaviour of real ants. The ACO heuristics have been applied to many CO problems that can be represented by a graph data structure. Several types of ACO heuristic were described in detail. • Design of Experiments. The field of Design Of Experiments was introduced and its capabilities highlighted. The advantages of DOE over its alternatives, trial-and-error or One-Factor-At-A-Time were described. Some common excuses for not adopting DOE were refuted. The next Chapter will review the issues that arise when using DOE and describe how DOE should be adapted for experiments with tuning metaheuristics.

50

Part II

Related Work

51

3 Empirical methods concerns Thus far, this thesis has motivated rigorous empirical research on the parameter tuning problem for metaheuristics. It was hypothesised that the parameter tuning problem could be successfully addressed by adapting techniques from the field of Design Of Experiments. A background on combinatorial optimisation and the Travelling Salesperson Problem was detailed. The Ant Colony Optimisation (ACO) family of metaheuristics was introduced and described. Criticisms of the lack of experimental rigour in the experimental analysis of heuristics have been made on several occasions in the operations research field [64, 65]. Such criticisms and calls for increased rigour have also appeared in the evolutionary computation field [48, 122, 103]. While there has been much useful and creative research in the ACO field, the issue of experimental rigour has never been to the fore. In the following, we bring together the most relevant criticisms, suggestions and general issues relating to the design and analysis of experiments with heuristics that have appeared in the heuristics and operations research literature over the previous three decades. We relate these to the relatively new ACO field. This will facilitate a critical review of the literature on ACO and the parameter tuning of ACO in the next chapter. It will also strongly influence the development of the thesis methodology in subsequent chapters. The material in this chapter is presented approximately in the order an experimenter would encounter the issues when working in the field. Some issues, such as reproducibility and responses for example, have an unavoidable overlap— a poor choice of response or poor reporting will reduce reproducibility for example. A familiarity with statistics and Design Of Experiments is assumed. A necessary background on these topics is given in Appendix A and in the literature [89, 84].

53

CHAPTER 3. EMPIRICAL METHODS CONCERNS

3.1

Is the heuristic even worth researching?

A question that heuristics research often neglects is whether the heuristic is even worth researching. It is tempting to expend effort on extensions of nature-inspired metaphors and refinements of algorithm details. In fact, these endeavours were identified as important goals in the early stages of the ACO field [30]. While much useful work is done in this direction, it is important not to lose sight of the purpose of optimisation heuristics which is to address the heuristic compromise and solve difficult optimisation problems to a satisfactory quality in reasonable time. Johnson [69] lists some questions that should be asked before beginning a heuristics research project: • What are the questions you want your experiments to address? • Is the algorithm implemented correctly and does it generate all the data you will need? We add that, all else being equal, the analytical tractability of the algorithm changes the conclusions that can be made from our experiments. • What is an adequate set of test instances and runs? • Given current computer specifications, which problem instances are too small to yield meaningful distinctions and which are too large for feasible running times? • Who will care about the answers given the current state of the literature? This final question of ‘care’ ties in with the analysis of Barr et al [7, p. 12] who state that a heuristic method makes a contribution if it is: • Fast: produces higher quality solutions quicker than other approaches. • Accurate: identifies higher quality solutions than other approaches. • Robust: less sensitive to differences in problem characteristics, data quality and tuning parameters than other approaches. • Simple: easy to implement. • Generalisable: can be applied to a broad range of problems. • Innovative: new and creative in its own right. An examination of the literature reveals that not only do researchers often fail to ask questions regarding speed, accuracy and robustness, they often fail even to collect the necessary data that would permit answering these questions. We can speak of one heuristic dominating another heuristic in terms of one or more of these qualities when the dominating heuristic scores better on these qualities than the dominated heuristic. For example, we often find that a given heuristic may dominate another in terms of speed and accuracy but is in turn dominated in terms of its generalizability. In general, a highly dominated heuristic is not worth studying given the aforementioned overarching aim of heuristics research. It is 54

CHAPTER 3. EMPIRICAL METHODS CONCERNS nonetheless worthwhile to study a dominated algorithm in some circumstances [69]. Firstly, the algorithm may be in widespread use or its dominating rival may be so complicated that it is unlikely to enter into widespread use. Secondly, the algorithm may embody a general approach applicable to many problem domains and studying how best to adapt it to a given domain may be of interest. ACO certainly does embody a general approach for combinatorial optimisation problems that can be represented by graphs (Section 2.4 on page 34). The version of ACO studied in this thesis does not incorporate local search and therefore it could be argued that the thesis experiments with a dominated algorithm. This argument is easily countered in several ways. Firstly, much research in ACO is still conducted without local search. Secondly, and more importantly, the emphasis of this thesis is on parameter tuning rather than on the design of new ACO versions that improve the state-of-the-art in TSP solving. ACO is a useful subject of study because of its large number of tuning parameters. Although the thesis’ DOE approach to tuning will later be shown to improve ACO performance, no claims are made about the competitiveness of this performance in relation to state-ofthe-art TSP solution methods. This does not preclude applying the thesis’ DOE methodologies to such state-of-the-art methods. Assuming that it is worthwhile to study the heuristic in question, the experimenter must then determine what type of study will be conducted.

3.2

Types of experiment

Barr et al [7] distinguish between just two types of computational experiments with algorithms: (1) comparing the performance of different algorithms for the same class of problems or (2) characterising an algorithm’s performance in isolation. In fact, there are several types of experiment identified in the literature. • Dependency study [79] (or Experimental Average-case study [69]). This aims to discover a functional relationship between factors and algorithm performance measures. It focuses on average behaviour, generating evidence about the behaviour of an algorithm for which direct probabilistic analysis is too difficult. For example, one may investigate whether and how the tuning parameters α and β (Section 2.4) increase the convergence rate of ACO. • Robustness study [79]. A robustness study looks at the distributional properties observed over several random trials. Typical questions that a robustness study addresses are: how much deviation from average is there? What is the range in performance at a given design point? Are there unusual values in the measurements? • Probing study [79] (or Experimental Analysis paper [69]). These studies ‘open up’ an algorithm and measure particular internal features of its operation, attempting to explain and understand the strengths, weaknesses and workings of an algorithm. For example, an ACO probing study might investigate whether different types of trail reinitialisation schedule improve the 55

CHAPTER 3. EMPIRICAL METHODS CONCERNS performance of MMAS (Section 1 2.4.5 on page 39). • Horse race study [69] (or Competitive Testing [65]). A horse race study attempts to demonstrate the superiority of one algorithm over another by running the algorithm on benchmark problem instances. This is typical of the majority of research in the ACO field. The horse race study has its place towards the latter stages of a heuristic’s life cycle (Section 3.3). However, its scientific merits have been strongly criticised [69, 65]. • Application study [69]. An application study uses a particular code in a particular application and describes the impact of the code in that context. For example, one might report the application of ACS to a scheduling problem in a manufacturing plant. We consider the application study as a specific context for the other types of study. Dependency studies, robustness studies, probing studies and horse race studies could all conceivably be conducted with a particular code in a particular application. The choice of experiment type will depend very much on the life cycles of both the heuristic and the problem domain in question. This thesis is primarily a dependency study as it studies the relationship between tuning parameters and performance. It also has some characteristics of a probing study in that design factors, factors that represent parameterised design decisions, are also experimented with.

3.3

Life cycle of a heuristic and its problem domain

The heuristic life cycle consists of two main phases, (1) research and (2) development [101]. Research aims to produce new heuristics for existing problems or to apply existing heuristics in creative ways to new problems. Development aims to refine the most efficient heuristic for a specific problem. Software implementation details and the application domain become more important in this situation. Birattari [12] breaks development into two phases. There is what he also terms a development phase in which the algorithm is coded and tuned. This phase relies on past problem instances. The second phase is the production phase in which the algorithm is no longer developed and is deployed to the user. This phase is characterised by the need to cope with new problem instances. The research phase of the heuristic lifecycle requires dependency, robustness and probing studies. The development phase requires horse race and application studies. Although ACO is still in the research phase of its life cycle, the majority of work reported on it is more appropriate for the development phase as it focuses on the typical horse race and application study issues. The problem domain also has a life cycle [101, p. 264] and this impacts heuristic research. For some problems in the early stages of their life cycle, there exist few if any solution algorithms. In these cases, being able to consistently construct a feasible solution is a significant achievement. Later in the problem’s life cycle, a body of consistent algorithms that produces feasible solutions already exists. At 56

CHAPTER 3. EMPIRICAL METHODS CONCERNS this stage, research must demonstrate either an insight into the algorithm’s behaviour (probing study) or must demonstrate that the algorithm performs better than other existing methods (horse race and application studies). The TSP as used in this thesis is undoubtedly in the later stages of its life cycle. Over ten years ago in 1996, Colorni et al [30] identified 4 progressions in natureinspired heuristics and used these to compare the state of the art of 6 types of heuristic. Their stages, in order of progress, are: 1. the presence of practical results, 2. the definition of a theoretical framework, 3. the availability of commercial packages, and 4. the study of computational complexity and related principles. We disagree with this ordering, although it is often encountered in computer science research. The meaning of ‘practical results’ is vague. Assuming the authors mean results on real problems rather than small scale abstractions, then complexity studies and theoretical frameworks can certainly precede such ‘practical results’. Their stages make no distinction between problem and heuristic life cycle. Their view on the state-of-the-art with respect to these stages is summarised in Table 3.1. Results Simulated nealing

An-

Well oped

devel-

Theory

Packages

Complexity

Well developed

Developing

Developing

Tabu Search

Developing

Developing

Developing

Developing

Neural Nets

Developing

Developing

Developing

Emerging

Genetic rithms

Algo-

Developing

Developing

Emerging

Emerging

Sampling and Clustering

Developing

Emerging

Emerging

Ant Systems

Emerging

Emerging

Table 3.1: The state of the art in nature-inspired heuristics from 10 years ago. Adapted from [30].

With hindsight, we see that this assessment was overly optimistic. A robust theory and complexity analysis for ant colony algorithms has yet to be established and research has only recently moved in this direction [42]. There are still no commercial packages in widespread use, although we have mentioned anecdotal evidence for the use of ant colony approaches within several companies (Chapter 1). Even some of the supposedly established results that we will review in the next chapter may have to be revised in light of the experiment design issues we review in this section and the results this thesis reports.

57

CHAPTER 3. EMPIRICAL METHODS CONCERNS

3.4

Research questions

Having decided on the type of experimental study that is required, based on the heuristic and problem life cycles, the experimenter can then proceed to refine the study into one or more specific research questions. There are two main issues in research with heuristics: how fast can solutions be obtained and how close do the solutions come to being optimal [101]? These questions cannot be answered in isolation. Rather we must consider the trade off between feasibility and solution quality [7, p. 14], a trade off that this thesis terms the heuristic compromise. This is of course a simplification and other authors have tried to enumerate the various research questions that one can investigate within this heuristic compromise of quality and speed. In the following, we have categorised these questions within the types of experimental study identified previously.

3.4.1

Dependency Study

• What are the effects of type and degree of parametric change on the performance of each solution methodology [3, p. 880]? • What are the effects of problem set and size on the performance of each method [3, p. 880]? • What are the interaction effects on the solution techniques when the above factors are changed singly or in combination [3, p. 880], [30]? • How does running time scale with instance size and is there any dependence on instance structure [69, 30]?

3.4.2

Robustness

• How robust is the algorithm [7, p. 14]? • Does a new class of instances cause significant changes in the behaviour of a previously studied algorithm [69]? • For a given machine, how predictable are running times/operation counts for similar problem instances [69]? • How is running time affected by machine architecture [69]? • How far is the best solution from those more easily found [7, p. 14]? • What are the answers to these questions for other performance measures [69]?

3.4.3

Probing study

• How do implementation details, heuristics and data structure choices affect running time [69]?

58

CHAPTER 3. EMPIRICAL METHODS CONCERNS • What are the computational bottlenecks of the algorithm and how do they depend on instance size and instance structure [69]? • What algorithm operation counts best explain running time [69, 30]?

3.4.4

Horse race

• Is there a best overall method for solving the problem? [3, p. 880] • What is the quality of the best solution found? [7, p. 14] • How long does it take to determine the best solution? [7, p. 14] • How quickly does the algorithm find good solutions? [7, p. 14] • How does an algorithm’s running time compare to those of its top competitors and are those comparisons affected by instance size and structure? [69]

3.5

Sound experimental design

Once one or more research questions have been identified, an experiment can be planned and executed. A general procedure for experimentation has three steps [28]. 1. Design. An experimental design is conceived. This is the general plan the experimenter uses to gather data. Crowder et al [34] quote a definition of good experimental design. The requirements for a good experiment are that the treatment comparisons should as far as possible be free from systematic error, that they should be made sufficiently precise, that the conclusions should have a wide range of validity, that the experimental arrangement should be as simple as possible, and finally that the uncertainty in the conclusions should be assessable. [4] 2. Data gathering and exploration. When all data have been gathered, an exploratory analysis is conducted. This involves looking for patterns and trends in the data using plots and descriptive statistics. Appropriate transformations of the data may have to be done. 3. Analysis. Formal statistical analyses are performed. There is some recognition in the literature that formal statistical analyses are an integral part of an experimental procedure. Attempts have been made to detail the specifics of the experimental procedure for heuristics. Developing a sound experimental design involves identifying the variables expected to be influential in determining code performance (both those which are controllable and those which are not), deciding the appropriate measures of performance and evaluating the variability of 59

CHAPTER 3. EMPIRICAL METHODS CONCERNS these measures, collecting an appropriate set of test problems, and finally, deciding exactly what questions are to be answered by the experiment [35]. The identification of a methodical and ordered experimental design procedure is to be welcomed. However, there is a problem with the ordering presented in that a decision on the research question is left to the very end of the design process. We posit that this should be the very first step in any procedure because the nature of the questions the experimenter wants to ask will determine all subsequent decisions in the design. A research question involving comparisons (Section 3.4.4 on the preceding page) requires a different design to a research question concerning relationships (Section 3.4.1 on page 58). A more comprehensive seven step outline of the design and analysis of an experiment comes from outside the heuristics field [84, p. 14]. 1. Recognition of and statement of the problem. Although it seems obvious, it is often difficult to develop a statement of the problem. If the process is new then a common initial objective is factor screening, determining which factors are unimportant and need not be investigated. A better understood process may require optimisation. A system that has been modified may require confirmation to determine whether it performs the same way as it did in the past. A discovery objective occurs when we wish to explore new variables such as an improved local search component. Robustness studies are needed when there are circumstances in which the responses may seriously degrade. Clearly, these objectives are reflected in the types of study categorised in Section 3.4 on page 58. 2. Selection of the response variable and the need for replicates. The response variable(s) chosen must provide useful information about the process under study. Measurement error, or errors in the measuring equipment, must be considered and may require the use of repeated measurements of the response. This is typically the case with measurements of CPU time. 3. Choice of factors, levels, and range. Factors are either potential factors or nuisance factors [84, p. 15]. Because there is often a large number of potential factors, they are classified as either: • Design factors: these are the factors that are actually selected for study. • Held-constant factors: these factors may have an effect on the response(s) but because they are not of interest, they are held constant at a specific level during the entire experiment. Nuisance factors are not of interest in the study but may have large effects on the response. They therefore must be accounted for. Nuisance factors can be classified [84, p. 16] as: • Controllable: A controllable nuisance factor is one whose levels can be set by the experimenter. In traditional design of experiments, a batch 60

CHAPTER 3. EMPIRICAL METHODS CONCERNS of raw material is a common controllable nuisance factor. In heuristics DOE, the random seed for the heuristic’s random number generator is a very common one. • Uncontrollable: These factors are uncontrollable but can be measured. Techniques such as analysis of covariance can then be used to compensate for the nuisance factor’s effect. In traditional DOE, operating conditions such as ambient temperature may be an uncontrollable but measurable nuisance factor. In heuristics DOE, CPU usage by background processes is a good example. Once the design factors have been selected, the experimenter chooses both the ranges over which the factors are varied and the specific factor levels at which experiments will be conducted. The region of interest (Section A.2 on page 213) is usually determined using practical experience and theoretical understanding, when available. When there is no knowledge of the heuristic, a pilot study can give quick and useful guidelines on appropriate factor ranges. The specific factor levels are often a function of the experimental design. 4. Choice of experimental design: choice of design depends on the experimental objectives. Some designs are more appropriate for modelling, some for optimisation, some for screening. Some designs can better fit into a sequential experimental procedure and so are more efficient in terms of experimental resources. Decisions are also made on the number of replicates and the use of blocking. An emphasis is clearly being placed on the importance of deciding on the research question(s) early on in the experiment procedure and not at the end. 5. Performing the experiment: In the traditional DOE environment of manufacturing it is often difficult to plan and organise an experiment. The process must be carefully monitored to ensure that everything is done according to plan. This is less of an issue in the majority of experiments with heuristics for combinatorial optimisation. However, experiments involving heuristics and humans, in some visual recognition task say, would have to pay very careful attention in this step. It goes without saying that all code should be checked for correctness and bugs. 6. Analysis of the data: Statistical methods are required so that results and conclusions are objective rather than judgmental. Statistical methods do not prove cause but rather provide guidelines to the reliability of a result [84, p. 19]. They should be used in combination with engineering knowledge. Of particular importance here is the danger of misinterpretation of hypothesis tests and p values. This is addressed in Appendix V on page 211.

61

CHAPTER 3. EMPIRICAL METHODS CONCERNS 7. Conclusions and recommendations: Graphical methods are most useful to ensure that results are practically significant as well as statistically significant. Conclusions should not be drawn without confirmation testing. That is, new independent experiment runs must be conducted to confirm the conclusions of the main experiment. Confirmation testing is rare in the ACO literature. Birattari [12] draws attention to the need for independent confirmation as is typical in machine learning. This thesis places a strong emphasis on independent confirmation of all its statistical analyses. Cohen [29] identifies several tips for performance analysis of which the most relevant to heuristics performance analysis are reproduced here. 1. Use bracketing standards. The tested program’s anticipated performance should exceed at least one standard and fall short of another. A typical upper standard in heuristics is an optimal solution. Often times however the optimal solution is not known. The use of optimal solutions is discussed in Section 3.10 on page 68. A typical lower standard is a randomly generated solution. Alternative lower standards for heuristics are simple reproducible heuristics such as a greedy search. Adjusted Differential Approximation (Section 3.10.2 on page 69), a quality response used in this thesis, incorporates a comparison to an expected random solution to a problem. 2. Many measures. It is not expensive to collect many performance measures. If collecting relatively few measures, a pilot study can once again help, determining which are highly correlated. Highly correlated measures are redundant. 3. Conflicting measures. Collect opponent performance measures. Conflicting measures are unavoidable with heuristics due to the heuristic compromise of lower solution time and higher solution quality. These design of experiment steps and tips provide metaheuristics research with some much needed procedural rigour. However there remain many pitfalls for the experimenter.

3.5.1

Common mistakes

Many of the issues that arise in designed experiments for other fields such as manufacture are thankfully not an issue for the heuristic engineer. Consequently, metaheuristics researchers have few excuses for poor methodology. Measurement errors due to gauge calibration, for example do not arise. No human data entry with the possibility of mistakes is generally required. Nonetheless, some of the lessons from traditional DOE [67] do translate to designed experiments for heuristics.

62

CHAPTER 3. EMPIRICAL METHODS CONCERNS • Too narrow factor ranges. Running too narrow a range from high to low for the factors can make it seem that key factors do not affect the process. The reality is that they do not affect the process in the narrow range examined. • Too wide factor ranges. Running too wide a range of factors may recommend results that are not usable in the real process. • Sample size and effect size. The sample size must be large enough to detect the effect size that the experimenter has deemed to be significant and yet not so large as to detect the tiniest of effects of no practical significance. The risk of all of these mistakes being made can be greatly mitigated by investing a small amount of resources in a pilot study. Jain [68, p. 14-25] lists further common mistakes made in performance evaluation. The most relevant of these are as follows. 1. Biased Goals. The definition of a performance evaluation project’s goals can implicitly bias its methodology and conclusions. A goal such as showing that “OUR system is better than THEIRS” can turn a problem into one of finding metrics such that OUR system turns out better rather than finding the fair metrics for comparison [68, p. 15]. This is the danger that others also highlight [65]. The problem of bias is discussed in further detail in Section 3.14 on page 74. 2. Unsystematic approach [68]. Analysts sometimes select parameter values, performance metrics and problem instances arbitrarily, making it difficult to draw any conclusions. This is, unfortunately, very common in the metaheuristics field. This thesis provides methodical guidelines and steps from the Design Of Experiments field that replace this unsystematic approach. 3. Incorrect Performance Metrics. Changing the choice of metrics can change the conclusions of a study. It is important to conduct a study with several competing metrics so that any effect of choice of metric can be understood and accounted for. This is not an issue if the experimenter has followed the recommendations of recording many performance measures (Section 3.5 on page 59). 4. Ignoring Significant Factors. Not all parameters have the same effect on performance and so it is important to identify the most important parameters. This is dealt with in the screening step mentioned in Section 3.5 on page 59. 5. Inappropriate Experimental Design. Proper selection of parameters and number of measurements can lead to more information from the same number of experiments. Jain [68] also highlights the problem with the OFAT approach and his preference for factorial and fractional factorial designs as introduced in this thesis. 6. No Sensitivity Analysis. Without a sensitivity analysis, one cannot be sure whether the conclusions would change if the analysis were done in a slightly 63

CHAPTER 3. EMPIRICAL METHODS CONCERNS different setting. Furthermore, a sensitivity analysis can help confirm the relative importance of factors. 7. Omitting Assumptions and Limitations. This can lead a reader of the research to apply an analysis to another context where the assumptions are no longer valid. Even within a well-defined experimental framework, the experimenter must beware of these many common pitfalls.

3.6

Heuristic instantiation and problem abstraction

A research question has been identified and an experiment design has been selected to answer this question. The experimenter must now think about the implementation of the algorithm that is the subject of the experiment and the problem domain to which the algorithm will be applied. We can consider both algorithms and problems at different levels of instantiation. Several authors [28, 65, 79] discuss how different levels of algorithm instantiation are appropriate for different types of analyses. A general description may be enough to determine whether an algorithm has a running time that is exponential in the length of its input. Hooker [65] likens this to an astronomer who tests a hypothesis about the behaviour of galaxies by creating a simulation. This simulation can improve our understanding even though the running time is much faster than the real phenomenon. Further algorithm instantiation, such as details of data structures, is needed to count critical operations as a function of the input. A complete instantiation in a particular language with a particular compiler and running on a particular machine is needed to generate CPU times for particular inputs. This thesis uses fully instantiated algorithms. As instantiation increases, so too does the importance of implementation issues. There are three main advantages to using efficient algorithm implementations [69]. Such implementations better support claims of practicality and competitiveness. There is less possibility for the distortion of results achieved by algorithms that are significantly slower than those used in practice. Finally, faster implementations allow one to experiment with more and larger problem instances. Clearly there is a balance between code that has been implemented efficiently for research purposes and code that has been fine-tuned as for a competitive industrial product. The problem domain can also be treated at several levels of abstraction [12]. • Lowest level. This is a mathematical model of a well defined practical problem. This level is most often used in industry and application studies where it is desired to solve a particular instance rather than make generalisations across a class of instances. • Intermediate level. At this level, abstractions such as the Travelling Salesperson Problem and Quadratic Assignment Problem capture the features and

64

CHAPTER 3. EMPIRICAL METHODS CONCERNS constraints of a class of problems. This thesis is focussed at this level of problem instantiation. • Highest level. The highest level of abstraction includes high level ideas such as deceptive problems [57] but does not represent a specific real world problem. Once the appropriate levels of algorithm instantiation and problem abstraction have been agreed, the experimenter can begin pilot studies.

3.7

Pilot Studies

The discussions of common mistakes in experiment design already mentioned the usefulness of pilot studies (Section 3.5.1 on page 62). A pilot study is simply a small scale set of experiment runs that are used for the exploratory analysis of a process. Pilot studies help refine a full blown experiment design in several ways [101]. 1. Pilot studies can indicate that some factors initially thought important actually have little effect or have a single best level that can be fixed in all runs. They help identify design factors (Section 3.5 on page 59). They also can indicate where two or more factors can be collapsed into a single one. 2. Pilot studies help determine the number and values of levels to use in determining whether the factor has practical significance. 3. Pilot studies reveal how much variability we can expect in outcomes. This influences the number of replicates that will be necessary in a sample in order to obtain reliable results. 4. Pilot studies can help design the algorithm itself, by highlighting appropriate output data and stopping criteria. Pilot studies are therefore an important part of the early stages of an experiment design, reducing the risk of some common design mistakes (Section 3.5.1 on page 62). They can never be a replacement for designed experiments with sufficient sample sizes and correct statistical analyses. Conclusions should not be drawn from pilot studies.

3.8

Reproducibility

The reproducibility of research results is of course fundamentally important to all sciences. Computer science and research in metaheuristics should be no different. Reproducing research with computers in general and metaheuristics in particular presents some unique challenges. 1. Differences between machines [65, 35]: It is difficult to guarantee that algorithms being tested by different researchers are run on machines with the 65

CHAPTER 3. EMPIRICAL METHODS CONCERNS exact same specifications. Specifying the processor speed, memory etc. is not enough. What other CPU processes may have run throughout the experiment or for periods during the experiment? Even if a researcher goes to all the trouble of setting up a clean environment, how reproducible is that environment going to be for other researchers who do not have access to the same machines? How reproducible will that environment remain as technology advances with new hardware and operating system versions for example? We will see in Section 3.9 on the next page that many of these concerns can be overcome with benchmarking but there is as yet no discussion of appropriate benchmarks for ACO heuristics and their problem domains. This thesis devotes a whole chapter (Chapter 5) to benchmarking its code. 2. Differences in coding skill [65]: It is often unclear what coding technique is best for a given algorithm. Even if a given technique could be agreed on, it is difficult to guarantee that different programmers have applied the same technique fairly. This can be mitigated by using and sharing code. This the¨ sis uses code that was made available online by Stutzle [47]. However, the porting of this code from C to Java and the associated refactoring into an object-oriented implementation undoubtedly introduces further implementation differences. 3. Degree of tuning of algorithm parameters [65]: Given that it is possible to adjust parameters so that an algorithm performs well on a set of problems, we must ask how much adjustment should be done and whether this adjustment has been done in the same way as in the original research. This thesis introduces a methodical approach to tuning metaheuristics and therefore greatly improves this aspect of reproducibility of research. Strictly then, the reproducibility of an algorithm means that ‘if you ran the same code on the same instances on the same machine/compiler/operating system/system load combination you would get the same running time, operation counts, solution quality (or the same averages, in the case of a randomised algorithm)’ [69]. This is impossible in practice. A broader notion of reproducibility [69] is required that is acceptable in classical scientific studies. This notion recognises that while the classical scientist will use the same methods, he will typically use different apparatus, similar but distinct materials and possibly different measuring techniques. The experiment is deemed reproduced if it produces data consistent with the original experiment and reaches the same conclusions. Such a notion of reproducibility must be expected from metaheuristics research.

3.8.1

Reporting results for reproducibility

Even if methods are reproduced exactly for a heuristic experiment, the way results are calculated and reported can reduce reproducibility. Many of the common approaches to reporting the performance of an algorithm have drawbacks from the perspective of reproducibility [69]. 66

CHAPTER 3. EMPIRICAL METHODS CONCERNS • Report the solution value: This is not reproducible since we cannot perform similar experiments on similar instances and determine if we are getting similar results. Furthermore, it provides no insight into the quality of the algorithm. • Report the percentage excess over best solution currently known: this is reproducible only if the current best solution is explicitly stated. Unfortunately, current bests are a moving target and so leave us in doubt about the algorithm’s true quality. • Report the percentage excess over an estimate of a random problem’s expected optimal: this is reproducible if the estimate and its method of computing are explicitly stated. It is meaningful only if the estimate is consistently close to the expected optimal. • Report the percentage excess over a well-defined lower bound: this is reproducible when the lower bound can be feasibly computed or reliably approximated. • Report the percentage excess over some other heuristic: this is reproducible so long as the other heuristic is completely specified. This involves more than naming the heuristic or citing a reference. Johnson [69] recommends using a simple algorithm as the standard. This standard is preferably easily specified and deterministic.

3.9

Benchmarking

A machine can be fully described when results are originally reported. Over time, it becomes increasingly difficult to estimate the relative speeds between the earlier system and the current one because of changes in technology. This has two consequences. Firstly, the existing results cannot be reproduced. Secondly, new results cannot even be easily related to the existing results. The solution to this is benchmarking. Benchmarking is the process of running standard tests on standard problems on a set of machines so that the machines can be fairly compared in terms of performance. Johnson [69] advocates benchmarking code in the following way. The benchmark source code is distributed with the experiment code. The benchmark is compiled and run on the same machine and with the same compiler as used for the experiment implementations. The run times for a specified set of problem instances of varying sizes is reported. Future researchers can calibrate their own machines in the same way and attempt to normalise existing results to their newer results. Benchmarking is common in scientific computing and was introduced to the heuristics community at the DIMACS challenges1 . Benchmarking for ACO TSP algorithms has never been reported to our knowledge. The benchmarking process for this thesis is reported in Chapter 5.

1

http://public.research.att.com/∼dsj/chtsp/download.html

67

CHAPTER 3. EMPIRICAL METHODS CONCERNS

3.10

Responses

The issue of responses was already touched on in our discussion of reproducibility (Section 3.8 on page 65). The choice of performance measure depends on the questions that motivate the research [79]. A study of how growth rate is affected by problem size (‘big O’ studies) would count the dominant operation identified in a theoretical analysis. A study to recommend strategies for data structures might measure the number of data structure updates. The literature offers some general guidelines for choosing good performance measures [79]. • Data should not be summarised too early. Algorithms should report outputs from every trial rather than the means over a number of trials. This is especially important when data have unusual distributional properties. • A good performance measure will exhibit small variation within a design point compared to the variation between distinct design points. Barr et al [7, p. 14] observe that research questions can broadly be categorised as questions of quality, computational effort and robustness. They advise that measures from each category should be used in a well-rounded study. The literature also examines more specific responses.

3.10.1

CPU Time

Johnson [69] advocates always reporting CPU times, even if they are not the subject of a study. He presents some reasons why running times are not reported and his counter arguments. • The main subject of the study is one component of the running time, for example local optimisation. Readers will still want to know how important this component is relative to the overall running time. • The main subject of the study is a combinatorial count related to the algorithm’s operation. To establish the meaningfulness of this count, readers will need to study its correlation with running time. For example, an investigation of pruning schemes used by an algorithm could mislead a reader if it did not report that the better scheme took significantly longer to run. This issue also arises in research with ACO where there is often a temptation to extend the ant metaphor without examining the real cost of this added complexity. • The main concern of the study is to investigate the quality of solutions produced by an approximate algorithm. The main motivation of using an approximate algorithm is that it trades quality of solution for reduced running time. Readers will want to know what the trade off is. McGeoch [79] acknowledges that it is often difficult to find combinatorial measures that predict running times well when an algorithm is highly instantiated. 68

CHAPTER 3. EMPIRICAL METHODS CONCERNS Coffin and Saltzman [28] argue that CPU time is an appropriate comparison criterion for algorithms when the algorithms being compared have significantly different architectures and no comparable fundamental operations. Barr et al [7, p. 15-16] advise recording the following times. • Time to best-found solution: this is the time required by the heuristic to find the solution the author reports. This should include all pre-processing. • Total run time: this is the total algorithm execution time until the execution of its stopping rule. • Time per phase: the timing and quality of solution at each phase should be reported. One should exercise caution with the time to best solution response. It is only after the experiment has concluded that we know this was the best solution found. It can be deceptive to report this value in isolation if the reader is not told how long the algorithm actually ran for. This is related to the issue of best solution from a number of runs (Section 3.10.5 on the next page).

3.10.2

Relative Error and Adjusted Differential Approximation

According to Barr et al [7, p. 15], comparison should be made to the known optimal solution. We have already mentioned some criticisms of the specifics of how this comparison is made (Section 3.8 on page 65). Birattari [12] discusses measures of performance in terms of solution quality. He rightly dismisses absolute error as it is not invariant with a scaling of the cost function. He also dismisses the use of relative error since it is not invariant under some transformations of the problem, as first noted by Zemel [124]. An example is given of how an affine transformation2 of the distance between cities in the TSP, leaves a problem that is essentially the same but has a different relative error of solutions. Birattari uses a variant of Zemel’s differential approximation measure [125] defined as: cde (c, i) =

c − c¯i crnd − c¯i i

(3.1)

where cde (c, i) is the differential error of a solution instance i with cost c, c¯i is the cost of the optimal solution and crnd is the expected cost value of a random solui tion to instance i. An additional feature of this Adjusted Differential Approximation (ADA) is that its value for a random solution is 1, so the measure indicates how good a method is relative to a trivial method which in this case is a random solution. It can therefore be considered as incorporating a lower bracketing standard (Section 3.5 on page 59).

2 An affine transformation is any transformation that preserves collinearity (i.e., all points lying on a line initially still lie on a line after transformation) and ratios of distances (e.g., the midpoint of a line segment remains the midpoint after transformation). Geometric contraction, expansion, dilation, reflection, rotation, and shear are all affine transformations.

69

CHAPTER 3. EMPIRICAL METHODS CONCERNS ADA is not yet a widely used solution quality response. This thesis measures and analyses both relative error and ADA in keeping with Cohen’s [29] recommendation on multiple performance measures (Section 3.5 on page 59).

3.10.3

Relative Terms

Relative terms are responses expressed such as some type of quotient such as the number of iterations/average number of iterations. Crowder et al [35] are not in favour of using relative terms when reporting performance. While relative terms do make comparison more difficult, Johnson [69] argues that relative performance indicators are often enlightening. It is important that enough information is provided so that the original components of the relative term can be recovered for reproducibility (Section 3.8 on page 65).

3.10.4

Frequency of Optimum

While it is of interest to determine the probability with which an algorithm will find an optimal solution for a given instance, it has limitations when used as a metric [69]. Firstly, it limits analysis to instances for which optima are actually known. Secondly, it ignores how near the algorithm gets when it doesn’t find the optimum. Thirdly, it cannot distinguish between algorithms on larger problem instances where the probability of finding the optimal solution is usually 0. Moreover, this response overemphasises finding an optimum when the heuristic compromise is about finding a good enough solution in reasonable time.

3.10.5

Best Solution from a number of runs

Birattari and Dorigo [13] criticise the use of the best solution from a number of runs as advocated by others [48]. They dismiss this measure as ‘not of any real interest’ since it is an over optimistic measure of a stochastic algorithm. The authors also counter the reasoning that in a real world scenario one would always use the best of several runs [48]. Firstly, it leads to an experiment measuring the performance of a random restart version of the algorithm. Secondly, this random restart version is so trivial (repeated run of the same algorithm with no improvement or input from the previous run) that it would not be a sound restart strategy anyway with the given resources. Johnson [69] levels two further criticisms at the reporting of the best solution found from multiple runs on a problem instance. Because the best run is a sample from the tail of a distribution it is necessarily less reproducible than the average. Also, if running time is reported, it is generally for that best run of the algorithm and not for the entire number of runs that yielded the reported best solution (Section 3.10.1 on page 68). This obscures the time actually required to find the reported solution. If the number of runs is not stated, there is no way to determine the real running time. Even when the number of runs is reported, multiplying the number of runs by the reported run time would overestimate the time needed. Actions such as setting up data structures need only be done once when multiple runs are performed. 70

CHAPTER 3. EMPIRICAL METHODS CONCERNS

3.10.6

Use of Averages

Reports of averages should be accompanied at least by a measure of distribution. Any scaling or normalising of averages should be carefully explained so that raw averages can be recovered if necessary.

3.11

Random number generators

Several problems can occur with the use of pseudo-random number generators and differences in numerical precision of machines [79]. These problems can be identified with replication. Firstly, a faulty implementation of a generator can introduce a bias in the stream of numbers produced and this can interact with the algorithm. Treatments should be replicated with more than one random number generator. Secondly, differences in numerical precision of machines can introduce biases into an algorithm’s behaviour. Treatments should be replicated with the same generator and seeds on different machines. It is difficult to implement a good generator correctly [92]. The source code in this thesis uses the minimal generator of Park and Miller [92] described in the literature [99, p. 279]. This is the generator used in the original source code by ¨ Stutzle and Dorigo [47].

3.12

Problem instances and libraries

There are two basic types of test instance: (1) instances from real-world applications and (2) randomly generated instances. The former are found in libraries such as TSPLIB [102] or come from private sources. The latter come from instance generators. A generator is software that, given some parameters, produces a random problem instance consistent with those parameters. Real-world data sets are desirable because the instances automatically represent many of the patterns and structures inherent in the real world [101]. However, real-world data sets are often proprietary and may not span all the problem characteristics of interest. Randomly-generated test instances offer many conveniences. • Control of problem characteristics [101]. If the problem generator is properly designed, then the problem characteristics are explicitly under the researcher’s control. This enables the researcher to cover regions of the design space that may not be well covered by available real-world data or libraries. This control can be a necessity with the experiment designs in the Design Of Experiments approach. When problems can be distinguished by some parameter, these parameters should be treated as independent variables in the analysis [28, p. 28]. • Replicates [101]. The problem generator can create an unlimited supply of problem instances. This is particularly valuable in high variance situations for which statistical methods demand many replicates.

71

CHAPTER 3. EMPIRICAL METHODS CONCERNS • Known optimum [101]. Some problem generators can generate problem instances with a known optimal solution. Knowing the optimum is important both for bracketing standards (Section 3.5 on page 59) and for the calculation of some response measures (Section 3.10 on page 68). However, knowing an optimum may bias an experiment. • Stress testing [69]. Problem generators can be used to determine the largest problem size that can be feasibly run on a given machine. This is important when deciding on ranges of problem sizes to experiment with in the pilot study phase (Section 3.7 on page 65). Barr et al [7, p. 18] also support this argument. They state that many factors do not show up on small instances but do appear on larger instances. Experiments with smaller instances therefore may not lead to accurate predictions for larger more realistic instances. A poorly designed generator can lead to misleading unstructured random problem instances. Johnson [69] refers to Asymmetric TSP papers that report codes that easily find optimal solutions to generated unstructured problems with sizes of the order of thousands of cities yet struggle to solve structured instances from TSPLIB of sizes less than 53 cities. Online libraries of problem sets, be they real-world or randomly generated, should be used with caution [101]. • Quality [101]. It is sometimes unclear where a particular instance originated from and whether the instance actually models a real-world problem. Inclusion in a library generally does not make any guarantees about the quality of the instance. • Not Representative [101, 65]. Some instances appearing in publications may be contrived to illustrate a particular feature of an algorithm or to illustrate an algorithm’s pathological behaviour. They are therefore not suitable as representative instances and may even be misleading. • Biased [101]. Problem instances are often published precisely because an algorithm performs well specifically on those instances. The broader issue of bias is covered in Section 3.14 on page 74. • Misdirected research focus [101, 65]. The availability of benchmark test instances can draw researchers into making algorithms perform well on those instances. As Hooker [65] puts it, ‘the tail wags the dog’ as problems begin to design algorithms. This changes the context of a study from one of research to one of development and encourages the premature publication of horse race studies (Section 3.2 on page 55) before an algorithm is completely understood. In summary, it would seem that problem generators are a necessity for designed experiments. It is preferable to have access to a generator rather than relying on benchmark libraries. Generators that are well-established and tested are preferable to developing one’s own. This thesis uses a generator from a large community research competition [58]. 72

CHAPTER 3. EMPIRICAL METHODS CONCERNS

3.13

Stopping criteria

Heuristics can run for impractically long time periods. A stopping criterion is some condition that causes the heuristic to halt execution. One typically sees several types of stopping criteria in the heuristics literature. We term these (1) CPU time stopping criterion, (2) computational count stopping criterion and (3) quality stopping criterion respectively. In the first two types, a heuristic is halted after a given amount of time or after a given number of computational counts (such as the number of iterations). In the third type, the heuristic is halted once a given solution quality (typically the optimum solution) is achieved. Running experiments with a time stopping criterion has been criticised on the grounds of reproducibility [69]. A run on a different machine or with a different implementation will have a distinctly different quality because of differences between the experimental material (Section 3.8 on page 65). Johnson goes so far as to state that ‘the definition of an algorithm in this way is not acceptable for a scientific paper’ [69]. Using a computational count as a stopping criterion is preferred by some authors [69] and is generally the most common type of stopping criterion in the literature. Furthermore, one can report running time alongside computational count. This permits other researchers to reproduce the work (using the computational count) and observe differences in run times caused by their machines and implementations. Johnson [69] objects to the use of attaining an optimal value as a stopping criterion on the grounds that in practice one does not typically run an approximate algorithm on an instance for which an optimal solution is known. In addition, we argue that this overemphasises the search for optima when this is not the purpose of a heuristic. There is some evidence that the choice of stopping criterion could affect the appropriate choice of tuning parameter settings for a heuristic. Socha [115] investigated the influence of the variation of a running time stopping criterion on the best choice of parameters for the Max-Min Ant System (Section 2.4.5 on page 39) heuristic applied to the University Course Timetabling Problem. Three levels of a local search component, ten levels of pheromone evaporation rate, eleven levels of pheromone lower bound and four levels of fixed run-time were investigated. The local search levels were not varied with the other two parameters so we do not know whether these interact. Furthermore, the parameter levels used with the separate local search investigation were not reported and analyses were performed on only two instances. The remaining three algorithm parameters were compared in a full factorial type design on a single instance with 10 replicates. The motivation for this number of replicates was not mentioned. A fractional factorial design would have been sufficient to determine an effect due to stopping criterion and this would have offered huge savings in the number of experiment runs. Despite this, the work does seem to indicate that different parameter settings are more appropriate for different run-times of MMAS for one instance of the UCTP. This is intuitive when one realises that the parameters investigated, pheromone evapora-

73

CHAPTER 3. EMPIRICAL METHODS CONCERNS tion and pheromone lower bound, have an influence on the explore/exploit nature of the MMAS algorithm. Obviously, exploration is a more sensible strategy when a greater amount of run-time is available. Pellegrini et al [96] attempt an analysis of the effect of run-time on solution parameters but this has many flaws that we discuss in Section 4.3.1 on page 82. The result of Socha [115] has the following implication for parameter tuning experiment designs; results are restricted to the specific stopping criterion used. Either (1) the stopping criterion (and a range of its settings) should be included as a factor in the experiments or (2) the analyses should be conducted at several levels of the stopping criterion settings. For example, if a fixed iteration stopping criterion were used then the number of fixed iterations could be included as a factor or analyses should be conducted after several different fixed iterations. The former approach permits the most general conclusions at the cost of greatly increased experimental resources.

3.14

Interpretive bias

The issue of bias is well recognised in the medical research field [71]. Its dangers are equally relevant to the heuristics field. Bias is probably unavoidable given the nature of science. Good science inevitably embodies a tension between the empiricism of concrete data and the rationalism of deeply held convictions. Unbiased interpretation of data is as important as performing rigorous experiments. This evaluative process is never totally objective or completely independent of scientists’ convictions or theoretical apparatus. [71, p. 1453] There are several types of bias that can affect the interpretation of results and we relate these to the heuristics field here. • Confirmation bias. Researchers evaluate research that supports their prior beliefs differently from research challenging their convictions. Higher standards are expected of the research that challenges convictions. This bias is often unintentional. • Rescue bias. This bias involves selectively finding faults in an experiment that contradicts expectations. It is generally a deliberate attempt to evade evidence. • Auxiliary hypothesis bias. This is a form of rescue bias in which the original hypothesis is modified in order to imply that results would have been different had the experiment been different. • Mechanism bias. Evidence is more easily accepted when it is supported by accepted scientific mechanisms.

74

CHAPTER 3. EMPIRICAL METHODS CONCERNS • ‘Time will tell’ bias. Scientific scepticism necessitates a judicious attitude of requiring more evidence before accepting a result. This bias affects the amount of such evidence that is deemed necessary. “A new scientific truth does not triumph by convincing its opponents and making them see the light, but rather because its opponents eventually die, and a new generation grows up that is familiar with it.” Max Planck [97] • Orientation bias. This reflects a phenomenon of experimental and recording error being in the direction that supports the hypothesis. This arises in the pharmaceuticals industry, for example, where trials consistently favour the new pharmaceutical treatments. Clearly these biases can affect interpretation of results regardless of the attention paid to the aforementioned issues.

3.15

Chapter summary

This chapter has covered the following topics. • Concerns regarding many aspects of experiment design for heuristics have appeared throughout the heuristics literature over the past three decades. These concerns have not been addressed in the Ant Colony Optimisation literature. • It is important to ask whether the heuristic is even worth researching. The temptation to invent creative extensions to algorithms or explore new natureinspired metaphors can distract us from the real task of producing optimisation heuristics that produce feasible solutions in acceptable time. • There are several types of experiment one can conduct. The appropriate type will depend on the life cycle of the heuristic and the problem domain. Each type of experiment can answer several types of research question. • There are clearly defined steps to good design and analysis of experiments. Nonetheless, there are many potential pitfalls for the analyst and many design and analysis decisions that must be made and justified. • Different levels of heuristic instantiation and problem domain abstraction are appropriate for different types of study and research question. This thesis studies a highly instantiated metaheuristic applied to a problem type of medium abstraction. • Machines should be benchmarked so that results can be correctly interpreted by other researchers and can be scaled to different types of experiment material (machine architecture, compiler, programming language etc).

75

CHAPTER 3. EMPIRICAL METHODS CONCERNS • A broad notion of reproducibility for empirical research with metaheuristics states that an experiment is reproducible if others can produce consistent data that leads to the same conclusions. • There are many types of performance responses one can measure and report. • One should exercise caution in the choice of random number generator. It is difficult to implement a generator well and poor implementations can bias research results. • Problem instances can be so-called real-world instances or randomly generated instances. Both have their advantages and disadvantages. Randomly generated instances are probably more appropriate when one needs explicit control of problem instance characteristics. It is difficult to implement a generator well and so is preferable to use an established and well tested generator. The are several potential dangers of online libraries of instances, be they real world instances or randomly generated ones. • Because heuristics can run for a significant time, continuously improving their solution, one needs to choose a stopping criterion to halt an experiment. Stopping criteria are generally based on a computation count, a clock time or when a predefined solution quality is attained. • There are several types of interpretive bias that can affect the researcher’s assessment of results, even from the most rigorously designed experiment. The next chapter will review experiments on tuning metaheuristics in light of the concerns summarised in this chapter.

76

4 Experimental work The previous chapter summarised the most important experiment design and analysis concerns that have been raised in the heuristics literature. The discussion of these concerns was related to Ant Colony Optimisation (ACO) research at a general level. This chapter examines research that is relevant to parameter tuning of ACO in the context of the concerns that have been identified. It begins with a review of the most significant attempts to analyse problem difficulty for algorithms. The chapter then continues with approaches to tuning heuristics and metaheuristics other than ACO. This is necessary because the fields of operations research and heuristics in general have been better than the ACO field at recognising and addressing the parameter tuning problem. Lessons can be learned from these fields. Finally, this chapter addresses parameter tuning approaches for the ACO metaheuristic, the focus of this thesis. Of course, parameter tuning should be a major part of any ACO research effort. It is integral to the effective application of the heuristic. A comprehensive review of parameter tuning would therefore necessitate reviewing almost all ACO literature. A glance through the ACO literature should convince the reader that methodical and reproducible parameter tuning of ACO is rarely addressed, despite its identification as an open research topic [47]. This chapter will therefore limit its scope to papers that have explicitly proposed and investigated methods for the parameter tuning of ACO.

4.1

Problem difficulty

Some problem instances are more difficult for an algorithm (exact or heuristic) to solve than other instances. It is critically important to understand which instances can be expected to be more difficult for a given algorithm. Essentially this involves investigating which levels of one or more problem characteristics (and combinations of levels) have a significant effect on problem difficulty. Fischer et al [49] investigated the influence of Euclidean TSP structure on the

77

CHAPTER 4. EXPERIMENTAL WORK performance of two algorithms, one exact and one heuristic. The exact algorithm was branch-and-cut [5] and the heuristic was the iterated Lin-Kernighan algorithm [63]. In particular, the TSP structural characteristic investigated was the distribution of cities in Euclidean space. The authors varied this distribution by taking a structured problem instance and applying a perturbation operator to the city distribution until the instance resembled a randomly distributed problem. There were two perturbation operators. A reduction operator removed between 1% to 75% of the cities in the original instance. A shake operator offset cities from their original location. Using 16 original instances, 100 perturbed instances were created for each of 8 levels of the perturbation factor. Performance on perturbed instances was compared to 100 instances created by uniformly randomly distributing cities in a square. Predictably, increased perturbation lead to increased solution times that were closer to the times for a completely random instance of the same size. It was therefore concluded that random Euclidean TSP instances are relatively hard to solve compared to structured instances. Unfortunately, it is unavoidable that the reduction operator confounds changed problem with a reduction in problem size, a known factor in problem difficulty. Nonetheless, the research of Fischer et al leads us to suspect that structured instances possess some feature that algorithms can exploit in their solution whereas completely random instances are lacking that feature and consequently may be unrealistically difficult. These results tie in with arguments over the merits of problem instance generators discussed previously (Section 3.12 on page 71). Van Hemert [119] evolved TSP instances of a fixed size that were difficult to solve for two heuristics: Chained Lin-Kernighan and Lin Kernighan with Cluster Compensation. TSP instances of size 100 were created by uniform randomly selecting 100 coordinates from a 400x400 grid. An initial population of such instances was evolved for each of the algorithms where higher fitness was assigned to instances that required a greater effort to solve. This effort was a combinatorial count of the algorithms’ most time-consuming procedure. This is an interesting approach that side-steps the difficult issues related to CPU time measurement discussed in Section 3.10.1 on page 68 while still acknowledging the relevance and importance of CPU time. Van Hemert then analysed the evolved instances using several interesting approaches. His aim was to determine whether the evolutionary procedure made the instances more difficult to solve and whether that difficulty was specific to the algorithm. The first approach considered was box plots of the mean, median and 5 and 95 percentile range. Secondly, the author looked at the frequency with which each algorithm found an optimal solution in each of the problem sets and the average discrepancy between the algorithm solution and the known optimum. The problems with the first of these responses has already been discussed (Section 3.10.4 on page 70). The average number of clusters in each set was measured with a deterministic clustering algorithm. The average distribution of tour segment lengths was measured for both problem sets as well as the average distance between pairs of nodes. Finally, to verify whether difficult properties were common to both algorithms, each algorithm was run on the other algorithm’s evolved problem set. A set evolved for one algorithm was less difficult for the other al78

CHAPTER 4. EXPERIMENTAL WORK gorithm. However, the alternative evolved set still required more effort than the random set indicating that some difficult instance properties were shared by both evolved problem sets. Van Hemert’s conclusions may have been limited by the lack of a rigorous experiment design. The approach can be summarised as evolving instances and then looking for characteristics that might explain any observed differences in problem hardness. This offers no control over problem characteristics. Ideally, one should hypothesise a characteristic that affects hardness and then test that hypothesis while controlling for all other characteristics. This was exactly the approach taken in the next piece of research and in this thesis. Cheeseman et al [26] explored the idea of defining an ‘order parameter’ for NP instances such that critical values of this parameter describe instances that are particularly hard to solve. The basic idea is that such a critical value divides the space of problems into two regions. One region is underconstrained and so has a high density of solutions. This makes it relatively easy to find a solution. The other region is overconstrained and so has very few solutions. However, these solutions typically have very distinct local maxima/minima and so again are relatively easy to find. The difficult problems occur at the boundary between these two regions where there are many minima/maxima corresponding to almost complete solutions. In essence, the algorithm is forced to investigate many ‘false leads’. In some ways, this concept of critical values of an order parameter resembles that of phase transitions used in statistical mechanics and physics. Cheeseman et al [26] investigated the presence of these transitions when various algorithms were applied to four problems: finding Hamiltonian circuits, graph colouring, k-satisfyability and the Travelling Salesperson Problem. In the TSP investigations, three problem sizes of 16, 32 and 48 were investigated. For each problem size, many instances were generated such that each instance had the same mean cost but a varying standard deviation of cost. Mean and standard deviation of edge lengths were controlled by drawing edge lengths from a LogNormal distribution. The computational effort for an exact algorithm to solve each of these instances was measured and plotted against standard deviation of TSP edge lengths. The plots showed an increase in the magnitude and sharpness of the phase transition with increasing problem size. Although conducted only with an exact algorithm on relatively small instance sizes, this research leads us to expect that edge length standard deviation may have a significant influence on problem difficulty for other heuristics. This is an important research question that this thesis will answer for the ACO algorithms.

4.2

Parameter tuning of other metaheuristics

Adenso-D´ıaz and Laguna [2] have used a factorial design combined with a local search procedure to systematically find the best parameter values for a heuristic. Their method, CALIBRA, was demonstrated on 6 different combinatorial optimisation applications, mostly related to machine scheduling. CALIBRA begins by finding a set of ‘optimal’ parameter values using the Taguchi methodology [95] in a

79

CHAPTER 4. EXPERIMENTAL WORK 2k factorial design. The Taguchi methodology is based on a linear assumption that can lead to large differences between the predicted ‘optimal’ values and the true optimum. CALIBRA therefore uses this analysis only as a guideline to focus the search through the parameter space. An iterative local search is then used to improve on the parameter values within a refined region of the parameter space. The parameter values found by CALIBRA led to better algorithm performance than the values used by some other authors. In all other situations, the CALIBRA parameter values did not perform significantly better or worse than the parameter values used by the original authors. A main limitation of CALIBRA is that it can only tune five algorithm parameters. A more serious limitation is that CALIBRA does not examine interactions between parameters and so cannot be used in situations where such interactions might be significant. Later chapters will demonstrate that interactions are present in ACO tuning parameters and so the more sophisticated experiment designs of this thesis are required. Coy et al [33] present a systematic procedure for finding good heuristic parameter settings on a range of Vehicle Routing Problems (VRP). This methodology was applied to two local search heuristics with 6 tuning parameters and a total of 34 VRPs. The new parameters found results that were, on average, 1% to 4% better than the best known solutions. Broadly, Coy et al’s procedure works by finding high quality parameter settings for a small number of problems in the problem set and then combining these settings to achieve a good set of parameters for the complete problem set. The procedure is as follows: 1. A subset of the problem set is chosen for analysis. This subset should be representative of the key problem characteristics in the entire set. In their paper’s case, example key VRP characteristics are demand distribution and customer distribution. 2. A starting value and range for each parameter is determined. This requires either a judgement based on previous experience with the heuristic or a pilot study. 3. A factorial or fractional factorial design is used to determine the parameter settings. Linear regression gives a linear approximation of the response surface. The path of steepest descent along this surface is calculated, beginning at the starting point identified from the parameter study. 4. Step 4 is repeated for each problem in the analysis set. 5. The parameter vectors determined in step 4 are averaged to obtain the final parameter settings for the heuristic over all problem instances. Coy et al’s approach does not use higher order regression models (such as quadratic). The author’s chose the simpler linear approach because different response surfaces are averaged over all test instances. The authors believed their approximate approach would not be significantly enhanced by a more complicated response surface. This comparison was not performed however.

80

CHAPTER 4. EXPERIMENTAL WORK In Coy et al’s work, their VRP problems were chosen based on three characteristics, distribution of customers, distribution of demand and problem size. The decisions were based on a graphical analysis of these characteristics for each instance. In their conclusions, Coy et al acknowledge that the method will perform poorly in two scenarios: • if the representative test problems are not chosen correctly or • if the problem class is so broad that it requires very different parameter settings. While they recommend creating problem subclasses based on the ‘significant’ problem characteristics, they give no detail of how such significance could be determined. The first of these shortcomings could conceivably be mitigated in an application scenario by building up a repository of instances to which the tuning procedure has been applied. The second shortcoming is more troublesome. It is likely that many problems have instances that require quite different parameter settings. Coy et al’s method does not build a model of the relationship between instances, parameter settings and performance. It therefore cannot recommend parameter settings for varying combinations of problem characteristics. The designs introduced in this thesis can make such recommendations. Parsons and Johnson [94] used a 24 full factorial replicated design to screen the best parameter settings for four genetic algorithm parameters applied to a data search problem (specifically DNA sequencing). Their stopping criterion was a fixed number of trials. They used a type of sequential experimentation procedure of running first a half fraction, then the other half fraction and finally the replicates of the data. Only two parameters were deemed important and so a steepest ascent approach with these parameters was used to determine the centre point of a central composite design for building a response surface. This response surface then allowed the authors to improve the genetic algorithm performance on the tested data set. Experiments on larger data sets with these parameter settings showed improvements in both solution quality and computational effort. Unfortunately, no analysis of the data set characteristics was made so we cannot determine why parameters tuned on one data set worked so well for larger data sets. The authors could have halved the number of experiment runs by using a 2IV (4-1) fractional factorial (Section A.3.2 on page 215) instead of the 24 full factorial. Analysis of Variance (ANOVA) and response surface models have been used for parameter tuning on several occasions in the heuristics literature. For example, Van Breedam [22] attempts to find significant parameters for a genetic algorithm and a simulated annealing algorithm applied to the vehicle routing problem using an analysis of variance technique. Seven GA and eight SA parameters are examined. Park and Kim [91] used a non-linear response surface method to find parameter settings for a simulated annealing algorithm. None of these methods have been applied to ACO.

81

CHAPTER 4. EXPERIMENTAL WORK

4.3

Parameter tuning of ACO

Existing approaches to understanding the relationship between tuning parameter settings, problem characteristics and performance of ACO fall into a three categories: (1) Analytical Approaches, (2) Automated Approaches and (3) Empirical Approaches. Recent analytical approaches attempt to understand parameters and recommend parameter settings based on mathematical proof. Automated approaches attempt to use some other algorithm or heuristic to tune the ACO heuristic. The automated approach may use the heuristic itself in a kind of introspective manner, in which case we term the approach Automated Self-tuning. Alternatively, Heuristic Tuning uses some other heuristic to search for good parameter settings for the tuned heuristic. The final category is Empirical approaches. These gather data about the heuristic and attempt to analyse it to draw conclusions about the heuristic. Empirical approaches in ACO are either Trial-and-Error or occasionally the One-Factor-At-a-Time (OFAT) approach. This thesis uses the Design Of Experiments approach, bringing the experiment designs and analysis procedures from DOE to bear on the parameter tuning problem.

4.3.1

Analytical Approaches

Dorigo and Blum [42] provide a survey of analytical results relating to ant colony optimisation. They acknowledge that the convergence proofs they summarise for ACO are of little use to a practitioner since the proofs often assume the availability of either infinite time or infinite space [42, p. 246]. Their discussion is preceded with a simplification of the transition probabilities of the Construct Solutions phase (Section 2.4.3 on page 37) so that heuristic information is omitted. Their motivation for this simplification is to ‘ease the derivations’ [42, p. 256] however this simplification ignores the reality that heuristic information is well-established as a highly significant contributor to algorithm performance. The authors also admit that none of the convergence proofs discussed make reference to the ‘astronomically large’ time to find the optimal solution [42, p. 260]. Neumann and Witt attempt an analysis of run time and evaporation rate on the OneMax problem [86] and the LeadingOnes and BinVal problems [41]. An abstract ACO algorithm is analysed in both cases. However, no comparison with empirical data is given and so it is impossible to tell whether any of the proofs’ assumptions have affected their analyses’ application to instantiated ACO algorithms. Despite this, the authors claim that ‘It is shown that the so-called evaporation factor ρ, the probably most important parameter in ACO algorithms, has a crucial impact on the runtime’ [41, p. 34]. These claims are far too general given that their analyses are at such an early stage and apply only to a single abstract algorithm with no test of predictions on real instantiated algorithms. These claims will be tested later in this thesis. Pellegrini et al [96] attempt an analytical prediction of suitable parameter settings for MMAS for a given run time. The most important parameters are deemed to be the number of ants, the pheromone evaporation rate and the exponents of the

82

CHAPTER 4. EXPERIMENTAL WORK pheromone and heuristic terms. This is a subjective judgement as no screening or ¨ other analysis is done to verify it. Dorigo and Stutzle’s MMAS code [47] was used and so we should expect results to be consistent with results from this thesis. Experiments were performed without local search and assuming pheromone update is not time consuming. This assumption is likely incorrect if one examines the code used. This shows that when no local search is used, pheromone evaporation takes places on all edges in the problem rather then being limited to the edges in the candidate list. This issue was described in Section 2.4.8 on page 44 and a computation limit parameter was introduced. In general, the reasoning about the parameters is reported quite vaguely. For example, ‘It is easy to see that the number of iterations is the leading force, at least until a certain threshold. Nonetheless, the number of nodes has a remarkable impact as well.’ We have no measure of ‘easy’, leading force’, ‘a certain threshold’ or ‘remarkable’. The values recommended by the analysis are then compared to the values recommended by an automated tuning algorithm, F-Race [14]. Available solution time was set to six arbitrarily chosen levels varying between 5 and 120 seconds. The F-Race recommendations were observed to match the predicted trends of the analysis for the exponent parameters and the number of ants but not the pheromone decay parameter for just two distinct levels of instance size, 300 and 600. The failure on prediction of pheromone value may be due to the incorrect assumption highlighted earlier. Regardless, the results do not confirm the author’s analysis but rather confirm that the authors’ analysis would appear to agree with some of the F-Race results. Once again, parameters were treated in isolation and the unstated assumption is that there is no possibility of many different combinations of parameter settings achieving the same performance under the constraint of a fixed solution time. Times were not benchmarked properly so we cannot compare other tuning procedures to Pellegrini et al’s analysis. Hooker [64] highlights the failings of theoretical analysis of algorithm performance on several fronts. • Not practical. The results do not usually tell us how an algorithm will perform on practical problems. • Not representative. Complexity results are asymptotic or apply to a worst case that seldom occurs. Worst case analyses, by definition, do not give an indication of how a heuristic will perform in more representative scenarios [101]. Average case results presuppose a probability distribution over randomly generated problems that is typically unreflective of reality. • Too simplified. Results are usually obtained for the simplest kinds of algorithms and so do not apply to the complex algorithms used in practice. All the analyses mentioned demonstrate some or all of Hooker’s failings. In general, results are still too premature to recommend parameter settings for an ACO algorithm when presented with a given problem instance and the complete heuristic compromise. Analytical approaches are not ready to address the parameter tuning problem. 83

CHAPTER 4. EXPERIMENTAL WORK

4.3.2

Automated Approaches

Self-tuning Randall [100] has examined how ACS (Section 2.4.6 on page 42) can use its own mechanisms to tune its parameters at the same time as it is solving TSP and QAP problems. Four tuning parameters were examined, β, local and global pheromone update factors ρlocal and ρglobal and the exploration/exploitation threshold q0 . The number of ants m was arbitrarily fixed at 10. Each ant maintained its own parameter values. A separate pheromone matrix was used to learn new parameter values. The self-tuning test ACS was compared to a control ACS with fixed parameter values taken from another author’s implementation from 7 years previously on different problem instances [46]. Twelve instances from the TSPLIB (sizes 48 to 442) and QAPLIB (sizes 12 to 64) were used for comparison of the test and control. Experiments on each instance were repeated 10 times with different random seeds and were halted after 3000 iterations. The choice of number of replicates was not justified. Only a single fixed iteration stopping criterion was examined. Randall claims that this number of iterations ‘should give the ACS solver sufficient means to adequately explore the search space of these problems.’ [46, p. 378]. No evidence was given to support this claim. Furthermore, because problem size varies, a fixed iteration will give a less adequate exploration of the search space of larger instances. Two responses were measured, the percentage relative error of the solution and the CPU time until the best solution was found. Note that there is some disagreement over the use of this measure as it does not account for the 3000 iterations that were actually run (Section 3.10.1 on page 68). Responses were listed in a table with a row for each problem instance. No attempt was made to qualify the significance of differences between control and test responses with a statistical test. Although it is difficult to interpret the results in these circumstances, it seems that the parameter tuning strategy had little practically significant effect on the quality of solutions found and therefore is not better than a set of parameter values recommended by other authors in different circumstances. Part of the difficulty in detecting differences between test and control is that many of the problem instances were so small as to be solved relatively easily in the set-up phase of ACS, when a nearest neighbour heuristic is applied to the TSP graph (Section 2.4). Although the self-tuning approach is intuitively appealing, it has two major deficiencies. Firstly, it offers no understanding of the relationship between parameters and performance—the algorithm tunes itself and it is hoped that performance improves. An understanding of this relationship necessitates building some model (analytical or empirical) of the relationship. Of course, the particular application scenario will determine whether modelling this relationship is advantageous (see Chapter 1). The second deficiency is related to the first. Without a model, there is no understanding of the relative importance of tuning parameters. This runs the risk of wasted resources tuning parameters that actually have no effect on performance.

84

CHAPTER 4. EXPERIMENTAL WORK Heuristic Tuning Botee and Bonabeau [19] investigated the use of a simple Genetic Algorithm to tune 12 parameters of a modified version of the Ant Colony System algorithm on two problems from TSPLIB, Oliver30 and Eil51. The parameters they evolved are summarised in Table 4.1 on the next page. The ACS equations were modified in several ways to give the genetic algorithm more flexibility. The trail evaporation parameter ρ, which was the same for local and global pheromone updates in the original ACS, was separated into ρlocal and ρglobal for the local and global pheromone update equations respectively. A new numerator Q and a new exponent γ were introduced into the pheromone deposit Equation ( 2.13 on page 43). τij = (1 − ρ) τij + ρ

Q (Cchosen ant )

γ

(4.1)

The chosen ant was always the best so far ant. The ACS implementation was augmented with 2-opt local search. The number of repetitions was of the form σ = a · nb where a and b determine how σ scales with problem size n. The original trail value was set to a small value rather than according to the nearest neighbour approach advocated in previous literature [46]. The candidate list length was fixed at 1. Each colony, characterised by the given parameters, was treated as an individual. The population of 40 colonies was randomly generated and run for 100 generations. The GA found a set of parameter values that always found the optimal solution to Oliver30 in fewer ant cycles than ACS. The solution found for Eil51 was comparable to the best known solution. The modified algorithm, tuned by the GA, found an optimal solution to Oliver30 after 4928 ant cycles (14 ants for 352 iterations averaged over 30 repetitions). The original algorithm found the optimal solution in 8300 ant cycles (10 ants for 830 iterations). The results for Eil51 are not comparable to the original paper [45] as this paper used a different but similar problem Eil50. The evolved parameter values are summarised in Table 4.1 on the next page. While the reported results are interesting, the general conclusions we can draw from them are limited. Firstly, the experiments introduced 4 new ACS parameters at once—the γ exponent and Q numerator in the global pheromone update equation, and the separation of the pheromone decay parameter into local and global versions. The use of σ repetitions of a local 2-opt search procedure was also introduced and the candidate list length was removed by fixing it at a value 1. It therefore makes it impossible to tell whether the improvements in ant cycles that the authors report are due to the genetic algorithm’s tuning or the introduced parameters or the removed parameter. These factors are confounded (Section A.1.6 on page 212). The authors then carry all the parameter values from the Oliver30 problem instance to the Eil51 instance, except for number of ants, which is changed from 14 to 25. Neither the use of the same parameters or the arbitrary change in the number of ants is justified. Secondly, while the performance improvement of 40% on Oliver30 seems very large, only two simple problem instances were tested and the differences in performance between both algorithms were not tested for 85

CHAPTER 4. EXPERIMENTAL WORK Symbol

Parameter

Values

m

Number of ants

14

q0

Exploration/exploitation threshold

0.38

α

Influence of pheromone trails

0.37

β

Influence of heuristic

6.68

ρlocal

Local pheromone decay

0.30

ρglobal

Global pheromone deposition

0.31

Q

Global pheromone update term

78.04

γ

Global pheromone update term

0.67

τ0

Initial pheromone levels

0.41

a

Local search term

5

b

Local search term

0.97

Table 4.1: Evolved parameter values for ACS. Results are from Botee and Bonabeau [19, p. 154] applied to Oliver30 and Eil51.

statistical significance. A 40% improvement on such small problems is probably of little practical significance. Overall, the approach of tuning ACS with the GA has some other weak points. The GA is itself a heuristic and so probably needs its own tuning. We have seen that this is best done with a DOE approach (Section 4.2 on page 79). The authors do not specify where their GA parameter settings such as population size come from. This introduces another parameter tuning problem on top of the ACS parameter tuning problem. Their methodology does not incorporate any screening of parameters. The GA therefore expends time tuning parameters that may not have any effect on ACS performance. Birattari [12] uses algorithms derived from a machine learning technique known as racing [78] to incrementally tune the parameters of several metaheuristics. Tuning is achieved with a fixed time constraint where the goal is to find the best configuration of an algorithm within this time. While the dual problem of finding a given threshold quality in as short a time as possible is acknowledged, the author does not pursue the idea of a simultaneous bi-objective optimisation of both time and quality. Solution times were subsequently investigated by others [38]. Comparisons are made between 4 types of racing algorithm and a baseline algorithm that uses a brute force approach to tuning. These racing algorithms were used to tune Iterated Local Search for the Quadratic Assignment Problem and Max-Min Ant System for the Travelling Salesperson Problem. Experiments were run on Dorigo and ¨ Stutzle’s code [12, p. 120], the same original source code on which this thesis is based.

86

CHAPTER 4. EXPERIMENTAL WORK

4.3.3

Empirical Approaches

One Factor at a Time Even if the previous self-tuning (Section 4.3.1 on page 83) and heuristic tuning (Section 4.3.2 on page 85) approaches were successful, they are of limited use when attempting to understand the all-important relationship between tuning parameters, problem characteristics and performance. Understanding this relationship requires a model. Building a model involves sampling various points in the space of parameter settings and problem instances (the design space) and then measuring performance at those points. When there are many parameters or problem characteristics, the researcher must confront a vast high-dimensional design space. One way to tackle this is to use a One-Factor-At-a-Time (OFAT) approach. OFAT involves fixing the values of all but one of the tuning parameters. The remaining parameter is varied until performance is maximised. Another parameter is chosen to be varied and all other parameter values are fixed. This process continues one factor at a time until all parameters have been tuned. While OFAT may occasionally be quick, there are limitations to the conclusions that one may draw from an OFAT analysis (Section 2.5 on page 47). However this approach occurs in the literature and so an illustrative case is reviewed here. ¨ Stutzle studied the three modifications to Ant System introduced for Max-Min Ant System [118, p. 18] with an OFAT approach. Default parameter values were β = 2, α = 1, m = n, ρ = 0.98 and candidate list lengths were 20. ρ was varied between 0.7 and 0.99 for two small instances from TSPLIB, KroA100 and d198. ¨ The response measured was quality of solution. Stutzle’s recommendation was a low value of ρ when a low number of tour constructions are performed and a high value of ρ when a high number of tour constructions are performed. The trade-off of initialisation to τmin or τmax was also examined along with the use of the global or solution best ant for pheromone update. An informal examination of a table of the differences between the trail initialisations showed a negligible practical difference in solution quality (0.9% maximum) on small instances of size 51 to 318. However, only a single problem characteristic, problem size, was reported. Interestingly, the difference between trail initialisation methods increased with problem size but this trend was unfortunately not explored further. A similar result was obtained for the choice of ant for pheromone update. MMAS was then compared to Ant System [44], Elitist Ant System [44] and Rank-based Ant System [24] where the parameter settings for these algorithms are listed but no motivation of the choice of these parameter settings was given. Comparisons were made on three small instances of size 51, 100 and 198 with a fixed tours stopping criterion. The results were listed in absolute terms without a statistical test for significance. We can express that results table in terms of relative solution quality. Although MMAS did indeed find the best solutions, we see that the difference from the next best solution provided by ACS was only 0.6%. It is claimed that since MMAS outperforms ACS, and ACS was demonstrated to outperform other nature-inspired algorithms [46], MMAS is therefore competitive with other algorithms. However, this claim ignores the resources used in tuning these algorithms before their per-

87

CHAPTER 4. EXPERIMENTAL WORK formance was measured. The investigation of the benefits of several variants of MMAS with local search could have been done differently. Firstly, some of the parameters were inexplicably changed. Specifically, the number of ants was now fixed at m = 25 and the pheromone decay held constant at ρ = 0.8. Solution CPU time was reported but without any accompanying benchmarking. The instance sizes were varied from 198 to 1577. Design Of Experiments The Design of Experiments (DOE) approach is preferable to OFAT. Surprisingly, before the publication of results from this thesis [104, 106, 109, 105, 107], DOE has been almost completely absent from the ACO literature. Silva and Ramalho [114] give a small summary of the use of DOE techniques in ACO. However, their categorisation of the techniques is unusual. They include One-Factor-At-a-Time analysis as a ‘simple’ type of DOE and the general category of ‘data analysis’ is seen as separate from DOE. Only one reference [52] is listed as applying DOE but a reading of the reference shows that it does not actually use DOE. The authors then illustrate the use of a full 2k factorial with 7 factors and 5 replicates on a single instance of the Sequential Ordering Problem. A single solution quality response that is independent of CPU time is measured. Normal plots and residual plots are used to check model quality. The authors then use what they term the ‘observation method’ to recommend tuning parameter values. It is not clear what this method is. Non-integers values of α = 0.25 and β = 1.5 are recommended. This recommendation would actually lead to extremely high CPU times because of non-integer exponentiation in the ant decisions (Equation ( 2.2 on page 39)). The authors did not measure the CPU response and so would have been unaware of this problem. This dramatically supports the argument for recording CPU time, regardless of the focus of the experiment (Section 3.10.1 on page 68). Gaertner and Clark [50] attempted to find optimal parameter settings for three parameters of a single ACO heuristic, Ant Colony System, using a full factorial design. While there were many flaws in the execution and analysis of their research, we include it in this section because of its use of a factorial design. Although the authors identified 6 tuning parameters, α, β, ρ, q0 , m and Q, they immediately argued that all but 3 of these could be omitted from consideration. Firstly, they claimed that it is sufficient to fix α and only vary β. They claimed that Q is a constant despite listing it as a parameter. Finally, they claimed that m could ‘reasonably’ be set to the number of cities in the problem. This left only three parameters that were actually considered, β, ρ and q0 . We know from our review in Section 2.4.9 on page 45 that the number of tuning parameters is actually far greater for ACS. The authors then partitioned the three parameters β, ρ and q0 into 14, 9 and 11 values respectively. No reasoning was given for this granularity of partitioning or why the number of partitions varied between parameters. Each ‘treatment’ was run 10 times with a 1000 iteration or optimum found stopping criterion on a single 30 city instance, Oliver30. This resulted in 13,860 experiment

88

CHAPTER 4. EXPERIMENTAL WORK runs that took the authors several weeks to execute on a dual 2.2GHz processor with 2Gb RAM. While the excessive running time may have been due to poor implementation, the approach was nonetheless incredibly inefficient—a response surface design (Section A.3.4 on page 218) for 3 factors with a full factorial and 10 replicates would have required approximately 150 runs, 1% of that used by the authors. It was also expensive although we cannot relate their figures to present day values because no benchmarking was reported. CPU time was not reported for the various parameter settings and so the heuristic compromise was ignored. The authors also make an unfair comparison with other authors’ work, claiming that they find the optimal solution faster on average without local search. This claim fails to acknowledge the authors’ significant effort (13,860 runs over several weeks) to find their conclusion and the use of prior knowledge of an optimum to stop an experiment once the optimum was found. Section 3.13 on page 73 discussed the problem with using an optimum as a stopping criterion. The authors claim that their parameter setting is robust because their empirical search found the same parameter setting for 3 values of relative error, 0%, 1% and 5%. This is not a robustness analysis. The authors made no attempt to see how the response varies when the input parameter is perturbed. This thesis presents a far more rigorous experimental approach that draws contradictory conclusions to those of Gaertner and Clark, is an order of magnitude more efficient in terms of experiment runs and deals with all ACS tuning parameters across a space of problem instances rather than a single instance.

4.4

Chapter Summary

This chapter covered the following topics. • Problem Difficulty. It is critically important to determine the characteristics that affect the difficulty of a problem presented to a heuristic. Without this, it is impossible to generalise the relationship between instances, tuning parameters and heuristic performance. – Some authors have tried manipulating instances and examining the resulting affect on problem difficulty. Others have tried to evolve difficult instances and then determine why those instances were difficult. – Some authors have hypothesised that a particular problem characteristic made an instance difficult and then generated many instances with different levels of that hypothesised characteristic. This approach is preferable since it fits within the more scientific approach of hypothesise and test. Such an approach has not yet been applied to ACO heuristics for the TSP. – Results with exact algorithms for the TSP suggest that the standard deviation of edge lengths in the TSP instance has an effect on the difficulty of the instance.

89

CHAPTER 4. EXPERIMENTAL WORK • Parameter Tuning of other heuristics. Other heuristics have been tuned using basic Design Of Experiments techniques such as factorial and fractional factorial designs. • Parameter tuning of ACO heuristics. – Approaches to tuning ACO can be categorised as either (1) Analytical, (2) Automated or (3) Empirical. Analytical approaches attempt to prove properties about the parameter-problem-performance relationship using mathematical proof. Automated approaches use an algorithm to automatically tune the heuristic. This algorithm may be the heuristic itself. Empirical approaches gather data from actual algorithm runs and attempt to build a model to reason about the data and draw conclusions. – Automated Tuning with another heuristic, such as a genetic algorithm, is inefficient for two reasons. Firstly, there is no ability to screen out parameters that are not effecting performance and so effort is wasted on tuning potentially ineffective parameters. Secondly, the tuning heuristic does not build up a model of the relationship between parameters, problem instances and performance of the tuned heuristic. This severely limits what can be learned from running the tuning procedure. – Automated Self-tuning involves applying the heuristic’s own optimisation mechanisms to the heuristic’s tuning parameters. This is intuitively a sensible approach to parameter tuning. However, it suffers from the same lack of screening and modelling as the automated tuning approach. Examples from the literature have been poorly executed experimentally and so we cannot determine whether this is a viable approach to parameter tuning. – Researchers often use a One-Factor-At-a-Time (OFAT) approach. While this does give useful insights into the importance and effects of various parameter settings, the OFAT approach has many recognised deficiencies for parameter tuning. – The Design of Experiments approach has been used on two occasions to recommend ACS parameter settings. There were several flaws with the execution of the DOE methods however. Furthermore, the authors did not examine CPU time and so made recommendations of setting exponent tuning parameters to non-integer values. This concludes the first part of the thesis. Chapter 2 on page 29 gave a background on combinatorial optimisation and the Travelling Salesperson Problem. Metaheuristics were introduced as an approach to finding approximate solutions to these difficult and important problems and the most important Ant Colony Optimisation (ACO) heuristics were described in detail. A review of related work in Chapter 3 began by collecting and organising the many experiment design and analysis issues that arise in empirical research with metaheuristics. Several approaches to determining the problem characteristics that affect performance and

90

CHAPTER 4. EXPERIMENTAL WORK to parameter tuning were reviewed in this chapter. However, in light of the concerns raised in Chapter 3, the vast majority of these approaches and their execution have been deficient in several ways. The next part of this thesis will address these deficiencies, comprehensively describing an adapted DOE approach for addressing the parameter tuning problem.

91

Part III

Design Of Experiments for Tuning Metaheuristics

93

5 Experimental testbed Before detailing the adapted DOE methodology that this thesis introduces, we must first address the thesis’ ‘apparatus’. This chapter covers all issues related to the experimental testbed. The experimental testbed in metaheuristic research comprises three items. These are: 1. the code for the problem generator that creates the test instances that the algorithms then solve, 2. the code for the metaheuristics on which the experimenter is conducting research, and 3. the machines that run all experiments in the research. Of course, either the machines or the problem generators could be the subject of the research (Section 3.4 on page 58). One often asks whether a problem generator is appropriate for an algorithm. In an industrial context, one may be concerned about the machine characteristics that best suit the algorithms and problems. This chapter deals with these three experimental testbed items in order.

5.1

Problem generator

We have already discussed the difficulty in creating a problem generator and the arguments for and against the use of problem generators (Section 3.12 on page 71). Problem generators are a large area of research because of these difficulties and so are beyond the scope of this research. It is therefore desirable to choose a problem generator that is acceptable for other researchers. Preferably, the generator will already have been subjected to extensive use so that any peculiarities with the generator are more likely to have become known. We have chosen to use a problem generator provided with the 8 th DIMACS Implementation Challenge: The Travelling

95

CHAPTER 5. EXPERIMENTAL TESTBED Salesman Problem 1 . The DIMACS challenge was a large competition held within the TSP research community with the stated goals of: • creating ‘a reproducible picture of the state of the art in the area of TSP heuristics (their effectiveness, their robustness, their scalability, etc.), so that future algorithm designers can quickly tell on their own how their approaches compare with already existing TSP heuristics’ and • enabling ‘current researchers to compare their codes with each other, in hopes of identifying the more effective of the recent algorithmic innovations that have been proposed. . . ’. One way the DIMACS challenge facilitated these goals was to provide researchers with problem generators to generate instances on which their codes could be tested. Problem generators were provided for several of the possible types of TSP instance (Section 2.2). In particular, the DIMACS challenge provided a generator called portmgen to generate symmetric instances with a given number of nodes where edges lengths were chosen with a uniform random distribution. Other researchers have used instances with edges drawn from a Log-Normal distribution (Section 3.1 on page 54) so that the standard deviation of edge lengths could be treated as a factor and controlled in the experimental sense. This was shown to have an important effect on problem difficulty for an exact algorithm [26]. This same factor is investigated later in this thesis (Chapter 7). The LogNormal distribution is the probability distribution of any random variable whose logarithm is normally distributed. There is a good introduction to the Log-Normal Distribution online2 . The distribution has the probability density function: 2

e− ln(x−µ) /2σ √ f (x; µ, σ) = xσ 2π

2

(5.1)

for x > 0 where µ and σ are the mean and standard deviation of the variable’s logarithm. For our purposes of controlling edge length standard deviation, we note that relationships can be derived to solve for the Log-Normal parameters µ and σ of Equation ( 5.1) given a desired expected mean E (x) and expected variance V ar (x) of the resulting distribution. V ar (x) 1 µ = ln (E (x)) − ln 1 + 2 2 E (x) σ 2 = ln 1 +

V ar (x) E (x)

! (5.2)

!

2

(5.3)

For example, if we want a Log-Normal distribution with a certain standard deviation and certain mean, these equations will tell us what values of parameters µ and σ to use when creating our distribution.

1 2

http://www.research.att.com/∼dsj/chtsp/ http://en.wikipedia.org/w/index.php?title=Log-normal distribution&oldid=136064053

96

CHAPTER 5. EXPERIMENTAL TESTBED For this research, the DIMACS portmgen generator [58] was ported to Java and refactored into an Object-Oriented implementation we call Jportmgen. The generator’s behaviour was preserved during the porting using unit tests. The DIMACS portmgen and the thesis’ Jportmgen produced identical instances for a given pseudo-random generator seed. The code was then modified such that chosen edge lengths exhibited a Log-Normal distribution with a desired mean and standard deviation, as per Cheeseman et al [26]. This new implementation therefore allows the experimenter to control problem size, edge length mean and edge length standard deviation while remaining true to the DIMACS generator accepted for the TSP community’s largest research project. Different distributions can be plugged into the generator, including the original DIMACS uniform distribution. Although Cheeseman et al [26] did not state their motivation for using the Log-Normal distribution, a plot of the relative frequencies of the normalised edge lengths of Euclidean instances from the online benchmark library, TSPLIB [102], shows that the majority have a Log-Normal shape (Appendix B). Figure 5.1 shows relative frequencies of the normalised edge lengths of several instances created by

Normalised Relative Frequency

Jportmgen. 1

Mean 100, StDev 70 Mean 100, StDev 30 Mean 100, StDev 10

0.5

0 0

0.2 0.4 0.6 Normalised Edge Lengths

0.8

Figure 5.1: Relative frequencies of normalised edge lengths for several TSP instances of the same size and same mean cost. Instances are distinguished by their standard deviation. All instances demonstrate the characteristic Log-Normal shape.

Unless otherwise stated, all future references to the problem generator will refer to the Jportmgen generator. All generated instances in the thesis are created with this Log-Normal version of the DIMACS portmgen.

5.2

Algorithm implementation

The next important aspect of the testbed is the algorithm implementation. Reproducible algorithm implementations are both extremely important and yet difficult to achieve (Section 3.8 on page 65). The best way to overcome these issues is to provide the source code on which all experiments are conducted. Furthermore, in the interest of advancing the field, research and its results should be both consistent with previous research and extensible in future research by others. Meeting these basic demands of a scientific field requires the community’s adoption of a

97

CHAPTER 5. EXPERIMENTAL TESTBED standard implementation of its ACO algorithms. The closest thing to a standard ¨ implementation for ACO algorithms is the C code written by Stutzle and Dorigo for their definitive book on the field [47]. This was made available to the community on the world wide web3 and is recommended for experiments with ACO [42, p. 275]. For the reasons mentioned above (reproducibility, relevance to previous research and extensibility in future research) we made the decision to use the ACOTSP ¨ code of Stutzle and Dorigo. ACOTSP was a procedural C implementation. We translated ACOTSP into a Java implementation that we will refer to henceforth as JACOTSP. The Java implementation now benefits from the usual Object-Oriented advantages4 , in particular its extensibility. The class hierarchy in JACOTSP ensures that algorithm subclasses share the same data structures and differ only in the implementation details of individual methods. The Template design pattern [72] proves particularly useful in this regard. The Delegator pattern allows different termination conditions, for example, to be ‘plugged in’ to the algorithms without disrupting the rest of their structure. JACOTSP runs on symmetric TSP problems using 6 ACO algorithms namely Ant System, Ant Colony System, Rankbased Ant System, Elitist Ant System, Best-Worst Ant System and Max-Min Ant System. One may question the impact of Java on computation times when compared to the original ACOTSP C implementation. While the early releases of Java were indeed slow, subsequent releases addressed this issue in the context of scientific computing [23]. Java is now an acceptable choice for scientific computing according to performance benchmarks5 and is used for high performance scientific computing in laboratories such as CERN 6 . To focus too closely on relative running times would be to miss the aim of the thesis which is to demonstrate effective tuning of a heuristic resulting in improvements in solution quality and solution time. As discussed in Chapter 3, these solution times will always be dependent on the particular implementation details, regardless of the programming language used. The random number generator used in ACOTSP and ported to JACOTSP is the Minimal Random Number Generator of Park and Miller [92]. Its implementation and a discussion of its merits can be found in the literature [99, p. 278-279]. The reimplementation of the random number generator, in particular, ensures that JACOTSP produces the same behaviour (and ultimately the same solutions) as its ACOTSP predecessor. This backwards compatibility was ensured with unit tests that compare the output files of ACOTSP with those of JACOTSP for a variety of input parameters. Such compatibility does not make sense when new tuning parameters are identified or when aspects of the algorithm’s internal design are parameterised and varied. Breaking such compatibility is inevitable if the ACO algorithms are to evolve. Given that the random number generator is well-established and that backwards compatibility was important, we did not investigate alternative generators as is sometimes advisable (Section 3.11 on page 71).

3

ACOTSP, available at http://iridia.ulb.ac.be/∼mdorigo/ACO/aco-code/public-software.html general, the use of an object model leads to systems with the following attributes of wellstructured complex systems: abstraction, encapsulation, modularity and hierarchy [18]. 5 http://shootout.alioth.debian.org/ 6 http://dsd.lbl.gov/ hoschek/colt/ ˜ 4 In

98

CHAPTER 5. EXPERIMENTAL TESTBED The original ACOTSP contained a timer that reported the CPU time for which ACOTSP was running. A slightly different approach was taken in JACOTSP because accessing CPU times in Java was problematic in the Java version in which the JACOTSP project was started. Newer versions have since overcome this. For this reason, the decision was taken to use a timer supplied with the Colt project7 . Colt is a set of high performance scientific computing libraries used at the CERN labs. JACOTSP therefore measures elapsed time rather than CPU time. The timer was paused during the calculation and output of data that is not essential to the functioning of the JACOTSP ant algorithms. For example, branching factor calculation is not timed for ACS but is timed for MMAS because it is used in trail reinitialisation. Any concerns over the interruption of the timer by other operating system processes are easily allayed by randomising experiment running orders. Unless otherwise stated, the times reported in this thesis’ case studies are elapsed times rather than CPU times.

5.2.1

Profiling

Exponentiation is a mathematical operation, written an , involving two numbers, the base a and the exponent n. When n is a whole number (an integer), the exponentiation operation corresponds to repeated multiplication. However, when the exponent is a real number (say 1.73) a different approach to calculation is required and this approach is computationally very expensive. In Java, the language of the JACOTSP implementation, the natural logarithm method is used for real exponents. The details of this method are beyond the scope of this discussion. A simple profiling of the JACOTSP code showed that real exponent values caused the vast majority of computational effort to be expended on exponentiation. Recall from the design of ACO (Section 2.4.3 on page 37) that two exponentiations are involved in every ant movement decision (see equation ( 2.2 on page 39) for example). The implication of this and the common knowledge of the expense of exponentiation is that tuning parameters that are exponents should be limited to be integer values only. However, we have seen at least one case in the literature [114] where authors looking only at solution quality and not recording CPU time actually recommend non-integer values of these exponents (Section 4.3.3 on page 88). Any gain in quality from using a non-integer α and β will most likely be offset by the huge deterioration in solution time. This is further evidence to support the recommendation of measuring CPU time (Section 3.10.1 on page 68) and for this thesis’ emphasis on the heuristic compromise.

7

http://dsd.lbl.gov/˜hoschek/colt/index.html

99

CHAPTER 5. EXPERIMENTAL TESTBED

5.3

Benchmarking the machines

The remaining aspect of our experimental testbed is the physical machines on which all experiments are conducted.

Clearly, all machines can differ widely.

There are differences in processor speeds, memory sizes, chip types, operating systems, operating system versions and, in the case of Java, different versions of different virtual machines. Even if machines are identical in terms of all of these aspects, they may still differ in terms of running background processes such as virus checkers. This is the unfortunate reality of the majority of computational research environments. Furthermore, such differences will almost certainly occur in the computational resources of other researchers who attempt to reproduce or extend previous work of others. Ultimately, such differences mean that experiments that are identical in terms of the two previous testbed issues of algorithm code and problem instances will still differ when run on supposedly identical machines. These differences necessitate the benchmarking of the experimental testbed (Section 3.9 on page 67). Reproducibility of results (Section 3.8 on page 65) is a second important motivation for benchmarking. Other researchers can reproduce the benchmarking process on their own experimental machines. They can thus better interpret the CPU times reported in this research by scaling them in relation to their own benchmarking results. This mitigates the decline in relevance of reported CPU times with inevitable improvements in technology. It is hoped that the benchmarking advocated in this thesis becomes commonplace in reported ACO and metaheuristics research.

5.3.1

Benchmarking method

The clear and simple benchmarking procedure of the DIMACS [58] challenge is applied here and its results described below. 1. A set of TSP instances is generated with one of the DIMACS problem generators. These instances range in size from one thousand nodes to one million nodes. 2. The DIMACS greedy search, a deterministic algorithm, is applied to each instance for a given number of repetitions and the total time for all repetitions is recorded. The number of repetitions varies inversely with the size of the instance. For example, the instance of size 1 million is solved only once by the greedy search while the instance of size 1 thousand is solved 1 thousand times.

5.3.2

Results and discussion

The results of the DIMACS benchmarking of our experimental machines are illustrated in Figure 5.2 on the facing page and the corresponding data are presented in Figure 5.3 on the next page.

100

CHAPTER 5. EXPERIMENTAL TESTBED

DIMACS Benchmarking of experiment machines 50.00 116

40.00

253

Time (s)

111 30.00

156 188 136

20.00

10.00

0.00 E1k.0

E3k.0

E10k.0

E31k.0 E100k.0 E316k.0 E1M.0 Instance

Figure 5.2: Results of the DIMACS benchmarking of the experiment testbed.

benc

Instance E1k.0 E3k.0 E10k.0 E31k.0 E100k.0 E316k.0 E1M.0

Size Repetitions 1000 1000 3000 316 10000 100 31000 32 100000 10 316000 3 1000000 1

116 5.45 5.67 6.91 11.77 22.86 28.61 39.52

253 4.38 4.61 7.25 16.52 26.53 32.05 44.80

111 5.25 7.61 8.81 13.41 26.77 34.50 49.23

Time (s) 156 5.00 5.25 6.44 11.20 21.03 27.61 38.03

188 3.31 3.78 4.99 9.48 10.87 12.56 16.55

136 4.81 5.23 6.64 11.00 19.85 25.86 35.44

Figure 5.3: Data from the DIMACS benchmarking of the experiment testbed.

101

96 3.37 3.75 5.32 10.84 12.82 14.70 19.31

CHAPTER 5. EXPERIMENTAL TESTBED The horizontal axis represents the different instances for which the benchmarking was conducted where instances are arranged in order of increasing size. The vertical axis is the total time in seconds for the benchmarking run. Each bar represents a different machine, identified by a machine ID. It is evident that every machine’s benchmark time increases with instance size. The differences between machines become more pronounced as instance size increases. In general, machine 188 is always fastest. The benchmarking times indicate that despite the similarity of the specification of most of the machines, there are still differences in CPU times and these differences seem to amplify in larger instances. The benchmarking has thus identified a nuisance factor in the experimental testbed and the need to randomise experiment runs across the experiment testbed. Efforts to use completely identical machines for all experiments will still encounter this nuisance factor. Any successful performance analysis methodology will have to cope with this reality.

5.4

Chapter summary

This chapter covered the following: • Decision to use a publicly available problem generator. There are concerns over the difficulties of developing reliable problem generators. This research uses a publicly available problem generator that was also used in the DIMACS challenge, a large research competition within the TSP community. • Modified generator to control problem characteristic. Problem instance standard deviation of edge lengths may be an important problem characteristic affecting problem difficulty. To this end, a modified DIMACS problem generator draws its edge lengths such that they exhibit a Log-Normal distribution with a desired mean and standard deviation. The choice of this distribution is in keeping with previous work by Cheeseman et al [26]. • OOP implementation of publicly available algorithm source code. There is a suite of publicly available C code of the main ACO algorithms to accompany the field’s main book [47]. We have reimplemented this code in Java and refactored it into an extensible Object-Oriented (OO) implementation. Our java code continues to reproduce the solutions of the original C code ¨ by Dorigo and Stutzle. Results from this research are therefore applicable to other research that has used their publicly available C code. • Highlighting the computationally expensive exponentiation calculation. We saw from our overviews of the ACO metaheuristic (Section 2.4.3 on page 37) that all involve a very large number of exponentiation calculations in which the tuning parameters α and β are the exponents. It is well known that exponentiation with non-integer exponents is computationally very expensive. This was highlighted in our profiling of the code but is often missed by researchers who ignore the heuristic compromise and do not record CPU times. Tuning of the α and β parameters will therefore be restricted to integer values. 102

CHAPTER 5. EXPERIMENTAL TESTBED • Benchmarked experimental machines.

The DIMACS benchmarking ap-

proach was applied to all machines used during the course of this research. The emphasis of the thesis research is not to compare algorithms in a ‘horserace’ study (Section 3.4.4 on page 59). Benchmarking the machines benefits future research in that CPU times reported in the thesis can be interpreted and scaled by other researchers using the same accepted benchmarking approach, regardless of the inevitable differences in their experimental testbed.

103

6 Methodology Chapter 3 brought together high-level concerns spanning all aspects of Design Of Experiments (DOE) with particular emphasis on DOE for metaheuristics. The reader is referred to Appendix A for a background on DOE. This chapter focuses on a sequential experimentation methodology that deals with these concerns. The methodology efficiently takes the experimenter from the initial situation of almost no knowledge of the metaheuristic’s behaviour to the desired situation of a modelled metaheuristic with recommendations on tuning parameter settings for given problem characteristics. The methodology is based on a well-established procedure from DOE that this thesis modifies for its application to metaheuristics. It was first introduced to the ACO community only recently by the author [104]. The design generation and statistical calculations can be performed with most modern statistical software packages including SPSS, NCSS PASS, Minitab and Microsoft Excel. Examples of the methodology’s successful application to the parameter tuning problem are presented in the chapters in Part IV of the thesis. This chapter begins with a relatively high-level overview of the whole sequential experimentation methodology before detailing all its stages and decisions.

6.1

Sequential experimentation

Experimentation for process modelling and process improvement is inherently iterative. This is no different when the studied process is a metaheuristic. Box [20] gives some sample questions that often arise after an experiment has been conducted. “That factor doesn’t seem to be doing anything. Wouldn’t it have been better if you had included this other variable?” “You don’t seem to have varied that factor over a wide enough range.”

105

CHAPTER 6. METHODOLOGY “’The experiments with high factor A and high factor B seem to give the best results; it’s a pity you didn’t experiment with these factors at even higher levels.” In sequential experimentation, there are six directions in which a subsequent experiment commonly moves [20]. These depend on the results from the first experiment. 1. Move to a new location in the design space because the initial results suggest a trend that is worth pursuing. 2. Stay at the current location in the design space and add further treatments to the design to resolve ambiguities that may exist in the design. Such ambiguities are typically due to effects that cannot be separated from one another due to the nature of the experiment design. Such ‘entanglement’ of effects is termed aliasing (Section A.3 on page 214). 3. Rescale the design if it appears that certain variables have not been scaled over wide enough ranges. 4. Remove or add factors to the experiment. 5. Repeat some runs to better estimate the replication error. 6. Augment the design to assess the curvature of the response. This is particularly important when large two-factor interactions occur between factors. These questions and the decisions for subsequent experiments are part of a larger sequential experimentation methodology. This ‘bigger picture’ is illustrated in Figure 6.1 on the next page. The main advantage of the sequential experimentation methodology is its efficiency of resources. It is rare that the experimenter begins with a full knowledge of the metaheuristic (hence the point of experimentation). A revision of experiment design decisions is therefore inevitable as more is learned about the metaheuristic. The sequential methodology avoids the risky and oft unsuccessful approach of running a large all-encompassing expensive experiment up front. Instead, many of the designs and their existing data are incorporated into subsequent experiments so that no experimental resources go to waste. Factors are carefully examined before a decision on their inclusion in an experiment. Calculations of statistical power reveal when a sufficient number of replicates have been gathered. We now describe each of the stages in this sequential experimentation methodology of this thesis with reference to modelling and tuning the performance of a metaheuristic.

6.2

Stage 1a: Determining important problem characteristics

The first stage in the sequential experimentation approach is to determine all problem characteristics that affect at least one of the responses of interest. Without 106

CHAPTER 6. METHODOLOGY

Stage Heuristic tuning parameters

Known & suspected problem characteristics

1. Screening

screening

augment design with foldover runs to confirm model estimate main effects and interactions

curvature?

2. Modelling 3. Tuning

augment design with axial points or new design space Response Surface Methods

Numerical Optimisation

4. Evaluation

None

runs to confirm model

runs to confirm results Overlay plots for optimal parameter settings

Figure 6.1: The sequential experimentation methodology. The methodology covers four main stages, screening, modelling, tuning and evaluation of results.

107

CHAPTER 6. METHODOLOGY sufficient problem characteristics, the response surface models from later in the procedure will not make good predictions of performance on new instances.

6.2.1

Experiment Design

The main difficulty encountered in attempting to experiment with problem instance characteristics is that of the uniqueness of instances. That is, while several instances may have the same characteristic that is hypothesised to affect the response, these instances are nonetheless unique. For example, there is a potentially infinite number of possible instances that all have the same characteristic of problem size. The uniqueness of instances will therefore cause different values of the response despite the instances having identical levels of the hypothesised characteristic. The experimenter’s difficulty is one of separating the effect (if any) due to the hypothesised characteristic from the unavoidable variability between unique instances. A given heuristic encounters instances with different levels of some characteristic. The experimenter wishes to determine whether there is a significant overall difference in heuristic performance response for different levels of this characteristic. The experimenter also wishes to determine whether there is a significant variability in the response when unique instances have the same level of the problem characteristic. There is a well-established experiment design to overcome this difficulty. It is termed a two-stage nested (or hierarchical) design. Figure 6.2 illustrates the twostage nested design schematically. Problem characteristic Instance

Observa tions

1 1

2

2 3

4

5

6

y111 y112

y121 y131 y122 y132

y241 y242

y251 y252

y261 y262

y11r

y12r y13r

y24r

y25r

y26r

Figure 6.2: Schematic for the Two-Stage Nested Design with r replicates. (adapted from [84]). There are only two levels of the parent problem characteristic factor. Note the instance numbering to emphasise the uniqueness of instances within a given level of the problem characteristic.

Note that this thesis applies the two-stage nested design to the heuristics researcher’s aim of determining whether a problem characteristic merits inclusion in the sequential experimentation methodology. This design cannot capture possible interactions between more than one problem characteristic. Capturing such interactions would require a more complicated crossed nested design or the factorial designs encountered later in the sequential experimentation procedure. The goal here is to quickly determine whether the hypothesised characteristic should be included in subsequent experiments in the sequential methodology. This design, first introduced into ACO research by the author [105, 110] is now receiving wider

108

CHAPTER 6. METHODOLOGY attention and theoretical backing in the community [6].

6.2.2

Method

This is an overview of the method for determining whether a problem characteristic should be included in subsequent stages of the methodology. A case study illustrates this method with real data in Chapter 7. 1. Responses variables. Identify the response(s) to be measured. For experiments with metaheuristics, these must reflect some measure of solution quality and some measure of solution time. 2. Design factors and factor ranges. Choose the problem characteristic hypothesised to affect the response of interest and the range over which that factor will be varied. The null hypothesis is that changes in this characteristic cause no significant change in the metaheuristic performance. The alternative hypothesis is that changes in the characteristic do indeed cause changes in the performance. 3. Held-constant factors. Fix all tuning parameters and all other problem characteristics at some values that remain fixed for the duration of the experiment. These values may be values commonly encountered in the literature, values in the middle of the range of interest or values determined to be of interest by pilot studies. As noted above, this does not permit an examination of interactions between the hypothesised characteristic and both other characteristics and the metaheuristic’s tuning parameters. These interactions are examined later in the sequential methodology. 4. Experiment design. Generate a 2-stage nested design where the hypothesised characteristic is the parent factor (Factor A) and the unique instances (Factor B) are nested within a given level of this parent. This design is illustrated schematically in Figure 6.2 on the preceding page. Preferably, there should be at least three levels of the parent factor so that it can be determined whether the hypothesised effect of the problem characteristic is linearly related to the performance metric (Section 3.5 on page 59). We can have as many instances nested within each parent level as we like. The number of instances must be the same within each level of the parent. A treatment is then a run of the metaheuristic on an instance within a given level of the problem characteristic. 5. Replicates and randomise. Replicate each treatment and randomise the run order. Collect the data in the randomised run order. 6. Analysis. The data can now be analysed with the General Linear Model. The statistical technicalities behind this analysis have recently been explained in the context of metaheuristics research [6]. These translate into the following settings in most statistical software:

109

CHAPTER 6. METHODOLOGY • Factor A is entered into the model as the parent. Factor B is nested within Factor A. • Factor B is set as a random factor since the unique instances are randomly generated. • Factor A is a fixed factor because its levels were chosen by the experimenter. 7. Diagnostics. The usual diagnostic plots (Section A.4.2 on page 221) are examined to verify that the assumptions for the application of ANOVA have not been violated. 8. Response Transformation. A transformation of the response may be needed (Section A.4.3 on page 222). If so, the experimenter returns to step 5 and reanalyses the transformed response. 9. Outliers. If outliers are identified in the data, these are deleted and the experimenter returns to step 5. 10. Power. The gathered data are analysed to determine the statistical power. If insufficient power has been reached for the study’s level of significance then further replicates are added and the experimenter returns to the Replicates and randomise step. 11. Interpretation. When satisfied with the model diagnostics, the ANOVA table from the analysis can be interpreted. 12. Visualisation. The box plot is also useful for visualising the practical significance of any statistically significant effects. This approach can be repeated for any problem characteristic that is hypothesised to affect performance. Once satisfied with a set of characteristics, the experimenter proceeds to do a larger scale screening experiment with all the tuning parameters and all the relevant problem characteristics. Of course, if the set of problem characteristics turns out to be insufficient, the experimenter returns to this stage of the overall sequential experimentation methodology.

6.3

Stage 1b: Screening

The next stage in the sequential experimentation methodology is screening. Screening aims to determine which factors have a statistically significant effect and practically significant effect on each response as well as the relative size of these effects. Chapters 8 on page 143 and 10 on page 169 present detailed case studies of screening. The experiment design and methodology presented here were first introduced to the ACO community by the author [104, 107, 108]. A similar DOE screening methodology was applied to Particle Swarm Optimisation a year later [74].

110

CHAPTER 6. METHODOLOGY

6.3.1

Motivation

A detailed motivation for screening heuristic tuning parameters and problem characteristics has been covered in Chapter 1. It is important to screen heuristic tuning parameters and problem characteristics for several reasons. We learn which parameters have no effect on performance and this saves experimental resources in subsequent performance modelling experiments. It also improves the efficiency of other heuristic tuning methods (Section 4.3.2 on page 85) by reducing the search space these methods must examine. This was already discussed on several occasions as one of the major advantages of the DOE approach to parameter tuning over alternatives like automated tuning (Section 4.3.2 on page 84). Screening experiments also provide a ranking of the importance of the tuning parameters. In a case of limited resources, this arms the experimenter with knowledge of the most important factors to examine and those parameters that one may afford to treat as held-constant factors. Finally, screening is a useful design tool. Alternative new heuristic features can be ‘parameterised’ into the heuristic and a screening analysis will reveal whether these new features have any significant effect on performance.

6.3.2

Research questions

The research questions in any heuristic screening study can be phrased as follows. Screening. Which of the given set of heuristic tuning parameters and problem characteristics have an effect on heuristic performance in terms of solution quality and solution time? Ranking. Of the tuning parameters and problem characteristics that affect heuristic performance, what is the relative importance of each in terms of solution quality and solution time? Adequacy of a Linear Model. Is a linear model of the responses adequate to predict performance or is a higher order model required? These research questions lead to a potentially large number of hypotheses. It would not contribute to this case study to exhaustively list all here. Some illustrative examples follow. These examples can be specified for any tuning parameter or problem characteristic and must be analysed for all performance metrics in the study. A screening hypothesis would look like the following. • Null Hypothesis H0 : the tuning parameter A has a significant affect on performance measure X. • Alternative Hypothesis H1 : the tuning parameter A has no significant affect on performance measure X. A ranking hypothesis would look like the following. • Null Hypothesis H0 : the tuning parameters A and B have an equally important effect on performance measure X. 111

CHAPTER 6. METHODOLOGY • Alternative Hypothesis H1 : the tuning parameter A has a stronger effect than tuning parameter B on performance measure X.

6.3.3

Experiment Design

Full factorial designs are expensive in terms of experimental resources and provide more information than is needed in a screening experiment (Section A.3.2 on page 215). For screening purposes it is sufficient to use a fractional factorial (FF). The minimum appropriate fractional factorial design resolution for screening is a resolution IV since no main effects are aliased but a resolution V design is preferable when possible since this provides information on unaliased second order effects. Fractional factorials, aliasing and resolution are discussed in more detail in Appendix A.2.

6.3.4

Method

The methodology for factor screening is described below. 1. Responses variables. Identify the response(s) to be measured. For experiments with metaheuristics, these must reflect some measure of solution quality and some measure of solution time. 2. Design factors and factor ranges. Choose the algorithm tuning parameters and problem characteristics that will be screened as well as the ranges over which these factors and problem characteristics will vary. Sometimes factors have a restricted range due to their nature. Alternatively, factors may have an open range. In either case, a pilot study (Section 3.7 on page 65) may be required to determine sensible factor ranges. 3. Held constant factors and values. Because of limited resources, it may not be possible to experiment with all potential design factors. A pilot study may have revealed that some factors have a negligible effect on performance. These factors must be held at a constant value for the duration of the experiments. 4. Experimental Design. For the given number of factors, choose an appropriate fractional factorial design. If the number of factors and resource limitations prevent the use of a resolution V design then examine the available resolution IV designs for the given number of factors. Where there are several available designs of resolution IV, check whether the design requiring a smaller number of treatments has a satisfactory aliasing structure. Note that any resolution IV design will have aliased two-factor interactions. However, knowledge of the system and its most likely interactions may make some of these aliases negligible. Furthermore, judicious assignment of factors will result in the most important factors having the least aliasing in the generated design. The choice of design will fix the number and value of the factor levels. The chart provided in Section A.3.2 on page 215 can help in making a decision between resolution, number of runs and aliasing. 112

CHAPTER 6. METHODOLOGY 5. Run Order. This is now the minimum design with a single replicate. Generate a random run order for the design. 6. Gather Data. Collect data according to the treatments and their random run order. 7. Significance level, effect size and replicates. Choose an appropriate significance (alpha) level for the study. Choose an appropriate effect size that the screening must detect. For the study’s chosen significance level (5% in this thesis), examine the design’s power to detect the chosen effect size. If the power is not greater than 80%, add replicates to the design and return to the Run Order step to gather the extra data. Sufficient data has been gathered when power reaches 80%. At this stage, sufficient data has been gathered with which to build a model of each response. The steps in this model building stage of the methodology are the subject of the next section.

6.3.5

Build a Model

The following model building steps must be repeated for each of the responses separately. Before this however, it is necessary to check that none of the responses are highly correlated. Only one response in a set of highly correlated responses needs to be analysed. A scatter plot of each response against each other response visually demonstrates correlation. Recall that two solution quality responses are measured in this research (Section 3.10.2 on page 69). These responses are of course highly correlated. However, both responses are always analysed separately. It is an open question for the ACO field as to which solution quality response is the more appropriate and we wished to examine whether the conclusions using one would differ from the conclusions using the other. 1. Find important effects. Various techniques such as Half-normal plots can be used to identify the most important effects that should be included in a model of the data. This thesis uses backwards regression (Section A.4.1 on page 220) with an alpha out value of 0.1. 2. ANOVA test. The result of the backwards regression is an ANOVA on the model containing these most important effects. 3. Diagnosis. The usual diagnostic tools (Section A.4.2 on page 221) are used to verify that the ANOVA model is correct and that the ANOVA assumptions have not been violated. 4. Response Transformation. The diagnostics may reveal that a transformation of the response is required. In this case, perform the transformation and return to the Find Important Effects step. 5. Outliers. If the transformed response is still failing the diagnostics, it may be that there are outliers in the data. These should be identified and removed before returning to the Find Important Effects step. 113

CHAPTER 6. METHODOLOGY 6. Model significance. Check that the overall model is significant. 7. Model Fit. Check that the predicted R-Squared value is in reasonable agreement with the Adjusted R-Squared value and that both of these are close to 1. Check that the model has a signal to noise ratio greater than about 4. At this stage in the procedure, it may be necessary to augment the design depending on whether a resolution IV or resolution V design had been used and depending on whether aliased effects are statistically significant. This is one of the common iterative experimentation situations identified earlier (Section 6.1 on page 105).

6.3.6

Augment model

In a resolution IV design, some second order effects will be aliased with other effects. If the experimenter deems these effects to be important then the design must be augmented with additional treatments so that the aliasing of these effects can be removed. There is a methodical approach to fractional factorial augmentation that is termed foldover. The details of foldover are beyond the scope of this thesis but are covered in the literature [84]. In essence, foldover is a way to add specific treatments to a design such that a target effect will no longer be aliased in the new augmented design. If foldover is performed, the new design can be analysed as per Sections 6.3.4 on page 112 and 6.3.5 on the previous page. At this point, we have reduced models for each response that pass the diagnostics and in which no important effects are aliased. However, diagnostics involve some subjective judgements. While the ANOVA procedure can be robust to slight violations of these diagnostics, it is still good practice to independently confirm the models’ accuracy on some real data.

6.3.7

Confirmation

Before drawing any conclusions from a model, it is important to confirm that the model is sufficiently accurate. As in traditional DOE, confirmation is achieved by running experiments at new randomly chosen points in the design space and comparing the actual data to the model’s predictions. Confirmation is not a new rigorous experiment and analysis in itself but rather a quick informal check. In the case of a heuristic, these randomly chosen points in the design space equate to new problem instances and new randomly chosen combinations of tuning parameters. The methodology that this thesis proposes is as follows: 1. Treatments. A number of treatments are chosen where a treatment consists of a new problem instance and a new set of tuning parameter values with which the instance will be solved. 2. Generate Instances. The required problem instances are generated. 3. Select Tuning Parameters. It is important to remember that the screening design uses only high and low values of the factors. It therefore can only produce a linear model of the response. This will not be an accurate predictor of 114

CHAPTER 6. METHODOLOGY the response within the centre of the design space if the response actually exhibits curvature. However, the model should still be accurate near the edges of the design space (at the high and low values of the factors). The randomly chosen tuning parameters should therefore be restricted to be within a certain percentage of the edges of the factors’ ranges. In this research, a limit of within 10% of the factor high and low values is used. 4. Random run order. A random run order is generated for the treatments and a given number of replicates. In this research, 3 replicates are used. 3 replicates is enough to give an estimate of how variable the response is for a given treatment. We are conducting this confirmation to ensure our subjective decisions in the model building were correct. We are not conducting a new statistically designed experiment that would introduce further subjective diagnostic decisions as per the previous section. 5. Prediction Intervals. The collected data is compared to the model’s 95% high and low prediction intervals (Section A.3.5 on page 219). We identify two criteria upon which our satisfaction with the model (and thus confidence in its predictions) can be judged [106]. • Conservative: we should prefer models that provide consistently higher predictions of relative error and higher solution time than those actually observed. We typically wish to minimise these responses and so a conservative model will predict these responses to be higher than their true value. • Matching Trend: we should prefer models that match the trends in heuristic performance. The model’s predictions of the parameter combinations that give the best and worst performance should match the combinations that yield the actual metaheuristic’s observed best and worst performance. 6. Confirmation. If the model is not a satisfactory predictor of the actual algorithm then the experimenter must return to the model building phase and attempt to improve the model. At this stage, we have built models of each response from the gathered data and the models have been confirmed to be good predictors of the algorithm responses around the edges of the design space. We can now analyse the models and rank and screen the factors.

6.3.8

Analysis

The following steps for analysing a screening model must be repeated for each response independently. 1. Rank most important factors. The terms in the model should be ranked according to their ANOVA Sum of Square values (Section A.4 on page 220).

115

CHAPTER 6. METHODOLOGY These ranks can then be studied alongside the corresponding p values for the model terms. The most important model terms will have large sum of square values and a p value that shows they are statistically significant. 2. Screen factors. Factors that are not statistically significant and have a relatively low ranking can be removed immediately from the subsequent experiments as they do not have an important influence on the response. Furthermore, factors that are statistically significant but practically insignificant can also be considered for removal from subsequent experiments. The extent to which we screen out factors will depend on the experimental resources available for subsequent experiments, our knowledge of the metaheuristic’s tuning parameters and our confidence in the screening experiment’s recommendations. Note also that if a factor is important for even one response then it must remain in the subsequent experiments. Subsequent experiments combine all responses into a single model of performance. 3. Model graphs. Graphs of the response for each factor should be examined. This serves two purposes. It confirms our decision to screen factors in the previous step. Furthermore, it shows us whether statistically significant and highly ranked factors actually have a practically significant effect on the response. At this stage, the relative importance of each term in the model has been assessed. A linear relationship between the factors and the response has been assumed. The screening design has therefore yielded a planar relationship between the factors and the response. The relationship between the factors and response is often of a higher order than planar. A higher order relationship will exhibit some curvature. It is therefore important to determine whether such curvature exists so that we can establish the need to use a more sophisticated response surface and associated experiment design in subsequent experiments.

6.3.9

Check for curvature

Adding centre points to a design allows us to determine whether the response surface is not planar but actually contains some type of curvature. A centre point is simply a treatment that is a combination of all factors’ values at the centre of the factors’ ranges1 . The average response value from the actual data at the centre points is compared to the estimated value of the centre point that comes from averaging all the factorial points. If there is curvature of the response surface in the region of the design, the actual centre point value will be either higher or lower than predicted by the factorial design points. If no curvature exists then the screening experiment’s planar models should be sufficient to predict responses. This should first be confirmed as per Section 6.3.7 on page 114 but with the difference that

1 Centre points must therefore be replicated for each level of each categoric factor since the ‘middle’ of a categoric factor does not make any sense.

116

CHAPTER 6. METHODOLOGY treatments are drawn from throughout the design space rather than just from the edges. If the analysis with centre points reveals the possibility of curvature in the actual data then further experiments are required to predict the responses throughout the whole design space. These experiments are the subject of the next stage in the methodology.

6.4

Stage 2: Modelling

The response surface methodology is similar to the screening methodology of Section 6.3.4 on page 112 in many regards. It involves a fractional factorial type design, analysed with ANOVA. Its data are collected following the good practices for experiments with heuristics developed in Chapter 3 and illustrated in the screening methodology of Section 6.3 on page 110. The most significant difference in the response surface methodology is in the experiment design it requires. The screening design used a simple linear model to determine whether there was a significant difference between high and low levels of the factors. For this purpose, only the edges of the design space were of interest and only these were examined when confirming the model (see Section 6.3.7 on page 114). A Response Surface Model, by contrast, requires a more sophisticated design since it attempts to build a model of the factor-problem-performance relationship across the whole design space. The experiment designs and methodology presented here were first introduced to the ACO field by the author [104, 106].

6.4.1

Motivation

A detailed motivation for performance modelling has already been presented in Chapter 1. Modelling heuristic performance is a sensible way to explore the vast design space of tuning parameter settings and their relationship to problem instances and heuristic performance. A good model can be used to quickly recommend tuning parameter settings that maximise performance given an instance with particular characteristics. Performance models can also provide visualisations of the robustness of parameter settings in terms of changing problem characteristics.

6.4.2

Research questions

This parameter tuning study addresses the following research questions. • Screening.

Which tuning parameters and which problem characteristics

have no significant effect on the performance of the metaheuristic in terms of solution quality and solution time? If a screening study has already been conducted correctly and the tuning study is performed on the screened parameter set, one would expect all remaining parameters to have a significant effect on performance. The screening study is a more efficient method of answering screening questions but the tuning study can nonetheless identify 117

CHAPTER 6. METHODOLOGY further unimportant parameters that may have been missed in the screening study. • Ranking. What is the relative importance of the most important tuning parameters and problem characteristics? • Sufficient order model. What is the minimum order model that satisfactorily models performance? We know from the screening study whether or not a linear model is sufficient. Tuning studies offer more advanced fit analyses that recommend the minimum order model that is required for the data. • Relationship between tuning, problems and performance. What is the relationship between tuning parameters, problem characteristics and the responses of solution quality and solution time. A tuning study yields a mathematical equation describing this relationship for each response. • Tuned parameter settings. What is a good set of tuning parameter settings given an instance with certain characteristics? Are these settings better than what can be achieved with randomly chosen settings? Are these settings better than alternative settings from the literature? The first two research questions are identical to questions in the screening study of Chapter 8. The tuning study does not obviate the need for a screening study. A screening study is a simpler and more efficient method for answering these particular research questions.

6.4.3

Experiment Design

Building a response surface requires a particular type of experiment design. There are several alternatives available and these are discussed in Section A.3.4 on page 218. In this thesis, the Face-Centred Composite (FCC) design is used for all models. The FCC is most appropriate for situations where design factors are restricted to be within a certain range. This is the case with many metaheuristic tuning parameters. For example, in ACO, the pheromone related parameter ρ must be within the range 0 < ρ < 1.

6.4.4

Method

The method for response surface modelling is detailed below. As already noted, it is very similar to the screening methodology of Section 6.3 on page 110. 1. Responses variables. Identify the response(s) to be measured. For experiments with metaheuristics, these must reflect some measure of solution quality and some measure of solution time. 2. Design factors and factor ranges. Choose the algorithm tuning parameters and problem characteristics whose relationship to performance metrics will be modelled by the response surfaces. Note that problem characteristics must be included in the model so that we can correctly model their relationship to 118

CHAPTER 6. METHODOLOGY the tuning parameters and the heuristic performance. If the response surface design has been augmented from a previous screening design by adding star points then the factors and factor ranges are already determined. 3. Held constant factors and values. Because of limited resources, it may not be possible to experiment with all potential design factors. A pilot study may have revealed that some factors have a negligible effect on performance. These factors must be held at a constant value for the duration of the experiments. 4. Screened out factors. Factors that have been screened out in the previous screening study can be set to any values with impunity. 5. Experiment design. For the given number of factors, choose an appropriate fractional factorial design for the factorial part of the Face-Centred Composite design. Where there are several available designs of resolution IV, check whether the design requiring a smaller number of treatments has a more satisfactory aliasing structure. The choice of the FCC design determines the location of the design’s star points and consequently the levels of the factors. 6. Run Order. This is now the minimum design with a single replicate. Generate a random run order for the design. 7. Gather Data. Collect data according to the treatments and their run order. 8. Significance level, effect size and replicates. Choose an appropriate significance (alpha) level for the study. Choose an appropriate effect size that the screening must detect. For the study’s chosen significance level (5% in this thesis), examine the design’s power to detect the chosen effect size. If the power is not greater than 80%, introduce replicates into the design. Add replicates to the design and return to the Run Order step and gather the extra data. Sufficient data has been gathered when power reaches 80% by convention. At this stage, we have sufficient data to build a response surface model of each response.

6.4.5

Build a Model

Building the models for a response surface differs in several ways from building a model for screening. When screening, a model of each response was analysed separately. This meant that outliers removed from one response’s model would not necessarily be removed from another response’s model. With the response surface models however, we will ultimately be combining all the models’ recommendations into a simultaneous tuning of all the responses. It is therefore more appropriate that outlier runs deleted from the analysis of one response will remain deleted from the analysis of other responses. When screening, it was sufficient to use two levels of each factor and so we were limited to a linear/planar model of the responses. The response surface model is

119

CHAPTER 6. METHODOLOGY more complicated in that it can model higher-order surfaces. The first step is therefore to determine the most appropriate model for the data. 1. Model Fitting. The highest order model that can be generated from the FCC design is quadratic. All lower order models (linear and 2-factor interaction) are generated and then assessed on two aspects: their significance and their R-squared values. (a) Begin with the lowest order model, the linear model. If the model is not significant, it is removed from consideration and the next highest order model is examined for significance. (b) Examine the adjusted R-squared and predicted R-squared values. These should be within 0.2 of one another and as close to 1 as possible. If the R-squared values are not satisfactory, the model is removed from consideration and the next highest order model is examined. If models of a higher order than quadratic are required then an alternative design to the FCC will have to be used. 2. Find important effects. A stepwise linear regression is performed on the chosen model to estimate its coefficients. Note that this is different from the screening stage of Section 6.3 on page 110. Here, terms are being removed from the model to give the most parsimonious model possible. Screening, by contrast, determines which factors (and all associated model terms) do not even take part in the experiment design. Of course, if screening has been accurate then few main effect terms should be removed from the response surface model by stepwise linear regression. 3. Diagnosis. The usual diagnostics of the linear regression model are performed. If the model passes these tests then its proposed coefficients can be accepted. 4. Response Transformation. The diagnostics may reveal that a transformation of the response is required. In this case, perform the transformation and return to step 2. 5. Outliers. If the transformed response is still failing the diagnostics, it may be that there are outliers in the data. These should be identified and removed before returning to step 2. 6. Model significance. Check that the overall model is significant. 7. Model Fit. Check that the predicted R-Squared value is in reasonable agreement with the Adjusted R-Squared value and that both of these are close to 1. Check that the model has a signal to noise ratio greater than about 4. As with the screening procedure, it is good practice to independently confirm the models’ accuracy on some real data as recommended by the DOE approach.

120

CHAPTER 6. METHODOLOGY

6.4.6

Confirmation

The methodology for confirming our response surface models differs in one important way from the previous method for confirming our screening models (Section 6.3.7 on page 114). The randomly chosen instances and parameter settings are now drawn from across the entire design space rather than being limited to the edges of the design space. The methodology is the same as that of Section 6.3.7 on page 114 in all other aspects and so is not repeated here.

6.4.7

Analysis

The following analysis is performed for each model separately. 1. Rank most important factors. The terms in the model equations should be ranked according to their ANOVA F values. These ranks can then be studied alongside the corresponding p values for the model terms. The rankings should be in approximate agreement with the previous screening study. 2. Model graphs. Graphs of the responses for each factor should be examined. This shows us whether statistically significant and highly ranked factors actually have a practically significant effect on the response. It also shows the likely location of optimal response values. Surface plots of pairs of factors are particularly insightful. At this stage, the experimenter has a model of each of the responses over the entire design space and these models have been confirmed to be accurate predictors of the actual metaheuristic. It is now possible to use this model for tuning the actual metaheuristic.

6.5

Stage 3: Tuning

Screening has provided us with a reduced set of the most important problem characteristics and the most important algorithm tuning parameters. The Response Surface Model has given us mathematical functions relating these characteristics and tuning parameters to each of the responses of interest. The accuracy of these models’ predictions have been methodically confirmed and we are satisfied that the models meets our quality criteria (see Section 6.4.6). Since the Response Surface Models are mathematical functions of the factors, it is possible to numerically optimise the responses by varying the factors. This allows us to produce the most efficient process. There are several possible optimisation goals. We may wish to achieve a response with a given value (target value, maximum or minimum). Alternatively, we may wish that the response always falls within a given range (relative error less than 10%). More usually, we may wish to optimise several responses because of the heuristic compromise. We have seen in the literature review of Chapter 4 on page 77 that such tuning rarely deals with

121

CHAPTER 6. METHODOLOGY both solution quality and solution time simultaneously and so neglects the heuristic compromise. We introduce here a technique from Design Of Experiments that allows multiple response models to be simultaneously tuned.

6.5.1

Desirability Functions

The multiple responses are expressed in terms of desirability functions [40]2 . Desirability functions are described in more detail in Section A.3.6 on page 220. The overall desirability for all responses is the geometric mean of the individual desirabilities. The well-established Nelder-Mead downhill simplex [98, p. 326] is then applied to the response surface model’s equations such that the desirability is maximized. We specify the optimization in this research with the dual goals of minimizing both solution error and solution time, while allowing all algorithm-related factors to vary within their design ranges3 . Equal priority is given to the dual goals. This does not preclude running an optimisation that favours solution quality or favours solution time, as determined by the tuning application scenario. Problem characteristics are also factors in the model since we want to establish the relationship between these problem characteristics, the algorithm parameters and the performance responses. It does not make sense to include these problem characteristic factors in the optimization. The optimisation process would naturally select the easiest problems as part of its solution. We therefore needed to choose fixed combinations of the problem characteristics and perform the numerical optimizations for each of these combinations. A sensible choice of such combinations is a threelevel factorial of the characteristics. A more detailed description of these methods follows.

6.5.2

Method

These are the steps used to simultaneously optimise the desirability of the solution quality and solution time responses. 1. Combinations of problem characteristics. A three-level factorial combination of the problem characteristics is created. In the case of two characteristics, this creates 9 combinations of problem characteristics. These combinations are illustrated in Table 6.1 on the next page. 2. Numerical optimisation. For each of these combinations, a numerical optimisation of desirability is performed using the Nelder-Mead Simplex with the

2 From the NIST/SEMATECH e-Handbook of Statistical Methods, available at http://www.itl.nist.gov/div898/handbook/pri/section5/pri5322.htm. 3 It is important to note that optimisation of desirability does not necessarily lead to parameter recommendations that yield optimal metaheuristic performance. As stated in the opening of Section 6.5.1, desirability functions are a geometric mean of the desirability of each individual response. Furthermore, a response surface model is an interpolation of the responses from various points in the design space. There is therefore no guarantee that the recommended parameters result in optimal performance, they only result in tuned performance that is better than performance in most of the design space. One must be careful to distinguish between optimised desirability and tuned parameters.

122

CHAPTER 6. METHODOLOGY Standard Order

Characteristic A

Characteristic B

1

Level 1 of A

Level 1 of B

2

Level 2 of A

Level 1 of B

3

Level 3 of A

Level 1 of B

4

Level 1 of A

Level 2 of B

5

Level 2 of A

Level 2 of B

6

Level 3 of A

Level 2 of B

7

Level 1 of A

Level 3 of B

8

Level 2 of A

Level 3 of B

9

Level 3 of A

Level 3 of B

Table 6.1: A full factorial combination of two problem characteristics with three levels of each characteristic. This results in nine treatments.

following settings: • Cycles per optimisation is 30. • Simplex fraction is 0.1. • Design points are used as the starting points. • The maximum number of solutions is 25. The optimisation goal is to minimise the solution error response and the solution run time response. These goals can be given different priorities if, for example, quality is deemed more important than time. The problem characteristics are fixed at the values corresponding to the 3 level factorial combination. 3. Choose best solution. When the optimisation has completed, the solution with the highest desirability is taken and the others are discarded. Note that there may be several solutions of very similar desirability but with differing factor settings. This is due to the nature of the multiobjective optimisation and the possibility of many regions of interest (Section A.2 on page 213). 4. Round off integer-valued parameters. If a non-integer value of an exponent tuning parameter was recommended, this is rounded to the nearest integer value for the reasons explained in Section 5.2.1 on page 99. This optimisation procedure has given us recommended parameter settings for 9 locations covering the problem space. Of course, a user requiring more refined parameter recommendations will have to run this optimisation procedure for the problem characteristics of the scenario to hand. Optimisation of desirability is not expensive but requires access to appropriate tools. Alternatively, an interpolation of the recommendations across the design space is possible. The experimenter can now evaluate the tuning recommendations.

123

CHAPTER 6. METHODOLOGY

6.6

Stage 4: Evaluation

There are two main aspects to the evaluation of recommended tuning parameter settings. Firstly, we wish to assess how well the presented method’s recommended parameter settings perform in comparison to alternative recommended settings. Secondly, we wish to determine how robust the recommended settings are when used with other problem characteristics.

6.6.1

Comparison

We wish to determine by how much the results obtained with the tuned parameter settings are better than the results obtained with randomly chosen parameter values and the results obtained with alternative parameter settings. We wish to compare with randomly chosen parameters to demonstrate that the effort involved in the methodology does indeed offer an improvement in performance over not using any method at all. We compare with alternative parameter settings to see whether the methodology is competitive with the literature or supports values recommended in the literature. Such alternative settings may include those recommended by others or those determined using some other tuning methodology. In this research, all results using DOE tuned parameters are compared to the results obtained with the parameter settings recommended in the literature [47]. The methodology is described below. It is similar in terms of set-up to the previous confirmation experiments for the screening models (Section 6.3.7 on page 114) and response surface models (Section 6.4.6 on page 121). It must be repeated for every combination of problem characteristics used to cover the design space (see Section 6.5.2 on page 122). 1. Generate problem instances. A number of problem instances are created at randomly chosen locations in the problem space. 2. Randomly choose combinations of tuning parameters. For each instance, a number of sets of parameter settings are chosen randomly within the design space. 3. Use models to choose parameter settings. For each instance, a parameter setting is chosen using the desirability optimisation of the response surface model. In this research, two models were built: a relative error vs time model and an ADA vs time model. There are therefore two sets of parameter recommendations, one for each model. 4. Solve instances. The instances are solved using the various parameter settings. 5. Plot. A scatter plot is used to illustrate the differences, if any, between the solutions obtained with the various parameter settings. It is important once again to draw the reader’s attention to the heuristic compromise. We may find that recommended parameter settings offer a similar solution

124

CHAPTER 6. METHODOLOGY quality to the DOE method’s recommendations. However, the DOE methodology is constrained to also find settings that offer good solution times. Only when both responses are examined do we have a realistic comparison of the performance of the parameter settings.

6.6.2

Robustness

At this point, we have tuned parameter settings for each member of a set of combinations of problem characteristics. This set was chosen so that it covers the whole space of possible problems in some sensible fashion. We also have the model equations of the responses that allow us to calculate tuned parameter settings for new combinations of problem instance characteristics. It may not always be convenient or necessary to perform these calculations. A technique from Design Of Experiments called overlay plots is adapted in this thesis so that statements can be made about the robustness of tuned parameter settings across a range of different problem instance characteristics [106]. Overlay plots are a visual representation of regions of the design space in which the responses fall within certain bounds. They can be considered as a more relaxed optimisation than that of tuning where the experimenter was looking for maxima and minima. For example, the experimenter might query the response models for ranges of some tuning parameters within which the solution time response is less than a certain value. Overlay plots are very useful in the context of robustness of tuned parameters. Recall that the researcher has several sets of tuned parameter values for various combinations of problem characteristics. For any one of these combinations of problem characteristics, the associated tuned parameter settings should result in maxima or minima of the responses. Now one may also wonder whether adjacent combinations of problem characteristics in the problem space could also be quite well solved with the same tuned parameter settings. Overlay plots allow one to visualise the answer to this question. Figure 6.3 on the next page illustrates an overlay plot. The horizontal and vertical axes represent values of two problem instance characteristics. The tuned parameter settings are listed to the left. The white area represents all instance combinations that are solvable with these parameter settings within given relaxed bounds on the performance responses. Clearly, this area will be larger for more robust tuned parameter settings. Overlay plots are a powerful tool that is only available when one has a model of metaheuristic performance across the whole design space. They are backed by the same rigour and statistical confidence as all DOE methods.

6.7

Common case study issues

Thus far, this chapter has detailed the methodologies that the thesis has adapted from DOE and will apply to the parameter tuning problem. Chapter 3 highlighted how even within a good methodology there are many experimental design decisions that must be taken. This section summarises those common issues and decisions 125

CHAPTER 6. METHODOLOGY

Figure 6.3: A sample overlay plot. The horizontal and vertical axes represent two problem characteristics. The tuning parameter settings are listed on the left. Two constraints on solution have been specified: the time must be less than 5 seconds and the relative error must be less than 1%. The white area is the region of the problem space that can be solved within these constraints on time and solution quality.

taken across all subsequent case studies. They are reported here to avoid repetition in the case studies but a stand-alone case study should report all of the following for completeness and to aid reproducibility.

6.7.1

Instances

All TSP instances were of the symmetric type. In the Euclidean TSP, cities are points with integer coordinates in the two-dimensional plane. For two cities at coordinates (x1 , y1 ) and (x2 , y2 ), the q distance between the cities is computed ac2 2 cording to the Euclidean distance (x1 − x2 ) + (y1 − y2 ) . However this definition of distance must be modified slightly. In the given form, this equation produces irrational numbers that can require infinite precision to be described correctly. This causes problems when comparing tour lengths produced by solution techniques. in this thesis, distance is calculated using q In all problems encountered  2 2 (x1 − x2 ) + (y1 − y2 ) + 0.5 . This is the so-called EUC2D distance type as (int) specified in the online TSP benchmark library TSPLIB [102] and as used in the original ACOTSP code (Section 5.2 on page 97). The TSP problem instances ranged in size from 300 cities to 500 cities with cost matrix standard deviation ranging from 10 to 70. All instances had a mean of 100. The same instances were used for each replicate of a design point. Instances were generated with a version of the publicly available portmgen problem generator from the DIMACS challenge [58] as described in Section 5.1 on page 95.

126

CHAPTER 6. METHODOLOGY

6.7.2

Stopping criterion

All experiments except those in Chapter 7 were halted after a stagnation stopping criterion. Stagnation was defined as a fixed number of iterations in which no improvement in solution value had been obtained. Responses were measured at several levels of stagnation during an experiment run: 50, 100, 150, 200 and 250 iterations. This facilitated examining the data at alternative stagnation levels to ensure that conclusions were the same regardless of stagnation level.

6.7.3

Response variables

Three response variables were measured. The time in seconds to the end of an experiment reflects the solution time. The adjusted differential approximation (ADA) and relative error from a known optimum reflect the solution quality. These were described in Section 3.10.2 on page 69. Concorde [5] was used to calculate the optima of the instances. Expected random solution values of the instances, as used in the ADA calculation, were generated by randomly permuting the order of cities in a TSP instance 200 times and taking the average tour length from these permutations.

6.7.4

Replicates

The design points were replicated in a work up procedure (Section A.6.1 on page 227) until sufficient power of 80% was reached for detecting a given effect size with an alpha level of 5% for all responses. The 80% power and 5% significance level were chosen by convention. The size of effect that could feasibly be detected depended on the particular response and the particular experiment design.

6.7.5

Benchmarks

All experimental machines were benchmarked as per the DIMACS benchmarking procedure described in Chapter 5. The results of this benchmarking are presented in Section 5.3 on page 100. These benchmarks should be used when scaling CPU times in future research.

6.7.6

Factors, levels and ranges

Held-Constant Factors There are several held constant factors common to all case studies. Local search, a technique typically used in combination with ACO heuristics, was omitted. There are two reasons for this omission. Firstly, there is a large number of local search alternatives from which to choose and choosing one would have restricted the thesis’s conclusions to a particular local search implementation. Secondly, the overwhelming contribution to ACO solution quality comes from local search. This would defeat the thesis aim of evaluating and demonstrating the tuning of a heuristic with a large number of parameters. All instances had a cost matrix mean of 100.

127

CHAPTER 6. METHODOLOGY The computation limit parameter (Section 2.4.8 on page 44) was fixed at being limited to the candidate list length as this resulted in significantly lower solution times. Nuisance Factors A limitation on the available computational resources necessitated running experiments across a variety of machines with slightly different specifications. There was no control over background processes running on these machines. Runs were executed in a randomised order across these machines to counteract any uncontrollable nuisance factors due to the background processes and differences in machine specification.

6.7.7

Outliers

This research used the approach of deleting outliers from an analysis until the analysis passed the usual ANOVA diagnostics (Section A.4.2 on page 221).

6.8

Chapter summary

This chapter has detailed the sequential experimentation methodology that is used in the rest of the thesis. The following topics were covered. • Iterative experimentation. Experimentation for modelling any process is inevitably iterative. Metaheuristics are no different. The experimenter often knows little about the heuristic at the start of experimentation. Once data is gathered and understanding of the heuristic deepens, the original experiment design decisions may have to be revised. Any procedure for algorithm modelling must efficiently incorporate the iterative nature of experimentation. • Sequential experimentation procedure. There is a well-established sequential experimentation procedure in traditional DOE. This thesis modifies that procedure so that it can be applied to metaheuristics. • Choosing problem characteristics. Experimentation with metaheuristics is different from traditional DOE in many regards. In particular, both problem characteristics and algorithm tuning parameters must be incorporated into the model so that parameter settings for new instances can be selected. In the spirit of sequential experimentation, we would like to determine quickly which problem characteristics should be included in subsequent experiment designs rather than building a large all-encompassing and expensive design. The two-stage nested design was introduced in Section 6.2 on page 106 as a methodical way to generalise performance effects due to a problem characteristic despite the individual uniqueness of each problem instance. • Performing confirmation runs. Confirmation runs are important when validating conclusions in traditional DOE. Many DOE analyses involve some

128

CHAPTER 6. METHODOLOGY subjective decisions. It is important then to confirm conclusions from the DOE procedures. This increases our confidence in our analysis—if the methods give good predictions of actual performance then they are sufficient for our engineering purposes. There are two major types of confirmation runs encountered here. 1. Model confirmation runs are used to confirm that the ANOVA equation is a good prediction of the actual response. 2. Tuning confirmation runs are used to confirm that the recommended tuning parameter settings are indeed as good as or better than alternatives. • Factor Screening. Screening experiments allow us to rank the importance of each tuning parameter and problem characteristic, determining those that matter most to performance and those that have no statistically significant or practically significant effect. Screening experiment designs do not have to be of a high enough resolution to make statements about higher order interactions. They are a quick way to reduce the size of subsequent response surface designs. • Response Surface Modelling. Response surface modelling determines the relationship between tuning parameters, problem characteristics and performance responses. A surface is built for each response separately. • Desirability functions. Desirability functions are a DOE technique for combining multiple responses into a single response. We have introduced desirability functions to the ACO community [104, 106, 109]. This permits easy tuning of factors while observing the heuristic compromise of high solution quality in reasonable solution time. • Tuning. Once all responses are expressed in a single desirability function, the multi-objective optimisation of all the responses in terms of the tuning parameters can be performed using well-established numerical optimisation methods. Optimisation can only be performed for fixed combinations of problem characteristics. Including problem characteristics in the optimisation would not make sense as the optimisation would select the easiest combination of problem characteristics. We therefore perform the optimisations at combinations of problem characteristics that span the design space. The recommendations from tuning for these combinations of problem characteristics can be interpolated across the design space. Alternatively, tuning can be performed for every specific combination of problem characteristics that the user is presented with. • Overlay plots. Since tuned parameter settings relate to specific combinations of problem characteristics, it is useful to determine how robust those parameter settings are to changes in the original problem characteristic values. Overlay plots provide a useful visual tool for doing this. Given a bound

129

CHAPTER 6. METHODOLOGY on the responses, we can plot all the combinations of problem characteristics that the algorithm will solve within these bounds using the given tuned parameter settings. The chapters in the next part of this thesis will apply this sequential experimentation methodology and its procedures to several ACO algorithms to test the thesis hypothesis.

130

Part IV

Case Studies

131

7 Case study: Determining whether a problem characteristic affects heuristic performance This Chapter reports a case study on how to determine whether a problem characteristic affects the performance of a metaheuristic. The methodology for this case study was proposed and described in Chapter 6. The Chapter reports a new result for ACO. The standard deviation of TSP edge lengths has a significant effect on the difficulty of TSP instances for two ACO algorithms, ACS and MMAS1 . The results reported in this Chapter have been published [105, 110]. The results support conclusions from similar experiments with exact algorithms [26] and provide a detailed illustration of the application of techniques of which the research community is becoming increasingly aware [6].

7.1

Motivation

An integral component of the construct solutions phase (Section 2.4.3 on page 37) of ACO algorithms depends on the relative lengths of edges in the TSP. These edge lengths are often stored in a TSP cost matrix. The probability with which an ant chooses the next node in its solution depends, among other things, on the relative length of edges connecting to the nodes being considered (Equation ( 2.2 on page 39)). Intuitively, it would seem that a high variance in the distribution of edge lengths would result in a different problem to a low variance in the distribution of edge lengths. This has already been investigated for exact algorithms for the TSP

1 Recall from Section 6.7.6 on page 127 that edge length mean was held constant during all experiments. The results in terms of standard deviation of edge lengths can therefore be interpreted in terms of the scale-free ratio of standard deviation to mean edge length.

133

CHAPTER 7. CASE STUDY: DETERMINING WHETHER A PROBLEM CHARACTERISTIC AFFECTS HEURISTIC PERFORMANCE (Section 4.1 on page 77, [26]). It was shown that the standard deviation of edge lengths in a TSP instance has a significant effect on the problem difficulty for an exact algorithm. This leads us to suspect that standard deviation of edges lengths may also have a significant effect on problem difficulty for the ACO heuristics. This research is worthwhile for several reasons. Current research on ACO algorithms for the TSP does not report the problem characteristic of standard deviation of edge lengths. Assuming that such a problem characteristic affects performance, this means that for instances of the same or similar sizes, differences in performance are confounded (Section A.1.6 on page 212) with possible differences in standard deviation of edge lengths. Consequently, too much variation in performance is attributed to problem size and none to problem edge length standard deviation. Furthermore, in attempts to model ACO performance, all important problem characteristics must be incorporated into the model so that the relationship between problems, tuning parameters and performance can be understood. With this understanding, performance on a new instance can be satisfactorily predicted given the salient characteristics of the instance.

7.2

Research question and hypothesis

The research question of this case study can be phrased as follows: Does the variability of edge lengths in the Travelling Salesperson Problem affect the difficulty of the problem for the ACO metaheuristic? This can be refined to the following research hypotheses, phrased in terms of either MMAS or ACS: • Null Hypothesis H0 : the standard deviation of edge lengths in TSP instances’ cost matrices has no effect on the average quality of solutions produced by the algorithm. • Alternative Hypothesis H1 : the standard deviation of edge lengths in TSP instances’ cost matrices affects the average quality of solutions produced by the algorithm.

7.3 7.3.1

Method Response Variable

The response variable was solution quality, measured as per Section 6.7.2 on page 127. The response was measured after 1000, 2000, 3000, 4000 and 5000 iterations.

7.3.2

Instances

Instances were generated as per Section 6.7.1 on page 126. Standard deviation of TSP cost matrix was varied across the 5 levels: 10, 30, 50, 70 and 100. Three prob134

CHAPTER 7. CASE STUDY: DETERMINING WHETHER A PROBLEM CHARACTERISTIC AFFECTS HEURISTIC PERFORMANCE lem sizes; 300, 500 and 700 were used in the experiments. The same instances were used for the two algorithms and the same instance was used for replicates of a design point.

7.3.3

Factors, Levels and Ranges

Design Factors There were two design factors. The first was the standard deviation of edge lengths in an instance. This was a fixed factor, since its levels were set by the experimenter. Five levels: 10, 30, 50, 70 and 100 were used. The second factor was the individual instances with a given level of standard deviation of edge lengths. This was a random factor since instance uniqueness was caused by the problem generator and so was not under the experimenter’s direct control. Ten instances were created within each level of edge length standard deviation. Held-constant Factors There were many common held-constant factors as per Section 6.7.6 on page 127. This study also contained further held-constant factors. Problem size was fixed for a given experiment. Sizes of 300, 500 and 700 were investigated. Two ACO algorithms were investigated: MMAS (Section 2.4.6 on page 42) and ACS (Section 2.4.5 on page 39). These algorithms were chosen because they are claimed to be the best performing of the ACO heuristics and because they are representative of the two main types of ACO heuristic: Ant System descendents and non-Ant System descendents respectively. The held constant tuning parameter settings for the heuristics are listed in the following table. Parameter

Symbol

ACS

MMAS

Ants

m

10

25

Pheromone emphasis

α

1

1

Heuristic emphasis

β

2

2

15

20

Candidate List length Exploration threshold

q0

0.9

N/A

Pheromone decay

ρglobal

0.1

0.8

Pheromone decay

ρlocal

0.1

N/A

Solution construction

Sequential Sequential

Table 7.1: Parameter settings for the problem difficulty experiments with the ACS and MMAS algorithms. Values are taken from the original publications [118, 47]. See Section 2.4 on page 34 for a description of these tuning parameters and the MMAS and ACS heuristics.

It is important to stress that this research’s use of parameter values from the literature by no means implies support for such a ‘folk’ approach to parameter selection in general. Selecting parameter values as done here strengthens any conclusions in two ways. It shows that results were not contrived by searching for

135

CHAPTER 7. CASE STUDY: DETERMINING WHETHER A PROBLEM CHARACTERISTIC AFFECTS HEURISTIC PERFORMANCE a unique set of tuning parameter values that would demonstrate the hypothesised effect. Furthermore, it makes the research conclusions applicable to all other research that has used these tuning parameter settings without the justification of a methodical tuning procedure, as proposed in this thesis. Recall from the motivation (Section 7.1 on page 133) that demonstrating an effect of edge length standard deviation on performance with even one set of tuning parameter values is sufficient to merit the factor’s consideration in parameter tuning studies. The results from this research will therefore directly affect the results on parameter tuning in later chapters.

7.3.4

Experiment design, power and replicates

The experiment design is a two-stage nested design (Section 6.2.1 on page 108). The standard deviation of edge lengths is the parent factor. The individual instances factor is nested within this. The heuristics are probabilistic (Section 2.4) and so repeated runs with identical inputs (instances, parameter settings etc.) will produce different results. All treatments are thus replicated in a work up procedure (Section A.6.1 on page 227) until sufficient power of 80% was reached to detect an effect for the study’s significance level of 1%. Power was calculated with Lenth’s power calculator [76]. For all experiments, 10 replicates were sufficient to meet these requirements.

7.3.5

Performing the Experiment

Randomised run order Available computational resources necessitated running experiments across a variety of similar machines.

Runs were executed in a randomised order across

these machines to counteract any uncontrollable nuisance factors. While such randomising is strictly not necessary when measuring a machine-independent response, it is good practice nonetheless. Stopping Criterion. Experiments were halted after a fixed iteration stopping criterion (Section 3.13 on page 73). The number of fixed iterations was 5000. A potential problem with this approach is that the choice of combinatorial count can bias the results. Should we stop after 1000 iterations or 1001? Taking response measurements after 1000, 2000, 3000, 4000 and 5000 iterations mitigated this concern. The data were separately analysed at the 1000 and 5000 measurement points.

7.4

Analysis

7.4.1

ANOVA

The two-stage nested designs were analysed with the General Linear Model. Standard deviation was treated as a fixed factor since we explicitly chose its levels and 136

CHAPTER 7. CASE STUDY: DETERMINING WHETHER A PROBLEM CHARACTERISTIC AFFECTS HEURISTIC PERFORMANCE instance was treated as a random factor. The technical reasons for this decision in the context of experiments with heuristics have recently been well explained in the metaheuristics literature [6]. To make the data amenable to statistical analysis, a transformation (as per Section A.4.3 on page 222) of the responses was required for each analysis. The transformations were either a log10 , inverse square root transformation or a square root transformation. Outliers were deleted and the model building repeated until the models passed the usual ANOVA diagnostics for the ANOVA assumptions of model fit, normality, constant variance, time-dependent effects, and leverage (Section A.4.2 on page 221). Figure 7.1 lists the number of data points deleted in each analysis.

Algorithm ACS

MMAS

Problem Relative ADA size Error 300 3 3 500 0 2 700 5 4 300 3 7 500 1 2 700 4 2

Figure 7.1: Number of outliers deleted during the analysis of each problem difficulty experiment. Each experiment had a total of 500 data points.

The number of outliers deleted in each case is very small in comparison to the total number of 500 data points. Further details on these analyses and diagnostics are available in many textbooks [84] and in Appendix A.

7.5

Results

In all cases, the effect of standard deviation of edge length on solution quality was deemed statistically significant at the p < 0.01 level. The effect of individual instance was also deemed statistically significant at the p < 0.01 level, however, an examination of the data shows that this effect was not of practical significance. The following figures illustrate box plots of the data for the problem sizes of 300 and 700 for ACS and MMAS. The same trends were observed for problems of size 500 and so these plots are omitted. In each box-plot, the horizontal axis shows the five levels of standard deviation of the instances’ edge lengths at the five measurement points, 1000, 2000, 3000, 4000 and 5000 iterations. The vertical axis shows the solution quality response in its original scale. There is a separate plot for each algorithm and each problem size. Vertical axes have not been set to the same scale in the various plots. This is to discourage performance comparisons between plots because parameters had not been tuned to the different problem sizes. Outliers have been included in these plots. An examination of the plots shows that only standard deviation had a practically significant effect on solution quality. At each measurement point, there was a slight improvement in the response.

137

CHAPTER 7. CASE STUDY: DETERMINING WHETHER A PROBLEM CHARACTERISTIC AFFECTS HEURISTIC PERFORMANCE

Relative Error, ACS, size 300, mean 100 14

12

10

8

6

4

2

0 A:problemStDev

10 30 50 70 100 RelErr 1000

10 30 50 70 100 RelErr 2000

10 30 50 70 100 RelErr 3000

10 30 50 70 100 RelErr 4000

10 30 50 70 100 RelErr 5000

Figure 7.2: Relative Error response for ACS on problems of size 300, mean 100.

Relative Error, ACS, size 700, mean 100 18 16 14 12 10 8 6 4 2 0 A:problemStDev

10 30 50 70 100 Rel Err 1000

10 30 50 70 100 Rel Err 2000

10 30 50 70 100 Rel Err 3000

10 30 50 70 100 Rel Err 4000

10 30 50 70 100 Rel Err 5000

Figure 7.3: Relative Error response for ACS on problems of size 700, mean 100.

138

CHAPTER 7. CASE STUDY: DETERMINING WHETHER A PROBLEM CHARACTERISTIC AFFECTS HEURISTIC PERFORMANCE

Relative Error, MMAS, size 300, mean 100 20

15

10

5

A:problemStDev

10 30 50 70 100 Rel Err 1000

10 30 50 70 100 Rel Err 2000

10 30 50 70 100 Rel Err 3000

10 30 50 70 100 Rel Err 4000

10 30 50 70 100 Rel Err 5000

Figure 7.4: Relative Error response for MMAS on problems of size 300, mean 100.

Relative Error MMAS, size 700, mean 100 25

20

15

10

5

A:problemStDev

10 30 50 70 100 Rel Err 1000

10 30 50 70 100 Rel Err 2000

10 30 50 70 100 Rel Err 3000

10 30 50 70 100 Rel Err 4000

10 30 50 70 100 Rel Err 5000

Figure 7.5: Relative Error response for MMAS on problems of size 700, mean 100.

139

CHAPTER 7. CASE STUDY: DETERMINING WHETHER A PROBLEM CHARACTERISTIC AFFECTS HEURISTIC PERFORMANCE This is to be expected since the metaheuristic has a larger number of iterations in which to solve the problems. In all cases, problem instances with a lower standard deviation had a significantly lower relative error value than instances with a higher standard deviation. In all cases, there was a higher variability in the relative error response between instances with a higher standard deviation. The same conclusions were drawn from an analysis of the ADA quality response.

7.6

Conclusions

For ACS and MMAS, applied to TSP instances generated with log-normally distributed edge lengths such that all instances have a fixed cost matrix mean of 100 and a cost matrix standard deviation varying from 10 to 70: 1. a change in cost matrix standard deviation leads to a statistically and practically significant change in the difficulty of the problem instances for these algorithms. 2. there is no practically significant difference in difficulty between instances that have the same size, cost matrix mean and cost matrix standard deviation. 3. there is no practically significant difference between the difficulty measured after 1000 algorithm iterations and 5000 algorithm iterations. Difficulty here means relative error from an optimum and the adjusted differential approximation. We therefore reject the null hypothesis of Section 7.2 on page 134.

7.6.1

Implications

These results are important for the ACO community for the following reasons: • They demonstrate in a rigorous, designed experiment fashion, that quality of solution of an ACO TSP algorithm is affected by the standard deviation of the cost matrix. • They demonstrate that cost matrix standard deviation must be considered as a factor when building predictive models of ACO TSP algorithm performance. • They clearly show that performance analysis papers using ACO TSP algorithms must report instance cost matrix standard deviation as well as instance size since two instances with the same size can differ significantly in difficulty. • They motivate an improvement in benchmark libraries so that they provide a wider crossing of both instance size and instance cost matrix standard deviation. Plots of instances in the TSPLIB show that generated instances generally have the same shaped distribution of edge costs (Appendix B).

140

CHAPTER 7. CASE STUDY: DETERMINING WHETHER A PROBLEM CHARACTERISTIC AFFECTS HEURISTIC PERFORMANCE

7.6.2

Assumptions and restrictions

For completeness and for clarity, we state that this case study did not examine the following issues. • It did not examine clustered problem instances or grid problem instances. These are other common forms of TSP in which nodes appear in clusters and in a very structured grid pattern respectively. The conclusions should not be applied to other TSP types without a repetition of this case study. • Algorithm performance was not being examined since no claim was made about the suitability of the parameter values for the instances encountered. Rather, the aim was to demonstrate an effect for standard deviation and so argue that it should be included as a factor in experiments that do examine algorithm performance. These experiments are the subject of subsequent case studies in the thesis. • We cannot make a direct comparison between algorithms since algorithms were not tuned methodically. That is, we are not entitled to say that ACS did better than MMAS on, say, instance X with a standard deviation of Y. • We cannot make a direct comparison of the response values for different sized instances. Clearly, 3000 iterations explores a bigger fraction of the search space for 300-city problems than for 500 city problems. Such a comparison could be made if it was clear how to scale iterations with problem size. Such scaling is an open question.

7.7

Chapter summary

This Chapter presented a case study on determining whether a problem characteristic has an effect on the difficulty of problems for a given heuristic. Specifically, it investigated whether the standard deviation of edge lengths in TSP instances affects the quality of solutions produced by the MMAS and ACS heuristics. The result, that symmetric TSP edge length standard deviation affects problem difficulty is a new result for the ACO community. This result is particularly important for approaches to modelling and analysing ACO performance. This will be illustrated in subsequent chapters in the thesis.

141

8 Case study: Screening Ant Colony System This Chapter reports a case study on screening the factors affecting the performance of a heuristic. The methodology for this case study was described in Chapter 6. The particular heuristic studied is Ant Colony System for the Travelling Salesperson Problem (Section 2.4.6 on page 42). This chapter reports many new results for ACS. Established tuning parameters previously thought to affect performance are actually shown to have no effect at all. New tuning parameters that were thought to affect performance are investigated. A new TSP problem characteristic is shown to have a very strong effect on performance, confirming the results of Chapter 7. All analyses are conducted for two performance measures, quality of solution and solution time. This provides an accurate measure of the heuristic compromise that is rarely seen in the literature. Finally, it is shown that models of ACS performance must be of a higher order than linear. The results reported in this Chapter have been published in the literature [108, 107].

8.1 8.1.1

Method Response Variables

Three responses were measured as per Section 6.7.2 on page 127. These responses were percentage relative error from a known optimum (henceforth referred to as Relative Error), adjusted differential approximation (henceforth referred to as ADA) and solution time (henceforth referred to as Time).

143

CHAPTER 8. CASE STUDY: SCREENING ANT COLONY SYSTEM

8.1.2

Factors, levels and ranges

Design Factors There were 12 design factors, 10 representing the ACS tuning parameters and 2 representing the TSP problem characteristics being investigated. The design factors and their high and low levels are summarised in the following table. A description of the ACS tuning parameter factors was given in Section 2.4.9 on page 45. H:solutionConstruction and J:antPlacement could be considered as parameterised design features as mentioned in Section 6.3.1 on page 111. The antsFraction and nnFraction are expressed as a percentage of problem size. The factor ranges were chosen to encompass common parameter values encountered in the literature. The experiences of other researchers using ACS has been that 10 ants is a good parameter setting. If this is the case, our experiments will confirm this recommendation methodically. Recall that this is a sequential experimentation methodology and does not preclude incorporating the experimenter’s prior knowledge into the chosen factor ranges. Factor

Name

Type

Low

High

A

alpha

Numeric

1

13

B

beta

Numeric

1

13

C

antsFraction

Numeric

1.00

110.00

D

nnFraction

Numeric

2.00

20.00

E

q0

Numeric

0.01

0.99

F

rho

Numeric

0.01

0.99

G

rhoLocal

Numeric

0.01

0.99

H

solutionConstruction

Categoric

Parallel

sequential

J

antPlacement

Categoric

Random

same

K

pheromoneUpdate

Categoric

BestSoFar

bestOfIteration

L

problemSize

Numeric

300

500

M

problemStDev

Numeric

10.00

70.00

Table 8.1: Design factors for the screening study with ACS. There are two problem characteristic factors (L-problemSize and M-problemStDev). The remaining 10 factors are ACS tuning parameters.

Held-Constant Factors The held constant factors are as per Section 6.7.6 on page 127.

8.1.3

Instances

All TSP instances were of the symmetric type and were created as per Section 6.7.1 on page 126. The TSP problem instances ranged in size from 300 cities to 500 cities with cost matrix standard deviation ranging from 10 to 70. All instances had a mean of 100. The same instances were used for each replicate of a design point. 144

CHAPTER 8. CASE STUDY: SCREENING ANT COLONY SYSTEM

8.1.4

Experiment design, power and replicates

The experiment design was a Resolution IV (12-5) fractional factorial with 24 centre points. The number of replicates was 8, determined using the work-up procedure (Section A.6.1 on page 227) until a power of 80% was achieved for a significance level of 5% when detecting an effect size of 0.18 standard deviations. This yielded a total of 1048 runs. Figure 8.1 gives the descriptive statistics for the collected data and the actual effect size for each response. Iterations 50

100

150

200

250

51.79 103.90 1173.42 0.13

Relative Error 7.77 10.36 56.63 0.45

4.15 4.06 20.21 0.64

18.70

1.87

0.73

106.48 241.33 3211.93 0.20

7.55 10.23 56.37 0.45

4.06 4.04 20.21 0.64

43.44

1.84

0.73

161.81 391.29 5813.52 0.29

7.42 10.13 56.37 0.40

4.00 4.01 20.21 0.43

70.43

1.82

0.72

238.37 673.29 7828.67 0.37

7.35 10.08 56.37 0.40

3.95 3.97 20.21 0.43

121.19

1.81

0.71

285.87 786.47 9898.04 0.44

7.30 10.02 56.37 0.37

3.91 3.94 20.21 0.43

141.56

1.80

0.71

Time Mean StDev Max Min Actual Effect Size of 0.18 standard deviations Mean StDev Max Min Actual Effect Size of 0.18 standard deviations Mean StDev Max Min Actual Effect Size of 0.18 standard deviations Mean StDev Max Min Actual Effect Size of 0.18 standard deviations Mean StDev Max Min Actual Effect Size of 0.18 standard deviations

ADA

Figure 8.1: Descriptive statistics for the ACS screening experiment. Statistics are given at five stagnation points ranging from 50 to 250 iterations. The actual effect size equivalent to 0.18 standard deviations is also listed for each response variable.

8.1.5

Performing the experiment

Responses were measured at five stagnation levels. An examination of the descriptive statistics in Figure 8.1 verifies that the stagnation level did not have a large effect on the response values and therefore the conclusions after a 250 iteration stagnation should be the same as after lower iteration stagnations. The two solution quality responses show a small but practically insignificant decrease in solution error as the stagnation iterations is increased.

145

CHAPTER 8. CASE STUDY: SCREENING ANT COLONY SYSTEM The Time response increases with increasing stagnation iterations because the experiments take longer to run. For all cases, the level of stagnation iterations has little practically significant effect on the three responses. ACS did not make any large improvements when allowed to run for longer. It is therefore sufficient to perform analyses at the 250 iterations stagnation level.

8.2

Analysis

8.2.1

ANOVA

Effects for each response model were selected using stepwise regression (Section A.4.1 on page 220) applied to a full 2 factor interaction model with an alpha out threshold of 0.10. Some terms removed by stepwise regression were added back into the final model to preserve hierarchy. To make the data amenable to statistical analysis, a transformation of the responses was required for each analysis. The transformation was a log10 for all three responses. Outliers were deleted and the model building repeated until the models passed the usual ANOVA diagnostics for the ANOVA assumptions of model fit, normality, constant variance, time-dependent effects, and leverage (Section A.4.2 on page 221). 35 data points (3% of total data) were removed when analysing Relative Error. 36 data points (3% of total data) were removed when analysing ADA. 32 data points (3% of total data) were removed when analysing Time.

8.2.2

Confirmation

Confirmation experiments were run according to the methodology detailed in Section 6.3.7 on page 114 in order to confirm the accuracy of the ANOVA models. Recall that the general idea is to run the algorithm on new randomly chosen combinations of parameter settings and problem characteristics and compare performance to the ANOVA models’ predictions. The randomly chosen treatments produced actual algorithm responses with the descriptives listed in the following figure.

Relative Error 250 ADA 250 Time 250

Max 85.59 39.37 16141

Min 0.91 0.93 1

Mean 11.78 7.57 477

StDev 17.60 9.69 1869

Figure 8.2: Descriptive statistics for the confirmation of the ACS screening ANOVA. The response data is from runs of the actual algorithm on the randomly generated confirmation treatments with a stagnation stopping criterion of 250 iterations.

The large ranges of each response reinforce the motivation for correct parameter tuning as there is clearly a high cost in incorrectly tuned parameters. The next three figures illustrate the 95% prediction intervals (Section A.3.5 on page 219) and actual data for the three response models, Relative Error, ADA and Time respectively.

146

Relative Error 250

CHAPTER 8. CASE STUDY: SCREENING ANT COLONY SYSTEM 50.0 45.0 40.0 35.0 30.0 25.0 20.0 15.0 10.0 5.0 0.0

Relative Error 250 95% PI low 95% PI high

0

10

20

30

40

50

Treatment

Figure 8.3: 95% Prediction intervals for the ACS screening of Relative Error.

ADA 250

40 35

ADA 250

30

95% PI low

25

95% PI high

20 15 10 5 0 0

10

20

30

40

50

Treatment

Figure 8.4: 95% Prediction intervals for the ACS screening of ADA.

The two solution quality responses are well predicted by their models. The models match the trends of the actual data, successfully picking up the extremely low and extremely high response values which vary over a range of 85% for relative error and 39 for ADA. Both quality models tend to overestimate high values. The time response is also well predicted by its model. Time is subject to the nuisance factor of different experiment machines and is a more variable response due to the nature of the ACS algorithm. The extremely high times and extremely low times are predicted well for all the 50 treatments. This was achieved over a range of over 16000s (see Figure 8.1 on page 145). The three ANOVA models are therefore satisfactory predictors of the three ACS performance responses for factor values within 10% of the factor range limits listed in Section 8.1 on page 144.

8.3

Results

Figure 8.6 on the next page gives a summary of the Sum of Squares ranking and the ANOVA F and p values for the three responses. Only the main effects are listed. Those main effects that rank in the top 12 are highlighted in bold. Rankings are based on the full two factor interaction model of 78 terms, before stepwise regression was applied.

147

CHAPTER 8. CASE STUDY: SCREENING ANT COLONY SYSTEM

10000

Time 250 95% PI low

Time 250

1000

95% PI high 100 10 1 0

10

20

30

40

50

Treatment

Figure 8.5: 95% Prediction intervals for the ACS screening of Time.

Relative Error Main Effect A-alpha B-beta C-antsFraction D-nnFraction E-q0 F-rho G-rhoLocal H-solutionConstruction J-antPlacement K-pheromoneUpdate L-problemSize M-problemStDev

Rank

F value p value

62 1.15 0.2840 3 2103.73 < 0.0001 11 305.13 < 0.0001 4 1739.78 < 0.0001 2 6920.35 < 0.0001 45 4.63 0.0317 13 227.57 < 0.0001 14 151.18 < 0.0001 75 0.01 0.9386 47 4.41 0.0360 12 289.62 < 0.0001 1 43076.29 < 0.0001

ADA Rank F value 61 3 10 4 2 44 12 13 75 46 18 1

1.38 2128.02 310.74 1744.51 6955.10 5.08 226.37 154.72 0.00 4.85 52.52 9426.17

Time p value 0.2405 < 0.0001 < 0.0001 < 0.0001 < 0.0001 0.0244 < 0.0001 < 0.0001 0.9831 0.0278 < 0.0001 < 0.0001

Rank

F value

p value

16 12.01 0.0006 13 16.70 < 0.0001 1 37566.17 < 0.0001 3 1680.96 < 0.0001 7 41.76 < 0.0001 56 0.94 0.3313 41 2.88 0.0901 4 80.62 < 0.0001 54 1.28 0.2576 5 49.39 < 0.0001 2 3334.25 < 0.0001 43 2.57 0.1096

Figure 8.6: Summary of ANOVAs for Relative Error, ADA and Time. Only the main effects are shown. Rankings are based on the full two factor interaction model of 78 terms, before backward selection was applied.

148

CHAPTER 8. CASE STUDY: SCREENING ANT COLONY SYSTEM This table contains several results and answers to some open questions in the ACO literature.

8.3.1

Screened factors

J-AntPlacement, is statistically insignificant at the 0.05 level for both quality responses and the time response. J-AntPlacement can therefore be directly screened out. Factor F-rho is statistically insignificant at the 0.05 level for the time response and statistically significant for both quality responses. Despite its significance, the rankings of F-Rho are very low (45, 44 and 56 out of 78 effects) across all three responses. F-rho is therefore screened out. A-Alpha is statistically insignificant at the 0.05 level for both quality responses but statistically significant at the 0.05 level for the time response. This is reflected in the low rankings for the quality responses and the high ranking (almost within the top 12) for the time response. A-alpha should be set to 1 since this requires marginally less time to compute the ACS decision equation than a higher value of A-alpha. K-pheromoneUpdate is statistically significant for all three responses but has a very low ranking for both quality responses. An examination of the plot of time for the K-pheromoneUpdate factor shows that the effect on time is not practically significant. K-pheromoneUpdate can therefore be screened out. In summary, 4 factors are screened out, A-Alpha, F-Rho, J-AntPlacement and K-PheromoneUpdate.

8.3.2

Relative Importance of Factors

Of the two problem characteristics, the factor with the larger effect on solution quality is the problem standard deviation M-problemStDev. The problem size LproblemSize has a stronger effect on solution time than M-problemStDev. ACS takes longer to reach stagnation on larger problem instances than smaller instances. Of the remaining unscreened tuning parameters, the heuristic exponent B-beta, the amount of ants C-antsFraction, the length of candidate lists D-nnFraction and the exploration/exploitation threshold E-q0 have the strongest effects on solution quality. The same is true for solution time. These results are important because they highlight the most important tuning parameters and problem characteristics in terms of both heuristic performance dimensions. These factors are the minimal set of design factors one should experiment with when modelling and tuning ACS.

8.3.3

Adequacy of a Linear Model.

A test for curvature as per Section 6.3.9 on page 116 shows there is a significant amount of curvature in all three responses. This means that the linear model for screening is not adequate to explore the whole design space. A higher order model 149

CHAPTER 8. CASE STUDY: SCREENING ANT COLONY SYSTEM of all three responses is required. This is an important result because it confirms that a One-Factor-At-a-Time (Section 4.3.3 on page 87) approach is insufficient for investigating the performance of ACS.

8.3.4

Comparison of Solution Quality Measures

It was already mentioned that two solution quality responses were measured to investigate if either response lead to different conclusions. An examination of the ANOVA summaries reveals that both Relative Error and ADA have almost the same rankings and same statistical significance or insignificance for all factors except LproblemSize. While L-problemSize has a significant effect on both responses, the ADA response has a lower ranking of 18 compared to the Relative Error response ranking of 12. This is due to the nature of how the two responses are calculated (Section 3.10.2 on page 69). This result shows that for screening of ACS, the choice of solution quality response will not affect the conclusions. However, ADA may be preferable as it exhibits a lower range and variability than Relative Error (Figure 8.1 on page 145). The advantage of a lower variability was discussed in the context of statistical power (Section A.6 on page 225).

8.4

Conclusions and discussion

The following conclusions are drawn from the ACS screening study. These conclusions apply for a significance level of 5% and a power of 80% to detect the effect sizes listed in Figure 8.1 on page 145. These effect sizes are a change in solution time of 140s, a change in Relative Error of 1.8% and a change in ADA of 0.71. Issues of power and effect size are discussed in Section A.6 on page 225. • Tuning Ant placement not important. The type of ant placement has no significant effect on ACS performance in terms of solution quality or solution time. This was an open question in the literature. It is remarkable because intuitively one would expect a random scatter of ants across the problem graph to explore a wider variety of possible solutions. This result shows that this is not the case. • Tuning alpha not important. Alpha has no significant effect on ACS performance in terms of solution quality or solution time. This confirms the common recommendation in the literature of setting alpha equal to 1. Alpha has also been analysed with an OFAT approach in Appendix D on page 235. • Tuning Rho not important. Rho has no significant effect on ACS performance in terms of solution quality or solution time. This is a new result for ACS. It is a surprising result since Rho is a term in the ACS update pheromone equations and other analytical results in a much simplified scenario suggested it was important [?].

150

CHAPTER 8. CASE STUDY: SCREENING ANT COLONY SYSTEM • Tuning Pheromone Update Ant not important. The ant used for pheromone updates is practically insignificant for all three responses. An examination of the plot of time for the K-pheromoneUpdate factor shows that the effect on time is not practically significant. K-pheromoneUpdate can therefore be screened out. • Most important tuning parameters. The most important ACS tuning parameters are the heuristic exponent B-beta, the amount of ants as a fraction of problem size C-antsFraction, the length of candidate lists as a fraction of problem size D-nnFraction and the exploration/exploitation threshold E-q0. These are the factors one should focus on as design factors when experimental resources are limited. • Problem standard deviation is important. This confirms the main result of Chapter 7 in identifying a new TSP problem characteristic that has a significant effect on the difficulty of a problem for ACS. ACO research should be reporting this characteristic in the literature. • Higher order model needed. A higher order model, greater than linear, is required to model ACS solution quality and ACS solution time. This is an important result because it demonstrates for the first time that simple OFAT approaches seen in the literature are insufficient for accurately modelling and tuning ACS performance. • Comparison of solution quality responses. The is no difference in the conclusions of the screening study using either the ADA or Relative Error solution quality responses. ADA has a slightly smaller variability and so results in more powerful experiments than Relative Error.

8.5

Chapter summary

This chapter has presented a case study on screening the tuning parameters and problem characteristics that affect the performance of ACS. This illustrated the application of the methodology in Section 6.3 on page 110 with a fully instantiated ACO heuristic, Ant Colony System. Many new results were presented and existing recommendations in the literature were confirmed in a rigorous fashion. In the next chapter, these results will be used to efficiently build an accurate model of ACS performance. The full model and the reduced model using the screening results will be compared. This will confirm that the screening decisions recommended in this study were correct.

151

9 Case study: Tuning Ant Colony System This Chapter reports a case study on tuning the factors affecting the performance of a heuristic. The methodology for this case study was described in Chapter 6. The particular heuristic studied is Ant Colony System (ACS) for the Travelling Salesperson Problem (Section 2.4.6 on page 42). This chapter reports many new results for ACS. All analyses are conducted for two performance measures, quality of solution and solution time. This provides an accurate measure of the heuristic compromise that is rarely seen in the literature. It is shown that models of ACS performance must be at least quadratic. ACS is tuned using a full parameter set and a screened parameter set resulting from the case study of the previous chapter. This verifies that screening decisions from the previous chapter are correct. The results reported in this Chapter have been published in the literature [106].

9.1 9.1.1

Method Response Variables

Three responses were measured as per Section 6.7.2 on page 127. These responses were percentage relative error from a known optimum (henceforth referred to as Relative Error), adjusted differential approximation (henceforth referred to as ADA) and solution time (henceforth referred to as Time).

153

CHAPTER 9. CASE STUDY: TUNING ANT COLONY SYSTEM

9.1.2

Factors, levels and ranges

Design factors In the full parameter set RSM, there were 12 design factors as per the screening study of Chapter 8. The factors and their high and low levels are repeated in the following table for convenience. Factor

Name

Type

Low Level

High Level

A

alpha

Numeric

1

13

B

beta

Numeric

1

13

C

antsFraction

Numeric

1.00

110.00

D

nnFraction

Numeric

2.00

20.00

E

q0

Numeric

0.01

0.99

F

rho

Numeric

0.01

0.99

G

rhoLocal

Numeric

0.01

0.99

H

solutionConstruction

Categoric

parallel

sequential

J

antPlacement

Categoric

random

same

K

pheromoneUpdate

Categoric

bestSoFar

bestOfIteration

L

problemSize

Numeric

300

500

M

problemStDev

Numeric

10.00

70.00

Table 9.1: Design factors for the tuning study with ACS. The factor ranges are also given.

Tuning parameters for the Screened model, based on the results of the previous Chapter, did not include A-alpha, F-rho, J-antPlacement and K-pheromoneUpdate. These two screened factors took on randomly chosen values within their range for each experiment run. Held-Constant Factors The held constant factors are as per Section 6.7.6 on page 127.

9.1.3

Instances

All TSP instances were of the symmetric type and were created as per Section 6.7.1 on page 126. The TSP problem instances ranged in size from 300 cities to 500 cities with cost matrix standard deviation ranging from 10 to 70. All instances had a mean of 100. The same instances were used for each replicate of a design point.

9.1.4

Experiment design, power and replicates

The experiment design was a Minimum Run Resolution V Face-Centred Composite (Section A.3.4 on page 218) with six centre points. The number of replicates was increased in a work-up procedure (Section A.6.1 on page 227) until a power of 80% was achieved for an significance level of 5% when 154

CHAPTER 9. CASE STUDY: TUNING ANT COLONY SYSTEM detecting a given effect size. The next two figures give the descriptive statistics for the collected data and the actual effect size for each response in the full and screened experiments with the FCC design.

Iterations

Time

50 Mean StDev Max Min Actual Effect Size of 0.2 stdevs 100 Mean StDev Max Min Actual Effect Size of 0.2 stdevs 150 Mean StDev Max Min Actual Effect Size of 0.2 stdevs 200 Mean StDev Max Min Actual Effect Size of 0.2 stdevs 250 Mean StDev Max Min Actual Effect Size of 0.2 stdevs

65.33 194.96 3131.77 0.17

Relative Error 11.01 19.49 125.84 0.55

ADA 5.60 7.45 41.75 0.66

38.99

3.90

1.49

136.38 469.81 7825.44 0.28

10.73 19.09 124.20 0.55

5.44 7.24 41.51 0.66

93.96

3.82

1.45

204.39 681.54 12075.97 0.38

10.57 18.95 124.20 0.47

5.36 7.18 41.51 0.66

136.31

3.79

1.44

270.25 906.16 15423.77 0.50

10.47 18.85 124.20 0.46

5.31 7.15 41.42 0.60

181.23

3.77

1.43

341.36 1121.20 15573.66 0.61

10.40 18.81 123.74 0.46

5.27 7.13 41.42 0.60

224.24

3.76

1.43

Figure 9.1: Descriptive statistics for the full ACS FCC design. The actual detectable effect size of 0.2 standard deviations is shown for each response and for each stagnation point. There is little practical difference in effect size of the solution quality responses for an increase in the stagnation point.

The full design could achieve sufficient power with 5 replicates while detecting an effect of size 0.2 standard deviations. The screened design could achieve sufficient power with 10 replicates while detecting an effect of size 0.29 standard deviations. Unfortunately, experimental resources did not permit using a larger number of replicates. This is further motivation for the use of DOE and fractional factorials. Without the vast savings of fractional factorials (Section A.3.3 on page 218) this experiment would have been completely infeasible.

9.1.5

Performing the experiment

Responses were measured at five stagnation levels. An examination of the descriptive statistics verifies that the stagnation level did not have a large effect on the response values and therefore the conclusions after a 250 iteration stagnation should be the same as after lower iteration stagnations. The two solution quality 155

CHAPTER 9. CASE STUDY: TUNING ANT COLONY SYSTEM

Iterations

Time

50 Mean StDev Max Min Actual Effect Size of 0.29 stdevs 100 Mean StDev Max Min Actual Effect Size of 0.29 stdevs 150 Mean StDev Max Min Actual Effect Size of 0.29 stdevs 200 Mean StDev Max Min Actual Effect Size of 0.29 stdevs 250 Mean StDev Max Min Actual Effect Size of 0.29 stdevs

53.70 73.01 718.67 0.18

Relative Error 10.70 16.85 92.85 0.65

ADA 6.80 9.89 42.59 0.90

21.17

4.89

2.87

106.44 143.81 1274.12 0.36

10.48 16.69 92.85 0.62

6.69 9.82 42.59 0.81

41.70

4.84

2.85

160.90 225.24 1653.75 0.50

10.35 16.60 92.85 0.62

6.63 9.79 42.46 0.68

65.32

4.82

2.84

208.49 278.26 1801.84 0.60

10.28 16.57 92.45 0.55

6.57 9.74 42.46 0.64

80.70

4.81

2.83

262.77 363.71 3721.72 0.69

10.22 16.52 92.45 0.55

6.54 9.72 42.15 0.64

105.47

4.79

2.82

Figure 9.2: Descriptive statistics for the screened ACS FCC design. The actual detectable effect size of 0.3 standard deviations is shown for each response and for each stagnation point. There is little practical difference in effect size of the solution quality responses for an increase in the stagnation point.

156

CHAPTER 9. CASE STUDY: TUNING ANT COLONY SYSTEM responses show a small but practically insignificant decrease in solution error as the stagnation iterations is increased. The Time response increases with increasing stagnation iterations because the experiments take longer to run. For all cases, the level of stagnation iterations has little practically significant effect on the three responses. ACS did not make any large improvements when allowed to run for longer. It is therefore sufficient to perform analyses at the 250 iterations stagnation level.

9.2 9.2.1

Analysis Fitting

A fit analysis was conducted for each response in the full experiments and the screened experiments. For both the full and screened cases, at least a quadratic model was required to model the responses. For the Minimum Run Resolution V Face-Centred Composite, cubic models are aliased and so were not considered.

9.2.2

ANOVA

Effects for each response model were selected using stepwise regression (Section A.4.1 on page 220) applied to a full quadratic model with an alpha out threshold of 0.10. Some terms removed by stepwise regression were added back into the final model to preserve hierarchy. To make the data amenable to statistical analysis, a transformation of the responses was required for each analysis. The transformation was a log10 for all three responses. Outliers were deleted and the model building repeated until the models passed the usual ANOVA diagnostics for the ANOVA assumptions of model fit, normality, constant variance, time-dependent effects, and leverage (Section A.4.2 on page 221). 138 data points (∼5% of total data) were removed when analysing the full model of ADA-Time. 122 data points (∼5% of total data) were removed when analysing the full model of RelativeError-Time. 47 data points (∼5% of total data) were removed when analysing the screened model of ADA-Time. 15 data points (∼2% of total data) were removed when analysing the screened model of RelativeError-Time.

9.2.3

Confirmation

Confirmation experiments were run according to the methodology detailed in Section 6.4.6 on page 121 in order to confirm the accuracy of the ANOVA models. The randomly chosen treatments produced actual algorithm responses with the descriptives listed. The large ranges of each response reinforce the motivation for correct parameter tuning as there is clearly a high cost in incorrectly tuned parameters. The next two figures illustrate the 95% prediction intervals (Section A.3.5 on page 219) and actual confirmation data for full and screened response surface models of Relative Error and Time. 157

CHAPTER 9. CASE STUDY: TUNING ANT COLONY SYSTEM

Iterations Time 100 Mean 70.14 StDev 84.14 Max 528.28 Min 1.55 150 Mean 109.80 StDev 130.37 Max 774.39 Min 2.17 200 Mean 169.66 StDev 220.76 Max 1084.58 Min 2.83 250 Mean 220.19 StDev 287.57 Max 1652.54 Min 3.45

Relative Error 7.01 3.48 17.65 3.12 6.88 3.41 17.65 2.77 6.75 3.38 16.92 2.77 6.69 3.35 16.84 2.77

ADA 3.83 3.27 16.15 1.00 3.77 3.24 16.15 1.00 3.71 3.22 15.48 1.00 3.67 3.18 15.41 1.00

Figure 9.3: Descriptive statistics for the confirmation of the ACS tuning. The response data is from runs of the actual algorithm on the randomly generated confirmation treatments. Full Response Surface Model Screened FCD RSM Full CCC RSM Screened CCC RSM

Full Response Surface Model

Screened FCD

30.0

Relative Error 250

25.0

95% PI low

20.0

95% PI high

15.0 10.0 5.0

35.0 30.0

Relative Error 250

Relative Error 250

35.0

25.0 20.0 15.0 10.0 5.0 0.0

0.0 0

10

20 30 Treatment

40

0

50

10

Screened FCD R

Full Response Surface Model

1000

95% PI low

100

95% PI high

10000 1000 Time 250

Time 250

Time 250

100

10

10

1

1 10

20 30 Treatment

40

50

Figure 9.4: 95% Prediction intervals for the full ACS response surface model of Relative Error-Time. The horizontal axis is the randomly generated treatment. The vertical axis is the Relative Error or Time response.

158

30

Treatment

10000

0

20

0

10

20 30 Treatment

CHAPTER 9. CASE STUDY: TUNING ANT COLONY SYSTEM

ACS_RelErr-Time.xls

Screened FCD RSM

rface Model

95% PI low 95% PI high

35.0 Relative Error 250

Relative Error 250

Relative Error 250 95% PI low

30.0 25.0

95% PI high

20.0 15.0 10.0 5.0 0.0

40

0

50

10

20

30

40

50

Treatment

Screened FCD RSM

urface Model

95% PI low 95% PI high

Time 250

Time 250

10000

Time 250

1000

95% PI low

100

95% PI high

10 1 40

50

0

10

20 30 Treatment

40

50

Figure 9.5: 95% Prediction intervals for the screened ACS response surface model of RelativeErrorTime. Screening was conducted in the previous Chapter. The horizontal axis is the randomly generated treatment. The vertical axis is the Relative Error or Time response.

159

CHAPTER 9. CASE STUDY: TUNING ANT COLONY SYSTEM For both screened and full model, the predictions are very similar for both the Relative Error and Time responses. This shows that the screening decisions from the previous case study were correct. Looking at the predictions in general we see that time was better predicted than relative error. The time models match all the trends in the actual data. The relative error models however exhibit some false peaks and miss some actual peaks. Similar results were observed for the full and screened models of ADA-Time. The ADA model failed to predict one more peak than the relative error model. The RelativeError-Time and ADA-Time models are therefore deemed good predictors of the ACS responses.

9.3

Results

9.3.1

Screening and relative importance of factors

The following two figures give the ranked ANOVAs of the Relative Error and Time models from the RelativeError-Time analysis. The terms have been rearranged in order of decreasing sum of squares so that the largest contributor to the models comes first.

Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

Term J-problemStDev E-q0 D-nnFraction DE BD EJ B-beta AD BE H-problemSize AJ AB CE G-rhoLocal D^2 DJ AF E^2 BJ B^2 C-antsFraction EG CD CJ EF EH DG K-solutionConstruction AG CF H^2 M-pheromoneUpdate G^2 A^2 FH

Sum of squares 182.23 64.87 22.01 17.71 8.86 4.61 4.46 4.36 3.84 3.66 3.38 3.26 2.76 2.57 2.45 2.31 2.26 2.24 2.21 2.13 1.93 1.88 1.69 1.50 1.30 1.23 0.98 0.84 0.42 0.38 0.32 0.28 0.26 0.24 0.21

F value 121362.13 43205.00 14656.74 11794.89 5901.77 3068.80 2967.32 2904.26 2554.84 2439.84 2250.48 2174.28 1836.57 1714.47 1631.98 1540.95 1504.79 1492.39 1468.81 1417.47 1286.26 1253.72 1125.00 1001.96 862.88 820.42 649.57 562.08 279.16 254.96 215.21 188.89 172.66 162.14 141.53

p value < 0.0001 < 0.0001 < 0.0001 < 0.0001 < 0.0001 < 0.0001 < 0.0001 < 0.0001 < 0.0001 < 0.0001 < 0.0001 < 0.0001 < 0.0001 < 0.0001 < 0.0001 < 0.0001 < 0.0001 < 0.0001 < 0.0001 < 0.0001 < 0.0001 < 0.0001 < 0.0001 < 0.0001 < 0.0001 < 0.0001 < 0.0001 < 0.0001 < 0.0001 < 0.0001 < 0.0001 < 0.0001 < 0.0001 < 0.0001 < 0.0001

Rank 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72

Sum of Term squares F value p value EM 0.21 137.62 < 0.0001 F-rho 0.20 134.68 < 0.0001 BH 0.20 132.31 < 0.0001 DF 0.20 131.48 < 0.0001 GH 0.19 124.68 < 0.0001 F^2 0.18 122.13 < 0.0001 J^2 0.15 100.38 < 0.0001 AH 0.14 94.24 < 0.0001 A-alpha 0.12 80.01 < 0.0001 FJ 0.12 77.53 < 0.0001 EK 0.10 63.99 < 0.0001 HJ 0.09 58.21 < 0.0001 DK 0.06 42.16 < 0.0001 JK 0.06 42.03 < 0.0001 DH 0.06 41.72 < 0.0001 AE 0.06 37.78 < 0.0001 FM 0.05 33.60 < 0.0001 HK 0.05 32.45 < 0.0001 CG 0.04 28.05 < 0.0001 BM 0.04 25.03 < 0.0001 DM 0.04 25.02 < 0.0001 GK 0.04 24.44 < 0.0001 JM 0.03 22.86 < 0.0001 BC 0.03 22.76 < 0.0001 GJ 0.03 19.37 < 0.0001 CM 0.03 17.58 < 0.0001 BG 0.03 17.42 < 0.0001 CH 0.02 16.61 < 0.0001 FG 0.02 9.99 0.0016 KM 0.01 7.98 0.0048 GL 0.01 7.51 0.0062 BK 0.01 7.49 0.0062 GM 0.01 6.05 0.0140 EL 0.01 5.58 0.0183 L-antPlacement 0.01 4.39 0.0363 FL 0.00 2.92 0.0878 BF 0.00 2.75 0.0972

Figure 9.6: RelativeError-Time ranked ANOVA of Relative Error response from full model. The table lists the remaining terms in the model after stepwise regression in order of decreasing Sum of Squares.

Looking first at the Relative Error rankings, we see that the least important

160

CHAPTER 9. CASE STUDY: TUNING ANT COLONY SYSTEM

Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

Sum of squares F value p value Term C-antsFraction 1214.31 35326.80 < 0.0001 C^2 248.18 7219.98 < 0.0001 H-problemSize 122.02 3549.80 < 0.0001 D-nnFraction 31.19 907.44 < 0.0001 E-q0 4.88 141.91 < 0.0001 DH 4.57 132.87 < 0.0001 EM 3.37 98.14 < 0.0001 HJ 2.75 80.00 < 0.0001 K-solutionConstruction 2.59 75.23 < 0.0001 M-pheromoneUpdate 2.44 71.03 < 0.0001 EF 1.95 56.76 < 0.0001 BD 1.44 41.91 < 0.0001 DE 1.43 41.68 < 0.0001 GM 1.31 38.21 < 0.0001 DM 1.23 35.79 < 0.0001 CD 1.23 35.73 < 0.0001 EG 1.14 33.10 < 0.0001 CG 1.04 30.19 < 0.0001 BM 0.63 18.19 < 0.0001 CH 0.59 17.20 < 0.0001 CF 0.50 14.58 0.0001 JM 0.45 13.21 0.0003

Rank 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46

Sum of squares F value Term FJ 0.43 12.41 AE 0.42 12.36 G-rhoLocal 0.35 10.30 AB 0.33 9.68 B-beta 0.31 9.09 AG 0.30 8.64 HK 0.27 7.79 BE 0.25 7.34 BC 0.24 7.08 FM 0.24 6.88 DK 0.22 6.29 FH 0.19 5.39 AH 0.18 5.31 AJ 0.18 5.20 GH 0.17 4.91 BJ 0.17 4.86 AC 0.15 4.29 EJ 0.14 4.19 J-problemStD 0.13 3.69 BG 0.11 3.26 AF 0.11 3.06 HM 0.09 2.76 A-alpha 0.01 0.28 F-rho 0.01 0.22

p value 0.0004 0.0004 0.0013 0.0019 0.0026 0.0033 0.0053 0.0068 0.0078 0.0088 0.0122 0.0203 0.0213 0.0227 0.0268 0.0276 0.0385 0.0408 0.0547 0.0712 0.0804 0.0966 0.5963 0.6386

Figure 9.7: RelativeError-Time ranked ANOVA of time response from full model. The table lists the remaining terms in the model after stepwise regression in order of decreasing Sum of Squares.

main effects are L-antPlacement, A-alpha, F-rho and M-pheromoneUpdate. These are exactly the terms that were deemed unimportant in the screening study of the previous chapter. By far the most important terms are the main effects of the candidate list length tuning parameter and the exploration/exploitation tuning parameter as well as their interaction. This is a very important result because it shows that candidate list length, a parameter that we have often seen set at a fixed value or not used is actually one of the most important parameters to set correctly. Looking at the Time rankings, we see that L-antPlacement was completely removed from the model. The least important main effects were then F-rho and A-alpha. These results also confirm the screening decisions. However, the MpheromoneUpdate term has now risen in importance in its effect on time. By far the most important tuning parameters are the amount of ants and the lengths of their candidate lists. This is quite intuitive as the amount of processing is directly related to these parameters. The result regarding the cost of the amount of ants is particularly important because the amount of ants does not have a relatively strong effect on solution quality. The extra time cost of using more ants will not result in gains in solution quality. This is an important result because it methodically confirms the often recommended parameter setting of letting the number of ants equal to a small number (usually 10). An examination of the ranked ANOVAs from the ADA-Time model gives the same ranking of tuning parameter contributions to time and ADA. As with the previous ACS screening study, the choice of solution quality response does not change the conclusions of the relative importance of the tuning parameters.

161

CHAPTER 9. CASE STUDY: TUNING ANT COLONY SYSTEM

9.3.2

Tuning

A desirability optimisation is performed as per Section 6.5.2 on page 122. Recall that equal preference is given to the minimisation of relative error and solution time. The results from the full and screened RelativeError-Time models are pre-

pheromoneUpdate

Time05

Relative Error

Desirability

1.00 1.00 0.99 0.69 0.96 parallel 1.00 1.00 0.98 0.95 0.28 sequential 1.00 20.00 0.98 0.05 0.70 parallel 1.00 1.00 0.99 0.11 0.81 parallel 2.19 1.16 0.97 0.99 0.03 parallel 1.61 20.00 0.98 0.01 0.07 parallel 1.13 1.00 0.99 0.86 0.01 parallel 1.00 1.00 0.99 0.99 0.48 parallel 1.04 19.78 0.99 0.05 0.01 parallel

antPlacement

solutionConstruction

rhoLocal

rho

q0

nnFraction

antsFraction

10 8 2 40 13 5 70 1 11 10 8 4 40 13 6 70 1 11 10 7 3 40 13 7 70 1 10

beta

StDev

300 300 300 400 400 400 500 500 500

alpha

Size

sented in the following two figures.

random random random random random random same random same

bestSoFar bestSoFar bestSoFar bestSoFar bestOfItera bestOfItera bestOfItera bestSoFar bestOfItera

1.15 1.46 1.77 2.42 2.83 4.92 4.88 4.25 9.24

0.46 1.24 2.18 0.46 1.33 2.59 0.38 1.35 2.54

0.96 0.86 0.80 0.92 0.82 0.73 0.88 0.80 0.70

RelError05

Desirability

q0

10 1 1.00 1.00 0.99 0.99 parallel 40 1 1.00 1.00 0.99 0.99 sequential 70 12 1.00 20.00 0.99 0.01 parallel 10 1 1.00 1.00 0.99 0.99 parallel 40 1 1.00 1.00 0.99 0.99 sequential 70 13 1.00 20.00 0.99 0.01 parallel 10 1 1.00 1.00 0.99 0.99 parallel 40 5 1.03 1.13 0.99 0.04 sequential 70 13 1.00 20.00 0.99 0.01 parallel

Time05

solutionConstruction

rhoLocal

nnFraction

StDev

300 300 300 400 400 400 500 500 500

beta

Size

antsFraction

Figure 9.8: Full RelativeError-Time model results of desirability optimisation. The table lists the recommended parameter values for combinations of problem size and problem standard deviation. The expected time and relative error are listed with the desirability value.

0.93 1.13 2.70 1.91 2.21 5.19 3.92 3.71 9.98

0.51 2.32 4.45 0.59 2.71 3.68 0.69 3.16 3.01

0.98 0.82 0.71 0.93 0.77 0.69 0.87 0.73 0.68

Figure 9.9: Screened RelativeError-Time model results of desirability optimisation. The table lists the recommended parameter values for combinations of problem size and problem standard deviation. The expected time and relative error are listed with the desirability value. 1 of 1

The rankings of the ANOVA terms has already highlighted the factors that have little effect on the responses. These screened factors can take on any values in the desirability optimisations. This is confirmed by examining the desirability recommendations from the full and screened models. The most important factors, comprising the screening model, have recommended settings that strongly agree with

162

CHAPTER 9. CASE STUDY: TUNING ANT COLONY SYSTEM the recommended settings from the full model. For example, beta is always low, except when the problem standard deviation is high. The exploration/exploitation threshold q0 is always at a maximum of 0.99, implying that exploitation is always preferred to exploration. AntsFraction is always low. The remaining unimportant factors take on a variety of values in the full model desirability optimisation. The predicted values of time from both models agree with one another to within a second. The predictions of relative error are higher from the screened model. The quality of the desirability optimisation recommendations can now be evaluated.

9.3.3

Evaluation of tuned settings

The tuned parameter recommendations from the desirability optimisation are evaluated as per the methodology of Section 6.6 on page 124. Some illustrative plots are given in the following two figures. On each plot, the horizontal axis lists the randomly generated treatments and the vertical axis lists the response value. Each plot contains the data for the response recorded using the settings from the desirability optimisation of the full and screened experiments. The responses produced by using parameter settings recommended in the literature (Section 2.4.9 on page 45) and some randomly chosen parameter settings are also listed. Relative Error vs Time model after 250 iteration stagnation 30

Full

Relative Error 250

25

Screened Book

20

Random

15 10 5 0 0

1

2

3

4

5

6

7

8

9

10

Treatment Figure 9.10: Evaluation of Relative Error response in the RelativeError-Time model of ACS on problems of size 400 and standard deviation 70. The horizontal axis is the randomly generated treatment. There are plots of the results from four parameter settings, the settings from a desirability optimisation of the full relativeError-Time model, the settings from a desirability optimisation of the screened relativeErrorTime model, the settings recommended in the literature and randomly generated settings.

For both Relative Error and Time, the parameter settings from the full and screened models perform about the same as the parameter settings from the literature. Interestingly, on a small number of occasions, randomly chosen settings perform better than all other settings. Similar results were found with all eight other combinations of problem characteristics for the full and screened RelativeErrorTime models. This result confirms the recommendation of generally good ACS settings in the literature summarised in Section 2.4.9 on page 45. In particular, both the literature and the desirability optimisation agree on the recommended settings

163

CHAPTER 9. CASE STUDY: TUNING ANT COLONY SYSTEM

Relative Error vs Time model after 250 iteration stagnation 10000

Full Screened

Time 250

1000

Book Random

100 10 1 0

1

2

3

4

5

6

7

8

9

10

Treatment

Figure 9.11: Evaluation of Time response in the RelativeError-Time model of ACS on problems of size 400 and standard deviation 70. The horizontal axis is the randomly generated treatment. There are plots of the results from four parameter settings, the settings from a desirability optimisation of the full relativeError-Time model, the settings from a desirability optimisation of the screened relativeErrorTime model, the settings recommended in the literature and randomly generated settings.

for the most important factors according to the screening study. Both recommend low values of Beta and AntsFraction and high values of the exploration/exploitation threshold q0 . Results from the ADA-Time desirability optimisation were different as the recommended parameter settings from the literature were chosen with a relative error response in mind rather than an ADA response. The following two figures illustrate representative results for ADA and Time. Again, on a few occasions, the randomly chosen settings perform better than all alternatives. There is little difference in solution times between the desirability settings and the literature settings. However, there is a very large difference when one considers ADA. This shows that one should not use the literature recommended parameter settings if one is measuring an ADA solution quality response.

9.4

Conclusions and discussion

The following conclusions are drawn from the ACS tuning study. The first of these relate to screening and ranking and serve to confirm the conclusions from the screening study of the previous chapter (Section 8.4 on page 150). These screening and tuning conclusions apply for a significance level of 5% and a power of 80% to detect the effect sizes listed in Figure 9.1 on page 155. These effect sizes are a change in solution time of 224s, a change in Relative Error of 3.76% and a change in ADA of 1.43. Issues of power and effect size are discuss in Section A.6 on page 225. • Tuning Ant placement not important. The type of ant placement has no significant effect on ACS performance in terms of solution quality or solution time. 164

CHAPTER 9. CASE STUDY: TUNING ANT COLONY SYSTEM

ADA vs Time model after 250 iteration stagnation 40

Full

35

Screened

ADA 250

30

Book

25

Random

20 15 10 5 0 0

1

2

3

4

5

6

7

8

9

10

Treatment Figure 9.12: Evaluation of ADA response in the ADA-Time model of ACS on problems are of size 500 and standard deviation 10. The horizontal axis is the randomly generated treatment. There are plots of the results from four parameter settings, the settings from a desirability optimisation of the full ADATime model, the settings from a desirability optimisation of the screened ADA-Time model, the settings recommended in the literature and randomly generated settings.

ADA vs Time model after 250 iteration stagnation 10000

Full Screened

Time 250

1000

Book Random

100 10 1 0

1

2

3

4

5

6

7

8

9

10

Treatment Figure 9.13: Evaluation of Time response in the ADA-Time model of ACS on problems are of size 500 and standard deviation 10. The horizontal axis is the randomly generated treatment. There are plots of the results from four parameter settings, the settings from a desirability optimisation of the full ADATime model, the settings from a desirability optimisation of the screened ADA-Time model, the settings recommended in the literature and randomly generated settings.

165

CHAPTER 9. CASE STUDY: TUNING ANT COLONY SYSTEM • Tuning Alpha not important. Alpha has no significant effect on ACS performance in terms of solution quality or solution time. This confirms the common recommendation in the literature of setting alpha equal to 1. Alpha has also been analysed with an OFAT approach in Appendix D on page 235. • Tuning Rho not important. Rho has no significant effect on ACS performance in terms of solution quality or solution time. This is a new result for ACS. • Tuning Pheromone Update Ant not important. The ant used for pheromone updates is ranked highly for solution time. However, omitting this factor from the screened model did not affect the performance of the screened model. • Most important tuning parameters. The most important ACS tuning parameters are the heuristic exponent B-beta, the amount of ants C-antsFraction, the length of candidate lists D-nnFraction, the exploration/exploitation threshold E-q0 and rhoLocal. The tuning study also provides further results that the screening study of the previous case study could not. • Minimum order model. A model that is of at least quadratic order is required to model ACS solution quality and ACS solution time. This is a new result for ACS and shows that an OFAT approach is not an appropriate way to tune the performance of ACS. • Relationship between tuning, problems and performance. Both the models of RelativeError-Time and ADA-Time were good predictors of ACS performance across the entire design space. The prediction intervals for full and screened models were very similar, confirming that the decisions from the screening study in the previous chapter were correct. • Tuned parameter settings. There was much similarity between the recommended tuned parameter settings from the full and screened models and both settings resulted in similar ACS performance. Recommended settings from desirability optimisation resulted in similar solution quality and solution time as settings in the literature. There are immense performance gains to be achieved as evidenced by the relatively poor performance of many randomly chosen parameter settings. The reader may have intuitively expected randomly chosen values to perform poorly but we emphasise that their evaluation is nonetheless an important control for the test of the DOE methodology. • Comparison of solution quality responses. The is no difference in screening and ranking conclusions from using the ADA or Relative Error solution quality responses for ACS.

166

CHAPTER 9. CASE STUDY: TUNING ANT COLONY SYSTEM

9.5

Chapter summary

This chapter presented a case study applying the methodology of Chapter 6 to the tuning of the Ant Colony System (ACS) heuristic. Many new results were presented and existing recommendations in the literature were confirmed in a rigorous fashion. The conclusions of the screening study in the previous chapter were also confirmed.

167

10 Case study: Screening Max-Min Ant System This chapter presents a case study on the screening of the Max-Min Ant System (MMAS) heuristic. Several new results for MMAS are presented. These results have not yet been published in the literature. The chapter follows the sequential experimentation procedure of Chapter 6, beginning with a screening study. The tuning study will follow in the next Chapter. Established tuning parameters previously thought to affect performance are actually shown to have no effect at all. New tuning parameters that were thought to affect performance are investigated. A new TSP problem characteristic is shown to have a very strong effect on performance, confirming the results of Chapter 7. All analyses are conducted for two performance measures, quality of solution and solution time. This provides an accurate measure of the heuristic compromise that is rarely seen in the literature. Finally, it is shown that models of MMAS performance must be of a higher order than linear.

10.1

Method

10.1.1

Response variables

Three responses were measured as per Section 6.7.2 on page 127. These responses were percentage relative error from a known optimum (henceforth referred to as Relative Error), adjusted differential approximation (henceforth referred to as ADA) and solution time (henceforth referred to as Time).

169

CHAPTER 10. CASE STUDY: SCREENING MAX-MIN ANT SYSTEM

10.1.2

Factors, Levels and Ranges

Design Factors There were 12 design factors, 10 representing the MMAS tuning parameters and 2 representing the TSP problem characteristics being investigated. The design factors and their high and low levels are summarised in Table 10.1. A description of the MMAS tuning parameters was given in Section 2.4.5 on page 39. The parameter M:antPlacement could be considered as a parameterised design feature as mentioned in Section 6.3.1 on page 111. As with the ACS case studies, we acknowledge that an experimenter’s prior experience with MMAS may suggest narrower ranges for these factors. When this experience was not available in this thesis, we chose ranges around the values hard-coded in the original source code. For example, restart freq was hard-coded to 25 and so a range of 2 to 40 was experimented with here. It is a simple matter to rerun this case study with different ranges if desired. Factor

Name

Type

Low

High

A

alpha

Numeric

1

13

B

beta

Numeric

1

13

C

antsFraction

Numeric

1.00

110.00

D

nnFraction

Numeric

1.00

20.00

E

q0

Numeric

0.01

0.99

F

rho

Numeric

0.01

0.99

G

reinitBranchFac Numeric

0.50

2.00

H

reinitIters

Numeric

2

80

J

problemStDev

Numeric

10

70

K

problemSize

Numeric

300

500

L

restartFreq

Numeric

2

40

M

antPlacement

Categoric

random

same

Table 10.1: Design factors for the screening study with MMAS. There are two problem characteristic factors (J-problemStDev and K-problemSize). The remaining 10 factors are MMAS tuning parameters.

Held-Constant Factors The held constant factors are as per Section 6.7.6 on page 127.

10.1.3

Instances

All TSP instances were of the symmetric type and were created as per Section 6.7.1 on page 126. The TSP problem instances ranged in size from 300 cities to 500 cities with cost matrix standard deviation ranging from 10 to 70. All instances had a mean of 100. The same instances were used for each replicate of a design point.

170

CHAPTER 10. CASE STUDY: SCREENING MAX-MIN ANT SYSTEM

10.1.4

Experiment design, power and replicates

The experiment design was a Resolution IV (12-5) fractional factorial (Section A.3.2 on page 215) with 6 centre points. The number of replicates was 8, determined using the work-up procedure of Section 6.7.4 on page 127 for a power of about 80%, a significance level of 5% and an effect size of 0.18 standard deviations. This yielded a total of 1030 runs. The following figure summarises the descriptive statistics of the three response variables across all treatments at the 5 stagnation measuring points and the actual effect size that is equivalent to 0.18 standard deviations.

Iterations 50

100

150

200

250

102.29 378.70 4549.08 0.18

Relative Error 8.15 13.34 118.75 0.41

6.08 10.36 43.46 0.34

68.17 181.28 591.26 6799.15 0.29

2.40 7.66 12.77 116.22 0.41

1.86 5.59 9.87 43.46 0.28

106.43 231.11 654.76 6933.65 0.43

2.30 7.46 12.57 116.22 0.41

1.78 5.37 9.45 43.19 0.28

117.86 307.51 865.00 8144.02 0.54

2.26 7.29 12.46 116.22 0.41

1.70 5.09 8.75 43.19 0.25

155.70 376.37 1038.89 9312.89 0.65

2.24 7.19 12.39 116.22 0.41

1.58 4.93 8.31 42.97 0.21

187.00

2.23

1.50

Time Mean StDev Max Min Actual Effect Size of 0.18 sDev Mean StDev Max Min Actual Effect Size of 0.18 sDev Mean StDev Max Min Actual Effect Size of 0.18 sDev Mean StDev Max Min Actual Effect Size of 0.18 sDev Mean StDev Max Min Actual Effect Size of 0.18 sDev

ADA

Figure 10.1: Descriptive statistics for the MMAS screening experiment. Statistics are given at five stagnation points ranging from 50 to 250 iterations. The actual effect size equivalent to 0.18 standard deviations is also listed for each response variable.

10.1.5

Performing the experiment

Responses were measured at five stagnation levels. An examination of the descriptive statistics verifies that the stagnation level did not have a large effect on the response values and therefore the conclusions after a 250 iteration stagnation should be the same as after lower iteration stagnations. The two solution quality responses show a small but practically insignificant decrease in solution error 171

CHAPTER 10. CASE STUDY: SCREENING MAX-MIN ANT SYSTEM as the stagnation iterations is increased. The Time response increases with increasing stagnation iterations because the experiments take longer to run. For all cases, the level of stagnation iterations has little practically significant effect on the three responses. MMAS did not make any large improvements when allowed to run for longer. It is therefore sufficient to perform analyses at the 250 iterations stagnation level.

10.2

Analysis

10.2.1

ANOVA

Effects for each response model were selected using stepwise regression (Section A.4.1 on page 220) applied to a full 2 factor interaction model with an alpha out threshold of 0.10. Some terms removed by backward selection were added back into the final model to preserve hierarchy. To make the data amenable to statistical analysis, a transformation of the responses was required for each analysis. The transformation was a log10 (Section A.4.3 on page 222) for all three responses. Outliers were deleted and the model building repeated until the models passed the usual ANOVA diagnostics for the ANOVA assumptions of model fit, normality, constant variance, time-dependent effects, and leverage (Section A.4.2 on page 221). 24 data points (2% of total data) were removed when analysing Relative Error. 24 data points (2% of total data) were removed when analysing ADA. 10 data points (1% of total data) were removed when analysing Time.

10.2.2

Confirmation

Confirmation experiments were run according to the methodology detailed in Section 6.3.7 on page 114 in order to confirm the accuracy of the ANOVA models. The randomly chosen treatments produced actual algorithm responses with the descriptives listed in the following figure.

Relative Error 250 ADA 250 Time 250

Max 107.25 37.62 4954

Min 0.27 0.79 1

Mean 7.21 5.25 239

StDev 15.35 8.00 599

Figure 10.2: Descriptive statistics for the confirmation of the MMAS screening ANOVA. The response data is from runs of the actual algorithm on the randomly generated confirmation treatments.

The large ranges of each response reinforce the motivation for correct parameter tuning as there is clearly a high cost in incorrectly tuned parameters. The following three figures illustrate the 95% prediction intervals and actual data for the three response models, Relative Error, ADA and Time respectively. The two solution quality responses are well predicted by their models. The models match the trends of the actual data, successfully picking up the extremely low

172

CHAPTER 10. CASE STUDY: SCREENING MAX-MIN ANT SYSTEM

Relative Error 250 95% PI low 95% PI high

Relative Error 250

30.00 25.00 20.00 15.00 10.00 5.00 0.00 0

10

20

30

40

50

Treatment

Figure 10.3: 95% Prediction intervals for the MMAS screening of Relative Error. The horizontal axis shows the randomly generated treatment number.

40

ADA 250 95% PI low 95% PI high

35 ADA 250

30 25 20 15 10 5 0 0

10

20

30

40

50

Treatment

Figure 10.4: 95% Prediction intervals for the MMAS screening of ADA. The horizontal axis shows the randomly generated treatment number.

173

CHAPTER 10. CASE STUDY: SCREENING MAX-MIN ANT SYSTEM and extremely high response values which vary over a range of 107% for relative error and 37 for ADA. 10000.0 Time 250 95% PI low 95% PI high

Time 250

1000.0 100.0 10.0 1.0 0

10

20 30 Treatment

40

50

Figure 10.5: 95% Prediction intervals for the MMAS screening of Time. The horizontal axis shows the randomly generated treatment number.

The time response is also well predicted by its model. This was achieved over a range of 5000s. The three ANOVA models are therefore satisfactory predictors of the three MMAS performance responses for factor values within 10% of the factor range limits listed in Section 10.1.2 on page 170.

10.3

Results

The next figure gives a summary of the Sum of Squares ranking and the ANOVA F and p values for the three responses. Only the main effects are listed. Those main effects that rank in the top 12 are highlighted in bold. Rankings are based on the full two factor interaction model of 78 terms, before backward selection was applied.

A-alpha B-beta C-antsFraction D-nnFraction E-q0 F-rho G-reinitBranchFac H-reinitIters J-problemStDev K-problemSize L-restartFreq M-antPlacement

Relative Error ADA Time Rank F value p value Rank F value p value Rank F value p value 23 59.80 < 0.0001 24 59.80 < 0.0001 6 342.47 < 0.0001 3 1270.75 < 0.0001 3 1270.75 < 0.0001 40 5.79 0.0163 8 760.90 < 0.0001 8 760.90 < 0.0001 1 28012.34 < 0.0001 5 1006.33 < 0.0001 5 1006.33 < 0.0001 3 2941.57 < 0.0001 2 2378.04 < 0.0001 2 2378.04 < 0.0001 5 352.99 < 0.0001 13 547.03 < 0.0001 13 547.03 < 0.0001 8 172.10 < 0.0001 22 65.12 < 0.0001 23 65.12 < 0.0001 10 114.19 < 0.0001 16 111.65 < 0.0001 17 111.65 < 0.0001 63 0.97 0.3255 1 21133.33 < 0.0001 1 6690.14 < 0.0001 72 0.17 0.6833 44 7.56 0.0061 16 120.56 < 0.0001 2 3489.94 < 0.0001 63 0.44 0.5078 63 0.44 0.5078 35 7.88 0.0051 53 2.97 0.0853 54 2.97 0.0853 50 2.61 0.1066

Figure 10.6: Summary of ANOVAs for Relative Error, ADA and Time for MMAS. Only the main effects are shown. Effects with a top 12 ranking are in bold. Rankings are based on the full two factor interaction model of 78 terms, before stepwise regression was applied.

The screening study of MMAS and the associated ANOVAs yield several impor174

CHAPTER 10. CASE STUDY: SCREENING MAX-MIN ANT SYSTEM tant results and answers to open questions from the ACO literature.

10.3.1

Screened factors

Factor L-RestartFreq is statistically insignificant at the 0.05 level for the two quality responses but significant for the time response. The factor has low rankings across all three responses. M-AntPlacement, is statistically insignificant at the 0.05 level for all three responses. It has a low ranking for all three responses, leading us to expect this factor to have little effect on performance. In summary, 2 factors are screened out, M-AntPlacement and L-RestartFreq.

10.3.2

Relative Importance of Factors

Of the two problem characteristics, the factor with the larger effect on solution quality is the problem standard deviation J-problemStDev. The problem size KproblemSize has a stronger effect on solution time than J-problemStDev. This is because MMAS takes longer to reach stagnation on larger problem instances than smaller instances. Of the remaining unscreened tuning parameters, the heuristic exponent B-beta, the amount of ants C-antsFraction, the length of candidate lists D-nnFraction, the exploration/exploitation threshold E-q0 and F-Rho have the strongest effects on solution quality. The same is true for solution time except for B-beta which has a low ranking for solution time. G-ReinitBranchFac is statistically significant for all three responses but is only ranked in the top third for the quality responses. It has a high ranking for Time. This highlights that G-ReinitBranchFac should be considered as a tuning parameter rather than being hard-coded as is typically the case. Although statistically significant for all responses, A-Alpha only has a high ranking for solution Time. Alpha could possibly be considered for screening. These results are important because they highlight the most important tuning parameters and problem characteristics in terms of both performance dimensions of the heuristic compromise. These factors are the minimal set of design factors one should experiment with when modelling and tuning MMAS performance.

10.3.3

Adequacy of a Linear Model.

A test for curvature as per Section 6.3.9 on page 116 shows there is a statistically significant amount of curvature in all three responses. This means that the linear model from the screening study is not adequate to explore the whole design space. A higher order model of all three responses is required. This is an important result because it confirms that a One-Factor-At-a-Time (OFAT) approach is insufficient for investigating the performance of MMAS.

175

CHAPTER 10. CASE STUDY: SCREENING MAX-MIN ANT SYSTEM

10.3.4

Comparison of Solution Quality Measures

It was already mentioned that two solution quality responses were measured to investigate if the choice of one response over the other lead to different conclusions. An examination of the ANOVA summaries reveals that both Relative Error and ADA have the same rankings (±1 places) and same statistical significance or insignificance for all factors except K-problemSize. K-problemSize has a statistically significant effect on Relative Error only. This is due to the nature of how the two quality responses are calculated (Section 3.10.2 on page 69). This result shows that for screening of MMAS, the choice of solution quality response will not affect the conclusions. However, ADA may be a preferable response as it exhibits a lower range and variability than Relative Error. The advantage of a lower variability was discussed in the context of statistical power (Section A.6 on page 225).

10.4

Conclusions and discussion

The following conclusions are drawn from the MMAS screening study. The first results concern tuning and design parameters that have no impact on MMAS performance. These screening and tuning conclusions apply for a significance level of 5% and a power of 80% to detect the effect sizes listed in Figure 10.1 on page 171. These effect sizes are a change in solution time of 187s, a change in Relative Error of 2.2% and a change in ADA of 1.5. Issues of power and effect size are discuss in Section A.6 on page 225. 1. Tuning Restart frequency tuning parameter not important. The number of iterations used in the restart frequency has no significant effect on MMAS performance in terms of solution quality or solution time. This is a highly unexpected result as the restart frequency is a fundamental feature of MMAS (Section 2.4.5 on page 39) 2. Tuning Ant placement design parameter not important. The type of ant placement has no significant effect on MMAS performance in terms of either solution quality or solution time. This was an open question in the literature. The result is remarkable because intuitively one would expect a random scatter of ants across the problem graph to explore a wider variety of possible solutions. This result shows that this is not the case. MMAS design can be fixed with either a random scatter method or single node method of ant placement. Other results were as follows. 3. Alpha only important for solution time. The choice of Alpha only significantly effects solution time. Although statistically significant for the quality responses, it has a low ranking.

176

CHAPTER 10. CASE STUDY: SCREENING MAX-MIN ANT SYSTEM 4. Problem standard deviation is important. This confirms the main result of Chapter 7 in identifying a new TSP problem characteristic that has a significant effect on the difficulty of a problem for MMAS. ACO research should be reporting this characteristic in the literature. 5. Most important parameters. The rankings show that the most important tuning parameters affecting solution quality or solution time or both are beta, antsFraction, length of candidate list, exploration/exploitation threshold and the pheromone update term rho. 6. Beta not important for solution time. The choice of Beta only affects solution quality and not solution time. It has always been known that Beta strongly affects solution quality. 7. New tuning parameter. Reinitialisation Branching Factor tuning parameter has a strong effect on time and a moderate but statistically significant effect on quality.. 8. Higher order model of MMAS behaviour needed. A higher order model, greater than linear, is required to model MMAS solution quality and MMAS solution time. This is an important result because it demonstrates for the first time that simple OFAT approaches seen in the literature are insufficient for accurately modelling and tuning MMAS performance. 9. Comparison of solution quality responses. The is no difference in conclusions from the ADA and Relative Error solution quality responses for MMAS. The ADA response is therefore preferable for screening because it exhibits a lower variability than Relative Error and therefore results in more powerful experiments. The result regarding alpha confirms the literature’s general recommendation that Alpha be set to 1 [47, p. 71]. It also contradicts Pellegrini et al’s [96] analysis of the most important MMAS parameters, where alpha was one of these parameters and the length of candidate list and exploration/exploitation threshold were omitted. This result is all the more remarkable given that Pellegrini et al’s research used the ACOTSP code and this study used the backwards compatible JACOTSP code. It highlights the danger of using only intuitive reasoning to rank the importance of tuning parameters rather than a rigorous DOE approach with the fully instantiated algorithm. The ranking of the tuning parameters contradicts the claim of Doerr and Neumann [41, p. 38] about rho being the most important parameter affecting ACO solution run time. This thesis’ screening study shows that, excluding problem characteristics, rho is in fact the 7th most important factor to affect solution time to stagnation for MMAS. The Reinitialisation Branching Factor is usually held constant (Section 2.4.5 on page 39) in reported research but this study has shown that it is an important tuning parameter to be considered.

177

CHAPTER 10. CASE STUDY: SCREENING MAX-MIN ANT SYSTEM

10.5

Chapter summary

This chapter presented a case study applying the methodology of Chapter 6 to the screening of the Max-Min Ant System (MMAS) heuristic. Many new results were presented. Existing recommendations in the literature were confirmed and other claims were refuted in a rigorous fashion. The next chapter will use the results of this screening to tune the performance of MMAS.

178

11 Case study: Tuning Max-Min Ant System This Chapter reports a case study on tuning the factors affecting the performance of a heuristic. The methodology for this case study was described in Chapter 6. The particular heuristic studied is Max-Min System (MMAS) for the Travelling Salesperson Problem (Section 2.4.5 on page 39). This chapter reports many new results for MMAS. All analyses are conducted for two performance measures, quality of solution and solution time. This provides an accurate measure of the heuristic compromise that is rarely seen in the literature. It is shown that models of MMAS performance must be at least quadratic. MMAS is tuned using a full parameter set and a screened parameter set resulting from the case study of the previous chapter. This verifies that screening decisions from the previous chapter are correct. The results reported in this Chapter have been published in the literature [109].

11.1

Method

11.1.1

Response Variables

Three responses were measured as per Section 6.7.2 on page 127. These responses were percentage relative error from a known optimum (henceforth referred to as Relative Error), adjusted differential approximation (henceforth referred to as ADA) and solution time (henceforth referred to as Time).

179

CHAPTER 11. CASE STUDY: TUNING MAX-MIN ANT SYSTEM

11.1.2

Factors, levels and ranges

Design Factors In the full RSM, there were 12 design factors as per the screening study of the previous chapter. The factors and their high and low levels are repeated in the following table for convenience. Factor

Name

Type

Low

High

A

alpha

Numeric

1

13

B

beta

Numeric

1

13

C

antsFraction

Numeric

1.00

110.00

D

nnFraction

Numeric

1.00

20.00

E

q0

Numeric

0.01

0.99

F

rho

Numeric

0.01

0.99

G

reinitBranchFac Numeric

0.50

2.00

H

reinitIters

Numeric

2

80

J

problemStDev

Numeric

10

70

K

problemSize

Numeric

300

500

L

restartFreq

Numeric

2

40

M

antPlacement

Categoric

random

same

Table 11.1: Design factors for the tuning study with MMAS. The factor ranges are also given.

Tuning parameters for the Screened model, based on the results of Section 10.3 on page 174, did not include L-RestartFreq and M-antPlacement. These two factors took on randomly chosen values within their range for each experiment run. Held-Constant Factors The held constant factors are as per Section 6.7.6 on page 127. There were additional held-constant factors for MMAS. The p term in the trail minimum update (Section 2.4.5 on page 39) was fixed at 0.05, as hard-coded in the original ACOTSP. The lambda value used in the Trail Reinitialisation of the daemon actions calculation (Section 2.4.5 on page 39) is fixed at 0.05.

11.1.3

Instances

All TSP instances were of the symmetric type and were created as per Section 6.7.1 on page 126. The TSP problem instances ranged in size from 300 cities to 500 cities with cost matrix standard deviation ranging from 10 to 70. All instances had a mean of 100. The same instances were used for each replicate of a design point.

180

CHAPTER 11. CASE STUDY: TUNING MAX-MIN ANT SYSTEM

11.1.4

Experiment design, power and replicates

The experiment design for both models was a Minimum Run Resolution V FaceCentred Composite (Section A.3.4 on page 218) with six centre points. The work-up procedure was slightly different to that specified in Section 6.7.4 on page 127 due to a limitation on experimental resources. In this study, a target power of 80% and significance level of 5% were fixed by convention. A number of replicates of 8 was fixed according to the number of feasible experiment runs that could be conducted with the given resources. The collected data was then examined to determine the smallest effect size that could be detected given these constraints. Fortunately, the variability of the data was low enough to permit a reasonable effect size to be detected with this power, significance level and number of replicates. The descriptive statistics for the full experiment and screened experiment data are given in the following two figures. Iterations

110.68 455.98 6493.67 0.19

Relative Error 9.16 16.75 105.23 0.41

4.73 6.62 43.27 0.25

113.99

4.19

1.66

205.88 885.52 11166.51 0.32

8.63 15.64 104.18 0.39

4.50 6.41 40.42 0.16

221.38

3.91

1.60

274.53 1161.26 16378.81 0.44

8.17 14.33 101.68 0.37

4.38 6.32 40.42 0.16

290.32

3.58

1.58

355.73 1562.46 20094.57 0.58

8.00 13.88 101.68 0.37

4.32 6.28 40.42 0.16

Time

50 Mean StDev Max Min Actual Effect Size of 0.25 stdevs 100 Mean StDev Max Min Actual Effect Size of 0.25 stdevs 150 Mean StDev Max Min Actual Effect Size of 0.25 stdevs 200 Mean StDev Max Min Actual Effect Size of 0.25 stdevs 250 Mean StDev Max Min Actual Effect Size of 0.25 stdevs

ADA

390.62

3.47

1.57

426.54 1898.49 25757.98 0.71

7.84 13.44 99.00 0.37

4.27 6.24 40.42 0.12

474.62

3.36

1.56

Figure 11.1: Descriptive statistics for the full MMAS experiment design. The actual detectable effect size of 0.25 standard deviations is shown for each response and for each stagnation point. There is little practical difference in effect size of the solution quality responses for an increase in the stagnation point.

The full design could achieve sufficient power while detecting an effect of size 0.25 standard deviations. The screened design could only achieve sufficient power while detecting an effect of size 0.41 standard deviations.

181

CHAPTER 11. CASE STUDY: TUNING MAX-MIN ANT SYSTEM

Iterations

Time

50 Mean StDev Max Min Actual Effect Size of 0.41 stdevs 100 Mean StDev Max Min Actual Effect Size of 0.41 stdevs 150 Mean StDev Max Min Actual Effect Size of 0.41 stdevs 200 Mean StDev Max Min Actual Effect Size of 0.41 stdevs 250 Mean StDev Max Min Actual Effect Size of 0.41 stdevs

88.71 334.78 3983.14 0.23

Relative Error 7.97 10.45 96.23 0.41

ADA 5.20 7.74 42.54 0.46

137.26

4.29

3.17

179.99 658.07 6673.04 0.41

7.22 8.30 93.47 0.41

4.71 6.92 42.54 0.46

269.81

3.40

2.84

284.49 1024.61 10113.55 0.56

6.81 6.78 93.47 0.41

4.48 6.43 42.54 0.46

420.09

2.78

2.64

348.11 1170.99 11535.46 0.69

6.45 4.58 32.22 0.41

4.35 6.22 42.54 0.46

480.11

1.88

2.55

398.42 1280.86 11656.30 0.83

6.40 4.55 32.22 0.41

4.33 6.22 42.54 0.46

525.15

1.87

2.55

Figure 11.2: Descriptive statistics for the screened MMAS experiment design. The actual detectable effect size of 0.41 standard deviations is shown for each response and at each stagnation point. There is little practical difference in the detectable effect size of the solution quality responses for different stagnation iterations levels.

182

CHAPTER 11. CASE STUDY: TUNING MAX-MIN ANT SYSTEM

11.1.5

Performing the experiment

Responses were measured at five stagnation levels. An examination of the descriptive statistics verifies that the stagnation level did not have a large effect on the response values and therefore the conclusions after a 250 iteration stagnation should be the same as after lower iteration stagnations. The two solution quality responses show a small but practically insignificant decrease as the stagnation iterations is increased. The Time response increases with increasing stagnation iterations because the experiments take longer to run. For all cases, the level of stagnation iterations has little practically significant effect on the three responses. MMAS did not make any large improvements when allowed to run for longer. It is therefore sufficient to perform analyses of the full and screened models at the 250 iterations stagnation level.

11.2

Analysis

11.2.1

Fitting

A fit analysis was conducted for each response in the full experiment and the screened experiment. In both cases, at least a quadratic model was required to model the responses. For the Minimum Run Resolution V Face-Centred Composite, cubic models are aliased and so were not considered.

11.2.2

ANOVA

Effects for each response model were selected using backward selection applied to a full quadratic model with an alpha threshold of 0.10. Some terms removed by backward selection were added back into the final model to preserve hierarchy. To make the data amenable to statistical analysis, a transformation of the responses was required for each analysis (Section A.4.3 on page 222). The transformations were a log10 in all but one case., In the case of ADA in the ADA-Time model, a square root transformation with d=0.5 was used. Outliers were deleted and the model building repeated until the models passed the usual ANOVA diagnostics for the ANOVA assumptions of model fit, normality, constant variance, time-dependent effects, and leverage (Section A.4.2 on page 221). The next table summarises the outliers deleted from each analysis. Experiment

Model

Number

% of runs

Full

ADA-Time

56

4

Screened

ADA-Time

50

8

Full

RelativeError-Time

41

3

Screened

RelativeError-Time

53

9

Table 11.2: Amount of outliers removed from MMAS tuning analyses. The percentage of the total runs deleted is greater in the screened experiments than the full experiments.

183

CHAPTER 11. CASE STUDY: TUNING MAX-MIN ANT SYSTEM For both models, the screened experiments required more outliers to be deleted than the full experiments. The amount of outliers (8% and 9%) for the two screened experiments may be cause for concern. This concern is addressed with some confirmation experiments.

11.2.3

Confirmation

Confirmation experiments were run according to the methodology detailed in Section 6.3.7 on page 114. The randomly chosen treatments produced actual algorithm responses with the descriptives listed in the following figure.

Iterations 100 Mean StDev Max Min 150 Mean StDev Max Min 200 Mean StDev Max Min 250 Mean StDev Max Min

Time 65.30 81.26 380.82 3.27 88.83 103.21 474.20 4.64 112.13 122.51 568.09 6.00 137.01 149.95 662.35 7.37

Relative Error 5.89 4.35 25.68 0.69 5.87 4.35 25.68 0.69 5.84 4.35 25.68 0.69 5.83 4.35 25.68 0.69

ADA 3.62 3.92 25.31 0.43 3.61 3.91 25.31 0.43 3.59 3.90 25.31 0.43 3.58 3.90 25.31 0.43

Figure 11.3: Descriptive statistics for the MMAS confirmation experiments. This is the actual data produced from the randomly generated confirmation treatments. All three responses vary over wide ranges, highlighting the cost of incorrectly chosen tuning parameter settings.

The same confirmation runs were used for the screened and full experiments. This allows a direct comparison between the predictive capabilities of the models from the screened and full experiments. The following two figures compare the predictive capabilities of the full and screened RelativeError-Time models for the Relative Error response over fifty randomly generated treatments. Both models predict the relative error response well for almost all of the 50 treatments extending over a range of 25%. There are about three poorly performing predictions, two where the models overpredicted and one where the models underpredicted the actual MMAS performance. Both the full and screened models have the same shape, confirming the accuracy of the screening study in the previous chapter.

184

CHAPTER 11. CASE STUDY: TUNING MAX-MIN ANT SYSTEM

Relative Error 250

Full Response Surface Model 30.00

Relative Error 250

25.00

95% PI low

20.00

95% PI high

15.00 10.00 5.00 0.00 0

5

10

15

20

25

30

35

40

45

50

Treatment

Figure 11.4: 95% prediction intervals of Relative Error by the full RelativeError-Time model of MMAS. The horizontal axis is the randomly generated treatment number.

Screened Response Surface Model

Relative Error 250

30.00

Relative Error 250

25.00

95% PI low

20.00

95% PI high

15.00 10.00 5.00 0.00 0

5

10

15

20

25

30

35

40

45

50

Treatment

Figure 11.5: 95% prediction intervals of Relative Error by the screened RelativeError-Time model of MMAS. The horizontal axis is the randomly generated treatment number.

185

CHAPTER 11. CASE STUDY: TUNING MAX-MIN ANT SYSTEM The following two figures compare the predictive capabilities of the full and screened RelativeError-Time models for the Time response. Both the full and screened models of RelativeError-Time are excellent predictors of the Time response. All extreme points are well predicted. Both models have the same shape, confirming the accuracy of the screening study from the previous chapter.

Full Response Surface Model

1000

Time 250

Time 250

95% PI low 95% PI high

100

10

1 0

5

10 15 20 25 30 35 40 45 50 Treatment

Figure 11.6: Predictions of Time by the full RelativeError-Time model of MMAS. The horizontal axis is the randomly generated treatment number.

Screened Response Surface Model 1000

Time 250

Time 250

95% PI low 100

95% PI high

10

1 0

5

10 15 20 25 30 35 40 45 50 Treatment

Figure 11.7: Predictions of Time by the screened RelativeError-Time model of MMAS. The horizontal axis is the randomly generated treatment number.

The next two figures compare the predictive capabilities of the full and screened ADA-Time models for the ADA response. Both models are good predictors of ADA across the majority of confirmation treatments. The full model fails to predict the exact values of two peaks but does nonetheless identify them as peaks. The screened model has a slightly different shape to the full model. This may be due either to decisions in the modelling process or and incorrect screening decision. The next two figures compare the predictive capabilities of the full and screened ADA-Time models for the Time response. Both models are excellent predictors of the Time response over a range of thousands of seconds and there is little difference in the models’ predictions.

186

CHAPTER 11. CASE STUDY: TUNING MAX-MIN ANT SYSTEM

Full Response Surface Model 25.00

ADA 250

ADA 250

20.00

95% PI low 95% PI high

15.00 10.00 5.00 0.00 0

5

10

15

20

25

30

35

40

45

50

Treatment Figure 11.8: 95% prediction intervals of ADA by the full ADA-Time model of MMAS. The horizontal axis is the randomly generated treatment number.

Screened Response Surface Model 25.00 ADA 250

ADA 250

20.00

95% PI low

15.00

95% PI high

10.00 5.00 0.00 0

5

10

15

20

25

30

35

40

45

50

Treatment Figure 11.9: 95% prediction intervals of ADA by the screened ADA-Time model of MMAS. The horizontal axis is the randomly generated treatment number.

Full Response Surface Model 10000

Time 250 95% PI low

Time 250

1000

95% PI high

100 10 1 0

5

10

15

20

25

30

35

40

45

50

Treatment

Figure 11.10: 95% prediction intervals of Time by the full ADA-Time model of MMAS. The horizontal axis is the randomly generated treatment number.

187

CHAPTER 11. CASE STUDY: TUNING MAX-MIN ANT SYSTEM

Screened Response Surface Model 10000 Time 250 95% PI low

Time 250

1000

95% PI high 100 10 1 0

5

10

15

20 25 30 Treatment

35

40

45

50

Figure 11.11: 95% prediction intervals of Time by the screened ADA-Time model of MMAS. The horizontal axis is the randomly generated treatment number.

In general, we conclude that both models are good predictors of the responses and that the screening decisions from the screening study were correct. The concerns raised in the previous section regarding effect size and outlier deletion have been mitigated. The models can therefore be used to make recommendations on good tuning parameter settings for a given instance.

11.3

Results

11.3.1

Screening and relative importance of factors

The following two figures give the ranked ANOVAs of the Relative Error and Time models from the RelativeError-Time analysis. The terms have been rearranged in order of decreasing sum of squares so that the largest contributor to the models comes first. Looking first at the Relative Error rankings, we see that the least important main effects are A-alpha, M-antPlacement, and H-reinitIters. The first two of these are the terms that were deemed unimportant in the screening study of the previous chapter. However, in the screening study it was L-restartFreq that was deemed unimportant rather than H- reinitIters. This may adversely affect the desirability optimisation of the next section. By far the most important terms are the main effects of the exploration/exploitation tuning parameter, the candidate list length tuning parameter and exponent B-Beta as well as their interactions. This is a very important result because it shows that candidate list length, a parameter that we have often seen set at a fixed value or not used is actually one of the most important parameters to set correctly. G-reinitBranchFac, which is not normally considered a parameter at all, is also very important for solution quality. Looking at the Time rankings, we see that antPlacement was completely removed from the model. The least important main effects were then L-restartFreq and H-reinitIters. These results also confirm the screening decisions.

188

CHAPTER 11. CASE STUDY: TUNING MAX-MIN ANT SYSTEM

Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34

Sum of Term squares F value p value J-problemStDev 96.79 27278.14 < 0.0001 B-beta 12.95 3648.58 < 0.0001 12.69 3576.14 < 0.0001 E-q0 BE 12.21 3440.86 < 0.0001 D-nnFraction 5.84 1646.78 < 0.0001 DE 4.47 1260.23 < 0.0001 BD 3.89 1096.49 < 0.0001 G-reinitBranchFac 3.89 1095.75 < 0.0001 GJ 2.79 785.30 < 0.0001 FH 2.68 756.69 < 0.0001 BC 2.61 735.55 < 0.0001 CD 2.57 725.08 < 0.0001 AB 2.50 704.56 < 0.0001 AK 2.41 680.27 < 0.0001 AJ 2.39 673.34 < 0.0001 HJ 2.34 658.60 < 0.0001 HK 2.28 642.08 < 0.0001 EL 2.00 565.03 < 0.0001 GK 1.90 535.92 < 0.0001 EJ 1.84 519.39 < 0.0001 C-antsFraction 1.82 513.19 < 0.0001 GL 1.75 492.62 < 0.0001 DF 1.68 474.43 < 0.0001 DK 1.51 426.61 < 0.0001 EF 1.40 395.95 < 0.0001 CL 1.20 338.79 < 0.0001 K-problemSize 1.14 322.31 < 0.0001 KL 1.01 284.32 < 0.0001 L-restartFreq 0.85 238.83 < 0.0001 D^2 0.69 194.48 < 0.0001 DH 0.66 185.14 < 0.0001 FJ 0.65 182.44 < 0.0001 AH 0.62 174.99 < 0.0001 CJ 0.62 174.03 < 0.0001

Rank 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73

Sum of Term squares F value p value BK 0.59 165.63 < 0.0001 HL 0.59 164.91 < 0.0001 BL 0.55 154.09 < 0.0001 CF 0.54 151.67 < 0.0001 FL 0.50 140.41 < 0.0001 JK 0.44 124.25 < 0.0001 BJ 0.44 122.94 < 0.0001 A^2 0.43 120.87 < 0.0001 EH 0.43 120.76 < 0.0001 CK 0.42 118.41 < 0.0001 FG 0.42 117.05 < 0.0001 CH 0.41 115.01 < 0.0001 J^2 0.38 106.55 < 0.0001 DG 0.37 104.57 < 0.0001 CE 0.36 101.04 < 0.0001 H-reinitIters 0.34 96.05 < 0.0001 AL 0.29 81.91 < 0.0001 AE 0.27 76.12 < 0.0001 FK 0.25 70.12 < 0.0001 AG 0.22 62.70 < 0.0001 BF 0.20 57.43 < 0.0001 EK 0.17 49.28 < 0.0001 F-rho 0.17 47.94 < 0.0001 BH 0.14 40.61 < 0.0001 AD 0.14 38.56 < 0.0001 F^2 0.13 36.25 < 0.0001 DJ 0.12 34.29 < 0.0001 E^2 0.11 30.90 < 0.0001 M-antPlacement 0.08 21.47 < 0.0001 AC 0.08 21.37 < 0.0001 BG 0.06 17.64 < 0.0001 AF 0.06 15.97 < 0.0001 EG 0.05 15.35 < 0.0001 B^2 0.05 14.48 0.0001 C^2 0.05 13.28 0.0003 DL 0.04 11.10 0.0009 A-alpha 0.03 9.67 0.0019 EM 0.03 7.05 0.0080 JL 0.02 6.44 0.0113

Figure 11.12: RelativeError-Time ranked ANOVA of Relative Error response from full model. The table lists the remaining terms in the model after stepwise regression in order of decreasing Sum of Squares.

189

CHAPTER 11. CASE STUDY: TUNING MAX-MIN ANT SYSTEM

Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

Term C-antsFraction K-problemSize D-nnFraction E-q0 CG DG A-alpha EG AG C^2 BC AE AC CJ G-reinitBranchFac F-rho CE FJ DF AD BG J-problemStDev HL BF AB CD KL AH JK GK

Sum of squares F value p value 447.72 28026.84 < 0.0001 47.69 2985.59 < 0.0001 42.94 2688.11 < 0.0001 8.06 504.59 < 0.0001 5.58 349.35 < 0.0001 5.08 318.31 < 0.0001 4.95 309.83 < 0.0001 4.61 288.38 < 0.0001 3.57 223.53 < 0.0001 3.46 216.87 < 0.0001 3.41 213.56 < 0.0001 3.02 189.01 < 0.0001 2.64 165.50 < 0.0001 2.39 149.68 < 0.0001 2.31 144.33 < 0.0001 1.66 104.20 < 0.0001 1.52 94.85 < 0.0001 1.50 94.14 < 0.0001 1.50 93.69 < 0.0001 1.48 92.87 < 0.0001 1.30 81.17 < 0.0001 1.25 78.56 < 0.0001 1.20 75.24 < 0.0001 1.19 74.42 < 0.0001 0.98 61.19 < 0.0001 0.95 59.67 < 0.0001 0.84 52.89 < 0.0001 0.73 45.43 < 0.0001 0.72 44.89 < 0.0001 0.66 41.51 < 0.0001

Rank 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Sum of Term squares F value AK 0.66 41.11 DK 0.63 39.22 CF 0.55 34.54 B-beta 0.45 28.07 BJ 0.39 24.28 DE 0.39 24.14 DJ 0.35 22.18 BK 0.35 21.84 EL 0.33 20.92 BH 0.32 19.86 JL 0.29 17.87 BE 0.26 16.46 0.24 14.71 HK HJ 0.23 14.39 EJ 0.20 12.65 D^2 0.20 12.48 CK 0.18 11.16 EK 0.17 10.89 GJ 0.14 8.65 GL 0.13 8.45 G^2 0.10 6.34 FL 0.10 6.17 FH 0.09 5.39 DL 0.06 3.93 BL 0.05 3.07 CH 0.04 2.75 EF 0.04 2.68 FG 0.04 2.52 H-reinitIters 0.01 0.49 L-restartFreq 0.01 0.36

p value < 0.0001 < 0.0001 < 0.0001 < 0.0001 < 0.0001 < 0.0001 < 0.0001 < 0.0001 < 0.0001 < 0.0001 < 0.0001 < 0.0001 0.0001 0.0002 0.0004 0.0004 0.0009 0.0010 0.0033 0.0037 0.0119 0.0131 0.0204 0.0476 0.0799 0.0977 0.1017 0.1130 0.4839 0.5483

Figure 11.13: RelativeError-Time ranked ANOVA of time response from full model. The table lists the remaining terms in the model after stepwise regression in order of decreasing Sum of Squares.

By far the most important tuning parameters are the amount of ants and the lengths of their candidate lists. This is quite intuitive as the amount of processing is directly related to these parameters. The result regarding the cost of the amount of ants is particularly important because the amount of ants does not have a relatively strong effect on solution quality. The extra time cost of using more ants will not result in gains in solution quality. This is an important result because it contradicts the often recommended parameter setting of letting the number of ants equal to the problem size. An examination of the ranked ANOVAs from the ADA-Time model gives the same ranking of tuning parameter contributions to time and ADA. As with the previous screening study, the choice of solution quality response does not change the conclusions of the relative importance of the tuning parameters.

11.3.2

Tuning

A desirability optimisation is performed as per Section 6.5.2 on page 122. The results from the full and screened RelativeError-Time models are presented in the following two figures. The rankings of the ANOVA terms has already highlighted the factors that have little effect on the responses. These screened factors can take on any values in the desirability optimisations. This is confirmed by examining the desirability recommendations from the full and screened models. The most important factors, comprising the screening model, have recommended settings that usually agree

190

antsFraction

nnFraction

q0

rho

reinitBranchFac

reinitIters

restartFreq

antPlacement

Time

Relative Error

Desirability

StDev 10 40 70 10 40 70 10 40 70

beta

Size 300 300 300 400 400 400 500 500 500

alpha

CHAPTER 11. CASE STUDY: TUNING MAX-MIN ANT SYSTEM

1 1 13 1 13 13 13 13 13

1 1 13 1 6 12 2 13 13

1.00 1.00 1.13 1.00 1.00 1.00 1.00 1.01 1.00

1.00 1.00 1.00 1.50 1.00 1.00 1.01 1.32 1.40

0.99 0.99 0.04 0.96 0.95 0.01 0.99 0.01 0.01

0.01 0.01 0.68 0.22 0.70 0.64 0.49 0.59 0.48

0.50 0.50 0.57 0.51 0.50 0.62 0.51 0.50 0.51

80 80 80 80 80 35 31 5 38

2 2 39 39 20 30 23 30 34

random random same same random random same random same

0.72 0.94 1.01 1.53 2.18 2.55 4.25 4.45 4.27

0.15 0.56 0.54 0.19 0.84 1.16 0.37 0.99 1.14

1.00 0.95 0.95 0.96 0.87 0.84 0.91 0.82 0.81

Size

St Dev

alpha

beta

antsFraction

nnFraction

q0

rho

reinitBranchFac

reinitIters

Time

Relative Error

Desirability

Figure 11.14: Full RelativeError-Time model results of desirability optimisation. The table lists the recommended parameter values for combinations of problem size and problem standard deviation. The expected time and relative error are listed with the desirability value.

300 300 300 400 400 400 500 500 500

10 40 70 10 40 70 10 40 70

3 1 13 13 1 13 13 13 13

11 7 6 5 3 9 8 11 10

1.00 8.83 1.00 1.00 2.79 1.00 1.00 1.20 1.11

1.10 1.00 2.34 1.00 1.00 1.00 1.02 1.04 1.00

0.98 0.98 0.99 0.89 0.94 0.99 0.85 0.99 0.14

0.25 0.56 0.52 0.53 0.69 0.53 0.24 0.57 0.46

0.50 0.50 0.50 0.50 0.60 1.12 0.50 0.85 0.61

80 77 77 3 77 76 50 80 75

1 3 1 3 3 5 5 8 4

0.40 0.62 2.29 0.41 1.11 2.82 0.41 1.18 3.04

0.97 0.89 0.73 0.94 0.80 0.64 0.90 0.74 0.63

Figure 11.15: Screened RelativeError-Time model results of desirability optimisation. The table lists the recommended parameter values for combinations of problem size and problem standard deviation. The expected time and relative error are listed with the desirability value.

191

CHAPTER 11. CASE STUDY: TUNING MAX-MIN ANT SYSTEM with the recommended settings from the full model. For example, AntsFraction and nnFraction are always low in both models. The remaining unimportant factors take on a variety of values in the full model desirability optimisation, for example AntPlacement. The predicted values of both relative error and time from both models agree closely with one another. The quality of the desirability optimisation recommendations can now be evaluated.

11.3.3

Evaluation of tuned settings

The tuned parameter recommendations from the desirability optimisation are evaluated as per the methodology of Section 6.6 on page 124. Some illustrative plots are given in the following figures. On each plot, the horizontal axis lists the randomly generated treatments and the vertical axis lists the response value. Each plot contains the data for the response recorded using the settings from the desirability optimisation of the full and screened experiments. The responses produced by using parameter settings recommended in the literature and some randomly chosen parameter settings are also listed. Relative Error vs Time model after 250 iteration stagnation 10

Full Desirability

Relative Error 250

Screened Desirability Book Random 5

0 0

1

2

3

4 5 6 Treatment

7

8

9

10

Figure 11.16: Evaluation of Relative Error response in the RelativeError-Time model of MMAS. Problems are of size 500 and standard deviation 10.

For Relative Error, the parameter settings from the full and screened models perform slightly better than the parameter settings from the literature. Interestingly, on a small number of occasions, randomly chosen settings perform better than all other settings. It is not until we examine the other side of the heuristic compromise that we see the advantage of the DOE approach to parameter tuning. Here, the results from the DOE desirability optimisation are two orders of magnitude better than the results from the settings recommended in the literature (Section 2.4.9 on page 45). For both models, there is little difference between the performance in terms of time or quality, supporting the conclusions made in the screening experiment of the previous chapter. As with ACS, results were quite different when evaluating the ADA-Time model.

192

CHAPTER 11. CASE STUDY: TUNING MAX-MIN ANT SYSTEM Relative Error vs Time model after 250 iteration stagnation 10000

Full Desirability Screened Desirability

Time 250

1000

Book Random

100 10 1 0

1

2

3

4

5

6

7

8

9

10

Treatment

Figure 11.17: Evaluation of the Time response in the relativeError-Time model of MMAS. Problems are of size 500 and standard deviation 10.

In the next two figures we see that the parameter settings from the full and screened models outperform the settings from the literature by an order of magnitude in solution quality and three orders of magnitude in solution time. Again, the comparison of the performance of the DOE settings with the literature settings is not appropriate for ADA as the literature settings were recommended in the context of relative error. However this highlights the importance of not apply the literature recommended settings without an understanding of the solution quality response. There is no practically significant difference between parameter setting recommendations from the full and screened models. This indicates that the decisions from the screening study were correct. ADA vs Time model after 250 iteration stagnation

40

Full Desirability

ADA 250

35

Screened Desirability

30

Book

25

Random

20 15 10 5 0 0

1

2

3

4

5 6 Treatment

7

8

9

10

Figure 11.18: Evaluation of the Time response in the RelativeError-Time model of MMAS. Problems are of size 300 and standard deviation 10.

193

CHAPTER 11. CASE STUDY: TUNING MAX-MIN ANT SYSTEM ADA vs Time model after 250 iteration stagnation 1000

Full Desirability

Time 250

Screened Desirability Book

100

Random

10

1 0

1

2

3

4

5 6 7 Treatment

8

9

10

Figure 11.19: Evaluation of the Time response in the ADA-Time model of MMAS. Problems are of size 300 and standard deviation 10.

Not all parameter recommendations performed so well. The next figure shows the evaluation of the Relative Error performance from the RelativeError-Time model for problems of size 400 and standard deviation 70. The parameter settings from the literature perform as well as or better than those obtained with the DOE approach. Randomly chosen parameter settings perform better than all others on many occasions. Relative Error vs Time model after 250 iteration stagnation

Relative Error 250

15 Full Desirability Screened Desirability

10

Book Random 5

0 0

1

2

3

4

5

6

7

8

9

10

Treatment

Figure 11.20: Evaluation of Relative Error response in the RelativeError-Time model of MMAS. Problems are of size 400 and standard deviation 70.

Similar results of poor relative error performance despite excellent time performance were also observed for other combinations of problem size and standard deviation. This was not an issue with ACS and may be due to the more complicated nature of MMAS or an interaction between the stagnation stopping criterion and the less aggressive behaviour of MMAS relative to ACS.

194

CHAPTER 11. CASE STUDY: TUNING MAX-MIN ANT SYSTEM

11.4

Conclusions and discussion

The following conclusions are drawn from the MMAS tuning study. These screening and tuning conclusions apply for a significance level of 5% and a power of 80% to detect the effect sizes listed in Figure 11.1 on page 181. These effect sizes are a change in solution time of 474s, a change in Relative Error of 3.36% and a change in ADA of 1.56. Issues of power and effect size are discuss in Section A.6 on page 225. • Tuning Ant Placement not important. This factor had a very low ranking in both full models. The screening study correctly identified the lack of importance of Ant Placement. • Tuning Restart frequency not important. The number of iterations used in the restart frequency has no significant effect on MMAS performance in terms of solution quality or solution time. This is a highly unexpected result as the restart frequency is a fundamental feature of MMAS (Section 2.4.5 on page 39) • Alpha only important for solution time. The choice of Alpha only effects solution time. Although statistically significant for the quality responses, it has a low ranking. This confirms the literature’s general recommendation that Alpha be set to 1. • Beta not important for solution time. The choice of Beta only effects solution quality and not solution time. This is a new result in the ACO literature. • Reinitialisation Branching Factor is important. This has a strong effect on Time and a moderate but statistically significant effect on quality. This highlights the importance of tuning this parameter which is usually held constant. • Sufficient order model.. A model of at least quadratic order is required to model MMAS solution quality and MMAS solution time. This confirms the result of the screening study that simple OFAT approaches seen in the literature are insufficient for accurately modelling and tuning MMAS performance. • Relationship between tuning, problems and performance. Both the models of RelativeError-Time and ADA-Time were good predictors of MMAS performance across the entire design space. The prediction intervals for full and screened models were very similar, confirming that the decisions from the screening study were correct. However a ranking of the full model terms from the tuning study suggests it may not have been appropriate to screen out restart frequency. • Tuned parameter settings. There was little similarity between the recommended tuned parameter settings from the full and screened models but both settings resulted in similar MMAS performance. This indicates that there may be many combinations of parameter settings that give similar performance for 195

CHAPTER 11. CASE STUDY: TUNING MAX-MIN ANT SYSTEM MMAS. Finding one of these settings is important however as recommended settings resulted in similar or better solution quality than random settings and two orders of magnitude better time than random settings. There are immense performance gains to be achieved. For some combinations of problem size and problem standard deviation, the RelativeError-Time recommendations resulted in worse solution quality than some randomly chosen parameter settings. Solution time was nonetheless improved. These poor recommendations may be due to the complex nature of MMAS that may be difficult to model with an interpolative DOE approach. This complexity is particularly evident in the daemon actions phase of MMAS (Seciton 2.4.5 on page 39) where reinitialisations occur. This ‘restart’ nature of MMAS may be difficult to model with the DOE approach.

11.5

Chapter summary

This chapter presented a case study applying the methodology of Chapter 6 to the tuning of the Max-Min Ant System (MMAS) heuristic. Many new results were presented and existing recommendations in the literature were confirmed in a rigorous fashion. The conclusions of the screening study in the previous chapter were also confirmed.

196

12 Conclusions This thesis presents a rigorous Design Of Experiments (DOE) approach for the tuning of heuristic algorithms for Combinatorial Optimisation (CO). The thesis therefore draws on well-established fields such as Operations Research, Design Of Experiments, Empirical Analysis and Empirical Methodology and contributes a much needed rigour [64, 103, 48, 65] to fields such as Heuristics, Metaheuristics, and Ant Colony Optimisation. This chapter summarises the thesis. It begins with a brief overview of the main problem that the thesis addresses. It then examines the advantages of the approach taken in the thesis, leading to the main hypothesis of the work. The contributions from this thesis are listed along with the thesis strengths and limitations. The chapter closes with a discussion of possibilities for future work.

12.1

Overview

CO is a ubiquitous and extremely important type of optimisation problem that occurs in scheduling, planning, timetabling and routing. However, these are typically very difficult problems to solve exactly and so are generally tackled with approximate (heuristic) approaches and more general frameworks of heuristic approaches called metaheuristics. Heuristics sacrifice finding an exact solution to a problem and instead find an approximate solution in reasonable time. They therefore inherently involve a tradeoff between solution quality and solution time called the heuristic compromise. Popular and successful metaheuristics include Evolutionary Computation, Tabu Search, Iterated Local Search and Ant Colony Optimisation. The flexibility of metaheuristics to deal with a range of problems comes at a cost— their application to a particular problem typically requires setting the values of a large number of tuning parameters. These tuning parameters are inputs to the metaheuristic that govern its behaviour. Exploring this large parameter space and relating it to problem characteristics and performance in terms of the heuristic

197

CHAPTER 12. CONCLUSIONS compromise is called the parameter tuning problem. Methodically solving the parameter tuning problem for metaheuristics is probably the single most important obstacle to producing metaheuristics that are well understood scientifically and easily applied in practical scenarios.

12.2

Advantages of DOE

This thesis takes an empirical approach to parameter tuning, adapting methodologies, experiment designs and analyses from the field of Design Of Experiments [84, 85]. It therefore fits between analytical and automated tuning approaches. It efficiently provides the raw data and trends to which the analytical camp should fit their models. It recommends good quality parameter tuning settings against which the automated tuning camp can compare their tuners’ recommendations. DOE was chosen for several reasons. It is well-established theoretically and well supported in terms of software tools. This means it should be relatively easy to convince metaheuristics researchers of the thesis’ methodology and to encourage them to adopt it. While a familiarity with statistical analyses and their interpretation is of course required, many of the more involved aspects can be left to the statistical analysis software. This is certainly the case in other fields such as psychology and medicine where researchers have a tradition of following good practice in experiment design, data collection and data analysis without necessarily being expert statisticians. The traditional areas to which DOE is applied in engineering map almost directly to the common research questions that one asks in metaheuristics research so DOE’s power and maturity can be transferred directly to metaheuristics research. DOE offers efficiency in terms of the amount of data that needs to be gathered. This is critical when attempting to understand immense design spaces. All DOE conclusions are based on statistical analyses and so are supported with mathematical precision. This allays any concerns regarding subjective interpretation of results. The main advantage of DOE over automated approaches is that it produces a model of the data. If the model is verified to be a good quality model then it can be used to explore, understand and ask questions of the actual algorithm performance. Automated approaches [12, 8] provide relatively fast solutions but do not provide this power to explore the parameter-performance relationship. They generally provide only a tuned solution. Understanding where this solution came from requires understanding the complex tuning algorithm code and its dynamics. Understanding the results from a DOE analysis requires only understanding a clear and well-established methodology. While some application scenarios will not require this understanding, scientific reproducibility certainly demands it. These advantages of DOE over alternatives for tuning metaheuristics lead to the thesis hypothesis.

198

CHAPTER 12. CONCLUSIONS

12.3

Hypothesis

The main hypothesis of this thesis is:

The problem of tuning a metaheuristic can be successfully addressed with a Design Of Experiments approach.

This hypothesis was tested in a multitude of ways. DOE nested designs were used to investigate the importance of problem characteristics on heuristic performance. DOE screening designs were used to rank the importance of tuning parameters and problem characteristics. DOE response surface models were used to model the relationship between tuning parameters, problem characteristics and performance. Desirability functions were used to tune performance while simultaneously addressing the two aspects of the heuristic compromise The key finding was that the Design of Experiments approach is indeed an excellent method for modelling and tuning metaheuristics like Ant Colony Optimisation. This was demonstrated with independent confirmation experiments and with comparisons to tuning parameter settings taken from the literature. This was recognised by the community through peer-reviewed publications at the field’s main conferences with best paper nominations and best paper awards [105, 106, 110, 108, 104, 109]. This result is now covered in more detail.

12.4

Summary of main thesis contributions

The following is a summary of the main contributions from this thesis. 1. Synthesis. The poor quality of methodology in the metaheuristics field is probably due to a lack of awareness of the issues involved. Chapter 3 surveyed the literature to gather together these issues as they have been discussed over the past 30 years in fields such as heuristics and operations research. Issues covered included: (a) the types of research question and study, (b) the stages in sound experiment design and some common mistakes, (c) the issues of heuristic instantiation and problem abstraction, (d) the importance of pilot studies, (e) reproducibility of results, (f) benchmarking of machines, (g) the advantages and disadvantages of various performance responses, (h) random number generators, (i) problem instances, (j) stopping criteria, and 199

CHAPTER 12. CONCLUSIONS (k) interpretative bias This thesis strived to apply best practice regarding these issues and should serve as a much needed illustration of good experiment design and analysis. 2. Heuristic code JACOTSP. All experiments were run with our Java version (JACOTSP) of the original C source code (ACOTSP) accompanying the literature [47]. JACOTSP was informally verified to produce the same behaviour as ACOTSP by comparing their outputs on a variety of problem instances and a variety of tuning parameter settings. JACOTSP also offers the usual advantages of an Object-Oriented (OO) design, namely extensibility and reuse. It is intended to make JACOTSP available to the community, creating a focal point for more reproducible research with ACO. 3. Problem generator code Jportmgen. All problem instances were generated with our Java port (Jportmgen) of a generator used in a large open competition in the optimisation community (portmgen) [58]. Jportmgen was informally verified to produce the same instances as the original portmgen. Again, the OO design permitted instances to be generated with edge lengths that follow a plugged-in distribution. This was important for experiments with problem characteristics. 4. Methodology for investigating if a problem characteristic affects performance. A detailed methodology was presented for determining whether a given problem characteristic affects heuristic performance. This involved the introduction of the nested design and its analysis to the ACO field. The methodology was illustrated with a published case study of ACS and MMAS [105, 110]. This method and experiment design are now attracting attention in the stochastic local search field [6]. 5. Methodology for screening tuning parameters and problem characteristics. A detailed methodology and efficient experiment designs were presented for ranking the most important tuning parameters and problem characteristics that affect performance and for screening out those that do not effect performance. This methodology was published along with illustrative case studies of its application [108, 104]. A further methodology was presented for independently confirming the accuracy of the screening model and its recommendations. The thesis is the first use of fractional factorial designs (Section A.3.2 on page 215) for screening ACO tuning parameters and problem characteristics. 6. Methodology for modelling the relationship between tuning parameters, problem characteristics and performance. A detailed methodology and efficient experiment designs were presented for modelling the relationship between tuning parameters, problem characteristics and performance. This methodology was published along with illustrative case studies of its application [106, 109]. The thesis is the first use of response surface models and fractional factorial designs for modelling ACO. 200

CHAPTER 12. CONCLUSIONS 7. Desirability functions and emphasis on the heuristic compromise. Throughout the thesis, the heuristic compromise has been emphasised. The multiobjective problem of reducing running time while improving solution quality was tackled by the introduction of desirability functions. Using desirability functions and tuning heuristic desirability does not exclude the analysis of heuristic solution time and heuristic solution quality separately. It is a convenient approach to deal with the heuristic compromise. Confirmation experiments demonstrated that tuning parameter settings found with the desirability approach offered orders of magnitude savings in solution time over parameters taken from the literature. 8. A new important problem characteristic for ACO. The analysis of problem difficulty from the case study in Chapter 7 showed that the standard deviation of edge lengths in a TSP instance has a significant effect on problem difficulty for ACS and MMAS. This means that research should report the standard deviation of instances. This result was confirmed in subsequent screening and modelling case studies in which it was shown that problem instance standard deviation had a very large effect on solution quality. This may extend to other metaheuristics. 9. New results from screening ACS and MMAS. The screening experiments answered many open questions regarding the importance of various tuning parameters and problem characteristics. From screening ACS, it was shown that: • Tuning Ant placement not important. The type of ant placement has no significant effect on ACS performance in terms of solution quality or solution time. This was an open question in the literature. It is remarkable because intuitively one would expect a random scatter of ants across the problem graph to explore a wider variety of possible solutions. This result shows that this is not the case. • Tuning Alpha not important. Alpha has no significant effect on ACS performance in terms of solution quality or solution time. This confirms the common recommendation in the literature of setting alpha equal to 1. An OFAT analysis of alpha for ACS is reported in Appendix D. • Tuning Rho not important. Rho has no significant effect on ACS performance in terms of solution quality or solution time. This is a new result for ACS. It is a surprising result since Rho is a term in the ACS update pheromone equations and analytical approaches in very simplified scenarios have concluded that rho is important [41]. • Tuning Pheromone Update Ant not important. The ant used for pheromone updates is practically insignificant for all three responses. An examination of the plot of time for the K-pheromoneUpdate factor shows that the effect on time is not practically significant. K-pheromoneUpdate can therefore be screened out.

201

CHAPTER 12. CONCLUSIONS • Most important tuning parameters. The most important ACS tuning parameters are the heuristic exponent B-beta, the amount of ants CantsFraction, the length of candidate lists D-nnFraction and the exploration/exploitation threshold E-q0. • Problem standard deviation is important. This confirms the main result of Chapter 7 in identifying a new TSP problem characteristic that has a significant effect on the difficulty of a problem for ACS. ACO research should be reporting this characteristic in the literature. • Higher order model needed. A higher order model, greater than linear, is required to model ACS solution quality and ACS solution time. This is an important result because it demonstrates for the first time that simple OFAT approaches seen in the literature are insufficient for accurately tuning ACS performance. • Comparison of solution quality responses. The is no difference in conclusions from the ADA and Relative Error solution quality responses. ADA has a slightly smaller variability and so results in more powerful experiments than Relative Error. From screening MMAS, it was shown that: (a) Tuning Restart Frequency not important. The tuning parameter Restart Frequency is statistically insignificant for solution quality and solution time in the factor ranges experimented with. It may, however, become important when very high solution quality is required. (b) Tuning AntPlacement not important. As with ACS, the design parameter AntPlacement does not have a significant effect on solution quality or solution time. Either random scatter or single random node placement can be used when placing ants on the TSP graph. (c) Tuning Alpha only important for solution time. The choice of the Alpha tuning parameter value only effects solution time. Although statistically significant for the quality responses, it has a low ranking. This confirms the literature’s general recommendation that Alpha be set to 1 [47, p. 71]. (d) Problem difficulty results confirmed. The result of the study of problem characteristics affecting performance was confirmed with problem edge length standard deviation having a very strong effect on solution quality for MMAS. (e) Important tuning parameters. Of the remaining unscreened tuning parameters, the heuristic exponent Beta, the amount of ants antsFraction, the length of candidate lists nnFraction, the exploration/exploitation threshold q0 and and pheromone decay term Rho have the strongest effects on solution quality. The same is true for solution time except for beta which has a low ranking for solution time.

202

CHAPTER 12. CONCLUSIONS (f) New parameter. Reinitialisation Branching Factor (ReinitBranchFac) is statistically significant for all three responses but is only ranked in the top third for the quality responses. It has a high ranking for Time. This highlights that ReinitBranchFac should be considered as a tuning parameter rather than being hard-coded as is typically the case. (g) Higher order model of MMAS behaviour needed. A higher order model, greater than linear, is required to model MMAS solution quality and MMAS solution time. This is an important result because it demonstrates for the first time that simple OFAT approaches seen in the literature are insufficient for accurately tuning MMAS performance. (h) Comparison of solution quality responses. The is no difference in conclusions from the ADA and Relative Error solution quality responses for MMAS. The ADA response is therefore preferable for screening because it exhibits a lower variability than Relative Error and therefore results in more powerful experiments. The modelling case studies in Chapters 9 and 11 both confirmed results from the other screening case studies and yielded new results in their own right. • Confirmation of screening study results.

The models of ACS and

MMAS performance were built using the full set of tuning parameters and the reduced set resulting from the previous screening experiments. Both the full and reduced models were good predictors of performance in terms of both solution quality and solution time. This confirmed the accuracy of the previous screening studies’ recommendations. In general, the ranking of the importance of the tuning parameters was in broad agreement with the ranking from the screening study. Some small differences are to be expected because the screening analyses are conducted on each response separately while the modelling analyses are conducted on each response simultaneously. These results confirm that screening studies for ACO can be trusted for screening out parameters that do not affect performance and so reducing the parameter space to explore in more expensive modelling studies. • Possibility of multiple regions of interest. The recommended parameter settings from the desirability optimisation of full and screened models were not the same for MMAS. However, when the recommendations were independently evaluated, both gave similarly competitive solution qualities and similarly huge savings in solution time on new problem instances. This highlights the possibility of multiple regions of interest in the parameter space of MMAS. This is important because it confirms the futility of attempting to recommend ‘optimal’ parameter settings. It also illustrates the more complicated parameter-performance relationship that emerges when one considers the two aspects of the heuristic compromise simultaneously. There are probably more than one regions

203

CHAPTER 12. CONCLUSIONS in the parameter space where a similar compromise in solution quality and solution time can be found. • Quadratic models needed. Fit analyses for the ACS and MMAS response surface models showed that a surface of order at least quadratic is required to model these metaheuristics. This rules out the use of OFAT approaches (Section 2.5 on page 47) to tuning these heuristics. The quadratic models were independently confirmed to be good predictors of performance across the parameter space. It is an open question whether higher order models and the associated increase in experiment expense would yield even better predictions of performance. Note that all the aforementioned results were obtained at a significance level of 5% and the largest effect size that could be detected with a power of 80%. Please refer to the individual case studies for the details of these effect sizes. Note that these effect sizes were limited by the number of replicates that could be run with the available experimental resources. In an ideal situation, the experimenter would determine the effect size from the experimental objectives and then increase the replicates until sufficient power was achieved.

12.5

Thesis strengths

12.5.1

Rigour and efficiency

The strengths of this thesis come from the strengths of DOE (Section 2.5 on page 47). The thesis’ methodologies are adapted from well-established and tested methodologies used in other fields such as manufacture. They are therefore proven on decades of scientific and industrial experience. The experiment designs allow for a very efficient use of experiment resources while still obtaining all of the most important information from the data. Until this thesis, there has been little awareness of the potential of these designs in the ACO literature. Their efficiency is critically important when experiments are expensive due to large parameter spaces and difficult problem instances. In particular, the designs provide a vast saving in experiment runs (Section A.3.3 on page 218). Because DOE and Response Surface Models build a model of performance across the whole design space, many research questions can be explored. Numerical optimisation of this surface can recommend tuning parameter settings for different weightings of the responses of interest. One may obtain settings appropriate for long run times and high quality or short run times and lower levels of solution quality. All of these questions are answered on the same model without need to rerun experiments.

12.5.2

Generalizability

The methodology and results from this thesis are of interest to both those who design heuristics and engineers who wish to deploy a heuristic. Designers can use the thesis methodology to rank the contribution of new additions to a heuristic

204

CHAPTER 12. CONCLUSIONS (design parameters) as well as to understand and model the contribution of tuning parameters to changes in performance. DOE provides a rigorous approach for testing hypotheses about a new heuristic, categorically determining whether new techniques/components make a significant impact on performance. For engineers, DOE provides a verifiably accurate model of behaviour, allowing the heuristic to be quickly retuned to new problem instances without running a large set of new experiments. Although all case studies illustrate the application of the thesis’ methodologies to ACO, there is no reason why the methodologies cannot be applied to other heuristics.

12.5.3

Reproducibility and empirical best practice

An effort has been made throughout the thesis to address the methodological issues raised and discussed in Chapter 3. The thesis uses algorithm and problem generator code that is backwards compatible with codes commonly used in the field. This strengthens the reproducibility of its results and makes its conclusions applicable to all previous work that has used these codes. Experiment machines were properly benchmarked according to an established procedure [58]. This improves the thesis’ reproducibility and applicability for all subsequent research work that may refer to this thesis.

12.6

Thesis limitations

DOE is not a panacea for the myriad difficulties that arise in the empirical analysis of heuristics. It goes a long way towards overcoming many of those difficulties. However, the conclusions and contributions of this thesis are necessarily limited in a few ways. • Computational expense. Despite the efficiency of the DOE designs that this thesis introduced, running sufficient experiments to gather sufficient data is still computationally expensive. Of course, the experiments would have been orders of magnitude more expensive had a less sophisticated approach been used. This expense is increased when there are many categorical tuning parameters due to the nature of how the designs are built. However, any expense is mitigated by the amount of useful and structured information obtained. DOE yields a full model of the data that can be explored in many ways and used to make new predictions about the heuristic performance across the entire design space. The DOE designs and methods in this thesis are the state-of-the-art approach for building such models. If a user is concerned about quickly tuning parameters in a one-off scenario, an automated approach may be a preferable alternative. Recall however that use of an automated method implies that the user is content to trust its blackbox approach and requires no understanding of the parameter, problem and performance relationship.

205

CHAPTER 12. CONCLUSIONS • Categorical factors. The previous point mentioned how categorical tuning parameters increase the size and consequently the expense of the experiments. It must be pointed out that the 2-level fractional factorial designs used in screening can only take on two values for each factor. This is not a severe limitation when one considers the main motivation of a screening design—to determine whether a factor should be included in the more expensive Response Surface Model design. • Nested parameters. A parameter type that we term a nested parameter arose in the analysis of MMAS parameters in Section 2.4.9 on page 45. These are parameters that only make sense within their parent parameter. Factorial experiment designs cannot analyse these types of parameters directly. Values of the nested parameter and its parent could be lumped into a single categorical parameter. Our summary of tuning parameters in Section 2.4.9 on page 45 suggests that these types of parameter might not be so common anyway.

12.7

Future work

The research presented in this thesis should be developed and extended along the following lines. • Further ACO algorithms. The original ACOTSP and our JACOTSP contain further ACO algorithms, Rank-based Ant System [24], Elitist Ant System and Best-Worst Ant System [31]. These could easily be investigated with the thesis methodologies to see whether similar results are obtained regarding the importance of various tuning parameters. It would also be of interest to extend these algorithms with local search. • Comparison to OFAT. It would strengthen the argument for the use of DOE in favour of OFAT if a comprehensive comparison of the two methods were conducted as has been done in other fields [37]. • Comparison to other tuning methods. If would be interesting to compare DOE to the results from other tuning approaches such as automated tuning. • Further heuristics. The methodology could also be applied to other heuristics. Screening studies, for example, have already been independently applied to Particle Swarm Optimisation [74]. • Useful tools. A strength of DOE is its support in software. There is no excuse for the metaheuristics practitioner to claim that statistics are too timeconsuming or complicated to use (Section 2.5 on page 47). Modern statistical analysis software shields the user from much of this complexity. A greater awareness of this software and tutorials on how to use it with metaheuristics is urgently needed.

206

CHAPTER 12. CONCLUSIONS

12.8

Closing

It is hoped that this thesis has convinced the reader of the merits of the DOE approach when applied to the problem of tuning metaheuristics. The parameter tuning problem is ubiquitous in the field and must be tackled in every new piece of metaheuristics research. The methodologies of this thesis, or some appropriate adaptation of them, should be used when setting up ACO heuristics. There is no longer any excuse for inheriting values from other publications or for fuzzy reasoning with words and intuition about the parameters that need to be tuned. Adopting this thesis’ methodologies will add to the expense of metaheuristics experiments. However, this is the unavoidable reality of dealing with metaheuristics with a large number of tuning parameters. The researcher who embraces this thesis’ methodologies will have at their disposal an established, efficient, rigourous, reproducible approach for making strong conclusions about the relationship between metaheuristic tuning parameters, problem characteristics and performance.

207

Part V

Appendices

209

A Design Of Experiments (DOE) This appendix provides a basic background on the main Design Of Experiments (DOE) and statistics concepts used in this thesis and introduced in Section 2.5 on page 47. The material in this appendix is adapted and compiled from the literature [84, 1, 89, 85] for the reader’s convenience and is not intended to replace a detailed study of those texts. The appendix begins with an introduction to the terminology for the DOE field. It then provides a short explanation of the various DOE topics referred to in the thesis.

A.1

Terminology

The following terms are encountered frequently in the design and analysis of experiments.

A.1.1

Response variable

The response variable is the measured variable of interest. In the analysis of metaheuristics, one typically measures the solution quality and solution time required by a heuristic as these are reflections of the heuristic compromise. The DOE approach can be used in heuristic design as well as heuristic performance analysis and so the choice of response variable is limited only by the experimenter’s imagination. In some cases, it may be appropriate to measure the frequency of some internal heuristic operation for example.

A.1.2

Factors and Levels

A factor is an independent variable manipulated in an experiment because it is thought to affect one or more of the response variables. The various values at which the factor is set are known as its levels. In heuristic performance analysis,

211

APPENDIX A. DESIGN OF EXPERIMENTS (DOE) the factors include both the heuristic tuning parameters and the most important problem characteristics. These are sometimes distinguished by referring to them as tuning factors and problem factors respectively. Factors can also be new heuristic components that are hypothesised to improve performance. Sometimes factors are distinguished as being either design factors (or primary factors) or held-constant factors (or secondary factors). Design factors are those factors that are being studied because we are interested in their effects on the responses. Held-constant factors are those factors that are known to affect the responses but are not of interest in the present study. They should be held at a constant value throughout all experiments.

A.1.3

Treatments

A treatment is a specific combination of factor levels. The particular treatments will depend on the particular experiment design and on the ranges over which factors are varied.

A.1.4

Replication

Replicates are repeated runs of a treatment. Replicates are needed when a studied process produces different response measurements for identical runs of a treatment. This is always the case with stochastic heuristics. The number of replicates required in an experiment is linked to the statistical concept of power discussed later.

A.1.5

Effects

An effect is a change in the response variable due to a change in one or more factors. We can define main effects as follows: The main effect of a factor is a measure of the change in the response variable to changes in the level of the factor averaged over all levels of all the other factors. [89] Higher order effects (or interactions) are the effect that occurs when the combined change in two factors produces an effect greater (or less) than that of the sum of effects expected from either factor alone. An interaction occurs when the effect one factor has depends on the level of another factor. A second order effect is due to two factors, a third order to three and so on.

A.1.6

Confounding

Two are more effects are said to be confounded if it is impossible to separate the effects when the subsequent statistical analysis is performed. This is best described with an example. A computer scientist has developed a new algorithm and wishes to compare it with an established algorithm. He has two machines available to him. The 212

APPENDIX A. DESIGN OF EXPERIMENTS (DOE) established algorithm will be run on one machine and the experimental algorithm to the other. The characteristic to be measured as an index of performance will be the run time of the algorithm to solve a particular problem. However, when the two run times are compared, it is impossible to say how much of the difference is due to the algorithms and how much is due to inherent differences (age, operating system version, memory) between the two machines. The effects of algorithm and machine are thus confounded. Confounding is due to poor experimental planning and execution, particularly to poor control of factors. It is important to stress the difference between confounding and aliasing. Aliasing is an inability to distinguish several effects due to the nature of the experiment design rather than poor execution. It is a deliberate and known price that we pay for using more efficient designs such as fractional factorials, as discussed later.

A.2

Regions of operability and interest

There are two regions within an experimental design space [85]. The region of operability is the region in which the equipment, process etc. works and it is theoretically possible to conduct an experiment and measure responses. In ACO, the region of operability is sometimes bounded, as with the exploration/exploitation parameter ρ which must be within a range 0 < ρ < 1. With other tuning parameters such as α, the region of interest is, in theory, unbounded . Within this region of operability, there may be one or more regions of interest. A region of interest is a region to which an experimental design is confined. The region of interest is typically chosen because we believe it contains the optimal process settings. These regions are illustrated schematically in the following figure.

O

R

R’

Figure A.1: Region of operability and region of interest (adapted from [85]).

The region of operability is often not known until the process has been well studied and may change depending on circumstances. Myers and Montgomery [85] offer the following comments on the difficulty of choosing an experimental region of interest. . . . in many situations the region of interest (or perhaps even the region of operability) is not clear cut. Mistakes are often made and adjust213

APPENDIX A. DESIGN OF EXPERIMENTS (DOE) ments adopted in future experiments. Confusion regarding type of design should never be an excuse for not using designed experiments. Using [a design for the wrong type of region] for example, will still provide important information that will, among other things, lead to more educated selection of regions for future experiments. [85, p. 317]

A.3

Experiment Designs

There many experiment designs to choose from. The design an experimenter uses will depend on many factors including the particular research question, whether experiments are in the early stages of research and the experimental resources available. This section focuses on the advanced designs that appear in this thesis. It begins with a simpler more common design as this provides the necessary background for understanding the subsequent designs.

A.3.1

Full and 2k Factorial Designs

A full factorial design consists of a crossing of all levels of all factors. The number of levels of each factor can be two or more and need not be the same for each factor. These levels may be quantitive (scalar), such as values of pheromone decay constant; or they may be qualitative, such as types of algorithm. This is an extremely powerful but expensive design. A more useful type of factorial for DOE uses k factors, each at only 2 levels. The so-called 2k factorial design provides the smallest number of runs with which k factors can be studied in a full factorial design. Factorials have some particular advantages and disadvantages [89]. These are worth noting given the importance that factorials play in experimental design. The advantages are that: • greater efficiency is achieved in the use of available experimental resources in comparison to what could be learned from the same number of experiment runs in a less structured context such as an OFAT analysis [37], • information is obtained about the interactions, if any, of factors because the factor levels are all crossed with one another, and • results are more comprehensive over a wider range of conditions due to the combining of factor levels in one experiment. Of course, these advantages come at a price. As the number of factors grows, the number of treatments in a 2k design rapidly overwhelms the experiment resources. Consider the case of 10 continuous factors. A na¨ıve full factorial design for these ten factors will require a prohibitive 210 = 1024 treatments. The full factorial experiment is the ideal design for many of the research questions in this thesis but the size of metaheuristic design spaces limits its applicability. A more efficient design is required.

214

APPENDIX A. DESIGN OF EXPERIMENTS (DOE)

A.3.2

Fractional Factorial Design

The previous section mentioned the exponential increase in expense of factorial designs with an increase in the design factors. There are benefits to this expense. A 210 full factorial will provide data to evaluate all the effects listed in the next table. For screening however, the experimenter is interested only in the main effects (the design factors) and perhaps the two-factor effects. This makes the full factorial inefficient for screening purposes. Effect

Number estimated

Main

10

Two-factor

45

Three factor

120

Four-factor

210

Five-factor

252

Six-factor

210

Seven-factor

120

Eight-factor

45

Nine-factor

10

Ten-factor

1

Table A.1: Numbers of each effect estimated by a full factorial design of 10 factors.

If it is assumed that higher-order interactions are insignificant, information on the main effects and lower-order interactions can be obtained by running a fraction of the complete factorial design. This assumption is based on the sparsity of effects principle. This states that a system or process is likely to be most influenced by some main effects and low-order interactions and less influenced by higher order interactions. A judiciously chosen fraction of the treatments in a full factorial will yield insights into only the lower order effects. This is termed a fractional factorial. The price we pay for the fractional factorial’s reduction in number of experimental treatments is that some effects are indistinguishable from one another. They are aliased. Additional treatments, if necessary, can disentangle these aliased effects should an alias group be statistically significant. The advantage of the fractional factorial is that it facilitates sequential experimentation. The additional treatments and associated experiment runs need only be performed if aliased effects are statistically significant. Depending on the number of factors, and consequently the design size, a range of fractional factorials can be produced from a full factorial. The amount of higher order effects that are aliased is described by the design’s resolution. For Resolution III designs, all effects are aliased. Resolution IV designs have unaliased main effects but second-order effects are aliased. Resolution V designs estimate main and second-order effects without aliases. The details of how to choose a fractional factorial’s treatments are beyond the 215

APPENDIX A. DESIGN OF EXPERIMENTS (DOE) scope of this thesis. It is an established algorithmic procedure that is well covered in the literature [84] and is provided in all modern statistical analysis software. The fractional factorials used in this research are summarised in the next figure which shows the relationship between number of factors, design resolution and associated number of experiment treatments.

Number of factors 2 4

2

2 2

4

5

6

7

8

10

11

12

3 −1

4 −1

5−2

2

6 −3

2

6 −2

7− 4

2

7−3

32

2 2 2

6 −1

7−2

VI

IV

64

2 2

6

7 −1

16

9

III

2

8

Number of treatments

3

3

2 2

IV 4

2

III

2 2

5 −1 V 5

III

IV

III

IV

VII 7

2

8− 4

2

8−3

2

8−2

IV

IV

2 2

128

V

8 −1 VIII 8

2 2

9 −5

10 − 6

III

III

2

2

9 −4

10 − 5

IV

IV

2 2 9 −3

10 − 4

IV

IV

9 −2

10 − 3

VI

V

9 −1

10 − 2

IX

VI

2 2 2 2

256

2 2 2

512

2 2

9

10 −1 X

11−7 III

11− 6 IV

11−5

2

IV

11− 4

2

V

11−3

2

VI

11− 2

2

VII

12 − 8

2

III

12 − 7

2

IV

12 − 6

2

IV

12 − 5

2

IV

12 − 4

2

VI

12 − 3

2

VI

Figure A.2: Fractional Factorial designs for two to twelve factors. The required number of treatments is listed on the left. Resolution III designs (do not estimate any terms) are coloured darkest followed by Resolution IV designs (estimate main effects only) followed by Resolution V and higher (estimate main effects and second order interactions).

The minimum appropriate fractional factorial design resolution for screening is therefore resolution IV since screening aims to remove factors (main effects) that do not effect the responses. A resolution V design is preferable when resources allow because it also tells us what second order effects are present without the need for additional treatments and experiment runs. It is informative to consider the two available resolution IV designs for 9 factors in the next figure as examples of the importance of examining alias structure. The 2(9-4) design requires 32 treatments while the 2(9-3) is more expensive with 64 treatments. The cheaper 2(9-4) design has 8 of its 9 main effects aliased with 3 third order interactions. The 2(9-3) design has only 4 of its 9 main effects aliased with a single third-order interaction. The second order interactions are almost all aliased in the more expensive 2(9-3) design but the aliasing is more favourable than the cheaper 2(9-4) design. Resources permitting, the more expensive 2(9-3) design is therefore more desirable for screening main effects. Screening designs based on 2k factorials and fractional factorials can only produce linear models of a response because each factor appears at only two levels. For a more complicated relationship between factors and response, a more sophis216

APPENDIX A. DESIGN OF EXPERIMENTS (DOE)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63

2(9-3) Resolution IV Term Alias [A] = A [B] = B [C] = C [D] = D [E] = E [F] = F [G] = G [H] = H [J] = J [AB] = AB [AC] = AC [AD] = AD [AE] = AE [AF] = AF [AG] = AG [AH] = AH [AJ] = AJ [BC] = BC [BD] = BD [BE] = BE [BF] = BF [BG] = BG [BH] = BH [BJ] = BJ [CD] = CD [CE] = CE [CF] = CF [CG] = CG [CH] = CH [CJ] = CJ [DE] = DE [DF] = DF [DG] = DG [EF] = EF [EG] = EG [EH] = EH [EJ] = EJ [FG] = FG [FH] = FH [FJ] = FJ [GH] = GH [GJ] = GJ [ABE] = ABE [ABF] = ABF [ABH] = ABH [ABJ] = ABJ [ACJ] = ACJ [ADE] = ADE [ADF] = ADF [AEG] = AEG [AEJ] = AEJ [AFG] = AFG [AFJ] = AFJ [AGH] = AGH [AGJ] = AGJ [BCE] = BCE [BCF] = BCF [BDE] = BDE [BDF] = BDF [BEH] = BEH [BFH] = BFH [CEG] = CEG [CFG] = CFG

+

DHJ

+

AHJ

+ + + + + + + + + + + +

ADJ ADH CDG BDG HJ CFH CEH BCD DJ DH ADG ACG

+ + + + + + + + + + + + +

ACD CGJ CGH ABG AFH AEH ABD AEF BGH CFJ CEJ ABC ACH

+ +

ACF CDF

+ + + + + + + + + + + + + + + + +

ACE CDE BCJ BCH FGJ EGJ BDJ BDH CDH EHJ FHJ BFJ BFG BEJ BEG DGJ BEF

+ + + +

FGH EGH DFG DEG

+ +

EFH BCG

+

CEF

+

GHJ

+

CHJ

+ + + + + +

EFJ DFJ DEJ BHJ BGJ DEF

+

CDJ

+

EFG

+

DEH

+

DFH

+

DGH

2 (9-4) Resolution IV Alias Term [A] = A [B] = B + [C] = C + [D] = D + [E] = E + [F] = F + [G] = G + [H] = H + [J] = J + [AB] = AB + [AC] = AC + [AD] = AD + [AE] = AE + [AF] = AF + [AG] = AG + [AH] = AH + [AJ] = AJ + [BC] = BC + [BD] = BD + [BE] = BE + [BF] = BF + [BG] = BG + [BH] = BH + [BJ] = BJ + [CD] = CD + [CE] = CE + [CF] = CF + [CG] = CG + [DE] = DE + [DF] = DF + [ABJ] = ABJ +

CHJ BHJ BGJ BFJ BEJ BDJ BCJ BCH CDF BDF BCF BCG BCD BCE BDE CDE HJ GJ FJ EJ DJ CJ CH GH FH EH DH FG EG ACH

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + +

DGJ DGH CGH CFH CEH CDH CDG BDG CEG BEG BEH BDH BGH BFH BFG CFG ADF ACF ACG ACD ACE ADE DG ABF ABG ABD ABE ABH ABC ADG

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + +

EFJ EFH EFG DFG DEG DEF CEF BEF DEH DEJ CEJ CDJ CGJ CFJ DFJ DFH AEG AEH ADH AGH AFH AFG EF AEJ ADJ AGJ AFJ ACJ AHJ AEF

+ + + + + + + +

FGH FGJ FHJ GHJ DHJ EHJ EGJ EGH

Figure A.3: Effects and alias chains for a 2 (9-3) resolution IV design and a 2(9-4) resolution IV design.

217

APPENDIX A. DESIGN OF EXPERIMENTS (DOE) ticated design is required. These are called Response Surface designs.

A.3.3

Efficiency of Fractional Factorial Designs

The following figure makes explicit the huge savings in experiment runs when using a fractional factorial design instead of a full factorial design.

Design Full 2 (9-5) III 2(9-4) IV 2(9-3) IV 2(9-2) VI

% saving of Treatments treatments 512 16 97 32 94 64 88 128 75

% saving of Design* Treatments treatments Full 531 Half 275 50 Quarter 147 75 Min Run 65 91 * FCC with 1 centre point

Figure A.4: Savings in experiment runs when using a fractional factorial design instead of a full factorial design. The savings for screening designs are on the left and the savings for response surface designs are on the right. In both cases, fractional factorial designs offer enormous savings in number of treatments over the full factorial alternative.

A.3.4

Response Surface Designs

There are several types of experiment design for building response surface models. This research uses Central Composite Designs (CCD). A CCD contains an imbedded factorial (or fractional factorial design). This is augmented with both centre points and a group of so-called ‘star points’ that allow estimation of curvature. Let the distance from the centre of the design space to a factorial point be ±1 unit for each factor. Then, the distance from the centre of the design space to a star point is ±α where |α| > 1. The value of α depends on certain properties desired for the design and on the number of factors involved. The number of centre point runs the design is to contain also depends on certain properties required for the design. There are three types of central composite design, illustrated in Figure A.5. FCC

CCC

Axial point

Factorial point

ICC

Center point

Figure A.5: Central composite designs for building response surface models. From left to right these designs are the Circumscribed Central Composite (CCC), the Face-Centred Composite (FCC) and the Inscribed Central Composite (ICC). The design space is represented by the shaded area. The factorial points are black circles and the star points are grey squares.

218

APPENDIX A. DESIGN OF EXPERIMENTS (DOE) The designs differ in the location of their axial points. The choice of design depends on the nature of the factors being experimented with. • Circumscribed Central Composite (CCC). In this design, the star points establish new extremes for the low and high settings for all factors. These designs require 5 levels for each factor. Augmenting an existing factorial or resolution V fractional factorial design with star points can produce this design. • Inscribed Central Composite (ICC). For those situations in which the limits specified for factor settings are truly limits, the ICC design uses the factor settings as the star points and creates a factorial or fractional factorial design within those limits. This design also requires 5 levels of each factor. • Face-Centred Composite (FCC). In this design, the star points are at the centre of each face of the factorial space, so alpha = ± 1. This design requires just 3 levels of each factor. An existing factorial or resolution V design from the screening stage can be augmented with appropriate star points to produce the CCC and FCC designs. This is not the case with the ICC and so it is less useful in the sequential experimentation scenario (see Section 6.1). Many of the tuning parameters encountered in ACO algorithms have a restrictive range of values they can take on. For example, the exploration/exploitation threshold q0 (Section 2.4.6 on page 42) must be greater than or equal to 0 and less than or equal to 1. The problem instance characteristics also have a restrictive range imposed by the user. We cannot hope to model all possible instances and so must restrict our instance characteristic ranges to those that will be encountered in the application of the algorithm. The FCC is designed for scenarios where such restrictions on factor ranges are enforced. Clearly, it is the most appropriate design in the current ACO parameter tuning scenario. The FCC is used in all response surface modelling in this thesis.

A.3.5

Prediction Intervals

A regression model from the response surface design is used to predict new values of the response given values of the tuning parameter and problem characteristic input variables. The model’s p% prediction interval is the range in which you can expect any individual value from the actual heuristic to fall into p% of the time. The prediction interval will be larger (a wider spread) than a confidence interval since there is more scatter in individual values than in averages. Montgomery [84, p. 394-396] describes the mathematical formulation of prediction intervals and their applications. In particular, prediction intervals should be used in confirmation experiments to verify that models of the heuristic behaviour are correct. This thesis is the first use in the heuristics literature of prediction intervals and independent confirmation runs to verify conclusions.

219

APPENDIX A. DESIGN OF EXPERIMENTS (DOE)

A.3.6

Desirability functions

The concept of a desirability function can be briefly described as follows [84, 1]. The desirability function approach is a widely used industrial method for optimising multiple responses. The basic idea is that a process with many quality characteristics is completely unacceptable if any of those characteristics are outside some desired limits. For each response, Yi , a desirability function di (Yi ) assigns a number between 0 and 1 to the possible values of the response Yi . di (Yi ) = 0 is a completely undesirable value and di (Yi ) = 1 is an ideal response value. These individual k desirabilities are combined into an overall desirability D using a geometric mean: D = (d1 (Y1 ) × d2 (Y2 ) × . . . × dk (Yk ))

1/ k

(A.1)

A particular class of desirability function was proposed by Derringer and Suich [40]. Let i and Ui be the lower limit and upper limits respectively of response i. Let Ti be the target value. If the target value is a maximum then

di =

  0 yi < Li    r yi −Li Ti −Li

  

1

Li ≤ yi ≤ Ti

(A.2)

Li ≤ yi ≤ Ti

(A.3)

yi > Ti

If the target is a minimum value then

di =

   

1 yi < Ti  r

  

0

Ui −yi Ui −Ti

yi > Ui

The value r adjusts the shape of the desirability function. A value of r = 1 is linear. A value of r > 1 increases the emphasis of being close to the target value. A value of 0 < r < 1 decreases this emphasis. These cases are illustrated in the following figure.

A.4

Experiment analysis

Experiment analysis is the steps one takes after designing an experiment and gathering data. The analysis steps in this thesis are listed in Chapter 6 on methodology. Some of these steps are covered in more detail here.

A.4.1

Stepwise regression.

Various techniques can be used to identify the most important terms that should be included in a regression model. This thesis uses an automated approach called stepwise regression. Usually, this takes the form of a sequence of F-tests, but other techniques are possible. The 2 main stepwise regression approaches are: 1. Forward selection. This involves starting with no variables in the model,

220

APPENDIX A. DESIGN OF EXPERIMENTS (DOE)

Figure A.6: Individual desirability functions. On the left is a maximise function and on the right is a minimise function. Figure adapted from [84, p. 426].

trying out the variables one by one and including them if they are ’statistically significant’. 2. Backward selection. This involves starting with all candidate variables and testing them one by one for statistical significance, deleting any that are not significant according to an alpha out value. This thesis uses backward selection for the choice of terms in all its analyses with an alpha out value of 0.1. There are several criticisms of stepwise regression methods worth noting. 1. A sequence of F-tests is often used to control the inclusion or exclusion of variables, but these are carried out on the same data and so there will be problems of multiple comparisons for which many correction criteria have been developed. 2. It is difficult to interpret the p-values associated with these tests, since each is conditional on the previous tests for inclusion and exclusion. Nonetheless, the accuracy of all models in this thesis is independently analysed with confirmation runs. This should allay any engineer’s concerns over the use of stepwise regression for metaheuristic screening and modelling.

A.4.2

ANOVA diagnostics

Once an Analysis of Variance (ANOVA) has been calculated, some diagnostics must be examined to ensure that the assumptions on which ANOVA depends have not been violated. • Normality. A Normal Plot of Studentised Residuals should be approximately a straight line. Deviations from this may indicate that a transformation of the response is appropriate. 221

APPENDIX A. DESIGN OF EXPERIMENTS (DOE) • Constant Variance. A plot of Studentised Residuals against predicted response values should be a random scatter. Patterns such as a ‘megaphone’ may indicate the need for a transformation of the response. • Time-dependent effects. A plot of Studentised Residuals against run order should be a random scatter. Any trend indicates the influence of some timedependent nuisance factor that was not countered with randomisation. • Model Fit. A plot of predicted values against actual response values will identify particular treatment combinations that are not well predicted by the model. Points should align along the 450 axis. • Leverage and Influence. Leverage measures the influence of an individual design point on the overall model. A plot of leverage for each treatment indicates any problem data points. • A plot of Cook’s distance against treatment measures how much the regression changes if a given case is removed from the model. It is an open question how much a violation of these diagnostics invalidates the conclusions from an ANOVA. Coffin and Saltzman [28] believe that the ANOVA F-test is extremely robust to unequal variances provided that there is approximately the same number of observations in each treatment group. Diagnostics were always examined and passed in the analyses of this thesis. Furthermore, any concerns regarding these diagnostics are allayed with the use of independent confirmation runs.

A.4.3

Response Transformation

If a model is correct and the assumptions are satisfied, the residuals should be unrelated to any other variable, including the predicted response. This can be verified by plotting the residuals against the fitted values. The plot should be unstructured. However, sometimes nonconstant variance is observed. This is where the variance of the observations increases as the magnitude of the observations increases. The usual approach to dealing with this problem is to transform the data and run the ANOVA on the transformed data. The next table gives some popular transformations. Name

Equation

Logarithmic

Y 0 = log10 (Y + d) √ Y0 = Y +d √ Y0 =1 Y +d

Square root Inverse Square Root

Table A.2: Some common response transformations.

The appropriate transformation is chosen based on the shape of the data or, in the case of this thesis, using an automated technique called a Box-Cox plot [21]. 222

APPENDIX A. DESIGN OF EXPERIMENTS (DOE)

A.4.4

Outliers

An outlier is a data value that is much larger or smaller than the majority of the experiment data. Outliers are important because they affect the ANOVA assumptions and can render conclusions from a statistical analysis invalid. Outliers are easily identified with the ANOVA diagnostics. In this thesis, we take the approach of deleting outliers. Some would disagree with this approach because outliers in responses such as solution quality are not due to any random noise but are instead actual repeatable data values. This is true but we must still somehow deal with the outliers and make the data amenable to statistical analysis. If the proportion of outliers deleted is reported then the reader can be assured that the outliers represented a very small proportion of the total data. If confirmation runs are reported then the reader is reassured that the model was accurate despite the deletion of outliers.

A.4.5

Dealing with Aliasing

Once a model has been successfully built and analysed, allowing for necessary transformations and outliers, there may still be obstacles to interpreting the ANOVA results. Some designs such as the fractional factorial reduce the number of experiment runs required at the cost of some of the model’s effects being aliased. Aliased effects are those effects that cannot be distinguished from one another. We say that these effects form an alias chain. For example, if the main effect A and a second order effect AB are aliased then we cannot tell whether it is A or AB that contributes to the model. When a significant effect is aliased, several approaches are available to the experimenter to determine the correct model term to which the effect should be attributed [84, p. 289]. 1. Engineering judgement. A first attempt is to use engineering judgement to justify ignoring some terms in the alias. It may be known from experience that one of the aliased effects is not important and can be discarded. 2. Ockham’s razor. Consider a 24−1 IV experiment. It has four main effects that we shall call A, B, C and D. Suppose that the significant main effects are A, C and D and the significant aliased interactions are AC and AD, aliased as follows [AC] → AC + AD and [AD] → AD + BC. The fact that AC and AD are the interactions composed only of significant main effects, it is more likely that these interactions are the significant interactions in the alias chains. Montgomery [84] cites this as an application of Ockham’s razor—the simplest explanation of the effect is most likely the correct one. Failing the application of either of these two approaches, one must augment the design to de-alias the significant effects. 3. Augment design. A foldover procedure is a methodical and efficient way to introduce more treatments into a fractional design so that a particular effect can be de-aliased. The foldover procedure produces double the number of

223

APPENDIX A. DESIGN OF EXPERIMENTS (DOE) Interachons among Variables Analys~sof Vanance

new treatments for which data must be gathered (once for each replicate).

,

120augmented AreaBurneddesign should foldover on those most significant model terms The

that the experimenter wishes to de-alias Once dealiasing has been completed, the experimenter can interpret significant main effects and interactions.

A.4.6

Interpreting interactions

There are several possibilities when plotting two-factor interactions from a two-way slow medium fast analysis of variance [29]. These possibilities are illustrated in the following figure.

Wind S p e d

There are two factors denoted by A and B and these factors were tested at two A plot of mean acreage threeB3 windrespectively. speeds. Figure levels A1, A27.2and three levels B1, lost B2atand

A1

A1

A2

A2

Figure 7.3 Examples of main effects and interaction effects in two-way analyses of variance.

Figure A.7: Examples of possible main and interaction effects [29]. The possibilities are numbered 1 to 6.

The interpretation of these possibilities is as follows. • Example 1: There is a main effect for A represented by the increasing slope. There is a main effect for B, represented by the vertical distance between lines. There is no interaction AB since the lines are always parallel. • Example 2: There is a main effect for A but now there is no main effect for B since the lines are no longer separated by a vertical distance. There is no interaction AB either. • Example 3: There are no effects whatsoever.

224

APPENDIX A. DESIGN OF EXPERIMENTS (DOE) • Example 4: There is a main effect for A represented by the increasing slopes. There is no main effect B because the vertical distances between levels of B are reversed at the two levels of A. There is an interaction between A and B. • Example 5: There is no main effect for A as the slopes at different levels of B cancel one another out. There is a main effect of B. There is an interaction AB. • Example 6: There is a main effect of A. There is a main effect of B. There is also an interaction AB. The presence of interactions clearly complicates an analysis because it means that a main effect cannot be interpreted in isolation. The inability to detect interactions is one of the most important shortcomings of the OFAT approach (Section 2.5 on page 47) and one of the main strengths of the DOE approach.

A.5

Hypothesis testing

Hypothesis testing (sometimes called significance testing) is an objective method of making comparisons with a knowledge of the risks associated with reaching the wrong conclusions. A statistical hypothesis is a conjecture about the problem situation. One may conjecture, for example, that the mean heuristic performances at two levels 1 and 2 of a factor are equal. This is written as: H0 : µ1 = µ2 H1 : µ1 6= µ2 The first statement is the null hypothesis and the second statement is the alternative hypothesis. A random sample of data is taken. A test statistic is computed. The null hypothesis is rejected if the test statistic falls within a certain rejection region for the test. It is extremely important to emphasise that hypothesis testing does not permit us to conclude that we accept the null hypothesis. The correct conclusion is always either a rejection of the null hypothesis or a failure to reject the null hypothesis. The p-value is the probability of obtaining a test statistic that is at least as far as the observed value from the value specified in the null hypothesis, where the null hypothesis value is calculated under the assumption that the null hypothesis is true. Smaller p-values indicate that the data are inconsistent with the assumption that the null hypothesis is true

A.6

Error, Significance, Power and Replicates

Two types of error can be committed when testing hypotheses [84, p. 35]. If the null hypothesis is rejected when it is actually true, then a Type I Error has occurred. If the null hypothesis is not rejected when it is false then a Type II Error has occurred. These error probabilities are given special symbols

225

APPENDIX A. DESIGN OF EXPERIMENTS (DOE) • α = P (Type I error) = P (reject H0 |H0 true) • β = P (Type II error) = P (fail to reject H0 |H0 false) In the context of Type II errors, it is more convenient to use the power of a test where Power = 1 − β = P (reject H0 | H0 false) It is therefore desirable to have a test with a low α and a high power. The probability of a Type I Error is often called the significance level of a test. The particular significance level depends on the requirements of the experimenter and, in a research context, on the conventional acceptable level. Unfortunately, with so little adaptation of statistical methods to the analysis of heuristics, there are few guidelines on what value to choose. Norvig cites a value as low as 0.0000001% in research work at Google [87]. All experiments in this thesis use a level of either 1% or 5%. The power of a test is usually set to 80% by convention. The reason for this choice is due to diminishing returns. It requires an exponentially increasing number of replicates to increase power beyond about 80% and there is little advantage to the additional power this confers. Miles [82] describes the relationship between significance level, effect size, sample size and power using an analogy with searching. • Significance Level: This is the probability of thinking we have found something when it is not really there. It is a measure of how willing we are to risk a Type I error. • Effect Size: The size of the effect in the population. The bigger it is, the easier it will be to find. This is a measure of the practical significance of a result, preventing us claiming a statistically significant result that has little consequence [101]. • Sample size: A larger sample size leads to a greater ability to find what we were looking for. The harder we look, the more likely we are to find it. The critical point regarding this relationship is that what we are looking for is always going to be there—it might just be there in such small quantities that we are not bothered about finding it. Conversely, if we look hard enough, we are guaranteed to find what we are looking for. Power analysis allows us to make sure that we have looked reasonably hard enough to find it. A typical experiment design approach is to agree the significance level and choose an effect size based on practical experience and experiment goals. Given these constraints, the sample size is increased until sufficient power is reached. If a response has a high variability then a larger sample size will be required. Different statistical tests and different experiment designs involve different power calculations. These calculations can become quite involved and the details of their calculation are beyond the scope of this thesis. Power calculations are supplied

226

APPENDIX A. DESIGN OF EXPERIMENTS (DOE) with most good quality statistical analysis software. Some are even provided online [76]. Power considerations have had limited exposure in the heuristics field [28] but play a strong role in this thesis.

A.6.1

Power work up procedure

Sufficient power is achieved with a so-called work-up procedure [36]. This is an iterative procedure whereby data is calculated for a design with a number of replicates, power is calculated and replicates are added if sufficient power was not achieved. This process repeats until sufficient power is reached. The work up procedure is an efficient way to ensure the experiment has enough power without wasting resources on unnecessary replicates.

227

B TSPLIB Statistics This appendix reports statistics and plots of some TSPLIB [102] instances. The instances are symmetric Euclidean instances as described in Section 2.2 on page 30. These statistics and plots are referenced in the conclusions of Chapter 7. The table gives some descriptive statistics of the symmetric Euclidean instances. All instances have approximately the same ratio of standard deviation to mean. Figure B.2 on the next page to Figure B.4 on page 231 are histograms illustrating the normalised frequency of normalised edge lengths in several of the symmetric Euclidean TSP instances. All histograms have a shape that can be represented by a Log-Normal Distribution.

229

APPENDIX B. TSPLIB STATISTICS

Instance Oliver30 kroA100 kroB100 kroC100 kroD100 kroE100 eil101 lin105 pr107 pr124 bier127 ch130 pr136 pr144 ch150 kroA150 kroB150 pr152 kroA200 kroB200 ts225 tsp225 pr226 gil262 pr264 pr1002 vm1084 rl1304 rl1323 nrw1379 vm1748 rl1889 pr2392

Standard Deviation 21.08 916.04 912.90 910.74 867.21 933.55 16.35 670.85 3105.26 2848.45 3082.05 169.98 2945.43 2813.44 169.35 919.03 922.40 3668.43 917.36 892.36 3321.76 95.21 3708.92 48.25 2557.95 3160.87 4149.34 3670.93 3724.98 543.67 4235.17 4012.50 3125.49

Mean Coefficient 43.93 0.48 1710.70 0.54 1687.54 0.54 1700.55 0.54 1631.10 0.53 1732.15 0.54 33.92 0.48 1177.35 0.57 5404.24 0.57 5623.35 0.51 4952.47 0.62 356.22 0.48 6073.99 0.48 5639.51 0.50 359.31 0.47 1717.35 0.54 1711.61 0.54 6914.83 0.53 1701.17 0.54 1664.18 0.54 7080.03 0.47 183.58 0.52 7503.01 0.49 101.92 0.47 4248.45 0.60 6435.61 0.49 7907.79 0.52 7190.12 0.51 7403.58 0.50 1032.34 0.53 8548.22 0.50 7834.80 0.51 6374.92 0.49

Figure B.1: Some descriptive statistics for the symmetric Euclidean instances in TSPLIB. Instances are presented in order of increasing size. The columns are the standard deviations of edge lengths, the mean edge lengths and the ratio of standard deviation to mean.

Normalised Frequency

bier127.tsp 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 Normalised Edge Length

Figure B.2: Histogram of normalised frequency of normalised edge lengths of the bier127 TSPLIB instance.

230

APPENDIX B. TSPLIB STATISTICS

Normalised Frequency

Oliver30.tsp 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 Normalised Edge Length

Figure B.3: Histogram of normalised frequency of normalised edge lengths of the Oliver30 TSPLIB instance.

Normalised Frequency

pr1002.tsp 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 Normalised Edge Length

Figure B.4: Histogram of normalised frequency of normalised edge lengths of the pr1002 TSPLIB instance.

231

C Calculation of Average Lambda Branching Factor Average lambda branching factor was first introduced as a descriptive statistic for ACS performance [46] but is now an integral part of trail reinitialisation in the MMAS daemon actions (Section 2.4.5 on page 39). It is discussed in detail in the literature [47, p. 87]. The following figure represents its calculation in pseudocode adapted from JACOTSP. Broadly, the calculation can be described as follows. For every city in the TSP, go through the city’s candidate list calculating a branching factor cut-off. For each edge in the city’s candidate list, count the edges that are above this cut-off value. Average the counts for every city in the TSP. If we let the TSP size be n and we let the candidate list length for all cities be cl then the complexity of the calculation is given by: O(n · 2cl + n) ≈ O(2n) if cl  n

(C.1)

Clearly the calculation is expensive and this is exacerbated if the candidate list length approaches the problem size. However, this expense does not show when CPU time is not recorded. The expense of the calculation is mitigated in three ways: • The branching factor is often not computed at each iteration. • The complexity is the same as that of construction of one solution by an ant. • The candidate list length cl is typically very small.

233

APPENDIX C. CALCULATION OF AVERAGE LAMBDA BRANCHING FACTOR

/** * Method to calculate the average lambda branching factor. */ public double computeAverageBranchingFactor() {

/*An array to store number of branches from each TSP city*/ double[] num_branches = new double[tspSize]; /*O(tspSize)*/ for (each aCity in TSP) { /*O(cl)*/ final double cutoff = calculateCutOffForLambdaBranchingFactorForACity(aCity); /*O(cl)*/ num_branches[aCity] = countEdgesAboveCutOffFromCity(aCity, cutoff); } /*O(tspSize)*/ final double averageNumberOfBranches = calculateAverageOf(num_branches[tspSize]); double result = averageNumberOfBranches / (tspSize * 2); return result; }

Figure C.1: Pseudocode for the calculation of the average lambda branching factor.

234

D Example OFAT Analysis This appendix reports a One-Factor-at-A-Time (OFAT) approach to tuning a single parameter from the ACS heuristic (Section 2.4 on page 34). The particular parameter is alpha, which plays a role in an artificial ant’s solution building decisions.

D.1

Motivation

The ACS screening study of Chapter 8 and the ACS tuning study of Chapter 9 both reported that alpha did not have a statistically significant effect on either solution quality or solution time. This is an interesting and important result because it is intuitively unexpected and contradicts the accepted view of the importance of alpha [96]. The methods and experiment designs introduced in this thesis are new to the ACO field and the metaheuristics field in general. It is of interest, therefore, to attempt the same analysis, in so far as is possible, using a more familiar empirical technique called the One-Factor-at-A-Time (OFAT) approach. An OFAT approach involves taking one of the algorithm tuning parameters and allowing it to vary while all other tuning parameters are held fixed at some other values. The free parameter is tuned until performance is maximised. The procedure then moves on to another of the tuning parameters, allowing it to vary while all others are held fixed. This study applies an OFAT analysis to the alpha tuning parameter. The aim of the study is to determine whether an OFAT approach will lead to a different conclusion from the DOE approach. It is not an endorsement of the OFAT approach, the demerits of which were highlighted in Section 2.5 on page 47. In keeping with the thesis’ strong emphasis on experimental rigor, the OFAT analysis is conducted with a designed experiment and supporting statistical analyses.

235

APPENDIX D. EXAMPLE OFAT ANALYSIS

D.2 D.2.1

Method Response Variables

Two responses were measured, relative error from a known optimum and elapsed solution time, as per Section 6.7 on page 125.

D.2.2

Factors, Levels and ranges

Design Factors Being an OFAT analysis, there was 1 design factor. This factor was the alpha tuning parameter for the ACS algorithm, described in Section 2.4 on page 34. Alpha was set at the following five levels: 1, 3, 5, 7, 12. Held-Constant Factors The held constant factors are as per Section 6.7.6 on page 127. There were additional held-constant factors required of the OFAT approach. All other tuning parameters were fixed at 6 different settings. These settings came from the desirability optimisation results from the full ACS response surface model, given in the

pheromoneUpdate

Time

Relative Error

Desirability

parallel parallel parallel parallel parallel parallel

antPlacement

1.00 1.00 0.99 0.11 0.81 2.19 1.16 0.97 0.99 0.03 1.61 20.00 0.98 0.01 0.07 1.13 1.00 0.99 0.86 0.01 1.00 1.00 0.99 0.99 0.48 1.04 19.78 0.99 0.05 0.01

solutionConstruction

rhoLocal

rho

q0

nnFraction

10 4 40 6 70 11 10 3 40 7 70 10

antsFraction

StDev

400 400 400 500 500 500

beta

Size

following figure.

random random random same random same

bestSoFar bestOfItera bestOfItera bestOfItera bestSoFar bestOfItera

2.42 2.83 4.92 4.88 4.25 9.24

0.46 1.33 2.59 0.38 1.35 2.54

0.92 0.82 0.73 0.88 0.80 0.70

Figure D.1: Fixed parameter settings for the OFAT analysis. These are reproduced from the results of the desirability optimisation of the ACS full response surface model. The response predictions from the tuning have also been included.

Note that in practice, one may not have access to these tuned parameter settings. A researcher conducting an OFAT analysis without any prior knowledge would have no guidelines on the values at which the other parameters should be fixed.

D.2.3

Instances

All TSP instances were of the symmetric type and were created as per Section 6.7.1 on page 126. The TSP problem instances ranged in size from 400 cities to 500

236

APPENDIX D. EXAMPLE OFAT ANALYSIS cities with cost matrix standard deviation ranging from 10 to 70. All instances had a mean of 100. The same instances were used for each replicate of a design point. For each OFAT analysis, a single problem instance was used. These were the same instances used in the ACS tuning case study.

D.2.4

Experiment design, power and replicates

The experiment design for each of the OFAT analyses is a single factor 5-level factorial. All 5 treatments were replicated 10 times. A work up procedure was not needed in this case. The next figure gives the descriptive statistics for the collected data and the actual detectable effect size for the quality and time responses with a significance level of 5% and a power of 80%. Problem size 400 400 400 500 500 500

Problem Std. StDev Range Min Max Mean Dev. 10 0.30 0.74 1.04 0.89 0.07 40 1.72 3.29 5.01 4.14 0.43 70 6.30 4.17 10.47 6.73 1.73 10 0.35 0.62 0.97 0.80 0.09 40 1.55 3.25 4.79 3.91 0.36 70 4.52 6.03 10.55 8.11 1.25

Problem size 400 400 400 500 500 500

Problem Std. StDev Range Min Max Mean Dev. 10 3.67 1.52 5.19 2.95 0.96 40 6.05 2.50 8.55 4.13 1.30 70 13.15 4.38 17.53 8.20 3.32 10 4.03 1.97 6.00 3.15 0.89 40 7.97 2.60 10.56 4.97 1.58 70 31.83 4.63 36.45 14.08 7.67

Figure D.2: Descriptive statistics for the six OFAT analyses. Relative Error is reported above and Time below. There are six combinations of problem size and problem standard deviation.

D.2.5

Performing the experiment

Responses were measured at a 250 iteration stagnation stopping criterion. Available computational resources necessitated running experiments across a variety of similar machines. Runs were executed in a randomised order across these machines to counteract any uncontrollable nuisance factors. The experimental machines are benchmarked as per Section 5.3 on page 100.

D.3

Analysis

D.3.1

ANOVA

To make the data amenable to statistical analysis, a transformation of the responses was required for some of the analyses. These transformations were a log10 , inverse or inverse square root. 237

APPENDIX D. EXAMPLE OFAT ANALYSIS No outliers were detected. The models passed the usual ANOVA diagnostics for the ANOVA assumptions of model fit, normality, constant variance, time-dependent effects, and leverage (Section A.4.2 on page 221).

D.4

Results

The next figure summarises each of the OFAT analyses for relative error and time. It reports the detectable effect size for a significance threshold of 5% and a power of 80% and the statistical significance result. Relative Error Problem Problem size StDev 400 10 400 40 400 70 500 10 500 40 500 70

St Devs at 80% power 2 0.35 0.35 0.4 2 2

Effect size 0.14 0.15 0.61 0.04 0.71 2.50

ANOVA sig? No Yes Yes Yes No Yes

Time Effect size 1.91 0.45 1.16 0.35 3.16 15.34

ANOVA sig? No Yes Yes Yes Yes No

Figure D.3: Summary of results from the six OFAT analyses. The figure gives the detectable effect size in terms of both the number of standard deviations and the actual response units for the relative error response and the time response.

Some of the analysis showed a statistically significant effect for alpha on the responses of solution quality and solution time. Note that the detectable effect sizes are small relative to those of the screening and tuning case studies. This is due to the lower variability in the responses when varying only a single tuning parameter. The following figures show the plots of the relative error response for the OFAT analyses. Each plot shows the five levels of alpha on the horizontal axis and includes a 95% Fisher’s Least Significant Difference interval [84, p. 96]. Alpha had a statistically significant effect on solution quality in four out of the six experiments. An examination of the range over which average relative error varied in these significant case shows that the largest difference was approximately 3.9% for problems of size 400 and standard deviation 70 and approximately 0.1% for problems of size 500 and standard deviation 10. All analyses except that of size 400 and standard deviation of 70 recommended an alpha value of 1 to minimise relative error.

D.5

Conclusions and discussion

We draw the following conclusion from these results. For ACS, with all tuning parameters except alpha set to the values in Figure D.1 on page 236: • alpha has a statistically significant effect on solution quality for instances with a size and standard deviation combination of 400-40, 400-70, 500-10 and 500-70. 238

APPENDIX D. EXAMPLE OFAT ANALYSIS

Figure D.4: Plot of the effect of alpha on relative error for a problem with size 400 and standard deviation 10. Alpha was not statistically significant in this case.

Figure D.5: Plot of the effect of alpha on relative error for a problem with size 400 and standard deviation 40. Alpha was statistically significant in this case.

239

APPENDIX D. EXAMPLE OFAT ANALYSIS

Figure D.6: Plot of the effect of alpha on relative error for a problem with size 400 and standard deviation 70. Alpha was statistically significant in this case.

Figure D.7: Plot of the effect of alpha on relative error for a problem with size 500 and standard deviation 10. Alpha was statistically significant in this case.

240

APPENDIX D. EXAMPLE OFAT ANALYSIS

Figure D.8: Plot of the effect of alpha on relative error for a problem with size 500 and standard deviation 40. Alpha was not statistically significant in this case.

Figure D.9: Plot of the effect of alpha on relative error for a problem with size 500 and standard deviation 70. Alpha was statistically significant in this case.

241

APPENDIX D. EXAMPLE OFAT ANALYSIS • Apart from one anomalous result, an alpha value of 1 is recommended to minimise relative error. The first of these conclusions appears to contradict the results of Chapters 8 and 9. These concluded that alpha had a relatively unimportant effect on solution quality. However, there are several important differences between the previous experiments and the current OFAT analysis. Firstly, the fractional factorial and response surface designs experimented with many more factors. This resulted in a larger variability in the response, as listed in the descriptive statistics of Figure 9.1 on page 155, for example. The OFAT analysis, varying only alpha and conducted on a single instance, had a much smaller variance in its response measurements, as listed in Figure D.2 on page 237. The consequence is that the OFAT analysis could detect much smaller effects for a given significance level and power than the fractional factorial screening and the response surface. This does not mean that OFAT is a better approach than DOE. The OFAT conclusions are more accurate in their context. This context is the particular fixed values of the other parameter settings and a single problem instance. As discussed in Section 2.5, the OFAT analysis tells us nothing about interactions and is inefficient in comparison to DOE approaches in terms of the information gained from a given number of experiments. Most importantly, for some response surface shapes, an incorrect OFAT starting point can lead to incorrect tuning recommendations. Unfortunately, since the response surface shape cannot be deduced with OFAT, the experimenter does not know if these incorrect tuning recommendations are being made. The only safe option in this case is to use a DOE approach.

242

References [1] NIST/SEMATECH Engineering Statistics Handbook, 2006. [2] A DENSO -D IAZ , B., AND L AGUNA , M. Fine-Tuning of Algorithms Using Fractional Experimental Designs and Local Search. Operations Research 54, 1 (2006), 99–114. [3] A MINI , M. M., AND R ACER , M. A Rigorous Computational Comparison of Alternative Solution Methods for the Generalized Assignment Problem. Management Science 40, 7 (1994), 868–890. [4] A NDERSON , V. L., AND M C L EAN , R. A. Design of experiments: a realistic approach. M. Dekker Inc., New York, 1974. [5] A PPLEGATE , D., B IXBY, R., C HVATAL , V., AND C OOK , W. Implementing the Dantzig-Fulkerson-Johnson algorithm for large traveling salesman problems. Mathematical Programming Series B 97, 1-2 (2003), 91–153. [6] B ANG -J ENSEN , J., C HIARANDINI , M., G OEGEBEUR , Y., AND J ØRGENSEN , B. Mixed Models for the Analysis of Local Search Components. In Engineering Stochastic Local Search Algorithms. Designing, Implementing and Analyzing ¨ Effective Heuristics, T. Stutzle, M. Birattari, and H. Hoos, Eds., vol. 4638. Springer, Berlin / Heidelberg, 2007, pp. 91–105. [7] B ARR , R. S., G OLDEN , B. L., K ELLY, J. P., R ESENDE , M. G. C., AND S TEW AR T , W. R. Designing and Reporting on Computational Experiments with Heuristic Methods. Journal of Heuristics 1 (1995), 9–32. [8] B AR TZ -B EIELSTEIN , T. Experimental Research in Evolutionary Computation. The New Experimentalism. Natural Computing Series. Springer, 2006. [9] B AR TZ -B EIELSTEIN , T., AND P REUSS , M. Experimental Research in Evolutionary Computation. Tutorial at the genetic and evolutionary computation conference, June 2005. [10] B AUTISTA , J., AND P EREIRA , J. Ant Algorithms for Assembly Line Balancing. In Proceedings of the Third International Workshop on Ant Algorithms, M. Dorigo, G. D. Caro, and M. Sampels, Eds., vol. 2463 of Lecture Notes in Computer Science. Springer, 2002, p. 65. [11] B ENTLEY, J. L. Fast algorithms for the geometric traveling salesman problem. ORSA Journal on Computing 4 (1992), 387–411. [12] B IRATTARI , M. The Problem of Tuning Metaheuristics. Phd, Universit´e Libre de Bruxelles, 2006. [13] B IRATTARI , M., AND D ORIGO , M. How to assess and report the performance of a stochastic algorithm on a benchmark problem: Mean or best result on a number of runs? Optimization Letters (2006). ¨ [14] B IRATTARI , M., S T UTZLE , T., P AQUETE , L., AND VARRENTRAPP, K. A Racing Algorithm for Configuring Metaheuristics. In GECCO 2002: Proceedings of the Genetic and Evolutionary Computation Conference, New York, USA, W. B. Langdon, E. Cant-Paz, K. E. Mathias, R. Roy, D. Davis, R. Poli, K. Balakrishnan, V. Honavar, G. Rudolph, J. Wegener, L. Bull, M. A. Potter, A. C. 243

REFERENCES Schultz, J. F. Miller, E. K. Burke, and N. Jonoska, Eds. Morgan Kaufmann, 2002, pp. 11–18. [15] B LUM , C. Ant colony optimization: Introduction and recent trends. Physics of Life Reviews 2, 4 (2005), 353–373. [16] B LUM , C., AND R OLI , A. Metaheuristics in combinatorial optimization: Overview and conceptual comparison. ACM Computing Surveys 35, 3 (2003), 268–308. [17] B LUM , C., AND S AMPELS , M. An ant colony optimization algorithm for shop scheduling problems. Journal of Mathematical Modelling and Algorithms 3, 3 (2004), 285–308. [18] B OOCH , G. Object-oriented Analysis and Design with Applications, second ed. The Benjamin/Cummings Publishing Company, Inc., 1994. [19] B OTEE , H. M., AND B ONABEAU , E. Evolving Ant Colony Optimization. Advances in Complex Systems 1 (1998), 149–159. [20] B OX , G. E. P. Sequential experimentation and sequential assembly of designs. Quality Engineering 5, 2 (1992), 321–330. [21] B OX , G. E. P., AND C OX , D. R. An Analysis of Transformations. Journal of the Royal Statistical Society Series B (Methodological) 26, 2 (1964), 211–252. [22] B REEDAM , A. V. Improvement Heuristics for the Vehicle Routing Problem Based on Simulated Annealing. European Journal of Operations Research 86, 3 (1995), 480–490. [23] B ULL , J. M., S MITH , L. A., B ALL , C., P OTTAGE , L., AND F REEMAN , R. Benchmarking Java against C and Fortran for scientific applications. Concurrency and Computation: Practice and Experience 15, 3-5 (2003), 417–430. [24] B ULLNHEIMER , B., H AR TL , R. F., AND S TRAUSS , C. A New Rank Based Version of the Ant System: A Computational Study. Central European Journal for Operations Research and Economics 7, 1 (1999), 25–38. [25] B ULLNHEIMER , B., H AR TL , R. F., AND S TRAUSS , C. An Improved Ant System Algorithm for the Vehicle Routing Problem. Annals of Operations Research 89 (1999), 319–328. [26] C HEESEMAN , P., K ANEFSKY, B., AND T AYLOR , W. M. Where the Really Hard Problems Are. In Proceedings of the Twelfth International Conference on Artificial Intelligence, vol. 1. Morgan Kaufmann Publishers, Inc., USA, 1991, pp. 331–337. [27] C HIARANDINI , M., P AQUETE , L., P REUSS , M., AND R IDGE , E. Experiments on Metaheuristics: Methodological Overview and Open Issues. Tech. Rep. IMADA-PP-2007-04, Institut for Matematik og Datalogi, University of Southern Denmark, 20 March. [28] C OFFIN , M., AND S ALTZMAN , M. J. Statistical Analysis of Computational Tests of Algorithms and Heuristics. INFORMS Journal on Computing 12, 1 (2000), 24–44. [29] C OHEN , P. R. Empirical Methods for Artificial Intelligence. The MIT Press, Cambridge, Massachusetts, 1995. [30] C OLORNI , A., D ORIGO , M., M AFFIOLI , F., M ANIEZZO , V., R IGHINI , G., AND T RUBIAN , M. Heuristics from Nature for Hard Combinatorial Problems. International Transactions in Operational Research 3, 1 (1996), 1–21. [31] C ORDN , O., F ERNANDEZ , I., H ERRERA , F., AND M ORENO , L. A New ACO Model Integrating Evolutionary Computation Concepts: The Best-Worst Ant System. In Proceedings of ANTS’2000. From Ant Colonies to Artificial Ants: Second Interantional Workshop on Ant Algorithms, Brussels, Belgium, September 7-9, 2000. 2000, pp. 22–29. 244

REFERENCES [32] C OSTA , D., AND H ER TZ , A. Ants Can Colour Graphs. The Journal of the Operational Research Society 48, 3 (1997), 295–305. [33] C OY, S., G OLDEN , B., R UNGER , G., AND WASIL , E. Using Experimental Design to Find Effective Parameter Settings for Heuristics. Journal of Heuristics 7, 1 (2001), 77–97. [34] C ROWDER , H. P., D EMBO , R. S., AND M ULVEY, J. M. Reporting Computational Experiments in Mathematical Programming. Mathematical Programming 15 (1978), 316–329. [35] C ROWDER , H. P., D EMBO , R. S., AND M ULVEY, J. M. On Reporting Computational Experiments with Mathematical Software. ACM Transactions on Mathematical Software 5, 2 (1979), 193–203. [36] C ZARN , A., M AC N ISH , C., V IJAYAN , K., T URLACH , B., AND G UPTA , R. Statistical Exploratory Analysis of Genetic Algorithms. IEEE Transactions on Evolutionary Computation 8, 4 (2004), 405–421. [37] C ZITROM , V. One-Factor-at-a-Time versus Designed Experiments. The American Statistician 53, 2 (1999), 126–131. [38] DEN B ESTEN , M. L. Simple Metaheuristics for Scheduling: An empirical investigation into the application of iterated local search to deterministic scheduling problems with tardiness penalties. Phd, Germany. ¨ [39] DEN B ESTEN , M. L., S T UTZLE , T., AND D ORIGO , M. Ant colony optimization for the total weighted tardiness problem. In Proceedings of PPSN-VI, sixth international conference on parallel problem solving from nature, M. Schoenauer, K. Deb, G. Rudolph, X. Yao, E. Lutton, J. J. Merelo, and H.-P. Schwefel, Eds., vol. 1917 of Lecture Notes in Comput Science. Springer, Berlin, 2000, pp. 611–620. [40] D ERRINGER , G., AND S UICH , R. Simultaneous Optimization of Several Response Variables. Journal of Quality Technology 12, 4 (1980), 214–219. [41] D OERR , B., N EUMANN , F., S UDHOLT, D., AND W ITT, C. On the Runtime Analysis of the 1-ANT ACO Algorithm. In Proceedings of the Genetic and Evolutionary Computation Conference, vol. 1. ACM, 2007, pp. 33–40. [42] D ORIGO , M., AND B LUM , C. Ant colony optimization theory: A survey. Theoretical Computer Science 344, 2-3 (2005), 243–278. [43] D ORIGO , M., AND C ARO , G. D. The Ant Colony Optimization Meta-Heuristic. In New Ideas in Optimization, D. Corne, M. Dorigo, F. Glover, D. Dasgupta, P. Moscato, R. Poli, and K. V. Price, Eds., Mcgraw-Hill’S Advanced Topics In Computer Science. McGraw-Hill, 1999, pp. 11–32. [44] D ORIGO , M., AND C OLORNI , A. The Ant System: Optimization by a colony of cooperating agents. IEEE Transactions on Systems, Man, and Cybernetics Part B 26, 1 (1996), 1–13. [45] D ORIGO , M., AND G AMBARDELLA , L. M. Ant Colonies for the Travelling Salesman Problem. BioSystems 43, 2 (1997), 73–81. [46] D ORIGO , M., AND G AMBARDELLA , L. M. Ant Colony System: A Cooperative Learning Approach to the Traveling Salesman Problem. IEEE Transactions on Evolutionary Computation 1, 1 (1997), 53–66. ¨ [47] D ORIGO , M., AND S T UTZLE , T. Ant Colony Optimization. The MIT Press, Massachusetts, USA, 2004. [48] E IBEN , A., AND J ELASITY, M. A critical note on experimental research methodology in EC. In Proceedings of the 2002 IEEE Congress on Evolutionary Computation. IEEE, 2002, pp. 582–587.

245

REFERENCES ¨ [49] F ISCHER , T., S T UTZLE , T., H OOS , H., AND M ERZ , P. An Analysis Of The Hardness Of TSP Instances For Two High Performance Algorithms. In Proceedings of the Sixth Metaheuristics International Conference. 2005, pp. 361– 367. [50] G AER TNER , D., AND C LARK , K. L. On Optimal Parameters for Ant Colony Optimization Algorithms. In Proceedings of the 2005 International Conference on Artificial Intelligence, vol. 1. CSREA Press, 2005, pp. 83–89. [51] G AGN E´ , C., P RICE , W. L., AND G RAVEL , M. Comparing an ACO algorithm with other heuristics for the single machine scheduling problem with sequence-dependent setup times. Journal of the Operational Research Society 53 (2002), 895–906. [52] G AMBARDELLA , L. M., AND D ORIGO , M. HAS-SOP: hybrid Ant System for the Sequential Ordering Problem. Tech. Rep. IDSIA-11-97, IDSIA, 19 April. [53] G AMBARDELLA , L. M., AND D ORIGO , M. HAS-SOP: An Ant Colony System Hybridized with a New Local Search for the Sequential Ordering Problem. INFORMS Journal on Computing 12, 3 (2000), 237–255. [54] G ANDIBLEUX , X., D ELORME , X., AND T’K INDT, V. An Ant Colony Optimisation Algorithm for the Set Packing Problem. In Proceedings of the Fourth International Workshop on Ant Colony Optimization and Swarm Intelligence, M. Dorigo, M. Birattari, C. Blum, L. M. Gambardella, F. Mondada, and ¨ T. Stutzle, Eds., vol. 3172 of Lecture Notes in Computer Science. 2004, pp. 49– 60. [55] G AREY, M. R., AND J OHNSON , D. S. Computers and Intractability : A Guide to the Theory of NP-Completeness. Books in the Mathematical Sciences. W. H. Freeman, 1979. [56] G LOVER , F. Tabu Search - Part I. ORSA Journal on Computing 1, 3 (1989), 190–206. [57] G OLDBERG , D. E. Genetic algorithms in search, optimization, and machine learning. Addison-Wesley Publishing Company, Inc., 1989. [58] G OLDWASSER , M., J OHNSON , D. S., AND M C G EOCH , C. C., Eds. Proceedings of the Fifth and Sixth DIMACS Implementation Challenges. American Mathematical Society, 2002. [59] G OTTLIEB , J., P UCHTA , M., AND S OLNON , C. A Study of Greedy, Local Search, and Ant Colony Optimization Approaches for Car Sequencing Problems. In Proceedings of EvoWorkshops 2003: Applications of Evolutionary Computing, S. Cagnoni, J. R. Cardalda, D. Corne, J. Gottlieb, A. Guillot, E. Hart, C. Johnson, E. Marchiori, J.-A. Meyer, M. Middendorf, and G. Raidl, Eds., vol. 2611 of Lecture Notes in Computer Science. Springer, Berlin, 2003, pp. 246–257. [60] G REENBERG , H. Computational Testing: Why, how and how much? ORSA Journal on Computing 2, 1 (1990), 94–97. [61] G UNTSCH , M., AND M IDDENDORF, M. Pheromone Modification Strategies for Ant Algorithms Applied to Dynamic TSP. In Proceedings of EvoWorkshops 2001: Applications of Evolutionary Computing, E. J. W. Boers, J. Gottlieb, P. L. Lanzi, R. E. Smith, S. Cagnoni, E. Hart, G. R. Raidl, and H. Tijink, Eds., vol. 2037 of Lecture Notes in Computer Science. Springer, Berlin, 2001, p. 213. [62] G UNTSCH , M., AND M IDDENDORF, M. Applying Population Based ACO to Dynamic Optimization Problems. In Ant Algorithms : Third International Workshop,, . Dorigo, G. D. Caro, and M. Sampels, Eds., vol. 2463 of Lecture Notes in Computer Science. Springer, 2002, p. 111.

246

REFERENCES [63] H ELSGAUN , K. An effective implementation of the Lin-Kernighan traveling salesman heuristic. European Journal of Operational Research 126, 1 (2000), 106–130. [64] H OOKER , J. N. Needed: An Empirical Science of Algorithms. Operations Research 42, 2 (1994), 201–212. [65] H OOKER , J. N. Testing heuristics: We have it all wrong. Journal of Heuristics 1 (1996), 33–42. ¨ [66] H OOS , H., AND S T UTZLE , T. Stochastic Local Search, Foundations and Applications. Morgan Kaufmann, 2004. [67] H YBARGER , J. The Ten Most Common Designed Experiment Mistakes. Stat Teaser (December 2006). [68] J AIN , R. The art of computer systems performance analysis: techniques for experimental design, measurement, simulation and modeling. John Wiley and Sons Inc., 1991. [69] J OHNSON , D. S. A Theoretician’s Guide to the Experimental Analysis of Algorithms. In Proceedings of the Fifth and Sixth DIMACS Implementation Challenges, M. Goldwasser, D. S. Johnson, and C. C. McGeoch, Eds. American Mathematical Society, 2002, pp. 215–250. [70] J OHNSON , D. S., AND P APADIMITRIOU , C. H. Computational Complexity. In The Traveling Salesman Problem, E. L. Lawler, J. K. Lenstra, A. H. G. R. Kan, and D. B. Shmoys, Eds., Wiley Series in Discrete Mathematics and Optimization. John Wiley and Sons, 1995, pp. 37–85. [71] K APTCHUK , T. J. Effect of interpretive bias on research evidence. British Medical Journal 326 (2003), 1453–1455. [72] K ERIEVSKY, J. Refactoring to Patterns. The Addison-Wesley Signature Series. Addison-Wesley, 2005. [73] K IRKPATRICK , S., G ELATT, C. D., AND V ECCHI , M. P. Optimization by Simulated Annealing. Science 220, 4598 (1983), 671–680. [74] K RAMER , O., G LOGER , B., AND G OEBELS , A. An experimental analysis of evolution strategies and particle swarm optimisers using design of experiments. In Proceedings of the Genetic and Evolutionary Computation Conference. ACM, 2007, pp. 674–681. [75] L AWLER , E. L., L ENSTRA , J. K., K AN , A. H. G. R., AND S HMOYS , D. B., Eds. The Traveling Salesman Problem - A Guided Tour of Combinatorial Optimization. Wiley Series in Discrete Mathematics and Optimization. John Wiley and Sons, New York, USA. [76] L ENTH , R. V. Java Applets for Power and Sample Size. 2006. [77] M ANIEZZO , V., AND C OLORNI , A. The Ant System Applied to the Quadratic Assignment Problem. IEEE Transactions on Knowledge and Data Engineering 11, 5 (1999), 769–778. [78] M ARON , O., AND M OORE , A. Hoeffding races: Accelerating model selection search for classification and function approximation. Advances in Neural Information Processing Systems 6 (1994), 59–66. [79] M C G EOCH , C. C. Toward an experimental method for algorithm simulation. INFORMS Journal on Computing 8, 1 (1996), 1–15. [80] M ERKLE , D., M IDDENDORF, M., AND S CHMECK , H. Ant colony optimization for resource-constrained project scheduling. IEEE Transactions on Evolutionary Computation 6, 4 (2002), 333–346.

247

REFERENCES [81] M ICHELS , R., AND M IDDENDORF, M. An Ant System for the Shortest Common Supersequence Problem. In New Ideas in Optimization, D. Corne, M. Dorigo, and F. Glover, Eds. McGraw-Hill, 1999, pp. 51–61. [82] M ILES , J. Getting the Sample Size Right: A Brief Introduction to Power Analysis, 2007. [83] M ITCHELL , M., AND T AYLOR , C. E. Evolutionary Computation: An Overview. Annual Review of Ecology and Systematics 20 (1999), 593–616. [84] M ONTGOMERY, D. C. Design and Analysis of Experiments, 6 ed. John Wiley and Sons Inc, 2005. [85] M YERS , R. H., AND M ONTGOMERY, D. C. Response Surface Methodology. Process and Product Optimization Using Designed Experiments. Wiley Series in Probability and Statistics. John Wiley and Sons Inc., 1995. [86] N EUMANN , F., AND W ITT, C. Runtime Analysis of a Simple Ant Colony Optimization Algorithm. In Theory of Evolutionary Algorithms, D. V. Arnold, T. Jansen, M. D. Vose, and J. E. Rowe, Eds., Dagstuhl Seminar Proceedings. Internationales Begegnungs- und Forschungszentrum fuer Informatik (IBFI), Schloss Dagstuhl, Germany, Dagstuhl, Germany, 2006. [87] N ORVIG , P. Mistakes in Experimental Design and Interpretation, 2007. [88] N OWE , A., V ERBEECK , K., AND V RANCX , P. Multi-type Ant Colony: The Edge Disjoint Paths Problem. In Proceedings of the Fourth International Workshop on Ant Colony, Optimization and Swarm Intelligence, M. Dorigo, M. Birattari, C. Blum, L. M. Gambardella, F. Mondada, and T. Stutzle, Eds., vol. 3172 of Lecture Notes in Computer Science. Springer, 2004, pp. 202–213. [89] O STLE , B. Statistics in Research, 2nd ed. Iowa State University Press, 1963. [90] P AQUETE , L., C HIARANDINI , M., AND B ASSO , D. Proceedings of the Workshop on Empirical Methods for the Analysis of Algorithms. In International Conference on Parallel Problem Solving From Nature (Reykjavik, Iceland, 2006). [91] P ARK , M.-W., AND K IM , Y.-D. A systematic procedure for setting parameters in simulated annealing algorithms. Computers and Operations Research 25, 3 (1998), 207–217. [92] P ARK , S. K., AND M ILLER , K. W. Random number generators: good ones are hard to find. Communications of the ACM 31, 10 (1988), 1192–1201. [93] P ARPINELLI , R., L OPES , H., AND F REITAS , A. Data mining with an ant colony optimization algorithm. IEEE Transactions on Evolutionary Computation 6 (2002), 321–332. [94] P ARSONS , R., AND J OHNSON , M. A Case Study in Experimental Design Applied to Genetic Algorithms with Applications to DNA Sequence Assembly. American Journal of Mathematical and Management Sciences 17, 3 (1997), 369–396. [95] P EACE , G. S. Taguchi Methods: A Hands-On Approach. Addison-Wesley, 1993. [96] P ELLEGRINI , P., F AVARETTO , D., AND M ORETTI , E. On Max-Min Ant System’s parameters. In Fifth International Workshop on Ant Colony Optimization and Swarm Intelligence, vol. 4150 of Lecture Notes in Computer Science. Springer Berlin, 2006, pp. 203–214. [97] P LANCK , M. Scientific autobiography and other papers. Williams and Norgate, London, 1950. [98] P RESS , W. H., F LANNERY, B. P., T EUKOLSKY, S. A., AND V ETTERLING , W. T. Numerical Recipes in Pascal: the art of scientific computing. Cambridge University Press, 1989. 248

REFERENCES [99] P RESS , W. H., T EUKOLSKY, S. A., V ETTERLING , W. T., AND F LANNERY, B. P. Numerical Recipes in C: the art of scientific computing. Cambridge University Press, Cambridge, 1992. [100] R ANDALL , M. Near Parameter Free Ant Colony Optimisation. In Proceedings of the Fourth International Workshop on Ant Colony, Optimization and Swarm Intelligence, M. Dorigo, M. Birattari, C. Blum, L. M. Gambardella, F. Mon¨ dada, and T. Stutzle, Eds., vol. 3172 of Lecture Notes in Computer Science. Springer, Berlin, 2004, pp. 374–381. [101] R ARDIN , R. L., AND U ZSOY, R. Experimental Evaluation of Heuristic Optimization Algorithms: A Tutorial. Journal of Heuristics 7 (2001), 261–304. [102] R EINELT, G. TSPLIB - A traveling salesman problem library. ORSA Journal of Computing 3 (1991), 376–384. [103] R IDGE , E., AND C URRY, E. A Roadmap of Nature-Inspired Systems Research and Development. Multi-Agent and Grid Systems 3, 1 (2007). [104] R IDGE , E., AND K UDENKO , D. Sequential Experiment Designs for Screening and Tuning Parameters of Stochastic Heuristics. In Workshop on Empirical Methods for the Analysis of Algorithms at the Ninth International Conference on Parallel Problem Solving from Nature, L. Paquete, M. Chiarandini, and D. Basso, Eds. 2006, pp. 27–34. [105] R IDGE , E., AND K UDENKO , D. An Analysis of Problem Difficulty for a Class of Optimisation Heuristics. In Proceedings of the Seventh European Conference on Evolutionary Computation in Combinatorial Optimisation, C. Cotta and J. V. Hemert, Eds., vol. 4446 of Lecture Notes in Computer Science. Springer-Verlag, 2007, pp. 198–209. [106] R IDGE , E., AND K UDENKO , D. Analyzing Heuristic Performance with Response Surface Models: Prediction, Optimization and Robustness. In Proceedings of the Genetic and Evolutionary Computation Conference. ACM, 2007, pp. 150–157. [107] R IDGE , E., AND K UDENKO , D. Screening the Parameters Affecting Heuristic Performance. Technical Report YCS 415 (www.cs.york.ac.uk/ftpdir/reports/index.php), The Department of Computer Science, The University of York, April 2007. [108] R IDGE , E., AND K UDENKO , D. Screening the Parameters Affecting Heuristic Performance. In Proceedings of the Genetic and Evolutionary Computation Conference, D. Thierens, H.-G. Beyer, M. Birattari, J. Bongard, J. Branke, J. A. Clark, D. Cliff, C. B. Congdon, K. Deb, B. Doerr, T. Kovacs, S. Kumar, J. F. Miller, J. Moore, F. Neumann, M. Pelikan, R. Poli, K. Sastry, K. O. Stanley, T. Stutzle, R. A. Watson, and I. Wegener, Eds., vol. 1. ACM, 2007. [109] R IDGE , E., AND K UDENKO , D. Tuning the Performance of the MMAS Heuristic. In Engineering Stochastic Local Search Algorithms. Designing, Implementing and Analyzing Effective Heuristics, T. Stutzle and M. Birattari, Eds., vol. 4638 of Lecture Notes in Computer Science. Springer, Berlin / Heidelberg, 2007, pp. 46–60. [110] R IDGE , E., AND K UDENKO , D. Determining whether a problem characteristic affects heuristic performance. A rigorous Design of Experiments approach. In Recent Advances in Evolutionary Computation for Combinatorial Optimization, Studies in Computational Intelligence. Springer, 2008. [111] R IDGE , E., K UDENKO , D., AND K AZAKOV, D. A Study of Concurrency in the Ant Colony System Algorithm. In Proceedings of the IEEE Congress on Evolutionary Computation. 2006, pp. 1662–1669. [112] S COTT, L. DOE Strategies: An Overview of the Methodology and Concepts, 2006.

249

REFERENCES [113] S HMYGELSKA , A., AND H OOS , H. An ant colony optimisation algorithm for the 2D and 3D hydrophobic polar protein folding problem. BMC Bioinformatics 6, 1 (2005), 30. [114] S ILVA , R. M. A., AND R AMALHO , G. L. Going the Extra Mile in Ant Colony Optimization. In Proceedings of the Fourth Metaheuristics International Conference. 2001, pp. 361–366. [115] S OCHA , K. The Influence Of Run-time Limits On Choosing Ant System Parameters. In Proceedings of the Genetic and Evolutionary Computation Conference, E. Cantu-Paz, J. A. Foster, K. Deb, L. Davis, R. Roy, U.-M. O’Reilly, H.-G. Beyer, R. K. Standish, G. Kendall, S. W. Wilson, M. Harman, J. Wegener, D. Dasgupta, M. A. Potter, A. C. Schultz, K. A. Dowsland, N. Jonoska, and J. F. Miller, Eds., vol. 2723. Springer, 2003, pp. 49–60. [116] S OLNON , C. Ants can solve constraint satisfaction problems. IEEE Transactions on Evolutionary Computation 6, 4 (2002), 347–357. ¨ [117] S T UTZLE , T. Local Search Algorithms for Combinatorial Problems - Analysis, Algorithms and New Applications. Phd, TU Darmstadt, 1998. ¨ [118] S T UTZLE , T., AND H OOS , H. H. Max-Min Ant System. Future Generation Computer Systems 16, 8 (2000), 889–914. [119] VAN H EMER T, J. I. Property Analysis of Symmetric Travelling Salesman Problem Instances Acquired Through Evolution. In Proceedings of the Fifth Conference on Evolutionary Computation in Combinatorial Optimization, G. R. Raidl and J. Gottlieb, Eds., vol. 3448. Springer-Verlag, Berlin, 2005, pp. 122–131. [120] V OSS , S., M AR TELLO , S., O SMAN , I. H., AND R OUCAIROL , C., Eds. MetaHeuristics - Advances and Trends in Local Search Paradigms for Optimization. Kluwer Academic Publishers, Dordrecht, The Netherlands, 1999. [121] W IKIPEDIA , T. F. E. I. Travelling salesman problem, 2007. [122] W INEBERG , M., AND C HRISTENSEN , S. An Introduction to Statistics for EC Experimental Analysis. Tutorial at the ieee congress on evolutionary computation, 2004. [123] X U , J., C HIU , S., AND G LOVER , F. Fine-tuning a tabu search algorithm with statistical tests. International Transactions on Operations Research 5, 3 (1998), 233–244. [124] Z EMEL , E. Measuring the quality of approximate solutions to zero-one programming problems. Mathematics of Operations Research 6 (1981), 319–332. [125] Z LOCHIN , M., AND D ORIGO , M. Model based search for combinatorial optimization: a comparative study. In Proceedings of the Seventh International Conference on Parallel Problem Solving from Nature, J. J. M. Guervs, P. Adamidis, and H.-G. Beyer, Eds., vol. 2439. Springer-Verlag, Berlin, Germany, 2002, pp. 651–661.

250