Fully Automated Knowledge Extraction using Group of Adaptive

Czech Technical University in Prague Faculty of Electrical Engineering

Fully Automated Knowledge Extraction using

Group of Adaptive Models Evolution by

Pavel Kord´ık

A thesis submitted to the Faculty of Electrical Engineering, Czech Technical University in Prague, in partial fulfilment of the requirements for the degree of Doctor.

PhD program: Electrical Engineering and Information Technology September 2006

Thesis Supervisor: ˇ Miroslav Snorek Department of Computer Science and Engineering Faculty of Electrical Engineering Czech Technical University in Prague c 2006 by Pavel Kord´ık Copyright ° ii

Abstract and contributions Keywords like data mining (DM) and knowledge discovery (KD) appear in several thousands of articles in recent time. Such popularity is driven mainly by demand of private companies. They need to analyze their data effectively to get some new useful knowledge that can be capitalized. This process is called knowledge discovery and data mining is a crucial part of it. Although several methods and algorithms for data mining has been developed, there is still a lot of gaps to fill. The problem is that real world data are so diverse that no universal algorithm has been developed to mine all data effectively. Also stages of the knowledge discovery process need the full time assistance of an expert on data preprocessing, data mining and the knowledge extraction. These problems can be solved by a KD environment capable of automatical data preprocessing, generating regressive, predictive models and classifiers, automatical identification of interesting relationships in data (even in complex and high-dimensional ones) and presenting discovered knowledge in a comprehensible form. In order to develop such environment, this thesis focuses on the research of methods in the areas of data preprocessing, data mining and information visualization. The Group of Adaptive Models Evolution (GAME) is data mining engine able to adapt itself and perform optimally on big (but still limited) group of realworld data sets. The Fully Automated Knowledge Extraction using GAME (FAKE GAME) framework is proposed to automate the KD process and to eliminate the need for the assistance of data mining expert. The GAME engine is the only GMDH type algorithm capable of solving very complex problems (as demonstrated on the Spiral data benchmarking problem). It can handle irrelevant inputs, short and noisy data samples. It uses an evolutionary algorithm to find optimal topology of models. Ensemble techniques are employed to estimate quality and credibility of GAME models. Within the FAKE framework we designed and implemented several modules for data preprocessing, knowledge extraction and for visual knowledge discovery. Keywords: Data Mining, Knowledge Discovery, GMDH, Continuous Optimization, Niching Genetic Algorithm, Ensemble Models, Data Preprocessing, Visualization, Feature Ranking

iii

Acknowledgements ˇ First of all, I would like to express my gratitude to my thesis supervisor, Dr. Miroslav Snorek. He managed to create great environment, where many ideas arise, evolve and are shared by people, who are interested in soft computing. Thank you Mirek also for your personality. I thank to prof. Jiˇrina who read the early version of this thesis and his comments fundamentally influenced it. Thanks also to prof. Tvrd´ık for his constructive comments and for pushing me to finish the thesis. Thanks to the GMDH community (prof. Ivakhnenko, his son Gregorij, prof. Stepasko, many other ”GMDH” people from Kiev and Dr. Frank Lemke from Berlin) for always being friendly and willing to discuss new ideas. Thanks to Phil Prendergast from Hort Research New Zealand for giving me chance to model mandarin tree water consumption instead of picking mandarins. This initiated my interest in real world applications of GMDH theory. Thanks to my friends from our department for their will to stay, collaborate on the research and not complain about bad weather during canoeing trips. I would like to thank following people, who collaborate(d) on the FAKE GAME project: Jan Saidl - classification plots, scatterplot matrix, GA search for interesting projections ˇ Miroslav Cepek - application of the GAME engine to Sleep stages classification data Jiˇr´ı Novák - application of the GAME engine to signal filtering Jiˇr´ı Kopsa - 3D visualization of classification boundaries Jiˇr´ı Noˇzka - 3D visualization of regression manifolds ˇ aˇcek - module for distribution transformation to uniform one and back Jan Sim´ ˇ Tomáˇs Cern´ y - module for missing values imputation Lukaˇs Trlida - parser for PMML standard of GMDH models Oleg Kovaˇr´ık - ACO* and CACO optimization methods Samuel Ferenˇc´ık - HGAPSO optimization method Jan Drchal - java implementation of the SADE algorithm Miroslav Jánoˇs´ık - DE (version 1) optimization method Ondˇrej Fil´ıpek - SCG, OS, SOS, palDE, training modules Aleˇs Piln´ y - SinusNeuron unit and experiments with optimal Sin transfer function Ondˇrej Zicha - PolyFractNeuron unit with rational transfer function Michal Semler - ExpNeuron unit with exponential transfer function David Sedláˇcek - visualization of GAME models’ topology (connections) Pavel Stanˇek - experiments with GAME engine settings (niching enabled/disabled) Finally, my greatest thanks to my father, sister and our big family whose support was of crucial importance.

iv

Dedication

To my wife Jana and our daughter Aniˇcka.

v

Contents 1 Introduction

1

1.1

Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2

1.2

Goals of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3

1.3

Contributions of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4

1.4

Organization of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5

2 Background and survey of the state-of-the-art 2.1

2.2

6

The theory related to FAKE . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6

2.1.1

Automated data preprocessing . . . . . . . . . . . . . . . . . . . . . .

6

2.1.1.1

Dealing with alpha variables . . . . . . . . . . . . . . . . . .

6

2.1.1.2

Imputing missing values . . . . . . . . . . . . . . . . . . . . .

7

2.1.1.3

Data normalization . . . . . . . . . . . . . . . . . . . . . . .

7

2.1.1.4

Distribution transformation . . . . . . . . . . . . . . . . . . .

8

2.1.1.5

Data reduction . . . . . . . . . . . . . . . . . . . . . . . . . .

8

2.1.2

Visual data mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8

2.1.3

Credibility estimation . . . . . . . . . . . . . . . . . . . . . . . . . . .

10

2.1.4

Feature ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11

The theory related to GAME . . . . . . . . . . . . . . . . . . . . . . . . . . .

11

2.2.1

Inductive modeling and the GMDH . . . . . . . . . . . . . . . . . . .

11

2.2.1.1

The philosophy behind inductive modeling . . . . . . . . . .

11

2.2.1.2

Group method of data handling (GMDH) . . . . . . . . . . .

12

2.2.1.3

The state of the art in the GMDH related research . . . . . .

13

2.2.2

Neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

14

2.2.3

Optimization methods . . . . . . . . . . . . . . . . . . . . . . . . . . .

15

2.2.3.1

Quasi-Newton method (QN) . . . . . . . . . . . . . . . . . .

15

2.2.3.2

Conjugate gradient method (CG) . . . . . . . . . . . . . . .

16

2.2.3.3

Orthogonal Search (OS) . . . . . . . . . . . . . . . . . . . . .

17

2.2.3.4

Genetic algorithms (GA) . . . . . . . . . . . . . . . . . . . .

17

2.2.3.5

Niching methods in the evolutionary computation . . . . . .

18

2.2.3.6

Differential Evolution (DE) . . . . . . . . . . . . . . . . . . .

19

2.2.3.7

Simplified Atavistic Differential Evolution (SADE) . . . . . .

19

2.2.3.8

Particle swarm optimization (PSO) . . . . . . . . . . . . . .

19

2.2.3.9

Ant colony optimization (ACO) . . . . . . . . . . . . . . . .

20

vi

2.2.3.10 Hybrid of the GA and the particle swarm optimization (HGAPSO) 20 2.2.4

2.3

Ensemble methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

21

2.2.4.1

Bagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

22

2.2.4.2

Bias-variance decomposition . . . . . . . . . . . . . . . . . .

22

2.2.4.3

Simple ensemble and weighted ensemble . . . . . . . . . . . .

23

2.2.4.4

Ensembles - state of the art . . . . . . . . . . . . . . . . . . .

24

Previous results and related work . . . . . . . . . . . . . . . . . . . . . . . . .

24

3 Overview of our approach - the FAKE GAME framework 3.1

The goal of the FAKE GAME environment . . . . . . . . . . . . . . . . . . .

27

3.1.1

Research of methods in the area of data preprocessing . . . . . . . . .

28

3.1.2

Automated data mining . . . . . . . . . . . . . . . . . . . . . . . . . .

28

3.1.3

Knowledge extraction and information visualization . . . . . . . . . .

28

4 The design of the GAME engine and related results 4.1 4.2

4.3 4.4

4.5

4.6

29

Heterogeneous units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

30

4.1.1

Experiments with heterogeneous units . . . . . . . . . . . . . . . . . .

31

Optimization of GAME units . . . . . . . . . . . . . . . . . . . . . . . . . . .

34

4.2.1

The analytic gradient of the Gaussian unit . . . . . . . . . . . . . . .

35

4.2.2

The analytic gradient of the Sine unit . . . . . . . . . . . . . . . . . .

36

4.2.3

The experiment: analytic gradient saves error function evaluations . .

37

Heterogeneous learning methods . . . . . . . . . . . . . . . . . . . . . . . . .

38

4.3.1

Experiments with heterogeneous learning methods . . . . . . . . . . .

39

Structural innovations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

41

4.4.1

Growth from a minimal form . . . . . . . . . . . . . . . . . . . . . . .

42

4.4.2

Interlayer connections . . . . . . . . . . . . . . . . . . . . . . . . . . .

42

Regularization in GAME

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

43

4.5.1

Regularization of Combi units on real world data . . . . . . . . . . . .

45

4.5.2

Evaluation of regularization criteria . . . . . . . . . . . . . . . . . . .

46

Genetic algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

47

4.6.1

Niching methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

50

4.6.1.1

Evaluation of the distance computation . . . . . . . . . . . .

51

4.6.1.2

The performance tests of the Niching GA versus the Regular GA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

52

The inheritance of unit properties - experimental results . . .

54

Evolving units (active neurons) . . . . . . . . . . . . . . . . . . . . . . . . . .

55

4.6.1.3 4.7

26

vii

4.7.1

CombiNeuron - evolving polynomial unit . . . . . . . . . . . . . . . .

55

4.8

Ensemble techniques in GAME . . . . . . . . . . . . . . . . . . . . . . . . . .

56

4.9

Benchmarking the GAME engine . . . . . . . . . . . . . . . . . . . . . . . . .

59

4.9.1

Internet advertisements . . . . . . . . . . . . . . . . . . . . . . . . . .

59

4.9.2

Pima Indians data set . . . . . . . . . . . . . . . . . . . . . . . . . . .

60

4.9.3

Spiral data benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . .

61

4.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

62

5 The FAKE interface and related results 5.1

Automated data preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . .

63

5.1.1

Imputing missing values . . . . . . . . . . . . . . . . . . . . . . . . . .

63

5.1.2

Distribution transformation . . . . . . . . . . . . . . . . . . . . . . . .

64

5.1.2.1

The design of the transformation function . . . . . . . . . . .

65

5.1.2.2

Experiments with artificial data sets . . . . . . . . . . . . . .

66

5.1.2.3

Mandarin data set distribution transformation . . . . . . . .

68

Data reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

69

Knowledge extraction and information visualization . . . . . . . . . . . . . .

70

5.2.1

Math formula extraction . . . . . . . . . . . . . . . . . . . . . . . . . .

70

5.2.2

Feature ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

71

5.1.3 5.2

63

5.2.2.1 5.2.3

5.2.4

Extracting significance of features from niching GA used in GAME . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

72

Relationship of variables . . . . . . . . . . . . . . . . . . . . . . . . . .

73

5.2.3.1

Relationships in the Dyslexia data set . . . . . . . . . . . . .

76

5.2.3.2

Relationships in the Building data set . . . . . . . . . . . . .

78

Boundaries of classes . . . . . . . . . . . . . . . . . . . . . . . . . . . .

78

5.2.4.1

79

Classification boundaries and regression plots in 3D . . . . .

5.2.5

GAME classifiers in the scatterplot matrix

. . . . . . . . . . . . . . .

81

5.2.6

Credibility estimation of GAME models . . . . . . . . . . . . . . . . .

81

5.2.6.1

Credibility estimation - artificial data . . . . . . . . . . . . .

82

5.2.6.2

Credibility estimation - real world data . . . . . . . . . . . .

83

5.2.6.3

Uncertainty signaling for visual knowledge mining . . . . . .

84

5.2.7

Credibility of GAME classifiers . . . . . . . . . . . . . . . . . . . . . .

84

5.2.8

The search for interesting behavior . . . . . . . . . . . . . . . . . . . .

87

5.2.8.1

Ensembling: what do we mean by ”interesting behavior”? . .

87

5.2.8.2

Evolutionary search on simple synthetic data . . . . . . . . .

88

viii

5.2.8.3

Experiments with diversity . . . . . . . . . . . . . . . . . . .

89

5.2.8.4

Study with more complex synthetic data . . . . . . . . . . .

90

5.2.8.5

Experiments with real world data . . . . . . . . . . . . . . .

93

6 Applications of the FAKE GAME framework 6.1

6.2

96

Noise cancelation by means of GAME . . . . . . . . . . . . . . . . . . . . . .

96

6.1.1

Finite impulse response filter (FIR)

. . . . . . . . . . . . . . . . . . .

96

6.1.2

Replacing FIR by the GAME network . . . . . . . . . . . . . . . . . .

97

6.1.3

Experiment with synthetic data . . . . . . . . . . . . . . . . . . . . . .

98

Sleep stages classification using the GAME engine . . . . . . . . . . . . . . .

100

6.2.1

GMDH for classification purposes . . . . . . . . . . . . . . . . . . . . .

100

6.2.2

Data acquisition and preprocessing . . . . . . . . . . . . . . . . . . . .

100

6.2.3

Classification of sleep stages . . . . . . . . . . . . . . . . . . . . . . . .

101

6.2.3.1

The configuration of GAME engine . . . . . . . . . . . . . .

101

6.2.3.2

The configuration of WEKA methods . . . . . . . . . . . . .

102

6.2.3.3

Comparison of different methods . . . . . . . . . . . . . . . .

102

Experiments with GAME configurations . . . . . . . . . . . . . . . . .

103

6.2.4

7 Summary and conclusions

104

8 Suggestions for the future work

106

9 Bibliography

107

10 Publications of the author

115

A Standardization of GMDH clones

117

A.1 PMML description of GMDH type polynomial networks . . . . . . . . . . . . B Data sets used in this thesis

118 120

B.1 Building data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

120

B.2 Boston data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

120

B.3 Mandarin data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

120

B.4 Dyslexia data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

121

B.5 Antro data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

121

B.6 UCI data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

122

C The FAKE GAME environment

123 ix

C.1 The GAME engine - running the application . . . . . . . . . . . . . . . . . .

123

C.1.1 Loading and saving data, models . . . . . . . . . . . . . . . . . . . . .

123

C.1.2 How to build models . . . . . . . . . . . . . . . . . . . . . . . . . . . .

124

C.2 Units so far implemented in the GAME engine . . . . . . . . . . . . . . . . .

124

C.3 Optimization methods in the GAME engine . . . . . . . . . . . . . . . . . . .

125

C.4 Configuration options of the GAME engine . . . . . . . . . . . . . . . . . . .

126

C.5 Visual knowledge extraction support in GAME . . . . . . . . . . . . . . . . .

128

D Results of additional experiments

129

x

List of Figures 1.1

Fully Automated Knowledge Extraction (FAKE) using Group of Adaptive Models Evolution (GAME) . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

2.1

Real world data set with missing values and alpha variables . . . . . . . . . .

6

2.2

Pixel oriented scatterplot for variables relationship analysis . . . . . . . . . .

9

2.3

Some climatic models also uses ensembling to estimate the uncertainty of the prediction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

10

2.4

The original MIA GMDH network . . . . . . . . . . . . . . . . . . . . . . . .

12

2.5

Models are trained for the same task and then combined . . . . . . . . . . . .

21

2.6

The Bagging scheme: Models are constructed on bootstrapped samples . . . .

22

2.7

The Bias-Variance Decomposition. . . . . . . . . . . . . . . . . . . . . . . . .

23

2.8

Reducing variance and bias part of the error by ensembling. . . . . . . . . . .

23

2.9

The comparison of GMDH methods with various levels of modification. . . .

25

3.1

FAKE GAME environment for the automated knowledge extraction. . . . . .

26

4.1

Group of Adaptive Models Evolution (GAME) . . . . . . . . . . . . . . . . .

29

4.2

The comparison: original MIA GMDH network and the GAME network . . .

30

4.3

List of units implemented in the FAKE GAME environment. . . . . . . . . .

31

4.4

Units’ competition on the Building data set. . . . . . . . . . . . . . . . . . . .

33

4.5

Units’ competition on the Spiral data set. . . . . . . . . . . . . . . . . . . . .

33

4.6

The process of GAME units optimization. . . . . . . . . . . . . . . . . . . . .

35

4.7

Exponentially growing computational complexity eliminated by the analytic gradient. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

38

4.8

Learning methods on the Ecoli data set. . . . . . . . . . . . . . . . . . . . . .

40

4.9

Learning methods on the Boston data set. . . . . . . . . . . . . . . . . . . . .

40

4.10 Learning methods on the Building data set. . . . . . . . . . . . . . . . . . . .

41

4.11 Models complexity, a minimum of the regularization criterion and the noise. .

44

4.12 The expected and the measured error for different regularization and noise. .

44

4.13 Regularization of the Combi units on the Antro data set. . . . . . . . . . . .

45

4.14 Regularization of the Combi units on the Building data set. . . . . . . . . . .

46

4.15 The validation on both the training and the validation set is better for complex data with low noise. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

47

4.16 The relative performance of regularization methods on various levels of noise.

48

4.17 Encoding of units when optimized in GAME layers. . . . . . . . . . . . . . .

48

xi

4.18 Regular GA versus Niching GA: non-correlated units can be preserved. . . . .

49

4.19 The fitness can be higher, when non-correlated inputs are used. . . . . . . . .

50

4.20 The distance of two units in the GAME network. . . . . . . . . . . . . . . . .

50

4.21 The visualization of units correlation and distances during the evolution. . . .

51

4.22 Results of experiment with different distance computation between GAME units. 52 4.23 The GA versus the Niching GA with DC: the experiment proved our assumptions. 53 4.24 For the WBHW and WBE variables, the GA with DC is significantly better than the regular GA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

53

4.25 For simple data set, GA with DC attained a superior performance in all output variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

54

4.26 The fifty percent inheritance level is a reasonable choice for all three data sets.

55

4.27 Encoding of the transfer function for the CombiNeuron unit. . . . . . . . . .

56

4.28 Bagging the GAME models. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

56

4.29 The ensemble of suboptimal models tops their accuracy. . . . . . . . . . . . .

57

4.30 For optimally trained models the ensemble has not superior performance. . .

58

4.31 Ensemble of two models exhibiting diverse errors can provide significantly better result. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

58

4.32 Two GAME networks solving the intertwined spirals problem. . . . . . . . . .

61

5.1

The comparison of imputing methods on the Stock market prediction data set.

64

5.2

The principle of the distribution transformation. . . . . . . . . . . . . . . . .

65

5.3

How to create an artificial distribution function.

. . . . . . . . . . . . . . . .

65

5.4

For data that are closer to uniform distribution, results are not significant. . .

66

5.5

Histograms of the original data set and the data set transformed by the artificial distribution function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

67

For highly non-uniform distribution, models build on the transformed data are significantly better. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

67

5.6 5.7

The artificial distribution functions for input features of the Mandarin data set 68

5.8

The Scatterplot matrix of the Mandarin data before and after the transformation. 69

5.9

How to extract the math equation from the GAME model. . . . . . . . . . .

70

5.10 The extraction of a math formula from the GAME model on the Anthrop data. 71 5.11 The number of units connected to a particular signify its importance.

. . . .

73

5.12 The feature ranking derived during the construction of the GAME inductive model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

74

5.13 Visualizing relationship of variables derived by a model. . . . . . . . . . . . .

74

5.14 Data projection into a regression plot. . . . . . . . . . . . . . . . . . . . . . .

75

5.15 With data vectors displayed, quality of models can be evaluated. . . . . . . .

76

xii

5.16 An overfitted nonlinear inductive model on the Dyslexia data set. . . . . . . .

77

5.17 The group of linear inductive models shows the relation of a reading speed to dyslexia. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

77

5.18 Relationship plots on the Building data set. . . . . . . . . . . . . . . . . . . .

78

5.19 The classification plot of the Pima Indians Diabetes data set. . . . . . . . . .

79

5.20 Visualization of 3D manifolds representing decision boundaries of a GAME model on the Iris data set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

80

5.21 3D visualization of GAME models’ regression manifolds together with data vectors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

80

5.22 The visualization of a class membership into a scatterplot matrix for the Ecoli data set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

81

5.23 Responses of GAME models for a testing vector lying in the area insufficiently described by the training data set. . . . . . . . . . . . . . . . . . . . . . . . .

82

5.24 The dependence of ρ on dyi′ is linear for an artificial data without noise. . . .

83

5.25 The dependence of ρ on

dyi′

is quadratic for real world data. . . . . . . . . . .

83

5.26 GAME models on the Building data set with the uncertainty signified by the envelope in background. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

84

5.27 The explanation how to combine ensemble models to get better defined class memberships. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

85

5.28 When models are multiplied, the classification into classes is better constrained. 86 5.29 Multiplication is sensitive to anomalies in model behavior. . . . . . . . . . . .

86

5.30 Interesting behavior of models . . . . . . . . . . . . . . . . . . . . . . . . . . .

87

5.31 Synthetic training data and ensemble models approximating it. . . . . . . . .

88

5.32 The plot of fitness function for all possible individuals. . . . . . . . . . . . . .

89

5.33 The individual with the highest fitness dominated the population of genetic algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

90

5.34 Diversity in population for the standard genetic algorithm. . . . . . . . . . .

91

5.35 Diversity in population for the niching genetic algorithm (Deterministic Crowding employed). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

91

5.36 Three solutions found by the niching genetic algorithm. . . . . . . . . . . . .

92

5.37 Input vectors used for generating training data are concentrated in clusters. .

92

5.38 The best individuals after ten and fifty generations for features x1 ,x2 ,x3 . . . .

93

5.39 Plots showing the relationship of the feature temp and outputs wbc, wbhw and wbe. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

94

6.1

The architecture of the FIR filter.

. . . . . . . . . . . . . . . . . . . . . . . .

96

6.2

The GAME network functioning as a filter. . . . . . . . . . . . . . . . . . . .

97

6.3

Frequency response of reference signal(left) and input signal(right). . . . . . .

98

xiii

6.4

The signal filtered by GAME network (left) corresponds better to the reference signal. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

99

6.5

GAME and adaptive FIR doing a feature ranking. . . . . . . . . . . . . . . .

100

6.6

An example of the GAME classifier of the REM Sleep Stage. . . . . . . . . .

101

A.1 An example of simple GMDH type model that is described in PMML bellow.

117

C.1 The configuration of units in the GAME engine. . . . . . . . . . . . . . . . .

127

D.1 The behavior of a GAME model consisting from Gaussian units almost resembles fractals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

129

D.2 Characteristic response of the FIR filter. . . . . . . . . . . . . . . . . . . . . .

130

D.3 Characteristic Response of the GAME network filter - noise is better inhibited in regions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

130

D.4 The percentage of surviving units in GAME network according to their type (Motol, Ecoli data). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

131

D.5 The type of surviving units - Mandarin and Iris data sets. . . . . . . . . . . .

131

D.6 Relationship plots on the Antro data set. . . . . . . . . . . . . . . . . . . . .

132

D.7 The classification of the Spiral data by the GAME model evolved with all units enabled. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

132

D.8 Units’ competition on the Boston data set.

. . . . . . . . . . . . . . . . . . .

133

D.9 Units’ competition on the Ecoli data set. . . . . . . . . . . . . . . . . . . . . .

133

D.10 Units’ competition on the Mandarin data set. . . . . . . . . . . . . . . . . . .

133

D.11 Units’ competition on the Iris data set. . . . . . . . . . . . . . . . . . . . . . .

134

D.12 The performance comparison of learning methods on the Mandarin data set.

134

D.13 Regularization of the Combi units on the Antro data set. . . . . . . . . . . .

135

D.14 Performance of GAME ensembles on Advertising data set depending on number of member models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

135

D.15 The configuration window of the CombiNeuron unit. . . . . . . . . . . . . . .

136

xiv

1

SECTION 1. INTRODUCTION

1 Introduction The Group Method of Data Handling (GMDH) was invented by A.G. Ivakhnenko in late sixties [47]. He was looking for a computational instruments allowing him to model real world systems characterized by data with many inputs (dimensions) and few records. Such ill-posed problems could not be solved traditionally (ill-conditioned matrixes) and therefore different approach was needed. Prof. Ivakhnenko proposed the GMDH method which avoided solution of ill-conditioned matrixes by decomposing them into submatrices of lower dimensionality that could be solved easily. The more important idea behind the GMDH is the adaptive process of combination of these submatrices back to the final solution. The original GMDH method is called Multilayered Iterative Algorithm (MIA GMDH). Many similar GMDH methods based on the principle of induction (problem decomposition and combination of partial results) have been developed since then. The only possibility how to model real world systems before the GMDH, was to manually create a set of math equations mimicking the behavior of a real world system. This involved a lot of time, domain expert knowledge and also experience with the synthesis of math equations. The GMDH allowed automatic generation of a set of these equations. A model of the real world system can be also created by Data Mining (DM) algorithms, particulary by artificial Neural Networks (NNs). Some DM algorithms such as decision trees are simple to understand, whereas NNs have often so complex structure that they are necessarily treated as a black-box model. The MIA GMDH is something in between - it generates polynomial equations which are less comprehensible than a decision tree, but better interpretable than a NN model. In this thesis, we propose the Group of Adaptive Models Evolution

DATA WAREHOUSING DATA INTEGRATION

Classification, Prediction, Identification and Regression

Classes boundaries, relationship of variables

MO DEL

DATA CLEANING INPUT DATA

MO DEL

AUTOMATED DATA PREPROCESSING

MO DEL

GAME ENGINE

Credibility estimation

MO DEL MO DEL

DATA INSPECTION DATA COLLECTION PROBLEM IDENTIFICATION

MO DEL

Math equations

Interesting behaviour

FAKE INTERFACE

Feature ranking

Figure 1.1: Fully Automated Knowledge Extraction (FAKE) using Group of Adaptive Models Evolution (GAME)

2


(GAME) engine which evolved from MIA GMDH. The ensemble of GAME models is more accurate than models generated by the GMDH and it also outperforms NN models on data sets, we have been experimenting with. The consequence of improved accuracy is greater complexity and reduced interpretability of models. To deal with this problem, we propose several techniques for the knowledge extraction from GAME models. The goal of this thesis is to propose a framework for Fully Automated Knowledge Extraction (FAKE) using the GAME engine 1.1. It should help domain experts to extract useful knowledge from a real world data without assistance of data miners or statisticians. The Figure 1.1 shows that data collection, warehousing, integration, etc. are not in the scope of this thesis. These steps cannot be automated in general, they are highly dependent on specific conditions of the data provider. Within the FAKE interface, we explore methods for automated data preprocessing for the GAME engine. Preprocessing methods are necessary to increase the number of data sets that can be processed by means of the GAME engine. The GAME engine itself is designed to be able to mine data automatically, without experimenting with proper topology and optimal parameter settings. The knowledge extraction is supported by several methods (information visualization, feature ranking, formula extraction, etc.) belonging to the FAKE interface. The motivation why we propose the FAKE GAME framework can be found in the next Section.

1.1

Problem statement

With the continual development of sensors and computers, the number of collected data dramatically increases. Data sets obtained from various domains of human activity are growing in size and diversity. Global data repositories include tiny data sets as well as extremely large multivariate ones having several billions of instances. The collected data sets can also differ in complexity of the problem they describe. Several data sets include missing values, outlayers and are affected by noise. In order to extract some useful knowledge from these data, they need to be manually processed first. The traditional techniques for knowledge extraction like Exploratory Data Analysis followed by statistical techniques are very demanding for both statistical skills and time. Therefore most of data collected by companies and institutions remain unexplored and the valuable knowledge hidden inside is lost. During the last decade, the Data Mining and the Knowledge Discovery became hot topics for private companies. They know the price of knowledge in their data and they do not wish to lose it. The Data Mining [32] employs machine learning algorithms to analyze a real world data. Probably the most popular DM methods are artificial neural networks (MLP, etc.). They are popular mainly thanks to their usability as a black-box, without knowing how it works inside. They can also often quickly deliver superb results. However, there are also problems with DM methods and particulary with NNs. Some of the biggest drawbacks are listed bellow. The first drawback is that one has to be experienced in NN theory to be able to get really


3

reliable results. One has to choose a proper neural network architecture, suitable learning paradigm, preprocess data for neural network and interpret the results. Using NNs as a black-box is often source of serious mistakes (data overfitting, recalling patterns far from a training data, etc.) The second drawback is that the knowledge of the system is hidden in network weights (blackbox) as opposed to DM methods that generate formulas interpretable by man. Neural network can be of course written also in the form of math equation, but even for simple networks, the equation is several pages long and contains nested nonlinear functions. The third major drawback is that users do not know when they can trust a neural network. Whereas for some input patterns the response is perfect, for other patterns the output is far from the target value. Especially for a real world data, it is very hard to distinguish regions, where the NN is trained well, from regions where the output is almost random. This thesis targets all these drawbacks. We build the FAKE GAME environment to automate the process of knowledge extraction from data. Firstly, a user of the FAKE GAME environment does not need to experienced in the theory of neural networks or the GMDH. The GAME engine incorporates the genetic algorithm to evolve optimal models with proper type of units (hybrid model) and optimal learning methods. GAME models grow during the learning process (their topology is not given in advance) so the size of models proportional to the complexity of a problem is guaranteed. Secondly, the models evolved by the GAME can be written in the form of math equations. To overcome the black-box disadvantage of more complex models, we use the visualization of models’ behavior. Within the FAKE interface, we developed several visualization techniques that can be directly used for the visual knowledge mining. Plots showing models’ behavior are very useful especially for complex systems, where math equations are not interpretable any more. We also automated the search for ”interesting” plots of variables relationship in the multidimensional space. The third above mentioned drawback is the questionable plausibility of neural network. The problem is that NN can give plausible output just for cases it has been successfully trained for. Good output of the NN model cannot be ensured by constraining its input features or by computing the distance from training data. The GAME engine solves this problem by evolving ensemble of diverse models. The more they differ in responses to the same input, the less plausible their output is. The difference in responses of ensemble models is also used in definition of the interesting behavior of models (see Section 5.2.8).

1.2

Goals of the thesis

The goals of this thesis are the following: • Propose the FAKE GAME framework. • Describe the GAME engine. • Evaluate functionality of improvements proposed. • Benchmark the GAME engine with other data mining tools.

4


• Describe the FAKE interface. • Present some applications of described techniques.

1.3

Contributions of the thesis

The original results presented in this thesis are: • The FAKE GAME framework for automated knowledge extraction from data. • The GAME engine for automated data mining (evolves of ensemble models for the purpose of regression, prediction and classification). – Heterogeneous GAME units (hybrid models perform better than uniform ones) Section 4.1. – Optimization of GAME units (analytic gradient significantly reduces number of error evaluation calls needed to reach the optimum) - Section 4.2. – Heterogeneous learning methods (several optimization methods compared on diverse real world problems) - Section 4.3. – Regularization of GAME units (regularization prevented CombiNeuron unit from overfitting noisy data) - Section 4.5. – Genetic algorithm evolves GAME units layer by layer (evolves input connections as well as transfer functions, properties and type of learning method used) - Sections 4.6, 4.7. – Niching scheme employed (maintaining diversity increased accuracy of models) Section 4.6.1. – Ensemble of GAME models generated (ensemble response is more reliable, often more accurate than single models) - Section 4.8. – GAME benchmarks: GAME achieved superior results for all benchmarks we have performed, it solved intertwined spirals problem as the only GMDH-type method we know - Section 4.9. • The FAKE interface consisting from modules for automated data preprocessing and knowledge extraction support. – Missing values imputing (promising performance of the Euclid distance neighbor replacement method) - Section 5.1.1. – Transformation to uniform distribution (transformation of data using artificial distribution function significantly improved accuracy of GAME models on simple synthetic data set) - Section 5.1.2. – Math formula extraction (regularized CombiNeuron GAME models can be serialized into simple polynomial equations suitable for knowledge extraction) - Section 5.2.1. – Feature ranking (three novel algorithms for the feature ranking) - Sections 5.2.2, 5.2.2.1, 5.2.8.5.


5

– Regression plots allow to study relationship of variables in particular conditions Section 5.2.3. – Classification plots enable to study decision boundaries of classes estimated by models - Section 5.2.4. – Interactive 3D regression (classification) plots are helpful, when there exist more than one (two) important features in a system modeled - Section 5.2.4.1. – For multivariate problems we proposed the scatterplot matrix enriched by information on models’ classification boundaries - Section 5.2.5. – The credibility of models was empirically set to be inversely proportional to dispersion of ensemble models’ responses - Section 5.2.6. – We use ensemble of classifiers, where member models are multiplied, to visualize just credible areas of class membership - Section 5.2.7. – To locate interesting regression plots in multidimensional space automatically, we use a genetic search with specific fitness function - Section 5.2.8. • The GAME engine outperformed the FIR filter and also classifiers from WEKA environment on the Sleep stages classification problem - Sections 6.1, 6.2.

1.4

Organization of the thesis

This thesis is organized as follows. After the introduction, the second chapter summarizes the state of the art in domains connected to this thesis. The chapter is subdivided into two sections. First section deals with the theory related to Fully Automated Knowledge Extraction concept. Namely the advances in Data Preprocessing, Visual Data Mining, Feature Ranking and other methods related to Knowledge Discovery process are briefly mentioned. The second section focuses on the theory related to the core of the FAKE GAME framework - Group of Adaptive Models Evolution. The state of the art in areas of Inductive Modeling, Continuous Optimization Techniques, Neural Networks, and Ensemble Methods are described in this section. In the third chapter we propose the FAKE GAME framework for automated knowledge extraction. The fourth chapter describes the design of the GAME engine, the core of the framework. Several improvements of the state of the art methods and their empirical evaluation can be found in separate sections of this chapter. Benchmarks of the GAME engine are concluding this chapter. The FAKE interface is described in the chapter five. Several methods are described and their application on real world problems is demonstrated. The chapter six presents two case studies. The GAME engine and feature ranking methods from the FAKE interface were applied to real world problems. After the conclusion, future work and bibliography chapters, there are three chapters in the appendix of this thesis. The first proposes the PMML standard for exchanging GMDH models. In the second chapter of the appendix we documented the FAKE GAME environment. The last chapter contains additional figures and graphs, extending experiments made within the thesis.

6

SECTION 2. BACKGROUND AND SURVEY OF THE STATE-OF-THE-ART

?,C,A,08,00,?,S,?,000,?,?,G,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,COIL,0.700,0610.0,0000,?,0000,?,3 ?,C,R,00,00,?,S,2,000,?,?,E,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,Y,?,?,?,COIL,3.200,0610.0,0000,?,0000,?,3 ?,C,R,00,00,?,S,2,000,?,?,E,?,?,Y,?,B,?,?,?,?,?,?,?,?,?,?,?,?,?,?,SHEET,0.700,1300.0,0762,?,0000,?,3 ?,C,A,00,60,T,?,?,000,?,?,G,?,?,?,?,M,?,?,?,?,?,?,?,?,?,?,?,?,?,?,COIL,2.801,0385.1,0000,?,0000,?,3 ?,C,A,00,60,T,?,?,000,?,?,G,?,?,?,?,B,Y,?,?,?,Y,?,?,?,?,?,?,?,?,?,SHEET,0.801,0255.0,0269,?,0000,?,3 ?,C,A,00,45,?,S,?,000,?,?,D,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,COIL,1.600,0610.0,0000,?,0000,?,3 ?,C,R,00,00,?,S,2,000,?,?,E,?,?,?,?,?,Y,?,?,?,?,?,?,?,?,?,?,?,?,?,SHEET,0.699,0610.0,4880,Y,0000,?,3 ?,C,A,00,00,?,S,2,000,?,?,E,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,Y,?,?,?,COIL,3.300,0152.0,0000,?,0000,?,3 ?,C,R,00,00,?,S,2,000,?,?,E,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,Y,?,?,?,COIL,0.699,1320.0,0000,?,0000,?,3 ?,C,A,00,00,?,S,3,000,N,?,E,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,SHEET,1.000,1320.0,0762,?,0000,?,3 ?,C,R,00,00,?,S,2,000,?,?,E,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,COIL,1.200,0610.0,0000,?,0000,?,3 ?,C,R,00,00,?,S,2,000,?,?,E,?,?,?,?,?,Y,?,?,?,?,?,?,?,?,?,?,?,?,?,SHEET,0.300,1320.0,4880,Y,0000,?,3 ?,C,R,00,00,?,S,2,000,?,?,E,?,?,?,?,B,Y,?,?,?,Y,?,?,?,?,?,?,?,?,?,SHEET,1.200,0610.0,0150,?,0000,?,3 ?,C,A,00,45,?,S,?,000,?,?,D,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,COIL,1.200,0609.9,0000,?,0000,?,3 ?,C,A,00,00,?,S,2,000,?,?,F,?,?,Y,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,SHEET,0.600,1220.0,0761,?,0000,?,3 ?,C,A,00,00,?,S,2,000,?,?,E,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,SHEET,4.000,1320.0,0762,?,0000,?,3 ?,C,A,10,00,?,?,?,000,?,?,E,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,COIL,3.201,0600.0,0000,?,0000,?,U ?,C,A,00,80,T,?,?,000,?,?,G,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,SHEET,0.800,0610.0,4170,Y,0000,?,U …

Figure 2.1: An example of the real world data set with missing values and alpha variables.

2 Background and survey of the state-of-the-art We divided this chapter into two sections. The first section discusses the background and state of the art related to the FAKE interface. The second section deals with the knowledge needed to understand the process of design and evolution of the GAME engine.

2.1

The theory related to FAKE

Fully Automated Knowledge Extraction is an interface to the GAME engine. It combines methods from several scientific domains. We designed this interface because the extraction of knowledge is time consuming and expert skills demanding task. For the knowledge extraction, we need a data set describing behavior of system under investigation. This raw data set needs to be preprocessed first. The task of data preprocessing is of crucial importance and it often takes more time to preprocess data than to mine it [80]. 2.1.1

Automated data preprocessing

The data preprocessing phase can be divided into several steps [38]. Not all steps need to be necessarily performed. It depends on the quality of data and on requirements of a data mining method used, which steps are demanded. To preprocess data for the GAME engine, following steps can be performed. 2.1.1.1

Dealing with alpha variables

The FAKE GAME environment works with numerical data. All characters, strings or symbols (alpha variables) have to be encoded into numerical variables. The encoding can be performed automatically (adding new binary variable for each discovered string in a data set). However the dimensionality of resulting data set might become huge with many unique strings in the


7

data set. Then it is wise to let the domain expert to reduce the dimensionality by utilizing his expert knowledge of strings (several strings can be encoded in one new variable - e.g. variable Pressure ”low” = 0.1, ”medium” = 0.4, etc.). 2.1.1.2

Imputing missing values

We assume that missing values have been identified by the domain expert and replaced e.g. by the ”?” symbol as shown in the Figure 2.1. Some neural networks such as Self Organizing Map (SOM) [55] can work with data containing missing values [23]. Neither the GMDH nor the GAME engine can deal with an incomplete data set. The problem of missing data can be overcome by following techniques. The easiest way how to deal with missing data is to delete records containing missing values. However this will not work for a data with higher percentage of missing values. Also, some potentially useful information can be lost by leaving out all records with missing values. Better approach is to fill in missing values (missing values imputing). We can replace all missing values by zero or if we assume that the distribution of missing values is the same as the non-missing values, we can replace them by a mean value. These techniques are fast and simple to implement, but they introduce bias and they do not take into account interrelationships in data [25]. Another technique is based on similarity measures. It assumes that if two records match in non-missing values, they probably have all values identical. For ordinal variables, the similarity can be computed e.g. by the Euclid Distance measurement. Missing values are the imputed by the values taken from the most similar records. Even more sophisticated techniques model the relationship between attributes (features) and use these models (Decision Tree, Linear Regression, Neural Networks or GMDH) to impute the missing values. Another methods we can utilize for imputing are Markov Chain Monte Carlo (MCMC) or Propensity Scores that use approximate Bayesian bootstrap to estimate missing values from non-missing ones [25]. 2.1.1.3

Data normalization

For neural networks with a nonlinear sigmoid transfer function in neurons (Multi Layered Perceptron, Cascade Correlation Neural Network, etc.), at least the output variable has to be normalized into the h0, 1i range. The GMDH does not require the normalization because it uses polynomial transfer functions with an unlimited range. The GAME engine uses normalization to be able to utilize units with limited range transfer functions, but it can also work without it. v−vmin A variable can be normalized by the Min-max normalization vnorm = vmax −vmin , by the Zv−vmean score normalization vnorm = vstd dev , by the Decimal scaling vnorm = 10v j , where j is the smallest integer such as max (|vnorm | < 1) [38]. The Softmax Scaling, that uses a logistic function to project variables into the h0, 1i interval, is also frequently used.

According to the article [61], a logarithmic normalization improved clustering results of gene

8


expression data (it was found superior to other normalization techniques such as Min/Max scaling). However on a different data, results are likely to be different. When normalizing variables, it is important to check if a data set does not contain outliers. An outlier is a record that is extremely abnormal and for certain types of normalization (e.g. Min-max) can project almost all records to a tiny interval by zero. Outliers can be detected for example by means of the Semi-Discrete Decomposition (SDD) as described in [68]. The Softmax Scaling when properly configured can also deal with outliers. 2.1.1.4

Distribution transformation

According to [80], the best distribution for the data mining is the uniform one. The more a distribution of a data set differ from the uniform distribution, the worse results we are likely to encounter. In this thesis we propose a technique that can transform data from whatever distribution to the distribution close to the uniform one (see Section 5.1.2). 2.1.1.5

Data reduction

When a data set contains several thousands of records, it is likely, that some of the records are redundant and some are almost identical. The more records we use for a data mining method, the more time it consumes. The computational time of certain DM methods grows almost exponentially with increasing number of records. Therefore some data reduction mechanism has to be applied to large data sets in the preprocessing stage. We need to reduce the data set in volume, but at the same time preserve the knowledge hidden inside. We have implemented an application that reduces the data set in order to get the uniform distribution in the output variable (see Section 5.1.3). 2.1.2

Visual data mining

Real world data sets are usually of medium size (5+ attributes, 1000+ observations). When viewing such data set as columns of numbers, the knowledge can hardly be extracted. The most straightforward technique making the knowledge accessible is to visualize the data vectors as points in a plot. The problem is with the dimensionality of the data set. The scientific discipline dealing with multivariate data is called Information Visualization and it becomes very popular. One area of Information Visualization is the Visual Data Mining (VDM). The task of VDM [108, 12] is to let the user explore data in the multidimensional system space to find relationships among the system attributes. The VDM utilizes several methods to display the data in a form that is easier comprehensible for man. Here, the evolutionary computation can be also employed to display data [24]. Plots and scenes are the most effective views for the exploration and the knowledge extraction process. There are several methodologies how these plots or scenes can be rendered. The common visualization techniques for VDM are:

9

PROMOTION


PRICE

Figure 2.2: Pixel oriented scatterplot for variables relationship analysis • Scatterplots • Parallel coordinates • Grand tour techniques • Pixel mapping • Relationship-based visualization ([99]) Real data sets describing complex systems have many more than three features. Each feature adds one degree of freedom (one dimension) to a data space. It is hard for man to cope with more than three dimensional space. Therefore a data set with more than three attributes has to be somehow mapped or projected into two or three dimensions. Techniques that can be used for multidimensional data visualization are listed bellow: • Scatterplot matrix • Parallel coordinate plots (see [44]) • 3-D stereoscopic scatterplots • Grand tour on all plot devices • Circle Segments (see [14]) • Density plots • Linked views • Saturation brushing • Pruning and cropping

10


All these techniques allow to study complex systems utilizing the information carried by data vectors location. The most convenient area of application for these methods is the cluster analysis [13]. However when one would like to study relationship among input and output variables of the system, the yield of these methods is not very high. Figure 2.2 shows pixel oriented scatterplot that can be used for variables relationship analysis. The relation between price and promotion variable is visible. But we can only guess the how the exact relationship in particular conditions looks like. In this thesis we describe an approach to streamline the VDM process. Together with data vectors projection we employ the GAME engine to construct ensemble of inductive models from these data vectors. The behavior of these models is expressed by the background color in existing scatterplots (see Figure 5.8 and 5.19). The behavior of models enriches pixel oriented scatterplots so the knowledge extraction is faster and more efficient. In our approach advantages of inductive models (generalization, dimensionality reduction, error resistance, complex relationship expression, etc.) can be exploited. By employing the ensemble of inductive models we also are able to automatically locate interesting areas of the multidimensional space of system variables (see Section 5.2.8). This can be used as the projection pursuit for the Data Driven Guided Tours [15], too. Very popular method of multidimensional data projection is the Self Organizing Map (SOM) [54].

2.1.3

Credibility estimation

In the environmental science, there are attempts to estimate the credibility of their models. They combine several models of different kind and plot the empirical probability of their

Figure 2.3: Some climatic models also uses ensembling to estimate the uncertainty of the prediction. predictions (Figure 2.3). Another possibility is to test models on sensitivity of its initial conditions and parameters. Several models with slightly different values can be produced and the credibility estimated (for further information see [97]). These techniques have tight connection to the theory of ensembles discussed bellow. The credibility estimation for GAME models and classifiers 5.2.6 is also based on similar ideas.


2.1.4

11

Feature ranking

When modeling a real-world system, it is necessary to preselect a set of features from the available information that may have impact on the behavior of the system. By recording these features in particular cases, a data set suitable for modelling can be produced [53]. The goal of feature selection is to avoid selecting too many or too few variables than necessary. In practical applications, it is impossible to obtain complete set of relevant variables. Siedlecki and Sklansky [91] used genetic algorithms for feature selection by encoding the initial set of n variables to a chromosome, where 1 and 0 represents presence and absence respectively of variables in the final subset. They used classification accuracy, as the fitness function and obtained good neural network results. These methods are usually used as a data preprocessing tool for further system modeling and classification by means of neural networks. Some neural networks are very sensitive to the presence of irrelevant features in a data set. The GAME engine is on the other hand designed to deal with irrelevant features (by ignoring them). Some features can be relevant just in a small subspace of the state space of all input features. Later in this thesis, we present three algorithms ranking features according their significance (see Sections 5.2.2, 5.2.2.1 and 5.2.8.5).

2.2

The theory related to GAME

This section describes the core algorithm for building the ensemble of inductive models - the GAME engine. Models are subsequently used in the FAKE GAME knowledge extraction process. Firstly the theory of inductive modeling has to be explained. It is necessary for the reader of this thesis to understand where the GAME method is coming from, therefore related inductive algorithms will be described in a broader scope. 2.2.1

Inductive modeling and the GMDH

Inductive modeling uses machine learning techniques to derive models from data. Deductive modeling on the other hand uses domain expertise to conclude a math models of a system. There have been lots of discussions between supporters of both approaches whose approach is better. The answer is: ”There is a enough data for both of them, but only the inductive modeling can be applied massively with limited human resources”. 2.2.1.1

The philosophy behind inductive modeling

The capability of induction is fundamental for human thinking. It is the next human ability that can be utilized in soft-computing, besides that of learning and generalization. The induction means gathering small pieces of information, combining it, using already collected information in the higher abstraction level to get complex overview of the studied object or process. Inductive modeling methods utilize the process of induction to construct models of studied systems. The construction process is highly efficient, it starts from the minimal form and the model grows according to system complexity. It also works well for systems with many

12


Input variables (features)

first population of models

i

2

2

y = ai + bj + cij + di + ej + f P

j

P P P

P

Output variable P

P P

P P P

second population of models

Figure 2.4: The original MIA GMDH network inputs. Where the traditional modeling methods fail, due to the ”curse of dimensionality” phenomenon, the inductive methods are capable to build reliable models. The problem is decomposed into small subtasks. At first, the information from most important inputs is analyzed in the subspace of low dimensionality, later the abstracted information is combined to get a global knowledge of the system variables relationship.

2.2.1.2

Group method of data handling (GMDH)

There are several methods for inductive models construction commonly known as Group Method of Data Handling (GMDH) introduced by Ukrainian scientist Ivachknenko in 1966 [47, 31, 63]. The GMDH theory or polynomial networks are called Statistical Learning Networks [17] in the United States of America. They were developed more or less independently. Their disadvantage is that they do not use the external regularization criterion and therefore can overfit noisy data. The GAME engine presented in this thesis was inspired by the Multilayered Iterative Algorithm (MIA GMDH). It uses a data set to construct a model of a complex system. The model is represented by a network (see Figure 2.4). Layers of units transfer input signals to the output of the network. The coefficients of units transfer functions are estimated using the data set describing the modeled system. Networks are constructed layer by layer during the learning stage. The original MIA algorithm works as follows. First initial population of units with given polynomial transfer function is generated. Units have two inputs and therefore all pair-wise combinations of input variables are employed. Then coefficients of unit’s transfer functions are estimated using stepwise regression or any other optimization method. Units are sorted by their error of output variable modeling. Few of the best performing units are selected and function as inputs for next layer. Next layers are generated identically until the error of modeling decreases. Which units are performing best and therefore should survive in a layer is decided using an external criterion [46] of regularity (CR). There are several possible criteria applicable. The most popular is the criterion of regularity


13

based on the validation using an external data set: AB =

NB 1 X (yi (A) − di )2 → min NB i=1

(2.1)

where the yi (A) is the output of a GMDH model trained on the A data set. Additional criterion to discriminate units that will be deleted is the variation accuracy criterion (VAC) [19, 48] PN (yi − di )2 2 δ = Pi=1 → min (2.2) N ¯2 (di − d) i=1

where the yi is the output of a GMDH model, di is the target variable and d¯ is the mean of the target variable. With δ 2 < 0.5 the model is good and when δ 2 > 1, the modeling failed (the output unit should be deleted). The proper regularization [18] is of crucial importance in the GMDH theory. More information on the GMDH topic can be found in the book of A. Muller and F. Lemke [69] or on the GMDH website [1]. 2.2.1.3

The state of the art in the GMDH related research

Polynomial Neural Networks (PNN) [77] are also GMDH type networks. The units called partial descriptions having different transfer function of polynomial type [75] are evolved by a genetic algorithm (GA). In [76] the structural optimization of the fuzzy polynomial neural network (FPNN) is realized via standard GA whereas in the case of the parametric optimization a standard least square method based learning is used. The article [70] uses GA to optimize the structure of original MIA GMDH neural network whereas the coefficients are solved by the Singular Value Decomposition (SVD) method. The hybrid architecture of the network is employed in the polynomial harmonic GMDH (phGMDH) [72], where the harmonic inputs are passed to a polynomial network whose architecture is built using the MIA GMDH algorithm. In this thesis, we do not limit the use of harmonic functions to the input layer of the GMDH network. Any neuron in the GAME network can have harmonic transfer function. A novel algorithm based on GMDH for designing MLP neural networks can be found in [34]. This idea is similar to the one presented in [85], where Cascade Correlation Networks are enhanced by the GMDH. In [72], the iterative gradient descent training algorithm is offered for improving the performance of polynomial neural networks. The Back-Propagation algorithm is derived for multilayered networks with polynomial activation functions. We believe that if ”powerful enough” optimization techniques are used during the construction stage, it is not necessary to readjust parameters after the polynomial network is built. This readjustment of parameters might suggest they were not set optimally. The AIC and PSS criterion are used in revised GMDH-type algorithm [56] to find the optimal number of neurons and layers of the GMDH networks. Such regularization takes into account just the complexity of GMDH network. Outputs from neurons in a layer can be highly

14


correlated resulting to a redundant GMDH network. We propose a different approach in the Section 4.6.1 of this thesis. The recent article [9] introduces a GMDH-based feature ranking and selection algorithm. This algorithm builds GMDH networks of gradual complexity, rewarding features selected by smaller networks. In this thesis we propose three different algorithms of feature ranking that can also supply the proportional significance of features. 2.2.2

Neural networks

Neural networks are closely connected to the GMDH theory, although they have different background. The GMDH evolved from the mathematical description of a system by means of Kolmogorov-Gabor polynomial [63]. Neural networks were at the beginning biologically oriented. Later, powerful optimization methods for neural networks (Back-Propagation of error) were invented, allowing to build multilayered networks of neurons (MLP) capable of solving nonlinear problems. It has been shown [59] that MLPs are equivalent to mathematical description of a system by means of the Koglomorov theorem1 . In recent time both neural networks and GMDH algorithms are optimized by genetic algorithms and it is even harder to distinguish the boundary between both theories. Of course, some neural networks are very different from GMDH (recurrent, modular, spiking neural networks, etc.) [57]. The article [66] shows that the problem of two intertwined spirals can be successfully solved by the MultiLayered Perceptron (MLP), where weights are evolved by the Genetic Algorithm. Number of function evaluation is in this case much higher than when using standard BackPropagation, because the information about gradient of error is not utilized in the GA. On the other hand, in some applications, Genetic approach gives better results than BackPropagation [89]. The Cascade Correlation algorithm [30] is capable of solving extremely difficult problems. It performs optimally on ”spiral” benchmarking problem (a network consisting of less than 20 neurons is generated). According to experiments on real-world data performed in [109], the algorithm has difficulties with avoiding premature convergence to complex topological structures. The main advantage of the Cascade Correlation algorithm is also its main disadvantage. It easily solves extremely difficult problems therefore it is likely to overfit. Also the GAME engine is able to generate models solving ”spiral” problem only when the build-in validation and regularization mechanisms are disabled. The recent article [85] proposes an improvement of the Cascade Correlation Algorithm [30]. The original algorithm assumes fully connected network. Each neuron is connected to all features and all previously built neurons. The improvement called Evolving Cascade Correlation Networks (ECCN) [85] uses techniques from GMDH theory [47] to choose just relevant inputs for each neuron. Cascade networks evolved by ECCN overfit data less than fully connected cascade networks. Recently, very interesting algorithm for designing recurrent neural networks was proposed in [94, 95]. The NeuroEvolution Through Augmenting Topologies (NEAT) is designed for solving reinforcement learning tasks [93], but can be applied also to supervised learning prob1

although inner functions are very complex and they have almost fractal character


15

lems. Similarly to the GMDH, NEAT networks grow from a minimal structure up to optimal complexity. The topology and also weights of the NEAT networks are evolved using niching genetic algorithm [65]. We have applied the NEAT to Two intertwined spiral problem [51], but it failed. The reason why NEAT is unable to evolve successful networks solving the spiral problem is probably that a) the chromosome is too big when evolving architecture and weights of network simultaneously, b) niching alone is unable to protect complex structures. 2.2.3

Optimization methods

The understanding of optimization methods is crucial in both mathematical and machine learning modeling. To build successful models, it is necessary to adjust their parameters. The GAME engine uses optimization methods to adjust weights and coefficients of units. Optimization of weights and coefficients leads to the following problem of nonlinear programming: min f (~x), ~x ∈ ℜn , (2.3) where f (~x) is differentiable function of vector (weights or coefficients) defined in ℜn . From the initial value of ~x0 the sequence of elements ~xk+1 = ~xk + αk d~k

(2.4)

computed when iterating k ∈ 1, 2, 3, ...,, where αk is the step length. To be able to achieve the convergence, we need to find the proper direction d~k of the search in a state space of all possible coefficients values. The easiest way how to find this direction is to use some gradient method of steepest descent d~k = −∇f (~xk ),

∇f (~xk ) =

µ

¶T

∂ ∂ f (~x1 ), ..., f (~xn ) ∂~x1 ∂~xn

.

(2.5)

Theoretical analysis proved that the ”first order” gradient methods are not very effective. Especially when the function f (~x) is simple, many iterations are needed to find the optimal solution. The more effective ”second order” methods compute (or estimate) second derivations of f (~x) revealing more information about extremes of the error function and they often provide faster convergence to the optimal solution. 2.2.3.1

Quasi-Newton method (QN)

The most popular second order optimization method of nonlinear programming is QuasiNewton method [88, 86]. It computes the search directions d~k from the equation 2.4 as d~k = −∇2 f (~xk )−1 ∇f (~xk )

(2.6)

so we can expect quadratic convergence of the method and that is perfect. On the other hand to compute second derivations of the function f is both computationally expensive and inaccurate. Better approach is to make a compromise and use first derivations of f to compute

16


the search direction d~ more precisely. This can be realized e.g. by the following formula d~k = −Hk ∇f (~xk ),

Hk =

Ã

!

∂2 f (~x) ∂~xi ∂~xj

,

(2.7)

i,j=1,...,n

where Hk ∈ ℜn×n is so called Hessian matrix. This matrix can be computed from the first derivations in case that all the following assumptions are fulfilled: 1. Hk must be positively definite given that we start H0 which is positively definite. 2. Hk fulfill so called quasi-Newton condition: Hk+1 (∇f (~xk+1 ) − ∇f (~xk )) = ~xk+1 − ~xk . 3. Hk+1 can be computed from Hk , ~xk+1 , ~xk , ∇f (~xk+1 ), ∇f (~xk ) in the following way: Hk+1 = Hk + βk ~uk ~uTk + γk~vk~vkT , where βk , γk ∈ ℜ, ~uk , ~vk ∈ ℜn . There exists several formulas fulfilling these conditions. Very popular is Davidon-FletcherPowell (DFP) formulae [83]: Hk+1 = Hk +

Hk ~qk ~qkT Hk p~k p~Tk + , p~Tk ~qk ~qkT Hk ~qk

(2.8)

where ~qk = ∇f (~xk+1 ) − ∇f (~xk ), p~k = ~xk+1 − ~xk . Another popular approach is to construct the approximate Hessian matrix by the Broyden-Fletcher-Goldfarb-Shanno (BFGS) method [16]. In our application, ~x is the vector of weights or coefficients2 of a unit in the GAME network we are optimizing. The function f (~x) is the error of the GAME unit on the training data and therefore has to be minimized. By computing gradients and Hessian matrix of the function f for each learning iteration, we get the optimal direction d~ in the state space of coefficients ~x. From the initial setting of coefficients ~x0 with probably large error on training data f (~x0 ), we change the coefficients ~xk of the GAME unit in the direction d~k that is computed from the equation 2.7. After several steps (learning iterations), the error of the GAME unit would be much lower than the initial error (f (~xn ) ≪ f (~x0 )). Number of learning iterations needed to find an optimal solution ~xn depends on the complexity of data set and on the transfer function of the GAME unit. 2.2.3.2

Conjugate gradient method (CG)

The Conjugate gradient method [107] is a non-linear iterative method for solving Ax = b, where A is an n × n matrix, AT = A > 0, and b ∈ Rn is given. The pseudocode of the CG algorithm is given bellow. 2

~x.

Note that we use different notation. Coefficients of units are labeled as ~a. The vector of inputs is labeled


17

Given x0 , generate xk , k = 1, 2, . . . by: r0 = b − Ax0 , p0 = r0 . For k = 0, 1, 2, . . . : If p0 6= 0, then: ak = (rkT rk )/(pTk Apk ) xk+1 = xk + ak pk rk+1 = rk − ak Apk T r T bk = (rk+1 k+1 )/(rk rk ) pk+1 = rk+1 + bk pk . End If. End For. End. For detailed explanation of CG algorithm principles see [90]. 2.2.3.3

Orthogonal Search (OS)

The Orthogonal Search (OS) optimizes multivariate problem by selecting one dimension at a time, minimizing the error at each step. The OS can be used [11] to train single layered neural networks. We use minimization of a real-valued function of several variables without using gradient, optimizing variables one by one. The Stochastic Orthogonal Search differs from OS just by random selection of variables. 2.2.3.4

Genetic algorithms (GA)

The Genetic Algorithms (GA) [36, 41] are inspired by Darwin’s theory of evolution. Population of individuals are evolved according simple rules of evolution. Each individual has a fitness that is computed from its genetic information. Individuals are crossed and mutated by genetic operators and the most fit individuals are selected to survive. After several generations the mean fitness of individuals is maximized. The GA can be used as optimization method e.g. for learning of neural structures or to setup weights and architecture of ANN. The Inductive Genetic Programming (iGP) applied to construct Multivariate Trees as described in [73] has a tight connection to the topic of this thesis (although it uses a different terminology). Multivariate Trees are in fact inductive polynomial models similar to those generated by the Multilayered Iterative Algorithm GMDH. They use second order polynomials (programs) to construct the final model. The topology of models (trees, with polynomials in the internal nodes, features xi in the leaves and dependent variable y as the root) is evolved by Genetic Algorithm. Later, the niching version of the genetic algorithm [71] proved to evolve better preforming models. This findings fully correspond to results obtained by [95] and also to the ones presented in this thesis (see Section 4.6.1). Unfortunately, just maintaining diversity by niching is not enough to preserve more complex topologies that are able to solve very hard problems such as two intertwined spirals [51]. Our experiments showed [26] that NEAT is unable to evolve models capable of solving spiral problem. In this case the evolution layer by layer as implemented in GAME (see Section

18


4.6) enables to evolve topologies complex enough. The problem with NEAT is that it evolves topology and weights simultaneously. The dimensionality of chromosomes in the Topology and Weight Evolving Artificial Neural Networks(TWEANN) [94] is growing fast with increasing complexity of networks. Solving the spiral problem involves very complex network with extremely long chromosome. The ”curse of dimensionality” phenomenon prevents TWEANN networks from solving the spiral problem.

2.2.3.5

Niching methods in the evolutionary computation

Niching methods [65] extend genetic algorithms to domains that require the location of multiple solutions. They promote the formation and maintenance of stable subpopulations in genetic algorithms (GAs). Niching method Fitness Sharing was recently used to maintain diverse individuals when evolving neural networks [95]. Similar niching method is the Deterministic Crowding (DC) [64]. The big advantage of this method is that it does not need any extra parameters as many of others do. The basic idea of deterministic crowding is that offspring is often most similar to parents. We replace the parent who is most similar to the offspring with higher fitness. DC works as follows. First it groups all population elements into n=2 pairs. Then it crosses all pairs and mutates the offsprings. Each offspring competes against one of the parents that produced it. For each pair of offspring, two sets of parent-child tournaments are possible. DC holds the set of tournaments that forces the most similar elements to compete. Similarity can be measured using either genotypic or phenotypic distances. The pseudocode of simple genetic algorithm and for the niching GA with deterministic crowding can be compared bellow. Genetic algorithm (no niching) Generate initial population of n individuals repeat for m generations repeat n/2 times Select two fit individuals p1 ,p2 Cross them, yielding c1 ,c2 Apply mutation, yielding c′1 ,c′2 if f (c′1 ) > f (p1 ) then replace p1 with c′1 if f (c′2 ) > f (p2 ) then replace p2 with c′2 end end

Niching GA (Deterministic Crowding) Generate initial population of n individuals repeat for m generations repeat n/2 times Select two individuals p1 ,p2 randomly Cross them, yielding c1 ,c2 Apply mutation, yielding c′1 ,c′2 if [d(p1 , c′1 )+d(p2 , c′2 )] ≤ [d(p1 , c′2 )+d(p2 , c′1 )] if f (c′1 ) > f (p1 ) then replace p1 with c′1 if f (c′2 ) > f (p2 ) then replace p2 with c′2 else if f (c′2 ) > f (p1 ) then replace p1 with c′2 if f (c′1 ) > f (p2 ) then replace p2 with c′1 end end

The distance of individuals e.g. d(p1 , c′1 ) can be based on their phenotypic or genotypic difference. In case of neurons the difference can be computed from connection of their inputs or from difference of their weights.


19

There exist several other niching strategies such as islands, restrictive competition, semantic niching [50], etc. 2.2.3.6

Differential Evolution (DE)

The Differential Evolution (DE) [98] is a genetic algorithm with special crossover scheme. It adds the weighted difference between two individuals to a third individual. For each individual in the population, an offspring is created using the weighted difference of parent solutions. The offspring replaces the parent in case it is fitter. Otherwise, the parent survives and is copied to the next generation. The pseudocode how offsprings are create can be found e.g. in [106]. 2.2.3.7

Simplified Atavistic Differential Evolution (SADE)

The Simplified Atavistic Differential Evolution (SADE) algorithm [42] is a genetic algorithm improved by one crossover operator taken from differential evolution. It also prevents premature convergence by using so called radiation fields. These fields have increased probability of mutation and they are placed to local minima of the energy function. When individuals reach a radiation field, they are very likely to be strongly mutated. At the same time, the diameter of the radiation field is decreased. The global minimum of the energy is found when the diameter of some radiation field descend to zero. This algorithm was also applied to optimize weights of neural network [27]. 2.2.3.8

Particle swarm optimization (PSO)

The entry in Wikipedia, free encyclopedia defines basic, canonical PSO algorithm as follows. Let f : Rm → R be the objective function. Let there be n particles, each with associated positions xi ∈ Rm and velocities vi ∈ Rm , i = 1, . . . , n. ˆ i be the current best position of each particle and let g ˆ be the global best. Let x Initialize xi and vi for all i. One common choice is to take xij ∈ U [aj , bj ] and vi = 0 For all i and j = 1, . . . , m, where aj , bj are limits of the search domain in each dimension. ˆ i ← xi , i = 1, . . . , n. x ˆ to the position with the smallest objective value. Set g While not converged: For 1 ≤ i ≤ n: xi ← xi + v i . vi ← ωvi + c1 r1 (ˆ xi − xi ) + c2 r2 (ˆ g − xi ). ˆ i ← xi . If f (xi ) < f (ˆ xi ), x ˆ ← xi . If f (xi ) < f (ˆ g), g End For. End While. End For.

20


Note the following about the above algorithm: • ω is an inertial constant. Good values are usually slightly less than 1, • c1 and c2 are constants that say how much the particle is directed towards good positions. They represent a ”cognitive” and a ”social” component, respectively. Usually, we take c1 , c2 ≈ 1, • usually, r1 , r2 ∈ U [0, 1]. By studying this algorithm, we see that we are essentially carrying out something like a discrete-time simulation where each iteration of it represents a ”tic” of time. The particles ”communicate” information they find about each other by updating their velocities in terms of local and global bests; when a new best is found, the particles will change their positions accordingly so that the new information is ”broadcast” to the swarm. The particles are always drawn back both to their own personal best positions and also to the best position of the entire swarm. They also have stochastic exploration capability via the use of the random constants r1 , r2 . 2.2.3.9

Ant colony optimization (ACO)

The Ant colony optimization (ACO) algorithm is primary used for discrete problems (e.g. Traveling Salesman Problem, packet routing). Recently, several approaches have been proposed to extend the application of this algorithm for continuous space problems [103, 104]. We have so far implemented two of them. The first algorithm Continuous Ant colony optimization (CACO) was proposed in [21]. It works as follows. There is an ant nest in a center of a search space. Ants exits the nest in a direction given by quantity of pheromone. When an ant reaches the position of the best ant in the direction, it moves randomly (the step is limited by decreasing diameter of search. If the ant find better solution, it increases the quantity of pheromone in the direction of search [58]. The second algorithm is similarly called Ant Colony Optimization for Continuous Spaces (ACO*) [20]. It was designed for the training of feed forward neural networks. Each ant represents a point in the search space. The position of new ants is computed from the distribution of existing ants in the state space [21, 92]. 2.2.3.10

Hybrid of the GA and the particle swarm optimization (HGAPSO)

The HGAPSO algorithm was proposed in [49] and it is based on the following ideas. Since PSO and GA both work with a population of solutions, combining the searching abilities of both methods seems to be a good approach. Originally, PSO works based on social adaptation of knowledge, and all individuals are considered to be of the same generation. On the contrary, GA works based on evolution from generation to generation, so the changes of individuals in a single generation are not considered.


21

Input variables

Model 1

Model 2

Model 3

Output variable

…

Model N

Output variable combination Ensemble output

Figure 2.5: Models are trained for the same task and then combined

In the reproduction and crossover operation of GAs, individuals are reproduced or selected as parents directly to the next generation without any enhancement. However, in nature, individuals will grow up and become more suitable to the environment before producing offspring. To incorporate this phenomenon into GA, PSO is adopted to enhance the topranking individuals on each generation. Then, these enhanced individuals are reproduced and selected as parents for crossover operation. Offsprings produced by the enhanced individuals are expected to perform better than some of those in original population, and the poorperformed individuals will be weeded out from generation to generation. For detailed description of the HGAPSO algorithm see [49].

2.2.4

Ensemble methods

Ensemble techniques [22] are based on the idea that a collection of a finite number of models (eg. neural networks) is trained for the same task. Neural network ensemble [111] is a learning paradigm where a collection of a finite number of neural networks is trained for the same task. It originates from Hansen and Salamons work [39], which shows that the generalization ability of a neural network system can be significantly improved through ensembling a number of neural networks, i.e., training many neural networks and then combining their predictions. Since this technology behaves remarkably well, recently it has become a very hot topic in both neural networks and machine learning communities, and has already been successfully applied to diversified areas such as face recognition, optical character recognition, scientific image analysis, medical diagnosis, seismic signals classification, etc. In general, a neural network ensemble is constructed in two steps, i.e., training a number of component neural networks and then combining the component predictions (see Figure 2.5). As for training component neural networks, the most prevailing approaches are Bagging and Boosting. Boosting generates a series of component neural networks whose training sets are determined by the performance of former ones. Training instances that are wrongly predicted by former networks will play more important roles in the training of later networks. For detailed description of the boosting approach see [33].

22


Sampling with replacement

Training data

Sample 1

Learning algorithm

Model 1 (classifier)

Sample 2

Learning algorithm

Model 2 (classifier)

.. .

.. .

.. .

Sample M

Learning algorithm

Model M (classifier)

Averaging or voting

Ensemble model (classifier)

output

Figure 2.6: The Bagging scheme: Models are constructed on bootstrapped samples 2.2.4.1

Bagging

Bagging is based on bootstrap sampling [45]. It generates several training sets from the original training set and then trains a component neural network from each of those training sets. Not every ensemble of models gives more accurate prediction. When e.g. identical models are combined, we cannot achieve any improvement. Models in the ensemble must be diverse - they must exhibit diverse errors [22]. Bagging introduces diversity by varying the training data. 2.2.4.2

Bias-variance decomposition

The theoretical tool to study how a training data affects the performance of models is Biasvariance decomposition. It decomposes the error of the ensemble to the bias and variance part [22]. Bias is the part of the expected error caused by the fact the model is not perfect, where variance is the part of the expected error due to the nature of the training set. The proof of the following formal definition can be found in [22]. ET {(f − d)2 } = ET {(f − ET {f })2 } + (ET {f } − d)2

(2.9)

where ET is the Expectation operator in the decomposition, with respect to all possible training sets of fixed size N , and all possible parameter initializations d is the target value of a testing data point3 . During the training, the part of the error caused by model bias decreases, as the model approaches the training data. At the same time the part of error caused by variance increased due to the overfitting phenomena. Bagging reduces mainly variance (see Figure 2.8), where Boosting reduces mainly the bias part (training vectors that are wrongly modeled by former models play more important roles in the training of later models). 3

The noise level of zero assumed, in the case of a non-zero noise component, d in the decomposition would be replaced by its expected value hdi, and a constant (irreducible) term σe2 would be added, representing the variance of the noise.

Error of the model


23

Optimum

Bias

Variance

Training time

Figure 2.7: The training should be stopped in the minimum of generalization error. Training time can be replaced by model complexity as well.

a)

b) training data model 1

model 1 model 2 ensemble model ensemble model

training data model 2

Figure 2.8: Ensemble methods reduce variance a) and bias b) giving the resulting model better generalization abilities, thus improving accuracy on testing data.

2.2.4.3

Simple ensemble and weighted ensemble

A single output can be created from a set of model outputs via simple averaging, or by means of a weighted average that takes account of the relative accuracy of the models to be combined [37]. The output Φm (x) of the weighted ensemble is given according to the equation bellow.

Φm (x) =

m X i=1

wi fi (x),

e−αei wi = P −αej je

(2.10)

Where f n (x) are outputs of member models given that the vector x is provided to their inputs. Weights wi are determined according to the performance ei of individual models on the training and validation set.

24


2.2.4.4

Ensembles - state of the art

There are concerns about complexity of the final solution. Building ensemble of models is in the contradiction to the Occam’s razor [102], because the model should be kept as simple as possible, when the accuracy is comparable. The study [29] shows that in some cases, the output function of the ensemble is in fact simpler than the output function of individual models (although it has much more parameters). Therefore from this point of view the ensemble of models is in the accordance with Occam’s razor. Error-Correcting Output Codes (ECOC) according to [67] improves the accuracy of ensemble classifiers. This method uses a different output coding for each base classifier. Instead of one classifier with k outputs in a k-class problem (one-per-class coding), b classifiers with only one output are used, each classifier only deciding between two super-classes that partition the k classes. The output of the single base classifiers can then be interpreted as bits in a codeword of length b, transmitting the class of the classified pattern. In [40] we can find similar result to the one obtained in this section. If the base classifiers are very accurate or the number of input features is too large, Adaboost.ECC cannot improve the classification performance on the test patterns compared to ECOC classifiers of comparable size and complexity. Only if the number of input features is small and few hidden neurons are used some improvements can be achieved. In [10] MLP neural network committees proved to be superior to individual MLP networks. The committee of GMDH networks generated by the Abductory Inductive Mechanism (AIM) gave better performance just when individual networks varied in complexity - a result similar to the conclusion in this thesis (see Section 4.8).

2.3

Previous results and related work

Our previous method ModGMDH incorporated some improvements (heterogeneous units, interlayer connections, growing complexity, etc.) of the MIA GMDH. It proved to generate more accurate models on various data sets. We published these results in the paper [A.15] (conference on Inductive Modeling ICIM 2002 in Lviv). Bellow we present the most important results from the paper. We designed an experiment and evaluated results of the original and the Modified GMDH method showing the advantages of the ModGMDH topology. For the purpose of the experiment we introduced more methods with partial modifications of the structure to explore the effects of particular modifications of the original method. We run the experiment with the following methods: • Original version of GMDH method (OGMDH) • Extended Original GMDH method (EOGMDH) - homogenous network with polynomial transfer function and extra inputs in units • Perceptron GMDH method (HPGMDH) - homogenous network with units of perceptron type and extra inputs


25

Figure 2.9: The comparison of GMDH methods with various levels of modification. • Modified GMDH method (MGMDH) - heterogeneous units of growing complexity (all modifications) Figure 2.9 compares the root mean square error (RMS) of models for full data set. The error was averaged for three runs. We can see that the response of models constructed by the original GMDH method is the worst for all data sets, whereas the modified GMDH method gives models with the best results. The EOGMDH method differs from OGMDH just in using extra inputs. This modification generally improves approximation attributes of the original GMDH method, according to results showed in Figure 2.9. Methods HPGMDH and EOGMDH differ in the type of unit, both are homogenous and using extra inputs. The HPGMDH gives better results than EOGMDH on the Mandarin data set but for the other data sets it is exactly the opposite. It seems that perceptron units are suitable for higher density of data in the input space (mandarin data set), whereas units with polynomial transfer function give better result on thin data sets. The idea of data-dependent approximation ability is supported by the fact that surviving units in the network constructed by MGMDH (initial population of a layer contains units of all types) on the Mandarin data set are almost just of the perceptron type. On the Artificial data set there is a majority of units with polynomial transfer function. When developing networks by MGMDH on the ”thinnest” Nuclear data set, the winning units are mostly that with polynomial or linear transfer function. The presence of heterogeneous units in MGMDH networks has positive influence on the stability of the learning process and the function approximation. This initial experiment opened the door. We entered and followed the way for three years. In this thesis we present what we have found on the way.

26SECTION 3. OVERVIEW OF OUR APPROACH - THE FAKE GAME FRAMEWORK

3 Overview of our approach - the FAKE GAME framework Knowledge discovery and data mining are popular research topics in recent times. It is mainly due to the fact that the amount of collected data significantly increases. Manual analysis of all data is no longer possible. This is where the data mining and the knowledge discovery (or extraction) can help. The process of knowledge discovery [32] is defined as the non-trivial process of finding valid, potentially useful, and ultimately understandable patterns. The problem is that this process still needs a lot of human involvement in all its phases in order to extract some useful knowledge. This thesis focuses on methods aimed at significant reduction of expert decisions needed during the process of knowledge extraction. Within the FAKE GAME environment we develop methods for automatic data preprocessing, adaptive data mining and for the knowledge extraction (see Figure 3.1). The data preprocessing is very important and time consuming

KNOWLEDGE EXTRACTION and INFORMATION VISUALIZATION

INPUT DATA

AUTOMATED DATA PREPROCESSING

FAKE INTERFACE

KNOWLEDGE

AUTOMATED DATA MINING

GAME ENGINE

Figure 3.1: FAKE GAME environment for the automated knowledge extraction. phase of the knowledge extraction process. According to [80] it accounts for almost 60% of total time of the process. The data preprocessing involves dealing with non-numeric variables (alpha values coding), missing values replacement (imputing), outlier detection, noise reduction, variables redistribution, etc. The data preprocessing phase cannot be fully automated for every possible data set. Each data have unique character and each data mining method requires different preprocessing. Existing data mining software packages support just very simple methods of data preprocessing [6]. There are new data mining environments [7, 4] trying to focus more on data preprocessing, but their methods are still very limited and give no hint which preprocessing would be the best for your data. It is mainly due to the fact that the theory of data preprocessing is not very developed. Although some preprocessing methods seem to be simple, to decide which method would be the most appropriate for some data might be very complicated.

SECTION 3. OVERVIEW OF OUR APPROACH - THE FAKE GAME FRAMEWORK27

Within the FAKE interface we develop more sophisticated methods for data preprocessing and we study which methods are most appropriate for particular data. The final goal is to automate the data preprocessing phase as much as possible. In the knowledge extraction process, the data preprocessing phase is followed by the phase of data mining. In the data mining phase, it is necessary to choose appropriate data mining method for your data and problem. The data mining method usually generates a predictive, regressive model or a classifier on your data. Each method is suitable for different task and different data. To select the best method for the task and the data, the user has to experiment with several methods, adjust parameters of these methods and often also estimate suitable topology (e.g. number of neurons in a neural network). This process is very time consuming and presumes strong expert knowledge of data mining methods by the user. In the new version of one commercial data mining software [96], an evolutionary algorithm is used to select the best data mining method with optimal parameters for actual data set and a problem specified. This is really significant step towards the automation of the data mining phase. We propose a different approach. The ensemble of predictive, regressive models or classifiers is generated automatically using the GAME engine. Models adapt to the character of a data set so that they have an optimal topology. We develop methods eliminating the need of parameters adjustment so that the GAME engine performs independently and optimally on bigger range of different data. The results of data mining methods can be more or less easily transformed into the knowledge, finalizing the knowledge extraction process. Results of methods such as simple decision tree are easy to interpret. Unfortunately majority of data mining methods (neural networks, etc.) are almost black boxes - the knowledge is hidden inside the model and it is difficult to extract it. Almost all data mining tools bound the knowledge extraction from complex data mining methods to statistical analysis of their performance. More knowledge can be extracted using the techniques of information visualization. Recently, some papers [105] on this topic had been published. In this thesis, we propose techniques based on methods such as scatterplot matrix, regression plots, multivariate data projection, etc. to extract additional useful knowledge from the ensemble of GAME models. We also develop evolutionary search methods to deal with the state space dimensionality and to find interesting projections automatically.

3.1

The goal of the FAKE GAME environment

The ultimate goal of our research is to automate the process of knowledge extraction from data. It is clear that some parts of the process still need the involvement of expert user. We build the FAKE GAME environment to limit the user involvement during the process of knowledge extraction. To automate the knowledge extraction process, we research in the following areas: data preprocessing, data mining, knowledge extraction and information visualization (see Figure 3.1).

28SECTION 3. OVERVIEW OF OUR APPROACH - THE FAKE GAME FRAMEWORK 3.1.1

Research of methods in the area of data preprocessing

In order to automate the data preprocessing phase, we develop more sophisticated methods for data preprocessing.We focus on data imputing (missing values replacement), that is in existing data mining environments [7, 4] realized by zero or mean value replacement although more sophisticated methods already exist [80]. We also developed a method for automate nonlinear redistribution of variables (see Section 5.1.2). Dealing with non-numeric variables (alpha values coding), outlier detection, noise reduction and other techniques of data preprocessing will be in the scope of our further research. This thesis is not focusing on data warehousing, because this process is very difficult to automate in general. It is very dependent on particular conditions (structure of databases, information system, etc.) We assume that source data are already collected cleansed and integrated (Figure 3.1). 3.1.2

Automated data mining

To automate the data mining phase, we develop an engine that is able to adapt itself to the character of data. This is necessary to eliminate the need of parameter tuning. The GAME engine autonomously generates the ensemble of predictive, regressive models or classifiers. Models adapt to the character of data set so that they have optimal topology. Unfortunately, the class of problems where the GAME engine performs optimally is still limited. To make the engine more versatile, we need to add more types of building blocks, more learning algorithms, improve the regularization criteria, etc. 3.1.3

Knowledge extraction and information visualization

To extract the knowledge from complex data mining models is very difficult task. Visualization techniques are promising way how to achieve it. Recently, some papers [105] on this topic had been published. In our case, we need to extract information from an ensemble of GAME inductive models. To do that we enriched methods such as scatterplot matrix, regression plots by the information about the behavior of models (see Section 5.2.3). For data with many features (input variables) we have to deal with curse of dimensionality. The state space is so big, that it is very difficult to find some interesting behavior (relationship of system variables) manually. For this purpose, we developed evolutionary search methods to find interesting projections automatically (see Section 5.2.8). Along with the basic research, we implement proposed methods in Java programming language and integrate it into the FAKE GAME environment [8] so we can directly test the performance of proposed methods, adjust their parameters, etc. Based on the research and experiments performed within this dissertation, we would like to develop an open source software FAKE GAME. This software should be able to automatically preprocess various data, to generate regressive, predictive models and classifiers (by means of GAME engine), to automatically identify interesting relationships in data (even in highdimensional ones) and to present discovered knowledge in a comprehensible form. The software should fill gaps which are not covered by existing open source data mining environments [6, 7].

SECTION 4. THE DESIGN OF THE GAME ENGINE AND RELATED RESULTS

29

4 The design of the GAME engine and related results The Group of Adaptive Models Evolution (GAME) is a data mining method. It can generate models for classification, prediction, identification or regression purposes (see Figure 4.1). It works with both continuous and discrete variables. The topology of GAME models adapts to the nature of a data set supplied. The GAME is highly resistant to irrelevant and redundant features, suitable for short and noisy data samples. The GAME engine further develops

REGRESSION MO DEL

OR

MO DEL

MO DEL

GAME

PREPROCESSED DATA

MO DEL MO DEL

CLASSIFICATION OR IDENTIFICATION OR

MO DEL

PREDICTION

Figure 4.1: Group of Adaptive Models Evolution (GAME)

the MIA GMDH algorithm [47] and is even more sophisticated than it’s predecessor the ModGMDH [A.9]. The GAME generates group of inductive models adapting themselves on data set character and on it’s complexity. An inductive model (network) grows as big as needed to solve a problem with sufficient accuracy. It consists of units (neurons) that have been most successful in modeling interrelationships within the data set. A GAME model (see Figure 4.2) has more degrees of freedom (units with more inputs, interlayer connections, transfer functions etc.) than MIA GMDH models. When a data set is multidimensional, it is impossible to search the huge state space of model’s possible topologies without any heuristic, as implemented in the ModGMDH. Therefore the GAME method incorporates niching genetic algorithm to evolve models of optimal structure. Improvements of the MIA GMDH are discussed bellow in more detailed form. Improvements of the GAME method over the original MIA GMDH:

30


MIA GMDH

GAME

input variables

P

P P

unified units

non-heuristic P search

P P

P

input variables

P P

2 inputs

P output variable

P

L

P diversified units

C

P interlayer connections

C

2 inputs G max

P

P L

genetic search

3 inputs C max 4 inputs max

output variable

Figure 4.2: The comparison: original MIA GMDH network and the GAME network Heterogeneous units Optimization of units Heterogeneous learning methods Structural innovations Regularization Genetic algorithm Niching methods Evolving units (active neurons) Ensemble of models generated

4.1

Several types of units compete to survive in GAME models. Efficient gradient based training algorithm developed for hybrid networks. Several optimization methods compete to build most successful units. Growth from a minimal form, interlayer connections, etc. Regularization criteria are employed to reduce a complexity of transfer functions. A heuristic construction of GAME models. Inputs of units are evolved. Diversity promoted to maintain less fit but more useful units. Units such as the CombiNeuron evolve their transfer functions. Ensemble improves accuracy, the credibility of models can be estimated.

Heterogeneous units

In MIA GMDH models, all units have the same polynomial transfer function. In the PNN [75] models, there are more types of polynomials used within single model. Our previous research showed that employing heterogeneous units within a model gives better results than when using units just of single type [A.15]. Hybrid models are often more accurate than homogeneous ones, even if the homogeneous model has an optimal transfer function

31


x1 x2 ... xn x1

x1

Linear (LinearNeuron)

Sin (SinusNeuron)

   n y = an +1 ∗ sin an + 2 ∗  ∑ ai xi + an +3  + a0   i =1 

x2

y = ∑ ai xi + an +1 n

...

i =1

xn x1

Gaussian (GaussianNeuron)

x2 ... xn x1 x2 ... xn x1

y = (1 + an +1 ) * e

( xi − ai )2 − i=1 (1+ an+2 )2

Logistic (SigmNeuron)

1

y=

n

1+ e

− ai xi

+ a0

Exponential (ExpNeuron)

m n   y = ∑  ai ∏ x rj  + a0 i =1  j =1 

...

+ a0

xn

x1

Rational (PolyFractNeuron)

y=

x2 ...

an2 +2

∑ ai xi +∑∑ an*i+ j xi x j + an2 +1 n

i =1

xn

i =1

Polynomial (CombiNeuron)

x2

n

x1

... xn

y = an+2 * e

an+1* ai xi i =1

+ a0

+ a0

i =1 j =1

y = ∑ψ q 2 n +1

...

n

n

Universal (BPNetwork)

x2

x2

n

xn

q =1

(∑

n p =1

φ pq ( x p )

)

Figure 4.3: Units are building blocks of GAME models. Transfer functions of units can be combined in a single model (then we call it a hybrid model with heterogeneous units). The list of units includes some of units implemented in the FAKE GAME environment. appropriate for modeled system (see results in the Section 2.3). In GAME models, units within single model can have several types of transfer functions (Hybrid Inductive Model). Transfer functions can be linear, polynomial, logistic, exponential, gaussian, rational, perceptron network, etc. (see Table 4.3 and Figure 4.3). The motivation, why we implemented so many different units was following. Each problem or data set is unique. Our previous experiments showed (see Section 2.3) that for simple problems, models with simple units were superior. Whereas for complex problems winning models were these with units having complex transfer function. The best performance on all tested problems was achieved by models, where units were mixed.

4.1.1

Experiments with heterogeneous units

To prove our assumptions and to support our preliminary results [A.15], we designed and performed following experiments. We used several real world data sets of various complexity and noise levels. For each data set, we built ensembles of 10 models. Each ensemble had a different configuration. In homogeneous ensembles, there was just single type of unit allowed (eg. Exp stands for an ensemble of 10 models consisting of ExpNeuron units only). The ensemble of 10 heterogeneous inductive models, where all units are allowed to participate in the evolution stage of models is labeled all (all-simple and all-fast are configurations, where

32


Table 4.1: The summary of unit types appearing in GAME models Class of the unit LinearNeuron CombiNeuron CombiNeuron PolySimpleNeuron PolySimpleNRNeuron SigmNeuron ExpNeuron PolyFractNeuron SinusNeuron GaussNeuron GaussianNeuron MultiGauss BPNetwork NRBPNetwork

Abbrevation Linear Combi CombiR300 Polynomial PolyNR Sigm Exp Fract Sin Gauss Gaussian MultiGauss Perceptron PerceptNR

Transfer function Linear Polynomial Polynomial Polynomial Polynomial Sigmoid Exponential Rational Sinus Gaussian1 Gaussian2 Gaussian3 Universal Universal

Learning method - any method - any method - any method + R300 - any method - any method + GL5 - any method - any method - any method - any method - any method - any method - any method BackPropagation algorithm BP alg. + GL5 stop.crit.

all units with simple transfer function respectively all units with short learning time, were allowed). The first experiment was performed on the Building data set. This data set has three output variables. One of these variables is considerably noisy (Energy consumption) and the other two output variables have a low noise level. The results are consistent with this observation. The Combi and the Polynomial ensembles perform very well on variables with low noise levels, but for the third ”noisy” variable they overfitted the training data having huge error on the testing data set. Notice, that the configuration all has an excellent performance for all three variables, no matter the level of noise (Figure 4.4). In the Figure 4.5, we present results of the experiment on the Spiral data set [51]. As you can see, Perceptron ensemble learned to tell two spirals apart without any mistake. The second best performing configuration was all with almost hundred percent accuracy1 . The worst performing ensembles were Linear and Sigm (units with linear and logistic transfer functions). Their 50% classification accuracy signifies that these units absolutely failed to learn the Spiral data set. This is not surprising for linear units, but the failure of units with logistic transfer function is under our investigation (there might be a bug in the implementation of the analytic gradient for the SigmNeuron unit). The behavior of the all ensemble on the Spiral data set is demonstrated in the appendix (see Figure D.7). We performed number of similar experiments with other real world data sets. You can find results of these experiments in the appendix. We can conclude the all ensemble performed extremely well for almost all data sets under the investigation. It showed the best performance for the Mandarin data set (Figure D.10), Boston data set (Figure D.8) and the well known Iris data set (Figure D.11). Only for the Ecoli data set, the error of the all ensemble was 1

We have to mention that building the ensemble of all models took just a fraction of time needed to build the ensemble of Perceptron models (consisting of BPnetwork units)


Hot water consumption

Cold water consumption

Energy consumption

17 .0119 .0121 .0123 0.01 0 0 0

3 35 0.014 .0145 0.01 0.01 0

2.3

Fract

Combi

all

all-P

all

all-PF

all

all-P

Fract

Polynomial

Polynomial

Exp

Combi

all-PF

all-PF

Fract Perceptron

Sin

2.5

2.7

2.9

Sin Perceptron CombiR300

CombiR300

Sigm

Exp

Sigm

Linear

CombiR300

Exp

all-P

Sigm

Sin

Polynomial

1132.1

Linear

Linear

Combi

7861.2

Perceptron

33

4.48

Figure 4.4: The performance comparison of GAME units on the Building data set. In the all-PF configuration all units except the Perceptron and Fract unit were enabled, similarly in all-P just the Perceptron unit was excluded.

Classification accuracy on the Spiral data set 100%

80%

60%

n ro ep t Pe rc

al l Fr al a ct lPo sim ly ple n M om ul tiG ial au ss Si G n au al ss l-f as C t om G b a C uss i om ia bi n R 30 0 Ex p Si gm Li ne ar

40%

Figure 4.5: The performance comparison of GAME units on the Spiral data set.

34


relatively high (see Figure D.9). It is not clear, why it apparently overfitted the data, when other ensembles which are more likely to overfitt (Combi, Polynomial ) did not. We will focus on this data set in our further research. The conclusion of our experiments is that for the best results we recommend to enable all units which have be so far implemented in the GAME engine. The more types of transfer function we have, the more different relationships we can model. Which types of units are selected to make up a model depends just on nature of data modeled. The main advantage of using units of various types in single model is that models automatically adapt themselves to the character of modelled system. Only units with appropriate transfer function survive. Hybrid models also better approximate relationships that can be expressed by a superposition of different functions (e.g. polynomial * sigmoid * linear). In the same sense, we also use several types of learning algorithms. Many authors agree that gradient methods are convenient for simpler data, for more difficult data (e.g. multidimensional with many local optima) it is better to estimate weights or coefficients of unit’s transfer function by a genetic algorithm. We study this problem in details in the Section 4.3.

4.2

Optimization of GAME units

The GMDH methods can be divided into two classes. The parametric and non-parametric methods. The MIA GMDH and also the GAME belong to the parametric class. Parametric models contain parameters that are optimized during the learning (training) stage. The optimal values of parameters are values minimizing the difference between a behavior of a real system and its model. This difference is typically measured by a root mean squared error. The error of a model2 on a training data is the sum of errors on the particular training vectors E=

m X

(yj − dj )2 ,

(4.1)

j=0

where yj is the output of the model for the j th training vector and dj is the corresponding target output value. The coefficients of a GAME unit a1 , a2 , · · · , an can be estimated during the training stage by any optimization method implemented (see Table 4.3). In each iteration, the optimization method proposes values of coefficients and the GAME unit returns its error on training data with these proposed coefficients (see Figure 4.6a). If the analytical gradient of the error can be computed, the number of iterations would be significantly reduced, because we know in which direction coefficients should be adjusted (see Figure 4.6b). The gradient of the error ~ in the error surface of GAME unit can be written as ∇E ~ = ∇E

Ã

~ ~ ~ ∂E ∂E ∂E , ,···, ∂a1 ∂a2 ∂an

!

,

(4.2)

~

∂E is a partial derivative of the error in the direction of the coefficient ai . It tell us how where ∂a i to adjust the coefficient to get smaller error E on the training data. This partial derivative 2

Output of an optimized unit can be treated as an output of the model so the terms ”error of the model” and ”error of the unit” can be confused.

35


b)

a) coefficients a1, a2, ..., an

Unit compute error on training data

optimize coefficients given inintial values

coefficients a1, a2, ..., an

repeat new values error

final values

Optimization method

optimize coefficients given inintial values

repeat new values

Unit compute error on training data

estimate gradient

error

compute gradient of the error

Optimization method

gradient

final values

Figure 4.6: Optimization of the coefficients can be performed without analytic gradient a) or with the gradient supplied b). Utilization of analytic gradient significantly reduces the number of iteration needed for the optimization of coefficients. can be computed as m ~ ∂yj ~ X ∂E ∂E = ∗ , ∂ai j=0 ∂yj ∂ai

(4.3)

where m is the number of training vectors. The first part of the summand can be is easily derived from the Equation 4.1 as m ~ X ∂E =2 (yj − dj ) . ∂yj j=0

(4.4)

The second part of the summand from the Equation 4.3 is unique for each unit, because it depends on its transfer function. We demonstrate the computation of the analytic gradient for Gaussian unit and for Sin unit. For other units the gradient is computed in a similar way. 4.2.1

The analytic gradient of the Gaussian unit

Gaussian functions are very important and they can be found almost everywhere. The most 1 ∗ common distribution in nature follows the gaussian probability density function f (x) = 2πσ (x−µ)2

e− 2σ2 . Neurons with gaussian transfer function are typically used in Radial Basis Function Networks. We have modified the function for our purposes. We added coefficients to be able to scale and shift the function. The first version of the transfer function as implemented in GaussianN euron is the following: −

yj = (1 + an+1 ) ∗ e

Pn

(

x −ai i=1 ij (1+an+2 )2

)

2

+ a0

(4.5)

The second version (GaussN euron) proved to perform better on several low dimensional real world data sets: Pn 2 (ai ∗xij −an+3 ) − i=1 2 (1+an+2 ) yj = (1 + an+1 ) ∗ e + a0 (4.6) Finally, the third version (M ultiGaussN euron), as the combination of the transfer functions above showed the best performance, but sometimes exhibited almost fractal behavior (see

36


Figure D.1). − |

Pn

∗ xij − an+i )2 (1 + a2n+2 )2

i=1 (ai

{z

}

ρj

yj = (1 + a2n+1 ) ∗ e

+ a0

(4.7)

We computed gradients for all these transfer functions. In this thesis, we derive the gradient of the error (see Equation 4.2) for the third version of the gaussian transfer function (Equation 4.7). We need to derive partial derivatives of the error function according to Equation 4.3. The easiest partial derivative to compute is the one in the direction of the a0 coefficient. The P ~ ∂y ∂E second term ∂ρjj is equal to 1. Therefore we can write ∂a =2 m j=0 (yj − dj ). In the case of 0 the coefficient a2n+1 , the equation becomes more complicated 

m ~ X − ∂E  =2 (yj − dj ) ∗ e ∂a2n+1 j=0

Pn

i=1

(ai ∗xij −an+i )

(1+a2n+2 )2

2



 .

(4.8)

Remaining coefficients are in the exponential part of the transfer function. Therefore the second summand in the Equation 4.3 cannot be formulated directly. We have to rewrite the Equation 4.3 as " # m ~ ∂yj ∂ρj ~ X ∂E ∂E , (4.9) = ∗ ∗ ∂ai j=0 ∂yj ∂ρj ∂ai where ρj is the exponent of the transfer function 4.7. Now we can formulate partial derivatives of remaining coefficients as ~ ∂E ∂a2n+2 ~ ∂E ∂ai ~ ∂E ∂an+i

= 2 = 2 = 2

m X

"

j=0 " m X

j=0 " m X

ρj

(yj − dj ) ∗ (1 + a2n+1 )e

ρj

∗2

Pn

∗ xij − an+i )2 (1 + a2n+2 )3

i=1 (ai

a2i ∗ x2ij − an+i ∗ xij

(yj − dj ) ∗ (1 + a2n+1 )e

∗ −2

(yj − dj ) ∗ (1 + a2n+1 )eρj

an+i − ai ∗ xij . ∗ −2 (1 + a2n+2 )2

j=0

(1 + a2n+2 )2 #

#

#

(4.10) (4.11) (4.12)

We derived the gradient of error on training data for the gaussian transfer function unit. An optimization method often requests these partial derivatives every iteration to be able to adjust parameters in proper direction. This mechanism (as described on the Figure 4.6b) can significantly save the number of error evaluations needed (see Figure 4.7). 4.2.2

The analytic gradient of the Sine unit

We also list the gradient for the Sine unit with the following transfer function "

yj = an+1 ∗ sin an+2 |

Ã n X

ai ∗ xij + an+3

i=1

{z ρj

!# }

+ a0

(4.13)


37

Table 4.2: Number of evaluations saved by supplying gradient depending on the complexity of the energy function. Complexity energy fnc. 1 2 3 4 5

Avg. evaluations without grad. 45.825 92.4 155.225 273.225 493.15

Avg. evals. with grad. 20.075 29.55 44.85 62.75 79.775

Avg. gradient calls 13.15 21.5 34.875 51.525 68.9

Evaluations saved 56.19% 68.02% 71.11% 77.03% 83.82%

Computation time saved 13.15% 33.12% 37.41% 48.75% 62.87%

Partial derivatives of the error function can be derived similarly as for the gaussian unit above. ~ ∂E ∂a0 ~ ∂E ∂an+1

4.2.3

= 2 = 2

~ ∂E ∂an+2

= 2

~ ∂E ∂an+3

= 2

~ ∂E ∂ai

= 2

m X

(yj − dj ) ,

j=0 m · X

(4.14) ¸

(yj − dj ) ∗ sin (ρj ) ,

j=0 " m X j=0 m · X

(yj − dj ) ∗ an+1 cos (ρj ) ∗

(4.15) n ³X

ai ∗ xij + an+3

i=1

´

#

,

¸

(yj − dj ) ∗ an+1 cos (ρj ) ∗ an+2 ,

j=0 m · X

(yj − dj ) ∗ an+1 cos (ρj ) ∗ an+2 ∗

j=0

(4.16) (4.17)

n X i=1

¸

xij .

(4.18)

The experiment: analytic gradient saves error function evaluations

We performed an experiment to evaluate the effect of analytic gradient computation. The Quasi-Newton optimization method was used to optimize the SigmNeuron unit (a logistic transfer function). In the first run the analytic gradient was provided and in the second run, the gradient was not provided so the QN method was forced estimate the gradient itself. We measured the number of function evaluation calls and for the first run, we recorded also the number of gradient computation requests. The results are displayed in the Graph 4.7 and in the Table B.1. In the second run, without the analytic gradient provided, the number of error function evaluation calls increased exponentially with rising complexity of the error function. For the first run, when the analytic gradient is provided, number of error function evaluation calls increases just linearly and the number of gradient computations grows also linearly. The computation of gradient is almost equally time-consuming as the error function evaluation. When we sum up these two numbers for the first run, we still get growth increasing linearly with the number of layer (increasing complexity of the error surface). This is superb result, because some models of

38


Evaluation calls 500

400

300 f_eval (no grad) f_eval (grad) 200

g_eval (grad)

100

0 1

2

3

4

5

No. of GAME layer (increas. complexity)

Figure 4.7: When the gradient have to be estimated by the optimization method, number of function evaluation calls grows exponentially with an increasing complexity of the problem. When the analytic gradient is computed, the growth is almost linear. complex problems can have 20 layers, the computational time saved by providing the analytic gradient is huge. Unfortunately some optimization methods such as genetic algorithms and swarm methods are not designed to use the analytic gradient of the error surface. On the other hand, for some data sets, the usage of analytic gradient can worsen a convergence characteristic of optimization methods (getting stuck in local minima). The training algorithm described in this Section enables efficient training of hybrid neural networks.

4.3

Heterogeneous learning methods

The question ”Which optimization method is the best for our problem?” has not a simple answer. There is no method superior to others for all possible optimization problems. However there are popular methods performing well on the whole range of problems. Among these popular methods, we can include so called gradient methods - the Quasi Newton method, the Conjugate Gradient method and the Levenberg-Marquardt method. They use the analytical gradient (or its estimation) of the problem error surface. The gradient brings them faster convergence, but in cases when the error surface is jaggy, they are likely to get stuck in a local optima. Other popular optimization methods are genetic algorithms. They search the error surface by jumping on it with several individuals. Such search is usually slower, but more prone to get stuck in a local minima. The search performed by swarm methods can be imagined as a swarm of birds flying over the error surface, looking for food in deep valleys. You can also imagine that for certain types of terrain, they might miss the deepest valley. Each data set have different complexity. The surface of a model’s RMS error depends on the data set, transfer functions of optimized unit and also on preceding units in the network.


39

Table 4.3: Learning methods summary Name of the class UncminTrainer SADETrainer PSOTrainer HGAPSOTrainer PALDifferentialEvolutionTr. DifferentialEvolutionTrainer StochasticOSearchTrainer OrthogonalSearchTrainer ConjugateGradientTrainer ACOTrainer CACOTrainer

Abbrv. QN SADE PSO HGAPSO PalDE DE SOS OS CG ACO CACO

Search Gradient Genetic Swarm Hybrid Genetic Genetic Random Empirical Gradient Swarm Swarm

Learning method Quasi-Newton method SADE genetic method Particle Swarm Optimization Hybrid of GA and PSO Differtial Evolution version 1 Differtial Evolution version 2 Stochastic Orthogonal Search Orthogonal Search Conjugate Gradient method Ant Colony Optimization Cont. Ant Colony Optimization

Therefore we might expect, there is no universal optimization method performing optimally on every data set. Each unit has different error surface even within a single network. In GAME, each unit can use arbitrary learning algorithm to estimate it’s coefficients (Quasi-Newton, Conjugate Gradient, Differential Evolution, SADE genetic algorithm [27], Particle Swarm Optimization, Back-Propagation algorithm, etc.). Learning methods are also evolved by the niching genetic algorithm (see Section 4.6.1), therefore methods training successful units ale selected more often than methods training poor performers on a particular data set. Learning methods we have so far implemented to the GAME engine are summarised in the Table 4.3. 4.3.1

Experiments with heterogeneous learning methods

For validation of the assumption that no universal optimization method exists performing optimally for any data set, we designed following experiments. Again several different real world data set were involved. For each data set, we generated models where just units with simple transfer functions (Linear, Sigm, Combi, Polynomial, Exp) were enabled. Coefficients of these units were optimized by a single method from the Table 4.3. In the configuration all, all methods were enabled. Because these experiments were computationally expensive (optimization methods not utilizing the analytic gradient need many more iterations to converge), we built the ensemble of 5 models for each configuration and data set. For the Ecoli data set, ensembles are formed just by three member models3 . Therefore results are not as significant as in the previous section where we experimented with GAME units. Compare results of two trials on the Ecoli data set in the Figure 4.8. If you focus on the Quasi Newton method (QN) in the First trial (Figure 4.8 left) ensemble of models optimized 3

The Ecoli data set has six output variables so six times more time was needed to produce models. Therefore we reduced the number of models in the ensemble to three.

40


Ecoli data set classification accuracy – single models, two trials

A PS O SO S

O

H G

H G

D E O S SA D E C AC O PS O

60%

C G

60%

al l AC O

70%

N lD E

70%

Q

80%

Training set Testing set

pa

80%

D E

90%

S

90%

al l PS O SA D E SO S A C O pa lD E C G C AC O Q N

100%

AP SO

100%

Figure 4.8: The performance comparison of learning methods on the Ecoli data set.

Boston dataset RMS error 0.4 0.35 0.3 0.25

Training set

0.2

Testing set

0.15 0.1 0.05 S lD E O S pa

SO

N Q

E

E

SA D

D

al l AC O A H CO G AP SO C

C G PS O

0

Figure 4.9: The performance comparison of learning methods on the Boston data set.

by the QN method overfitted the data, but in the second trial (with the same configuration) it’s result on the testing data set was much better. The hybrid of the Genetic Algorithm and the Particle Swarm Optimization (HGAPSO) performed best in both trials. Ensembles optimized by the HGAPSO did not overfitted the Ecoli data at all. Very different results we obtained from experiments with the Boston data set (see Figure 4.9). For all optimization methods the difference between their error on training and testing data set was almost the same. It signifies that this data set is not very noisy so the overfitting did not occurred. The best performance showed the Conjugate Gradient method, but all methods except the worst performing one (Orthogonal Search) achieved similar results. The results on the Building data set for it’s three output variables are shown in the Figure 4.10. There is no significant difference between results for the noisy variable (Energy consumption) and the other two. We can divide optimization methods into the good and bad performing classes. Good performers are Conjugate Gradient, Quasi Newton, SADE genetic algorithm, Differential Evolution, and the all configuration standing for all methods participation in models evolution. On the other hand badly performing optimization methods for

SECTION 4. THE DESIGN OF THE GAME ENGINE AND RELATED RESULTS Energy consumption CG DE QN SADE all SOS CACO PSO HGAPSO ACO OS palDE

Cold water consumption QN all DE SADE CG OS HGAPSO CACO SOS palDE ACO PSO

41

Hot water consumption QN CG SADE DE all HGAPSO CACO SOS palDE ACO PSO OS

Figure 4.10: The performance comparison of learning methods on the Building data set. the Building data set are Particle Swarm Optimization, PAL- Differential Evolution4 and the Ant Colony Optimization. In accordance with results published in [106], our version of differential evolution outperformed swarm optimization methods. The conclusion from our experiments with optimization methods is not fully in accordance with our expectations. We predicted that the all configuration (all methods employed) should be the best performing for all data sets. The experiments showed that gradient methods like Quasi Newton and Conjugate Gradients performed very well for all data sets we have been experimenting with (including the Mandarin data set - Figure D.12 in the appendix). When all methods are used, good performance is guarantied, but the computation is significantly slower (some methods need many iterations to converge). At this stage of the research and implementation, we recommend using the Quasi Newton (QN) optimization method only, because it is the fastest and very reliable. In our further research we plan to use the analytic gradient to improve the performance of gradient and swarm methods. We also plan to experiment with switching of optimization methods (switch to a different method when a convergence is slow).

4.4

Structural innovations

The structure of the original MIA GMDH was adapted to computational capabilities of early seventies. Thirty years later, the computing is a big step forward. Experiments that can be nowadays run on every personal computer were intractable even on the most advanced supercomputer. To make the computation of an inductive model possible, several restrictions on the structure of the model had to be imposed. Because of growing computational power and the development of heuristic methods capable of solving np-hard problems, we can leave out some of these restrictions. The first restriction of the original MIA GMDH is the fixed number of units’ inputs (two) and 4

palDE is the second version of the Differential Evolution algorithm implemented in the GAME engine. The result when the first version of DE performed well and the second version badly is pellicular. It signifies that the implementation and the proper configuration of a method is of crucial importance.

42


a polynomial transfer function that is constant (except coefficients) for all units in a model. The second restriction is the absence of layer breakthrough connections. In the original version inputs to a unit can be from a previous layer only.

4.4.1

Growth from a minimal form

The GAME models grow from a minimal form. There is a strong parallel with state of the art neural networks as the NEAT [93]. In the default configuration of the GAME engine, units are restricted to have at least one input and the maximum number of inputs must not exceed the number of the hidden layer, the unit belongs to. The number of inputs to the unit increases together with the depth of the unit in the model. Transfer functions of units reflect growing number of inputs. We showed [A.15] that increasing number of unit’s inputs and allowing interlayer connections plays significant role in improving accuracy of inductive models. The growing limit of inputs a unit is allowed to have is crucial for inductive networks. It helps to overcome the curse of dimensionality. According to the induction principle it is easier to decompose problem to one-dimensional interpolation problems and then combine solutions in two and more dimensions, than to start with multi-dimensional problems (for full connected networks - dimensionality is proportional to the number of input features). To improve the modeling accuracy of neural networks, artificial input features are often added to the training data set. These features are synthesized from original features by math operations and can possibly reveal more about the modeled output. This is exactly what is GAME doing automatically (units of the first hidden layer serve as additional synthesized features for the units deeper in the network). Our additional experiments showed, that the restriction on the maximum number of inputs to a units has a negative effect on the accuracy of models. However when the restriction is enabled5 , the process of model generation is much faster and the accuracy of produced models is more stable than without the restriction. Without the restriction we would need many more epochs of the genetic algorithm evolving units in a layer (models accuracy would be stable and the feature ranking algorithm deriving significance from proportional numbers of units connected to a particular feature would work properly - see Section 5.2.2.1).

4.4.2

Interlayer connections

Units have no longer inputs just from previous layer. Inputs can be connected to the output of any unit from previous layers as well as to any input feature. This modification greatly increases the state space of possible model topologies, but the improvement in accuracy of models is rather big [A.15]. The GMDH algorithm implemented in the KnowledgeMiner software [69] can also generate models with layer breakthrough connections. 5

No restriction on the maximal number of inputs does not mean a fully connected network!

43


4.5

Regularization in GAME

The regularization is a methodology allowing to create models that are not too simple, not too complex for an appropriate task. Without any form of regularization, models often overfit a training data loosing generalization abilities and their performance on a new unseen data becomes extremely bad. The GMDH methods usually regularize models using an external data set called a validation set6 . The criterion of regularity (in this case the error on the validation set) should be minimized: CRRM S−val =

NB 1 X (yi (A) − di )2 NB i=1

(4.19)

In some cases (e.g. few data records), it is possible to validate also on the training set: CRRM S−tr&val =

N 1 X (yi (A) − di )2 N i=1

(4.20)

For the experiments we have performed in previous sections, the criterion CRRM S−tr&val was used. If you look at the results in Figure 4.8, you can see that for some models, the accuracy on the training data is far better than that on the testing data. Also the error of Polynomial and Combi units when modeling the Energy consumption (Figure 4.4) is huge. These are the indicators of overtraining. The CRRM S−tr&val regularization was unable to prevent overtraining for noisy data. Other and very straightforward form of regularization is the penalization for complexity. There exist several criteria (AIC, BIC, etc.) developed in the information theory that can be applied to our problem. We experimented with a regularization that can be written as: CRRM S−R−val





¶ µ NB 1 X 1 2  = (yi (A) − di ) ∗ 1 + ∗ unit.P enaltyF orComplexity() , NB i=1 R

(4.21) and with the version validating also on the training data CRRM S−R−tr&val . The value of the R coefficient is very important. When you look on the Figure 4.11, you can see the minimum of the criterion is changing its position with the value of the R parameter. For noisy data, it is better to have a stronger regularization (R = 12) and for no noise in data, no regularization is needed R → ∞. To validate our assumptions, we have designed an experiment with an artificial data set, where we adjusted the level of noise from 0% up to 200%. We also generated 30 models for each noise level and different strength of the regularization (from R=12 to 3000). Theoretically (Figure 4.12 left), the lowest error on the testing data should be achieved when the strength of the regularization match the level of noise in the data. The experimental result (Figure 4.12 right) matched our expectations except that for a low regularization (R = 3000) the testing error was low also for data with the medium level of noise. This deviation from the theoretical expectations can be caused by the fact, we used just the error of the simple 6

The validation set can be created by splitting the training data set into a part used for optimization of model and a part used for the validation.

44


CR High noise 0.9

St op

0.3

th e

0

in

R5

R1 2

0.6

Medium noise m in im um of CR

00 R3 R 750

Low noise 0 y = a1x1+a2

R3000 Model complexity

y = a1x13x4+ ... +a6x2+a7

Figure 4.11: As described in [48], the regularization criterion (CR) should be sensitive to the presence of noise in data. The complexity of models is increasing during training as new layers are added. Training should be stopped in the minimum of the CR. The penalization for complexity (R???) can be part of the CR, but the value should be adjusted according to the noise level in data.

1

1

0.9

0.9

0.8

0.8

0.7

0.7

0.6

0.6

0.5

0.5

0.4

0.4

0.3

0.3

RM SR3 00 RM SR7 25 RM SR1 60 0 RM SR3 00 0

RM SR1 2 RM SR5 0

RM SR2 RM SR5

0

200% noise 100% noise 50% noise 20% noise 10% noise 5% noise 0% noise

0.2 0.1 0 RM SR1 2 RM SR5 0 RM SR3 00 RM SR7 25 RM SR1 60 0 RM SR3 00 0

0.1

RM SR2 RM SR5

0.2

200% noise 100% noise 50% noise 20% noise 10% noise 5% noise 0% noise

Figure 4.12: The stronger penalization for complexity we use, the worse performance we can expect on complex problems with low noise. On the other hand, stronger penalization should perform better for a noisy data (left graph). Experimental measurements (right graph) are in accordance with the theoretical assumptions, except for low regularization and medium noise.

4

45

4. 94 5. 7E +0 9 2E +2 6

0.58

7. 4

RMS error on the Antro training data set & the Antro testing data set

3.3 17


1.5

0.56

1.3

0.54

1.1 0.52

0.9 0.5

0.7 0.48

tr R ain 12 1 tr a in R 50 2 tr R ain 50 1 R train 30 2 0 R trai 30 n1 0 R trai 72 n2 5 R trai 72 n1 5 R t 16 rain 00 2 R 1 6 tr a i 00 n1 t R 30 rain 00 2 R 30 train 00 1 tr a in 2

R 12

R 12 t R est 12 1 t R est 50 2 t R est 50 1 R tes 30 t 0 2 R tes 30 t 0 1 R tes 72 5 t2 R tes 72 t1 R 5 te 16 st 2 0 R 0 te 16 st 00 1 R 30 tes t2 0 R 0 te 30 st 00 1 te st 2

0.5

0.46

Figure 4.13: Regularization of the Combi units by various strengths of a penalization for complexity. The error on Antro training data decreases and best result is achieved for almost no penalization (R3000). The optimal regularization on the testing data set is R300, where errors of individual models are the lowest and their variance is still reasonable. ensemble of 30 models for all configurations (instead of the mean and standard deviation of their error). The ensemble techniques reduced overfitting of models for medium noise, but were unable to correct extremely overfitted models trained on highly noisy data.

4.5.1

Regularization of Combi units on real world data

We we also interested how the regularization affects the performance of the GAME models on real world data sets. We chose the Antro data set because our previous experiments showed that this data set is considerably noisy. We used two different splits into the training and testing sets (training1, testing1, training2, testing2) to reduce the bias of selecting unrepresentative data. For our experiments we enabled just the CombiNeuron units (see Section 4.7 for detailed description of this unit). We generated 30 models for each strength of penalization on both training sets. In the Figure 4.14 you can see the minimum, maximum and the mean error of these models. Results signify that on the Antro data set the optimal value of the R parameter in the Equation 4.21 is around 300. When you look at the Figure D.13 in the Appendix, you can see results when an ensemble, instead of 30 individual models is used. The overfitting was reduced according to expectations. The same experiment with the Building data set turned out absolutely differently for two variables with low level of noise (Cold and Hot water consumption). The best accuracy was achieved for the lowest regularization (R3000). For the third output variable (Energy consumption) that is considerably noisy, the error also decreased with lower penalty for complexity. In the configuration R1600 two from 30 models significantly overfitted the data and also the output of the simple ensemble demonstrated large error on the testing data. Therefore stronger penalization (R725) is optimal for the Energy consumption variable. Both experiments with real world data sets showed that each output variable require different

46

RMS error on the Building training data set & Building testing data set

40 59 .6


0.035

WBE WBCW

0.026

0.03

WBHW

0.022

0.025

0.018

0.02

0.014

0.015

0.01

0.01

R 12

R 12

te st 1 R 50 te st R 1 30 0 te st R 1 72 5 t es R 16 t 00 1 t es R 30 t1 00 te st 1

tra in R 1 50 tra R 30 in1 0 tra in R 72 1 5 tra R 16 in 1 00 t r R ai 30 n1 00 tra in 1

0.03

Figure 4.14: The error of the GAME ensemble on Building training data decreases with the declining penalization. The results on the testing data set show that no regularization is needed for the WBHW and WBCW variables. For the WBE variable that is much noisier than the other two and for low penalization levels models are overfitted. degree of regularization depending on amount of noise present in data vectors. 4.5.2

Evaluation of regularization criteria

In [48] it was proposed that the criterion of regularity should be changing its minima with a changing level of noise in a data set. Such ”clever” criterion involves taking into account a noise level of an output variable. However the level of noise can be hardly estimated. The problem is that we cannot say if the variance is caused by noise or by a complex relationship. We can assume that a noise level is correlated with the variance of the output variable in a data set7 2

σ =

N X ¡ i=1

¢2 di − d¯ ,

(4.22)

where di are target values of the output variable and d¯ is the mean of these values. Then the regularity criterion can be written as CRRM S−p−n−val





NB ³ ´ 1 X = (yi (A) − di )2  ∗ 1 + σ 2 ∗ unit.P enaltyF orComplexity() . NB i=1

(4.23)

The penalty for complexity is then stronger for higher variance in a data set. We have been experimenting with above proposed criteria on the artificial data set with various levels of noise. The results in the Figure 4.15 left indicate that for low levels of noise 7

This assumption is not true for several regression data sets and for almost all classification problems.

E+ 08 3.8 6

E+ 07 6.9 0

0.1

e

e

no is

20 0%

e

no is

10 0%

50 %

no is

e

e

e no is

no is

20 %

0.05

e

4.1 7 4.1 E-02 7E -02

0.15

0%

e

RMS-p-n-tr&val

0.2

0

no is

5%

0%

no is

e

0

10 %

0.05

no is 20 e % no is 50 e % no is 10 0% e no is 20 e 0% no is e

1.7 0 2.7 E-03 1E -03

0.1

4.1 1 4.1 E-02 0E -02

0.15

4.8 6 4.8 E-02 4E -02 8.7 0 8.7 E-02 4E -02

0.2

R300-tr&val

0.25

no is

1.7 3

E-0 1

0.25

RMS-tr&val

0.3

no is

Validation set

0.3

0.4 0.35

10 %

3.2 0

Training & Validation set

47

1.7 0 2.0 E-03 1 1.9 E-03 5E -03

E-0 1

0.4 0.35

Regularization on testing data

5%

E+ 08 3.8 6

6.9 0

RMS on the testing data

E+ 07


Figure 4.15: For low noise levels in data, it is better to validate on both the training and the validation set. For noisy data, just the validation set should be used to prevent the overtraining. in data, it is better to validate on both the training and the validation set. For highly noisy data, this regularization (Equation 4.20) fails and it is better to validate on the validation set only (Equation 4.19), or use other criteria with the penalization for complexity (e.g. CRRM S−p−n−tr&val ). The difference between the criterion with and without the variance considered is not significant (Figure 4.15 right). The overview of the criteria performance relative to the best results found during the experiment in Figure 4.12 can be found in the Figure 4.16. Results showed that the regularization taking into account the variance of the output variable (Equation 4.23) is not better than the medium penalization for complexity (Equation 4.21 with R = 300). The problem is that we cannot say if the variance of the output variable is caused by noise or by a complex relationship. Additional research is needed to improve the results of the ”adaptive” regularization. In this state of research, we recommend to use the medium penalization for complexity. Even better option would be when a noise level in the data set is supplied by a domain expert as an external information. Then we can adjust the coefficient R from the Equation 4.21 to the appropriate value.

4.6

Genetic algorithm

The genetic algorithm is frequently used to optimize a topology of neural networks [66, 89, 95]. Also in GMDH related research, recent papers [77, 70] report improving the accuracy of models by employing genetic search to search for their optimal structure. In the GAME engine, we also use genetic search to optimize the topology of models and also the configuration and shapes of transfer functions within their units. The individual in the genetic algorithm represents one particular unit of the GAME network. Inputs of a unit are encoded into a binary string chromosome. The transfer function can be also encoded into the chromosome (see Figure 4.17 and the next section ”Evolving units”). The chromosome can also include configuration options such as strategies utilized during optimization

48


0% noise 0.003 0.002

200% noise

5% noise

0.001 RMS-valid R300-val RMS-p-n-valid

0 -0.001 100% noise

10% noise

50% noise

20% noise

R300-tr&val RMS-tr&val RMS-p-n-tr&valid

Figure 4.16: We can observe similar results as in the previous figure. Regularization methods using both training and validation sets are better with no noise but fail with high noise levels in the data. The R300 regularization criterion proved to perform surprisingly well for all levels of noise. The RMS-p-n criterion should be further adjusted to penalize less on a low noise and more on a high noise levels.

4 1 5 2 6

Niching GA

Linear transfer unit (LinearNeuron)

1234567 1001000 Inputs

Settings

Transfer function

y = a1 x1 + a2 x2 + a0

Polynomial trasfer unit (CombiNeuron)

3 7

1234567 1234567 1234567 0000110 2115130 1203211 Inputs

Settings

Transfer function

y = a x x + a2 x12 x2 + a0 3 1 1 2

Figure 4.17: If two units of different types are crossed, just the ”Inputs” part of the chromosome come into play. If two CombiNeuron units cross over, also the second part of the chromosome is involved.


Inputs

Type

Other

0 0 1 0 P trans.fn.

P

(no niching)

P

Type

Other

0 0 1 0 P trans.fn.

P

Niching GA with Deterministic Crowding

Regular Genetic Algorithm (GA)

Select N best individuals

Inputs

49

P

P

Select the best individual from each niche

P

P

S

Figure 4.18: GAME units in the first layer are encoded into chromosomes, then GA is applied to evolve the best performing units. After few epochs all units will be connected to the most significant input and therefore correlated. When the Niching GA is used instead of the basic variant of GA, units connected to different inputs survive.

of parameters. The length of the ”inputs” part of the chromosome equals to the number of input variables plus number of units from previous layers, the unit can be connected to. The existing connection is represented by ”1” in the corresponding gene. The number of ones is restricted to maximal number of unit’s inputs. The example how the transfer function can be encoded is in the Figure 4.27. Note that coefficients of the transfer functions (a0 , a1 , · · · , an ) are not encoded in the chromosome (Figure 4.17. These coefficients are set separately by optimization methods (Section 4.3. This is crucial difference from the Topology and Weight Evolving Artificial Neural Network (TWEANN) approach [95]. The fitness of the individual (e.g. f(p1)) is inversely proportional to the criterion of regularity described in the previous section. The application of the genetic algorithm in the GAME engine is depicted in the Figure 4.18. The left schema describes the process of single GAME layer evolution when the standard genetic algorithm 2.2.3.5 is applied. Units randomly initialized and encoded into chromosomes. Then the genetic algorithm is run. After several epochs of the evolution, individuals with the highest fitness (units connected to the most significant input) dominate the population. The best solution represented by the best individual is found. The other individuals (units) have very similar or the same chromosomes as the winning individual. This is also the reason why all units surviving in the population (after several epochs of evolution by the regular genetic algorithm) are highly correlated. The regular genetic algorithm found one best solution. We want to find also multiple suboptimal solutions (e.g. units connected to the second and the third most important input). By using less significant features we get more additional information than by using several best individuals connected to the most significant feature, which are in fact highly correlated (as shown on Figure 4.19.). Therefore we employ a niching method described bellow. It maintains diversity in the population and therefore units connected to less significant inputs are allowed to survive, too (see Figure 4.18 right).

50


f (A) = 8

f (B) = 7.99

A

f (X) = 8

B

C

f (Y) = 5

X

f (C) = 8

Y

f (Z) = 9

Z

f (C) < f (Z)

Figure 4.19: Fitness of unit Z is higher than that of unit C, although Z has less fit inputs. 4.6.1

Niching methods

The major difference between the regular genetic algorithm and a niching genetic algorithm is that in the niching GA the distance among individuals is defined. The distance of two 1

Distance(P1,P2) = genotyphic distance + correlation of errors Computed from units deviations on training & validation set

2

7 P1

3

8 4

6

Niching GA

5

P2

Normalized distance of Inputs Hamming(100010,101100) + features used + Euclid distance of coefficients Normalized distance of Transfer functions + Distance of configuration variables Normalized distance of Other attributes

Encoding units to chromosomes:

P1

Transfer function Other

123456 100010

Inputs

P2

Transfer function Other

123456 101100

Inputs

Figure 4.20: The distance of two units in the GAME network. individuals (e.g. d(p1; c01) from the pseudocode of Deterministic Crowding 2.2.3.5) can be based on the phenotypic or genotypic difference of units. In the GAME engine, the distance of units is computed from both differences. Figure 4.20 shows that the distance of units is partly computed from the correlation of their errors and partly from their genotypic difference. The genotypic difference consists the obligatory part ”difference in inputs”, then some units add ”difference in transfer functions” and also ”difference in configurations” can be defined. Units that survive in layers of GAME networks are chosen according to the following algorithm. After the niching genetic algorithm finished the evolution of units, a multi-objective algorithm sorts units according to their RMS error, genotypic distance and the correlation of errors. Surviving units have low RMS errors, high mutual distances and low correlations of errors. Niches in GAME are formed by units with similar inputs, similar transfer functions, similar configurations and high correlation of errors.


51

The next idea is that units should inherit their type and the optimization method used to estimate their coefficients. This improvement allows to reduce time wasted with optimizing units with an improper transfer function by optimization methods not suitable to processed data. 4.6.1.1

Evaluation of the distance computation

The GAME engine enables the visual inspection of complex processes that are normally impossible to control. One of these processes is displayed in the Figure 4.21. From left we can see the matrix of genotypic distances computed from chromosomes of individual units during the evolution of the GAME layer. Note that this distance is computed as a sum of three components: distance of inputs, transfer functions and configuration variables, where last two components are optional. The darker color of background signifies the higher distance Chromos. dist. Correlation

Error on training vectors RMSE

Epoch 1

Start of the niching Genetic Algorithm, units are randomly initialized, trained and their error is computed,

Epoch 30

after 30 epochs the niching Genetic Algorithm terminates,

Sorted

finally units are sorted according to their RMSE, chromosome difference and the correlation.

Figure 4.21: During the GAME layer evolution, distances of units can be visually inspected. The first graph shows their distance based on the genotypic difference. The second graph derives distance from their correlation. Third graph shows deviations of units on individual training vectors and the most right graph displays their RMS error on the training data. of corresponding individuals and vice versa. The next matrix visualize distances of units based on the correlation of their errors. Darker background signifies less correlated errors. The next graph shows deviations of units output from the target value of individual training vectors. From these distances the correlation is computed. The most right graph of the Figure 4.21 shows a normalized RMS error of units on the training data. All these graphs are updated as the evolution proceeds from epoch to epoch. When the niching genetic algorithm finishes, you can observe how units are sorted (multi objective sorting algorithm based on the Bubble sort) and which units are finally selected to survive in the layer. Using this visual inspection tool, we have evaluated and tuned the distance computation in the niching genetic algorithm. The next goal was to evaluate if the distance computation is well defined. The results in the Figure 4.6.1.1 show that the best performing models can be evolved with the proposed combination of genotypic difference and correlation as the distance measure. The worst results are achieved when the distance is set to zero for all units. Medium accuracy models are generated by either the genotypic difference based distance or the correlation of errors

52


RMS Error on Boston testing data set 0.292

Simple Ensemble

0.29

Average, Minimum and Maximum RMS Error of 10 Ensemble Models on Boston data set 0.3

Weighted Ensemble 0.296

0.288 0.286

0.292

0.284

0.288

0.282 0.284

0.28 0.28

0.278 0.276

0.276 None

Genome Correlation Gen&Corr.

None

Genome

Correlation Gen&Corr.

Figure 4.22: The best results gives the setting when the distance of units is computed as a combination of their genotypic distance and the correlation of their errors on training vectors. based distance. 4.6.1.2

The performance tests of the Niching GA versus the Regular GA

In the Figure 4.23, there is a comparison of the regular genetic algorithm and the niching GA with the Deterministic Crowding scheme. The data set used to model the output variable (Mandarin tree water consumption) has eleven input features. Units in the first hidden layer of the GAME network have a single input, so they are connected to a single feature. The population of 200 units in the first layer was initialized randomly (genes are uniformly distributed - approx. the same number of units connected to each feature). After 250 epochs of the regular genetic algorithm the fittest individuals (units connected to the most significant feature) dominated the population. On the other hand the niching GA with DC maintained diversity in the population. Individuals of three niches survived. As Figure 4.23 shows, the functionality of niching genetic algorithm in the GAME engine has been proved. When you look at the Figure 4.23 you can also observe that the number of individuals (units) in each niche is proportional to the significance of the feature, units are connected to. From each niche the fittest individual is selected and the construction goes on with the next layer. The fittest individuals in next layers of the GAME network are these connected to features which brings the maximum of additional information. Individuals connected to features that are significant, but highly correlated with features already used, will not survive. By monitoring which individuals endured in the population we can estimate the significance of each feature for the output variable modelling. This can be subsequently used for the feature ranking (see section 5.2.2). We also compared the performance (the inverse of RMS error on a testing data) of GAME models evolved by means of the regular GA and the niching GA with Deterministic Crowding respectively. Extensive experiments were executed on the complex data (Building dataset) and on the small simple data (On-ground nuclear tests dataset). The statistical test proved that on the level of significance 95%, the GA with DC performs


Number of units connected to particular variable 0 100 200 0

100

53

200

0 50 Epoch number Time 100

Day Rs Rn

150

PAR Tair RH

200

u SatVapPr VapPress

250

Battery Genetic Algorithm

GA with Deterministic Crowding

Figure 4.23: The experiment proved that the regular Genetic Algorithm approaches an optimum relatively quickly. Niching preserves different units for many more iterations so we can chose the best unit from each niche at the end. Niching also increases a probability of the global minimum not being missed.

RM S e ne rgy consumption 5,30E-02 5,25E-02

RM S cold water consumption

RM S hot water consumption

8,80E-03

4,10E-02

8,70E-03

4,05E-02

5,20E-02 4,00E-02

8,60E-03

5,15E-02

3,95E-02

5,10E-02

8,50E-03 3,90E-02

5,05E-02 8,40E-03

5,00E-02

3,85E-02

4,95E-02

8,30E-03

3,80E-02

4,90E-02 8,20E-03

3,75E-02

4,85E-02 4,80E-02

3,70E-02

8,10E-03

GA

GA+DC

GA

GA+DC

GA

GA+DC

Figure 4.24: RMS error of GAME models evolved by means of the regular GA and the GA with the Deterministic Crowding respectively (on the complex data). For the the hot water and the energy consumption, the GA with DC is significantly better than the regular GA

54


RMS on-ground nuclear tests

2,0E-05

WAWE PRESSURE

SUMA RADIATION

2,5E-05

INSTATNT RADIATION

3,0E-05

FIRE RADIUS

3,5E-05

CRATER DIAMETER

4,0E-05

CRATER DEPTH

4,5E-05

DC off DC on

1,5E-05 1,0E-05

GA

GA+DC

5,0E-06 0,0E+00

Figure 4.25: Average RMS error of GAME models evolved by means of simple GA (DC off) and GA with Deterministic Crowding (DC on) respectively (on the simple data). Here for all variables, the Deterministic Crowding attained the superior performance. better than simple GA for the energy and hot water consumption. The Figure 4.24 shows RMS errors of several models evolved by means of the regular GA and the GA with Deterministic Crowding respectively. The results are more significant for the On-ground nuclear dataset. The Figure 4.25 shows the average RMS error of 20 models evolved for each output attribute. Leaving out models of the fire radius attribute, the performance of all other models is significantly better with Deterministic Crowding enabled. We can conclude, than niching strategies significantly improved the evolution of GAME models. Generated models are more accurate than models evolved by the regular GA as showed our experiments with real world data. 4.6.1.3

The inheritance of unit properties - experimental results

We designed experiments to test our assumption that the configuration of the GAME engine, where offsprings of units inherit the type and the optimization method used by more successful parent, would perform better than the configuration, where the type and the optimization method is assigned randomly. We prepared configurations of the GAME engine with several different inheritance settings. In the configuration p0% none offsprings got their type or optimization method assigned randomly. In the configuration p50% offsprings have 50% chance to get random type or method assigned. In the configuration p100% nothing is inherited, all types and optimization methods are set randomly. We have been experimenting with the Mandarin, Antro and Boston data sets. For each configuration 30 models were evolved. The maximum, minimum and mean of their RMS errors for each configuration are displayed in the Figure 4.26. Results are very similar for all configurations and data sets. There is no configuration significantly better than others. For all data sets we can observe that the p50% and the p100% configuration have slightly better mean error values and lower dispersion of errors. We chose the p50% configuration to be default in the GAME engine. It


Mandarin inhertance test

Antro inhertance test

0.0029 0.0028 0.0027 0.0026 0.0025 0.0024 0.0023 0.0022 0.0021 0.002

55

Boston inhertance test 0.35

0.76 0.74

0.33

0.72 0.7

0.31

0.68

0.29

0.66 0.27

0.64

0.25

0.62 p0%

p10%

p50%

p70%

p100%

p0%

p20%

p50%

p80%

p100%

p0%

p20%

p50%

p80%

p100%

Figure 4.26: The experiments with the inheritance of transfer function and learning method. For all three data sets, the fifty percent inheritance level is a reasonable choice. means offspring units have 50% chance to get random type and optimization method assigned otherwise their type and methods are inherited from parent units.

4.7

Evolving units (active neurons)

Input connections of units are evolved by means of the niching genetic algorithm described above. It is possible to evolve at the same time also transfer functions of units and their configuration. In the actual version of the GAME engine, only the CombiNeuron unit supports the evolution of its transfer function. We are working on extending the support of transfer function evolution also to other GAME units. 4.7.1

CombiNeuron - evolving polynomial unit

The GAME engine in the configuration generating homogeneous models with PolySimpleNeuron or CombiNeuron units only, can be classified as Multiplicative-additive (generalized) GMDH algorithm [63]. The transfer function of a polynomial unit can be either pseudorandomly generated (as implemented in the PolySimpleNeuron unit8 ) or evolved as implemented in the CombiNeuron unit. To be able to evolve the transfer function we have to encode it into chromosome first. The encoding designed for the CombiNeuron unit is displayed in the Figure 4.27. The advantage of the encoding is that it keeps track of degrees of input features (degree field) although for some units particular features are disabled. This encoding is also used in the proposal of the PMML GMDH standard (see appendix A). When the transfer function was added into the chromosome9 , it was necessary to define evolutionary operators. The mutation operator can add/delete one element in the transfer function. It can also mutate the degree of arbitrary input feature in arbitrary element of the transfer function. The crossover operator simply combines elements of transfer functions of both parents. We plan to experiment with more sophisticated crossover techniques utilizing e.g. ”historical marking” [95]. In case of noisy data, the CombiNeuron unit should be penalized for complexity to avoid the 8 In PolySimpleNeuron units, multiplicative-additive polynomials of complexity increasing with the number of layer in the network are generated pseudo-randomly. 9 The Genome class was overridden by the CombiGenome class encoding both the inputs and the transfer function.

56


Elements

y = 8.94 * x23 * x42 - 2.37 * x1 * x45

+

7.12 * x4

Coeff.

x1 x2 x3 x4 x5

Coeff.

x1 x2 x3 x4 x5

Coeff.

x1 x2 x3 x4 x5

8.94

01010 23124

-2.37

10010 14256

7.12

00010 32411

used_field

degree_field

Encoding

Figure 4.27: Encoding of the transfer function for the CombiNeuron unit.

Sampling with replacement

Training data

Sample 1

GAME

GAME model 1

Sample 2

GAME

GAME model 2

.. .

.. .

.. .

Sample M

GAME

GAME model M

Averaging or voting

GAME ensemble

output

Figure 4.28: The Bagging approach is used to build an ensemble of GAME models, models are then combined by the Simple or Weighted averaging. training data overfitting phenomenon (see Section 4.5). When more types of units are enabled and the number of epochs of the niching GA together with the size of population is small, CombiNeuron units have not opportunity to evolve their transfer functions resulting in their poor performance and absence in final models. In this case we advise to reduce the number of unit types and change the inheritance configuration to p0% (all offsprings inherit their type from parents).

4.8

Ensemble techniques in GAME

The GAME method generates on the training data set models of similar accuracy. They are built and validated on random subsets of the training set (this technique is known as bagging [45]). Models have also similar types of units and similar complexity. It is difficult to choose the best model - several models have the same (or very similar) performance on the testing data set. We do not choose one best model, but several optimal models - ensemble models [22]. The Figure 4.28 illustrates the principle how GAME ensemble models are generated using bootstrap samples of training data and later combined into a simple ensemble or a weighted ensemble. This technique is called Bagging and it helps that member models demonstrate


57

diverse errors on a testing data. Other techniques that promote diversity in the ensemble of models play significant role in increasing the accuracy of the ensemble output. The diversity in the ensemble of GAME models is supported by following techniques: • Input data varies (Bagging) • Input features vary (using subset of features) • Initial parameters vary (random initialization of weights) • Model architecture varies (heterogeneous units used) • Training algorithm varies (several training methods used) • Stochastic method used (niching GA used to evolve models) We assumed that the ensemble of GAME models will be more accurate than any of individual models. This assumption appeared to be true just for GAME models whose construction was stopped before they reached the optimal complexity (Figure 4.8 left). RMS – cold water consumtion

RMS – age estimation

0,274

170

0,273

165

0,272

160

0,271

155

0,27 0,269

150

0,268

145

0,267

140

0,266 0,265

135 1

2

3

4

5ensemble 6

1

2

3

4

5

6

7

8

9

10 11ensemble 12

Figure 4.29: The Root Mean Square error of the simple ensemble is significantly lower than RMS of individual suboptimal models on testing data (left graph). For optimal GAME models it is not the case (right). We performed several experiments on both synthesized and real world data sets. These experiments demonstrated that ensemble of optimal GAME models is seldom significantly better then single the best performing model from the ensemble (Figure 4.8 right). The problem is, we cannot say in advance which single model will perform the best on testing data. The best performing model on training data can be the worst performing one on testing data and vice versa. Usually, models badly performing on training data perform badly also on testing data. Such models can impair the accuracy of ensemble model. To limit the influence of bad models on the output of ensemble, models can be weighted according to their performance on training data set. Such ensemble is called the weighted ensemble and we discuss its performance

58


bellow. Contrary to the approach introduced in [37], we do not use the whole data set to determine performances (Root Mean Square Errors) of individual models in the weighted ensemble. Also the optimal value of the coefficient α was experimentally determined to be 2*106 , instead of 1 used in [37]. RMS skeleton age estimation – training&validation data set

RMS skeleton age estimation – testing data set 156

138

154

136

152 150

134

148

132

146

130

144 142

128

140

Simple ensemble

126

Simple ensemble

138 136

124 1Individual 3 5 GAME 7 9models 11 13

15 Weighted 17 19 21 23 25 ensemble

27

1Individual 3 5 GAME 7 9models 11 13

15 Weighted 17 19 ensemble 21 23 25

27

Figure 4.30: Performance of the simple ensemble and weighted ensemble on very noisy data set (Skeleton age estimation based on senescence indicators). On the Figure 4.30 you can see that weighted ensemble has tendency to overfitt the data - stronger than simple ensemble. While its performance is superior on the training and validation data, on the testing data there are several individual models performing better. The theoretical explanation for such behavior might be the following. Figure 4.8 a shows ensemble of two models that are not complex enough to reflect the variance of data (weak learner). The error of the ensemble is lower than that of individual models, similarly like in the first experiment mentioned above.

b)

a)

model 1

model 1 model 2

model 2 model 3

ensemble model

ensemble model

Figure 4.31: Ensemble of two models exhibiting diverse errors can provide significantly better result. On the Figure 4.8 b, there is an ensemble of three models having the optimal complexity. It is apparent that the accuracy of the ensemble cannot be significantly better, than those of individual models. The negative result of second experiment is therefore caused by the fact, that the bias of optimal models cannot be further reduced. We can conclude that by using the simple ensemble, instead of single GAME model, we can in some cases improve the accuracy of modeling. The accuracy improvement is not only advantage of using ensembles. There is a highly interesting information encoded in the


59

Table 4.4: Experiments with the Internet advertisements data set in the JavaNNS software. JavaNNS ica2.dat ica5.dat ica5.dat ica7.dat ica7.dat

RMSE 0.169746 0.079646 0.078200 0.071500 0.066800

Yes[%] 68.14 78.33 80.83 79.58 77.46

No[%] 57.30 91.81 92.98 75.17 79.87

Correct[%] 61.51 85.74 87.97 77.32 78.69

Winning MLP topology 2+4+2+1 5+10+4+1 5+3+2+1 7+4+2+1 7+15+5+1

ensemble behavior. It is the information about the credibility of member models. The section 5.2.6 describes, how this information can be extracted. These models approximate data similarly and their behavior differ out of areas where system can be successfully modeled (insufficient data vectors present, etc.). In well defined areas all models have compromise response. We use this fact for models’ quality evaluation purposes and for estimation of modeling plausibility in particular areas of the input space. The applications of these techniques to real-world problems are discussed in the Chapter ?? of this report.

4.9

Benchmarking the GAME engine

In this section we compare the performance of the GAME engine to the most popular NN techniques and state of the art DM methods.

4.9.1

Internet advertisements

The Internet advertisements data set is available in the UCI archive. It has many records and many input features. To be able to experiment with this data set in the JavaNNS software [2], we used the Independent Component Analysis (ICA) [43] to reduce the number of input features. Table 4.4 shows the accuracy of the classification (MLP in JavaNNS Software) on data, where 2,5 and 7 input features were extracted from several hundreds original binary features by means of the Independent Component Analysis (ICA) [43]. For several topologies of the MLP network tested, the best results we obtained with the 5+3+2+1 topology (numbers of neurons in layers). The best advertising classification (87.9% accuracy) was achieved for the data set with five input features. The same data set (ica5.dat) was used in our experiments with the GAME engine. The best classification accuracy (93.13%) was achieved when the All − 210 and All − f ast − 511 configurations were used. 10

All − 2: All units enabled, ensemble of two models for both Yes and No classes. All − f ast − 5: All units with analytic gradient implemented were enabled, ensemble of five models for both Yes and No classes. 11

60


Table 4.5: Experiments with the Internet advertisements data set (ica5.dat) in the GAME software. GAME config. All-2 All-fast-5 Standard Combi-5

”Yes” model RMSE Accuracy 0.014689 92.44% 0.014559 92.44% 0.015827 91.41% 0.016153 90.72%

”No” model RMSE Accuracy 0.015285 91.75% 0.015915 92.10% 0.015218 92.10% 2473.229 91.41%

Correct 93.13% 93.13% 92.44% 91.41%

Table 4.6: The error of the GAME engine on each fold of the Pima data set (for various configurations). GAME cfg.

fold1

fold2

fold3

fold4

fold5

fold6

fold7

fold8

fold9

fold10

Avg

Std-1-tv All-1-tv Std-5-tv Std-5-v CombiR300

77.9 77.9 77.9 76.6 80.5

80.5 81.8 80.5 80.5 80.5

84.4 80.5 80.5 80.5 79.2

75.3 80.5 81.8 81.8 77.9

77.9 75.3 74.0 72.7 71.4

77.9 71.4 74.0 74.0 72.7

72.7 76.6 76.6 76.6 74.0

85.7 75.3 74.0 74.0 75.3

80.5 84.4 84.4 85.7 85.7

80.5 84.4 81.8 83.1 85.7

79.35 78.83 78.57 78.57 78.31

We also experimented with the optimal number of models in the GAME ensemble (see Figure D.14 in the appendix). Results indicate that 5 models in the ensemble is the optimum, but the experiment should be repeated several times to get more reliable results. On the Internet advertisements data set the GAME engine generates 5% more accurate classifiers than the best classifiers from the JavaNNS software. 4.9.2

Pima Indians data set

The Pima Indians data set is well known data set used in many machine learning benchmarks. It can be obtained from the UCI repository. We used the 10 fold cross validation [52] to find out which configuration of the GAME engine generates more accurate classifiers. The accuracy of GAME classifiers (the percentage of correctly classified examples) for each fold can be found in the Table 4.6. The winning configuration is default in the GAME engine12 . Followed by the All − 1 − tv configuration13 . Three top configurations validated models both on the training and the testing set. Winning configurations do not use ensemble of classifiers, just individual models. We compared our results to the results recently published in [82]. Table 4.7 shows that the GAME engine can classify the Pima Indians data set better than other state of the art soft 12

Units:LinearNeuron, PolySimpleNeuron, ExpNeuron, SigmNeuron, SinusNeuron, GaussianNeuron, MultiGaussian and GaussNeuron enabled 13 All units enabled


61

Table 4.7: The 10fold cross validation classification error and the standard deviation of errors for the Pima data set. Pima data set Accuracy[%] Std.dev.[%]

Bayes

MLP-BP

Lazy training

Soft propagation

GAME

72.2 6.9

77.47 3.75

76.69 5.22

76.69 2.37

79.35 3.89

computing methods. 4.9.3

Spiral data benchmark

The Spiral data set [5] is frequently used for benchmarking machine learning methods. It describes very difficult classification problem. Two intertwined spirals are to be told apart. In [30] shows, that simple backpropagation MLP networks are not capable to solve this problem. The more advanced neural networks such as the Cascade Correlation Network (CCN) [30] are able to solve it. We have found that original MIA GMDH is not capable to solve this complex problem at all. The motivation of this section is to prove that GAME models solve the Spiral data set problem successfully. Inductive methods traditionally use part of the training data set as the validation set (to decide which units have the best generalization ability). For this experiment, all training data were used to modify weights and coefficients of GAME units and fitness of these units was computed from their performance on the training data. This decreases the generalization ability of the network, but in this case this is the only option how to get hundred percent classification accuracy (few training data for complex problem).

Figure 4.32: Two GAME networks solving the intertwined spirals problem. The Figure 4.32 depicts two GAME networks solving the intertwined spirals problem. The

62


dark background signifies ”1” on the output of the network. All points making up one spiral are covered by dark background; whereas all points on the second spiral are classified as ”0” (white background). The majority of units in GAME networks are ”small perceptron networks optimised by the BP algorithm”. The average network has 12 layers with approximately four units in each layer. It signifies that solving the intertwined spirals problem is very difficult. Small perceptron network optimized by the BP algorithm is not able to solve this problem individually. The GAME demonstrates the power of induction - cooperating networks interconnected in one GAME network are able to solve the Spiral problem successfully. When the complexity of GAME network on this benchmarking problem is compared to the complexity of the Cascade Correlation Network, we must state that GAME networks are redundant. The CCN network can solve this problem with approx. 16 hidden sigmoid neurons. The solution is to implement the CCN network as the unit of the GAME network. This would reduce the complexity of resulting GAME network when solving difficult problems. Most important is that GAME allows solving complex multivariate problems where CCN fails because of the curse of dimensionality.

4.10

Summary

In this chapter we listed results related to the core of our research - the GAME engine. We proposed and evaluated usefulness of using heterogeneous units in GAME models. The big range of various units is evolved and only units adapted to nature of a data set survive. We derived and implemented an analytic gradient for several types of units. We showed that it dramatically reduces the number of error function evaluation calls and saves the computational time. We focused also on various optimization methods and performed experiments showing there is no superior method universally applicable, but the Quasi Newton and Conjugate Gradient methods perform sufficiently on a big range of different problems. Experiments with regularization of GAME units showed the modest penalization for complexity can prevent overfitting even of highly noisy data. Penalized models are simpler and can be interpreted by man. We showed that the niching genetic algorithm is superior to regular genetic algorithm when generating the topology of GAME models. The polynomial transfer functions of units are evolved, too. When combined with proper regularization, we can get really simple and accurate polynomial models. We showed that the simple ensemble of GAME models is seldom better than the best of individual models. However is has very stable behavior and its error is one of the lowest. Therefore is often worth to use ensemble instead of individual model with the best performance on the training and validation data.

SECTION 5. THE FAKE INTERFACE AND RELATED RESULTS

63

5 The FAKE interface and related results In this chapter we present results related to the Fully Automated Knowledge Extraction (FAKE) interface. It is a long way from a raw data set to the knowledge that is encoded inside. To extract the knowledge from data, several techniques are proposed and discussed in this chapter. Some limited knowledge can be extracted directly from raw data. More valuable knowledge can be accessed by means of sophisticated data mining techniques such as the GAME engine. Without data preprocessing, data mining is often impossible. If we aim at automating the knowledge extraction process, we need to automate data preprocessing stage, as well.

5.1

Automated data preprocessing

The results in this section are related to very important part of the FAKE interface. It is the interface between raw data and the GAME engine (see Figure 3.1). We still have not addressed many problems from this area. The research of automated data preprocessing techniques is in the early stage. 5.1.1

Imputing missing values

Majority of real world data sets contains some missing values. To be able to process these data by the FAKE GAME environment, we need to deal with the missing values. We have implemented an application allowing us to impute missing values in data by different techniques summarized bellow. • Leave out - Records containing missing values are deleted. • Replace by zero - All missing values are replaced by zeros. • Replace by mean - Missing values are replaced by the mean value of the corresponding variable (their column). • Text match similarity - If the text of non-missing values in the record match with other similar record, missing values are replaced by values from the similar record. • Euclid distance similarity - Missing values are replaced by values from the most similar record, based on the Euclid distance. • Dot Product similarity - The same as previous item, just the distance is computed as a dot product of n-dimensional vectors (where n is the number of variables or columns). Techniques implemented can solve the missing values problem automatically without user involvement. To examine the properties of these techniques, we designed the following experiment. The Stock market prediction data set does not contain any missing values. We artificially introduced missing values into this data set. Various mechanisms that produce missing values can be distinguished [62].

64


RMS error on a complete data set (Stock prediction) Infinity – no data available

0.522 0.422 0.322 0.222 0.122

leave out replace by zero replace by mean text match similarity euclid distance neighbour dot product neighbour

RMS error if trained on the complete data set

0.022 5% missing

20% missing

50% missing

80% missing

Figure 5.1: The performance of imputing methods on the Stock market prediction data set with different volume of missing values. We used the MCAR (Missing Completely At Random) mechanism to produce data sets with 5%, 10%, 50% and 80% of missing values in the Stock market prediction data. The MCAR [62] means that the probabilities that the values of the inputs are missing attributes are independent of the values of any of the variables. After that, we used above listed imputing techniques to correct data sets with various degree of missing values. Then we trained GAME ensembles on these corrected data sets. Finally, the error of ensembles on the original Stock market prediction data is shown in the Figure 5.1. The results signify that replacement is not suitable for the Stock market prediction data. Much better is to replace by the mean value. The leave out strategy is superior to other methods up to 20% of missing values. The imputing based on the Euclid distance has very promising results, specifically for high percentages of missing values in the data set. It is hard to do some general conclusions from results of this small experiment. With a data set of different properties it is likely we get different results. Our further work is to explore the relationship between properties of a data set and a proper imputing method used. We also plan to develop more sophisticated algorithms of the missing values replacement. 5.1.2

Distribution transformation

The best distribution of data for the data mining techniques should be the uniform one (as suggested in [80]). We proposed and implemented an algorithm that transforms data from whatever distribution to the distribution close to the uniform one. The inverse transformation is also possible. The transformation of distribution should be transparent to an expert user, because it changes physical meaning of features. Therefore we use the transformation as depicted in the Figure 5.2. The transformation algorithm is based on an artificial distribution function (see Figure 5.3). The design of the distribution function is inspired by [84], where receptive fields are used

65


Black box Outputs (training phase) Inputs (both training and recall phases)

T

T

Neural Network (e.g. GAME)

T

T

Data Transformation Module

IT

Inverse Transformation Module

IT Outputs (recall phase)

Figure 5.2: The principle of the distribution transformation. 1

slope of lines offset of lines transformation function

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0

5

10

15

20

25

Figure 5.3: The artificial distribution function is computed from a histogram of values in a data set. for local approximations of the function. Summing them together using gaussian weights produces the final approximation. 5.1.2.1

The design of the transformation function

For each feature, we compute its own transformation function. The slope of the transformation function should be in any point equal to the probability density function. This function is unknown, but can be approximated by the histogram of a data set processed. We divide i the h0, 1i range to M bins. The empirical probability in each bin ri = N N is computed from th number of values Ni of the feature hitting the i bin and total number of data vectors N . Then we have to map the ti range of definition h0, 1i into the h0, ∞) interval. This can be realized e.g. by the tangent function: µ

ki = tan ri ∗

¶

π . 2

The ki is the slope of the transformation function in the center of the ith bin. We can construct whole transformation function by weighted sum of lines representing the probability density of each bin. Only thing missing is the offset oi of these lines. We can compute it step by step

66


Figure 5.4: The distribution of the data set plotted in the left graph is displayed in the Figure 5.3. The results of the experiment with this data set were not statistically significant - in another words - the transformation has no impact to the quality of models. The reason might be that this distribution is closer to an uniform one and the difference of data sets is not important for the GAME engine. assuming that o0 = 0 and oi =

i X

oj + kj ∗

j=0

1 . M

Having offsets oi and slopes ki of lines, we can finally sum them up to create the transformation function: ¶ µ ¶¾ µ M −1 ½ X i+1 i ′ ∗ H −x + , (ki ∗ x + oi ) ∗ H x − f (x) = M M i=0 where H is the Heaviside function. The final transformation function must be normalized to the h0, 1i range. Therefore we have to divide it by its maximal value f (x) = f ′1(1) ∗ f ′ (x). 5.1.2.2

Experiments with artificial data sets

From an artificial data set in the Figure 5.4 left, we have synthesized the transformation function which is shown in the Figure 5.3. You can see slopes and offsets of lines computed from the empirical probability density of the data set. The right plot in the Figure 5.4 shows the original data set transformed using our function (Figure 5.3). Histograms of both the original and transformed data sets are displayed in the Figure 5.5. The distribution of the transformed data set is almost uniform. You can also observe minor disturbances of spiral shapes in the Figure 5.4 caused by the linear segmentation of the transformation function. We performed experiments to check if the transformed data set with uniform distribution improves the quality (classification accuracy) of models. Several GAME ensembles were produced for the original data set and for the transformed data set and their classification accuracy was compared. Surprisingly, the difference was not significant. We repeated experiments with a data set, with highly non-uniform distribution (see Figure 5.6). Here the GAME engine generated significantly more accurate models for transformed

67


140

70

number of data vectors in the interval

120

65

100

60

80

55

60

50

40

45

20

40

0

number of data vectors in the interval

35 0

5

10

15

20

25

0

5

10

15

20

25

Figure 5.5: The histogram of the original data set (left) and the histogram of the data set transformed from the original using the artificial distribution function. You can observe the distribution of transformed data set is almost uniform.

Example of models behaviour on Original and Transformed data ed m an sf or O rig in al

.75

X2 Transformed

Tr

.80

X2

Classification accuracy

.85

.70

X1

X1 Transformed

Figure 5.6: It is clear that the transformation has positive influence to results of GAME models trained on the original and the transformed data. The result is significant on a 98% confidence level.

68


1

0.8

0.6

0.4 Day Time Rs Rn PAR Tair RH u SatVap VapPress Battery

0.2

0 0

5

10

15

20

25

Figure 5.7: The artificial distribution functions for input features of the Mandarin data set data than for the original highly non-uniform data1 . The Figure 5.6 shows that the mean classification accuracy for 30 GAME ensembles for transformed data was about 4% higher than for ensembles generated on the original data. 5.1.2.3

Mandarin data set distribution transformation

The idea to apply the distribution transformation to the Mandarin data set was motivated by highly nonlinear character of some of its input features. Particulary the distribution of features Rs, PAR and Rn (see Figure 5.7) is highly nonlinear. On the other hand, some input features have uniform distribution (Day, Time) and their transformation functions in the Figure 5.7 are identity. Again, we generated GAME ensembles for the original and the transformed Mandarin data set. We compared their RMS errors on a testing data and the difference was not significant. To explore properties of the transformation, we used the scatterplot matrix to project original and transformed data (see Figure 5.8). The reason why the transformation does not improve the performance of models on Mandarin data might be the following. The data set contains measurements of Mandarin tree water consumption. During the night, tree is not consuming any water and also input features related to the solar radiation are close to zero (PAR, Rs). By transforming these features into the uniform distribution, we only amplify noise around zero (Figure 5.8). Our future work is to experiment with real world data sets containing clusters. Such data sets seems to be more suitable for the transformation of distribution using our approach. We also work on smoothing the transformation function by using Gaussian kernels instead of Heaviside functions. Weak learners should benefit more from the distribution transformation, therefore we scheduled some experiments in this direction. The improved performance of GAME ensemble models on the Mandarin data set can be achieved by better selection of training data set by means of an intelligent data reduction. 1

Note that this sort of data transformation can hardly be achieved by standard transformation techniques such as Softmax scaling (there are more than one dense clusters in the data set).

69

SECTION 5. THE FAKE INTERFACE AND RELATED RESULTS Rs

Rn

PAR

Tair

RH

u

SVpPr VapPr Battery Water

Time

Rs

Rn

PAR

Tair

RH

u

SVpPr VapPr Battery Water

Water Batery VapPr SVpPr u

Water Batery VapPr SVpPr u

RH

RH

Tair

Tair

PAR

PAR

Rn

Rn

Rs

Rs

Time

Time

Time

Figure 5.8: The Scatterplot matrix of the Mandarin data generated by the Sumatra TT software [4] before and after the transformation. Some disturbances caused by the linearization of the artificial distribution function can be observed. 5.1.3

Data reduction

A data set can be reduced in both dimensionality and size. The dimensionality reduction is important for several data mining methods. For example a fully connected MLP network suffers from ”curse of dimensionality” when too many input features and few data vectors are provided. To solve this problem, input features can be reduced by a feature selection or a feature extraction algorithm. Feature selection algorithms look for the best subset of input features in order to preserve maximum information. Other, less interesting features are removed from a data set. Feature extraction algorithms such as Principal Component Analysis (PCA) or Independent Component Analysis (ICA) project data to a low dimensional space while preserving maximum information possible. The GAME engine selects the most interesting input features automatically, while ignoring irrelevant ones. The dimensionality reduction is therefore necessary for data set with more than hundreds input features2 . The different problem is the reduction of size in case of large data sets. The volume of vectors can be reduced, when randomly selecting a subset of data vectors. This is however not the best approach. The data subset selected has to be representative. It means it should contain data vectors representing all possible system states. We have implemented a data filter in the FAKE GAME environment allowing us to choose a representative data subset. The main criterion is the value of the output variable in the record. A record is added to the subset with the probability inverse of the empiric probability 2 We are working on the version of the GAME engine allowing us to process data sets with more than thousand of input features. For such large data set it is necessary to increase the population size and number of epochs of genetic algorithm evolving GAME units. Some changes in memory structures of the application have to be also performed.

70


of the output variable. Later, the probability should respect also the difference in input features among records. Unfortunately, results of experiments with the data filter were not available at the time of publication of this thesis.

5.2

Knowledge extraction and information visualization

This section summarizes results related to the most important part of the FAKE interface the interface between the GAME engine and an expert user. We aim at extracting as much knowledge as possible in comprehensible form. 5.2.1

Math formula extraction

Math formulas are often used to describe a behavior of a system. The knowledge can be decoded from math formulas by users with some mathematical background. The extraction

Pi - units with linear transfer function (e.g. LinearNeuron)

x1 P1 x2

P3 P2

x3

y = Ρ3 (Ρ 2 ( x1 , x3 ), Ρ1 ( x1 , x2 ) ) = a31 (a11 x1 + a12 x3 + a13 ) + a32 (a21 x1 + a22 x2 + a23 ) + a33 = = (a31a11 + a32 a21 ) x1 + (a32 a22 ) x2 + (a31a21 ) x3 + a31a13 + a32 a23 + a33

Figure 5.9: How to extract the math equation from the GAME model. of math formulas from inductive models is depicted in the Figure 5.9. For model with linear units, simple linear equation can be extracted. Some formulas are too complicated to be useful. The example bellow is a math formula extracted from the GAME classifier build on the Pima Indians data set. diabetes=-8.598+9.375*sin(-0.901*-0.395*0.311*(0.353*e^(-(-0.381*0.850* e^(-(0.866*-18.815/(-4.898+ 3.378*age-4.236*age*age)-3.828+ 1.560*plasma_conc-1.778)^2/(0.640)^2)+-0.044-0.593*plasma_conc-1.159*diab_pedigree-0.337)^2/(0.866)^2)+-0.806)+ 0.706*(0.491+0.412*sin(-10.542*-0.310*0.850* e^(-(0.866*-18.815/(-4.898+ 3.378*age-4.236*age*age)-3.828+ 1.560*plasma_conc-1.778)^2/(0.640)^2)+-0.044-0.041*serum_insulin-0.258*diab_pedigree+ 0.200))-0.029*(triceps_thickness)-0.171*(serum_insulin)+ 0.022-0.075* BPNetwork(0.850*e^(-(0.866*-18.815/(-4.898+3.378*age-4.236*age*age)-3.828+ 1.560*plasma_conc-1.778)^2/(0.640)^2)+-0.044,diab_pedigree)-1.260)

The only information we can gain from this equation is that diabetes is dependent on features age, plasma conc, diab pedigree and serum insulin, other features were not selected and therefore are less significant or irrelevant. We can also see that the classifier contains linear, polynomial, rational, sin, gaussian and BPNetwork units. The short equation with a few complex units signifies the decision boundary is not very complex. For the Spiral data set, the equation would be several pages long containing many complex units (e.g. BPNetwork).


71

Europe Asia Africa NorthAmerica SouthAmerica Australia Antarctica Africaner Portugal SOTO Spain Suisse Thailand USAB USAW ZULU male female PUSA PUSB PUSC SSPIA SSPIB SSPIC SSPID

P

P

P

P

P

P

GAME age

Age = 0,782*0,676*12,622*PUSB+ 18,618+ 7,862*SSPIB2,991+ 0,68*14,256*SSPIC+ 30,76+ 10,975*Spain-23,805

Figure 5.10: The formula extracted from a GAME model on the Anthrop data set is not in optimal form (it can be simplified). When you copy/paste it into a cell of some Excel-like application, you can integrate it into your calculations.

The second equation was generated from the GAME model of the Age variable in the Anthrop data set. This data set is considerably noisy so linear or simple polynomial models are of the optimal complexity. The Figure 5.10 shows simple polynomial equation extracted from the GAME network depicted. This formula is quite easy to interpret. When a simple equation is needed, it is possible to use strong penalization for complexity (e.g. CombiR12 configuration) to get simple model even for complex problem. Of course the accuracy of such model will be proportional to its complexity. For complex problems it is better to extract knowledge by means of visualization techniques introduced bellow.

5.2.2

Feature ranking

The knowledge which input features play the most significant role in influencing the output variable is very valuable. In this thesis we propose three different feature ranking algorithms. The first technique uses the information extracted from the inductive model. It counts how many units are connected to particular feature. The ranking of this feature is subsequently derived taking into account the attributes of units connected (e.g. error of units on validation set). The algorithm is shown bellow in form of a pseudocode.

72


Feature ranking from inductive models function compute significance of features (Inductive model model) { for (j) all features do feature significance[j] = 0; layer index = model.get last layer index (); while (layer index >= 0) { layer = model.layer[layer index]; for (i) all units in layer do { unit = layer.unit[i]; for (j) all features do if (feature[j].is used by (unit)) feature significance[j] += unit.get unit score(); } layer index = layer index -1; } } The most important function get unit score() can be implemented in various ways. It should take into account the importance of the unit in the model. The score can be for example computed as (best error in the layer) / (unit’s error on the validation set) + (1 / (result of the statistical correlation or mutual information test with other units in the layer) + ...). We have implemented this algorithm in the FAKE GAME environment [8] and it was also already implemented in the well know software for inductive modeling KnowledgeMiner [3]. 5.2.2.1

Extracting significance of features from niching GA used in GAME

The second approach utilizes the information gained during the run of the niching genetic algorithm in the GAME engine. A significance of features can be extracted as follows. The number of individuals (units) in each niche is proportional to the significance of the feature, units are connected to. From each niche the fittest individual is selected and the construction goes on with the next layer. The fittest individuals in next layers of the GAME network are these connected to features which brings the maximum of additional information. Individuals connected to features that are significant, but highly correlated with features already used, will not survive. By monitoring which individuals endured in the population we can estimate the significance of each feature for the output variable modeling. This can be subsequently used for the feature extraction. In each layer of the network, after the last epoch of GA with DC, before the best individual from each niche is selected, we count how many individuals are connected to each feature. This number is accumulated for each feature and when divided by sum of accumulated numbers for all features, we get the proportional significance of each feature. The ranking of features is extracted from their proportional significance. More detailed description of this method can be found in [A.12]. As an example we apply the above described method to the selection of features significant for dyslectic children classification. The Figure 5.11 shows number of units connected to particular features while evolving the first layer of the GAME network. At the beginning,


73

100

FV3

Epoch number

0

FA3 200

RS3

RR3

250 0 100 200 Number of units connected to particular variable

Figure 5.11: The significance of each feature is correlated by the number of units, connected to this feature during the evolution process. all features had in average the same number of units connected to them. After several epochs, irrelevant features were not used any more. Number of units connected to a feature was proportional to its significance. In next layers of the GAME network, the proportional significance of features is updated as less significant features appear in the population of the niching genetic algorithm (see Figure 5.12). In this case the most significant six features (EE1, WE3, WW3, RS3, FV2, and FV3) for dyslectic children classification were selected. Using these features, the SOM clustering of patients was performed and the separation of dyslectic patients from other subject improved. For detailed description of the problem see [A.7]. The proposed method of feature extraction takes into account three factors. First is the significance of a feature for modeling the output variable. The second is the correlation of features (the distance of units in the niching GA is computed from their correlation). The third is the amount of additional information to the information carried by already extracted features. This resembles state of the art methods based on mutual information analysis. These methods extract set of features of the highest mutual information with the output variable while minimizing mutual information among the selected variables [78]. Third feature ranking algorithm is described in the section 5.2.8.5. We are comparing the performance of our methods with the state of the art feature selection algorithm [100].

5.2.3

Relationship of variables

Instead of trying to extract math formula to get some useful knowledge from complex models, visualization techniques can be successfully utilized.

74

FA3

FV3

FA2

FV2

FA1

FV1

SA3

SV3

SV2

SV1

SA2

RS3

SA1

RR3

EW3

WE3

WW3

EE3

EW2

WE2

WW2

EW1

EE2

WW1

EE1

WE1


Figure 5.12: The significance of each feature developed during the construction of the GAME inductive model for dyslectic patients. In total 9 layers were used (axis x-features, axis ypercent of overall connections for each feature, axis z-inductive model layers 1 to 9). The GAME, GMDH and neural network models are often complex and multidimensional - this significantly complicates understanding the influence of input features to the output variable. Any knowledge concerning relationship of variables cannot be extracted from such models nor their math description. To overcome this problem, we visualize the behavior of models. By visualization of model responses we can access the information abstracted by the model from a data set. The easiest way to visualize how the model approximates the data constant moving constant

x1

yk

x 1 = x 3 = const.

yk

x2 x3 GAME

moving

x1

moving

x2

x2

min

ym

max

x 3 = const. max

ym x1

constant

x3 GAME

min

x2

max

Figure 5.13: The IO relationship scatterplot produced by GAME model sensitivity analysis. set is to change values of input variables and record the output of the network. When we vary just one input variable whereas others stay constant, we can plot a curve (see Figure 5.13). The curve shows us the influence of selected input variable to the output variable in the configuration specified by the others input variables. If we change the input configuration,

75


max

V([A1,A2,A3];Y)

y

i

x3

V([A1,A2,A3];Y)

max

Y X3

min

x2

X1

x1

max

i = x2

min

A2

max

Figure 5.14: The projection of data vectors into the IO relationship plot. the shape of the curve changes often too. If we vary two of the input variables whereas others stay constant, we can plot a surface (see Figure 5.13). The surface represents the relationship between two input variables and the response of the model in the configuration defined by the constant inputs. More precisely, the curve in the Figure 5.13 expresses the relationship between input variable x2 and the dependent output variable y for the configuration x1 =X1, x3 =X3. To be able to see, how models approximate the training data, we need to find a projection − → of training vectors to the input-output relationship plot. Each data vector V ([A1,A2,A3];Y) consists of two parts. The first is the input vector and the second is the target output for this input vector. We plot the crosses representing data vectors to the graph 5.14. The position of the cross [A2,Y] is given by the value of the vector for the dimension of intersection (x2 ) and by the target value Y. − → We compute the Euclid distance of input part of each data vector V from the axis of the input space intersection. The size of the cross that represents the data vector is inversely related to its distance from the axis of intersection. Size = where H Size Dist A1,..,A3 X1,..,X3

1 , H > 0, Dist = max (H, Dist)

q

(A1-X1)2 + (A3 − X3)2 ,

is small number to limit the cross size, is the size of the cross in the graph, is the Euclid distance of the vector from axis of intersection, are values of the data vector in input dimensions, are the constant values of model inputs (input configuration).

The upper edge of the curve in the right graph of the Figure 5.14 shows responses of the model for input vectors located on the axis of intersection. The thickness of the curve represents the density of data vectors in the input space. The more vectors are present and in defined neighborhood, the higher the density is.

76

SECTION 5. THE FAKE INTERFACE AND RELATED RESULTS y4

y

x2 = 0.01

y2 y9

y x1 = 0.25

y5

y2

y3

y1

y1

y7

y6

y9

y5

y3 y4 y8

y8 data defined -1.0

0.0

j = x1

data defined 1.0

2.0

-1.0

0.0

i = x2

y6 1.0

y7 2.0

Figure 5.15: With data vectors displayed, quality of models can be evaluated.

You can consider the quality of the model simply by looking at the IO relationship graph and checking how it approximates the training or testing vectors. In the¡ Figure 5.15, you can study relationship variables of an artificial problem defined as ¢ 1 2 2 y = 2 sinh(x1 − x2 ) + x1 (x2 − 0.5) , where training data vectors were distributed uniformly in the area x1 , x2 ∈ h0,1i. Ensemble of 9 models was generated. Notice that their responses differ in areas where training data are absent. This property will be later in this thesis utilized to credibility estimation and also as the criterion in the search for interesting behavior of models.

5.2.3.1

Relationships in the Dyslexia data set

As an example we demonstrate relationships plots for models trained on the Dyslexia data set. The full data set (26 input variables, 3 output states corresponding to 3 classes, 49 vectors corresponding to 49 patients) was used to build 3 ensembles of linear inductive models. First ensemble classified healthy patients, second patients suffering from reading dysfunction and the third patients with dyslexia. In the Figure 5.16 there is a relationship plot for the dyslexia output variable and the most significant input feature - reading speed. This inductive model apparently overfitted data, there are some dyslectic patients reading faster than patients with a reading dysfunction. Therefore we decided to use just linear units in inductive models. The classification accuracy dropped significantly, but we were able to study the influence of input variables to the output. Figure 5.17 shows that according to all linear inductive models, the growing reading speed increases the probability of the patient being healthy. We showed that using relationship plots we can extract the knowledge from GAME classifiers. However these plots are more suitable for regression data sets.


77

healthy

1

0 min

RS3

max

Figure 5.16: Nonlinear inductive model (majority of sigmoidal units) - some healthy patients have lower reading speed than dyslectic patients.

Figure 5.17: The group of linear inductive models. Lines are the responses of models, crosses data vectors mapped into the scatterplot.

78 5.2.3.2


Relationships in the Building data set

On the Building data set we evolved ensemble of GAME models for each output (wbcw, wbhw, wbe). For the wbcw output variable (cold water consumption), the left plot in the Figure 5.18

Figure 5.18: Relationship of temperature and solar radiation with cold water consumption (left) and solar radiation with the energy consumption variable (right). shows relationship of the wbcw with TEMP (temperature of the air) and SOLAR (solar radiation) input features for low humidity and medium wind strength. In these conditions increasing temperature clearly leads to growing cold water consumption. The change in solar radiation does not affect the output variable at all. This input feature can be therefore considered irrelevant (for low humidity and medium wind strength). For models of wbe the noise level is extremely high (see Figure 5.18). It is partly caused by the fact that information about time of measurement was excluded from the data set. As you can see, models properly generalized the problem and avoided the overfitting. In the appendix D.6, you can see that this technique is also applicable to data sets with discrete features, although best performance can be expected for continuous features. 5.2.4

Boundaries of classes

The GAME is very versatile, because generated models adapt to the character of the system modeled. In previous section we gave short description of GAME models used as regression models. The water consumption variable that is modeled is continuous and its relationships with input variables can be easily expressed in polynomial form. That is the reason why many of surviving units in GAME models have polynomial transfer function. On the other hand modeling the membership to a class involves a binary encoding of the output variable (e.g. 1 = member, 0 = non-members). A sigmoid function is more usable for classification tasks, because it can simply divide two groups with a decision boundary. This is the reason why majority of successful units in GAME membership models (classifiers) are of sigmoid type. It further complicate the extraction of math formulas from a GAME model (a formula with nested sigmoid functions is not comprehensible and cannot be simplified). We developed visualization technique for GAME membership modeling. It is just slightly different as the one used for regression models in the previous section. Data are projected


79

Figure 5.19: The Pima Indians Diabetes data set [5] - crosses represent healthy/patient subjects, dark background signifies membership to the class ”diabetics” modeled by GAME network. into two-dimensional plane as crosses or rectangles of color indicating the membership to a particular class. The size of a rectangle is inversely proportional to the distance of the corresponding vector from the plane of intersection in the input space. To indicate responses of inductive models, we added the background color to scatterplot planes. If the model works well, then big rectangles of a particular color should have the same color in their background. The background of different color under a big rectangle signifies a misclassified vector. Figure 5.19 shows how GAME model separates one class (diabetics) from another (healthy). We can clearly see the decision boundary of the model. Visual knowledge mining as well as quality evaluation of the model is straightforward.

5.2.4.1

Classification boundaries and regression plots in 3D

We have extended the visualization techniques described above into third dimension. One extra degree of freedom can be utilized so we can study relationship of the output variable with two (regression) and three (classification) input features. We implemented all 3D visualization modules in 3D Java, so user can turn, zoom and shift objects in the 3D space. This allow better investigation of decision boundaries and regression manifolds. Figure 5.20 shows decision boundaries of two models of classes Versicolor and Virginica of the Iris data set. You can study the behavior of these models in 3 dimensions (petal length, petal width and sepal width). Regression manifolds of models on the Iris data set and Boston data set are shown in the Figure 5.21. Class Setoza is linearly separable from other two as visible in the projection. With increasing crime and distance from big employment centers, the value of housing in Boston increases (Figure 5.21 right).

80


Figure 5.20: Visualization of 3D manifolds representing decision boundaries of a GAME model on the Iris data set. Left plot shows the decision boundary of the Virginica class. Right plot shows the decision boundary of the Versicolor class. The tube character of the boundary indicates that the sepal width feature is not very significant.

Figure 5.21: Left plot shows behavior of GAME model of the Setoza class from the Iris data set. You can see the decision boundary dividing member records of the Setoza class from records of Virginica and Versicolor classes. Right plot displays the relationship of housing value, crime and distance to employment centers as modeled by the GAME network trained on the Boston data set.


5.2.5

81

GAME classifiers in the scatterplot matrix

The scatterplot matrix [12] is a popular technique for multivariate data visualisation. Data are projected into several two-dimensional graphs (axes are all pair-wise combinations of input features). Again, crosses represent data vectors, their color signifies the membership to a particular class. The GAME model is trained to separate member vectors of his class from other vectors. The dark background signifies areas, where the output of the GAME model is close to ”1” - all vectors in these areas are classified as members of the class modeled.

Figure 5.22: The scatterplot matrix showing the GAME network modeling the membership to the class ”im” (the Ecoli data set). The Figure 5.22 shows one GAME model of the ”im” class in the Ecoli data set [5]. By looking at the scatterplot matrix graph, we can decide which scatterplot best separates classes (axes alm2, aac) and choose the more detailed graph. 5.2.6

Credibility estimation of GAME models

The traditional techniques for models’ credibility estimation (eg. testing set, cross validation) are not sufficient to evaluate the quality of inductive models. Inductive models often deal with irrelevant features and short noisy data sets and it is hard to estimate their credibility. The main disadvantage of a black-box model is that it generates a random output for patterns it has not been taught on. It is hard to determine whether the configuration of inputs is out of the region, where the model was taught properly, or out of this area. Especially for black-box models with irrelevant or redundant inputs it is unacceptable to state that model is invalid, when values of its inputs are far from training data. Therefore we developed technique, allowing us to estimate the credibility of GAME models

82


for any configuration of inputs. Lets have an ensemble of GAME models for a single output variable. These models disagrees where not taught properly, giving the compromise response in areas well defined by training data. Random behavior of models can be expected also far from the areas of data presence. The dispersion of responses of GAME models can give us an estimate of models’ credibility for any configuration of input features. Following experiments helped us to explore the relationship between the dispersion of models’ responses in the GAME ensemble and the credibility of models. Given a training data set L and testing data set T , suppose that (x1 , x2 , ..., xm , y) is a single testing vector from T , where x1 ...xm are input values and y is the corresponding output value. Let G is an ensemble of n GAME models evolved on L using the Bagging technique [111]. When we apply values x1 , x2 , ..., xm to the input of each model, we receive models’ outputs y1′ , y2′ , ...yn′ . Ideally, all responses would match the required output (y1′ = y2′ = ... = yn′ = y). This can be valid just for certain areas of the input space that are well described by L and just for data without noise. In most cases models’ outputs will differ from the ideal value y.

Figure 5.23: Responses of GAME models for a testing vector lying in the area insufficiently described by the training data set. Figure 5.23 illustrates a case when a testing vector is lying in the area insufficiently defined by L. Responses of GAME models y1′ , y2′ , ..., yn′ significantly differ. The mean of responses P is defined as µ = n1 ni=1 yi′ , the distance of ith model from the mean is dyi′ = µ − yi′ and the distance of the mean response from required output ρ = µ − y (see Figure 5.23). We observed that there may be a relationship between dy1′ , dy2′ , ..., dyn′ and ρ. If we could express this relationship, we would be able to compute for any input vector not just the estimate of the output value (µ), but also the interval hµ + ρ′ , µ − ρ′ i where the real output should lie with a certain significant probability. 5.2.6.1

Credibility estimation - artificial data

We designed an artificial data set to explore this relationship by means of inductive modeling. We generated 14 random training vectors (x1 , x2 , y) in the range x1 , x2 ∈ h0.1, 0.9i and 200 testing vectors in the range x1 , x2 ∈ h0, 1i, y = 12 sinh(x1 − x2 ) + x21 (x2 − 0.5). Then using the training data and the Bagging scheme we evolved n inductive models G by the GAME method. The density of the training data in the input space is low, therefore there are several testing vectors far from training vectors. For these vectors responses of GAME models considerably differ (similarly to the situation depicted on the Figure 5.23). For each testing vector, we

83


U

U dyi'

dyi'

U < amax dyi'

dy c

Figure 5.24: The dependence of ρ on dyi′ is linear for an artificial data without noise.

U

U

dy'i

dy'i

U < (dy'i )

2

Figure 5.25: The dependence of ρ on dyi′ is quadratic for real world data. computed dy1′ , dy2′ , ..., dyn′ and ρ. This data (x1 = dy1′ , ..., xn = dyn′ , y = ρ) we used to train a GAME model D to explore the relationship between dyi′ and ρ. In the Figure 5.24 there are responses of the model D for input vectors lying on dimension axes of the input space. Each curve express the sensitivity of ρ to the change of one particular input whereas other inputs are zero. We can see that with growing deviation of model Gi from the required value y, the estimate of ρ given by the model D increases with a linear trend. There exist a coefficient amax that limits the maximal slope of linear dependence of ρ on dyi′ . If we present an input vector3 to models from G, we can approximately limit (upper bound) Pn ′ the maximal deviation of their mean response from the real output as ρ ≤ amax i=1 |dyi |. n 5.2.6.2

Credibility estimation - real world data

We repeated the same experiment with a real world data set (mandarin tree water consumption data - see Appendix B.3 for the description). Again, GAME model D was trained on distances of GAME models from µ on the testing data (2500 vectors). Figure 5.25 shows the sensitivity of D to the deviation of each GAME model from the mean response. Contrary to the artificial data set, the dependence of ρ on dyi′ is quadratic. We can approximately limit the maximal error of models’ mean response for this real world data set Pn ′ 2 as ρ ≤ amax i=1 (dyi ) , so the credibility models is inversely proportional to the size of this n interval. 3

We show relationship between dyi′ and ρ just on dimension axes of the input space(Fig.5.24), but the linear relationship was observed for whole input space (inputs are independent), therefore the derived equation can be considered valid for any input vector.

84


Figure 5.26: GAME models of the cold water (left) and hot water (right) consumption variable. Lef t: Models are not credible for low temperature of the air. With increasing temperature, the consumption of cold water grows. Right: When its too hot outside, the consumption of hot water drops down. Nothing else is clear - we need more data or include more relevant inputs. 5.2.6.3

Uncertainty signaling for visual knowledge mining

To demonstrate the credibility estimation on real world problem, we have chosen the Building data set again. On this data set we evolved ensemble of GAME models for each output (wbcw, wbhw, wbe). Figure 5.26 lef t shows the relationship of the cold water consumption variable on the temperature outside the building under conditions given by the other weather variables. The relationship of the variables can be clearly seen within the area of models’ compromise response. We have insufficient information for modeling in the areas where models’ responses differ. By the y thickness of dark background we signal the uncertainty of the models’ response for particular values of inputs. It is computed according to the equation derived above for real P P world data hywbc − n1 ni=1 (dyi′ )2 , ywbc + n1 ni=1 (dyi′ )2 i. The real value of the output variable should be in this interval with a significant degree of probability. In Figure 5.26 right we show that the level of uncertainty for wbhw is significantly higher than for wbc. For this specific conditions (values of humid, solar, wind) our GAME models are credible just in thin area where the consumption of hot water drops down. We presented the method for credibility estimation of GAME models. Proposed approach can be also used for many similar algorithms generating black-box models. The main problem of black-box models is that user does not know when one can trust the model. Proposed techniques allow estimating the credibility of the model’s response, reducing the risk of the misinterpretation of model outputs. 5.2.7

Credibility of GAME classifiers

Similar techniques as for the regression models, can be applied to GAME classifiers. Again a group of adaptive models is evolved for single output variable (membership to a particular class). Each model should have ”1” on the output, when classifies patterns of his member

SECTION 5. THE FAKE INTERFACE AND RELATED RESULTS GAME model 2

GAME models (1*2*3)

*

*

=

*

*

=

*

*

=

Iris Versicolor

Iris Setoza

GAME model 3

Iris Virginica

GAME model 1

85

Figure 5.27: When we multiply responses of several GAME networks modeling the same class, we get the membership area just for those configurations of inputs, where the output of all models is ”1”. class, and ”0” when the input vector belongs to another class. For regions far from training vectors, the output of the model is random. But the random output is usually close to ”1” or ”0”. This behavior has the following reason. When GAME networks are used to classify data into membership classes (output ”1” or ”0”), units with sigmoid transfer function outweigh other types of units (e.g. polynomial) so GAME networks are mainly formed by sigmoidal units. It affects their behavior in regions far from training vectors. Far from the decision boundary the sigmoid function is either close to ”1” or ”0” and so are outputs of networks. Consider the data about apples and pears. If we evolve the ensemble of GAME classifiers for the apple class, their outputs are ”1” for objects similar to apples, ”0” for these similar to pears. For object that is different from both apple and pear, each model from the group can give different output. Some can classify it as an apple (”1”); some can respond it is not apple (”0”). When outputs of all models are multiplied, the result is ”1” just for objects classified as an apple by the whole group. This simple idea is extended bellow to filter out artefacts and unimportant information from the classification by GAME models. The Iris data set [5] is often used to test the performance of classifiers. Iris plants are to be classified into three classes (Setoza, Virginica and Versicolor) given measurements of their sepal width, length and petal width, length. We evolved three ensembles of GAME models - one ensemble for each class. Figure 5.27 shows three models from each ensemble. Dark background signifies ”1” on the output of the model, light background the output ”0”. When these three models are multiplied for each class, the results can be observed in scatterplots

86


Figure 5.28: Three groups of twelve GAME models for classes Iris Setoza, Virginica and Versicolor. When all models are displayed in one scatterplot (left), regions of class membership are overlapping. The right scatterplot shows proposed improvement when outputs of models within one group are first multiplied and then result for each class is displayed.

Figure 5.29: The same plots as in the case of the Iris data set in the previous picture. Ten GAME models classifying the Advertising data set are multiplied. The behavior of the resulting classifier is sensitive to anomalies in the behavior of individual models. The majority instead of multiplication might be better approach to combine individual models.

of the fourth column. Each resulting scatterplot classifies as members of the class just plants similar to those present in the training set. On the Figure 5.28, you can see how the proposed method improved the classification. Outputs of twelve GAME models for each class are displayed in one scatterplot (left). Especially plants more distant from these present in training data are classified as members of several classes. When twelve models for each class were multiplied first and then results for three classes were displayed into one scatterplot (right), the boundaries of membership areas are clearly visible. The second example shows how this technique works on the Advertising data set (Figure 5.29). The multiplication of models should be probably changed to the majority operator. It would reduce the disturbances caused by misbehavior of few individual models. We are going to experiment with properties of the multiplication and the majority operators on several real world data sets.


5.2.8

87

The search for interesting behavior

Very useful outcome of a neural network model is that relationship of input and output variables can be plotted revealing some potentially interesting information about a modeled system. However this approach is not often used because there are several problems appearing from a closer look. At first there is a problem with the ”curse of dimensionality”, secondly the problem of model credibility arises when system state space is not fully covered by training data. There are also problems with irrelevant input variables, with the time needed to find some useful plot in multidimensional state space, etc. In this section we show that all these problems can be successfully overcome using modern techniques of evolutionary computation and ensemble modeling. The result of our research is an application that is able to automatically locate interesting plots of system behavior. 5.2.8.1

Ensembling: what do we mean by ”interesting behavior”?

Real-world data usually does not cover whole state space of features (we do not have measurements for all possible combinations of input variables). Features are seldom independent and the correlated data are normally distributed around some cluster centers. The rest of the input state space is empty. When we use such training data to build a model, responses of model would be random in areas, where training data are absent. The problem is that we cannot simply find the border of model credibility. If we limit the credibility just to areas of training data presence, we cannot deal with irrelevant features. Therefore we use ensemble techniques to find out areas where models are credible and their behavior is ”interesting”.

Figure 5.30: Interesting behavior of models When you look at the Figure 5.30, you can see, how we defined the term ”interesting”. Each curve represents the output of one ensemble model yi when a feature xi is changed. The more rapid change we observe (ysize → max), the more interesting behavior is displayed. The second criterion for the importance of projection is the credibility of models. We need to look for rapid changes of output variable just within the areas, where models are credible. This can be achieved by simple assumption. The random (different) output of ensemble

88


models signify that we are outside the area of credible behavior. The equal output means that all models converged to single (credible) value that is based on training data. The second criterion can be therefore computed from dispersion of ensemble model outputs - the envelope p should be minimized: p=

xstart +xsize µ X j=xstart

arg max (yi (j)) − arg min (yi (j)) 0

APPENDIX A. STANDARDIZATION OF GMDH CLONES

\begin{onecolumn}

119

120

APPENDIX B. DATA SETS USED IN THIS THESIS

Table B.1: Number of evaluations saved by supplying gradient depending on the complexity of the energy function. Feature CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT MEDV

Description per capita crime rate by town proportion of residential land zoned for lots over 25,000 sq.ft. proportion of non-retail business acres per town Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) nitric oxides concentration (parts per 10 million) average number of rooms per dwelling proportion of owner-occupied units built prior to 1940 weighted distances to five Boston employment centers index of accessibility to radial highways full-value property-tax rate per $10,000 pupil-teacher ratio by town 1000(Bk − 0.63)2 where Bk is the proportion of blacks by town % lower status of the population Median value of owner-occupied homes in $1000’s

B Data sets used in this thesis B.1

Building data set

The ”Building data set” is frequently used for benchmarking modeling methods [79]. It consists of 3000+ measurement inside the building (hot water (wbhw), cold water (wbc), energy consumption (wbe)) for the specific weather conditions outside (temperature (temp), humidity of the air (humid), solar radiation (solar), wind strength (wind)). We excluded the information about time of measurement and transformed this data set from time series domain to regression domain.

B.2

Boston data set

The Boston housing data set was taken from the StatLib library which is maintained at Carnegie Mellon University. It concerns housing values in suburbs of Boston. Number of Instances in the data set is 506. There are 13 continuous input features and one discrete output variable ”MEDV”.

B.3

Mandarin data set

The Mandarin tree data set (provided by the Hort Research, New Zealand) describes water consumption of a mandarin tree.


121

The mandarin tree is the complex system influenced by many input variables (water, temperature, sunshine, humidity of the air, etc.). Our data set consists of measurements of these input variables and one output variable – the water consumption of the tree. It describes how much water the tree needs in specific conditions. We used 2500 training vectors (11 input variables, 1 output variable).

B.4

Dyslexia data set

The eye movements of 49 female subjects were recorded using iView 3.0 videooculography system at Department of Neurology, 2nd Medical Faculty, Charles University, Czech Republic. The frequency rate of the camera used was 100Hz. 22 subjects were healthy, 18 subjects suffered from reading dysfunction and 9 subjects were dyslectic children. The average group age was 11 years, the age variance was 0.5 years. The subjects were stimulated by two non-verbal and one verbal stimuli. The non-verbal stimuli consist of the two images with different graphical patterns. The first image was the grid of blue dots (stimulus number 1 ), the subject was asked to inspect the image dot by dot. The second image was composed of the grid of faces, the subject was asked to count all smiling faces(stimulus number 2 ). The next verbal stimulus was Czech text to read (stimulus number 3 ).

B.5

Antro data set

Data1 represent a set of observations the skeletal indicators studied for the proposal of the methods of age at death assessment from the human skeleton (see [87]). It is a results of the visual scoring of the morphological changes of the features in two pelvic joint surfaces defined and described by a text accompanied with photos. The material consists of 955 subjects from the 9 human skeletal series of subjects known age and sex. This collections (populations) are dispersed on 4 continents (Europe, North America, Africa, Asia). The age in the death of the individuals varies between 19 and 100 years. Three features are scored on the pubic symphysis in the pelvis: (A) Posterior plate (PUSA) scored in three phases (1-2-3); (B) Anterior plate (PUSB) observed in three phases (1-2-3); (C) Posterior lip (PUSC) scored in two phases (1-2). Four features on the sacro-pelvic surface of the ilium ware observed: (A) Transverse organisation (SSPIA) evaluated in two phases (1-2); (B) Modification of the articular surface (SSPIB) scored in four phases (1-2-3-4); (C) Modification of the apex (SSPIC) observed in two phases (1-2); (D) Modification of the iliac tuberosity (SSPID) estimated in two phases (1-2). 1

Thanks to Dr. Jaroslav Bruzek who provided us with this data set and with many valuable comments of our results.

122

B.6


UCI data sets

Other data sets used in this thesis come from UCI machine learning repository [5]. The detailed description of the Pima Indian Diabetes data set, Iris data set, Ecoli data set, etc. can be found at [5].

123

APPENDIX C. THE FAKE GAME ENVIRONMENT

C The FAKE GAME environment The FAKE GAME environment is still under development. Do not expect a ”single green button” application (this is our vision). The core of the environment is the GAME engine offering many configuration options. These options were necessary during the development of the engine. In future versions these options will be available in expert mode only. Many components of the system are developed and tested independently (3D visualization engine, search for interesting plots, data preprocessing modules, etc.). They will be released in later versions of the FAKE GAME environment. Bellow we shortly describe how to work with the core of the environment - the GAME engine. We also mention the visual knowledge extraction support in the environment.

C.1

The GAME engine - running the application

The GAME engine is implemented in Java programming language. To run the application, java virtual machine must be installed on your computer. You can download java runtime environment (JRE) from the Sun website (free). To run the application type java -Xmx512M -Xms128M -jar autogen.jar in the directory, you have unpacked the archive to. C.1.1

Loading and saving data, models

Select File → Load raw data from the menu of the application window to import a data set in following format (name of output starts with !): sepal_l 5.1 3.5 4.9 3 4.7 3.2 4.6 3.1 5 3.6 5.4 3.9

sepal_w petal_l 1.4 0.2 1 0 0 1.4 0.2 1 0 0 1.3 0.2 1 0 0 1.5 0.2 1 0 0 1.4 0.2 1 0 0 1.7 0.4 1 0 0

petal_w

!Setosa

!Versicolor

!Virginica

Select File → Load to import a data set (data with models) in following format: Input_factor_name sepal_length sepal_width petal_length petal_width Attribute_name Iris-setosa Iris-versicolor Iris-virginica

type max continuous 7.9 continuous 4.4 continuous 6.9 continuous 2.5 polarity significance positive 0 positive 0 positive 0

min 4.3 2.0 1.0 0.1 max 1.0 1.0 1.0

med 0.0 0.0 0.0 0.0 min 0.0 0.0 0.0

124


Group_name 5.1 3.5 ... ...

input_factors_values 1.4 0.2 1.0 ... ... ...

output_attributes_values 0.0 0.0 ... ...

The advantage of this format is that you can specify the minimum and maximum value of particular input or output. For the previous format, both values are computed from the data. It is possible that some vectors in the data you plan to use can have values out of this range. For such case, you need to use this data format. Select File → Save to save a data set. When File → Save responses is enabled, responses of models are appended to data vectors. Models are serialized to the independent file with extension ”.net”. Select File → Save training and testing set to split your data set randomly into training part (file ”training”) and the testing part (file ”testing”). Then you can load the training file, build models, load the testing file and save responses of models on the testing set. C.1.2

How to build models

Select File → Create → Single GAME model to build one GAME model for each output variable selected in the right panel of the application. The item File → Create → Repeat generating GMDH allows you to build ensemble of GAME models for each selected output. The Repeat generating option is much faster than single model generation (graphics is reduced). Use preferably this option unless you need to examine the process of GAME model evolution.

C.2

Units so far implemented in the GAME engine

LinearNeuron – unit with simple linear transfer function, coefficients can be estimated by any training method ( very fast, accurate) LinearGJNeuron – unit with simple linear transfer function, are estimated by Gauss-Jordan method ( extremely fast,inaccurate) CombiNeuron – unit with polynomial transfer function that is evolved during the run of GA (for full functionality it needs all other units to be disabled, not to disturb evolution of chromosomes encoding its transfer function), coefficients can be estimated by any training method ( fast in the beginning slowing with increasing complexity of the transfer function, accurate only if all other units disabled , otherwise won’t survive) PolyHornerNeuron – unit with simple polynomial transfer function computed by Horner Scheme, coefficients can be estimated by any training method ( fast, accuracy limited by simplicity of the transfer function) PolySimpleNeuron – unit with randomly generated polynomial transfer function – complexity depends on the number of the layer; coefficients can be estimated by any training method ( fast, accurate) PolySimpleGJNeuron – unit with randomly generated polynomial transfer function – complexity depends on the number of the layer; coefficients are estimated by simplified Gauss-


125

Jordan method ( extremely fast,inaccurate – survives just by accident :-)) PolySimpleNRNeuron – unit with randomly generated polynomial transfer function – complexity depends on the number of the layer; coefficients can be estimated by any training method; training can be stopped early to prevent overtraining – GL5 criterion ( fast, accurate) ExpNeuron – unit with exponential transfer function, coefficients can be estimated by any training method ( fast, accurate) SigmNeuron – unit with logistic transfer function (sigmoid), coefficients can be estimated by any training method ( fast, accurate – especially for classification problems) SinusNeuron – unit with sin transfer function, coefficients can be estimated by any training method ( fast, accurate) GaussianNeuron – unit with gaussian transfer function (see Equation 4.5), coefficients can be estimated by any training method ( fast, accurate) GaussNeuron – unit with gaussian transfer function (see Equation 4.6), coefficients can be estimated by any training method ( fast, accurate) MultiGaussianNeuron – unit with gaussian transfer function (see Equation 4.7), coefficients can be estimated by any training method ( fast, accurate) GaussPDFNeuron – unit with gaussian transfer function (Gaussian conditional probability), coefficients can be estimated by any training method ( slow, accurate) PolyFractNeuron – unit with randomly generated rational transfer function – complexity depends on the number of the layer; coefficients can be estimated by any training method ( slow speed of learning, very accurate) BPNetwork – unit with small MLP network trained by the Backpropagation algorithm – the topology of networks is generated randomly and can be configured – by regulating the number of hidden neurons or the number of training epochs, the compromise between slowness and accuracy can be found ( extremely slow ,very accurate especially for complex data – can solve the two intertwined spirals problem) BPNRNetwork – the same as the BPNetwork unit; training can be stopped early to prevent overtraining – GL5 criterion ( extremely slow ,very accurate)

C.3

Optimization methods in the GAME engine

QuasiNewtonTrainer (Quasi Newton method) – a gradient method can be even faster if explicitly supplied by the gradient and Hessian matrix of the error surface ( very fast, accurate) – translated to Java from Fortran by Steve Verrill SADETrainer (SADE genetic method) – a genetic method mixed with the Tabu search (the concept of radiation fields) to increase the search radius, preventing stacking in local optima ( moderate speed , accurate) – translated to Java by Jan Drchal PSOTrainer (Particle Swarm Optimization) – a swarm optimization method inspired by birds; need some parameters tuning ( slow ,moderate accuracy) HGAPSOTrainer (Hybrid of GA and PSO) – a genetic method mixed with swarm optimization – for few generations chromosome s fly as birds, then are mutated and crossed

126


and again ( moderate speed , accurate) PALDifferentialEvolutionTrainer (Differential Evolution version 1) – a genetic method using special crossover and mutation. Each offspring has four parents. This version was implemented by PAL Development Core Team and is distributed under the GPL ( moderate speed , moderate accuracy) DifferentialEvolutionTrainer (Differential Evolution version 2) – a genetic method using special crossover and mutation. Each offspring has four parents ( moderate speed , accurate) StochasticOSearchTrainer (Stochastic Orthogonal Search) – an orthogonal search, dimensions selected randomly. This version was implemented by PAL Development Core Team and is distributed under the GPL ( moderate speed , accurate) OrthogonalSearchTrainer (Orthogonal Search) – a simple iterative search method, does not use gradient, performs search in directions of dimensions (one by one). This version was implemented by PAL Development Core Team and is distributed under the GNU GPL ( moderate speed , accurate) ConjugateGradientTrainer (Conjugate Gradient method) – an iterative method with good convergence properties. This version was implemented by WEKA development team and is distributed under the GNU GPL ( moderate speed , accurate) ACOTrainer (Ant Colony Optimization) – this nature inspired method was extended for continuous problems. ( moderate speed , moderate accuracy) CACOTrainer (Continuous Ant Colony Optimization) – this nature inspired method was extended for continuous problems. ( moderate speed , moderate accuracy)

C.4

Configuration options of the GAME engine

The GAME engine is still under development and it offers several options that can be configured. In this section we describe some options that can be accessed by selecting Options → Configure GAME, from the main menu of the GAME application. The configuration window will appear. There are several tabs in the configuration window. The first tab is called ”Complexity”. The meaning of abbreviations is the following: Populat. size - the size of the population in each layer (how many chromosomes (units) are evolved within one mating pool by genetic algorithm Max.surv.units - maximal number of units that are selected to survive in the layer, after the GA finishes Epochs - number of epochs of the genetic algorithm in each layer of the GAME network The next tab in the configuration window that can be selected is ”Unit types” tab (see Figure C.1). Its purpose is to allow selecting units that will be used during the construction of the GAME network. You can enable/disable units with the certain type of transfer function (linear, polynomial, perceptron networks and others). Disabled units won’t be generated to the population of GA during the construction of GAME network. To select an individual type of unit, choose proper transfer function in the tree of configuration options. To configure a unit, you have to


127

Figure C.1: The configuration of units in the GAME engine.

expand the tree with configuration option and click on the name of unit. Types of units that have been implemented so far are listed in the section C.2. In the next tab, you can enable or disable training methods listed in the section C.3. Enabled methods are used to estimate coefficients of units supporting the training by ”any method”. In the ”Evolution” tab it is possible to configure parameters of genetic algorithm that optimizes units in layers of the GAME network. If the Deterministic Crowding is enabled, the distance of two units needs to be computed - either by difference of inputs (first checkbox) or by the correlation of units’ responses on the training data (second checkbox) or by the combination of both. Diversity is derived from the distance of units - the threshold of minimal diversity can be set to prevent unifying population. When crossing two units, their offsprings can be of the same type as parents, or purely random. This allows evolving just successful types of units, but some amount of randomness is recommended. If the importance of individuals’ distance is set to zero, the algorithm works faster - especially the computation of units’ correlation is time demanding. The more important is the distance, the more preferred are diverse units having lower accuracy. It is possible to set the size of the training and the validation set in the ”Data” configuration tab. The training set will be used to estimate coefficients/weights of units. The validation set will be used to compute the fitness function of units when evolving them. The implicit size of the validation set is one third of training data. The more noisy data you process, the bigger proportion of data you should use as validation set. Percentage of data set used for learning can be decreased to make the computation faster - but remember the accuracy of resulting model can dramatically fall down, with just the fraction of data used for training.

128


If validate on training set, too checkbox is enabled, the fitness of units is computed on both training and validation set (can sometimes lead to overtraining, but useful if just small sample of data available). Bootstrap sampling is designed to introduce diversity into ensemble of models - sometimes giving more accurate ensemble output. In the ”Connection” tab, you can set, whether the number of inputs to a unit should be limited by the number of layer. The GAME networks consisting of units with unlimited number of inputs can me more accurate than when the number of inputs is growing with the number of layer. On the other hand in this case the genetic algorithm needs far more epochs to evolve proper topology (far bigger search space). The ”Others” tab allows to configure whether layers of GAME network will be added no matter how big is the increase in accuracy. Sometimes you can disable the normalization of variables - units with linear and polynomial transfer function are able to handle data that are not normalized. You can then use the Model equation button to serialize the model into the string and use it directly in table processor (Excel).

C.5

Visual knowledge extraction support in GAME

There are several visualization modules accessible via the Graph menu item. For optimal graph properties, tune their parameters in Options → Graph Properties. Visualization modules can be simply added to the FAKE GAME environment. Modules that have been so far implemented are summarized bellow. Module G2D G2Dmulti Graph3D Cut3D Clasification3d Starplot Clasification2D ClasificationMulti2D ScatterplotMatrix GA

Functionality IO relationship plots IO relationship plots Simple fast projection of model’s output IO relationship regression plots Decision manifold visualization Behavior of model in the neighborhood Decision boundaries visualization Decision boundaries visualization Matrix of decision boundaries Visualization of the search for interesting plots

Dim. 2 2 3 3 3 N 2 2 N2 2 x2 -

Models single multiple single multiple single single single multiple single multiple

The additional configuration options are: With Graph → Combine models for clasif. enabled, responses of models of the same class are multiplied. The responses in uncertain areas gets zero. Similar result as when using neural networks with local units (RBF), but immunity to curse of dimensionality and to irrelevant inputs. Select Graph → Visible response areas → Models responses accuracy to show the estimate of uncertainty signified by dark background. You can change value of inputs and study input-output relationship in conditions given by values of input features except the one studied.

APPENDIX D. RESULTS OF ADDITIONAL EXPERIMENTS

129

D Results of additional experiments

Figure D.1: The behavior of a GAME model consisting from Gaussian units almost resembles fractals.

130


20

Magnitude (dB)

10 0 −10 −20 −30 −40 −50

0

0.1

0.2

0.3 0.4 0.5 0.6 0.7 Normalized Angular Frequency (×π rads/sample)

0.8

0.9

1

0

0.1

0.2


0.8

0.9

1

1500

Phase (degrees)

1000

500

0

−500

Figure D.2: Characteristic response of the FIR filter.

20

Magnitude (dB)

10 0 −10 −20 −30 −40

0

0.1

0.2


0.8

0.9

1

0

0.1

0.2


0.8

0.9

1

Phase (degrees)

1500

1000

500

0

−500

Figure D.3: Characteristic Response of the GAME network filter - noise is better inhibited in regions.

131


Type of the unit

Ecoli data

Motol-questionary data

Cp

Im

Uns

Un

0.6

3.1915

0.5

5.8824

0.4

5.7971

0.2

Unc

Uncv

3.7736

0.425

4.3038

sinus 3

1

5.3191

0.1

1.1765

0.1

1.4493

gauss

0.2

1.0638

0

0.0000

0

0.0000

0.1

1.8868

0.325

3.2911

1.6

30.1887

0.450

sinus 1

0.9

4.7872

0.2

2.3529

0.2

4.5570

2.8986

0.2

3.7736

0.375

sinus4

0.8

4.2553

0.1

1.1765

3.7975

0.1

1.4493

0

0.0000

0.250

sinus 5

1.8

9.5745

0.2

2.5316

2.3529

0.1

1.4493

0

0.0000

0.525

sinus 6

1.2

6.3830

5.3165

0.2

2.3529

0.8

11.5942

0

0.0000

0.550

sinus 9

0.7

5.5696

3.7234

0.3

3.5294

1

14.4928

0.1

1.8868

0.525

sinus 10

5.3165

1.2

6.3830

0.2

2.3529

0.5

7.2464

0

0.0000

0.475

4.8101

sinus 11

1.6

8.5106

0.5

5.8824

0.1

1.4493

0.1

1.8868

0.575

5.8228

sinus 12

2.2

11.7021

1

11.7647

1.2

17.3913

0.1

1.8868

1.125

11.3924

linear

1.4

7.4468

0.3

3.5294

0.3

4.3478

0.2

3.7736

0.550

5.5696

polynomial

0.3

1.5957

0.3

3.5294

0.5

7.2464

1

18.8679

0.525

5.3165

sinus 2

perceptron

0

0

Uns

0

Un

imS

Un

0

rational

3.5

18.6170

0.5

exponencial

Uns

0

Un

0

Uns

0

0

0

5.8824

0.3

4.3478

0.9

0

16.9811

1.300

13.1646

0.4

2.1277

0.5

5.8824

0.8

11.5942

0.4

7.5472

0.525

5.3165

sigmoid

1

5.3191

3.6

42.3529

0.5

7.2464

0.4

7.5472

1.375

13.9241

SUMA

18.8

100

8.5

6.9

100

5.3

100

9.875

100

100

Figure D.4: The percentage of surviving units in GAME network according to their type. This experiment was performed to find the best form of the transfer function in the SinNeuron unit.

Iris data

Type of the unit

Mandarin data Setosa

Versicolor

Un

Uns

Un

Uns

sinus 2

0

0

0.2

sinus 3

0

0

gauss

0

0

sinus 1

0.3

sinus4

Virginica

Un

Uns

Un

Uns

Unc

6.4516

0

0

0.1

1.4925

0.075

0.977199

0.2

6.4516

0.2

3.7736

0.1

1.4925

0.125

1.628664

0.1

3.2258

1.3

24.5283

0.9

13.4328

0.575

7.491857

1.9231

0.3

9.6774

0.1

1.8868

0.1

1.4925

0.200

2.605863

1.2

7.6923

0

0

0

0

0

0

0.300

3.908795

sinus 5

0.9

5.7692

0

0

0

0

0

0

0.225

2.931596

sinus 6

1.6

10.2564

0.1

0

0

0

0

0.425

5.537459

sinus 9

0.8

5.1282

0

0

0

0

0

0

0.200

2.605863

sinus 10

0.5

3.2051

0

0

0.1

0

0

0.150

1.954397

sinus 11

1.1

7.0513

0.1

3.2258

0

sinus 12

0.9

5.7692

0.2

6.4516

0.5

linear

0.7

4.4872

0

polynomial

1.2

7.6923

0.6

perceptron

3.4

21.7949

0

0

1.9

35.8491

2.1

rational

2.8

17.9487

0

0

0.2

3.7736

0.8

exponencial

0.2

1.2821

0

0

0.1

1.8868

0.9

16.9811

sigmoid

0

0

1.3

SUMA

15.6

100

3.1

3.2258

0 19.3548

41.9355 100

1.8868 0 9.4340

Uncv

0.1

1.4925

0.325

4.234528

0.1

1.4925

0.425

5.537459

0

0

0

0

0.175

2.28013

0

0

0

0

0.450

5.863192

31.3433

1.850

24.10423

11.9403

0.950

12.37785

0.1

1.4925

0.100

1.302932

2.3

34.3284

1.125

14.65798

7.675

100

5.3

100

6.7

100

Figure D.5: The percentage of surviving units in GAME network according to their type. This experiment was performed to find the best form of the transfer function in the SinNeuron unit.

132


Figure D.6: Relationship of age at the death and features obtained from the skeleton. Higher values of SSPIB feature signify older people. The SSPIA feature is considered irrelevant. Values of the PUSC feature increase with growing age, but for older people the relationship is not clearly defined.

Figure D.7: The classification of the Spiral data by the GAME model evolved with all units enabled.

133


RMS on the Boston data set

4.03E+05

0.350 0.300 Training set

0.250

Testing set

0.200

Si n C om Ex bi p R 3 Li 0 0 ne a C r om bi

Fr

0.100

al l G ac a t M uss ul ia tiG n au ss a Pe ll-f rc as ep t al tro l-s n im pl e G au s Po S s ly igm no m ia l

0.150

Figure D.8: The performance comparison of GAME units on the Boston data set.

Classification accuracy on the Ecoli data set 100% Training set 90%

Testing set

80% 70% 60%

al l Si gm

Si n

Ex p al l-P al l-P F

C Po om b ly no i m C om ia bi l R 30 Li 0 ne ar

50%

Figure D.9: The performance comparison of GAME units on the Ecoli data set.

RMS error on the Mandarin data set 0.004 Training set Testing set 0.003

0.002

al Fr l ac Po a t ly ll-P no m ia al l l-P F Pe Co rc mb ep i tr on C om Ex bi p R 30 0 Si Li n ne ar Si gm

0.001

Figure D.10: The performance comparison of GAME units on the Mandarin data set.

134


RMS error on the Iris data set 100% 90% 80%

Training set Testing set

70%

Si n Pe Fr rc ac ep t tr o Li n ne ar

Ex p al l-P F Po a ly ll-P no m ia l

al l C Com om b bi i R 30 0 Si gm

60%

Figure D.11: The performance comparison of GAME units on the Iris data set.

Mandarin dataset RMS error 0.0035 0.003 0.0025 0.002

Training set

0.0015

Testing set

0.001 0.0005 S O

H

G

al AP l SO SO S pa lD E C AC O PS O A C O

E

N SA D E

Q

D

C

G

0

Figure D.12: The performance comparison of learning methods on the Mandarin data set.

135

the Antro testing data set

&

2.9 23

RMS error on the Antro training data set

5. 7 6. E+8 8E 2E +7 4. +25 8E +2 0


1.2

0.54 0.53

Simple Ensemble

1.1

Weighted Ensemble

0.52 0.51

1

0.5 0.49 0.48

0.9

0.47 0.46

0.7

0.8

R 12

R 12

t R est 12 1 te s R 50 t 2 te s R 50 t 1 R tes 30 t 0 2 R tes 30 t 0 1 R tes 72 t 5 2 R tes 72 t1 R 5 te 16 st 0 2 R 0 te 16 s 00 t 1 R 30 tes t2 0 R 0 te 30 st 00 1 te st 2

tr R ain1 12 tr R ain 50 2 tr R ain1 50 tra R 30 in2 0 R trai 30 n1 0 tra R 72 in2 5 tra R 72 in1 5 t R 16 rain 2 00 t R 16 rain 1 00 t R 30 rain 2 00 t R 30 rain 00 1 tra in 2

0.6

Figure D.13: Regularization of the Combi units by various strengths of a penalization for complexity. Compare this Figure to the Figure 4.14, which displays the maximum, mean and the minimum errors of the individual GAME models. When these models are combined to the ensemble, the overfitting is reduced (more for Simple ensemble, than for Weighted ensemble).

Performance of GAME ensembles on the Advertising data set 95% Training 94%

Testing

93% 92% 91% 90% e5

e8

e3

e15

e7

e20

e25

e1

e6

e9

e4

e10

e2

Figure D.14: The optimal number of models in the ensemble is 5, according to results on the Advertising data set. Surprisingly the ensemble with 10 member models overfitted the data. This experiment should be repeated several times, to get reliable results, but we currently lack computational resources.

136


Figure D.15: The configuration window of the CombiNeuron unit.

Fully Automated Knowledge Extraction using Group of Adaptive

Fully Automated Knowledge Extraction using Group of Adaptive

Suggest Documents

Fully Automated and Adaptive Intensity Normalization ... - DergiPark

Fully Automated Knowledge-Based Segmentation ... - DKFZ Heidelberg

Automated Knowledge Extraction from the UMLS

Automated Knowledge Extraction for Decision Model ... - CiteSeerX

Automated Feature Classification and Knowledge Extraction from

A Fully Automated Object Extraction System for the World ... - SMARTech

Fully automated on-line solid phase extraction ...

Towards fully automated axiom extraction for finite ... - Google Sites

Towards fully automated axiom extraction for finite ... - Google Sites

Automated Dispersive Solid-Phase Extraction Using ...

Automated aberration extraction using phase wheel targets

Automated Information Extraction using Amorphic

Knowledge-Driven Automated Extraction of the Human Cerebral ...

Fully automated detection and segmentation of meningiomas using ...

Illumination - Invariant Facial Components Extraction Using Adaptive

Detection of BRAF Mutations Using a Fully Automated Platform ... - PLOS

Fetal ECG Extraction Using Adaptive Filters - ijareeie

Fully automated tracking of cardiac structures using radiopaque ...

Implementation of Fully Automated Electricity for large Building Using ...

Commonsense Knowledge Extraction Using Concepts Properties

Modelling Conflict: Knowledge Extraction using Bayesian Neural

Using Automated Theorem Provers to Teach Knowledge ...

Knowledge Extraction Using Probabilistic Reasoning ... - IEEE Xplore

Dynamic Knowledge Extraction from Software Systems using ...