A Hybrid Learning System with a Hierarchical

A Hybrid Learning System with a Hierarchical Architecture for Pattern Classification Don Ajantha Sanjeewa Atukorale B.Sc. (Hons)

School of Computer Science and Electrical Engineering University of Queensland, St. Lucia, Q4072

Australia

A thesis submitted for the degree of Doctor of Philosophy. August 28, 2001 (Revised November 2002)

Statement of Originality To the best of my knowledge and belief, this thesis is original and my own work except as acknowledged in the text. Parts of this thesis have been published in the papers listed in the publications section. This thesis has not been submitted, in whole or in part, for a degree at this or any other university.

Don Ajantha Sanjeewa Atukorale, August 28, 2001 (Revised November, 2002).

i

Abstract This thesis addresses the use of hybrid learning systems that combine supervised and unsupervised learning methods for pattern classification. Most neural network research applied to pattern recognition has been focussed on supervised learning, and network models like the MLP provide an efficient method to design an arbitrarily complex non-linear classifier. However, there are some problem domains that are not solved in a satisfactory way by means of a single classifier. When the abstraction level of the classification task increases, the shape of the decision regions can become very complex, requiring impossibly large amounts of training data to form the class boundaries. This problem can be alleviated by using unsupervised learning techniques to reduce the number of degrees of freedom in the data. Hybrid learning systems which combine supervised and unsupervised learning methods have been very popular in this regard. This thesis introduces a novel hybrid system with a hierarchical architecture which is based on the neural gas (NG) algorithm for pattern recognition problems. The NG algorithm in the proposed learning system uses a much faster variation of the original NG algorithm by reducing the time complexity of its sequential implementation. The computationally expensive part of the adaptation step of the original NG algorithm is the determination of the neighborhood ranking. This requires an explicit ordering of all distances between the reference vectors and the input pattern, and this has time complexity

.

This

problem is addressed here by introducing an implicit ranking method which reduces the time complexity to . The proposed learning system generates multiple classifications for every data pattern presented, and these are registered as “confidence values”. The most ii

suitable functional form for calculating confidence values was determined empirically and it smoothly assigns confidence values from 1 to 0. These confidence values allow the system to employ a variety of classifier fusion techniques to combine individual classifications to produce the predicted class for a pattern. Four different classifier combination techniques were used in the comparisons. It was shown that combining a network performance measure with confidence values by means of the fuzzy integral leads to the best classification performance. The performance of the proposed system was compared with that of other techniques on three well-known benchmark data sets, and promising results were obtained. Finally, it was shown that the boosting algorithm can be applied to a learning system that uses mixed supervised/unsupervised methods. The boosted learning system gave improved results over those obtained without boosting.

iii

To my wife, Ganga.

iv

Acknowledgments A number of people have contributed in various ways to make my research work and this thesis possible. Firstly, I am deeply grateful to my friend, Dr. John Ginger, who paved the way for me to enter the University of Queensland, which I will never forget. I am thankful to Dr. Sri Parameswaran who helped me at the start to organize my life at the University of Queensland. I would like to thank my first supervisor Dr. P.N. Suganthan who gave me the initial motivation and the guidance I needed in the early stages of my candidature. It has been a great privilege to work with my second supervisor, Prof. Tom Downs, for the past 2 years. I have benefited enormously from his support, advice and guidance. I owe him special thanks for helping me to achieve my goal on time. It was a pleasure to work with David Chen, Dr. Marcus Gallagher, Kristedjo Kurnianto, Dr. Hugo Navone, Peter Stratton, Ian Wood and Jean Zhu and I would like to thank them for their help in numerous ways as well. And I am also thankful to Len Allen, Kate Anderson, Tricia Howarth, Dan Johnston, Helen Lakidis, Tracy Miller, Joan Moss, Helen Purchase and Kathleen Williamson for their help in various ways. My work has benefited from discussions with Dr. Peter Bartlett and communications with Gunnar Raetsch, especially regarding the implementation issues of the Boosting algorithm. My research has been supported by a scholarship from the School of Computer Science and Electrical Engineering and an International Postgraduate Research Scholarship (IPRS) from DETYA. Finally I would like to thank my parents and friends, for their support, and my wife Ganga, for her encouragement and for being a part of my life at all times. Thank you everybody! v

List of Publications Atukorale A. S., Suganthan P. N., Downs T. Combining Supervised and Unsupervised Learning using Neural Gas Networks and the Fuzzy Integral. IEEE Transactions on Neural Networks. (Submitted)

Atukorale A. S., Suganthan P. N. Numeral Recognition Based on Hierarchical Overlapped Networks. In Murshed N., editor, Cognitive and Neural Models for Word Recognition and Document Processing, World Scientific Press. (to appear)

Atukorale A. S., Downs T. Using Labeled and Unlabeled Data for Training. In Proceedings of the Sixth International Conference on Pattern Recognition and Information Processing (ICPRIP-2001),, volume 1, pages 195-199, Minsk, Belarus, May 2001.

Atukorale A. S., Suganthan P. N. Hierarchical Overlapped Neural-Gas Network with Application to Pattern Classification. Neurocomputing, volume 35, numbers 1-4, pages 165–176, Elsevier Science, November 2000.

Atukorale A. S., Suganthan P. N., Downs T. On the Performance of the HONG Network for Pattern Classification. In Proceedings of the International Joint Conference on Neural Networks (IJCNN’2000), volume 2, pages 285–290, Como, Italy, July 2000.

Atukorale A. S., Suganthan P. N. Multiple HONG Network Fusion by Fuzzy Integral. In Proceedings of the Sixth International Conference on Neural Information Processing (ICONIP’99), volume 2, pages 718–723. Perth, Australia, November 1999. vi

Atukorale A. S., Suganthan P. N. Combining Classifiers Based on Confidence Values.

In Proceedings of the Fifth International Conference on Docu-

ment Analysis and Recognition (ICDAR’99), pages 37–40. Bangalore, India, September 1999.

Atukorale A. S., Suganthan P. N. An Efficient Neural Gas Network for Classification. In Proceedings of the International Conference on Control, Automation, Robotics and Vision (ICARCV’98), pages 1152–1156. Singapore, December 1998.

vii

Contents Statement of Originality . . . . . . . . . . . . . . . . . . . . . . . . . . . .

i

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

ii

Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

v

List of Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

vi

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviii List of Symbols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix 1

2

Introduction

1

1.1

Motivation of the Project . . . . . . . . . . . . . . . . . . . . . . . . .

2

1.1.1

Pattern Classification . . . . . . . . . . . . . . . . . . . . . . .

2

1.1.2

Artificial Neural Networks . . . . . . . . . . . . . . . . . . .

3

1.2

Research Problem and Methodology . . . . . . . . . . . . . . . . . .

4

1.3

Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . .

5

1.4

Original Contributions . . . . . . . . . . . . . . . . . . . . . . . . . .

6

Unsupervised Learning & Self-Organizing Nets

8

2.1

Overview of Artificial Neural Networks . . . . . . . . . . . . . . . .

9

2.2

The Learning Process . . . . . . . . . . . . . . . . . . . . . . . . . . .

9

viii

2.3

Brief Historical Review of ANNs . . . . . . . . . . . . . . . . . . . .

10

2.4

Self Organization Process . . . . . . . . . . . . . . . . . . . . . . . .

13

2.4.1

Kernel Methods and Competitive Learning . . . . . . . . . .

14

Self-Organizing Feature Maps . . . . . . . . . . . . . . . . . . . . . .

16

2.5.1

Kohonen’s Self-Organizing Feature Map . . . . . . . . . . .

17

2.5.2

Competition Stage . . . . . . . . . . . . . . . . . . . . . . . .

19

2.5.3

Cooperation Stage . . . . . . . . . . . . . . . . . . . . . . . .

20

2.5.4

Problems Associated with the Basic SOM Algorithm . . . .

22

2.6

Variants of the Basic SOM Algorithm . . . . . . . . . . . . . . . . . .

24

2.7

Neural Gas Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . .

27

2.7.1

Properties of the NG Algorithm . . . . . . . . . . . . . . . .

29

2.7.2

Time Complexity of the Original NG Algorithm . . . . . . .

31

Extensions to NG Algorithm . . . . . . . . . . . . . . . . . . . . . . .

32

2.8.1

Partial Sorting . . . . . . . . . . . . . . . . . . . . . . . . . . .

33

2.8.2

Implicit Ranking Method . . . . . . . . . . . . . . . . . . . .

34

2.8.3

Speed-up Procedures . . . . . . . . . . . . . . . . . . . . . . .

38

2.5

2.8

3

Combining Multiple Classifiers

40

3.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

40

3.2

Main Approaches for Combining Classifiers . . . . . . . . . . . . .

42

3.2.1

Modular-based Approach . . . . . . . . . . . . . . . . . . . .

42

3.2.2

Ensemble-based Approach . . . . . . . . . . . . . . . . . . .

44

Specific Examples of Ensemble-based Classifiers . . . . . . . . . . .

47

3.3.1

Simple Voting . . . . . . . . . . . . . . . . . . . . . . . . . . .

47

3.3.2

Data Sampling and Re-sampling . . . . . . . . . . . . . . . .

48

3.3.3

Fuzzy Integrals . . . . . . . . . . . . . . . . . . . . . . . . . .

52

3.3

ix

4

Combining Supervised/Unsupervised Learning

61

4.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

62

4.2

Hierarchical Overlapped Architecture . . . . . . . . . . . . . . . . .

64

4.3

Combining Overlapped NG Networks . . . . . . . . . . . . . . . . .

70

4.3.1

Simple Methods of Combining ONG Networks . . . . . . .

71

4.3.2

Combining ONG Networks using Fuzzy Integral . . . . . .

72

4.3.3

Determining the Value of . . . . . . . . . . . . . . . . .

74

Data Sets Used in the Experiments . . . . . . . . . . . . . . . . . . .

77

4.4.1

UCI SatImage Data Set . . . . . . . . . . . . . . . . . . . . . .

77

4.4.2

UCI Letter Data Set . . . . . . . . . . . . . . . . . . . . . . . .

79

4.4.3

NIST Handwritten Numeral Data Set . . . . . . . . . . . . .

79

4.5

Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . .

81

4.6

Medium Sized Data Sets . . . . . . . . . . . . . . . . . . . . . . . . .

83

4.6.1

Performance on UCI SatImage Data . . . . . . . . . . . . . .

83

4.6.2

Performance on UCI Letters Data . . . . . . . . . . . . . . . .

84

Large Sized Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . .

86

4.7.1

Feature Extraction Methods on NIST SD3 Database . . . . .

86

4.7.2

Performance on NIST SD3 Data . . . . . . . . . . . . . . . . .

88

4.8

Number of Overlaps in HONG . . . . . . . . . . . . . . . . . . . . .

89

4.9

Using Labeled and Unlabeled Data . . . . . . . . . . . . . . . . . . .

90

4.10 Multiple HONG Networks . . . . . . . . . . . . . . . . . . . . . . . .

92

Boosting the HONG Network

94

5.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

94

5.2

The AdaBoost Algorithm . . . . . . . . . . . . . . . . . . . . . . . . .

95

5.3

AdaBoost in the HONG Architecture . . . . . . . . . . . . . . . . . . 100

4.4

4.7

5

x

6

5.4

Some Practical Issues Concerning Boosting . . . . . . . . . . . . . . 102

5.5

Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

5.6

Boosting Labeled and Unlabeled Data . . . . . . . . . . . . . . . . . 105

Conclusion

108

6.1

Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . 108

6.2

Suggestions for Future Extensions . . . . . . . . . . . . . . . . . . . 111

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

xi

List of Figures 1.1

A pattern classification system. . . . . . . . . . . . . . . . . . . . . .

2

2.1

A taxonomy of the learning process. . . . . . . . . . . . . . . . . . .

11

2.2

The self organizing map. . . . . . . . . . . . . . . . . . . . . . . . . .

18

2.3

The basic SOM algorithm . . . . . . . . . . . . . . . . . . . . . . . .

23

2.4

Behavior of a decay function

as the parameter reaches its maximum . Here and . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

29

2.5

The basic NG algorithm . . . . . . . . . . . . . . . . . . . . . . . . .

30

2.6

Behavior of the neighborhood function against

the rank index and the neighborhood shrinkage rate . . . . . 2.7

Here denotes distortions from a set in a six unit (i.e.,

31

)

network. Rank index is obtained by a complete sorting of the set

and the implicit rank index is obtained by using equation (2.6).

35

2.8

The two phases of the ranking time of the NG algorithm. . . . . . .

37

3.1

A mixture-of-experts architecture. An example for a modular-based

approach in combining classifiers. . . . . . . . . . . . . . . . . . . . 3.2

44

An example for an ensemble-based approach in combining classifiers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii

45

3.3

The Bagging algorithm . . . . . . . . . . . . . . . . . . . . . . . . . .

3.4

A simple example to demonstrate the use of fuzzy integral in multiple network fusion. . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.5

57

Calculation of the fuzzy integral for Class 1 and Class 2 respectively for the example shown in Figure 3.4. . . . . . . . . . . . . . .

4.1

49

59

Hierarchical overlapped architecture showing three overlapped units, A, B and C from the base NG network being expanded to the second level. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.2

66

This shows how the training patterns are selected for training the second level ONG networks. Here only one ONG network (A ) is

shown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3

Variation of confidence values with parameter S. The horizontal axis represents the distance

( ), and the vertical

axis represents the corresponding confidence values. . . . . . . . . 4.4

67

70

Confidence values produced by the three overlapped networks (in a HONG network) corresponding to the winner and the first two runner-up networks. . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.5

Variation of the fuzzy integral as is varied from

to

for

Example 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6

72

76

Pseudo code for the calculation of the fuzzy integral in the HONG network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

77

4.7

Pseudo code for the HONG network. . . . . . . . . . . . . . . . . . .

78

4.8

Multiple HONG-base classifier configuration. . . . . . . . . . . . . .

92

5.1

The AdaBoost.M1 algorithm . . . . . . . . . . . . . . . . . . . . . . .

96

xiii

5.2

The AdaBoost.M2 algorithm . . . . . . . . . . . . . . . . . . . . . . .

5.3

This shows one round () of boosted HONG network. Here represents the full training set and

98

represents its current distribu-

tion. The HONG network is trained on the re-sampled training set

. 5.4

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

This shows one round () of the boosted HONG network using labeled and unlabeled data. Here represents the labeled training set and

represents its current distribution. represents the

unlabeled training set. The HONG network is trained on the resampled training set

.

Note that the original training set

and the total training patterns . 6.1

. . . . . . . 106

An example of the HONG network with 3 levels. The full set of ONG networks is not shown in this diagram. . . . . . . . . . . . . . 111

xiv

List of Tables 3.1

Hypothetical runs of the boosting algorithm. Assume that pattern number 1 is an outlier and is hard for the learning algorithm to classify correctly. In each round the re-sampled training set are patterns from the original training set, but the hard patterns (in this case pattern 1) occur more in the latter rounds. . . . . . . . . . .

4.1

51

Degree of importance values and the fuzzy integral values of Class 1 and Class 2 for various values. . . . . . . . . . . . . . . . . . .

75

4.2

Distribution of digits into partitions in SD3 database. . . . . . . . .

81

4.3

Partitions of SD3 database used in my experiments. . . . . . . . . .

81

4.4

Parameters used on the three data sets using HONG network. Here

denotes the number of training patterns in the data set. See text for the number of units used. . . . . . . . . . . . . . . . . . . . . . . 4.5

82

Test error rates on the UCI SatImage dataset by other algorithms compared with the HONG. The first 4,435 patterns are used for training and the last 2,000 patterns are used for testing. Note that, the column “Boosting” shows the test error rate using the boosting algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xv

85

4.6

Test error rates on the UCI Letters dataset by other algorithms compared with the HONG. The first 16,000 patterns are used for training and the remaining 4,000 patterns are used for testing. Note that, Michie et. al. [95] used 15,000 patterns for training and 5,000 patterns for testing, which is a slightly different partitioning of the data than that in the UCI database. . . . . . . . . . . . . . . . . . . .

4.7

85

Results of the Base Neural-Gas and the HONG network combined with fuzzy integral for the NIST SD3 database. The three feature extraction methods (GF, PF, CF) are explained in Section 4.7.1. . . .

4.8

Test error rate for different HONG networks with varying training and test overlaps for the SatImage data set. . . . . . . . . . . . . . .

4.9

89

90

Test error rate for different HONG networks with varying training and test overlaps for the Letters data set. . . . . . . . . . . . . . . . .

91

4.10 Average test error rates on the UCI SatImage and Letters data sets for different proportions of labeled training patterns. . . . . . . . . 5.1

91

Modifications needed for the boosting algorithm to be in the logarithmic scale. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

5.2

Test error rates on the UCI SatImage dataset with/without boosting by the algorithms compared. The first 4,435 patterns of the data set are used for training and the last 2,000 patterns are used for testing. Note: The arc-fs algorithm by Breiman is quite similar to the AdaBoost.M1 algorithm. . . . . . . . . . . . . . . . . . . . . . 104

xvi

5.3

Test error rates on the UCI Letters dataset with/without boosting by the algorithms compared. The first 16,000 patterns of the data set are used for training and the remaining 4,000 patterns are used for testing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

5.4

Average test error rates of the HONG network with/without boosting on the UCI SatImage and Letters data sets for different proportions of labeled training patterns. . . . . . . . . . . . . . . . . . . . . 107

xvii

Abbreviations

Cardinality of set

Euclidean distance between vectors and ANN

Artificial Neural Network

C4.5

A Classification Tree

CART

Classification and Regression Tree

CF

Contour-based Features

GCS

Growing Cell Structure

GF

Global Features

GSOM

Growing SOM

HME

Hierarchical Mixture-of-Experts

HONG

Hierarchical Overlapped Neural Gas

ME

Mixture-of-Experts

MLP

Multi-layer Perceptron

MST

Minimal Spanning Tree

NIST

National Institute of Standards and Technology

NG

Neural Gas

NN

Nearest Neighbor

ONG

Overlapped Neural Gas

PAC

Probably Approximately Correct

PDE

Partial Distance Elimination

PF

Projection-based Features

SOFM

Self-Organizing Feature Map

SOM

Self-Organizing Map

SPA

Structure Parameter Adaptation xviii

List of Symbols

- input patterns vector space. - dimension of the input patterns. - discrete output space of the lattice. - dimension of the output space . - no. of training patterns.

- an input vector.

- probability distribution of the input data vectors.

- reference vector of unit .

- all reference vectors. i.e.,

- change of . - unit/neuron index. i.e., - winning unit in SOM/NG algorithms. - max. no. of units/neurons in the base network. - max. no. of units/neurons in the second level networks. - adaptation step.

- max. no. of adaptation steps. - class label index. i.e.,

- no. of classes. - class label of a pattern. - set of class labels. i.e., - no. of overlaps considered for the training set. - no. of overlaps considered for the testing set ( ). - index of the overlapped NG networks. eg., for training and for testing . xix

- Euclidean distance between the input pattern and the reference vector .

- min. of distances . i.e.,

- max. of distances . i.e.,

- all the distance values w.r.t the input pattern. i.e.,

,

- explicit ranking index. - implicit ranking metric. - learning rate.

- neighborhood function of the NG algorithm. - characteristic decay constant of the neighborhood function.

- degree of importance in fuzzy literature. - partial evaluation in fuzzy literature. - fuzzy integral.

xx

Chapter 1 Introduction In the recent past, artificial neural networks and pattern classification have been perhaps the most enthusiastically studied areas within the machine learning community. The widespread interest in these areas is motivated by their potential applications in a broad spectrum of our day to day activities. Pattern classification is the object of interest in this thesis– i.e., determining which class an object belongs to, given some of its properties. This thesis presents a detailed investigation into a hybrid learning system which combines supervised and unsupervised learning methods. Its application to some benchmark pattern classification problems is also described. In order to motivate the objectives of the research project, this chapter introduces the area of research within which the thesis lies and then broadly summarizes the methodology used. In Section 1.3, an outline of the subsequent chapters is given. Finally the main contributions to the field made during the course of this work are listed.

1

CHAPTER 1. INTRODUCTION

2

1.1 Motivation of the Project This section presents the two main areas studied in this thesis as a motivation for the objectives of the project.

1.1.1 Pattern Classification Pattern classification is a very general problem, and there are several tasks from a wide range of domains that can be cast as classification tasks. These include character and speech recognition, fault detection, process control, remote sensing, classification of human faces and many more. In many domains, the goal of pattern recognition is to identify the class of an object measured by a set of sensors. A set of measurements is called a pattern. Generally the recognition process is composed of two subprocesses, namely feature extraction and pattern classification [28], as shown in Figure 1.1. The task of the feature extraction process is to facilitate classification by selecting certain properties from the presented pattern. These selected properties, the features, are then provided to a classifier to decide the class. For many problem domains, however, it is hard to find a sophisticated feature extraction method, particularly when patterns of different classes are highly mixed in feature space. Therefore, the heavier burden of recognition system tends to fall on the process of pattern classification.

sensor data

Object

Sensor (pattern)

features Pattern Feature Classification Extraction

Figure 1.1: A pattern classification system.

Class


3

Due to their flexibility and adaptability, artificial neural network architectures offer a variety of ways of dealing with some of the more difficult problems in pattern classification.

1.1.2 Artificial Neural Networks The main idea behind an artificial neural network is to use several simple computational units, called neurons, connected by weighted links through which activation values are transmitted. The neurons normally have a very simple way to calculate their outputs given received input values, for example forming the sum of the inputs and feeding it through a monotonic transfer function. To use a neural network in a classification task, the pattern to classify is typically fed into the network as activation at a set of input units. This activation is then spread through the network via the connections, finally resulting in activation of the output unit, which is then interpreted as the classification result. Training of the network consists of presenting the patterns of the training set to the network, and letting it adjust its connection weights to obtain the correct output. As more and more computational power becomes available on our desktops, we want to be able to use it to obtain solutions to complex, real world classification problems. A large number of different artificial neural network architectures exist to tackle different kinds of problems (see [52, 54] for reviews), and a brief historical review of artificial neural networks is given in Chapter 2. As far as learning is concerned, the principle dichotomy in artificial neural network is between supervised and unsupervised learning. Supervised algorithms are provided with a set of training input patterns along with the correct output for each input pattern. In unsupervised algorithms only input patterns are re-


4

ceived and the algorithm does its best to find intrinsic structure in the pattern distribution.

1.2 Research Problem and Methodology As mentioned above, supervised and unsupervised learning are the two main research themes in the study of ANNs. Most neural network research applied to pattern classification, and also to other applications, has been focused on supervised learning, and network models like the MLP provide a highly efficient method to design an arbitrarily complex non-linear classifier [52]. However, there are some problem domains that are not solved in a satisfactory way by means of a single powerful classifier. When the abstraction level of the classification task increases, the shape of the regions associated together become increasingly complex, requiring impossibly large amounts of training data to form the class boundaries. This problem can be alleviated by using unsupervised learning techniques to reduce the number of degrees of freedom in the data. This is achieved by identifying underlying structure in the data and organizing patterns into clusters. Hybrid learning systems, which combine supervised and unsupervised learning methods, have become quite popular in recent times. Hybrid systems can be implemented using a hierarchical approach with multiple layers, where each layer is comprised of one or more networks [29, 96, 102]. The basic idea of hierarchical networks is to have processing units on each layer of the network receive connections from a restricted localized subset of the previous layer’s processing units [53]. This implies that each processing unit can only act upon partial in-


5

formation from the previous layer’s processing units. This can be advantageous because the processing units of each layer only have to concern themselves with the processing of a limited amount of information. And by processing the input information one step at a time, hierarchical networks can provide capabilities that would otherwise be difficult to deliver. In this study, a hybrid learning system with a hierarchical architecture based on the neural gas algorithm is developed for pattern recognition problems. The main innovation in this approach is that the hierarchy allows patterns of different classes that are close in feature space to be more easily distinguished. The proposed learning system generates multiple classifications for every data pattern presented, and these are registered as confidence values. These confidence values allow the system to employ a variety of classifier fusion techniques to combine individual classifications to produce an overall prediction for a pattern. The proposed learning system favors application areas such as speech and character recognition, data-mining, and classification of World Wide Web pages because it is able to handle a data set with a mixture of labeled and unlabeled patterns.

1.3 Organization of the Thesis The remainder of the thesis is organized as follows. In Chapter 2, a comprehensive review of artificial neural networks, particularly for unsupervised learning, is presented. This includes a detailed description of a variant on the selforganizing map called the neural gas algorithm. This algorithm is used as a basis for the learning system developed in the thesis. Chapter 3 discusses motivations


6

and methods for the combination of multiple predictive models. Chapter 4 discusses the use of hybrid learning systems that combine supervised and unsupervised learning methods. This is the core of the thesis and here I introduce a hybrid learning system with a hierarchical network architecture showing how it can handle labeled and unlabeled patterns and comparing its performance with other methods on real-world data. Chapter 5 discusses a popular variant of the boosting algorithm called AdaBoost which is used to improve the performance of the proposed hybrid learning system. Chapter 6 provides some concluding remarks including some suggestions for further investigation.

1.4 Original Contributions The major contributions of this thesis are as follows: 1. A comprehensive review of unsupervised learning. In particular self organizing networks and fusion techniques for combining multiple classifiers are discussed. 2. The introduction of an implicit ranking scheme in the original neural gas algorithm instead of the time consuming explicit ranking scheme. This modification eliminates the explicit sorting mechanism completely and reduces the complexity from to , for a given sample size . 3. Development and implementation of a hierarchical architecture called HONG which is based on the neural gas algorithm and combines supervised and unsupervised learning methods. The basic aim is to build a hierarchy in which the higher levels specialize in distinguishing between patterns that are close together in pattern space but belong to different classes.


7

4. Using confidence values and a measure of network performance with the fuzzy integral to provide a novel way of combining classifier outputs. 5. Application of the HONG network to benchmark data sets and comparison with the best recently published results. 6. Application of boosting techniques to the HONG network to provide improved performance.

Chapter 2 Unsupervised Learning and Self-Organizing Networks The previous chapter discussed motivation, objectives and the major contributions of this work. However, in order to relate and compare my contributions with existing techniques, it is necessary to discuss relevant works in some detail. This chapter introduces the area of artificial neural networks with a brief historical review and seeks to establish a context for a review of the development of self-organizing feature maps. In this chapter I concentrate mainly on unsupervised learning, which uses competitive learning algorithms and, in particular, I discuss Kohonen’s self-organizing feature map algorithm and its variants. Following this, I discuss another variant of the self-organizing map algorithm called the neural gas algorithm, which I have used extensively in my research work.

8

CHAPTER 2. UNSUPERVISED LEARNING & SELF-ORGANIZING NETS

9

2.1 Overview of Artificial Neural Networks Artificial neural networks (ANNs) or simply neural networks have seen an explosion of interest over the last few years, and are being successfully applied across an extraordinary range of problem domains, in areas as diverse as finance, medicine, engineering, geology and physics. ANNs are inspired by biological evidence, and attempt to make use of some of the organizational principles that are believed to be used in the human brain. While the biological plausibility of existing ANN models is generally doubtful, there is no doubt that ANNs are complex, non-linear systems which display many interesting and useful properties.

2.2 The Learning Process This thesis is concerned with learning systems where, usually, the process of learning is based on a certain amount of a priori knowledge and a set of training data. The procedure used to perform the learning process is called a learning algorithm, the function of which is to modify the network parameters (also known as reference vectors) of the network in an orderly fashion so as to attain a desired design objective. There is no unique learning algorithm for the design of neural networks. Rather, there are a diverse range of learning algorithms, each of which has its own strengths and weaknesses. These algorithms mainly differ from each other in the way they formulate their adjustments to the reference vectors in the learning process. Error-correction learning, Hebbian learning and competitive learning are some examples of learning algorithms. Another factor that has an influence on the learning process is the relationship between the learning model and the environment, sometimes referred to as


10

the learning paradigm [51]. There are three basic classes of learning paradigms: supervised learning, reinforcement learning and unsupervised learning. In supervised learning the network is provided with a set of training inputs along with the correct output for each input. Here the reference vectors are adjusted under the combined influence of the training input and the error1 signal. This adjustment is carried out iteratively in a step-by-step fashion with the aim of eventually making the neural network emulate the teacher. Reinforcement learning is a special case of supervised learning where the network is provided with a reward or quality signal indicating whether the network output was more or less successful. In unsupervised learning there is no external teacher or critic to oversee the learning process. It explores the underlying structure in the data, or correlations between patterns in the data and organizes patterns into categories from these correlations. Once the network has become tuned to the statistical regularities of the input data, it develops the ability to form internal representations for encoding features of the input and thereby create new categories automatically. To summarize the above, a taxonomy of the learning process is shown in Figure 2.1.

2.3 Brief Historical Review of ANNs This section gives a synopsis of the development of ANNs and aims to provide a background for review of the development of self-organizing feature maps. ANN models go by many names, including connectionist models2 , parallel distributed 1

The error signal is defined as the difference between the actual response of the network and the correct answer or the desired response. 2 Connectionism, however, has a very long past. In fact, one can trace the origin of connectionist ideas to the early Greek philosopher, Aristotle, and his ideas on mental associations [92].


11

Learning Process

Learning Algorithms

Error Hebbian Competitive Correction Learning Learning Learning

Learning Paradigms

Supervised Learning

Reinforcement Learning

Unsupervised Learning

Figure 2.1: A taxonomy of the learning process. processing models and neuro-computing systems to name a few. Whatever the name, these models attempt to achieve good performance via dense interconnection of simple computational elements called neurons. The field of ANNs has its roots in the early work of McCulloch and Pitts [91] published in 1943, who proved that any logical expression could be implemented by an appropriate network of simplified neurons (threshold units). In order to show that any logical expression is computable by their simplified neurons, all that is necessary is to show that they can implement the primitive functions AND, OR and NOT and this is easily done. And in order to build larger (or more complex) functions, one needs only to connect these primitives together in appropriate ways. The next milestone in ANN research was Rosenblatt’s work on the Perceptron (a network of threshold units with some variable input weights) [111] published in 1958. The outstanding achievement of Rosenblatt’s work was a derivation of the convergence conditions for a learning algorithm for the perceptron with a single layer of variable weights. A short time later, Widrow and Hoff introduced the Adaline (adaptive linear element) [132], which was trained by a least mean


12

squares algorithm. The main difference between the perceptron and the Adaline is that the training procedure for the Adaline converge for problems that are not linearly separable (the perceptron convergence procedure does not). ANNs generated a great deal of enthusiasm from the late fifties to the late sixties but this enthusiasm was brought to a halt when Minsky and Papert published their book on perceptrons in 1969 [97], which demonstrated fundamental and major limitations of the computing power of one-layer perceptrons. They highlighted the fact that perceptrons cannot learn classification problems that are not linearly separable and expressed the view that it was unlikely that learning rules could be found to solve such problems using networks with more than a simple layer. Since almost all classification problems are not linearly separable, Minsky and Papert’s book was very influential and led to a loss of interest in ANNs. For more than 10 years, interest (and funding) was concentrated on more conventional (rule-based) approaches to artificial intelligence. Then came the discovery of the error back-propagation algorithm which changed once again the course of research into learning systems. There is a prior claim by Werbos [131] who developed it in his Ph.D. thesis (1974) for a problem unrelated to learning systems. It was developed independently in the learning system context by Parker [103] and by LeCun in 1985 [51] and then popularized through the publication of the seminal book entitled Parallel Distributed Processing by Rumelhart et. al. [113]. Threshold units were replaced by units with sigmoidal characteristics and feedforward networks of these units are now called multi-layer perceptrons (MLPs). The fact that MLPs are trainable universal approximators [56] led to an exponential growth in publications in the area (for detailed reviews see [12, 51, 92, 109, 139] and the references therein).


13

In the period following the publication of Minskey and Papert’s book, there were two other important themes that emerged. These were associative (or content addressable) memory and self-organizing networks (based on competitive learning [114]). The work on self-organizing networks was pioneered by von der Malsburg [127], and was perhaps the first to demonstrate self-organization. Later, a strongly related, more formal algorithm was formulated by Kohonen [76]. The main focus of this thesis is around self-organizing networks which will be discussed in more detail in the remainder of this chapter.

2.4 Self Organization Process This section discusses in detail some algorithms for self-organized (or unsupervised) learning. The general goal of unsupervised learning is to construct new representations of the input data that capture its underlying structure, and to do this without a teacher. The form these representations take depends greatly on the architecture of the network as well as the training algorithm. The structure of a self-organizing network can take a variety of different forms, but generally consists of an input layer and an output layer with feed-forward connections from input to output and lateral connections between neurons (also called units) in the output layer. A key dimension along which various representations of unsupervised learning can be described is their degree of locality. At one extreme along this dimension lies a purely local representation, in which a single unit represents an entire data set. The other extreme is a representation in which each unit represents a single data pattern. Unsupervised learning methods can be divided into two broad classes based


14

on these two extremes of desired representation locality. The first set of algorithms, projection methods, attempt to define interesting representations in a space on which the data points can be projected [10]. Projection methods generally produce fully distributed representations, as each data point is described as a combination of the component directions. There is a close correspondence between this type of self-organized system and the statistical method of principal component analysis, which is a standard technique commonly used for data reduction in statistical pattern recognition and signal processing [100]. The second set of algorithms are called kernel methods [138], since they use neighborhood kernels to represent the degree of locality. These methods typically involve some form of clustering, where the input is approximated by one of a fixed set of points. Here, units that are far away from a data point play no role in representing that point. Kernel methods typically produce local representations, and use competitive learning algorithms. One way of distinguishing the two categories of unsupervised learning is that, in projection methods the units cooperate to represent the input, while in the kernel methods the units compete. This thesis focuses on the latter class of self-organized systems, which are based on competitive learning.

2.4.1 Kernel Methods and Competitive Learning The goal of kernel methods is to classify or cluster the input data, so that similar inputs should be associated with the same output unit. The key point is that the network must discover the classes from the statistics of the inputs. In the area of competitive learning a rather large number of models exist that have similar goals but differ considerably in the way they work. A common goal of these algorithms


15

is to distribute a number of patterns over a possibly high-dimensional space. The distribution of these patterns should reflect the probability distribution of the input patterns which in general is not given explicitly but only through sample patterns. Competitive learning methods can be divided into two classes, hard and soft competitive learning, based on how the adaptation of the units in the network is performed for each input pattern. Hard competitive learning (also known as winner-takes-all learning) comprises methods where each input pattern causes the adaptation of only one unit, the winner. Here stability is achieved by moving the winner unit’s reference vector closer to the input pattern, which makes it likely that the same unit will win the competition for that (or a similar) pattern the next time. The Linde-Buzo-Gray (LBG) algorithm [82] and the k-means algorithm [84] are two well-known algorithms of this class3 . A general problem occurring with hard competitive learning is the possible existence of dead units. These are units which, perhaps due to inappropriate initialization, are never the winner for an input pattern, and therefore keep their position indefinitely. These units do not contribute in any way to network performance and are therefore redundant. Common ways to avoid such dead units are (1) Use distinct patterns to initialize the reference vectors according to their probability density function [36] and (2) penalize frequent winners in some way [46]. Another problem with hard competitive learning is that different random initializations can lead to very different results. The purely local adaptations may not be able to get the system out of an initial poor local minimum. One way to cope with this problem is to modify the winner-takes-all approach of hard com3

A large amount of work has been done on learning problems out side the field of ANNs, especially in statistical learning models. These models are not discussed in this thesis.


16

petitive learning to the winner-takes-most approach of soft competitive learning. In this case not only the winner but also some other units are adapted. Nowlan [99] showed that soft competitive learning is superior to the traditional winner-takesall form in terms of both performance and learning speed on two classification tasks: digit and vowel recognition. The self-organizing feature map is a very popular kernel based algorithm that uses soft competitive learning.

2.5 Self-Organizing Feature Maps In the literature, there are basically two feature mapping models that have been developed based on neuro-biological motivation. The first model was originally proposed by Willshaw and von der Malsburg [133] to explain the problem of retinotopic mapping from the retina to the visual cortex. In this model there are two separate two-dimensional lattices4 of neurons connected together, with one projecting onto the other. One lattice represents presynaptic (i.e., input) neurons, and the other lattice represent postsynaptic (i.e., output) neurons. The two lattices are interconnected by modifiable synapses (or connections). The model is based on the idea that the geometrical proximity of presynaptic neurons creates correlations in their electrical activity. The resulting correlations can be used in the postsynaptic lattice to recognize the connections of neighboring presynaptic neurons and to connect them to neighboring postsynaptic neurons, hence producing a continuous mapping. Note, however, that this model is restricted to mapping where the input dimension is the same as the output dimension. 4

A lattice is an undirected graph in which every non-border vertex has the same fixed number of incident edges, and which usually appears in the form of an array with a rectangular or simplex topology.


17

The second model was proposed by Kohonen [74, 75, 76] and belongs to the class of vector coding algorithms. This model provides a topological mapping by placing the input patterns from a higher-dimensional space onto a lowerdimensional (lattice) space, and thereby facilitates data compression. Kohonen simplified the Mexican hat lateral relations in the Willshaw-von der Malsburg model and modified the weight update rule so as to involve neighborhood relations during learning. An additional novel element is that the range spanned by these neighborhood relations shrinks over time. It appears that Kohonen’s model has more practical use than the Willshaw-von der Malsburg model in the sense that it is capable of performing data compression. Certainly Kohonen’s model has received much more attention in the literature [67] than the Willshaw-von der Malsburg model.

2.5.1 Kohonen’s Self-Organizing Feature Map Teuvo Kohonen formulated the self-organizing feature map algorithm (SOFM) as a mathematical model of the self-organization of certain structures in the brain, the topographic maps. In the cortex, neurons are often organized in two-dimensional sheets with connections to other areas of the cortex or to sensor or motor neurons elsewhere in the body. For example, the somatosensory cortex is arranged as a topographic map of the sensory skin of the body. The term topographic map here means that neighboring areas on the skin find their neural representations in neighboring neurons in the cortex. A comprehensive treatment of this topic is provided by Kohonen [76], but here I discuss only the main features that relate to this thesis. Figure 2.2 shows the basic operation of the Kohonen SOM. The map is built by a (usually two)

CHAPTER 2. UNSUPERVISED LEARNING & SELF-ORGANIZING NETS dimensional lattice

of units (or neurons).

18

This is also called the output space

of the map. Each unit is labeled by an index

, and has associated reference

vectors , projecting into the input space .

Discrete output space A

wi*

fea

tur

em

ap

pin

g

i*

v

Input space V

Figure 2.2: The self organizing map. The principal goal of the SOM algorithm is to transform an incoming signal pattern of arbitrary dimension (say ) into an dimensional discrete map, and to perform this transformation adaptively in a topologically ordered fashion. The algorithm starts by randomly initializing the reference vectors in the lattice (or the network). That is, in the initialization operation no prior order is imposed on the feature map. Once the network has been initialized, the two stages of


19

competition and cooperation are iterated for every sample. In the competition stage, the best matching unit is selected (i.e., the winner), and in the cooperation stage the reference vectors of the winner are adapted along with those of its immediate lattice neighbors.

2.5.2 Competition Stage Let

denote the dimension of the input space

selected at random from

by

.

Denote an input pattern

,

where each

! "

. In the SOM network, the reference vector of each unit has the same di-

mension as the input space (i.e., each unit in the network is fully connected to the input vector, so that all input information is connected to all units). Unit is connected to input by the reference vector where

# # # ,

is the total number of units in the network.

To find the best match of

the input pattern with the various reference vectors , there exist at least two possibilities. First, select as winner the unit for which the dot product of the input pattern and the reference vector is the largest:

and label the winner5 as

.

$%

#

It is often claimed that this rule is analogous to a

certain type of biological circuitry [76]. Demartines et. al. [23] have shown that finding the best match of the input pattern with the reference vector using the dot product gives inconsistent results when the reference and the input vectors 5

If there are two or more winners (and this is likely to occur only rarely from a random starting point) select one winner at random.


20

are not normalized. The second method is to compute the Euclidean distance between the input pattern and the reference vectors, and select the unit with the smallest Euclidean distance:

$%

The latter one is mathematically more convenient and is sometimes called the nearest-neighbor rule. This is the one most widely used in practice and will be used here.

2.5.3 Cooperation Stage In the competition stage, the winning neuron determines the spatial location of

a topological neighborhood of excited units. According to neuro-biological evidence (for lateral interaction), a neuron that is firing tends to excite the neurons in its immediate neighborhood more than those farther away from it. This observation led to a topological neighborhood around the winning neuron to be

defined for the SOM as explained below.

£ denote the lateral distance between the winning unit and the excited unit . The topological neighborhood (also referred to as the neighborhood function) £ can be defined as a unimodal function with the following two reLet

quirements.

It is symmetric about the maximum point defined by £

.

Its amplitude decreases monotonically with increasing lateral distance £ . A typical choice of £ that satisfies these requirements is the Gaussian func-

CHAPTER 2. UNSUPERVISED LEARNING & SELF-ORGANIZING NETS tion,

£ where

21

£

&

(2.1)

& is the width of the topological neighborhood.

Another feature of the

SOM algorithm is that the width of the topological neighborhood & shrinks with time. An exponential decay function is a popular choice for this. This kind of

& reduction scheme ensures that the map actually approaches a neighborhood preserving6 final structure, provided that such a structure exists. However, if the dimensionalities of the input space and the output space differ, it is impossible to preserve all similarity relationships among reference vectors in the input space [31, 83]. Such issues are not discussed here. Once the winning unit and its neighborhood are identified, a simple competitive learning rule can be applied to adapt the reference vectors of the units. Let

£ be a neighborhood set of units, , centered around the winning unit . At each learning step, all units of the neighborhood set are updated, whereas

those outside the set are not updated. So, using discrete-time formalism, given the reference vector for unit at time , the updated reference vector at time is defined by:

if £

if £

£

where is the learning rate parameter of the algorithm, and

(2.2)

.

Equation

(2.2) has the effect of moving the reference vectors of £ towards the input pattern . Upon repeated presentation of the training patterns, the reference vec6

In this thesis, the topographic map representations are not discussed in detail, but a comprehensive study can be found in a recent book by Hulle [59].


22

tors tend to follow the distribution of the input patterns due to the neighborhood updating. In order to stabilize the map at the end of the learning phase, is often decreased over time (perhaps to a small residual value or even to zero) as well as the neighborhood function

£ by reducing the width &. When the neighbor-

hood function vanishes, only the reference vector of the winner is updated, and Kohonen’s rule in equation (2.2) becomes identical to a conventional clustering algorithm, such as the K-means7 algorithm. The map developed without a neighborhood function is usually referred to as a zero-order topology map [59]. Figure 2.3 shows a pseudo-code representation of the basic SOM algorithm.

2.5.4 Problems Associated with the Basic SOM Algorithm Kohonen’s feature map is a special method for conserving the topological relationships in input data, but it has some limitations. For example, in Kohonen’s model the neighborhood has to be defined in advance. Also, the property of neighborhood preservation, which distinguishes self-organizing maps from other neural network paradigms, depends on the choice of the output space map topology. However, in real world data sets, a suitable dimensionality to capture most of the information in the data is usually not known a priori, yet the output grid of the SOM lattice has to be specified prior to learning. So, one has to try different lattice structures and determine somehow which lattice yields the highest degree of topology preservation. However, this is a very time consuming process. To tackle this problem, one can use a more sophisticated learning scheme which 7

Although the K-means algorithm and the SOM algorithm are very closely related, their applications are quite different. In the K-means algorithm the number K of clusters should be chosen according to the number of clusters there are in the data, whereas in the SOM algorithm the number of reference vectors can be chosen to be much larger, irrespective of the number of clusters.


23

Initialization: Choose random values for the initial reference vectors . The only restriction here is that the be different for each , where is the number of units in the lattice. Small starting values are also a good idea. Sampling: Select an input pattern vector from the input space . Competition: Find the winning unit at time step by using the Euclidean minimum-distance criterion,

$%

Cooperation: Update the reference vectors of all units centered around the winning unit using,

£

where both and £ are varied dynamically during learning as explained earlier. Termination: Repeat above four steps until no noticeable changes in the feature map are observed. Figure 2.3: The basic SOM algorithm adapts not only the weight vectors of the units, but also the topology of the output space itself (i.e., by increasing/decreasing the units in the current map or evolving the map to higher/lower dimensionality). Some examples of such algorithms include topology representing networks [7, 89], the growing cell structure algorithm [39], SPA neural tree algorithm [80] and the growing hyper-cubical output space algorithm [8]. Formal studies of SOMs and their properties would be greatly simplified if it could be proven that the adaptation laws are derivable from some energy functions that are minimized. This would allow traditional techniques (eg., the Lya-


24

punov function approach8 ) to be used to ensure convergence of the algorithm. The possibility of the existence of energy functions has been studied in many works. However, it was shown by Erwin et. al. [31] that the dynamics of the SOM algorithm cannot be described as a stochastic gradient descent on any energy function. The only current solution to this problem is to describe the dynamics of the algorithm as a set of energy functions, one for each weight vector. This approach was first proposed by Tolat [126] for one-dimensional maps (i.e., for

), and later Erwin et al. [30, 31, 32] generalized it to arbitrary dimensions. 2.6 Variants of the Basic SOM Algorithm In this section I will briefly describe some of the rich variety of versions of the basic SOM algorithm have been proposed recently. Some of these variants aim at improving topology preservation by using more flexible map structures instead of the fixed lattice, so that they will describe the probability density function of the input data more accurately. While some of these methods may be useful for other purposes, they cannot be used for visualization, at least not as easily as the regular grid. Some other variants aim at reducing the computational complexity of the SOM. Speed of computation is an extremely important aspect in some applications such as in data mining where vast databases are analyzed. There are a number of ways to define the matching of an input pattern with its network representation (e.g. different metrics [23]) and also the neighborhood of a unit can be defined in various ways. With regard to the latter, several authors have suggested that the definition of the neighborhood function 8

£

could be

The key idea in Lyapunov function approach is to find an energy function that decreases on every training step. Therefore the system converges to the minimum of the energy function.


25

made dependent on intermediate results. Bauer et. al. [8] proposed a growth algorithm called the Growing Self Organizing Map (GSOM), which can adapt both the output space topology and the weight vectors. The GSOM algorithm starts with a two-unit configuration, learns by means of the basic SOM algorithm, grows the output space accordingly, learns again and keeps on repeating these operations until a pre-specified maximum number of units is obtained. Growth can be achieved either by adding units in one of the directions which is already spanned by the output space or by adding a new dimension, which is decided on the basis of the fluctuations within the Voronoi cells9 of the units. In this model the authors of [8] decompose the reconstruction error (i.e., ) along the different directions which result from projecting back the output space into the input space. This reconstruction error is used as the criterion in the growth algorithm to add units in the direction which has on average the largest error amplitude. The structure of the output space generated by the GSOM algorithm has the shape of a hypercube with varying size in different directions. Another adaptive self-organizing system is the Structure Parameter Adaptive (SPA) neural tree which was proposed by Li et. al. [80]. Tree structured classifiers have been widely used in pattern recognition tasks and have demonstrated excellent results in the literature. The SPA neural tree can be adapted to a changing environment both parametrically and structurally. The SPA neural tree begins with an empty structure and units are added to the tree when the error rate exceeds a threshold; also some units are deleted if they remain inactive for a long 9

Û

A Voronoi cell of unit consists of those data points which are closer to reference vector than to any other reference vector (i.e., the input space regions which belong to each reference vector) [89].


26

period. The tree uses a vigilance factor to control the creation of new units and a threshold factor to control the splitting of units into more units. An operational measure is used to control the deletion of units from the tree. In the SPA neural tree architecture, the units of a subtree have reference vectors similar to those of the root unit, reflecting the fact that the architecture can be used as a hierarchical classifier. Fritzke [37, 38, 39, 41, 42] has also proposed an SOM structure, called the Growing Cell Structure (GCS), which has the ability to automatically find a problem specific network structure through a growth process. The initial topology of the network is a "-dimensional simplex (eg., for

"

this is a line, for

"

a

triangle, etc). The " vertices of the simplex are the units of the network. The structure is modified dynamically by insertion and deletion of units using a selforganizing process which is similar to that in Kohonen’s SOM model although heuristics are used to add and remove units. The main advantage of this model is that the network size does not have to be specified in advance. Instead, the growth process can be continued until a specified performance criterion is met. Hierarchical maps, Supervised SOMs, Adaptive-Subspace SOMs and SOMs where the neighborhoods are defined in the input space are a few other variants of the basic SOM algorithm which are discussed in more detail in [76]. With SOMs, when the input vector distribution has a non uniform shape, the results of the best match computations tend to be concentrated on a fraction of units in the map. Whereas, if the input vector distribution is more uniform, the reference vector set neatly adapts to the input data. Because of this, researchers have abandoned the definition of topologies on the basis of spatial adjacency relations in the network and have defined the neighborhoods according to the relative mag-


27

nitudes of the vectorial differences in the input space. To achieve this, Kangas et. al. [66] used the Minimal Spanning Tree (MST) architecture to define the neighborhood relationships and Martinetz et al. [88] introduced the Neural Gas (NG) algorithm. In this thesis, I will make use of the NG algorithm which will provide a basis for an architecture proposed in Chapter 4.

2.7 Neural Gas Algorithm Martinetz et al. [87, 88] proposed the neural gas network algorithm for vector quantization, prediction and topology representation a few years ago. The NG algorithm solves most of the above problems associated with the basic SOM algorithm. The model: i) converges quickly to low distortion errors, ii) reaches a distortion error lower than that resulting from K-means clustering, maximumentropy clustering and Kohonen’s SOM, iii) obeys a gradient descent on an energy surface. Similar to the SOM algorithm, the NG algorithm uses a soft-max adaptation rule (i.e., it not only adjusts the winning reference vector, but also affects all cluster centers depending on their proximity to the input signal). This is not only to generate the topographic map but also to avoid confinement to local minima during the adaptation procedure. In the NG algorithm, the reference vectors

are adapted without any fixed

topological arrangement of the neural units within the network. Instead, it utilizes a neighborhood ranking of the reference vectors The reference vector changes

for a given data vector .

are not determined by the relative distances

between the neural units within a topologically prestructured lattice, but by the relative distances between the neural units within the input space.


28

Information about the arrangement of the receptive fields within the input space is implicitly given by the set of distortions,

,

associated with each , where is the number of units in the network [88]. Each time an input signal

is presented, an ordering of the elements of the set is

necessary (because of the ranking) to determine the required adjustments of the reference vectors . The resulting adaptation rule can be described as a winnertakes-most instead of winner-takes-all rule. So, using discrete-time formalism, given the reference vector of unit at time , the updated reference vector at time is defined by:

(2.3)

is a rank index and is a neighborhood function. Learning rate decays from to as the adaptation step, , reaches its prespecified maximum (see figure 2.4). The index10 describes the neighborhood ranking of the units with for the closest reference vector ( ) to the input data vector , where

is the learning rate (

),

¼

for the second closest reference vector (½ ) to , and so on. That is, the set

¼ ½ ½ is the neighborhood ranking relative to the input data vector

such that, $ '

decays to zero for increasing , with the neighborhood shrinkage rate . Here The neighborhood function11

10

Here the number associated with each neural unit is denoted by . Martinetz et. al. [88] reported that an exponential function should give the best overall result, compared to other choices like Gaussians for the neighborhood function. 11


29

to as the adaptation step, , reaches its maximum (see The role of here is to determine the number of units significantly

which decays from figure 2.4).

changing their reference vectors with the adaptation step (2.3). In the early part of the training process, equation (2.3) updates almost all the reference vectors in the network but over time, it gradually changes to updating the winner unit and the first few runners up. In the latter part of the training process it updates the reference vector of the winner unit only. Figure 2.5 shows a pseudo-code description of the basic NG algorithm. θ(t) 11 10 9 8 7 6 5 4 3 2 1 0

0

100

200

300

400

500

600

700

800

900

1000

1100

t

t

max

Figure 2.4: Behavior of a decay function as the parameter reaches its maximum . Here and .

2.7.1 Properties of the NG Algorithm Let represent the whole set of reference vectors (i.e.,

). In

the NG algorithm the dynamics of reference vector obeys a stochastic gradient


30

Initialization: Initialize the reference vectors randomly and the training parameters , where are initial values of and are final values of . Sampling: Select an input pattern vector from the input space . Sorting:

Compute the distortions ascending order.

and order the distortion set

in

Updating: Adapt the reference vectors according to:

The parameters have the following time dependencies: , , . Increment the time parameter by 1. Termination: Repeat above three steps until the maximum iteration reached.

is

Figure 2.5: The basic NG algorithm descent on the following cost function [87]:

( )

with

)

(2.4)

where describes the probability distribution of the input data vectors and is the dimension of the input space. The cost function ( is related to fuzzy clustering [87]. To see this, in contrast to hard clustering, where each data vector is deterministically assigned to its closest reference vector reference vector

, fuzzy clustering associates a data vector to a

with a certain degree , the so-called fuzzy membership of


31

to cluster [11]. So, if ) is chosen, then the average distortion error obtained (which has to be minimized) is given by

( and the corre-

sponding gradient descent is given by the adaptation rule given in equation (2.3). The shape of the cost function ( depends on the neighborhood shrinkage rate . By gradually decreasing the parameter , any local minima of ( emerge slowly, and this helps prevent the set of reference vectors from getting trapped in suboptimal states. Figure 2.6 shows the behavior of the neighborhood function

against the rank index and the neighborhood shrinkage rate

. hλ(t) (ki)

1

0.9

0.8

0.7

t Increasing

0.6

0.5

0.4

0.3

0.2

0.1

0

0

5

10

15

20

25

30

35

40

45

50

ki

Figure 2.6: Behavior of the neighborhood function against the rank index and the neighborhood shrinkage rate .

2.7.2 Time Complexity of the Original NG Algorithm Despite all its advantages over the SOM network, the NG network algorithm suffers from high time complexity in its sequential implementation [21, 88]. The computationally expensive part of the adaptation step of the NG algorithm is the


32

determination of the neighborhood ranking. This requires an explicit ordering of all distances between the reference vectors and the input pattern, and this has time complexity . In a parallel implementation, each reference vector

can be assigned to a computational unit (unit ).

To determine its rank index

for input , each unit has to compare the distortion with the distortions ¼ of all the other units . If each unit performs this comparison in a parallelized way, each unit needs time steps to determine its rank in

dex. Although highly parallel implementation can improve the processing speed, sequential implementation is still more economical, practical and flexible, especially with the recent wide availability of powerful processors including digital signal processors. In the next section I describe a method of neighborhood ranking that avoids explicit ordering in a way that allows considerable speedup in the sequential implementation of the ranking process.

2.8 Extensions to NG Algorithm The fact that the original NG algorithm suffers from a high time complexity in its sequential implementation limits its application to practical problems. Even for hardware implementations, it has been shown that training in high-dimensional domains (eg., video coding) using the NG algorithm incurs huge computational cost for the sorting process [112]. Recently, some attempts have been made to address this problem. I discuss these next and then introduce my own approach, which is an implicit, rather than explicit, ranking scheme.


33

2.8.1 Partial Sorting In seeking a solution to the time complexity problem, some authors have proposed partial sorting instead of full sorting of the distances between the reference vectors and the input pattern (i.e., ordering only the top few winning units correctly and keeping all other units in the list unaffected). Ancona et. al. [1] discussed the question of sorting accuracy and sorting completeness. Firstly, they addressed the sorting process by measuring the relevance of each position in the distortion list (i.e., the sorted list of the distortion set ) and showed that relative importance decreases monotonically from the top position in the list to the bottom. They concluded that correct identification of the top unit in the list becomes more and more important while training proceeds. This can be explained by the fact that, as training proceeds, the adaptation step in equation (2.3) becomes equivalent to the K-means adaptation rule [87]. Using theoretical analysis and experimental evidence, the authors of [1] concluded that partial sorting is acceptable, and is sufficient to attain a final distortion similar to that attained by the method used in the original NG algorithm. However, with partial sorting two crucial issues arise: (1) how many positions in the list must be computed exactly to attain an acceptable distortion, and (2) what is the best processing strategy for the positions not included in the partial list. For the first issue, they showed empirically that a few (typically five to ten) top list positions are enough to attain a final distortion similar to that attained by the method used in the original NG algorithm. For the second issue, they analyzed two experimental set-ups. In the first experiment, uniform noise was injected into the rest of the list positions (i.e., the list positions that are not partially sorted) and their units were adjusted according to their list positions. In the second experi-


34

ment, they ignored the rest of the list positions, leaving the corresponding units unaffected. They concluded that the second experimental set-up leads to better performance than the first. Choy et al. [21] arrived at similar conclusions.

2.8.2 Implicit Ranking Method My own contribution to this issue is to avoid an explicit ordering process by employing the following implicit ordering metric:

(2.5)

where and are the minimum and maximum distances between the training pattern and all reference vectors in the network respectively, and

.

With this metric the best matching (winner) unit will always12 have an

index of 0, the worst matching unit will have an index of 1, and all other units will take values between 0 and 1 (i.e.,

). Using in (2.5), a new rank

index can be obtained by,

%*+

where

%*+ is the round up operator.

(2.6)

Note that, in contrast to the partial

sorting method, all positions receive a ranking with this implicit metric. Figure 2.7 shows a simple example with six units (i.e.,

) where, for simplicity,

the

have been given integer values. In this example, the (given by equation (2.6))

differ from the “true” rank indices in that they assign the same rank of 4 to two 12

This method always picks the index of the winner unit correctly, which is the most important part of the sorting process, as shown by Ancona et. al. [1].


35

of the elements. Note that the ordering implied by the two ranking methods in the example is otherwise the same. The implicit ranking method in fact provides a partial rather than complete ordering. That is, for any two unit positions $ and

', if then always holds. As a consequence of this, distance values

which are close together tend to get the same rank index. d

7

i

2

4

5

9

N=6

8

d =2, d = 9 min max

m

.71

0

m’

4

0

i

k

3

i

0

1

2

4

5

i

.43 .29

2

1

1

.86

5

4

Figure 2.7: Here denotes distortions from a set in a six unit (i.e., ) network. Rank index is obtained by a complete sorting of the set and the implicit rank index is obtained by using equation (2.6).

Due to the introduction of the implicit ranking method, the neighborhood function used in equation (2.3) has to be modified accordingly. The new neighborhood function can be calculated as follows:

%*+

in (2.7) and

Since

out loss of generality the

, Æ,

,

when

, is large, with-

%*+ operator can be ignored and the calculation

simplified further to give:

(2.7)


36

Let . Now the new neighborhood function is given by:

¼

(2.8)

Note that in this modified neighborhood function, can be obtained directly from equation (2.5), and the scaling factor (i.e., ) can be incorporated in .

This further simplifies the reference vector update rule. With these modifications, the steps of the original NG algorithm given in Figure 2.5 change as follows. Instead of explicit sorting the algorithm now has to find the minimum and the maximum entries in the distortion set and this has a time complexity of only . The new update rule is then given by:

¼

(2.9)

Now let us consider the improvement offered by this implicit ranking method over the explicit ranking method. In my experiments, I divided the ranking time of the NG algorithm into two distinct phases. The first phase consists of distance calculation, which is common to both explicit sorting and implicit ranking. The second phase consists of the actual ranking process. In Figure 2.8, denotes the common distance calculation time, denotes the time taken by the explicit ranking procedure (using qsort(), the C quicksort library routine with time complexity

), and denotes the time taken to calculate the implicit ranking metric defined in equation (2.5). Since the distance calculation time, , is common to both procedures it is not included in the comparisons. Hence the percentage time improvement (or the speed up) for the implicit ranking metric over the sorting algorithm can be calcu-

CHAPTER 2. UNSUPERVISED LEARNING & SELF-ORGANIZING NETS Distance Calculation Time

td

Implicit Ranking Time

Explicit Ranking ts Time

37

tm

Improvement

Figure 2.8: The two phases of the ranking time of the NG algorithm. lated as: speed-up

(2.10)

The above speed-up factor is independent of the dimensionality of the input data, because I did not include

in equation (2.10).

Without loss of generality,

for the comparison of the speed up improvement I used a data set with the following: number of input patterns

(handwritten digit data from

NIST SD3 database), number of units in the network

,

and

.

Simulations were performed on a Pentium II, 350MHz personal computer, and explicit ranking took 153 seconds while implicit ranking took 28 seconds. With my method, I was able to achieve a speed-up of 81.8% over the explicit ranking method, and reduced the time complexity of the ranking process from to . Even though the above speed-up may seem modest in real terms, it becomes very significant when hierarchical architectures (see next chapter) are built based on the NG algorithm. Note also that this modification will allow the NG algorithm to be used as a basis for supporting vector quantization training in hardware implementations as in [112], since no actual sorting process is involved in the update rule.


38

2.8.3 Speed-up Procedures In addition to the above implicit ranking process, I also follow [21, 88] and update only those reference vectors whose effective learning rate,

¼ ,

is non-increasing only those reference vectors with sufficiently large values of ¼ are updated. This imposes a further modifi-

is non-negligible. Since

cation on the reference vector updating phase in the form of a truncated update rule. Let us rewrite equation (2.9) as:

-

where -

¼ . Now update only those units with a non-negligible effective learning rate -. If we choose a threshold for - (say ), the ref

erence vector update rule in (2.9) is modified by updating those units where

-

. That is, those for which

¼

using (2.8) this gives or

%

where % . But so we can write:

%

(2.11)

That is, the reference vectors are now updated according to the following trun-

CHAPTER 2. UNSUPERVISED LEARNING & SELF-ORGANIZING NETS cated rule:

if

%

39

(2.12)

otherwise

where % is given by (2.11). The same data set as in the previous section is used to compare the update rules in equations (2.9) and (2.12). For this, following NG parameters13

and

are chosen. For these parameters, the update

rule in equation (2.9) performs 106,152,000 updates (i.e., truncated update rule with a threshold of

-

,

).

But the

performs only 707,957

updates. Thus the number of updates has reduced by more than 99%. Another interesting property of this modification is that it increases the recognition rate14 slightly. This is due to the fact that very small weight updates are generally noisy and eliminating them improves recognition accuracy.

The learning rate and here it is denoted by . The neighborhood shrinkage rate ¼ too is denoted similar to . 14 When used as a classifier, as explained in Chapter 4. 13

Chapter 3 Combining Multiple Classifiers This chapter introduces motivations and methods for the combination of multiple predictive models. The problem of classifier combination has long been of interest to the pattern recognition community, and the fundamental issue is the way in which individual classifiers are combined to produce an output. In this chapter, I discuss the two main approaches that have appeared in the literature, highlighting the strengths and weaknesses of each. From then on I concentrate attention on one of these approaches (ensemble-based) and discuss methods for the creation of individual classifiers and how they can be combined to provide a group consensus.

3.1 Introduction Suppose we are given several classifiers that could be employed in the solution of a difficult classification problem. It is generally accepted that, instead of choosing a single “best” classifier for the problem, a combination of several classifiers is likely to produce an improved prediction by making use of the various out40

CHAPTER 3. COMBINING MULTIPLE CLASSIFIERS

41

puts simultaneously. Classifier combination approaches attempt to capture the information in all the classifiers and present a group consensus to give improved performance. Research on combining predictive models in neural computing can be traced back to Nilsson [98] who used a special case of a layered machine called a committee machine to provide the group consensus on the final decision. Combining predictive models is well known in a number of other fields such as finance [86] (with portfolio theory, which has recently been applied by Mani [85] using artificial neural networks), machine learning [3] (evidence combination) , econometrics [45] (linear combination of forecasts) and statistics [5, 60]. After combination, the composite classifier is generally more accurate than any one of the individual classifiers making up the system [101]. Combining the output of several classifiers is generally useful only if their outputs disagree for some patterns, a fact that is implied in the terms error independence [110], nonoverlapping errors [69] and complementary errors [55]. There are two main approaches in combining classifiers, each of which has its own strengths and weaknesses. The first approach is called the modular approach [119] (sometimes referred to as dynamic classifier selection technique [134]). The second approach is called the ensemble-based approach [119] (sometimes referred to as classifier fusion technique [134]). In this section, I discuss both approaches but pay more attention to the latter approach, since it is employed in the proposed hierarchical architecture which is described in the next chapter. There are many publications of the application on the above two approaches in the literature. To name a few of the applications in the pattern recognition area, there is handwritten recognition [58, 81, 106, 115, 136], remote sensing image


42

classification [18, 65], personal identity verification [70] and classification of human faces [48]. A theoretical framework for classifier combination can be found in [50, 68]. Although there are many publications on multiple classifier combination techniques, it is difficult to find a survey which covers all the main topics in this area, although an attempt was made by Clemen [22] who cites more than 200 publications in his review. Another attempt is by Sharkey [119, 121] whose work provides possibly the most comprehensive review work. This chapter will provide a review from a slightly different perspective, with an emphasis on those methods that are most relevant to the later chapters.

3.2 Main Approaches for Combining Classifiers 3.2.1 Modular-based Approach A common form of modular system is one that exploits the specialist capabilities of individual modules, so as to achieve results which would not be possible in a single classifier system. An early example of this approach was given by Baxt [9] in 1992, where two networks were trained separately on populations of patients with different likelihoods (low risk and high risk) of a clinical condition (myocardial infarction) and reported a better accuracy of the dual network model against the best single model. In this system, if the output generated by the network trained on the low risk patients was below an empirically set threshold, that output was chosen as the diagnostic output. If the output was above that threshold, the output of the network trained on the high risk patients was used as the diagnostic output. A modular architecture is suitable when the data can be well described by a


43

collection of functions, each of which is defined over a relatively local region of the input space. The modular architecture models such data by allocating different modules to different regions of the space. This approach can be described as the decomposition of a problem into a number of subtasks, each subtask being trained individually and the predictions combined with a “gating” function that depends on the input. It is possible to identify a few different modes of combining component networks, namely sequential, supervisory and competitive combination [120, 121]. The first two modes are not as widely used or as popular as the third mode. In the sequential mode, as the word suggests, the output of an earlier network forms the input for the next network. Most commonly, this is employed in hybrid systems for example in speech recognition where ANNs and hidden Markov models are combined in this way [14]. In the supervisory mode, one network module is trained to select the parameters of a second network on the basis of observations of the performance of the second network. An example is given in [90]. The most widely used modular combination mode, competitive combination, attempts to select the most appropriate module by means of a gating network [64]. Figure 3.1 shows this kind of a modular architecture, where the gating network has a set of scalar coefficients

that serve to weight the contributions of var-

ious networks. These coefficients are not fixed values, but vary depending on the input. The mixture-of-experts (ME) by Jacobs et. al. [62] is a classic example of this type of modular-based approach. Jordan et. al. [63] extended the modular system to a hierarchical architecture called the hierarchical mixture-of-experts (HME) which recursively divides each region of the input space into sub regions. This hierarchical architecture has close ties with the classification and regression


44

Final Output

Combiner

g1 Gating Network

g2 gN

Classifier 1

Classifier 2

Classifier N

Input Figure 3.1: A mixture-of-experts architecture. An example for a modular-based approach in combining classifiers. tree (CART) model [15] from statistics. Waterhouse et. al. [128], further extended the ME model by introducing Bayesian methods to avoid the over-fitting that can occur with traditional maximum likelihood inference.

3.2.2 Ensemble-based Approach In this approach a set of networks is trained on what is essentially the same task and their outputs are combined in some manner to achieve a group consensus. In


45

contrast to the Modular-based approach, which tries to choose the best classifier for a given input, the ensemble approach tries to use all the classifier outputs towards a final decision. The ME model makes the assumption that a single expert is responsible for a region in the input space, and the job of the gating network is to decide from which region the data point originates. In the ensemble-based approach the network learns the whole region rather than just a subregion. An ensemble-based approach is appropriate when no one model is likely to be correct for all parts of in the input space. Figure 3.2 illustrates a typical setup of a classifier ensemble.

Ensemble Output

Combine Classifier Outputs

Classifier 1

Classifier 2

Classifier N

Input Figure 3.2: An example for an ensemble-based approach in combining classifiers.


46

As stated earlier, combining the output of several classifiers is useful only if there is disagreement among them. When neural networks are used as classifiers, there are a number of parameters that can be manipulated with this goal in mind. They are:

the set of initial random weights: A set of neural network classifiers can be trained by varying the initial random weights while using the same training data set.

the topology: A set of neural network classifiers can be trained by varying the architecture of the network while using the same training data set. For example, in feedforward neural networks this can be achieved by changing the number of hidden units/layers in the network.

the algorithm: A set of neural network classifiers can be trained by using different training algorithms while using the same training data set.

the training data: Varying the training data is the most widely used method for creating ensembles. There are quite a few different ways of achieving this including data preprocessing, data sampling and re-sampling methods. By using different preprocessing methods, different feature sets can be extracted from the data. Different feature sets can be used to train different classifiers in the ensemble and their outputs can be combined to obtain the final ensemble output. Alternatively, the input data for individual classifiers in an ensemble can be distorted in different ways such as by injecting noise or by using


47

non-linear transformations. One further approach, which is made use of in this thesis, involves data sampling and re-sampling and this is discussed in this chapter.

3.3 Specific Examples of Ensemble-based Classifiers As stated earlier, this chapter focuses its attention on ensemble-based classifiers since these are employed in the learning system developed in this thesis. This section discusses methods of combining classifier decisions in an ensemble to achieve a group consensus. The idea of combining the output of multiple classifiers has been studied for several years but there exists no golden rule for choosing a suitable combination for a given problem. However, if the classifiers of an ensemble produce only labels, one can adopt a voting scheme such as plurality voting or majority voting. If the classifiers produce continuous outputs like posterior probabilities, an average or some other linear combination can be applied. If the classifier outputs are interpreted as fuzzy membership values, belief values or evidences, one can apply fuzzy rules, belief functions or Dempster-Shafer techniques respectively. The following discusses some of these classifier combination techniques in detail.

3.3.1 Simple Voting The most straightforward form of opinion integration is to let the classifiers cast a vote by forwarding the class they most prefer. The most simple (and quite effective) voting rule is plurality in which the collective decision is the classification reached by more classifiers than any other. In other words, the class with the most


48

votes wins. A variation on this is the majority voting rule which chooses the classification made by more than half the classifiers. When there is no agreement among more than half the classifiers, the result is considered an error or undecided. A correct decision by majority voting is inevitably a correct decision by plurality. For a two-class problem, the majority vote is equivalent to the plurality vote.

3.3.2 Data Sampling and Re-sampling A common approach to creating different classifiers is to use some form of data sampling technique, such that each classifier of the ensemble is trained on a different subsample of the training data. In the statistics literature, cross-validation and bootstrapping are well-known, and are widely used methods for estimating the error of a predictor from a limited set of data. The technique is based on random sampling with replacement from a data set and the resulting sets of data are called bootstrap replicates. Recently in the machine learning literature, two new algorithms have been described for creating ensemble members by using the bootstrapping technique. These two algorithms are similar in nature but differ in the way they construct their training sets. One is called bagging, the other boosting.

Bagging Classifiers The bagging algorithm is due to Leo Breiman [17], and is based on bootstrapping and aggregating (hence the acronym bagging). In bagging, a bootstrap sample is generated by uniformly sampling patterns with replacement from a training set drawn from a population of size . For a given bootstrap sample, a pattern in the training set has probability of being selected at least once in the


49

data samples. For large this is about [6], which means that, on average, distinct patterns appear in the training set of each ensemble member with the remainder being duplicates. Figure 3.3 is a pseudo-code listing for the bagging algorithm. Input:

Repeat:

Set - a sequence of patterns . . / with labels . 0 . 1- a weak learning system. - an integer specifying the number of iterations. Do for 1. Draw a bootstrap sample replacement). 2. Obtain a hypothesis:

from

(i.i.d sample with

0 1

Output:

Final hypothesis:

(i.e., this selects the most often predicted label . - ties are broken arbitrarily) Figure 3.3: The Bagging algorithm According to Breiman [17], bagging generates diverse classifiers only if the base learning algorithm is unstable. That is, small changes in the training set result in large changes in the learned classifier. Since neural networks and decision trees are examples of unstable learning algorithms, they are good candidates for this type of ensemble method. In bagging best results are often obtained when the number of bootstrap replicates is large but Breiman [17] suggests that it is not too important when procedures like CART or the C4.5 decision tree are used because their running times are small. But when neural networks are used the number of


50

bootstrap replicates can be an issue.

Boosting Classifiers Boosting was introduced by Schapire [117] as a method for boosting the performance of a “weak” learning algorithm. In the implementation of boosting, if we have a population of training patterns available, we select

patterns randomly with replacement and use these

to train an initial classifier. So, in this first training “round” there is no difference between bagging and boosting. However from the second round onwards the selection process is modified so that those patterns that are classified incorrectly in previous rounds are more likely to be selected for use in training than patterns that are correctly classified. Thus, boosting attempts to produce new classifiers that are better able to classify patterns for which the current ensemble’s performance is poor. This is in contrast to bagging, where the re-sampling of the training set is not dependent on the performance of earlier classifiers. As a result, boosting must generate the classifiers sequentially, while bagging can generate them in parallel. Table 3.1 shows an example of how boosting might work on an imaginary data set. The two most popular versions of boosting, viz AdaBoost.M1 and AdaBoost.M2 [34], will be briefly described in Chapter 5 where they will be applied to the learning system developed in this thesis. Empirical comparisons of voting classification algorithms such as bagging, boosting and their variants can be found in [6, 24, 101, 104], and the application of AdaBoost to neural networks can be found in [101, 118].


51

Table 3.1: Hypothetical runs of the boosting algorithm. Assume that pattern number 1 is an outlier and is hard for the learning algorithm to classify correctly. In each round the re-sampled training set are patterns from the original training set, but the hard patterns (in this case pattern 1) occur more in the latter rounds. Let the original patterns of the training set be: 1, 2, 3, 4, 5, 6, 7, 8 Round Resampled Training Set 1 2, 7, 8, 3, 7, 6, 3, 1 2 1, 4, 5, 4, 1, 5, 6, 4 3 7, 1, 5, 8, 1, 8, 1, 4 4 1, 1, 6, 1, 2, 3, 1, 4

Bias/Variance Decomposition The success of a classifier can be expressed by its generalization error. The lower the generalization error, the better the algorithm approximates the target function. It is possible to decompose the generalization error into two terms called bias and variance. In past studies of bias and variance, it was shown that there is a trade-off between these two terms, and this apparent trade-off is often called the bias/variance dilemma [61]. The bias/variance decomposition is a powerful tool from statistical sampling theory for analyzing supervised learning methods that have quadratic loss functions. The origins of bias and variance decomposition are in the area of quadratic regression where both bias and variance are always non-negative and averaging decreases the variance term without affecting the bias term. Generally, in this case, the generalization error can be decomposed into a sum of two terms as follows:

(%%*% '$2

$%$3


52

For classification problems, Schapire et. al. [116] suggested that it may be inherently more difficult or even impossible to find a bias and variance decomposition as natural and satisfying as in the quadratic regression case. But recently several proposals for decomposing classification error into bias and variance have been proposed [16, 72, 77, 125]. While the details of these decompositions differ from author to author, they all attempt to capture the following quantities: the bias term measures the consistent error that the learning algorithm makes over many different runs, and the variance term measures the error that is due to fluctuations from run to run. The basic idea is that, by averaging over many runs one can reduce the variance term and in that way reduce the expected error. The bias and variance of a classifier can be estimated when the classifier is trained on different sets of data sampled randomly from a given set of patterns as in bagging and boosting. In order to obtain the best generalization error, there needs to be a compromise between the conflicting requirements of small variance and small bias. Both bagging and boosting algorithms reduce the generalization error by reducing the variance term. However Bauer et. al. [6] demonstrated that boosting is doing more than just reducing variance, and does indeed seem to reduce the bias term as well.

3.3.3 Fuzzy Integrals The fuzzy integral is a non-linear function that is defined with respect to a fuzzy measure, which in turn is either a belief or a plausibility measure in the sense of Dempster-Shafer theory [124]. Banon has given a detailed description of fuzzy measures, and clarified the specificity of each of the measures in [2].


53

The fuzzy integral introduced by Sugeno [123], and the associated idea of fuzzy measure, provide a useful way for aggregating information [19, 20, 78, 124]. In the following I will briefly describe the idea behind fuzzy measures and fuzzy integrals. Definition 3.1 Let

0

be a finite set of elements and 0 denote the power set of

(i.e., the set of all subsets of

0

0 ). A set function 0 having the following

properties is called a fuzzy measure. 1) 2) 3)

0 if 4

0 and

4 , then 4 (This is called the monotonicity

property.) Note how this differs from the definition of probability measure - the additivity property of probability measure is replaced by a much weaker monotonicity property, which is the fundamental property of fuzzy measures. By the nature of the definition of the fuzzy measure , the measure of the union of two disjoint subsets cannot be directly computed from the component measures. In order to extend the concept to include this case, Sugeno [123] introduced the so-called

fuzzy measure, which satisfies the following additional property:

4 0 and

4

4 4 4

for some

/

(3.1)

Equation (3.1) gives the measure of the union of two disjoint fuzzy subsets.


54

We have that,

The

if

/ , 4 / 4

if

, 4 4 .

fuzzy measure for

(i.e.,

and

measure) is identical to probability

measure, and in this case the fuzzy integral corresponds to probability expectation [2]. The calculation of the fuzzy integral with respect to a fuzzy measure follows. Let

0

is as

. . . be a finite set of elements that can be used to

evaluate some object of interest. Now let us evaluate the object using the function

0

. represents a partial evaluation of the object from the viewpoint of an element . . Let the set function . be a fuzzy measure, as

so that

defined in Definition (3.1), which denotes the degree of importance1 of the element

. in the overall evaluation of the object. As defined by Sugeno [123], the fuzzy integral , over the set 0 of the function with respect to a fuzzy measure is given by:

.

Æ

. (

For finite sets Yager [137] has introduced a much simpler equation when deal1

The terms “partial evaluation” and “degree of importance” are standard terms in the fuzzy systems literature and I employ them in what follows to distinguish between and .


55

ing with fuzzy integrals. Assume the . ’s are ordered in descending order (if necessary,

0

is rearranged so that this relation holds). Then the fuzzy integral

is given by:

where

.

(3.2)

.

. . .

Equation (3.2) represents the aggregation of

partial evaluations, where the

fuzzy integral, , can be considered as the overall evaluation of the object. The

., can be determined either objectively or subjectively. The fuzzy measure, . , can be subjectively assigned by an expert or can be partial evaluations,

estimated objectively. In the illustrative example below I use it as a subjective measure (but in the HONG network it is determined objectively, as will be discussed in Chapter 4). Hereafter

. is denoted as .

The union operation

of the elements can be implemented using equation (3.1), when

is a fuzzy

measure, as follows:

.

or

for

. . .

using

Definition (3.1) property (3)


56

The value of is obtained from the equation , which is equivalent to solving the equation:

(3.3)

where and . It has been proved [79, 124] that for a fixed set of ,

, equation (3.3)

has a unique root greater than and not equal to zero. Thus, given the individual partial evaluations . , the calculation of the fuzzy integral with respect to a fuzzy measure only requires knowledge of the degree of importance of the elements in

0.

The value obtained by comparing the two quantities . and in terms of the operator in (3.2) is interpreted as the grade of agreement between the two measures, (i) partial evaluation . and (ii) degree of importance of the elements . The final operator takes the biggest of these terms. Hence fuzzy integration can be interpreted as searching for the maximal grade of agreement between the partial evaluations and the importance of the elements. The following example illustrates the concept of fuzzy integral and the associated calculations. Details of the above definitions, and other properties of the fuzzy integral are given in [2, 79, 137]. Example 1: Consider the following simple case of three neural networks configured to classify a two class problem. That is where

Neural Network . In this setup, assume that each neural

network produces a partial evaluation for both classes. Figure 3.2 illustrates the case where Neural Network 1 produce the partial evaluation for Class 1 and for Class 2. Also assume


57

that the importance of these networks in the evaluation of the above partial evaluations is assigned by an expert. Neural Network 1 is assigned the value

. All the partial evaluations and

the degrees of importance of the networks are given in Figure 3.4.

Neural Network 1

h(y1)

{

Class 1= 0.7 Class 2= 0.5

h(y2)

{

Class 1= 0.8 Class 2= 0.85

1

g = 0.37

Input Sample

Neural Network 2

Integrator

2

g = 0.40

Neural Network 3

h(y3)

Output Class

1= 0.4 { Class Class 2= 0.9

3

g = 0.13

Figure 3.4: A simple example to demonstrate the use of fuzzy integral in multiple network fusion. Since the degrees of importance of all the three networks are known, equation (3.3) can be used to calculate the value of

as

.

In the fuzzy integral

based decision combination, a fuzzy integral is calculated for each class, and the integral with the largest value indicates the class label. For example, using equation (3.2) the fuzzy integral for Class 1 can be calculated as follows. After rearranging the partial evaluations of Class 1 in descending order the following can be obtained.


.

!

! !

58

. . . . . . . . .

. .

.

Similarly, after rearranging the partial evaluations of Class 2 in descending order, and following the same procedure as for above, will be . Since the fuzzy integral value for Class 1 is greater than the fuzzy integral value for Class 2, the fuzzy “integrator” assigns Class 1 as the output class for the input sample. There are other ways in which the integrator in Figure 3.2 could make use of the available information in order to predict the class of the input sample. If, for instance, it were to use simple majority voting, where the predicted class is taken as the one chosen most often by the different networks, the integrator pays no attention to the values and simply labels the input sample as Class 2. If instead it made its decision on the basis of adding the partial evaluations, the two classes will get the following:

Total evaluation for Class 1

!

Total evaluation for Class 2

!


59

Since the total evaluation for Class 2 is greater than for Class 1, the integrator would again assign Class 2 as the output class for the given sample. If the integrator used a weighted average the two classes get the following:

for Class 1

!

for Class 2

!

And once again the integrator would assign Class 2 as the output class for the given sample. 1.1

1.1

1

1

h(.) 0.9

0.9

h(.)

0.8

0.8

0.7

0.7

0.6

0.6

0.5

0.5

g(.) 0.4

0.4

0.3

0.3

0.2

Overall evaluation (fuzzy integral for Class 1)

0.1 0

0

1

2

Network No.

3

g(.)

0.2

Overall evaluation (fuzzy integral for Class 2)

0.1 0

0

1

2

3

Network No.

Figure 3.5: Calculation of the fuzzy integral for Class 1 and Class 2 respectively for the example shown in Figure 3.4.

Compared with the other decision-making methods the main feature in the fuzzy integral method is that it gives more emphasis to networks with high values of both . and and, as the following example shows, it can lead to a different decision being made. In the fuzzy model, if a partial evaluation . is

value then . contributes directly to the overall evaluation. If . is larger than then the contribution to the overall evaluation

smaller than the

CHAPTER 3. COMBINING MULTIPLE CLASSIFIERS from

. is cut off at the value of .

60

Figure 3.5 shows the calculation of

the fuzzy integral pictorially for Classes 1 and 2 respectively. In both figures the horizontal axis represents the three networks . . . (denoted by ) after rearranging them in descending order of their partial evaluations. The labeling on

. for each network . , and the dashed line represents for . . . . As can be seen, for Class 1, the values of and are much higher than the valthe vertical axis has two interpretations. The solid line represents

ues for Class 2. This leads to a higher evaluation for Class 1 than for Class 2 in the fuzzy integral model, and hence the integrator in the above example assigns Class 1 as the output class. The method of combining multiple classifiers using fuzzy integral is used in the proposed hybrid learning system in this thesis, and its application is discussed in Chapter 4.

Chapter 4 Combining Supervised/Unsupervised Learning This chapter discusses the use of hybrid learning systems that combine supervised and unsupervised learning methods. Supervised learning systems are limited in a number of problem domains by the amount of labeled data that is available. To deal with this hybrid learning systems has an advantage in this regard. Hybrid systems can be implemented using a hierarchical approach with multiple layers where the processing units of each layer only have to concern themselves with the processing of a limited (i.e., localized) amount of information. By processing the input information one step at a time, hierarchical networks can provide capabilities that would otherwise be virtually impossible to deliver by a single powerful system. Here I introduce a hierarchical network architecture for pattern recognition problems. The basic aim in this level of structure is to build a hierarchy in which the higher levels specialize in distinguishing between patterns that belong to classes that are close together in pattern space. In this chapter I show how this 61

CHAPTER 4. COMBINING SUPERVISED/UNSUPERVISED LEARNING

62

new architecture can handle labeled and unlabeled patterns and compare its performance with other methods on real-world data.

4.1 Introduction As described in Chapter 2, supervised and unsupervised learning are the two main research themes in the study of ANNs. Most neural network research applied to pattern recognition, image processing and machine vision has been focused on supervised learning, and network models like the MLP provide a highly efficient method to design an arbitrarily complex non-linear classifier [52]. However, there are some problem domains that are not solved in a satisfactory way by means of a single powerful classifier. When the abstraction level of the classification task increases, the shape of the regions associated together become increasingly complex, requiring impossibly large amounts of training data to form the class boundaries. This problem can be alleviated by using unsupervised learning techniques to reduce the number of degrees of freedom in the data. Hybrid learning systems which combine supervised and unsupervised learning methods have been very popular in this regard. The use of such systems can be traced back to the early eighties when Fukushima introduced a pattern recognition system called the neocognitron [43] which is a hierarchical classifier that can be trained by either supervised or unsupervised rules. Hybrid systems of this type are implemented using a hierarchical approach with multiple layers, where each layer is comprised of one or more networks [29, 96, 102]. The basic idea of hierarchical networks is to have processing units on each layer of the network only receive connections from


63

a restricted localized subset of the previous layer’s processing units [53]. This implies that each processing unit can only act upon partial information, which may come from the previous layer’s processing units. This is an advantage of the hierarchical approach because the processing units of each layer only have to concern themselves with the processing of a limited amount of information. By processing the input information one step at a time, hierarchical networks can provide capabilities that would otherwise be impossible to deliver. Some architectures use a bottom-up approach (see eg., [71]) while others use a top-down approach (eg., [25]) in the implementation of the hierarchy. In this context, Hsieh et. al. [57] have proposed a model for pattern recognition which has a multi-layer architecture with self-organizing capability at each layer. In this model, unsupervised learning is first applied to the whole network such that the network acquires an ability to implement rough clustering, then supervised learning is applied on some layers for fine classification. Weijters et. al. [130] have proposed another hybrid system called BP-SOM which is formed by combining back-propagation (BP) learning with the SOM algorithm. Here each hidden layer of the BP algorithm has an associated SOM network. It is shown in [130] that the BP-SOM network and its learning algorithm outperform the BP algorithm on various difficult tasks. Recently Merkl and his colleagues [26, 93, 94] have shown the importance of hierarchically organized neural networks for document classification in information retrieval tasks. Using such an architecture, they were able to reduce the time needed for training and provided more intuitive visualization1 . The next section introduces another type of hybrid system with a hierarchi1

Intuitive visualization is necessary in this kind of application in order to assist the user during the search for relevant documents in an easily accessible fashion.


64

cal architecture which is based on the neural gas algorithm for pattern recognition problems. The learning algorithm used is a mixed supervised/unsupervised method with most of the learning being unsupervised.

4.2 Hierarchical Overlapped Architecture By retaining the essence of the original NG algorithm, I have been able to develop what I refer to as a hierarchical overlapped neural gas (HONG) network architecture for pattern recognition problems. The structure of the HONG network architecture is an adaptation of the hierarchical overlapped architecture developed for SOMs by Suganthan [122]. The basic aim in this level of structure is to build a hierarchy in which the higher levels specialize in distinguishing between patterns that belong to classes that are close together in pattern space. First a base network is initialized, but as in the SOM algorithm, there is no well defined procedure to initialize the units in this kind of a network. The size of the base network is influenced by the number of distinct classes in the data set as well as the number of training patterns; rules of thumb are used for this (see Section 4.5 for more details). In this architecture, as in the SOM architecture, every unit has a reference vector which has the same dimension as the input feature vector. Once the number of units in the base network has been selected their reference vectors are initialized randomly, and my accelerated version of the NG algorithm, discussed in Chapter 2, is applied to adapt the reference vectors. After this unsupervised NG learning stage, a labeled set of training patterns is presented to the network and the units in the base network are labeled according to the pattern class that they most frequently respond to. The decision boundaries obtained by


65

the unsupervised learning stage are now adjusted using the supervised procedure of learning vector quantization (LVQ) [75]. The LVQ algorithm modifies the reference vectors so as to improve the decision boundaries. The real advantage of combining supervised and unsupervised learning in an architecture like this comes when the training set contains a large number of unlabeled patterns and a relatively small number of labeled patterns. In such a situation, the labeled data is used to label the units in the network and to fine tune the labeled network in a supervised manner. In many real world applications (eg., speech and character recognition, data mining and classification of world wide web pages), unlabeled data are available in abundance while labeled data are very expensive. Now let us turn to the second level of the hierarchical structure. At this level a new NG network is created for each unit in the base network. This is depicted in Figure 4.1 where networks A , B and C are networks in the second level that are

created for the base level units A, B and C. For the ease of explanation, let us consider network A . The reference vectors of the units of network A are initialized

randomly in a similar fashion to the initialization of the base level unit A, but are now localized about the reference vector of unit A. This can be implemented as follows. Let represent the set of training patterns used to train the network A (i.e.,

! , where is the dimension of the input patterns).

Also let

denote the reference vector of the base unit A and denote the reference vector ¼

of the unit in the second level network A . Then each entry ", "

, of

the second level reference vectors can be initialized as:

#

¼

# ,&

(4.1)


Overlapped NG Networks C

C

66

Confidence Vector of overlapped NG network B’

C’

A’

C

B’

C’ A’

overlapped NG network grown from unit A

B’ overlapped NG network grown from unit B

Second Level

Base Level

C A

B Base NG Network

HONG Architecture Input feature vector

Feature Extractor

Input data vector

Figure 4.1: Hierarchical overlapped architecture showing three overlapped units, A, B and C from the base NG network being expanded to the second level. where ,

and & is calculated using the Euclidean distance between the reference vector and the sample training patterns in the set . That is, &

Initializing in this way allows the network A to concentrate on distinguishing

between the types of pattern that A responds to most strongly. The network A

is trained on all training patterns in for which base unit A is either the winner


67

or one of the runners-up, being a prespecified number (Tables 4.8 and 4.9 give typical values). Figure 4.2 shows a hypothetical example to illustrate this process which causes network A to become a specialized network for those classes to

which base level unit A responds the most. This process of specializing on groups of training patterns of similar types leads to improved classification performance. Each of the networks in the second level (i.e., B , C and the rest) are trained in

a similar fashion. The second level networks are referred to as overlapped NG networks because they are trained with overlapping sets of training patterns (i.e., individual training patterns are in general employed in the training of more than one second level NG network.). After training, the units in the overlapped NG (ONG) networks are labeled using the same procedure as is used in the base network. Responses of base level units to each pattern in V: (Only winner, first and second runner-up are considered) Pattern

Winner

1st

2nd

v1

D

E

C

v2

A

E

D

v3

C

B

A

v4

B

A

C

v5

E

D

B

v6

A

C

D

v7

D

C

E

A’ VA = {v2, v3, v4, v6}

E

B

C

D A

V = {v1, v2, v3, v4, v5, v6, v7}

Figure 4.2: This shows how the training patterns are selected for training the second level ONG networks. Here only one ONG network (A ) is shown.

Now let us consider the test phase, which is commenced by feeding a test


68

pattern to the base network. I employ the winning unit as well as the runner-up units in the base network to identify those networks in the second level that will be used to vote on the class of the test pattern. The value of less than

is set to a little

(used in training) so that units at the base level which are less likely

to be significant, are ignored. Experience has shown that this leads to improved performance. So, in the test phase, for a given test pattern, we consider the winner and the runner-up ONG networks (i.e., ONG networks) to implement the final classification step. In order to combine the outputs of the ONG networks, I employ the idea of confidence values. For each of the ONG networks, the confidence value that a test pattern belongs to class is calculated using the formula2 :

3

where

(4.2)

for all units in ONG network class .

Note that, is the Euclidean distance between input vector and the nearest unit labeled in ONG network . The quantities and are the minimum and the maximum values of over all in the ONG network . Equation (4.2) has been determined empirically as the most suitable functional form for calculating confidence values. It worked well and reasonably stable for all the problems I tested. It assigns a confidence value close to unity when

is close to , and when is further away from (and moves towards ) it assigns a confidence value close to zero. Equation (4.2) has the form of one side of a Gaussian curve with mean and standard deviation equal to 2

These confidence values do not generally sum up to one.


.

69

The parameter can be used to control how quickly the con-

fidence values for , which are further away from , are driven towards zero. As an illustration, Figure 4.3 shows the variation of the confidence value assignment for different values of . Here the simulations are performed for a 26-class problem where the distances for each class are between 0 and 250. When

/ , Equation (4.2) assigns all confidence values close to unity, and when , it assigns almost all confidence values close to zero irrespective of its distance from . When is between these two extremes (4.2) assigns confidence values smoothly between 0 and 1 according to its distance from . Given these facts,

was chosen as a reasonable value and employed in all my experiments. If a class is not represented in the ONG network considered, that class is simply ignored for that network and assigned 3 .

and the curves in Figure 4.3,

In the test phase, for a given test pattern, an ONG network will produce a confidence value between zero and one for each class. Then, if the set of classes are represented by

, the confidence values for each class can

be collected into a confidence vector, which for ONG network is written:

3 ½ 3 where 3

and .

Confidence vectors are illustrated diagrammatically in Figure 4.1 for ONG networks A , B and C . The confidence vectors generated by the various ONG

networks can be used to predict the class of a given test pattern in a variety of ways, as is explained in the following section.

CHAPTER 4. COMBINING SUPERVISED/UNSUPERVISED LEARNING 1

1

0.9

0.9

S= 10

0.8

0.7

0.6

0.6

0.5

0.5

0.4

0.4

0.3

0.3

0.2

0.2

0.1

0.1

0

0

50

100

150

200

250

1

0

50

100

150

200

250

0.9

S= 1

0.8

0.7

0.6

0.6

0.5

0.5

0.4

0.4

0.3

0.3

0.2

0.2

0.1

0.1

0

50

100

150

200

S= 0.2

0.8

0.7

250

1

0

0

50

100

150

200

250

1

0.9

0.9

S= 0.1

0.8

S= 0.01

0.8

0.7

0.7

0.6

0.6

0.5

0.5

0.4

0.4

0.3

0.3

0.2

0.2

0.1

0

0

1

0.9

0

S= 2

0.8

0.7

70

0.1

0

50

100

150

200

250

0

0

50

100

150

200

250

Figure 4.3: Variation of confidence values with parameter S. The horizontal axis represents the distance ( ), and the vertical axis represents the corresponding confidence values.

4.3 Combining Overlapped NG Networks In the HONG network more than one ONG network is used to vote on the class of a given test pattern with each ONG network outputting its “decision” for a pattern as a confidence vector. These vectors allow the use of various classifier fusion techniques to predict the correct class of a given test pattern and this is the subject of the following sections.


71

4.3.1 Simple Methods of Combining ONG Networks The most straightforward method of combining ONG networks is to let the individual ONG networks cast a hard vote in terms of the class they most prefer. The vote of ONG network

is then given by the class corresponding to the

maximum entry in its confidence vector. That is,

$% $,3

where

.

A majority voting rule can then be used to obtain the

ensemble vote. In addition to the above method, since each ONG network outputs its prediction as confidence values, another ensemble prediction for the class can be obtained directly in terms of the confidence values. If a number, 1 of ONG networks in the ensemble are considered, then an overall (or ensemble) confidence vector, ,

can be found by simply adding the individual confidence vectors. Then the ensemble prediction can be obtained by selecting the class

with the maximum

entry in the overall confidence vector. That is,

$% $,3

Both of these methods are very easy to implement and are very fast in execution. They do not however, take into count the differences in the individual ONG network capabilities. That is, all the contributing ONG networks are treated equally, which will not generally provide an optimal prediction if the capabilities


72

(or the performances) of the ONG networks can be estimated. In such situations, a fuzzy integral methodology can be applied for the ensemble combination and this is described next.

4.3.2 Combining ONG Networks using Fuzzy Integral The basic theory behind the fuzzy integral calculation is discussed in some detail in Chapter 3 (see Section 3.3.3). To summarize, the fuzzy integral non-linearly combines two measures of an object under consideration, namely partial evaluations and degree of importance, to obtain the overall evaluation of the object. Application of the fuzzy integral to the HONG architecture can be explained in terms of the example in Figure 4.4. This figure illustrates, for a given test pattern, an ONG network corresponding to a winner unit and two ONG networks corresponding to its first two runner-up units. These are represented by 5 , and respectively. Confidence values of each class k

Class 0

1

9

hk Class 0 Class 0

1

9

hk

Winner network (W)

First runner-up network (R1)

1

9

hk

Second runner-up network (R2)

Second layer NG networks for three overlaps

Figure 4.4: Confidence values produced by the three overlapped networks (in a HONG network) corresponding to the winner and the first two runner-up networks. Now let lem, let

0

be the set of classes for the classification prob-

.

.! .! be the set of networks and let be the pattern to be

½

¾

CHAPTER 4. COMBINING SUPERVISED/UNSUPERVISED LEARNING classified. Let

0

73

be the confidence value3 produced by a network for

the pattern for class , i.e., . is an indication of how certain the network

. is in classifying the pattern as class . Thus . indicates absolute certainty that the pattern is in class , and . implies absolute certainty the pattern is not in . The degree of importance of the network . for each class (denoted by ), can be calculated based on how well these networks perform on labeled data. To do this, in general, a subset of training data that is labeled is used. The performance,

, of network . for class is recorded as the ratio of the number of

patterns correctly classified to the total number of patterns presented from that class. Generally each network performs differently for each class and hence the three networks considered here (5 ,

and

) have different performance ( )

values in the recognition of the various classes. Let

!!

Now the degree of importance of each network for each class ( ) can be defined as

(4.3)

where is the “desired” sum of the degree of importance values. The question of setting a value for is discussed in next section. In fuzzy systems literature the function is called “partial evaluation”, but hereafter it is referred to as confidence value. 3


74

4.3.3 Determining the Value of

sets the fuzzy measure to be either a belief measure (when ) or a plausibility measure (when / ). However, if then , and the fuzzy measure is identical to probability measure. The parameter can be varied while maintaining . Selecting a very small value for will produce with negligible importance values. Also it is obvious that cannot keep on increasing while keeping . The value of employed depends on the individual performances of the networks ( ),

From [2], the value of

but normally a value close to unity is used. The following example illustrates the influence of on outcomes. Example 2: Let us consider again Example 1 discussed in Chapter 3, but now assume that the importance values of the networks are assigned objectively based on performances. Assume that the performances of the networks (see Figure 3.4) are estimated based on how well these networks perform on the training dataset. For simplicity, here the performance of the network is not estimated for each individual class but for each network as a whole. Let these performances of each network be (arbitrarily);

+%$" #*%

+%$" #*%

+%$" #*%

The above performance measures of each network can be converted to fuzzy measures by using equation (4.3). Since the performances are estimated irrespec-


75

tive of the classes, here is same for each , and hence is same for each . Let

, and denote the degrees of importance of the three networks for Class , , once the confidence values for those networks are arranged in descending order. In order to observe the role of in equation (4.3), a variety of importance values for Classes 1 and 2 are calculated by varying from 0.1

to 2.2. All these values and the associated fuzzy integral values are presented in Table 4.1. Table 4.1: Degree of importance values and the fuzzy integral values of Class 1 and Class 2 for various values.

0.1 0.2 0.3 0.4 0.6 0.8 0.9 1.1 1.2 1.3 1.4 1.6 1.8 2.0 2.2

0.04 0.09 0.13 0.18 0.27 0.36 0.40 0.49 0.53 0.58 0.62 0.71 0.80 0.89 0.98

0.04 0.08 0.12 0.17 0.25 0.33 0.37 0.45 0.50 0.54 0.58 0.66 0.74 0.83 0.91

Since the values of

0.01 0.03 0.04 0.06 0.09 0.11 0.13 0.16 0.17 0.18 0.20 0.23 0.26 0.28 0.31

0.40 0.45 0.55 0.62 0.70 0.70 0.70 0.70 0.70 0.70 0.70 0.71 0.80 0.80 0.80

0.01 0.03 0.04 0.06 0.09 0.11 0.13 0.16 0.17 0.18 0.20 0.23 0.26 0.28 0.31

0.04 0.09 0.13 0.18 0.27 0.36 0.40 0.49 0.53 0.58 0.62 0.71 0.80 0.89 0.98

0.04 0.08 0.12 0.17 0.25 0.33 0.37 0.45 0.50 0.54 0.58 0.66 0.74 0.83 0.91

0.50 0.50 0.50 0.50 0.50 0.51 0.55 0.63 0.66 0.69 0.73 0.80 0.85 0.85 0.85

Final Class 2 2 1 1 1 1 1 1 1 1 2 2 2 2 2

. and are known for the three networks the fuzzy

integral for the two classes can be calculated as follows:

!

!


76

values are very small their contribution to is negligible. In this case and , always. Since it assigns Class 2 to an When the

unknown pattern without considering the performance values of the networks. As Figure 4.5 shows this situation exists when Also when the

.

values are as large as their confidence values, the system

assigns Class 2 as the output class for an unknown pattern. As Figure 4.5 shows

/ . This shows that should at least be restricted to values between and for this particular problem. Practitioners this situation exists when

generally give it a value close to unity (either a little smaller or a little greater), and in this range the final result changes negligibly. Once a value for

is

selected it is kept constant throughout the experiment. 0.9

Class 2 0.8

Fuzzy integral (e)

0.7

Class 1

0.6

0.5

0.4

0.3

0

0.2

0.4

0.6

0.8

1

1.2

dsum

Figure 4.5: Variation of the fuzzy integral as Example 2.

1.4

1.6

1.8

2

2.2

is varied from to for

In the HONG network, the fuzzy integral fuses information from multiple networks in order to achieve a final classification. A fuzzy integral is calculated for each class, and then the integral with the largest value indicates the class label. The pseudo code in Figure 4.6 describes the calculation of the fuzzy integral in the HONG network. The pseudo code in Figure 4.7 describes the overall function


77

of the HONG network. FOR (each class ) FOR (each ONG network . participating in voting) Calculate . using equation (4.2) Calculate using equation (4.3) Calculate using as in equation (3.3) Compute the fuzzy integral for class , , using equation (3.2) Determine the class: $% $,

Figure 4.6: Pseudo code for the calculation of the fuzzy integral in the HONG network.

4.4 Data Sets Used in the Experiments 4.4.1 UCI SatImage Data Set The original SatImage data set was generated from the Landsat data purchased from NASA by the Australian Centre for Remote Sensing, and used for research at the university of New South Wales. One frame of Landsat MSS imagery consists of four digital images of the same scene in different spectral bands. Two of these are in the visible region (corresponding approximately to green and red regions of the visible spectrum) and two are in the (near) infra-red region. Each pixel is an 8-bit binary word, with 0 corresponding to black and 255 to white. The spatial resolution of a pixel is about 80m 80m. Each image contains 2340 3380 such pixels. The data set is a (tiny) sub-area of a scene, consisting of 82 100 pixels. Each line of data corresponds to a 3 3 square neighborhood of pixels completely contained within the 82 100 sub-area. Each line contains the pixel values in


78

Train the Base Network Initialize the network with N units Initialize the reference vectors of each unit in the network Train the network as discussed in Section 2.8 Label the units Apply LVQ algorithm to fine tune the decision boundaries FOR (Each unit in the Base Network) Create an ONG network Initialize the reference vectors of each unit in the ONG network using equation (4.1) Obtain the training samples from the Base network as shown in Figure 4.2 Train and Label the ONG network using the same procedure as used in the Base network Test Phase - for a given test pattern Identify the winner and the runner-up ONG networks Call the Fuzzy Integral code (see Figure 4.6) to calculate the class of the test pattern Figure 4.7: Pseudo code for the HONG network. the four spectral bands (converted to ASCII) of each of the 9 pixels in the 3 3 neighborhood and a number indicating the classification label of the central pixel. The data is given in random order and certain lines of data have been removed so that it is not possible to reconstruct the original image from this data set. This data set contains patterns with 36 attributes, each of which belongs to one of 6 classes. The dataset was divided into a training and a test set with 4,435 patterns in the training set and 2,000 patterns in the test set. More details of this data set can be found in [13, 95].


79

4.4.2 UCI Letter Data Set This data set was constructed by Frey and Slate [35]. The objective is to classify each of a large number of black and white rectangular pixel displays as one of the 26 capital letters of the English alphabet. The character images produced are based on 20 different fonts and each letter within these fonts has been randomly distorted to produce a file of 20,000 unique images. The parent fonts represent a full range of character types including script, italic, serif and Gothic. Perfect classification performance is unlikely to be possible with this data set. One of the fonts used, Gothic Roman, appears very different from others. Each image was represented by 16 numerical attributes which were calculated using edge counts and measures of statistical moments that were scaled and discretized into a range of integer values from 0 to 15. This data set contains 20,000 patterns, each of which belongs to one of 26 classes. The first 16,000 were used for training, and the remaining 4,000 patterns were used for testing. More details of this data set can be found in [13].

4.4.3 NIST Handwritten Numeral Data Set In 1988, the Image Recognition Group at the National Institute of Standards and Technology (NIST) undertook a project sponsored by the Bureau of the Census to design and collect a large database of handprinted characters. The database was designed to be used in training and testing high speed, high throughput, character recognition systems. NIST Special Database 1 (SD1) contains 2,100 full page images of handwriting patterns printed by 2,100 different writers, geographically distributed across the United States, with a sampling roughly proportional to population density. The writers used in this collection were permanent census


80

field representatives experienced in filling out forms. SD1 embodies a wide range of handwriting styles. The completed forms in this database illustrate the difficulty in recognizing handprinted characters with a computer. Character classifiers typically recognize one individual character at a time. The 2,100 pages contained in SD1 (digits, lower-case and upper-case letters) were segmented into individual images and assigned reference classifications. The SD3 database contains this segmented image collection. Each referenced image in the SD3 database has been manually checked and verified at two independent times, each time by a different person, As a result, SD3 contains 313,389 referenced character images [44]. In the following, only the handwritten numeral patterns were used. They contain 223,124 isolated handwritten numerals written by 2081 writers, scanned at 300 dots per inch. The full set of isolated handwritten numerals was originally partitioned into four sets, each containing about 500 writers, and named hsf 0, hsf 1, hsf 2 and hsf 3 (see Table 4.2). I used the first three partitions for training and the fourth for testing. The training set was further partitioned into a training and a validation set. For this I used the holdout method [33, 73], which partitions a dataset into two mutually exclusive subsets with roughly 2/3 of the data for training and 1/3 for validation. I finally ended up with three non-overlapping sets with 106,152 patterns for training, 53,067 for validation and 63,896 for testing (see Table 4.3). The test set comprises patterns from 586 writers not used in the training and validation sets.


81

Table 4.2: Distribution of digits into partitions in SD3 database. Partition No. of writers No. of digits

hsf 0 500 53,449

hsf 1 498 53,312

hsf 2 497 52,467

hsf 3 586 63,896

Table 4.3: Partitions of SD3 database used in my experiments. Partition(s) Size Use

hsf 0,1,2 106,152 Training

hsf 0,1,2 53,076 Validation

hsf 3 63,896 Testing

4.5 Experimental Results In the first set of experiments, two medium sized benchmark data sets from UCI repository (SatImage and Letters) were used. Both these data sets have predefined training and test sets as discussed in the earlier section. For each data set the performance of the HONG network with different fusion methods was compared with that of the best recently published results. For this to be a fair comparison, the full set of labels were employed to fine tune the HONG network after the clustering procedure had been implemented. In a later section (Section 4.9) the effects on performance of using less than 100% of the labels are investigated. Following these experiments on the two medium-sized data sets, a second set of experiments was carried out on one large sized data set from the NIST SD3 database. For all data sets the parameters used are given in Table 4.4. The exact values of these parameters are not very critical but can be optimized by trial and error if required. Both the base and the second level units were given the same set of parameters in all experiments.


82

Table 4.4: Parameters used on the three data sets using HONG network. Here denotes the number of training patterns in the data set. See text for the number of units used. Data Set SatImage / Letters NIST SD3

0.8 0.7

0.05 0.005

0.7 0.1

0.001 0.85 0.0001 0.90

The selection of the numbers of units for the data sets is based on the number of distinct classes in the data set as well as the number of training patterns available. A set of rules of thumb for this were obtained through experimentation. The number of units in the base network is determined using the following. Let

denote the number of training patterns available and denote the maximum number of classes they belong to. Generally, the number of units is proportional to both and . So, the number of units can be found using:

(4.4)

where is a constant. For medium sized data sets (eg., SatImages, Letters) a value around

,

and for large sized data sets (eg., NIST SD3) a value around

are

choices for the value of that yield good performance. For the ONG networks in the second level the following rule of thumb is used. If all the training patterns are assumed to be well represented in the base network (i.e., each base unit represents one class, and on average each unit activates for patterns), then the number of units in the second level ONG networks can be estimated using:

(4.5)


83

is the same constant used in equation (4.4). In what follows each ONG network is initialized with units in the second level.

where

4.6 Medium Sized Data Sets For both medium sized data sets the units in the HONG network are initialized using the rules of thumb given in equations (4.4) and (4.5). For the SatImage data set, the base network is initialized with 90 units and the second level networks are initialized with between 10 and 50 units depending on the number of training patterns available for training each of the ONG networks. For the Letters data set, the base network is initialized with 130 units and the second level networks are initialized with between 20 and 80 units, again depending on the number of training patterns available for training.

4.6.1 Performance on UCI SatImage Data In the HONG network used on SatImage, four different fusion methods are compared, namely majority voting, confidence values, weighted average and the fuzzy integral. In the case of majority voting, if the winner and the

runner-up

ONG networks for the test set are considered, each ONG network produces a class based on the best match and the class with the highest count is chosen as the final class. In the confidence values method, the confidence vectors of each

ONG network are summed to give an overall confidence vector and the largest entry in this vector indicates the class. In the weighted average method, the weights of the ONG networks are calculated based on how well these networks perform on the training set. Once the weight of each ONG network is


84

known, the classification of a pattern can be obtained similarly to the confidence method but this time each confidence vector has an associated weight. The fuzzy integral method allows the confidence vectors produced by the ONG networks to be combined with their performances on the training set in a non-linear fashion as discussed earlier. The results on the test set, averaged over 10 random4 runs for the HONG network, are given in Table 4.5, together with the best recently published results5 . As can be seen, the HONG network with simple majority voting outperforms the classifiers CART [15], C4.5 [105](two popular decision tree algorithms) and MLP used in the comparison without boosting. The results also indicate that the combination of network performance with confidence values by means of the fuzzy integral leads to a further improvement in classification performance. Since the HONG network has an apparently complex architecture, one might expect it to have a low execution speed. But this is not so. It took just over 2 minutes (on a Pentium II, 350 MHz PC running the FreeBSD operating system) to train this data set (i.e., with 4,435 patterns) on the HONG network with the parameters discussed earlier. Once the network was trained and the units labeled, it took just 5 seconds to test all 2,000 test patterns.

4.6.2 Performance on UCI Letters Data I use the same combination methods as I employed on the SatImage dataset and the results are shown in Table 4.6. This shows again that combining confidence 4

That is, running the HONG network 10 times randomly without changing any of its parameters. 5 For some of the algorithms their best results using the boosting algorithm is also given. Boosting techniques, and their application to the HONG network are discussed in Chapter 5.


85

Table 4.5: Test error rates on the UCI SatImage dataset by other algorithms compared with the HONG. The first 4,435 patterns are used for training and the last 2,000 patterns are used for testing. Note that, the column “Boosting” shows the test error rate using the boosting algorithm. Author Breiman [16] 1996 Freund et. al. [34] 1996 Schwenk et. al. [118] 2000 Michie et. al. [95] 1994 HONG

Algorithm CART C4.5 MLP (36-30-15-6) k-NN with Majority Voting with Confidence Values with Weighted Average with Fuzzy Integral

Error % Boosting 14.8 8.8 14.8 8.9 12.8 8.1 9.4 11.2 9.4 9.2 8.9

values is much better than taking a simple vote and that the additional incorporation of network performance values provides further improvement. Again my results are comparable to the recently published results for this data set. The HONG network took only 8 and half minutes to train 16,000 patterns and 16 seconds to test 4,000 patterns using a sequential computer as used for the previous data set. Table 4.6: Test error rates on the UCI Letters dataset by other algorithms compared with the HONG. The first 16,000 patterns are used for training and the remaining 4,000 patterns are used for testing. Note that, Michie et. al. [95] used 15,000 patterns for training and 5,000 patterns for testing, which is a slightly different partitioning of the data than that in the UCI database. Author Breiman [16] 1996 Freund et. al. [34] 1996 Schwenk et. al. [118] 2000 Michie et. al. [95] 1994 HONG

Algorithm CART C4.5 MLP (16-70-50-26) ALLOC80 with Majority Voting with Confidence Values with Weighted Average with Fuzzy Integral

Error % Boosting 12.4 3.4 13.8 3.3 6.1 1.5 6.4 7.9 6.1 5.7 4.2


86

4.7 Large Sized Data Set 4.7.1 Feature Extraction Methods on NIST SD3 Database The original NIST SD3 database consists segmented bi-level (i.e., black and white) images with an image resolution of ! ! along with a classification of the image. It is well known that the performance of a pattern recognition system is closely related to the feature measurements taken by a classifier. On the other hand, the use of large number of feature measurements will increase the complexity or the size of the classifier. Since the NIST SD3 database consists of a large number of features and some redundant features exist (eg., image in a large background) for each image, a few different feature extraction methods were performed. Three different feature extraction methods were used in the experiments. Since feature extraction is a hard and crucial step, each method is briefly explained here. Prior to the feature extraction operation, pre-processing operations were performed on the isolated numerals. First, any isolated blobs were removed from the binary image based on a ratio test. Even though the binary image resolution is

!

!,

the actual numeral part takes up a much smaller area. Therefore

to normalize the original images by size, a minimal square bounding box was fit to the set of nonzero pixels in each image and the containing area was normalized (i.e., down-sampled) to a smaller image. Once this was done, three different feature sets were extracted from each digital images, as is explained below. In the first operation, the global features (GF) were extracted based on the pixel values [135]. The pre-processed binary digit images were normalized to an !!

pixel resolution and each such image was sub-sampled into ! ! blocks.


87

The pixel values in each block are then summed. The result is an grey scale image with pixel values in the range [0,64]. That is, now each image has only feature values. In the second operation, structural projection features (PF) were extracted based on projections, black-to-white transitions and contour profiles similar to [49]. Initially the binary digit image was normalized to a pixel resolution. The black pixels were then projected onto four main directions (horizontal, vertical, left diagonal and right diagonal) and four different histograms were obtained. Similarly, black-to-white transitions were counted and another four different histograms were obtained. Finally eight contour profiles were computed from eight main directions (i.e., the above mentioned four directions in left-to-right and right-to-left directions). A contour profile value is defined as the number of white pixels separating the border and the first black pixel seen from a given direction. This results in 16 histograms and 6 features were extracted from each of the histograms by sub-sampling with a weighted average. From this,

features were

obtained. In the third operation, structural contour features (CF) were extracted based on contours similar to [49]. First the binary digit image was normalized to an !!

pixel resolution. Then both inner and outer contours of the digit image

were extracted. At each contour point, the gradient direction was estimated by computing the weighted average of the coordinates of the neighboring pixels. The calculated direction was quantized into one of the 8 uniform quantization intervals. Then the image with direction contours was sub-sampled into blocks. In each block and for each of the 8 discrete directions, the total number of contour direction pixels was counted. This counting was weighted according to


88

its position with respect to the center of the corresponding block. This results in

(i.e., ! ) contour direction features.

4.7.2 Performance on NIST SD3 Data This experiment shows the HONG network’s ability to handle large sized data sets. Here the base network’s performance is compared with the HONG network’s performance for the three different feature extraction methods discussed above. In each case, the HONG architecture improves upon the classification rate provided by the base level NG network. For training the HONG network I used the full training set (see Table 4.3) without any labels and for the supervised part of the learning (i.e., labeling and fine tuning) I used the full set of labeled data. In order to prevent the network from over-fitting, the training process was stopped at the point where the recognition rate on the validation set was at its best. Since the previous set of experiments showed that fuzzy integral is the best way to combine the ONG networks in the second level, I used only the fuzzy integral in the comparisons for this data set. The parameters used are as given on the bottom row of Table 4.4. I used a smaller neighborhood size for this data set because there are a large number of patterns in this case. And for the same reason, the learning rate was made much more slower than in the previous experiments. The base network was initialized with 250 units and the second level networks were initialized with between 50 and 150 units, using the rules of thumb given in equations (4.4) and (4.5). The training process performed four iterations on 106,152 patterns, and it took about 3.5 hours to train the network. However, for testing on all the patterns, it took only a few minutes. The results I obtained for the three feature extraction methods are shown in


89

Table 4.7. This table shows that the features based on the pixel values (i.e., global features GF) gave somewhat better results than the other two features on the test set. Table 4.7: Results of the Base Neural-Gas and the HONG network combined with fuzzy integral for the NIST SD3 database. The three feature extraction methods (GF, PF, CF) are explained in Section 4.7.1.

Method GF PF CF

BASE NG Error Rate (%) Training Validation Testing 0.690 1.100 1.230 0.790 1.230 1.340 0.680 1.170 1.260

HONG Error Rate (%) Training Validation Testing 0.100 0.700 0.780 0.340 0.940 0.970 0.070 0.880 0.960

4.8 Number of Overlaps in HONG As discussed in Section 4.2, each ONG network is trained/tested with overlapping sets of patterns. This overlapping occurs because individual patterns are used to train/test more than one second level ONG network. In unsupervised learning algorithms like the NG algorithm, it is very easy to identify the winner and the runner-up units for a given input pattern. Generally speaking the winner and the first few runner-up units tend to represent similar kinds of clusters in input space. In the HONG network I used the winner and runner-up units (in the base network) to identify the corresponding ONG networks for training, and

runner-up units to identify the corresponding ONG networks for testing. The values of and are set empirically and the value of is set to a little less than so that the units at the base level which have less confidence, are

the winner and

ignored.


90

Table 4.8 shows test error rates for the SatImage data set for different overlap sizes in the HONG network. Here three different classifier combination methods have been employed. The table indicates that for this dataset, the best value for

is six and the best value for is five. Similarly, Table 4.9 shows test error rates for the Letters data set for different overlap sizes and indicates that the best value for is nine and the best value for is seven. Experience has shown that for large data sets

and

generally give best results but for smaller data sets the

two parameters should have their values reduced. Table 4.8: Test error rate for different HONG networks with varying training and test overlaps for the SatImage data set. Number of Overlaps (winner & runners-up) In Training ( ) In Testing ( ) 3 2 3 3 6 4 6 5 6 6 9 7

SatImage Dataset Test Error Rate(%) Majority Confidence Fuzzy 12.1 10.5 10.1 12.2 10.6 10.2 11.1 9.5 9.0 11.2 9.4 8.9 11.5 9.7 9.1 11.9 10.1 9.2

4.9 Using Labeled and Unlabeled Data To demonstrate the HONG network’s ability to handle a data set with labeled and unlabeled training sets, I randomly partitioned the training set of the SatImage and the Letters data sets into labeled and unlabeled sets. For unsupervised training I used all the training patterns without their labels. For the supervised part I used only a proportion of the training patterns with their labels. For each labeled proportion given in Table 4.10 the experiments were carried out with the same


91

Table 4.9: Test error rate for different HONG networks with varying training and test overlaps for the Letters data set. Number of Overlaps (winner & runners-up) In Training ( ) In Testing ( ) 5 3 7 4 7 6 9 5 9 7 9 9 11 7

Letters Dataset Test Error Rate(%) Majority Confidence 8.7 7.6 8.7 7.4 8.8 7.4 8.2 6.1 7.9 6.1 7.7 6.4 8.3 6.7

Fuzzy 5.3 5.5 5.6 4.4 4.2 4.6 4.6

HONG parameters for 10 runs. The averaged test error rate over the 10 runs for the network using the fuzzy integral are given in Table 4.10. The HONG network is able to achieve an error rate close to CART and C4.5 (without boosting– see Tables 4.5 and 4.6), using only 60% of the labeled training data for both data sets. Note that the Letters data set performance deteriorates much faster than that for the SatImage data set as the labeled data proportion decreases. This is probably caused by the Letters data set having more classes than the SatImage data set hence requiring more labeled patterns to represent its decision boundaries. Table 4.10: Average test error rates on the UCI SatImage and Letters data sets for different proportions of labeled training patterns. Proportion of Labeled Data 40% 50% 60% 80%

Test Error Rate (%) SatImage Letters 23.32 27.33 17.79 15.52 15.07 13.71 12.38 8.11


92

4.10 Multiple HONG Networks In the final experiment of this chapter, I employed more than one HONG network to try to improve performance on the largest data set, i.e., the NIST SD3 data set. For this experiment, each input pattern was analyzed by three different HONG networks each trained on a different set of features, viz the ones detailed in Section 4.7.1. This provided the basis for a composite classifier as depicted in Figure 4.8. In this figure, each HONG network outputs its decision on a pattern as a set of confidence values rather than a crisp value. Since each HONG network is trained on a different feature set, each of them has a different degree of importance in the recognition of the classes. Once the two measures, confidence values and degree of importance, are obtained for each HONG network the three HONG networks are combined using the fuzzy integral to achieve the classification estimates6 .

Input Image

Global Features

HONG 1

Projection Features

HONG 2

Contour Features

HONG 3

Combiner

Final Classification

Figure 4.8: Multiple HONG-base classifier configuration. With the multiple HONG-based classifier I was able to achieve a test error rate of 0.520% for the NIST SD3 data set. The best test error rate for a single HONG network was given by the network which was trained on the global features (GF) and it was 0.780% (see Table 4.7). Thus, the multiple HONG-based classifier im6

This arrangement is somewhat similar to the one used by Tahani et. al. [124] for information fusion in an object recognition system. In [124], the scene under consideration is analyzed by different sensors and their performances are combined with the degree of importance using the fuzzy integral.


93

proves further on the high classification rates provided by the individual HONG networks. Unfortunately, I was not able to compare with any results for any classifier that employed unsupervised methods on this data set. To my knowledge, the best results published for this data set are by Ha et al.[49] who used a supervised method. The authors of [49] reported a test error rate of 0.460% using the same set of 223,124 samples, but having partitioned it into 40,000 samples for training, 10,000 for validation and 173,124 for testing. They applied a perturbation method to improve performance after combining two MLPs with different feature extraction methods. The feature extraction method I used is not identical to their method.7 Clearly Ha et al. used far less samples for training than in my method (see Table 4.3) but, of course, to learn the structure of data in an unsupervised fashion generally needs more training samples than a supervised method, especially on a large problem such as this.

7

I requested the feature sets and feature extraction routines used in [49] (in order to perform an objective comparison) but I did not succeed.

Chapter 5 Boosting the HONG Network In this chapter a popular variant of the boosting algorithm called AdaBoost is used to improve the performance of the proposed hybrid learning system discussed in the previous chapter. Also, it is shown how the boosting algorithm could be used when the training set contains labeled and unlabeled data.

5.1 Introduction As explained in [34], the AdaBoost algorithm can be implemented by either weighted random sampling from the available data, or by weighting a cost function for each individual pattern. In the HONG network, weighted sampling is more convenient and generally proceeds as follows. Given a population of training patterns, select

patterns randomly with replacement and use them to train a

(usually “weak”) classifier. A second classifier is then trained in a similar fashion, but with the selection process modified so that those patterns that are classified incorrectly by the first classifier are more likely to be selected for use in training the second one. Further classifiers are trained using the same approach so that 94

CHAPTER 5. BOOSTING THE HONG NETWORK

95

boosting attempts to produce new classifiers that are better able to classify patterns for which previous classifier performance was poor. This chapter describes the two versions of AdaBoost described in [34], viz AdaBoost.M1 and AdaBoost.M2 which are used to improve the performance of the HONG architecture.

5.2 The AdaBoost Algorithm In AdaBoost.M1, after each round1 the probability of incorrectly labeled patterns is increased, and the probability of correctly labeled patterns is decreased. The result of training the th classifier is a hypothesis of input patterns,

is finite.

0

0 , where is the space

is the space of class labels of the patterns and

Figure 5.1 shows a pseudo-code representation of the AdaBoost.M1

algorithm. The main disadvantage of this algorithm is that it requires a weak learning algorithm whose error is bounded by a constant strictly less than

.

In practice, the weak learning algorithms used provide no such guarantee. For example, the expected error of a hypothesis which randomly guesses the label of a pattern is , where is the number of possible classes. For

0

(i.e.,

), the weak hypothesis needs to be just slightly better than random

guessing, but when // this requirement may often be hard to meet [34]. 1

Training a new classifier using the re-sampled data in the boosting algorithm is called a round of the boosting process.


Input:

96

Set - a sequence of patterns . . / with labels . 0 . 1- a weak learning system. - an integer specifying the number of iterations.

Initialize:

for all .

Repeat:

Do for

1. Draw a sample from with respect to distribution and use it to train the learning system 1 (Weak Learner) to implement the function 0 .

1 2. Calculate the error of :

If

/ , then set and abort loop.

3. Set 6

4. Update distribution :

7

6

if . otherwise

where 7 is a normalization constant. weight of is one) Output:

Final hypothesis:

Figure 5.1: The AdaBoost.M1 algorithm

6

(So the total


97

In general, most classifiers provide more information than just a class label (eg., neural network classifiers can approximate posterior probabilities of classes [108]). It is always useful to use this kind of information rather than to perform a hard decision for one recognized class. The AdaBoost.M2 algorithm is designed to address the above issues and it can be used when a classifier computes confidence values as its outputs for the classes. By doing this, the AdaBoost.M2 algorithm attempts to overcome the problem faced by the earlier version. In this version, the result of training the th classifier is now a hypothesis

0

.

That is, each weak hypothesis outputs a vector

(here

each component need not sum to one). Another difference between the two is that AdaBoost.M1 is based on a simple error count where AdaBoost.M2 employs a somewhat more complex pseudo-loss error measure. More interestingly, the AdaBoost.M2 algorithm focuses the weak learner not only on hard to classify patterns, but more specifically on improving the discrimination between the correct class and the incorrect class that competes with it [34]. Figure 5.2 shows a pseudo-code representation of the AdaBoost.M2 algorithm. Note that the two versions of AdaBoost are equivalent for binary classification problems and differ only for problems of more than two classes. Breiman [16] introduced another version of boosting, similar to AdaBoost, called arcing, which aims to adaptively re-sample and combine (hence the acronym arcing). Its main difference from AdaBoost is that the classifiers it creates are combined without the weighting scheme used in building the final AdaBoost classifier (i.e., the classifiers have equal votes). It has frequently been shown that adaptive re-sampling techniques such as AdaBoost and arcing do better than the earlier bagging approach [6]. A detailed description of the bagging and the boost-


98

ing algorithms is given in Section 3.3.2. Input:

Set - a sequence of patterns . . / with labels . 0 . 1- a weak learning system. - an integer specifying the number of iterations.

Initialize: Let 4

. . . . " for all . 4 , where 4 = .

Repeat:

Do for 1. Draw a sample from with respect to mislabel distribution and use it to train the learning system 1 (Weak Learner) to implement the function 0 .

1 2. Calculate the pseudo-loss of :

3. Set 6

.

. .

"

4. Update distribution :

. 7 . 6

where 7 is a normalization constant. Output:

Final hypothesis:

. #

6

Figure 5.2: The AdaBoost.M2 algorithm


99

In many experiments with boosting (and also with bagging) it has been observed that, on average, the error keeps reducing as committee size is increased, but the marginal error reduction associated with each additional committee member tends to decrease. That is, each additional member, on average, has less impact on a committee’s prediction error than any one of its predecessors [129]. The main problems identified with the AdaBoosting algorithm are: (1) it is vulnerable to over-fitting after many rounds, but for short runs this is not noticeable [47], and (2) the performance of AdaBoost can degrade in the presence of noise, because eventually it puts a large amount of weight on the misclassified samples, and this can also lead it to significant over-fitting [24, 118]. Recently Rätsch et. al. [107] introduced a regularization strategy into the boosting algorithm to alleviate the over-fitting problem for noisy data. They have shown that the regularized version of the AdaBoost algorithm has a better overall generalization performance than the original AdaBoost algorithm2 . One interesting property of the AdaBoost algorithm [116, 118] is that its generalization error continues to decrease with further rounds of boosting even after the training error has reached zero (i.e., after the weighted vote of the full ensemble leads to zero training error). This phenomenon is discussed by Schapire et. al. [116] in terms of the concept of margins. The margin of a pattern is defined as the difference between the ensemble score (i.e., the sum of weighted votes) of the correct class and the strongest ensemble score of a wrong class. In the twoclass case, a pattern has a positive margin if and only if it is correctly classified by the ensemble. Moreover, the magnitude of the margin can be interpreted as a measure of confidence in the prediction. The authors of [116] have demonstrated 2

A method has also been recently introduced by Domingo et. al. [27], called MadaBoost, which they claim is resistant to certain kinds of noise.


100

that larger margins imply lower generalization error and that AdaBoost tends to increase the margins of the training samples.

5.3 AdaBoost in the HONG Architecture The HONG network can be boosted using the re-sampling method, and the re-

represent the current distribution of the training patterns at round where . The initial distribution is uniform over the training set so for all . Let ) be the cumulative distribution of . That is, sampled training patterns are generated as follows. Let

)

can be selected from the set by generating a random number $ between 0 and 1 such that, ) $ ) . By repeatedly following the above process times a re-sampled training set is generated and Now a random pattern

this is carried out for each round of boosting in the HONG network. One iteration of this boosting process is shown in Figure 5.3. Application of AdaBoost.M1 to a trained HONG network is exactly the same as in the original (random sampling) version described in [34]. Application of AdaBoost.M2 is also largely in accord with the original algorithm, the major difference being the manner in which the pseudo-loss error is calculated. In [34] the pseudo-loss was calculated using the expression

.

"

, . , .

(5.1)

CHAPTER 5. BOOSTING THE HONG NETWORK 1

2

3

1

P

S

2

101 3

P

Dt

Resample the set S based on the distribution Dt

1

2

3

P

S

t

Train the HONG Network (t) Using Sample S t

Calculate Training Error ε t on S

Calculate and Store the value of βt

Update the Distribution Dt

Figure 5.3: This shows one round () of boosted HONG network. Here represents the full training set and represents its current distribution. The HONG network is trained on the re-sampled training set . where

is a mislabel distribution3, is a plausibility measure and is a

count of the number of rounds of boosting. In [34] the plausibility measure was calculated based upon probability arguments, but the HONG network uses the fuzzy integral to fuse the two measures, confidence values and the degree of importance, to achieve the classification es3

A mislabel distribution is a distribution defined over the set of all mislabels, where (see Figure 5.2). Here a mislabel is the pair where is the index of the training pattern and is an incorrect label associated with pattern .


102

timates. In this way the HONG network produces a set of fuzzy integral values (between 0 and 1) one for each class (see Figure 4.6) and I employ these for the plausibility measure ( ) required by the AdaBoost.M2 algorithm.

5.4 Some Practical Issues Concerning Boosting Before I detail the results of the experiments, I would like to point out some of the practical problems I have encountered in the implementation of the boosting algorithm which are rarely discussed yet common in implementation. One issue is the numerical underflow problem with the weights of the training set distribution . Generally the initial distribution values are very small when is large, and after a few rounds of boosting the weights of the correctly classified patterns of the previous rounds decrease close to zero very quickly. Depending on the floating point representation, these small weight values are rounded to zero and hence stop using those patterns in future re-sampled training sets. This numerical representation problem can be overcome by maintaining the weights of the patterns in a logarithmic scale [4]. The necessary modifications are given in Table 5.1, and these modifications do not change the original boosting algorithm in any way other than in the re-sampling process. Table 5.1: Modifications needed for the boosting algorithm to be in the logarithmic scale. Step weights initialization training error weights update normalization

Original

Modified

6 6 7 7


103

Another issue in boosting by the re-sampling method occurs when calculating the training error. Here a hypothesis is trained on a re-sampled training set but the training error

should be calculated on the original training set .

After a

few rounds of boosting the re-sampled training set may contain the same pattern

will Since the boosting process requires to

(possibly an outlier) multiple times and calculating the training error on lead to an error value

/

.

continue, the above situation causes the boosting process to exit from its loop. Generally for boosting to continue, for a given round , the value of should be between zero and half (i.e.,

).

However if

on the training

set , the algorithm should terminate and make the current hypothesis the final hypothesis (i.e.,

).

That is, if

is zero, then the value of 6 also

becomes zero (see Figure 5.1 or 5.2). Because of this, the weight assigned to the current hypothesis (i.e., 6 ) becomes infinity and the weights assigned to the previous ( ) hypotheses become irrelevant. This situation arises very rarely, but if this happens the learning system is no longer an ensemble, but a single classifier.

5.5 Experimental Results The performance of the boosted HONG network is compared on the two UCI data sets SatImage and Letters. Details of these two data sets are given in Sections 4.4.1 and 4.4.2. In this experiment, the parameters used in the HONG network are unchanged from the previous set of experiments performed (i.e., without boosting) in Chapter 4. Hence the HONG parameters used for this set of experiments are as given in Table 4.4. Also I used only the fuzzy integral in the HONG net-


104

work (for combining multiple ONG networks) in the comparisons for these two data sets. Boosting by re-sampling is used in each case. For the SatImage data the results on the test set, averaged over 10 random runs, are compared in Table 5.2. The results shown in the table are after 75 rounds of boosting. Improvements in the generalization error beyond 75 rounds of boosting were negligible. The AdaBoost.M2 algorithm gave the best results on the HONG network for this data set. It took about 52 minutes (on a Pentium II PC) to carry out the boosting process for this data set, and took just a few seconds to test the full data set once it was trained. Table 5.2: Test error rates on the UCI SatImage dataset with/without boosting by the algorithms compared. The first 4,435 patterns of the data set are used for training and the last 2,000 patterns are used for testing. Note: The arc-fs algorithm by Breiman is quite similar to the AdaBoost.M1 algorithm.

Author

Algorithm

Breiman [16] 1996 Freund et. al. [34] 1996 Schwenk et. al. [118] 2000

CART C4.5 MLP HONG

Error (%) Without Boosting 14.8 14.8 12.8 8.9

Error (%) With Boosting 8.8 -arc-fs 8.9 -AdaBoost.M1 8.1 -AdaBoost.M2 8.8 -AdaBoost.M1 7.7 -AdaBoost.M2

The results on the Letters data, again averaged over 10 random runs, are compared in Table 5.3. For this data set, the results shown in the table are after 100 rounds of boosting beyond which improvements were insignificant. It took about 6 hours to train the boosted HONG network for this data set, and just a few seconds to test the full data set after training. The best results published for this data set are by Schwenk et. al. [118] but their boosted MLP system took more than a week on a fast processor (SGI Origin-2000) to train.


105

Table 5.3: Test error rates on the UCI Letters dataset with/without boosting by the algorithms compared. The first 16,000 patterns of the data set are used for training and the remaining 4,000 patterns are used for testing.

Author

Algorithm

Breiman [16] 1996 Freund et. al. [34] 1996 Schwenk et. al. [118] 2000

CART C4.5 MLP HONG

Error (%) Without Boosting 12.4 13.8 6.1 4.2

Error (%) With Boosting 3.4 -arc-fs 3.3 -AdaBoost.M1 1.5 -AdaBoost.M2 3.6 -AdaBoost.M1 2.9 -AdaBoost.M2

The results in the two tables show that, as one would expect, AdaBoost.M2 performs better than AdaBoost.M1 on these two multi-class classification problems.

5.6 Boosting Labeled and Unlabeled Data In the previous set of experiments the full set of labels was employed in training the HONG network. This section discusses how the boosting algorithm can be applied to the HONG network when the training data set contains both labeled and unlabeled data. As in Section 4.9, I randomly partitioned the training set of the SatImage and the Letters data sets into labeled and unlabeled sets. Let represent the proportion of the training patterns that are labeled and represent unlabeled training patterns. To boost the HONG network in this case, a re-sampled training set

is generated by re-sampling the training patterns from the sets and as follows. The labeled training data are re-sampled using the distribution

as in

the previous set of experiments, but the unlabeled training data are re-sampled


106

using a fixed uniform distribution as in bagging. Once the re-sampled training set is generated, it is fed to the HONG network as in Section 4.9 and a hypothesis is obtained. Here the training error is calculated on the labeled training set . Use of a boosting algorithm in the HONG network with a labeled and an unlabeled training set is illustrated diagrammatically in Figure 5.4. 1

2

3

PUL

S

1

2

3

PL

1

2

3

PL

S

UL

L

Uniform random sampling

1

Dt

2

3

P

St

Train the HONG Network (t) Using Sample S t

Calculate Training Error ε t on S L

Calculate and Store the value of βt

Update the Distribution Dt

Figure 5.4: This shows one round () of the boosted HONG network using labeled and unlabeled data. Here represents the labeled training set and represents its current distribution. represents the unlabeled training set. The HONG network is trained on the re-sampled training set . Note that the original training set and the total training patterns .


107

Performance of this scheme on the two UCI data sets is detailed in Table 5.4. The parameters used here are the same as for the previous set of experiments. Since the previous set of experiments showed that the AdaBoost.M2 algorithm performs substantially better than the AdaBoost.M1 algorithm for both data sets, the error rates given in Table 5.4 are for AdaBoost.M2 only. Each error rate in Table 5.4 was obtained by averaging over 10 runs. Table 5.4: Average test error rates of the HONG network with/without boosting on the UCI SatImage and Letters data sets for different proportions of labeled training patterns. Proportion of Labeled Data 40% 50% 60% 80% 100%

Test Error Rate (%) Without Boosting With Boosting SatImage Letters SatImage Letters 23.32 27.33 27.33 29.73 17.79 15.52 15.50 17.20 15.07 13.71 13.70 12.51 12.38 8.11 8.12 6.34 8.90 4.20 7.70 2.90

With the SatImage data, the boosting algorithm was able to improve performance when the labeled proportion was 50% or more. When the labeled proportion is below 50% the results deteriorate very quickly. Presumably this is because the system does not have enough labeled patterns to classify the data set satisfactorily. The same was observed with the Letters data set, but here performance is improved by boosting only when the labeled proportion is 60% or more. This is probably because the Letters data set has a larger number of classes than the SatImage data set.

Chapter 6 Conclusion 6.1 Summary and Conclusion The main theme underlying this thesis has been the use of hybrid learning systems that combine supervised and unsupervised learning methods for pattern classification. And specifically, a novel hybrid learning system with a hierarchical architecture, based on the neural gas algorithm, was developed. Chapter 2 described the application of neural networks to unsupervised learning problems. The general goal of unsupervised learning is to construct new representations of the input data that capture its underlying structure, and to do the discovery without a teacher. In this chapter we explored only kernel based methods, which employ soft competitive learning algorithms. We discussed Kohonen’s SOM algorithm, which is a special method for conserving the topological relationships in input data. A rich variety of versions of the basic SOM algorithm have been proposed recently. Some of the variants aim at improving the preservation of the topology by using more flexible map structures instead of the fixed lattice. Some other vari108

CHAPTER 6. CONCLUSION

109

ants aim at reducing the computational complexity of the SOM. This thesis investigated an algorithm called the neural gas (NG) algorithm which is yet another variation of the SOM algorithm. The NG algorithm solves most of the identified problems associated with the basic SOM algorithm. Despite its advantages over the SOM network, the NG network algorithm suffers from a high time complexity in its sequential implementation. The computationally expensive part of the adaptation step of the NG algorithm is the determination of the neighborhood ranking. I introduced an implicit ranking scheme instead of the time consuming explicit ranking scheme. This modification eliminates the explicit sorting mechanism completely and reduces the complexity from

to , for a given network size . Compared to the number of applications of Kohonen’s SOM algorithm, there are relatively a few for the NG algorithm in the literature. I am hopeful that, due to the speeding up method that I have introduced, there might be more applications of the NG algorithm in the future. Certainly, the implicit ranking method will allow the NG algorithm to be used as a basis for supporting vector quantization training in hardware implementations as in [112], since no actual sorting process is involved in the update rule. Chapter 3 described methods for the combination of multiple predictive models. In this chapter, I discussed the two main approaches that have appeared in the literature, highlighting the strengths and weaknesses of each. From then on I concentrated attention on one of these approaches (ensemble-based) and discussed methods for the creation of individual classifiers and how they can be combined to provide a group consensus. In this regard I discussed simple voting methods, data sampling and re-sampling methods and non-linear methods like


110

the fuzzy integral. Chapter 4 proposed the HONG system for pattern classification which is based on the neural gas algorithm. For this system it was shown that combining confidence values with a network performance measure by means of the fuzzy integral leads to the best classification performance. The performance of the proposed system was compared with that of other techniques on three well-known benchmark data sets, and promising results were obtained. Also it was demonstrated the HONG network’s ability to handle a data set with labeled and unlabeled training patterns. The effects on performance of using less than 100% of the labels are very encouraging and the HONG network was able to achieve a test error rate close to CART and C4.5 (without boosting) even using only 60% of the labeled training data for both SatImage and Letters data sets. A large problem, the NIST SD3 database, was tackled using an ensemble of HONG classifiers based on different sets of features extracted from the same data set. This is particularly relevant for unconstrained handwritten data which often contains conflicting information within the same class due to the various writing styles and instruments used. The ensemble was able to achieve an excellent test error rate for this data set. Chapter 5 described a popular variant of the boosting algorithm called AdaBoost and showed how this algorithm could be used to improve the performance of the HONG network. The boosting algorithm was able to improve further the performance of the HONG network on the two UCI data sets to which it was applied. In this chapter I also discussed some of the practical problems one may encounter in the implementation of boosting algorithms. These are rarely


111

discussed yet common in implementation. In addition, I applied the boosting algorithm to the case where the training set contains labeled and unlabeled data. It was shown that for this case also the boosting algorithm can improve the performance of the HONG network.

6.2 Suggestions for Future Extensions This section consider avenues for further research into the concepts considered in this thesis. Future work will concentrate on some of the following directions:

Third Level

Second Level

Base Level

Figure 6.1: An example of the HONG network with 3 levels. The full set of ONG networks is not shown in this diagram.

More than two levels. In the current investigations the behavior of the HONG architecture is examined only up to two levels. But it can certainly grow


112

beyond two levels. The growth process need not be uniform for each level. That is, when it grows to the third level, some of the ONG networks in the second level do not have to grow in the same way as the others (see Figure 6.1). Depending on the available training patterns for each ONG network, the growth can be controlled in a similar fashion to the method used in the SPA neural tree architecture [80]. By this means, some ONG networks can further divide the decision borders of the input patterns, and represent significant and interesting clusters at a higher level of the hierarchy.

Use of Growing NG. In the HONG architecture, each network is initialized with a fixed number of units where this selection is based on the number of distinct classes in the data set as well as the number of training patterns available. Instead, it is possible to employ an incremental network model such as the growing neural gas (GNG) algorithm [40]. Such a method can successively add new units to an initially small network by evaluating local statistical measures gathered during the previous adaptation step. Introducing an algorithm like this to the HONG network would eliminate the need to prespecify network sizes and hence eliminate the rules of thumb used in Section 4.5.

Topology preservation. In the HONG architecture, the topology preservation property does not apply because the base learning algorithm (i.e., the NG algorithm) does not have a topologically prestructured lattice as in the SOM algorithm. It is, however, possible to form a topology preserving map for the NG algorithm by combining it with the competitive Hebb rule as shown in [40, 88, 89]. Here at each adaptation step a connection between the winner and the second-nearest unit is created to capture the neighborhood relation-


113

ships between the reference vectors. Since the reference vectors are adapted according to the neural gas algorithm it is possible that reference vectors that were neighboring at an early stage of the adaptation procedure will not be neighboring at a later stage. In such a case, connections that were formed early will be no longer valid and a mechanism is needed to remove those connections. This can be done by a local connection aging mechanism [89]. By introducing a topology preserving map into the HONG network it will be possible to visualize clusters in the input space. One complication is that, the HONG network does not directly produce a continuous topology preservation map in the higher level networks. That is, each upper level network has a separate topology from its neighboring network topologies. Experiments with artificial data sets of known distributions are required to evaluate the effectiveness of this visualization.

Different parameters for each level. In the HONG architecture, the higher levels specialize in distinguishing between patterns that belong to classes that are close together in pattern space. So, the level of adaptation is different for each level in this architecture. To reflect this, some of the learning parameters (eg., learning rate

, neighborhood size , etc.)

can be given

different values for each level rather than using the same value for each level.

Bibliography [1] F. Ancona, S. Ridella, S. Rovetta, and R. Zunino. On the Importance of Sorting in “Neural Gas” Training of Vector Quantizers. In Proceedings of the IEEE International Conference on Neural Networks, pages 1804–1808, 1997. [2] G. Banon. Distinction Between Several Subsets of Fuzzy Measures. Fuzzy Sets and Systems, 5(3):291–305, May 1981. [3] J. A. Barnett. Computational Methods for a Mathematical Theory of Evidence. In Proceedings of the Seventh International Conference on Artificial Intelligence. Vancouver, B.C., Canada, August 1981. [4] P. Bartlett. Personal Communication, 2001. [5] R. Battiti and A. M. Colla. Democracy in Neural Nets: Voting Schemes for Classification. Neural Networks, 7(4):691–707, 1994. [6] E. Bauer and R. Kohavi. An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting and Variants.

Machine Learning,

36(1/2):105–139, July 1999. [7] H.-U. Bauer and K. Pawelzik. Quantifying the Neighborhood Preservation of Self-Organizing Feature Maps. IEEE Transactions on Neural Networks, 3(4):570–579, July 1992. 114

BIBLIOGRAPHY

115

[8] H.-U. Bauer and T. Villmann.

Growing a Hypercubical Output Space

in a Self-Organizing Feature Map. IEEE Transactions on Neural Networks, 8(2):218–226, March 1997. [9] W. G. Baxt. Improving the Accuracy of an Artificial Neural Network Using Multiple Differently Trained Networks. Neural Computation, 4(5):772–780, September 1992. [10] S. Becker. Unsupervised Learning Procedures for Neural Neworks. International Journal of Neural Processing Systems, 2:17–33, 1991. [11] J. C. Bezdek. A Convergence Theorem for the Fuzzy Isodata Clustering Algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2(1):1–8, 1980. [12] C. M. Bishop. Neural Networks for Pattern Recognition. Oxford University Press, Oxford, 1995. [13] C. L. Blake and C. J. Merz. Databases, 1998.

UCI Repository of Machine Learning

http://www.ics.uci.edu/mlearn/MLRepository.html

(As at 14/08/2001). [14] H. Bourlard and N. Morgan. Connectionist Speech Recognition: A Hybrid Approach. Kluwer, Norwell, MA, 1993. [15] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and Regression Trees. Wadsworth International, 1984. [16] L. Breiman. Arcing Classifiers. Technical report, Dept. of Statistics, University of California at Berkeley, 1996.

BIBLIOGRAPHY

116

[17] L. Breiman. Bagging Predictors. Machine Learning, 24(2):123–140, August 1996. [18] K. J. Cherkauer. Human Expert-Level Performance on a Scientific Image Analysis Task by a System Using Combined Artificial Neural Networks. In Thirteen National Conference on Artificial Intelligence, pages 15–21. Portland, OR, 1996. [19] J.-H. Chiang. Choquet Fuzzy Integral-Based Hierarchical Networks for Decision Analysis. IEEE Transactions on Fuzzy Systems, 7(1):63–71, February 1999. [20] S.-B. Cho and J. H. Kim. Multiple Network Fusion Using Fuzzy Logic. IEEE Transactions on Neural Networks, 6(2):497–501, March 1995. [21] C. S.-T. Choy and W.-C. Siu. Fast Sequential Implementation of “Neural Gas” Network for Vector Quantization. IEEE Transactions on Communications, 46(3):301–304, March 1998. [22] R. T. Clemen. Combining Forecasts: A Review and Annotated Bibliography. International Journal of Forecasting, 5(4):559–583, 1989. [23] P. Demartines and F. Blayo. Kohonen Self-Organizing Maps: Is the Normalization Necessary? Complex Systems, 6:105–123, 1992. [24] T. G. Dietterich. An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting and Randomization. Machine Learning, pages 1–22, 1999.

BIBLIOGRAPHY

117

[25] M. Dittenbach, D. Merkl, and A. Rauber. The Growing Hierarchical Selforganizing Map. In Proceedings of the International Joint Conference on Neural Networks (IJCNN’2000), volume 6, pages 15–19. Como, Italy, July 2000. [26] M. Dittenbach, D. Merkl, and A. Rauber. Using Growing Hierarchical Selforganizing Maps for Document Classification. In Proceedings of the Eighth Symposium on Artificial Neural Networks (ESANN’2000), April 2000. [27] C. Domingo and O. Watanabe. MadaBoost: A Modification of AdaBoost. In Thirteenth Annual Conference on Computational Learning Theory (COLT’2000), pages 180–189. Morgan Kaufmann, June 2000. [28] R. O. Duda and P. E. Hart. Pattern Classification and Scene Analysis. A WileyInterscience Publication, 1973. [29] O. K. Ersoy and D. Hong. Parallel, Self-Organizing, Heirarchical Neural Networks. IEEE Transactions on Neural Networks, 1(2):167–178, June 1990. [30] E. Erwin, K. Obermayer, and K. Schulten. Convergence Properties of SelfOrganizing Maps. In T. Kohonen, K. Mäkisara, O. Simula, and J. Kangas, editors, Artificial Neural Networks, pages 409–414. North-Holland, Amsterdam, 1991. [31] E. Erwin, K. Obermayer, and K. Schulten. Self-Organizing Maps: Ordering, Convergence Properties and Energy Functions. Biological Cybernetics, 67:47– 55, 1992. [32] E. Erwin, K. Obermayer, and K. Schulten. Self-Organizing Maps: Stationary States, Metastability and Convergence Rate. Biological Cybernetics, 67:35–45, 1992.

BIBLIOGRAPHY

118

[33] A. Flexer. Statistical Evaluation of Neural Network Experiments: Minimum Requirements and Current Practice. In Proceedings of Thirteenth European Meeting on Cybernetics and Systems Research, volume 2, pages 1005–1008. Vienna, Austria, 1996. [34] Y. Freund and R. E. Schapire. Experiments with a New Boosting Algorithm. In Proceedings of the Thirteenth International Conference on Machine Learning, pages 148–156, 1996. [35] P. W. Frey and D. J. Slate. Letter Recognition Using Holland-Style Adaptive Classifiers. Machine Learning, 6(2):161–182, March 1991. [36] B. Fritzke. Unsupervised Ontogenetic Networks. In E. Fiesler and R. Beale, editors, Handbook of Neural Computation, pages C2.4.:1–C2.4:16. IOP Publishing and Oxford University Press, 1996. [37] B. Fritzke. Let It Grow - Self-Organizing Feature Maps With Problem Dependent Cell Structure. In T. Kohonen, K. Mäkisara, O. Simula, and J. Kangas, editors, Artificial Neural Networks, volume 1, pages 403–408. Elsevier, North-Holland, Amsterdam, 1991. [38] B. Fritzke. Kohonen Feature Maps and Growing Cell Structures – a Performance Comparison. In L. Giles, S. Hanson, and J. Cowan, editors, Advances in Neural Information Processing Systems 5, pages 123–130. Morgan Kaufmann Publishers, San Mateo, CA, 1993. [39] B. Fritzke. Growing Cell Structures - A Self-Organizing Network for Unsupervised and Supervised Learning. Neural Networks, 7(9):1441–1460, 1994.

BIBLIOGRAPHY [40] B. Fritzke.

119 A Growing Neural Gas Network Learns Topologies.

In

G. Tesauro, D. Touretzky, and T. Leen, editors, Advances in Neural Information Processing Systems, volume 7, pages 625–632. The MIT Press, Cambridge MA, 1995. [41] B. Fritzke.

Growing Grid - A Self-Organizing Network with Constant

Neighborhood Range and Adaptation strength. Neural Processing Letters, 2(5):9–13, 1995. [42] B. Fritzke. Growing Self-Organizing Networks – Why?

In ESANN’96:

European Symposium on Artificial Neural Networks, pages 61–72, 1996. [43] K. Fukushima. Neocognitron: A Hierarchical Neural Network Capable of Visual Pattern Recognition. Neural Networks, 1(2):119–130, 1988. [44] M. D. Garris. Design, Collection and Analysis of Handwriting Sample Image Databases.

The Encyclopedia of Computer Science and Technology,

31(16):189–213, 1994. [45] C. W. J. Granger. Combining Forecasts: Twenty Years Later (invited review). Journal of Forecasting, 8(3):167–173, July 1989. [46] S. Grossberg. Adaptive Pattern Classification and Universal Recoding: I. Parallel Development and Coding of Neural Feature Detectors. Biological Cybernetics, 23:121–134, 1976. [47] A. J. Grove and D. Schuurmans. Boosting in the Limit: Maximizing the Margin of Learned Ensembles. In Proceedings of the Fifteenth National Conference on Artificial Intelligence (AAAI’98). AAAI Press, June 1998.

BIBLIOGRAPHY

120

[48] S. Gutta, J. R. Huang, P. Jonathon, and H. Wechsler. Mixture of Experts for Classification of Gender, Ethnic Origin, and Pose of Human Faces. IEEE Transactions on Neural Networks, 11(4):948–959, July 2000. [49] T. M. Ha and H. Bunke. Off-Line, Handwritten Numeral Recognition by Perturbation Method. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(5):535–539, May 1997. [50] L. K. Hansen and P. Salamon. Neural Network Ensembles. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(10):993–1001, October 1990. [51] S. Haykin. Neural Networks - A Comprehensive Foundation. Macmillan College Publishing Co., Inc., New York, 1994. [52] S. Haykin. Neural Networks - A Comprehensive Foundation, Second Ed. Prentice Hall Inc., Upper Saddle River, NJ, 1999. [53] R. Hecht-Nielsen. Neurocomputing. Addison-Wesley, 1990. [54] J. Hertz, A. Krogh, and R. G. Palmer. Introduction to the Theory of Neural Computation. Addison-Wesley, 1990. [55] T. K. Ho, J. J. Hull, and S. N. Srihari. Decision Combination in Multiple Classifier Systems. IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(1):66–75, January 1994. [56] K. Hornik, M. Stinchcombe, and H. White. Multilayer Feedforward Networks are Universal Approximators. Neural Networks, 2(5):359–366, 1989.

BIBLIOGRAPHY

121

[57] K.-R. Hsieh and W.-T. Chen. A Neural Network Model Which Combines Unsupervised and Supervised Learning. IEEE Transactions on Neural Networks, 4(2):357–360, March 1993. [58] Y. S. Huang and C. Y. Suen. A Method of Combining Multiple Experts for the Recognition of Unconstrained Handwritten Numerals. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(1):90–94, January 1995. [59] M. M. V. Hulle. Faithful Representations and Topographic Maps. John Wiley and Sons, 2000. [60] R. A. Jacobs. Methods for Combining Experts’ Probability Assessments. Neural Computation, 7(5):867–888, September 1995. [61] R. A. Jacobs. Bias/Variance Analysis of Mixtures-of-Experts Architecture. Neural Computation, 9(2):369–383, February 1997. [62] R. A. Jacobs and M. I. Jordan. Adaptive Mixtures of Local Experts. Neural Computation, 3(1):79–87, Spring 1991. [63] M. I. Jordan and R. A. Jacobs. Hierarchical Mixtures of Experts and the EM Algorithm. Neural Computation, 6(2):181–214, March 1994. [64] M. I. Jordan and R. A. Jacobs. Modular and Hierarchical Learning System. In M. A. Arbib, editor, The Handbook of Brain Theory and Neural Networks, pages 579–582. The MIT Press, 1995. [65] A. Jo´ ´ zwik, S. Serpico, and F. Roli. A Parallel Network of Modified 1-NN and -NN Classifiers Application to Remote-Sensing Image Classification. Pattern Recognition Letters, 19:57–62, 1998.

BIBLIOGRAPHY [66] J. A. Kangas, T. K. Kohonen, and J. T. Laaksonen.

122 Variants of Self-

Organizing Maps. IEEE Transactions on Neural Networks, 1(1):93–99, March 1990. [67] S. Kaski, J. Kangas, and T. Kohonen. Bibliography of Self-Organizing Map (SOM) Papers: 1981-1997. Neural Computing Serveys, 1:102–350, 1998. [68] J. Kittler. Combining Classifiers: A Theoretical Framework. Pattern Analysis and Applications, 1(1):18–27, 1998. [69] J. Kittler, M. Hatef, R. P. Duin, and J. Matas. On Combining Classifiers. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(3):226–239, March 1998. [70] J. Kittler, J. Matas, K. Jonsson, and M. R. Sánchez. Combining Evidence in Personal Identity Verification Systems. Pattern Recognition Letters, 18:845– 852, 1997. [71] J. Koh, M. Suk, and S. M. Bhandarkar. A Multilayer Self-Organizing Feature Map for Range Image Segmentation. Neural Networks, 8(1):67–86, 1995. [72] R. Kohavi and D. H. Wolpert. Bias Plus Variance Decomposition for ZeroOne Loss Functions. In Machine Learning: Proceedings of the Thirteenth International Conference, pages 275–283. Morgan Kaufmann, 1996. [73] R. Kohavi. A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), 1995. [74] T. Kohonen. Self-Organized Formation of Topologically Correct Feature Maps. Biological Cybernetics, 43:59–69, 1982.

BIBLIOGRAPHY

123

[75] T. Kohonen. The Self-Organizing Map. Proceedings of the IEEE, 78(9):1464– 1480, September 1990. [76] T. Kohonen. Self-Organizing Maps, Second Edition. Springer-Verlag, Berlin, 1995. [77] E. B. Kong and T. G. Dietterich. Error-Correcting Output Coding Corrects Bias and Variance. In Machine Learning: Proceedings of the Twelfth International Conference, pages 313–321. Morgan Kaufmann, 1995. [78] A. S. Kumar, S. Basu, and K. Majumdar. Robust Classification of Multispectral Data Using Multiple Neural Networks and Fuzzy Integral. IEEE Transactions on Geoscience and Remote Sensing, 35(3):787–790, May 1997. [79] K. Leszczynski, ´ P. Penczek, and W. Grochulski. Sugeno’s Fuzzy Measure and Fuzzy Clustering. Fuzzy Sets and Systems, 15:147–158, 1985. [80] T. Li, Y. Y. Tang, and L. Fang. A Structure-Parameter-Adaptive (SPA) Neural Tree for the Recognition of Large Character Set. Pattern Recognition, 28(3):315–329, 1995. [81] X. Lin, X. Ding, M. Chen, R. Zhang, and Y. Wu. Adaptive Confidence Transform Based Classifier Combination for Chinese Character Recognition. Pattern Recognition Letters, 19:975–988, 1998. [82] Y. Linde, A. Buzo, and R. Gray. An Algorithm for Vector Quantizer Design. IEEE Transactions on Communications, 28:84–95, 1980. [83] Z.-P. Lo and B. Bavarian. On the Rate of Convergence in Topology Preserving Neural Networks. Biological Cybernetics, 65:55–63, 1991.

BIBLIOGRAPHY

124

[84] J. MacQueen. Some Methods for Classification and Analysis of Multivariate Observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, pages 281–297. University of California Press, Berkeley, 1976. [85] G. Mani. Lowering Variance of Decisions by Using Artificial Neural Network Portfolios. Neural Computation, 3(4):484–486, 1991. [86] H. M. Markowitz. Portfolio Selection. The Journal of Finance, 7(1):77–91, March 1952. [87] T. M. Martinetz, S. G. Berkovich, and K. J. Schulten. ”Neural Gas” Network for Vector Quantization and its Application to Time-Series Prediction. IEEE Transactions on Neural Networks, 4(4):558–569, July 1993. [88] T. M. Martinetz and K. Schulten. A “Neural-Gas” Network Learns Topologies. In T. Kohonen, K. Mäkisara, O. Simula, and J. Kangas, editors, Artificial Neural Networks, pages 397–402. North-Holland, Amsterdam, 1991. [89] T. M. Martinetz and K. J. Schulten. Topology Representing Networks. Neural Networks, 7(3):507–522, 1994. [90] C. McCormack. Adaptation of Learning Rule Parameters Using a Meta Neural Network. Connection Science, 9(1):123–136, 1997. [91] W. S. McCulloch and W. Pitts. A Logical Calculus of Ideas Immanent in Nervous Activity. Bulletin of Mathematical Biophysics, 5:115–133, 1943. [92] D. A. Medler. A Brief History of Connectionism. Neural Computing Serveys, 1:61–101, 1998.

BIBLIOGRAPHY

125

[93] D. Merkl. Text Classification with Self-Organizing Maps: Some Lessons Learned. Neurocomputing, 21:61–77, 1998. [94] D. Merkl and A. Rauber. Uncovering the Hierarchical Structure of Text Archives by Using an Unsupervised Neural Network with Adaptive Architecture. In Proceedings of the 4th Pasific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2000), April 2000. [95] D. Michie, D. J. Speigelhalter, and C. Taylor. Machine Learning, Neural and Statistical Classification. Ellis Horwood, New York, 1994. [96] R. Miikkulainen. Script Recognition with Hierarchical Feature Maps. Connection Science, 2(1):83–101, 1990. [97] M. Minsky and S. Papert. Perceptrons: An Introduction to Computational Geometry. MIP Press, Cambridge, MA, 1969. [98] N. J. Nilsson. Learning Machines: Foundations of Trainable Pattern Classifying Systems. McGraw Hill, 1965. [99] S. J. Nowlan. Maximum Likelihood Competitive Learning. In D. Touretzky, editor, Advances in Neural Information Processing Systems, volume 2, pages 574–582. Morgan-Kaufmann, 1990. [100] E. Oja. A Simplified Neuron Model as a Principal Component Analysis. Journal of Mathematical Biology, 15:267–273, 1982. [101] D. Opitz and R. Maclin. Popular Ensemble Methods: An Emperical Study. Journal of Artificial Intelligence Research, 11:169–198, 1999.

BIBLIOGRAPHY

126

[102] J. Park, V. Govindaraju, and S. N. Srihari. OCR in a Hierarchical Feature Space. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(4):400–407, April 2000. [103] D. B. Parker. Learning-logic: Casting the Cortex of the Human Brain in Silicon. Technical Report TR-47, Center for Computational Research in Economics and Management Science, Cambridge, MA, 1985. [104] J. R. Quinlan. Bagging, Boosting, and C4.5. In Proceedings of the Thirteenth National Conference on Artificial Intelligence (AAAI’96), volume 1, pages 725– 730. AAAI Press, August 1996. [105] J. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, CA, 1993. [106] A. Rahman and M. Fairhurst. A New Hybrid Approach in Combining Multiple Experts to Recognise Handwritten Numerals. Pattern Recognition Letters, 18:781–790, 1997. [107] G. Rätsch, T. Onoda, and K. R. Muller. ¨ Regularizing AdaBoost. In M. S. Kearns, S. A. Solla, and D. A. Cohn, editors, Advances in Neural Information Processing Systems, volume 11, pages 564–570. The MIT Press, 1999. [108] M. D. Richard and R. P. Lippmann. Neural Network Classifiers Estimate Bayesian a posteriori Probabilities. Neural Computation, 3(4):461–483, Winter 1991. [109] B. D. Ripley. Pattern Recognition and Neural Networks. Cambridge University Press, 1996.

BIBLIOGRAPHY

127

[110] G. Rogova. Combining the Results of Several Neural Network Classifiers. Neural Networks, 7(5):777–781, 1994. [111] F. Rosenblatt. The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain. Psychological Review, 65:386–408, 1958. [112] S. Rovetta and R. Zunino. Efficient Training of Neural Gas Vector Quantizers with Analog Circuit Implementation. IEEE Transactions on Circuits and Systems –II: Analog and Digital Signal Processing, 46(6):688–698, June 1999. [113] D. E. Rumelhart and J. L. McClelland. Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Vol 1. MIP Press, Cambridge, MA, 1986. [114] D. E. Rumelhart and D. Zipser. Feature Discovery by Competitive Learning. Cognitive Science, 9:75–112, 1985. [115] M. Sabourin, A. Mitiche, D. Thomas, and G. Nagy. Classifier Combination for Hand-Printed Digit Recognition. In Proceedings of the International Conference on Document Analysis and Recognition, pages 163–166. Tsukuba Sci City, Japan, 1993. [116] R. Schapire, Y. Freund, P. Bartlett, and W. Lee. Boosting the Margin: A New Explanation for the Effectiveness of Voting Methods. In Machine Learning: Proceedings of the Fourteenth International Conference, pages 322–330. Morgan Kaufmann, 1997. [117] R. E. Schapire.

The Strength of Weak Learnability.

5(2):197–227, June 1990.

Machine Learning,

BIBLIOGRAPHY

128

[118] H. Schwenk and Y. Bengio. Boosting Neural Networks. Neural Computation, 12(8):1869–1887, August 2000. [119] A. J. C. Sharkey. On Combining Artificial Neural Nets. Connection Science, 8(3 & 4):299–313, 1996. [120] A. J. C. Sharkey. Modularity, Combining and Artificial Neural Nets. Connection Science, 9(1):3–10, 1997. [121] A. J. C. Sharkey. Multi-Net Systems. In A. J. C. Sharkey, editor, Combining Artificial Neural Nets: Ensemble and Modular Multi-Net Systems, pages 1–30. Springer Verlag, 1999. [122] P. N. Suganthan. Hierarchical Overlapped SOMs for Pattern Classification. IEEE Transactions on Neural Networks, 10(1):193–196, January 1999. [123] M. Sugeno. Fuzzy Measures and Fuzzy Integrals - A Survey. In M. Gupta, G. Saridis, and B. Gaines, editors, Fuzzy Automata and Decision Processes, pages 89–102. North Holland, New York, 1977. [124] H. Tahani and J. M. Keller. Information Fusion in Computer Vision Using the Fuzzy Integral. IEEE Transactions on Systems, Man and Cybernetics, 20(3):733–741, May 1990. [125] R. Tibshirani. Bias, Variance and Prediction Error for Classification Rules. Technical report, Dept. of Statistics, University of Toronto, November 1996. [126] V. Tolat. An Analysis of Kohonen’s Self-Organizing Maps using a System of Energy Functions. Biological Cybernetics, 64:155–164, 1990.

BIBLIOGRAPHY

129

[127] C. von der Malsburg. Self-Organization of Orientation Sensitive Cell in the Striate Cortex. Kybernetik, 14:85–100, 1973. [128] S. Waterhouse, D. MacKay, and T. Robinson. Bayesian Methods for Mixture of Experts. In D. S. Touretzky, M. C. Mozer, and M. E. Hasselmo, editors, Advances in Neural Information Processing Systems, volume 8, pages 351–357. Morgan-Kaufmann, 1996. [129] G. I. Webb. MultiBoosting: A Technique for Combining Boosting and Wagging. Machine Learning, 40(2):159–196, August 2000. [130] A. Weijters, A. V. D. Bosch, and H. J. V. D. Herik. Behavioural Aspects of Combining Backpropagation Learning and Self-Organizing Maps. Connection Science, 9(3):235–251, September 1997. [131] P. J. Werbos. Beyond Regression: New Tools for Pridiction and Analysis in the Behavioral Sciences. PhD thesis, Harverd University, Cambridge, MA, 1974. [132] B. Widrow and M. E. Hoff. Adaptive Switching Circuits. IRE Western Electric Show and Convention Record, 4:96–104, 1960. [133] D. Willshaw and C. von der Malsburg. How Patterned Neural Connections Can Be Set Up by Self-Organization. Proceedings of the Royal Society of London Series B, 194:431–445, 1976. [134] K. Woods, W. P. Kegelmeyer Jr., and K. Bowyer. Combination of Multiple Classifiers using Local Accuracy Estimates. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(4):405–410, April 1997. [135] J. Wu, H. Yan, and A. Chalmers. Handwritten Digit Recognition Using

BIBLIOGRAPHY

130

Two-Layer Self-Organizing Maps. International Journal of Neural Systems, 5(4):357–362, December 1994. [136] L. Xu, A. Krzyzak, and C. Y. Suen. Methods of Combining Multiple Classi˙ fiers and Their Applications to Handwriting Recognition. IEEE Transactions on Systems, Man and Cybernetics, 22(3):418–435, May 1992. [137] R. R. Yager. Element Selection from a Fuzzy Subset Using the Fuzzy Integral. IEEE Transactions on Systems, Man and Cybernetics, 23(2):467–477, March 1993. [138] R. S. Zemel. A Minimum Description Length Framework for Unsupervised Learning. PhD thesis, University of Toronto, Toronto, Canada, 1993. [139] G. P. Zhang. Neural Networks for Classification: A Survey. IEEE Transactions on Systems, Man and Cybernetics -Part C, 30(4):451–462, November 2000.