Feature Based Architecture for Decision Fusion - Semantic Scholar

4 downloads 0 Views 605KB Size Report
[86] K. Sirlantzis, S. Hoque, and M. C. Fairhurst. Trainable multiple classifier schemes for handwritten character recognition. In F. Roli and J. Kittler, Eds., Multiple ...
Feature Based Architecture for Decision Fusion by

Nayer Wanas

A thesis presented to the University of Waterloo in fulfillment of the thesis requirement for the degree of Doctor of Philosophy in Systems Design Engineering

Waterloo, Ontario, 2003

c

Nayer Wanas, 2003

I hereby declare that I am the sole author of this thesis. I authorize the University of Waterloo to lend this thesis to other institutions or individuals for the purpose of scholarly research.

Nayer Wanas

I authorize the University of Waterloo to reproduce this thesis by photocopying or other means, in total or in part, at the request of other institutions or individuals for the purpose of scholarly research.

Nayer Wanas

ii

The University of Waterloo requires the signatures of all persons using or photocopying this thesis. Please sign below, and give address and date.

iii

Acknowledgements All thanks and praise are due to the One GOD for giving me the strength and knowledge to complete this work. Without His will and help, nothing could, can, or will be accomplished.

Then, I would like to acknowledge the support and help of many people, without them this work would not have been realized. It is not possible to enumerate all of them but I would like to express my gratitude to some people in particular.

First of all, I would like to thank Prof. Mohamed Kamel who has been a unlimited source of encouragement, support, guidance and patience. He has been a role model and mentor, without his continuous support and supervision this work would not have been possible. I owe a great deal of gratitude to him. I would also like to express my gratitude to Dr. Fakhri Karray for his support. I would like to thank many anonymous reviews who gave me valuable comments on many parts of this thesis.

I would like to thank my colleagues in the PAMI lab who provided a good working atmosphere and stimulating discussions. I would also like to thank my friends, especially those at the University of Waterloo, for their support, care and assistance when I needed it most.

Finally, and by far not the least, I feel greatly indebted to my parents, for their unlimited support and guidance, not only throughout the years of my Ph.D., but, throughout my life. They have never stopped believing in me and encouraging me to finish my studies until I received my Ph.D. degree. To them is all my love and prayers.

iv

I dedicate this thesis to my family, especially my parents, my sister, and both my nieces Nour and Sarah.

v

Abstract Researchers continue to focus on the design of pattern recognition systems to achieve the best classification rates. Usually, different classification schemes are developed for the problem at hand, and then by experimental assessment, the best classifier is chosen as the final solution to the problem. However, it has been observed that although one design may outperform the others, the patterns that are misclassified by the different classifiers are not necessarily the same. This observation suggests that the use of multiple classifiers can enhance the decision about the patterns under classification. Also, multiple classifiers can improve the reliability of the final classifier, because the simultaneous use of different features and classification procedures. Combing the individual decisions to provide an aggregated decision is crucial to successful system design. Decision making is what separates the human species from other species. Everyday, we face a variety of choices, and we must decide which of the available actions to take. In this work, architectures and methods of aggregating decisions in a classifier ensemble environment will be investigated. The proposed architectures allow for dynamic decision fusion by utilizing the information and features of a problem via detector modules to guide the decision fusion process. The features of the problem sub-space, as well as the output, are used to identify the strength and weaknesses of the different classifiers. These features are also used to tune the aggregation procedure to better solve the problem. The main focus of this thesis is to make the final decision more dependent on the pattern being classified, and hence more dynamic. Some of the techniques that we propose to implement this dynamic fusion pertain to neural networks. Various aspects related to the design and performance of these new architectures are studied, including methods to generate the detector. An iterative training algorithm, that allows the final classification to determine whether further training should be carried out on components of these architectures, is proposed. The performance of these architectures is assessed by testing them on several benchmark problems, and the results are compared to alternative aggregation schemes. The new architectures improve the accuracy of an ensemble. Moreover, there is an improvement over existing aggregation techniques. The architectures provide a means to limit the intervention of a user while maintaining a level of accuracy that is competitive to most approaches. The time complexity of the training algorithm is shown to improve on the training requirements of the architectures.

v

Contents

1 Introduction

1

1.1

Preface

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

1.2

Classifier Combination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2

1.3

Data Dependence in Multiple Classifier Approaches . . . . . . . . . . . . . . . . . .

4

1.4

Objectives and Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7

1.5

Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7

2 Sensor Fusion 2.1

9

Level of Representation in Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.1.1

Signal-Level Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.1.2

Pixel-Level Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.1.3

Feature-Level Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.1.4

Symbol-Level Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2

Computational Methods for Sensor Fusion . . . . . . . . . . . . . . . . . . . . . . . 12

2.3

Classifier Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.4

Levels of Classifier Output Information . . . . . . . . . . . . . . . . . . . . . . . . . 16

vi

2.5

Problems for Multiple Classifier Combination . . . . . . . . . . . . . . . . . . . . . 18

2.6

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3 Classifier Ensembles Combining Methods 3.1

Methods for Creating Ensemble Members . . . . . . . . . . . . . . . . . . . . . . . 24 3.1.1

3.2

3.3

3.4

3.5

Varying Training Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

Methods for Combining Classifiers in Ensembles . . . . . . . . . . . . . . . . . . . 27 3.2.1

Averaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.2.2

Non-Linear Combining Methods . . . . . . . . . . . . . . . . . . . . . . . . 31

3.2.3

Bayesian Combination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.2.4

Dempster-Shafer Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.2.5

Combination via Choquet Integral . . . . . . . . . . . . . . . . . . . . . . . 35

3.2.6

Behaviour Knowledge Space and Classifier Selection . . . . . . . . . . . . . 37

Architectures for Combining Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.3.1

Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.3.2

Stacked Generalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.3.3

Hierarchical Mixture of Experts . . . . . . . . . . . . . . . . . . . . . . . . . 41

Choosing a Method for Ensemble Creation and Combination . . . . . . . . . . . . 43 3.4.1

Taking account of the Dependency between Classifiers . . . . . . . . . . . . 44

3.4.2

Creating Classifiers for Effective Combination . . . . . . . . . . . . . . . . . 45

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4 Feature-Based Decision Fusion Architectures 4.1

20

54

The Detector Based Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

vii

4.2

4.3

4.4

4.1.1

Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.1.2

Detectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.1.3

The Aggregation Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.1.4

Two-dimensional Illustrative Example . . . . . . . . . . . . . . . . . . . . . 61

4.1.5

Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

Feature Based Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.2.1

Illustrative Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.2.2

Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

An Adaptive Training Algorithm for Classifier Ensemble . . . . . . . . . . . . . . . 70 4.3.1

Adaptive training based in weighted average aggregation . . . . . . . . . . . 75

4.3.2

Evolving Feature Based Architecture . . . . . . . . . . . . . . . . . . . . . . 76

4.3.3

Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

5 Results and Comparative Study

88

5.1

Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

5.2

20-Class Gaussian Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 5.2.1

Variation of Number of Classifiers . . . . . . . . . . . . . . . . . . . . . . . 91

5.2.2

Analysis of Error Rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

5.3

Clouds Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

5.4

Satimages Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

5.5

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

6 Conclusion and Future Work

105

viii

6.1

Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

A Data Sets

111

A.1 Satimages Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 A.2 Clouds Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

Bibliography

118

ix

List of Tables 4.1

Classes within each training subset . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.2

Performance of the Gaussian 20-Class problem (% error) . . . . . . . . . . . . . . . 64

4.3

Performance of feature-based approach

4.4

Comparison of ensemble approaches based on EVOL Algorithm . . . . . . . . . . . 76

4.5

Performance of the feature-based approach

5.1

Error rates for the different combining methods for different numbers of classifiers

. . . . . . . . . . . . . . . . . . . . . . . . 68

. . . . . . . . . . . . . . . . . . . . . . 77

in the ensemble . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 5.2

Error rates for the different architectures on the 20-Class Gaussian problem, partially disjoint data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

5.3

Error rates for the different architectures on the 20-Class Gaussian problem, disjoint data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

5.4

Number of epochs / Training time (in mins) required for the different architectures and algorithms applied to the 20-Class Gaussian problem, partially disjoint data . 98

5.5

Number of epochs / Training time (in mins) required for the different architectures and algorithms applied to the 20-Class Gaussian problem, disjoint data . . . . . . 98

x

5.6

Error rates for the different architectures on the Clouds data problem, partially disjoint data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

5.7

Error rates for the different architectures on the Clouds data problem, disjoint data 100

5.8

Number of epochs / Training time (in mins) required for the different architectures and algorithms applied to the Clouds data problem, partially disjoint data . . . . . 101

5.9

Number of epochs / Training time (in mins) required for the different architectures and algorithms applied to the Clouds data problem, disjoint data . . . . . . . . . . 101

5.10 Error rates for the different architectures on the Satimages problem, partially disjoint data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 5.11 Error rates for the different architectures on the Satimages problem, disjoint data . 103 5.12 Number of epochs / Training time (in mins) required for the different architectures and algorithms applied to the Satimages problem, partially disjoint data . . . . . . 103 5.13 Number of epochs / Training time (in mins) required for the different architectures and algorithms applied to the Satimages problem, disjoint data . . . . . . . . . . . 103

xi

List of Figures 3.1

Two-level stacked generalization architecture . . . . . . . . . . . . . . . . . . . . . 41

3.2

Two-level hierarchical mixture of experts architecture . . . . . . . . . . . . . . . . 42

4.1

Detector-Based Decision Fusion Architecture . . . . . . . . . . . . . . . . . . . . . 56

4.2

The two-dimensional 20 Class data set used . . . . . . . . . . . . . . . . . . . . . . 62

4.3

Feature-based decision aggregation architecture . . . . . . . . . . . . . . . . . . . . 66

4.4

Adaptive training algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

4.5

Performance measures for different values of Γ and P . . . . . . . . . . . . . . . . . 80

4.6

Classification error for the best classifier, maximum, and evolving architecture . . . 81

4.7

Normal training algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

4.8

EVOL training algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

5.1

Performance measures for different values number of classifiers . . . . . . . . . . . 92

xii

List of Notations Ci xi y yj cij (x) pi (yj |x) Q(x) F K N wij I H M Mi V E e P Γ δ Ri

Classifier i Input Vector for classifier i Output Vector for Classifier i Output corresponding to class j Confidence of classifier i that input x belongs to class j Probability the output of classifier i is yj for a given input vector x Final classification output for input x Mapping between output of classifiers and Q(x) Number of classifiers Number of output classes Weights assigned by classifier i to output j Number of input features Number of hidden nodes in a neural network classifier Number of patterns in the training set Number of patterns in the training set for classifier i Number of patterns in the evaluation set Maximum number of training cycles Number of training cycles in one training step Preset modification constant Real valued constant Correction factor Number of repetitions of training for classifier i

xiii

Chapter 1

Introduction 1.1

Preface

Decision making is what separates the human species from other species. Everyday, we face a variety of choices, and we must decide which of the available actions to take. Before making a decision, we consider numerous approaches and their possible results. In other words, we consider the effects of multiple approaches. For some problems we may consult numerous experts, or a committee of experts, to help us reach a conclusion or decision. Similarly, in pattern classification, multiple classifier systems are often practical and effective solutions for difficult pattern recognition tasks. Multiple classifier systems are also known as hybrid methods, decision combination, multiple experts, mixture of experts, classifier ensembles, sensor fusion. In some cases, empirical observations of the performance of specialized classifiers justify the use of multiple classifiers. In other cases, the adoption of multiple classifiers stems from the problem decomposition such as the need to employ a variety of sensor types, or the need to avoid making commitments to arbitrary initial conditions, and parameters. The latter is derived the notion that adding randomness to the

1

1.2. CLASSIFIER COMBINATION

2

selection of initial conditions produces a diverse collection that can perform better than a single element. This is supported by the fact that the patterns that are misclassified by the different classifiers are not necessarily the same [56]. Moreover, the use of multiple classifiers improves the reliability of the overall classification process. However, combing the individual decisions to provide an aggregated decision is crucial to successful outcomes.

1.2

Classifier Combination

The use of multiple classifiers has gained momentum in the recent years [58,59,80], and researchers have continuously argued of the benefits of using multiple classifiers to solve complex recognition problems [56]. Multiple classifier systems can be classified in a variety of ways. The basic categorization of multiple classifier systems has been by the method the classifiers are arranged. The two basic categories in this regard are the serial suite and the parallel suite. The parallel expert architecture consists of a set of classifiers that are consulted in parallel. The decisions of the various experts are combined in parallel by the fusion module. The experts, in this case, are capable of independent and simultaneous operation. On the other hand, the serial suite consists of a set of classifiers arranged in series, or in tandem. This architecture is well suited to deal with situations where the different experts have a ternary decision scheme. A scheme in which they can be undecided on the input pattern they are presented with. If the current expert is undecided, information is passed to the next expert in the sequence. The experts have to have a varying ability of generalizations. While parallel architectures are the most popular, there are also some complicated combinations of these basic schemes that can also be used, such as parallel-serial or serial-parallel architectures.

CHAPTER 1. INTRODUCTION

3

Multiple classifiers can also be categorized based on the method of mapping between the input and output of the fusion module. This mapping may be linear or non-linear. Linear combination are the simplest approaches, in which a weighting factor is assigned to the output of each expert being combined. This weighting factor can be specific to the input pattern, or can be specific to the expert. Weighted average [37], fuzzy integrals [28] are among the linear combination methods, while the majority, or maximum votes [69] are non-linear method. Stacked generalization [102], or rank based methods [44], which involve a more complex mapping of the input, use a non linear mapping in the combining method. Combining methods can also be divided into two different classes depending on the representation methodology. The classifiers can all use the same representation, and hence the classifiers themselves should be different. In multi-representation approaches [56] the different classifiers us different representations of the same inputs. This can be due to the use of different sensors or different features extracted from the same data set. Another categorization of classifier combining methods are if they encourage specialization [4] in certain areas of the feature space. On the other hand, ensemble of classifiers [23, 65, 100] have classifiers that do not encourage such specialization and hence the classifiers themselves must have different classification powers. In other words, in an ensemble each base classifier can be used alone to provide a solution to the input pattern. While, a modular approach would need the coordination of all the classifiers to present a complete solution.

1.3. DATA DEPENDENCE IN MULTIPLE CLASSIFIER APPROACHES

1.3

4

Data Dependence in Multiple Classifier Approaches

The method that combining methods incorporate data dependence can be seen as a method of categorizing various voting approaches. The voting methods can lack any dependence on the data or can be in some way dependent, either implicitly or explicitly. Voting methods that don’t show any dependence on the data include methods such as the maximum, median, minimum, product and average votes [1, 56, 91]. The rank based methods, such as the borda count [44], also lack any dependence on the data. All such methods solely rely on the output of the base classifiers to produce a final representative decision irrespective of the pattern being classified. In a sense we can represent the confidence c, of a classifier i has in a given input pattern x being a member in class yj , ∀j∆ = 1, 2, · · · , N as a probability

cij (x) = Pi (yj |x)

(1.1)

Then the data independent operands will take the form

Q(x) = Fj (cij (x))

(1.2)

Where the mapping Fj can be linear or nonlinear. Some of the well known approaches that do not demonstrate data dependence are the product rule [56], simple averaging, the maximum and majority vote [65]. Although simple aggregation rules such as those presented compete with more complex aggregation schemes involving second-level training, they don’t guard against unreliable confidences. These confidences might be generated when using under-trained or over-trained sub-optimal clas-

CHAPTER 1. INTRODUCTION

5

sifiers. Trained combining methods, in general, are data dependent [22]. This dependency can be either implicit or explicit. Implicit dependencies include methods that train the combiner on the global performance of the data. The weighted average [37, 99], fuzzy integrals [28] and belief theory approaches [104] can be categorized as implicitly data dependent. In these approaches, the combiner is trained on how the base classifiers perform collectively on a set of training data. Boosting [27] can also be categorized as implicitly data dependent since it depends on clustering the data patterns into general clusters while training. Boosting, however, is different in the sense that the dependence is inherent in the classifiers rather then in the combining method. BehaviourKnowledge space (BKS) [46] also clusters the data into a defined number of BKS units, and in turn can also be considered implicitly data dependent. Modular approaches and static classifier selection methods, such that certain classifiers are specified for certain regions can be considered implicitly data dependent [4]. The predefined regions in which the classifiers are expected to excel are pre-defined during training and do not dynamically change. Methods that post-process the output of the base classifiers, or contain second-level training, while only considering the desired final output are implicitly data dependent. Such methods also include the stacked generalization approach [102], and hierarchical fusion [62, 86]. Evolutionary algorithms [69, 86] and Decision templates [66] have also been used to determine the weights assigned to the different base classifiers, they in turn can be considered implicitly data dependent. Implicitly dependent approaches can be representing as following the form

Q(x) = Fj (wi (cij (x)), cij (x))

(1.3)

1.3. DATA DEPENDENCE IN MULTIPLE CLASSIFIER APPROACHES

6

Where the weights assigned to any output yij depends on the output of the base classifiers. These approaches don’t guard against systematic errors of the individual classifiers. They also are prune to mistake for unrealistic confidences of certain classifiers. A third, and final, way of categorizing combing methods based on their dependence on data is if they demonstrate explicit data dependence. In the case of either selection or combination explicitly dependent approaches can be defined as

Q(x) = Fj (wi (x), cij (x)).

(1.4)

Dynamic selection [64], in which the choice of the base classifier to use is made during the classification phase, can be considered an explicitly data dependent approach. This is in contrast to static selection [39], in which the input pattern is not considered when making the selection. In dynamic selection the “competence” of each classifier is estimated in the vicinity of the input pattern being classified. Woods et al. [103] used the DCS LA approach, in which they use local accuracy to select the classifier of choice for a given pattern, While Giancinto and Roli [33] us a DCS approach based on multiple classifier behaviour (MCB). Both these approaches are explicitly data dependent. The mixture of experts [51], where the choice of the classifier is based on both its output and the input pattern being classified. Song et al. [87], uses partial context maximum posterior (PCMP) decision rule, which includes contextual information about the pattern being classified. On the bases of this categorization, it is intuitive to expect that approaches that incorporate explicit data dependence should improve the performance on other approaches. However, most of the methods proposed in the literature use explicit data dependence in selection rather then combining of classifiers.

CHAPTER 1. INTRODUCTION

1.4

7

Objectives and Approach

The objective of this thesis is to develop an architecture for combining classifiers. The architecture makes decision fusion a more explicit data dependent process, by introducing learning to the aggregation phase. No only that, but the architecture also gathers information about the input and behavior of the individual classifiers. The information (i.e., the extracted additional features) is used to tune the aggregation procedure. It is also desirable that the proposed architecture be used without tuning to achieve optimal classification. Given data to be classified, the architecture generates a classifier ensemble. It further generates an aggregation layer on top of this ensemble to perform the final decision making phase. A global training strategy for the architecture is introduced. This strategy guides the modules during the development phase to better adapt to the learning samples that are not correctly classified at the global level. As a result, the strategy relates the efficiency of the global fusion scheme to the training efficiency of all of the input samples, not just the local modular efficiency. A limited increase in computational resources is the tradeoff for a guaranteed reduction in classification error. To test the proposed architecture, its performance is compared with several classifier combining approaches by using several benchmark applications.

1.5

Thesis Overview

In Chapter 2, we present the relevant theories of fusion, specifically, sensor and decision fusion. Chapter 3 introduces the various approaches to combining classifiers. Feature-based architectures for decision fusion in classifier ensembles are proposed in Chapter 4. Decomposition, training and methodologies are discussed in detail. In Chapter 5, the feature-based decision fusion is

1.5. THESIS OVERVIEW

8

compared to alternative combining techniques for five benchmark problems. Chapter 6 contains the conclusions, and suggestions for future work.

Chapter 2

Sensor Fusion The term, sensor fusion, refers to the use of data from multiple sensors in an intelligent system to form one representation in order to improve accuracy. Sensor fusion must not be confused with the term, sensor integration, which refers to the use of multiple sensors to provide information about a sub-task during the different modes of operation. The distinction between these two terms is rather vague. For example, sensor fusion combines the features of weight, shape and color to classify objects. In addition, sensor fusion integrates the readings from sonar and vision systems to localize the objects in its environment. An example of sensor integration is the use of a vision system to locate an object, and then the use of a touch sensor to grip the object, when it is lifted. These sensory modalities are broken down into different steps of the target task. The applications of multi-sensor fusion can be categorized according to the level of representation assigned to the data during the fusion process, or to the computational methods adopted for the fusion. Lou and Kay [70] have arranged possible representations into four levels: the signallevel, pixel-level, feature-level, and symbol-level. Based on these levels, the number of features used by the sensory elements are one, two, many or a representative descriptor. Among the most 9

2.1. LEVEL OF REPRESENTATION IN FUSION

10

commonly used computational methods are the option pools [7], voting/ranking methods [100], Dempster-Shafer theory [104], and fuzzy integral approaches [28]. In Section 2.1, we will briefly review these different categories.

2.1 2.1.1

Level of Representation in Fusion Signal-Level Fusion

An example of signal-level fusion is our ability to approximately locate and track an insect buzzing above us. This is achieved by using the information that our ears provide over time. Inter-aural time and intensity differences in the ears are processed to provide an estimate of the insect’s position. Our hearing becomes tuned to the sound detected, and the accumulation of direction measurements facilitates a more accurate estimate of the insect’s location. Approaches to signallevel fusion incorporate multi-sensor or time-series data to improve the quality of the signal. The most common techniques consist of filtering or weighted averaging.

2.1.2

Pixel-Level Fusion

When an object moves within our field of view, the point of activity on the retina maps directly to the target’s horizontal and vertical location in space. Also, the time variation of the image projected onto the retina is detected as motion. This type of pixel-level fusion demonstrates the power of a data representation that contains implicit spatial information in its structure. Pixel-level fusion may be performed on time-series images, images from stereo or trinocular cameras, or multiple sensors with shared optics. Additional types of sensor measurements may be fused into images that serve as cellular representations of the world. Time-series pixel data can be

CHAPTER 2. SENSOR FUSION

11

used for optical flow and target tracking, whereas stereo or trinocular fusion allows the extraction of depth information through the triangulation of matched pixels in each image. The images that are separately generated by the different sensors are difficult to fuse due to the registration problems of parallax and image warping.

2.1.3

Feature-Level Fusion

When we approach a staircase, we can locate the limits of the stair treads by noting the edges due to the boundaries of color, texture, and abrupt differences in depth. Also, we may feel the treads with our feet. The fusion of these different features provide a reliable estimate of our position on the stair edge. Feature-level fusion involves the sensory extraction of the features of an object in the environment so that the features can be matched to the corresponding features in a symbolic world model. For instance, a mobile robot may measure the distance to the walls of the room by using sonar, and detect the straight edges of walls and doors using vision. The robot may then match these features to a geometric map of world to estimate its position or to update the map.

2.1.4

Symbol-Level Fusion

When combined, symbols such as the words in a sequence, can evoke a specific idea in our brains. Visual or acoustic information is used to detect the words; then symbolic reasoning and memory recall take over. The use of the relationships between the symbols, and between our senses and symbols, allows high-level logical decision making, and a more compact representation of concepts. The most common type of symbol-level fusion is pattern recognition. Feature information is

2.2. COMPUTATIONAL METHODS FOR SENSOR FUSION

12

extracted from the sensor data to define a point in feature space. This point may be mapped to a symbolic interpretation of the world based on that symbol’s neighborhood in feature space. Such a neighborhood function may be defined by the probability theory [92], Dempster-Shafer’s theory of evidence [83], crisp or fuzzy logic [60, 105], neural networks [106], or other means. Vocabulary, rules of grammar, and context can greatly assist in constraining the range of possible phonemes or words that match a particular sequence of data.

2.2

Computational Methods for Sensor Fusion

Independent of the levels of representation, multi-sensor systems can be categorized according to the computational method used. Generally, these methods perform a data-reduction mapping from multiple inputs to a smaller number of outputs. The inputs may be raw sensor readings, pixel values, extracted features, signal estimators, or control signals. Outputs may be an estimated state, recognized objects or events, enhanced features, or control signals. An important aspect of each technique is the way it models uncertainty or errors in sensor information. Ultimately, it is desired that the errors of each model are corrected by the other sensor models and vice versa. If all the models indicate the same error, that is they make the same mistake, then no combination can rectify the error. However, if the errors do not always correlate, performance can be improved with the proper combination. The sensor fusion approaches aim at combining the beliefs of the set of models used into a single, consensus belief. In the following we review the three generally distinguished approaches, the linear opinion pool, logarithmic opinion pool and the voting or ranking approach.

CHAPTER 2. SENSOR FUSION

13

Linear Opinion Pool The linear pool is a commonly used data fusion technique that is convenient because of its simplicity [7,30]. The linear opinion pool is evaluated as a weighted sum of the output probabilities from each model.

Plinear (A) =

K X

αi Pi (A)

(2.1)

i=1

where Plinear (A) is the combined probability using the linear opinion pool of the set of models used for an event A. αi is the weight given to the ith model, Pi (y) the probability output by the ith model for the event A, and K is the number of models. The parameters αi are generally chosen such that 0 ≤ αi ≤ 1, and

P

αi = 1.

The linear opinion pool is appealing in that the output is a probability distribution, and the weights αi provide a rough measure for the contribution of the ith model. However, it is noted that the probability distribution of the combiner output, namely, Plinear (A), may be multimodal.

Log Opinion Pool An alternative to the linear opinion pool is the log opinion pool. If the weights are constrained such that 0 ≤ αi ≤ 1 and

P

α1 = 1, then the log opinion pool also outputs a probability

distribution. However, as opposed to the linear opinion pool, the output distribution of the log opinion pool typically unimodal [7, 30]. The log opinion pool consists of a weighted product of the model outputs

Plog (A) =

K Y i=1

Piαi (A).

(2.2)

2.2. COMPUTATIONAL METHODS FOR SENSOR FUSION

14

Note that with this formulation, if any model assigns a probability of zero, then the combined probability is also zero. Hence, an individual model has the capability of a “veto”, whereas in the linear opinion pool, the zero probability is averaged in with the other probabilities.

Voting/Ranking Methods Another simple method for combining the results of multiple models is to use a voting procedure. In this case, each model must output a decision instead of a score. Numerous voting techniques have been presented in the literature, the most popular of which is the majority vote. The majority vote has been applied in several applications [24, 25, 104]. Other voting techniques are the maximum, minimum, and median votes. Ranking methods are appropriate for problems that involve numerous classes. Rankings do not use class labels or numerical scores, but they do utilize the order of the classes as estimated by the model(s). Ranking methods use class set reduction to reduce the number of class candidates without losing the true class. By reducing the number of classes and reordering the remaining classes, the true class should move to the top of the ranking. The Borda count is among the most popular rank based methods [44]. The literature suggests other computational methods for sensor fusion such as the DempterShafer Approach [41, 104], and the fuzzy integral method [14, 28]. These methods combine the beliefs of the various models into an overall consensus belief, not only their respective decisions. A detailed description of these methods will be presented in Chapter 3. Decision fusion can be seen as a special case of sensor fusion. Decision fusion is the process of combining partial soft or hard decisions, given by different experts. In this case, the term, expert is used, since each single module uses expert knowledge to transform the information carried

CHAPTER 2. SENSOR FUSION

15

by the measured data into a decision. The experts output their local decisions by generating corresponding scores. These scores are the measures of their respective beliefs of the acceptability of the identity claim. The higher the score, the higher the belief in the identity claimed. The output of these experts can be on one of the three different levels, the abstract level, rank level or measurement level. In the following section we shed some light on these levels in the context of classification.

2.3

Classifier Fusion

Multiple classifier systems are often practical and effective solutions for difficult pattern recognition tasks; they are also called combining of multiple classifiers [56,69,79,103,104], decision combination, multiple experts, mixture of experts [48,49,52,74], classifier ensembles [20,26], classifier fusion [8,14,28,54], consensus aggregation [6,7], dynamic classifier selection [103], composite classifier system [16], hybrid methods, cooperative agents or systems [3, 4], opinion pool [5, 7, 24, 25], and so on. The motivation for such systems may be derived from an empirical observation that specialized classifiers are superior in different cases, or it may follow from the nature of the application. In several other cases, the motivation for using such systems is to avoid making commitments to certain initial conditions. The traditional approach to classifier selection (where we compare the available classifiers with a representative sample and choose the best performer) is in contrast to classifier combining. Here, we abandon the attempt to find the best classifier, and instead, try to use all of the available ones in a smart way. The use of such an approach in order to achieve improved classification rates, compared to that achieved by a single classifier, has been widely accepted [58, 59, 80].

2.4. LEVELS OF CLASSIFIER OUTPUT INFORMATION

16

There are many ways to utilize more than one classifier in a single recognition problem. A divide-and-conquer approach [13] isolates the types of input from which a specific classifier performs well, and directs that input accordingly. A sequential approach would use one classifier first, and invokes others only if it fails to yield a decision with sufficient confidence. Consequently, most research in combining methods focus on applying all the available classifiers in parallel to the same input, and combining their decisions, within the parallel suite. At this stage, we can discriminate between the ensemble combination and modular combination of classifiers. The term ensemble is commonly used for combining a set of redundant classifiers. The redundancy occurs because each classifier provides a solution to the same task. Needless to say, that these solutions may be obtained by different means. This is in contrast to the modular approach, in which the task is decomposed into a number of sub-tasks. Each module is concerned with finding a solution for one sub-task. To complete the whole task, each component is expected to contribute. In the remainder of this study, we will be concerned with combining classifiers in an ensemble.

2.4

Levels of Classifier Output Information

Given a pattern space S, consisting of N mutually exclusive sets S = y1 ∪ · · · ∪ yN , with each yj ∀j ∈ ∆ = {1, 2, · · · , N } represents a set of specified patterns called a class. For a sample x from S, the task of a classifier (denoted by C) is to assign x one index yj ∈ ∆ ∪ {N + 1} as a label to represent so that x is regarded as being of class yj , if j 6= N + 1, with j = N + 1 denoting that x is rejected by C. Regardless of what internal structure a classifier has, or on what theory and methodology it is based, we may simply regard a classifier as a function box that receives

CHAPTER 2. SENSOR FUSION

17

input sample x and outputs label yj , or in short, C(x) = yj . Although yj is the only output information we want at the final stage of classification, many of the existing classification algorithms usually supply related information. For example, a Bayes classifier may also supply N values of post-probabilities P (yj |x), j = 1, · · · , M for each possible label. In fact, the final label yj is the result of a maximum selection from the N values, and this selection certainly discards some information that may be useful for a multi-classifier combination. Depending on whether some output information other than one label yj is used, or whether the other kind of information is used, we have different types of multi-classifier combination problems. Typically, the output information that various classification algorithms supply can be divided into the following three levels [104]

• The abstract level: a classifier C outputs a unique label yj or subset YJ from ∆ • The rank level: C ranks all the labels in ∆ (or a subset J ⊂ ∆) in a queue with the label at the top being the first choice • The measurement level: C attributes to each label in ∆ a value for the degree that x has the label

Among these three levels, the measurement level contains the most information, and the abstract level contains the least. From the measurement attributed to each label, we can rank all the labels in ∆, according to a rank rule (e.g., ascending or descending). By choosing the label at the top rank, or more directly, by choosing the label with the maximal value at the measurement level, we can assign a unique label to x. In other words, from the measurement level to the abstract level there is an information reduction process or abstraction process.

2.5. PROBLEMS FOR MULTIPLE CLASSIFIER COMBINATION

2.5

18

Problems for Multiple Classifier Combination

Depending on which of the three output information levels a combination is based on, the problems of combining multiple classifiers can be summarized in the following three types [104], which cover various applications:

• Type 1: The combination is based on the output information at the abstract level. Given K individual classifiers, Ci , i = 1, · · · , K each of which assigns a label yij to a given input x, producing an event Ci (x) = yij , the problem is to use these events to build an integrated classifier Q, which gives x one definitive label yj , i.e. Q(x) = yj , j ∈ ∆ ∪ {N + 1}. • Type 2: The combination is based on the output information of the rank-level. For an input x, each Ci produces a subset Li ⊆ ∆ with all the labels in Li ranked in a queue; the problem is to use these events C(x) = Li , i = 1, · · · , K to build an Q with Q(x) = yj , j ∈ ∆∪{N +1}. • Type 3: The combination is based on the output information at the measurement level. For an input x, each Ci produces a real vector Yi = [yi1 , · · · , yiN ] (where yij denotes the degree that Ci considers that x has label yj ), the problem is to use these events Ci (x) = Yi , i = 1, · · · , K to build an Q with Q(x) = yj , j ∈ ∆ ∪ {N + 1}.

Regarding Type 1 problems, the individual classifiers may be very different from each other in their theories, or methodologies. In fact, any kind of classifier can at least supply the output information at the abstract level so that it can be said that a Type 1 problem covers all kinds of pattern recognition areas. In contrast, a Type 3 problem requires that all the individual classifiers supply the output information at the measurement-level. To combine different measurement vectors, we are required to normalize them to the same scale. A Type 2 problem has a generality

CHAPTER 2. SENSOR FUSION

19

between Type 1 and Type 3, it requires that all the individual classifiers supply the output information at the rank-level.

2.6

Summary

In this chapter, we have presented some of the basic ideas of sensor fusion. We also presented a brief account of the various approaches to fusion. Decision fusion was also introduced as a special case of sensor fusion. In the following chapter, we will focus on decision fusion approaches. We also introduce some of the most commonly used classifier combining schemes. We will narrow down the focus of the work in this thesis to methods that combine Type 3 classifiers. These classifiers will be constructed in the more common parallel suite composing an ensemble of classifiers.

Chapter 3

Classifier Ensembles Combining Methods The main motivation for combining classifiers in redundant ensembles is that of improving their generalization ability. The inherent redundancy within the ensemble can also guard against the failure of the individual classifiers. The reason that we may expect a classifier to fail on certain inputs is based on the assumption that they have been trained on a limited set of data. They are required, based on the training data, to estimate the target function. Unless the function is simple or the training set is a perfect representative of the data in order to achieve perfect generalization, it is inevitable that the estimate and desired target will differ. Combining a set of imperfect estimators can be viewed as a way to manage the recognized limitations of the individual estimators. Each component classifier is known to make errors; however, the fact that the patterns that are misclassified by the different classifiers are not necessarily the same [56], suggests that the use of multiple classifiers can enhance the decision

20

CHAPTER 3. CLASSIFIER ENSEMBLES COMBINING METHODS

21

about the patterns under classification. Combining these classifiers in such a way as to minimize the overall effect of these errors can prove useful. Tumer and Ghost [97] have shown that the ensemble error (Eensemble ) decreases with an increase in the number of distinct members of the ensemble (K). The ensemble error improves with the reduction in correlation (ρ). This error is related to that achieved by Bayes rule (Ebayes ) with the following equation:

Eensemble =

1 + ρ(K − 1) Eerror + Ebayes . K

(3.1)

Hence, if ρ = 1, the error of the ensemble is equal to that of a single classifier. Therefore, it is useless to combine identical classifiers. The individual classifiers must be substantially different. If ρ = 0, the error decreases as the number of members of the ensemble increase. However, introducing needless classifiers doesn’t necessarily yield improved decisions. Similarly, the agreement of one poor classifier with another doesn’t necessarily produce better decisions. Hansen and Salamon [36] have suggested that the ensemble classifiers are most useful when each makes independent errors and the error rate is less than 50% for the ensemble error to decrease monotonically with the number of classifiers. We can examine the effect of combining redundant ensembles in terms of the statistical concepts of bias and variance. Much has been made recently of the fact that the error of an estimator can be expressed in terms of the bias squared plus the variance [75, 77]. Based on a training set (x1 , y1 ), · · · , (xm , ym ), a classifier can be trained to construct a function f (x) for the purpose of approximating y for previously unseen observations of x. Following [29] and to explicitly represent the dependence of f on the training data, we use the notation f (x; D), instead of simply f (x). Then, the mean squared error of f as a predictor of y may be written

22

  ED (f (X; D) − E [y | x])2

(3.2)

where ED is the expectation operator with respect to the training set D, and E [y | x] is the target function. Therefore, the bias/variance decomposition gives us

bias

}| {   z   ED (f (X; D) − E [y | x])2 = (ED [f (X; D)] − E [y | x])2 +ED (f (X; D) − E [f (x; D)])2 (3.3) {z } | variance

The first RHS term is called the bias and the second is the variance of a classifier. The bias of a predictor can be characterized as a measure of its ability to generalize correctly to a test set once it is trained. The variance of an estimator can be similarly characterized as a measure of the extent to which the estimator’s output is sensitive to the data on which it was trained, that is, the extent to which the same results are obtained if we use a different set of training data. There is a tradeoff between the bias and variance of training a classifier; the best generalization requires a compromise between the conflicting requirements of a small variance and a small bias. Such a tradeoff exists because attempting to decrease the bias by considering more of the data, will likely result in a higher variance. Also, efforts to decrease the variance by paying less attention to the data, usually results in an increased bias. What is required of an estimator is that it generalizes well after training on noisy or unrepresentative data, and to avoid over-fitting. The bias and variance can be approximated by an average over a fixed number of possible training sets. Krogh and Vedelsby [61] has provided an account of the bias and variance in an ensemble, expressing the bias-variance relation in terms of an average over the ensemble members, instead of an average over the possible training sets. The implication is that the members of the

CHAPTER 3. CLASSIFIER ENSEMBLES COMBINING METHODS

23

ensemble can be created by a variety of methods, as well as by varying the training set. The improvement in performance arising from ensemble combinations is usually the result of a reduction in variance, rather than a reduction in bias. This occurs because the usual effect of ensemble averaging is to reduce the variance of a set of classifiers, while leaving the bias unaltered. Therefore, an effective approach is to create and/or select a set of classifiers with a high variance, but low bias. The variance component can be removed by combining this set of classifiers. Therefore, it is reasonable to try to reduce the bias, since the resultant increased variance is removed by combining. Combining can thus provide a way of circumventing, or at least reducing, the bias-variance tradeoff. An ensemble which exhibits a high variance should also show a low correlation of errors. Then, one of the main determinants of the effectiveness of an ensemble is the extent to which the members are error-independent, that is, that they make different errors [56]. For instance Wolpert [102] points out that the more each generalizer has to say, the better the resultant stacked generalization. Jacobs [48] also asserts that the major difficulty with combining expert opinions is that these opinions tend to be correlated or dependent. Ideally, the ensemble is a set of classifiers that do not show any coincident errors. In other words, each of the classifiers generalizes well (a low bias component of error), and when they do make errors on the test set, these errors are not shared with any other classifier (a high variance component of error). With these considerations in mind, there are two key issues in the development of multiple classifier systems, how to create the individual classifiers, and how to perform the combination of these classifiers. In the following we will introduce some of the commonly used approach for both.

3.1.

3.1

METHODS FOR CREATING ENSEMBLE MEMBERS

24

Methods for Creating Ensemble Members

Since the main reason for combining classifiers in an ensemble is to improve their performance, there is clearly no advantage to be gained from an ensemble that is composed of a set of identical classifiers, that is, they generalize in the same way. The emphasis is on the similarity of the pattern of generalization. In principle, a set of classifiers can vary in terms of their weights, the time they take to converge, and even their architecture, yet constitute the same solution, since they result in the same pattern of errors when they are tested. Then, the aim is to find classifiers which generalize differently. There are a number of training parameters which can be manipulated with this goal in mind: initial conditions, the training data, the typology of the classifiers, and the training algorithm. We will review the commonly used methods which involve varying data and have been employed for the creation of ensemble members [84].

• Varying the set of initial random weights: A set of classifiers can be created by varying the initial random weights from which each classifier is trained, while maintaining the same training data. • Varying the topology: A set of classifiers can be created by varying the topology or architecture, while maintaining constant training data. A more intense diversity occurs if members of the ensemble are different modular systems. The errors made by the two modular systems with different internal modular structures may be uncorrelated. • Varying the algorithm employed: The algorithm used to train the classifiers can be varied, while holding the data constant.

CHAPTER 3. CLASSIFIER ENSEMBLES COMBINING METHODS

25

• Varying the data: The methods which seem to be most frequently used for the creation of ensembles are those which involve altering the training data. There are a number of different ways in which this can be done, including sampling data, disjoint training sets, boosting and adaptive re-sampling, different data sources, and preprocessing, or a combination of these techniques. In the following section, we will outline these approaches.

3.1.1

Varying Training Data

Sampling Data A common approach for creating a set of classifiers for an ensemble is to use some form of sampling technique, such that each classifier in the ensemble is trained on a different sub-sample of the training data. Re-sampling methods which have been used for this purpose include crossvalidation [61], and bootstrapping [9], although in statistics, the methods are better known as techniques for estimating the error of an estimator from limited sets of data. In bagging [9], a training set containing N cases is perturbed by sampling with replacement (bootstrap) from the training set. The perturbed data set may contain repeats. This procedure is repeated several times to create a number of different, although overlapping, data sets. Such statistical re-sampling techniques are particularly useful where there is a shortage of data.

Disjoint Training Sets A similar method to data sampling is the use of disjoint, or mutually exclusive training sets, by sampling the data without replacement. There is then no overlap of the data used to train different classifiers. The problem is that, as noted by [96], the size of the training set may be reduced, and this may result in a deteriorated performance.

3.1.

METHODS FOR CREATING ENSEMBLE MEMBERS

26

Boosting and Adaptive Re-sampling Schapire [82] has demonstrated that a series of weak learners can be converted to strong learners as a result of training the members of an ensemble on patterns that have been filtered by previously trained members of the ensemble. A number of empirical studies [20, 29] have supported the efficacy of the boosting algorithm, although one problem is that it requires large amounts of data. Freund and Schapire [27] have proposed an algorithm, Adaboost, that largely avoids this problem, although it was developed in the context of boosting. Essentially, in this algorithm, the training sets are adaptively re-sampled, so that the weights in the re-sampling are increased for those cases which are most often misclassified. Drucker [19] has compared the effectiveness of boosting to bagging, and reported a generally superior performance from boosting algorithms. Similarly, Breiman [9, 11] has explored some of the differences between the Freund and Schapire algorithm and bagging, concluding, on the basis of empirical and analytical evidence, that Freund and Schapire’s algorithm is more successful in reducing variance than bagging.

Different Data Sources Using data from different input sources is another method for varying the data on which classifiers are trained. This can be achieved when different sensors are used. It is worth noting that it is especially useful when these sensors collect different kinds of information.

Preprocessing The data on which classifiers are trained can also be varied by using different preprocessing methods. For example, different signal processing methods may be applied to the data, or different feature sets extracted. Alternatively, the input data for a set of classifiers can be distorted in

CHAPTER 3. CLASSIFIER ENSEMBLES COMBINING METHODS

27

different ways, for example, by using different pruning methods [97], by injecting noise [78], or by using non-linear transformations [77, 96].

3.2

Methods for Combining Classifiers in Ensembles

Once a set of classifiers has been created, an effective way of combining their outputs must be found. A variety of schemes have been proposed for combining multiple classifiers [48, 104]. The majority vote [5,50,69,101,104] is by far the most popular approach. Other voting schemes include the minimum, maximum, median [65], average [73, 90], and product [91] schemes. The weighted average approach [37, 42, 48, 71] tries to evaluate the optimal weights for the various classifiers used. The Behaviour-Knowledge Space (BKS) selects the best classifier in some region of the input space, and bases its decision on its output [103]. Other approaches to combine classifiers include the rank-based methods such as the Borda count [44], the Bayes approach [69,104], the DempsterShafer theory [18,40,79,104], the fuzzy integral [14,28,34,35,72,89], fuzzy connectives [63], fuzzy templates [67], probabilistic schemes [53, 55–57], and combination by a neural networks [12]. We can view the combiner as a scheme to assign weights of value to classifiers. The weights can be data independent or data dependent [103]. Various architectures of classifier combining have been introduced. Jordan and Jacobs [51] have presented a hierarchical mixture of experts which is based on the divide-and-conquer approach. We can categorize the various methods for classifier combining into the following groups, it might be noted that some of these categories may require a certain level of output of the classifiers.

• Averaging and weighted averaging: Linear opinion pools are one of the most popular aggregation methods, and refer to the linear combination of the outputs of the ensemble

3.2. METHODS FOR COMBINING CLASSIFIERS IN ENSEMBLES

28

members’ distributions with the constraint that the resulting combination is itself a distribution [48]. A single output can be created from a set of classifier outputs via simple averaging [95], or by means of a weighted average that takes into account the relative accuracies of the classifiers to be combined [37, 48, 95, 102]. • Non-linear combining methods: Non-linear combining methods that have been proposed include the Dempster-Shafer belief-based methods [79, 104], the use of rank-based information [44], voting [36], and order statistics [95, 98]. • Supra Bayesian: Jacobs [48] has contrasted the supra Bayesian approach to linear combinations. The underlying philosophy of the supra Bayesian approach is that the opinions of the experts are themselves data. Therefore, the probability distribution of the experts can be combined with its own prior distribution. • Stacked generalization: Under stacked generalization [102] a nonlinear estimator learns how to combine the classifiers with weights that vary over the feature space. The outputs from a set of Level 0 generalizers are used as the input to a Level 1 generalizer, which is trained to produce the appropriate output. The term stacked generalization is used by Wolpert [102] to refer both to this method of stacking classifiers, and the method of creating a set of ensemble members by training on different partitions of the data. It is also possible to view other methods of combining, such as averaging, as instances of stacking with a simple Level 1 generalizer. The same idea has been adapted to regression tasks, where it is called stacked regression.

Next, we will present some of the approaches for combining classifiers

CHAPTER 3. CLASSIFIER ENSEMBLES COMBINING METHODS

3.2.1

29

Averaging

Average Vote This approach averages the individual classifier outputs confidence for each class across all of the ensemble. The class output yielding the highest average value is chosen to be the correct class,

N

Q(x) = arg max j=1

! K 1 X yij (x) K

(3.4)

i=1

where N is the number of classes, and yij (x) represents the output confidence of the ith classifier for the jth class for the input x.

Weighted Average The weighted average approach is similar to the average vote, however, the outputs of the various classifiers are multiplied with a weighting factor or

N

Q(x) = arg max j=1

! K 1 X wi yij (x) K

(3.5)

i=1

The weights wi , i = 1, · · · , K can be derived by minimizing the error of the different classifiers on the training set. For notational convenience, we consider classifiers with a single output yi , although the generalization to several outputs is straightforward. We denote the true output of a classifier by d(x). Thus, the actual output of each classifier can be written as the desired output plus an error or

yi (x) = d(x) + i (x)

(3.6)

3.2. METHODS FOR COMBINING CLASSIFIERS IN ENSEMBLES

30

Therefore the average squared error of the ith classifier can be expressed as

    ei = E {yi (x) − d(x)}2 = E 2i .

(3.7)

If wi is the weight assigned to the ith classifier, a weighted combination of the outputs that satisfies

PK

i=1 wi

= 1 is represented by

y(x) =

K X

wi yi (x) = d(x) +

i=1

K X

wi i (x).

(3.8)

i=1

The error correlation matrix C is composed of elements represented by

cij = E [i (x)j (x)]

(3.9)

where M is the total number of samples. Thus the total error can be determined to be

  e = E {y(x) − d(x)}2   ! K K X X = E wi i  wj j  i=1

=

n X n X

j=1

wi wj cij

(3.10)

i=1 j=1

(3.11)

Optimal values of the weights can be determined by minimizing the error e. Therefore, while requiring that

PK

i=1 wi

= 1, the solution for wi is given in [37] as

CHAPTER 3. CLASSIFIER ENSEMBLES COMBINING METHODS

PK

wi =

−1 j=1 C ij . PK PK −1 )kj k=1 j=1 (C

31



(3.12)

Verikas et. al. [100], argue that one problem with the imposed constraint is that it does not prevent the weights from adopting large negative or positive values. The inverse of C can be unstable. Redundancy in the members of a committee leads to linear dependencies in the rows and columns of C. Therefore, they suggest, seeking to constrain the weights further by requiring that wi ≥ 0, ∀i = 1, · · · , K. They show that the weights are determined by minimizing the error function given by

e=

X i

wi ei +

X

wj cij wi −

i,j

X i

wi cii + λ

K X

wi2

(3.13)

i=1

where the last term is used for regularization, and λ is a regularization coefficient. While the weighted average presents a guard against the vulnerability of the simple averaging to unreliable confidences, Collinearity amongst the base classifiers can sometimes undermine the robustness of this approach.

3.2.2

Non-Linear Combining Methods

Majority Vote The correct class is the one most often chosen by different classifiers. If all the classifiers indicate different classes, or in the case of a tie, then the one with the highest overall output is selected to be the correct class.

3.2. METHODS FOR COMBINING CLASSIFIERS IN ENSEMBLES

32

Maximum Vote The class with the highest overall output is selected as the correct class,

K

Q(x) = arg max yi (x) i=1

(3.14)

where K is the number of classifiers, and yi (x) represents the output of the ith classifier for the input vector x.

Nash Vote Each voter assigns a number between zero and one for each candidate, and then compares the product of the voter’s values for all the candidates. The highest is the winner,

N

q(x) = arg max j=1

K Y

yij

(3.15)

i=1

Borda Count For any class j, the Borda count is the sum of the number of classes ranked below j by each classifier. If Bi (j) is the number of classes ranked below class j by the ith classifier, then the Borda count for class j is B(j) =

K X

Bi (j).

(3.16)

i=1

The output is the class with the largest Borda count. Although simple aggregation rules such as those presented compete with more complex aggregation schemes involving second-level training, they don’t guard against unreliable confidences. These confidences might be generated when using under-trained or over-trained sub-optimal classifiers.

CHAPTER 3. CLASSIFIER ENSEMBLES COMBINING METHODS

3.2.3

33

Bayesian Combination

The confusion matrix, ci , is estimated on a training set for the ith classifier . Elements cijk denotes the number of data points that are classified to be class k, whereas they are actually class j. Therefore, the conditional probability that a sample x actually belongs to class j, given that classifier i assigns it to class k, can be estimated as

P (x ∈ qj | λi (x) = ji ) = cijk /

N X

cijk .

(3.17)

j=1

Therefore, if we assume that the different classifiers are independent, a belief value that the input x belongs to class j can be approximated by

QK

P (x ∈ qj | λi (x) = ji ) Bel(j) = PN i=1 QK j=1 i=1 P (x ∈ qj | λi (x) = ji )

(3.18)

The final decision is then made based on the maximum belief value associated with each output class.

3.2.4

Dempster-Shafer Approach

The Dempster-Shafer approach has been used to combine the outputs of multiple classifiers for handwriting recognition [104] and is also analyzed and compared to Bayesian methods and the fuzzy set theory in [41]. The following constitutes a brief description. Consider a set of outcomes of an experiment to be denoted by Θ. For example, in a coin toss, if H represents heads and T represents tails, Θ = (H, T ). The set Θ has 2n(Θ) subsets, including the null set and Θ itself. n(Θ) is the number of elements in Θ. These subsets are known as propositions, and the set of propositions is denoted as P . A proposition consisting of only one element is called a singleton.

3.2. METHODS FOR COMBINING CLASSIFIERS IN ENSEMBLES

34

For example, if the experiment involves rolling a die, a singleton is the element 2 whereas the proposition that the number is even is (2,4,6). A basic probability assignment (BPA) is assigned to each proposition or subset of Θ, as opposed to each individual element of Θ as in conventional probability theory. If A ∈ P is a subset of Θ, then BPA(A) represents the impact of the evidence (the output of the classifier) on A [104]. From the BPA, a numeric value in the range [0,1] that indicates the belief in proposition A, denoted by bel(A), is computed. The belief in A, bel(A), indicates the degree to which the evidence or classifier output supports A and is given by

bel(A) =

X

BPA(A).

(3.19)

B⊆A

Given Θ and P , the evidence provided by the classifier induces a set of BPAs from which the beliefs are calculated. With multiple classifiers, each piece of evidence induces a different set of BPAs which must be combined or fused. If A ∈ P is a subset of Θ which is not the null set, BPA1 (A) is the BPA for one classifier, and BPA2 (A) is the BPA for the other classifier, the combining rule is given by [41, 104]

P BPA(A) =

C∩D=A BPA1 (C)BPA2 (D)

1−k

(3.20)

where BPA(A) is the overall BPA after fusion. Where, C ∈ P , D ∈ P and k is given by

k=

X

BPA1 (C)BPA2 (D).

(3.21)

C∩D=null

Note that the classifier outputs are assumed to be independent. The BPAs for all A ∈ P are found, and the beliefs bel(A) are computed before proceeding to invoke the decision rule based on

CHAPTER 3. CLASSIFIER ENSEMBLES COMBINING METHODS

35

the beliefs. Note that if k = 1, the two evidences are in complete conflict, and BPA(A) does not exist. Also, such formulation can easily be extended to include more than two classifiers [41,104]. The Bayesian methods and the Dempster-Shafer approach have generally be reported to produce good results, but involves more complex calculations compared to other approaches.

3.2.5

Combination via Choquet Integral

Background on fuzzy integral and fuzzy measure Definition 1. A set function g : 2Z → [0, 1] is a fuzzy measure if • g(0) = 0; g(Z) = 1, • if A; B ⊂ 2Z and A ⊂ B, then; g(A) ≤ g(B), • if An ⊂ 2Z for 1 ≤ n ≤ ∞ and the sequence {An } is a monotone in the sense of inclusion, then limn→∞ g(An ) = g (limn→∞ An ) In general, the fuzzy measure of a union of two disjoint subsets cannot be directly computed from the fuzzy measures of the subsets. Sugeno [88] has proposed the decomposable so called λ-fuzzy measure satisfying the following additional property:

g(A ∪ B) = g(A) + g(B) + λg(A)g(B)

(3.22)

for all A, B ⊂ Z and A ∩ B = 0, and for some λ > −1. Let Z = {z1 , z2 , · · · , zK } be a finite set (a set of committee members in our case), and let g i = g({zi }) . The values g i are called the densities of the measure. The value of λ is found from the equation g(Z) = 1, which is equivalent to solving the following equation:

3.2. METHODS FOR COMBINING CLASSIFIERS IN ENSEMBLES

K Y λ+1= (1 + λg i ).

36

(3.23)

i=1

When g is the λ-fuzzy measure, the values of g(Ai ) can be computed recursively as follows:

g(A1 ) = g({z1 }) = g 1

(3.24)

g(Ai ) = g i + g(Ai−1 ) + λg i g(Ai − 1), f or1 < i ≤ K

(3.25)

and

Definition 2. Let g be a fuzzy measure on Z. The discrete Choquet integral of a function h : Z → R+ with respect to g is defined as

Cg {h(z1 ), · · · , h(zK )} =

K X

{h(zi ) − h(zi−1 )}g(Ai )

(3.26)

i=1

where indices i are permuted so that 0 ≤ h(z1 ) ≤ · · · ≤ h(zK ) ≤ 1, Ai = {zi , · · · , zK }, and h(z0 ) = 0

Combination Via Choquet Integral Adopting Sugeno’s λ-fuzzy measure and assigned the fuzzy densities g i , that is, the degree of importance of each classifier, based on the performance of the classifier on validation data. The densities can be computed as follows:

pi g i = PK

j=1 pj

dS

(3.27)

CHAPTER 3. CLASSIFIER ENSEMBLES COMBINING METHODS

37

where pi is the performance of the ith classifier, and dS is the desired sum of the fuzzy densities. The committee members were assumed to have N outputs representing N classes, and data point x needs to be assigned to one of the classes. The class label Q for the data point x is then determined as

Q(x) = arg max Cg (j) j=1,··· ,N

(3.28)

where Cg (j) is the Choquet integral for the class j. The values of function h(z) that appear in the Choquet integral are given by the output values of the members of the committee (the evidence provided by the members). Several researches has shown dramatic improvement using fuzzy integrals [28], but other data sets has shown it to perform poorly. It is safe to say that it is not a universal solution to combining classifiers.

3.2.6

Behaviour Knowledge Space and Classifier Selection

A behaviour Knowledge Space (BKS) [46] is a K-dimensional space, where each dimension corresponds to the decision of one classifier. Each classifier has N + 1 possible decision values chosen from the set 1, · · · , N + 1. The intersection of the decisions of the individual classifiers occupies one unit of the BKS, and each unit accumulates the number of incoming samples of each class. The unit, which is the intersection of the classifiers’ decisions of the current input, is called the focal point. For an unknown test sample, the decision of the individual classifiers indexes a unit of BKS, and the unknown sample is assigned to the class with the most training samples in the BKS unit. Generally, the BKS will perform well on the training data, while performing poorly on the testing data.

3.3. ARCHITECTURES FOR COMBINING CLASSIFIERS

38

Similar to the BKS, Decision Templates (DT) [66] approach generates a template for given combination of classifier decision outputs. A similarity measure is used to compare the decision patterns of a certain given input to the templates previously generated. The DT is less likely to be over-trained then the BKS, but it is still relies on the presence of large training sets. Woods [103] has presented the Dynamic Classifier Selection by Local Accuracy (DCS-LA). This approach estimates each classifier’s accuracy in local regions of the feature space surrounding an unknown pattern. It then uses the decision of the most accurate classifier to estimate the true class of the input pattern. In this implementation the local regions are defined in terms of the k-nearest neighbors in the training data. The larger the value of k the more accurate an estimate that is generated for any given input pattern. However, the computational cost increases with the increased value of k.

3.3 3.3.1

Architectures for Combining Classifiers Boosting

Schapire first proposed a general method to convert a weak learner into one that achieves an arbitrarily high accuracy [82]. The algorithm originally developed was based on the theoretical model known as the weak learning model. This model assumes that weak learning algorithms can do slightly better than random guessing, regardless of the underlying probability distribution D used in generating the samples. The work of Schapire indicates how to boost these weak learners to achieve an arbitrarily high accuracy. The boosting algorithm of Freund and Schapire [27, 82] maintains a weight for each sample in the training set that reflects its importance. Adjusting the weights causes the learner to

CHAPTER 3. CLASSIFIER ENSEMBLES COMBINING METHODS

39

focus on different examples leading to different classifiers. Boosting is an iterative algorithm. In each iteration, the weights are adjusted in accordance with the performance of the corresponding classifier. The weight of the misclassified examples is increased. The final classifier aggregates the learned classifiers in each iteration by weighted voting. The weight of each classifier is a function of its accuracy. The AdaBoost (Adaptive Boosting) [27] is similar to the bagging algorithm in which sets of classifiers that participate in the voting are generated. Boosting generates classifiers sequentially. In each iteration, the algorithm changes the weights of the training instances, taking into account the errors of previously built classifiers. There are two reasons for the improved performance verified by boosting. The first is that it generates a hypothesis whose error rate is small by combining many hypotheses whose error rate may be large. It seems that boosting may also be helpful for learning problems that have one of the following properties:

• The observed examples tend to have varying degrees of difficulty. For such problems, the boosting algorithm tends to generate distributions that concentrate on the more difficult samples. • The learning algorithm should be sensitive to changes in the training samples so that significantly different hypotheses can be generated for different training sets. This is related to the reduced variance achieved by boosting. The use of a weighted majority over many hypotheses has the effect of reducing the random variability of the combined hypothesis.

It can be noted that the algorithm requires that each classifier perform better then a random guess for it to be effective. Boosting, generally requires large amounts of data to be effective.

3.3. ARCHITECTURES FOR COMBINING CLASSIFIERS

40

It also can be paralyzed, when adding more classifiers rather then yield improved performance, degrades the existing level of accuracy.

3.3.2

Stacked Generalization

Wolpert [102] has proposed the stacked generalization framework, a layered architecture. The classifiers at the Level 0 receive as input the original data, and each classifier outputs a prediction for its own sub problem. Successive layers receive as an input the predictions of the layer immediately preceding it. A single classifier at the top level outputs the final prediction. Stacked generalization is an attempt to minimize the generalization error using the classifiers in higher layers to learn the type of errors made by the classifiers immediately below. From this perspective stacked generalization can be seen as an extension to model selection methods such as cross validation, which uses a “winner takes all” strategy. The single classifier with the lowest cross-validation error is selected. The idea behind stacked generalization is that there may be more intelligent ways to use a set of classifiers. The role of the higher classifiers is to learn how previous classifier have made mistakes, in which classes they agree or disagree, and then use this knowledge for making predictions. Most of the work on the stacking architecture [10,93,102], focuses on two layered architectures, as depicted in Figure 3.1. The learning phase begins by training each of the Level 0 classifiers using the leave-one-out cross validation, that is, for each pattern in the training set one is omitted and the remaining patterns are used for training. After training, classify the left-out pattern is classified. A vector is formed from the predictions of the Level 0 classifiers and the actual class of that pattern. Then, Level 1 classifier is trained with the training set being the collection of vectors generated by the Level 0 classifiers. The second-level and subsequent level training would

CHAPTER 3. CLASSIFIER ENSEMBLES COMBINING METHODS

41

require large quantities of data be available. It is also prune to error due to unrealistic confidence by certain members of the base classifiers.

Level 0 Classifiers

Level 1 Classifiers

Classifier 1

Input

Classifier 2

Classifier 1

Final output

Classifier n

Figure 3.1: Two-level stacked generalization architecture

3.3.3

Hierarchical Mixture of Experts

Jordon and Jacobs [51] have based their hierarchical mixture of experts on the divide-and-conquer principle. The problem is divided into a set of simple sub-problems whose solutions can be combined to yield a solution of the problem. This follows from the way humans carry out the process of recognition: first at the categorical levels, and then at successive subordinate levels. Figure 3.2 illustrates the approach using two hierarchical levels. For each cluster of networks, the output of the classifiers are mediated by using a gating network. The collective output of the different clusters are then mediated with a gating network. The local experts sit as the leaf nodes of this tree. For a given input vector x, each local expert (i, j) produces the output µij as a generalized linear function,

3.3. ARCHITECTURES FOR COMBINING CLASSIFIERS

42

µ

g1 g2

Gating Network

µ2

µ1 x

Gating Network x

g11 g12

µ21

µ12 µ11

Local Expert

x

µ22 Local Expert

Local Expert

x

x

g22 g21

Gating Network

Local Expert

x

x

Figure 3.2: Two-level hierarchical mixture of experts architecture

µij = f (Uij , x)

(3.29)

where Uij is the weighting matrix, and f a fixed continues non-linearity. The gating networks T x, and where v is the weighting vector. are also generalized linearly where the output is ηij = vij ij

The j th output of the gating network is the softmax of the function ηij where

eηij gj|i = P η . ik Ke

(3.30)

The output of the clusters is

µi =

X

gj|i µij .

j

Similarly the final output of the top level of the tree is

(3.31)

CHAPTER 3. CLASSIFIER ENSEMBLES COMBINING METHODS

µ=

X

gi µi .

43

(3.32)

i

The mixture of experts approach is generally considered a modular approach. The various base classifiers are experts in local areas of the feature space, and classifier selection is done based on this pre-defined distribution. Therefore, the misclassification of one classifier would immediately affect the overall performance of the approach.

3.4

Choosing a Method for Ensemble Creation and Combination

As mentioned at the beginning of this chapter, the design of a multiple classifier system involves the design of the classifiers and the decision function for combining their respective outputs. In sections 3.1 to 3.3, we have presented various ways in which ensemble members can be created and combined. However, since more than one method for creating ensembles exists, clearly it is helpful to know which method is likely to produce the best results, or better results than those obtained by choosing the best classifier from a set of available alternatives. Although the definition of the design problem leads us to think that an effective design should address both creating and combining of ensembles simultaneously, most of the design methods described in the literature focus on one or the other. As a matter of fact, most of these methods focus on issues related to the design of the combining function. In particular, methods that focus on the design of a classifier ensemble tend to assume a fixed, simple decision combination function, and intend to generate a set of mutually complementary classifiers that can be combined to achieve optimal accuracy [58, 59]. Alternatively, methods focusing on the design of the combination function assume a given set of carefully designed

3.4. CHOOSING A METHOD FOR ENSEMBLE CREATION AND COMBINATION

44

classifiers, and intend to find an optimal combination of classifier decisions. In order to perform such optimization, a large set of combination functions of arbitrary complexity is available to the designer. The set ranges from simple voting rules to trainable combination functions. Although some design methods have proved to be very effective, and some papers have investigated the comparative advantages of different methods [31, 43, 65, 91], clear guidelines are not yet available for choosing the best design method for the classification task at hand. As argued in the beginning of this chapter, the effectiveness of an ensemble depends on the extent to which its members make different errors, or are error-independent. Once the importance of the error correlation between the classifiers has been recognized, the main approaches which can be adopted are as follows

• Taking account of the dependency between the classifiers when choosing a method of combining • Creating classifiers for an effective combination • Selecting classifiers for an effective combination

3.4.1

Taking account of the Dependency between Classifiers

Methods of combining which take into account the inter-dependency of the classifiers have been proposed. Hashem [37] has described methods to find optimal linear combinations of the members of an ensemble, which consider the relative accuracy of the component classifiers rather then the use of equal combination weights. Rosen [81] alternatively presents a training algorithm that incorporates an error de-correlation penalty term designed to encourage nets to make errors which are uncorrelated from those made by other classifiers.

CHAPTER 3. CLASSIFIER ENSEMBLES COMBINING METHODS

45

The extent to which the outputs of a set of classifiers are correlated a strongly indicates how they should be combined. For example, if an ensemble does not exhibit any coincident failures with respect to a validation set in a classification (Level 1 Diversity), then combining the classifiers by means of a simple majority vote produces good results. Good results are also obtained, if a simple majority vote is used to combine the classifiers which share coincident errors, but where the majority is always correct (Level 2 Diversity). Where there are overlapping errors, more complex methods of combination such as stacked generalization or some form of weighted averaging (e.g., optimal linear combinations [37, 48, 95, 102] are likely to be appropriate.

3.4.2

Creating Classifiers for Effective Combination

One approach to creating ensembles is to consider the relative merits of the methods used to create ensemble members, and then choose and apply a method which is more likely to result in classifiers which are diverse. Wolpert [102] has described the available guidance on the choice of methods for generating ensemble members (or Level 0 generalizers, in his terms), as a black art. However, a number of researchers have conducted empirical investigations into the effectiveness of different methods of ensemble creation [75, 97]. The consensus emerging from the field is that methods of creating classifiers by varying the data in some way are more likely to result in classifiers which make different errors. Consequently, the main methods of varying the data adopt different sampling methods, vary the input-output relationships within the data, or adapt re-sampling of some form [11]. Varying the data on which a set of classifiers are trained is more likely, it appears, to result in a set of classifiers that can be combined effectively than varying, for instance, the set of initial conditions from which they are trained, or their topology. The conclusion about the relative

3.4. CHOOSING A METHOD FOR ENSEMBLE CREATION AND COMBINATION

46

ineffectiveness of varying the initial conditions is supported by the results of [75]. Supposedly, backpropagation is sensitive to the initial weights, but the evidence suggests that although variations in the initial weights may affect the speed of convergence, or whether or not a classifier converges, the resulting differences in generalization are probably slight. It seems that unless the classifier being trained is low in complexity, often only one function that is compatible with a set of data is found. Therefore, regardless of the initial set of weights, the training algorithm, or the topology of a classifier that has learned a particular set of data, is likely to show the same pattern of generalization. Of course, it is difficult to argue conclusively against the possibility that altering the initial conditions of a classifier can result in significant changes in the pattern of generalization. However, the evidence suggests varying the initial conditions is likely to be less effective than training classifiers on data sets that are different in some way. We illustrate this point on classifiers that use disjoint training sets. Although varying the data may be an effective way of producing classifiers which generalize differently, this is not necessarily true as if the train set is not representative. It is still important to test the resulting classifiers, and establish the extent to which they constitute identical solutions, or make different errors. Using disjoint training sets do not necessarily result in low error correlations, which can be explained with reference to the concept of training set representativeness [17]. A representative training set is one which leads to a function being inferred which is similar, or identical, to that which generates the test set. Therefore, a representative training set leads to a good generalization. The problem is, however, that two representative training sets, even if the data that defined them did not overlap at all, can still lead to very similar functions being inferred. The result is that their pattern of errors on the test set are similar. In contrast, if a candidate set of classifiers are trained by using unrepresentative training sets, the resulting generalization performance is poor. Each

CHAPTER 3. CLASSIFIER ENSEMBLES COMBINING METHODS

47

classifier may infer quite different functions, and show different patterns of generalization to the test set. But, as the number of errors increases so does the probability that the errors made on the test set will overlap. It also follows that the smaller training sets that can result from the use of disjoint samples are also likely to be less representative, and result in the poorer performance noted by [97]. Consequently, there is a delicate balance between training set representativeness and error correlation. What is needed are several training sets, all of which are representative and lead to a good generalization, but exhibit a minimum number of coincident failures. The extent to which the training sets exhibit coincident failures (or the determination of the type of diversity they exemplify) can only be determined through a process of testing the performance of selected ensembles. Training set representativeness indicates the importance of testing and selecting classifiers for effective combining. However, the representativeness of training sets has further implications for effective combining. It has been argued [11] that classifiers trained for combining, should be under-regularized such that they fit the training data closely, since the variance that results from this are removed when combining. But, the issue is complicated if the data available for training is of poor quality (i.e., noisy and/or unrepresentative). Then, it may be important to avoid over-fitting, and to use some form of regularization. Thus, Raviv and Intrator [78], have obtained better results when they have incorporated a form of regularization (weight decay) into the training of their component classifiers on noisy data. Therefore, selecting classifiers has been suggested for effective combination. There are a number of ways in which the concept of selection can be incorporated into the construction of effective ensembles. The point of selecting is to reduce the number of shared failures that a set of classifiers produces. As has been argued [38, 48, 95], the presence of harmful

3.4. CHOOSING A METHOD FOR ENSEMBLE CREATION AND COMBINATION

48

collinearity or the correlation between the errors made by the component classifiers in an ensemble reduces the effectiveness of the ensemble itself. Even though certain methods of creating ensemble members are more effective than others, any such methods are still best combined with some form of testing and selection of ensemble members; it cannot be assumed that adopting a particular approach ensures that error independence is achieved. Some researchers have proposed the so-called overproduce and choose paradigm (also called the test and select approach [85]) to design an appropriate multiple classifier system for the task at hand [31,32,76]. The idea is to produce an initial large set of candidate classifier ensembles, and then to select the sub-ensemble of classifiers that can be combined to achieve optimal accuracy. The choice phase is intended to select the subset of classifiers that can be combined to achieve optimal accuracy. It is easy to see that such an optimal subset can be obtained by exhaustive enumeration, that is, by assessing on a validation set the classification accuracy provided by all possible subsets, and then choosing the subset exhibiting the best performance. The performance evaluation should be conducted with respect to a given combination function (e.g., the majority voting rule). Unfortunately, if R is the size of the set produced by the overproduction phase, the number of possible subsets is equal to

PR

i=1

R i



. Therefore, different strategies have been

proposed in order to limit the computational complexity of the choice phase. Although the choice phase usually assumes a given combination function for evaluating the performances of classifier ensembles, there is a strong interest in techniques that allow the choosing of effective classifier ensembles without assuming a specific combination rule. This can be seen via the analogy with the feature selection problem, where techniques for choosing those features that are most effective for preserving class separability have been developed. Accordingly, techniques for evaluating the degree of error diversity of classifiers that form an ensemble have been used for classifier selection

CHAPTER 3. CLASSIFIER ENSEMBLES COMBINING METHODS

49

purposes. In the following, we shall assume that a large ensemble C made up of R classifiers is created by the overproduction phase, or C = {c1 , c2 , · · · , cR }. The goal of the choice phase is to select the subset C ∗ of classifiers that can be combined to achieve optimal accuracy. There are various approaches to choose the classifiers for successful combining

• Using Heuristic Rules: Partridge and Yates have demonstrated some techniques that exploit heuristic rules for choosing classifier ensembles [76]. The choose the best technique assumes an a priori fixed size n of the optimal subset C ∗ . Then, it selects from the set C, the n classifiers with the highest classification accuracy to create the subset C ∗ . The rationale behind such a heuristic choice is that all the classifier subsets exhibit similar degrees of error diversity. Accordingly, the choice is based only on the accuracy value. The other choice technique, proposed by Partridge and Yates, is referred to as choose the best in the class. For each classifier class, the classifier exhibiting the highest accuracy is chosen. Therefore, a subset C ∗ that is made up of three classifiers is created, if the initial set C is made up of classifiers belonging to three classifier types (e.g., the multilayer perceptron neural net, the k-nearest neighbors classifier, and the radial basis functions neural net). With respect to the previous rule, this heuristic rule also takes into account that classifiers belonging to different types should be more error-independent than classifiers of the same type. On the other hand, the use of heuristic rules allows us to greatly reduce the computational complexity of the choice phase, because the evaluation of different classifier subsets is not required. On the other hand, the general validity of such heuristics is obviously not guaranteed. • Diversity Measures: Several measures of error diversity for classifier ensembles have been

3.4. CHOOSING A METHOD FOR ENSEMBLE CREATION AND COMBINATION

50

proposed. Partridge and Yates have developed a measure named within-set generalization diversity, or simply GD. GD takes values in the range [0, 1], and provides a measure of the diversity of the classifiers forming the ensemble. Another diversity measure, the pairwise diversity measure has been proposed by Kuncheva [68]. The Q statistics allow us to evaluate the diversity of the two classifiers. Classifiers that tend to categorize the same patterns correctly, that is, classifiers that are positively correlated, have positive values of Q. Classifiers that make errors on different patterns will exhibit negative values of Q. For statistically independent classifiers, Qi,k = 0. The average Q computed over all the possible classifier couples is used to evaluate the diversity of a classifier ensemble. Giacinto and Roli [31] have suggested a simple diversity measure, named compound diversity, or simply CD, based on the compound error probability for the two classifiers. As for Q, the average CD computed over all the possible classifier couples is used to evaluate the diversity of a classifier ensemble. • Methods Based on Search Algorithms We can see that search algorithms are the most natural way to implement the choice phase required by the overproduce and choose design paradigm. Sharkey et al.

[85] have developed an exhaustive search algorithm based on

the assumption that the number of candidate classifier ensembles is small. To avoid the complexity of an exhaustive search, three choice techniques are presented. They are based on search algorithms previously used for feature selection purposes, and for the solution of complex optimization problems.

1. Forward Search, The choice phase, based on the forward search algorithm begins by creating an ensemble consisting of a single classifier. Usually, this initial classifier is

CHAPTER 3. CLASSIFIER ENSEMBLES COMBINING METHODS

51

chosen randomly. Alternatively, the classifier with the highest accuracy can be used. Then, single classifiers are added to form the subsets of the two classifiers. If this subset exhibits the highest value of the evaluation function, one more classifier is added to such subset to form the subsets of three classifiers. Such an iterative process stops, when all the subsets of size K + 1 exhibit values of the evaluation function lower than the ones of size K. In this case, the subset of size K which exhibited the highest value of the evaluation function, is selected. 2. Backward Search, In order to explain this search algorithm, let us use a simple example in which the set C created by the overproduction phase. The backward search starts from the full classifier set. Then, eliminating one classifier, all the possible subsets of classifiers are created, and their evaluation function values are assessed. The subset that exhibits the highest value is selected. The iterative process stops, when all the subsets of size K − 1 exhibit values of the evaluation function that are lower than the ones of size K. In such case, the subset of size K that exhibited the highest value of the evaluation function is selected. 3. Tabu Search The two previous algorithms stop the search process, if the evaluation function decreases with respect to the previous step. Because the evaluation function can exhibit non-monotonic behavior, it can be effective to continue the search process, even if the evaluation function is decreasing. The tabu search is based on this concept. In addition, the tabu search implements both a forward search strategy and a backward one. The search begins from the full classifier set. At each step, adding and eliminating one classifier creates new subsets. Then, the subset that exhibits the highest value of the evaluation function is selected to create new subsets. It is noteworthy that such

3.4. CHOOSING A METHOD FOR ENSEMBLE CREATION AND COMBINATION

52

subsets are selected, even if the evaluation function is decreased with respect to the previous step. In order to avoid the creation of the same subsets in the different search steps (i.e., in order to avoid cycles in the search process), a classifier added or eliminated cannot be selected for insertion/deletion for a certain number of search steps. Different stop criteria can be adopted. For example, the search can stop after a certain number of steps, and the best subset created during the search process is returned.

All these search algorithms use an evaluation function for assessing the effectiveness of candidate ensembles. These diversity measures and the accuracy value assessed by the majority voting rule have been used as evaluation functions. Obviously, the search algorithms avoid exhaustive enumeration, but the selection of an optimal classifier ensemble is not guaranteed. It is worth noting that evaluation functions are computed with respect to a validation set in order to avoid over-fitting the problems. • A Method Based on Clustering Classifiers: An approach to the choice phase that allows the identification of an effective subset of classifiers with limited computational effort is desirable. Such an approach is based on the hypothesis that set C, created by the overproduction phase, is made up of the union of K disjoint subsets Ci . In addition, we assume that the compound error probability between any two classifiers belonging to the same subset is greater than the one between any two classifiers belonging to different subsets. It is easy to see that effective multiple classifier system members can be extracted from different subsets Ci , if the more highly error-correlated classifiers belong to the same subset and if the error-independent classifiers belong to different subsets. Therefore, the selection procedure

CHAPTER 3. CLASSIFIER ENSEMBLES COMBINING METHODS

53

consists of two phases, the identification of the subsets Ci by the clustering of classifiers, and the extraction of the classifiers from different clusters in order to create an effective classifier ensemble C ∗ . Classifiers are clustered according to the CD measure so that the classifiers that make a large number of coincident errors are assigned to the same cluster, and classifiers that make few coincident errors are assigned to different clusters. At each iteration of the clustering algorithm, one candidate ensembles C ∗ is created by taking from each cluster the classifier that exhibits the maximum average distance from all the other clusters. For each candidate ensemble C ∗ , the classifiers are then combined by majority voting, and the ensemble with the highest performance is chosen.

3.5

Summary

In this chapter, we have presented and discussed the relevant work in classifier combining. In addition, we have introduced several algorithms or architectures to combine of classifiers. Techniques for selecting classifiers to improve their combined performance was also introduced. These various combining methods have been categorized in Chapter 1, where we argued that data dependent techniques have the potential to present superior combining performance. The aim of this thesis is to present a framework by which explicit data dependence could be established. The framework must also allow for the possibility of overcoming the complex requirement of classifier selection, by allowing the classifiers to develop diversity while being trained. A simple architecture, called the feature-based architecture will be presented in Chapter 4. We also present a training algorithm that allows the classifiers to evolve while training.

Chapter 4

Feature-Based Decision Fusion Architectures In the previous chapter, we have reviewed various approaches presented in the literature for combining classifier ensembles. Based on the categorization presented in Chapter 1, these approaches could be considered either data independent, implicitly or explicitly data dependent. Data independent approaches consider the individual classifiers as a black box. These simple aggregation approaches base their final output on the individual decisions of these classifiers. Although such simple aggregation rules compete with more complex aggregation schemes involving second-level training, they are susceptible to incorrect estimates of the confidence by the individual classifiers. These confidences might be generated when using under-trained or over-trained sub-optimal classifiers. Another key issues that pertain to the combining of such classifiers, as well as other ensemble approaches, is the production of diverse classifiers. Methods for creating or selecting classifiers for combining [84] depend on the over-production of classifiers and applying diversity

54

CHAPTER 4. FEATURE-BASED DECISION FUSION ARCHITECTURES

55

measures or selection heuristics to choose which classifiers to use. Implicitly data dependent approaches, on the other hand, train the combiner on how the base classifiers perform collectively on a set of training data. Such approaches limit the vulnerability to incorrect estimates of the confidence by individual classifiers, yet collinearity amongst the base classifiers can sometimes undermine the robustness of this approach. They also can’t be considered a universal solution to combining classifier or involve complex calculations compared to other approaches. Finally, a third and final categorization is explicitly data dependent approaches where the competence of each classifier is estimated in the vicinity of the input pattern being classified. Most of the explicitly data dependent approaches in the literature are methods for selection of classifiers in local sub-spaces, rather than combining the outputs of the various classifiers being considered. These methods use a pre-defined distribution to guide the use of the various classifiers in the ensemble. This, makes most of these aggregation schemes a post-combination of decisions. This investigation focuses on making the decision fusion a more adaptive process. Local classification decisions are combined in a similar way to the parallel suite in decision fusion models [15]. This approach requires the aggregation procedure to gather information about the input beyond what individual classifiers provide. The gathered information (i.e., the extracted additional features) is used to tune the aggregation procedure. This strategy then guides the modules, during the development phase, to adapt to the learning samples that are misclassified, on the global level. Consequently, the efficiency of the global fusion scheme is related to the learning efficiency of all of the input samples, not just the efficiency of the local modules.

4.1. THE DETECTOR BASED ARCHITECTURE

4.1

56

The Detector Based Architecture

Figure 4.1 shows the block diagram of a proposed detector-based architecture that incorporates these requirements. In the following subsections, we will introduce each component of this architecture.

Dm

Aggregation Procedure Output

D2

D1

E1 C1

M1 CE1

E2 C2

M2 CE2

EK CK

MK CEK

Input

Figure 4.1: Detector-Based Decision Fusion Architecture

4.1.1

Classifiers

Each individual classifier, Ci , produces some output, cij (x) representing its interpretation of the input pattern x. In this context and similar to most aggregation methods, we are more interested in the usage of the output of these classifiers to help in the aggregation procedure rather than the methodology of classification. Another goal is to utilize sub-optimal classifiers in the proposed architecture. This will reduce the development requirements, and makes such an architecture worthwhile. Each classifier can be divided into three modules. These modules are the basic

CHAPTER 4. FEATURE-BASED DECISION FUSION ARCHITECTURES

57

classifier element CEi which performs the initial classification. The module Ei performs error analysis of the output of the classifier element CEi . This analysis can be based on a statistical study of the output of the classifier element CEi to prevent possible errors. These modules must be dynamic, as well as comprehensive. Relevant information, only, is passed to the modifying module. The accuracy of a pattern recognizer depends on the intrinsic overlap of the class distributions, and the estimation error due to the finite size of the set of learning objects. Errors can be classified into two categories: classification errors and systematic errors. The classification error, defined as the probability of error in classifying new objects, is used as a measure for the accuracy. The classification error depends on the characteristics of the features chosen, the number of features, the size of the learning set, and the procedure used for the estimation of the discriminant function. A number of these items are influenced or determined by a priori knowledge. Systematic errors may be constant or may vary in a regular way [94]. Eliminating those errors improves performance, which is the main role of this module. The techniques and approaches used are totally dependent on the nature of the problem, and the classification technique implemented in the classification module. The modules Mi are modification modules that utilize the information from the error analysis modules and operate on the classifier input to produce a modified output that is normalized to be within a common representation with all other inputs fed into the aggregation procedure. Hence, the comparison between the different output of the classifiers Ci is meaningful. Like the error analysis modules, these modules are dependent on the nature of the problem and the classifier used, and a confidence index is associated to each classifier. This index can be interpreted as the conditional probability that the classifier experienced success given a certain input, or

4.1. THE DETECTOR BASED ARCHITECTURE

58

P (xi correct | Input), where xi is the output vector of classifier i. To demonstrate how the error analysis and modification modules can be useful consider a classifier that automatically recognizes the amount field in a cheque. There are two groups of statistics that can generated. The first group includes the overall or global statistics of the classifier. The second is concerned with the statistics of each individual classifier on its own output.

• Global classifier statistics 1. Number of Correct: The number of readings that matched the truth, independent of their confidence level. As a raw statistic this figure gives a vague estimate of how this system performs and how sure one can be of the results this classifier produces. 2. Number of Rejects: The number of readings the classifier failed to identify. These values are not wrong but they are reported in the final performance of the engine. 3. Number of Misreads: The number of readings reported that are wrong. – Complex Misreads: Misread values that have more then two errors. No statistics are generated for these cases because they contain a large amount of ambiguity. – Simple Misreads: Misread values that have one or two errors only. These are the elements that statistics have generated for them. • Individual classifier statistics 1. Substitution Errors si (a, b): The rate by which each engine substitutes a certain numeral a, in the output string, with b. This statistic helps in evaluating the authenticity in the value reported by engine. Si denotes the total number of samples

CHAPTER 4. FEATURE-BASED DECISION FUSION ARCHITECTURES

59

containing substitution errors in the training set. 2. Insertion Errors Ii (a): The insertion of a number a between two existing numerals. This statistic reflect the amount of certainty in the length of, and sequence of numbers in the output. The more the probability of inserting a certain character the more the suspicion in the value presented that contains this number. The total number of samples in the training set that include an insertion error can be donated by Ii . 3. Deletion Errors di (a): The deletion of a certain number a from the output. This statistic is treated different then the other character level statistics. Both the insertion and substitution errors depend on characters present in the output, while the deleted character is not present in the output and hence the total deletion rate of each engine Di is used.

These statistics are generated in the error analysis module. The modification module can use these values to correct the output yi of the individual classifier. If the number of numerals in yi are L then the confidence in the output can be calculated as for

l = {1, 2, · · · L} ci = ci × (1 − Ii (yi (l)) × Ii ) × (1 − si (yi (l)) × Si )

end  ci = ci × (1 − Di ) × 1 −

No. of simple misreads No. of samples−No. of rejects



These modifications incorporate the types of individual classifier errors, normalized against the amount of values, that the statistics are based on. We re-iterate that these modifications are dependent on the nature of the classifier and the problem being addressed.

4.1. THE DETECTOR BASED ARCHITECTURE

4.1.2

60

Detectors

Each detector Dl takes the input space and attempts to extract useful information for the aggregation procedure, rather than aiming to solve the classification problem. In other words, it tries to understand and collect information that might be helpful in the aggregation procedure. For instance, in a character recognition problem, the goal is to identify a given character. Each detector Dl , ∀ l  ∆ = 1, 2, · · · , M is in itself a classifier that produces an output dl . While the individual classifiers try to determine the character, the detectors try to identify the category of the character. This helps the aggregation scheme in determining how to combine the different classification outputs to achieve a better performance.

4.1.3

The Aggregation Procedure

The aggregation procedure represents the fusion layer of all the different outputs to generate a more competent output. The aggregation procedure uses the detectors’ outputs to guide the means of combining the different classification results. The aggregation scheme can be divided into two phases: a learning phase and a decision making phase. The learning phase assigns a weight factor to each input to support each decision maker. This weight factor represents a confidence in the output of each classifier. These confidences are then aggregated using standard classifier-combining methods. The final classification output of this aggregation procedure can be represented as

Q(x) = Fj (dl (x), cij (x)).

(4.1)

The aggregation module is represented by Fj , where the combination of the detector outputs

CHAPTER 4. FEATURE-BASED DECISION FUSION ARCHITECTURES

61

dl and the confidence levels cij is performed.

4.1.4

Two-dimensional Illustrative Example

This problem is basically a two dimensional recognition problem. It contains 20 different equiprobable classes as presented by Auda and Kamel [4]. Each class has a total number of 100 samples generated to follow a Gaussian distribution around a random class mean. The classes also assume the same standard deviation. The complexity, and in this case what makes the presented results more general, lies in the non-homogeneous regions in the feature space and the overlapping boundaries between them. Figure 4.2 illustrates the classes within this data set, which can be grouped into three main clusters. Groups 1, 2 and 3 including classes {1,2,3,4,5,6,7}, {8,9}, and {10,11,12,13,14,15,16,17,18,19,20} respectively. The data is divided into three sets: training, validation, and testing sets; 1000/300/700 entries were used for the training, evaluation and testing, respectively. The evaluation set is used to evaluate the performance of the classifier on unseen data while training. This evaluation may trigger the need to retrain the classifier. While the testing set is used to evaluate the final outcome of the classifier. The individual classifiers are six backpropagation neural networks. For consistency, the training procedure, but not the training data, as well as the neural network structure is the same for all the classifiers. In this experiment, we used NeuralWorks commercial software [47] to build the networks and test their performance. Six different training subsets are used, each to train one classifier. Table 4.1 indicates the different training subsets. The classifiers have two input, 12 hidden and 20 output nodes. The learning algorithm is the Extended DBD algorithm with momentum 0.4 and learning coefficient 0.5. The individual decision is the highest output of each classifier.

4.1. THE DETECTOR BASED ARCHITECTURE

62

1

14 15

0.9

13

20

12

0.8

17

Y (Second Dimension)

11 0.7

10

0.6

8

0.5

7

6 0.3

16 18

9

0.4

19

4

5

0.2

3

0.1

2

1

0 0

0.1

0.2

0.3 0.4 0.5 0.6 0.7 X (First Dimension)

0.8 0.9

1

Figure 4.2: The two-dimensional 20 Class data set used

Table 4.1: Classes within each training subset Subset 1 2 3 4 5 6

Classes Included 1,2,3,4,5,6,7,8,9 8,9,10,11,12,13,14,15,16,17,18,19,20 1,2,4,5,6,7,8,9,10 2,4,7,10,11,12,13,14,15,16,17,18,19,20 data from all classes, with emphasis from classes 1 to 7 data from all classes, with emphasis from classes 10 to 20

CHAPTER 4. FEATURE-BASED DECISION FUSION ARCHITECTURES

63

The classifier trained using subsets 1 and 2 are expected to misclassify patterns from groups 3 and 1 respectively due to the lack of data points from these groups in the training subset. While subsets 3 and 4 include representative data from some of the classes of the all groups, they still will be expected to fail in certain subspaces of the input. Finally, subsets 5 and 6 are exposed to all the classes of the data set. This distribution is chosen to produce classifiers with different levels of generalization in different subspaces if the input space. Two detectors, both which were self organizing maps, were used. One divides down the clusters into 12 groups, and is the decomposition using the hierarchical task-decomposition technique [2]. The other divides the clusters into the three main groups. The aggregation procedure is yet another backpropagation neural network, taking the confidence output of each classifier and the output of the detectors. The confidence is evaluated by using the confusion matrix of each classifier. If classifier k produces an output j, the confidence ck (x) of that classifier can be given by

ckjj (x) ck (x) = PN k i=1 cij (x)

(4.2)

where ckij represents the number of data belonging to class i whereas the classifier k recognized them as being class j. These represent the diagonal elements in the confusion matrix generated from the training data for each classifier. Table 4.2 indicates the performance of the different techniques implemented. The performance is measured for the evaluation set and testing set. The testing set is also divided further into the three main groups. The various classifier combination schemes succeed in recognizing data from classes 8 and 9 (Group 2) perfectly; therefore, it is not included in the table.

4.1. THE DETECTOR BASED ARCHITECTURE

64

Table 4.2: Performance of the Gaussian 20-Class problem (% error) data set Best NN Majority Vote Average Vote Detector Based

evaluation 29.67 22.00 24.00 18.33

Testing 27.71 20.14 22.29 18.00

Group 1 22.00 12.77 14.29 13.43

Group 3 32.18 24.73 33.45 17.45

While we note an improvement in the performance by using multiple classifiers, the detectorbased approach obtained the best performance. The detectors provide a basis by which the aggregation module can weight the individual classifiers. Moreover, it is immune to the fluctuation in the output that may result from the experiment setup. This robustness is due to the ability of the architecture to adapt to the input data through the detectors.

4.1.5

Discussion

The detector-based architecture allows for the dynamic decision fusion of classifiers. In this architecture, the aggregation procedure has the flexibility to adapt to changes in the input and output in order to improve the final output. The main idea behind this architecture is that it tries to understand the changes in the input, by means of extracting features using the detectors, to direct the way it performs the aggregation. The aggregation learns how to combine the different decisions in order to improve the overall performance of the classification. However, there are a few issues regarding this architecture

• Design of detectors: The detectors require new features for the problem. These features are distinct from those used for the classification, yet they assist the choice of classifier. Such features are not always readily available, and extracting such features may be a difficult task.

CHAPTER 4. FEATURE-BASED DECISION FUSION ARCHITECTURES

65

• Design of classifiers: The method of designing a classifier and determining when training should be suppressed is always an issue. It is also known that combining good classifiers doesn’t necessarily lead to improved performance on aggregation. The classifiers must be trained sufficiently and appropriately to achieve an improved classification performance once the classifiers are combined. • Degree of human intervention: The parameters required for the design of the classifiers are a concern. Such design issues allow for a margin of improvement based on tuning the parameters. The heuristics associated with the margins are not always readily available to the normal users. The reduction of the influence of such tuning would improve the applicability of the design.

In section 4.2, we will describe the modifications and algorithms applied to the architecture to address these issues.

4.2

Feature Based Architecture

In this modification to the architecture, illustrated in Figure 4.3, we will address the issues regarding the design of the detectors. In this architecture, the detectors append the output of the ensemble to the input features to the ensemble classifiers. This is in contrast to the initial architecture where the detectors extract separate features. The detectors generate a weighting factor or confidence for each classifier. These weights are represented by wij (x, cij ). The weights reflect the degree of confidence in each classifier, with respect to the input space. These weights maybe used to combine the various classifiers by the use of standard voting techniques such as the maximum, average votes, or the aggregation learner. This learner takes the output of the

4.2. FEATURE BASED ARCHITECTURE

66

classifiers and the corresponding weights from the detectors to achieve the combined results. The aggregation can be represented as a mapping

Q(x) = Fj (wij (x, cij ), cij (x)).

(4.3)

where the weights wij are dependent on both the input pattern being classified and the individual confidence outputs of the classifiers. The detector provides a learning module for assigning these weights. It can also be noted that the detectors will be limited to one classifier assigning these weights. The aggregation module is represented by Fj , where the combination of the weights wij and the confidence levels cij is performed. Based on this representation, the feature-based approach is a combination of both implicitly and explicitly data dependent approaches.

Classifier 1 Classifier 2 Fusion Classifier Classifier N

INPUT

FINAL DECISION

Detector

Figure 4.3: Feature-based decision aggregation architecture

4.2.1

Illustrative Example

In order to investigate the feature-based architecture, we implement various classifier combining approaches. These approaches can be divided into three groups.

CHAPTER 4. FEATURE-BASED DECISION FUSION ARCHITECTURES

67

• Standard voting approaches, which include the majority, maximum, average, Nash vote, and Borda count approaches. • Trained approaches, which include the weighted average, Bayesian approach, and Fuzzy integrals. • Feature-based approaches, which includes the feature-based (FB) approach, and voting techniques that use the detector output. These voting techniques are the maximum, average, and Nash voting. The voting techniques applied to the feature-based, different from the standard voting and trained approaches, use the weights generated by the detector to combine the ensemble of classifiers. Again, we use the 20 Class Gaussian problem to illustrate the feature-based architecture. In all our tests, we train five one-hidden layer backpropagation neural networks [47] with ten sigmoidal hidden units using the error-backpropagation algorithm. Each experiment is repeated ten times; the best performing experiment on the evaluation set is used in further analysis. The training set is divided into five portions. Each classifier is trained by using four portions of the training set, with each classifier omitting a different portion of the data. As a result, the different training sets are partially disjointed. If a evaluation set is present, it is used to perform early stopping. The stopping criteria is to save the best network, with respect to the evaluation set, over 10,000 and 100,000 iterations. A test for the best performance is performed every 1000 iterations. The utilized learning scheme is Backprop with Delta-Bar-Delta, tanh transfer function, and softmax outputs. For the single best NN, we train one neural network with all the training data; the best of 10 different trails is selected. No optimization is performed for the trained networks. The

4.2. FEATURE BASED ARCHITECTURE

68

Table 4.3: Performance of feature-based approach Training Cycles Best NN Maximum Majority Average Borda Nash Weighted Average Bayesian Choquet Integral Feature Based Mean Corrcoef

10K 20.71 20.00 18.86 14.14 16.57 13.29 14.43 16.86 19.71 17.14 0.9354

100K 27.71 21.29 19.29 14.14 20.57 15.14 15.00 16.29 21.57 12.69 0.8975

parameters of all the networks are maintained for all the classifiers that are trained. Table 4.3 shows the results for the different combining methods for 10K and 100K epochs, respectively. From these results we observe that using a multiple classifier approach improved on the final classification with respect to the single network approach. It can also be seen that the trained combining approaches perform better then the fixed combining approaches. Although definitive conclusions cannot be drawn on the basis of the limited set of experiments above, some preliminary conclusions can be drawn. The feature-based approach presented an improvement in classification accuracy over the single classifier and other classifier combining methods. When a low number of training epochs is used, the correlation between the individual classifiers is high. This means that they probably fail and succeed on the same entries. The various combining methods show an improvement over the singlenet ranging between 3% and 30%. Overtraining the classifiers causes a reduction in the correlation between the different classifiers. Although the performance of the combining methods decreases, the margin of improvement compared to the singlenet increases from 23% to 49%. This occurs because the members of the ensemble under over-training begin to be more distinct in the errors they make. The trained

CHAPTER 4. FEATURE-BASED DECISION FUSION ARCHITECTURES

69

combining methods are better than the voting methods. This is due to their adaptive nature to the performance of the classifiers. The feature-based approach under the low number of training cycles does not achieve an acceptable performance, because of the size of the detectors and the aggregation network. This increases the requirements of training these modules in the architecture. With the higher number of training cycles, the feature-based architecture achieves a reduction in the error rate over the singlenet of 54%.

4.2.2

Discussion

The modification implemented to the architecture addresses the issues related to the design of the detectors. The detectors, compared to the detector-based approach, are reduced to one standard element that uses both the input space and the output of the ensembles to provide a weighting mechanism for each classifier. Unlike other weighting approaches, the feature-based approaches also use the input space to modify the weights assigned to the classifiers. The results indicate that this architecture may show an improved performance with regard to the classification accuracy. They also support the fact that the trained combining methods perform better then the standard voting approaches. However, the dimensionality and training requirements are still issues of concern. • Dimensionality of the detector and aggregation module: The detector and aggregation classifiers use the output of the ensemble and the input space. Hence, the number of the inputs to the detector is K × N + I, where K is the number of classifiers in the ensemble, N the number of classes, and I the number of inputs. This is in comparison to the members of the ensemble which only have I inputs. Likewise, the aggregation net-

4.3. AN ADAPTIVE TRAINING ALGORITHM FOR CLASSIFIER ENSEMBLE

70

work also has K × N + K inputs. For neural networks, the detector is required to tune (K × N + I) × H + H × K weights. This increase in the size of the network requires that more training be allowed to achieve an acceptable performance. The dimensionally may be reduced by representing the output of the ensemble with an output class and a confidence level. This reduction is reflected in the size of the input vector of the detector to be 2 × K + I, whereas the aggregation module has 2 × K + K inputs. To illustrate this reduction in the 20-Class problem, it means that the detector network is reduced from a neural net of 102 × 10 × 5 to one of 12 × 10 × 5, similarly, the aggregation network is reduced from 105 × 10 × 20 to 15 × 10 × 20. • Training Requirements: The issue of the training time required to achieve improved performance is problematic. In addition, issues regarding the dimensionality of the other components of the architecture increase the extent of the problem. To solve this problem, we will present a training algorithm that suits the proposed architecture and can determine the extent of the required training.

4.3

An Adaptive Training Algorithm for Classifier Ensemble

Typically, each module in a classifier ensemble is trained independently. During the testing phase, each module makes a decision, based on its training, as to the class of the input data. The decisions are fused based on some aggregation scheme and the result of this fusion is the final classification. These different approaches can be viewed as a means by which relative weights are assigned to each classifier. Although the ensemble model has many benefits, there are still some disadvantages to using such a model. For example, in practice, we see that although the individual

CHAPTER 4. FEATURE-BASED DECISION FUSION ARCHITECTURES

71

classification accuracy of some of the modules may be high, the final classification accuracy can be much lower [84]. This is due in part to the fact that the decision fusion mechanism may not have enough information about the accuracy expected from a module during the testing phase. This information could allow the aggregation module to carry out a more informed fusion process. Hence the relative important of each of the modules’ accuracy is just as important as its behavior amongst the ensemble. Duin [22] suggests that retraining of the base classifiers after training and evaluating the combining classifier will be useful. This will consequently mean that the design of the combined classifier system becomes an iterative procedure. Another problem shared by all classifier is that of the duration of the training. In other words, how much should the individual modules be trained. Depending on the class groupings, individual modules need to be trained to different levels of generalization or specification as a result of the severity of the overlap of the classes within the training data. The adaptive training algorithm presented attempts to alleviate these problems to some extent in order to create a more robust and systematic training procedure. In the feature-based architecture that is presented, as with most multiple classifier systems, each component is trained independently. Although, each component is optimized and achieves a high accuracy, the final classification may not reflect this improvement. The new Evolving (EVOL) algorithm presented in rectifies this issue. In this algorithm, the ensemble of classifiers, as well as the combining method, are allowed to focus on further training on the feature sub-space with the higher misclassification. The algorithm also presents a method to reduce the amount of required training by focusing on the quality of the training. The adaptive training attempts to allow the final classification by the aggregation layer to determine whether or not further training should be carried out at the modular level [45]. In addition, a computed confidence factor for each of the modules allows the algorithm to utilize the

4.3. AN ADAPTIVE TRAINING ALGORITHM FOR CLASSIFIER ENSEMBLE

72

best weight set available for each module. The idea is that by increasing the quality of training at the modular level, the final aggregation process is expected to be more accurate. The training and testing algorithms are described below. We note here that the data must be divided into training, testing and evaluation sets. The evaluation sets should be distinct from the training and test sets but, like the training set, should include representative vectors for all classes. Let Ci represent module i, CFi , the confidence factor of the classification from Ci and be a user defined threshold such that 0.0 < Γ < 1.0. We also define Err as the number of evaluation samples in error, 0 < P < 100% as a base percentage and δ to be the modifier for P . K is defined as the number of modules. Let CFib est be the best confidence factor obtained for Ci during training. Figure 4.4 shows the flowchart of the algorithm. During testing, we utilize the best weight set obtained for each of the modules. This assures the highest confidence factors for the modules’ local decisions. The algorithm utilizes the evaluation set to gauge how well each of the modules is performing not only with respect to itself, but also relative to other modules. The confidence factor acts as a measure of this ability since it is based on the proportion of incorrectly classified records relative to the total number of records in the evaluation set. Referring to the training algorithm, we continue training the network based on the classifications and votes that were in error (from the evaluation set). In order not to destroy the classifications and votes already learnt by a module, the algorithm randomly chooses records from the training set that represents the set of learned classes. The number of records chosen is dErr × P × δ/100e, in this manner depends on the number of records in error and the user defined constant P . A higher error from the evaluation set results in more records being chosen from the training set that have been classified correctly. The number of records chosen to represent the correctly classified classes is therefore a function of the number of records chosen to represent

CHAPTER 4. FEATURE-BASED DECISION FUSION ARCHITECTURES

START

Initilize Compose training files

DONE = TRUE

YES

i

k

73

Initialize

YES

NO

CFi =0 CFi_best =0

END

NO

i=0,

Train i = i +1 (b)

Train

Evaluate and compose training

i

Evaluate

DONEi = TRUE

Train Aggregation

k

NO

YES

(a)

i

k

Train Ci Evaluate Ci YES

Evaluate System on Traini YES

NO

CFi > CFi_best

Select new Training data

NO

Save Ci CFi_best = CFi DONEi=TRUE i

YES

CFi-CFi-1

Suggest Documents