MULTIPLE CLASSIFIERS BY CONSTRAINED MINIMIZATION Partha ...

1 downloads 0 Views 94KB Size Report
Rather than building separate classifiers by minimizing the empirical risk over differently weighted versions of the dataset, we consider here an approach that ...
MULTIPLE CLASSIFIERS BY CONSTRAINED MINIMIZATION Partha Niyogi

Jean-Benoit Pierrot

Olivier Siohan

Multimedia Communications Research Lab Bell Laboratories – Lucent Technologies, 600 Mountain Ave, Murray Hill, NJ 07974, USA ABSTRACT The paper describes an approach to combining multiple classifiers in order to improve classification accuracy. Since individual classifiers in the ensemble should somehow be uncorrelated to yield higher classification accuracy than a single classifier, we propose to train classifiers by minimizing the correlation between their classification errors. A simple combination strategy for three classifiers is then proposed and its achievable error rate is analyzed and compared to individual single classifier performance. The proposed approach has been evaluated on artificial data and a nasal/oral vowel classification task. Theoretical analyses and experimental results illustrate the effectiveness of the proposed approach.

lated classifiers by resampling the data with different distributions and training separate classifiers on each resampled data set. These classifiers are then typically combined in a weighted majority manner to yield provable error guarantees. Rather than building separate classifiers by minimizing the empirical risk over differently weighted versions of the dataset, we consider here an approach that explicitly minimizes the correlation between classifiers. To keep matters simple and tractable, we consider a three classifier system and analyze the possibilities contained therein. Experimental evaluation is carried out on various artificial datasets as well as on an oral/nasal vowel classification task.

2. A SIMPLE CLASSIFICATION PROBLEM

1. INTRODUCTION There has been considerable interest in the possibility of building multiple classifiers and then combining them to obtain superior performance in pattern classification and machine learning applications [1]. In speech recognition applications, most approaches have been focussed on combining classifiers built on different feature sets [2, 3] or different sub-bands [4], where ad-hoc combination strategies are used and experimentally compared. In this work, we adopt a rather different viewpoint and study the properties that individual classifiers should have to expect improvements when combining them together. At the outset, it is clear that multiple copies of the same classifier are no better than one classifier — therefore the classifiers must clearly be different from each other in some respect. The important questions are: (i) how should the classifiers be different from each other and how can we engineer them to be so, (ii) how should they then be combined, and (iii) how does the error rate after combination relate to the achievable error rates with individual single classifiers. Following the intuition that the individual classifiers in a classifier ensemble must be uncorrelated in some fashion, the class of algorithms that come under the rubric of boosting [5], bagging [6], arcing [7] attempt to build uncorre-



This work was done while J.-B. Pierrot was a post-doc at Bell Labs.

For simplicity and expository purposes, we consider twoclass pattern classification problems. There is a feature space and a two-class label set (say) and a probability distribution on according to which labelled examples are presented to a learning machine. The goal of the learning machine is to come up with a decision rule that into so as to minimize misclasmaps each sification error. In the following sections, we introduce a combination strategy for three classifiers and describe the corresponding training algorithm.



 

   





 



2.1. A Combination Strategy for Three Classifiers





Suppose we have two uncorrelated classifiers making a prediction about the label, , of a test example, . When both classifiers agree, the label is the same as that of each of the classifiers. When they disagree, a third classifier is invoked as an arbiter. This third classifier may be explicitly trained on disagreements of the first two. Given this combination strategy, we now proceed to examine how we might get two uncorrelated classifiers without unduly sacrificing the goodness of each individual one. The third classifier is obtained by training on disagreements of the first two classifiers using minimum error training.

2.2. A Constrained Minimization Formulation



xqyz





For a start, we consider how to obtain two classifiers and that are both at least -good, i.e, have an error rate no more than and are as uncorrelated as possible. Let be the error function associated with a classifier . is defined as a - valued random variable taking the value 1 if the classifier makes and error, and 0 otherwise:

 





  &  ! " # $% 

 $()' *

+

if otherwise

The error rate associated to a classifier fined as: Error

- 21 $

(1)



is simply de-

$,.- /$0

(2)

*  3 * 4 $ 3 * 4 $,5-7689 :%;@;A-  > $B= CD+ (3) Then, it is possible to design 2 classifiers having an error rate of at most  and being as uncorrelated as possible by

where denotes the mathematical expectation. The correlation between the two classifiers and can be represented in terms of the covariance between their respective errors:

solving the following constrained optimization problem:

 :4EGJ  >LFIH KM 3  4 $

(4)

  $ON7"P

  $ON7 (5)  where Q is a space of functions from to ?! R S and the classifiers   and   are functions drawn from this class. 

   TB BSTU$V W

Q

3

   $YZ [ \^]R_ ]a`%]Ubdc J e ]gfih _j0klBm

Q

X

S[ c m

X

- n$

[ c moqp $r rst/ uwv c

(6)

X

X@

XO

How does the error rate of the combined scheme relate to the error rates of the individual classifiers? Under what conditions is this combination scheme likely to yield performance superior to that achievable by a single optimal (chosen from ) classifier? To what extent is it profitable to sacrifice the accuracy of individual classifiers in an ensemble of the sort we have described? These are the questions we address here.

Error

In the previous discussion, the mathematical expectations are taken over the true probability distribution. Moreover, the true optimization problem is a combinatorial optimization, hard to solve computationally. In practice, the and are derived from a labeled set of trainclassifiers ing examples . In principle, the class could be any class of functions (classifiers) but to make matters concrete, let us assume that is the set of Radial Basis Functions with centers, and let us denote by and the parameters of the classifiers and , respectively. In other words, , where is the indicator function (cf. section 4.1). To go around the difficult optimization problem, we need to define a continuous and differentiable function of the classifier parameters, in order to approximate the true correlation and error rates . The indicator function is approximated by a sigmoid, as follows:

X

3. ANALYSIS OF PERFORMANCE

subject to Error

 " #{$G Y;q $#$ 

where is used to control the approximation. The error function is simply defined as . Then, the true error rate in Eq. (2) and correlation in Eq (3) can be approximated by replacing the mathematical expectation by its empirical average on the training data. Hence, and are trained by solving the classifier parameters the constrained minimization problem where both the objective function in Eq. (4) and constraints in Eq. (5) are now continuous valued functions. The whole procedure is in spirit quite similar to the minimum error training paradigm that was proposed on a speech recognition application in [8]. Rather than using a simple Generalized Probabilistic Descent algorithm as in [8], we choose to use the non-linear programming solver CFSQP to solve the constrained optimization problem [9]. The significant difference from the minimum error training paradigm is that instead of obtaining a single best classifier by minimizing the training error (empirical risk minimization), a different objective function is proposed to obtain two relatively uncorrelated classifiers. Once the parameters and of the two classifiers have been derived, it is possible to extract the data points on which the two classifiers disagree. A third classifier is built on these training examples only using an empirical error minimization training paradigm.

Q



3.1. Analysis

3



!

|

Suppose each of the classifiers and has an error rate and their correlation is . Let us also assume that is the error rate of the third classifier, trained on the data points on and disagree, and let be the resulting error which rate of the combined 3-classifier system.





~}

D}   ts 3€$ ;‚| $"sƒL| Lemma 1 The correlation between classifier    bounded by: ;„ Nƒ3…NA ;†$ .

Theorem 1 The error rate of the 3-classifier ensemble is . and



is

Corrolary 1 From the previous theorem and lemma, the following bound on the error rate of the 3-classifier system is obtained: .

0!|…N)D}‡N7

 ‰| ˆ  +

Therefore, if we have two classifiers that are -good each, then the combination strategy will have an error rate that is upper bounded by . This is true for all If however, then the combination strategy will not work. If can be guaranteed to be then, is of order



|Š  | ‹   $0+

‹ $

~}

}

" B3

|*+ 3 |Œ 

The total error rate is seen to depend upon and This is illustrated on Fig. 1 which is a plot of the error surface and iso-error lines as a function of and for . Thus the total error rate can be controlled by varying the



* ;„*U•Nƒ3…N –E FIH w  — $„;˜wU

Lemma 2 The correlation between classifier and bounded by: .

!

is

Corrolary 2 From the previous theorem and lemma, the following bound on the error rate of the 3-classifier system . is obtained:

  s†  $”|YN) } N E–FdH   a  $ws   ;Y  $2| This time, it appears that ~} is not automatically guaranteed

to be less than the minimum of the two classifiers. 4. EXPERIMENTAL RESULTS

Pe in the p−ρ plane

4.1. Artificial Data

0.5

To illustrate the effectiveness of the proposed approach, some experiments have been performed on simulated datasets. All classifiers have been trained in a space of Radial Basis Function (RBF) network implementing the following functions, where:

0.4 Iso−error lines

0.3

 *$%™ S[ š bdc f m

A 0.2

ρ

* 0.1 0

Ÿ0T

−0.1

where is the center of the basis function, fined as

−0.2 0

0.1

0.2

0.3

0.4

0.5

p



} |Ž 

3

Figure 1: Total error of the classifier ensemble as a function of and for . Light gray indicates low values of while dark gray indicate high values.

}

› $% œ ` —T W…T " #ŸRTU$’s l

TI’ž ž

error rates of the individual classifiers and the correlation between them. As a matter of fact, by increasing the error rates of the individual classifiers while decreasing their correlation sufficiently it is in principle possible to decrease the overall error rate. To appreciate this point better, consider the point A plotted in the figure. This represents two identical classifiers where each has an error rate of — the best possible error rate achievable by a single classifier chosen from the class The solid iso-error line represents all choices of that would give the same . Clearly, on one side of this line is actually less than These lower error rates are achieved by using classifiers whose individual error rates are higher than In the preceding discussion, we have assumed that error rates of the two classifiers are equal. This need not be so. Assuming that and are the error rates of and , respectively, the previous theorems can be rewritten:

*

QŽ+

 B3

~} +

*+





D}@ wU%s‚3{$ ‘;

Suggest Documents