A recursive partitioning decision rule for nonparametric classification.

0 downloads 0 Views 2MB Size Report
rate based on the training sample identities. These de- ... natural extension of the two-class rule. Let f,(?) and f,(?) represent. .... It should increase with increasing ...
SLAC-PUB-1373 (Rev.) cs -75 -487 April 1975 Revised January 1976

A RECURSIVEPARTITIONING DECISION RULE FOR NONPARAMETRIC CLASSIFICATION *

Jerome H. Friedman Stanford Linear Accelerator Center Stanford, California 94305

ABSTRACT A new criterion

for

driving

a recursive

for nonparametric

classification

both conceptually

and computationally

have strong

statistical

totically

Bayes risk

features

is introduced

missing

features

efficient.

work supported

is presented. simple,

decision

The criterion

The resulting

decision

The notion

of adaptively

and test

to IEEE Transactions

rule is

and can be shown to

and methods are presented

in both training

(Submitted

*This

merit.

partitioning

for

rule

is asympgenerated

dealing

vectors.

on Computers)

by U.S. ERDA under contract

AT(043)515

with

Introduction In many classification probability

densities

sequently,

are either

the classification

classification

cision

tion

vector

On the basis

rules

are the k-nearest

neighbor

decision

The training population

to a specified

to the class with investigated

the rule Bayes risk

training

sample size,

The rule They show that

for

for

training

and metric)

k +oo

vectors

N, such that

in this

introduced

by Fix and

are combined from

to ?(with

set.

respect x"is

assigned

These authors

the procedure

is asympof the

lim k(N) = co, while lim[k(N)/N] N-+CC N+CD

of misclassification

neighbor

-l-

decision

is bounded from

where R* is the Bayes probability

fication.

= 0.

by Cover and Hart [3].

for the extreme case of k=l (nearest

- W/(M-l)]

the most atten-

k is chosen to be a function

k has been investigated

probability

7c1, 7~2."~~.

are located,and

and showed that if

a de-.

tagged as to the class

representation

efficient,

fixed

the asymptotic

above by P[2

each vector

function

the largest

totically

rule),

with

distance

first

dis-

x", using

characterizes

have received rules

to be-

by density

samples from the M populations

The k closest

originated.

that

manner.

x", is thought

drawn from each of the populations,

decision

which it

The non-

of these features,

function

Con-

from information

characterized

K~, f12...flM,

The nonparametric

a single

unknown.

in the following

of observed features,

are unspecified.

set of vectors

Hodges [1,2]. into

must be designed

is made as to which distribution

a training

conditional

samples drawn from each class.

long to one of M populations, that

class

or completely

problem may be stated

A random p-dimensional

tributions

partially

logic

measured from representative parametric

the underlying

problems,

of misclassi-

Despite

their

the k-nearest

desirable

neighbor

cation

to classification

tational

complexity.

cently

in this

for

training

neighbor

Another

rule

others)

to the classification

coordinate-wise

tions.

the nearest expensive

is that

of metric

that

be good in another. criminating

of the feature ple is a difficult An alternate

Feature

subset that

in some regions Discovering a particular

decision

rule

solut6ons

rules

based on statistically

properties. equivalent -2-

and

transforma-

a great

For ex-

deal of dislittle

transformation

and training

data sam-

have yet been proposed. rule

so that

Anderson [7] presents blocks

on

space may not

the best nonlinear

approach is to design the decision invariance

are the

of the space may contain

problem and no general

the desired

mono-

Unfor-

however, may not be linear.

contains

tains

sets.

good linear

A feature

axes for

strictly

subset selection

of the feature

regions.

is intrinsic

can depend greatly

is good in one region

information

or none in other

that

under all

rules

to find

The

above (as well

of the training

transformation.

The optimum transformation,

[5,6].

The maximal invariants

levels

are examples of trying

ample, a metric

discussed

of these decision

of a particular

with

reduced subset.

namely invariance

ordered population

in

Techniques

a subset of points

to this

axes.

compu-

to a point

have been proposed

rules

appli-

has been made re-

neighbors

they lack an invariance

of the feature

appeal,

to their

computationally.

information

the decision

the performance

progress

sample to extract

problem,

tone transformations

the choice

considerable

is then applied

problem with

as almost all

tunately,

This is due, mainly,

finding

high discrimination

k nearest

choice

[4],

and intuitive

have not found widespread

problems. Although

properties

rules

space is relatively

using the full

relatively

decision

regard

p-dimensional

statistical

or distribution

it

con-

decision free

tolerance

These rules

regions.

on the basis

of a set of prespecified

possess the desired Bayes risk

partition

invariance

efficient,

the multivariate

feature

Although

functions.

space

these rules

and can be shown to be asymptotically

they may be no more useful

than random assignment

for moderate sample sizes. Henrichon

and Fu [83 and Meisel

istic

strategies

for

class

identities

of the training

partition subsets

feature

the marginal are obtained

the partitioning

classification cision

maintain

their

sense indicate

Bayes risk that

In addition,

tionings

they can perform well Meisel

. . cision,

and Michalopoulos

can be represented

by binary for

space partitioning

Michalopoulos.

partitioning

rectly

from considerations

cision

rule

that

results

are not available

evidence

trees.

required

criterion

for

of Henrichon

is especially

of Bayes risk

sample

these parti-

They apply a dytree

that

to arrive

tends to at a de-

space.

driving

the recursive

and Fu, and Meisel

simple and is motivated

efficiency.

In fact,

can be shown to be asymptotically

-3-

and common

moderate training

of the feature

algorithms

This criterion

results

the decision

the average number of comparisons given a particular

These de-

monotone transfor-

observe that

decision

finding

This note proposes a different feature

with

used

measure of the mis-

to all

empirical

These

feature

sample identities.

asymptotic

efficiency,

namic programming technique minimize

and the particular

invariance

Although

recursively

At each stage,

using a heuristic

the desired

heur-

on the

sample subsets.

such partitionings.

location,

is decided,

of the features.

concerning

sizes.

their

present

based directly

of training

from previous

[9]

These strategies

samples.

rate based on the training

rules

mations

space partitioning

distributions

the number of partitions, for

and Michalopoulos

and di-

the de-

Bayes risk

efficient

with

densities

[lo].

the training missing

no assumptions Computationally,

decision

first

rule

the simplest

for the multiclass

extension

and F2(?)

their

assigning

the point

that

with

are presented.

[ll]

the left

x* that

that

region

minimizes

and f,(?)

The

represent. and Fl(a

Assume that

distributions. respectively,

and fll and fl2

case are straightforward

[lo].

and F2(x) are known univariate if

the

We make the restriction

to the general

shows that

(M=2).

of the two classes

probabilities.

the moment FL(x)

Staller

butions.

functions

are &, and t2,

prior

= t21Tc2. Extensions Suppose for

point

both in

be seen below to be a

Let f,(?)

rule.

cumulative

for misclassification

point,

fast

case of only two classes problem will

density

corresponding

are the corresponding

Ll11

and classification

of the two-class

the (unknown) probability

losses

is quite

Methods for using vectors

stages.

in both training

class probability

Partitioning

Consider

natural

the underlying

the procedure

and classification

coordinates

Recursive

concerning

one were to cut the real

to one class

the Bayes risk

and the right

distriline

at a

to the other,

of misclassification

is the

maximizes the quantity

D(x) = IFl(x) - F2(x)i -

(1)

D(x*)

(2)

is

The quantity

= max D(x)

D(x*)

the two distributions.

is the well-known

reapplying

In this

Kolmogorov-Smirnov

In many situations,

vide adequate discrimination; multimodel.

.

for

example,

case, the Stoller

if

fl(x)

procedure

it to each of the two subintervals -4-

a single

distance

between

cut would not proand/or

f2(x)

were

could be extended by

defined

by the first

par-

resulting

titioning, recursively unless F2(x)

in four

applied

the interval

A terminal

otherwise,

it

of

meets a terminal

criterion

(depending

at which point

interval

a --class

which the Kolmogorov-Smirnov is greatest.

terminal

FL(x)

and F2(x)

the empirical

functions.

A natural

ascending

values

of x.

is assigned

until

> F2(x);

of Staller feature

for

class

dis-

it meets a

to one of the two classes. cumulative

However, they are easily

distributions

8',(x)

and p,(x)

distributions

estimated

from

by

x < x;i’

k/n

xki)

1

,(i) n

is the kth point

fur-

case, one could apply

the marginal

0

where xp)

and

measure of the separ-

extension

to each subpopulation

are not known.

=

Fl(x)

between the two marginal

the univariate

applications,

pi(x)

if

case would be to cut on that

at which time it

cumulative

one cell

is a well-known

As with

In nonparametric

on FL(x)

is not divided

distance

distance

recursively

criterion,

a class

partitioning,

two cell.

to the multivariate

the partitioning

the interval

is called

two distribution

tributions

can be

by the previous

is called

partitioning

partitioning

defined

The Kolmogorov-Smirnoff ability

This Stoller

to each interval

in the interval)

ther.

cuts.

5 x < x1':;

(3)

;

of the ith

(x class with

Here n is the cardinality

the points

ordered

in

of the subsample under

consideration. A nonparametric crimination criterion,

recursive

partitioning

can proceed as follows. it

is assigned

Kolmogorov-Smirnov

distance

algorithm

dis-

If the subsample meets the terminal

to one of the two classes. between the empirical

-5-

for two-class

Otherwise,

marginal

the

distributions

of the two classes,

D(xJ) = maxIP,(xj) - F2(xj)l J xj is evaluated largest

for each f.eature,

in turn and the one for which D(x:)

j,

is chosen as the one to be cut. D(xJ,)

The location

= max D(xS) J

there

That is,

.

(5)

procedure

is nothing

that

deals only with

restricts

it

Based on his knowledge of the problem, or transgenerated

[8]

features

that

marginal

the researcher

are general

features.

can manufacture

functions the feature

is largest

will

can be performed

yields

original

and manufactured.

the best marginal

Features

containing

ignored

so

generated

for

that

features.

or several

little there

However, there

is not necessary

information

are

is a great features

deal to be gained

yield

simply transif

one

good discrimination

these additional

features

They can be constructed

be manufactured as the parti-

and made dependent upon the particular For example, 'i

= 2.2

where the w"i are the eigenvectors

matrix

chooses the one that

subsamples. that

which they are applied.

of the matrix

BC-1 .

and C is the within

class

eigenvalues

over all

is no loss in adding any number of extra

partitioned

progresses,

for which D(xS)

at each stage of the partitioning.

or no discriminating

in advance of the partitioning. tioning

The algorithm

discrimination

of these transgenerated

some of the It

This maximization

new

of the original

At each stage in the partitioning, be chosen.

distri-

to the p-original

features.

features,

is

of the cut is taken to be xJ*.

Since the partitioning butions,

(4)

subsample to

one might add the feature

set (6)

i associated

with

the largest

Here B is the between class scatter - 6 -

matrix

several scatter

for the particular

sub-

Thus, the manufactured

sample under consideration. self

adapt to different In several

space. single

adaptive

subsamples and different

applications,

feature

regions

we have found it

useful

set can it-

of the feature to add the

feature y = xv?

(7a)

9 = [v, + v2] -1 * (2.1-22)

(m)

where

is the direction

associated

zi and Vi (i=l,2)

metric

It

features

matrices

may be motivated into

on the basis

Here

discriminant.

of the by para-

the decision

rule

of the nonparametric

criterion.

should be noted that

the addition

can cause the resulting

under all

linear

be incorporated

they are found to be useful

Kolmogorov-Smirnov

tures

these generated they will

considerations,

only if

the Fisher

are the subsample mean and covariance

Although

two classes.

with

strictly

decision

of adaptively

rule

to no longer

monotone transformations

For those suggested

above, however,

generated

be invariant

of the original

the rule

fea-

is invariant

features. to linear

trans-

formations. Terminal It

Criteria remains to specify

a subsample establishing terminate

if

One possibility results fied

in all

a terminal

the subsample contains

since further

class,

the criterion

partitioning

is to make this of the training

by the decision

rule.

is known in advance that

that

The partitioning

cell. training

vectors

the sole criterion themselves

However, this

criterion

is no overlap

-7-

of

should clearly

only from a single

cannot change any class

vectors

there

stops the partitioning

assignments.

for termination. being correctly is best only if

in the feature

This classiit

space between

the underlying

class probability

densities.

q3 When the probability

d;;' =

densities

does not correctly

classify

estimate it

as closely

to correctly

performance

as possible all

in overlap

density

ratio

estimate

density

at least

whenever it

absolute

for

with

total

increasing

its

k(N)

=

N

A method is described ticular

of the

enough to provide

partitioned

should

in a way that

insures

all

cells.

terminal

distance n-k

a reasonable

of a cell

in each of the two daughter

sample size, value

Here

cells.

(eqn 1) should be

so that

k, is a parameter

is problem dependent.

sample size N, more slowly the recursive

paper is asymptotically lim

of the

The cardinality

Thus, the partitioning

ratio.

range x k+l

Suggest Documents