rate based on the training sample identities. These de- ... natural extension of the two-class rule. Let f,(?) and f,(?) represent. .... It should increase with increasing ...
SLAC-PUB-1373 (Rev.) cs -75 -487 April 1975 Revised January 1976
A RECURSIVEPARTITIONING DECISION RULE FOR NONPARAMETRIC CLASSIFICATION *
Jerome H. Friedman Stanford Linear Accelerator Center Stanford, California 94305
ABSTRACT A new criterion
for
driving
a recursive
for nonparametric
classification
both conceptually
and computationally
have strong
statistical
totically
Bayes risk
features
is introduced
missing
features
efficient.
work supported
is presented. simple,
decision
The criterion
The resulting
decision
The notion
of adaptively
and test
to IEEE Transactions
rule is
and can be shown to
and methods are presented
in both training
(Submitted
*This
merit.
partitioning
for
rule
is asympgenerated
dealing
vectors.
on Computers)
by U.S. ERDA under contract
AT(043)515
with
Introduction In many classification probability
densities
sequently,
are either
the classification
classification
cision
tion
vector
On the basis
rules
are the k-nearest
neighbor
decision
The training population
to a specified
to the class with investigated
the rule Bayes risk
training
sample size,
The rule They show that
for
for
training
and metric)
k +oo
vectors
N, such that
in this
introduced
by Fix and
are combined from
to ?(with
set.
respect x"is
assigned
These authors
the procedure
is asympof the
lim k(N) = co, while lim[k(N)/N] N-+CC N+CD
of misclassification
neighbor
-l-
decision
is bounded from
where R* is the Bayes probability
fication.
= 0.
by Cover and Hart [3].
for the extreme case of k=l (nearest
- W/(M-l)]
the most atten-
k is chosen to be a function
k has been investigated
probability
7c1, 7~2."~~.
are located,and
and showed that if
a de-.
tagged as to the class
representation
efficient,
fixed
the asymptotic
above by P[2
each vector
function
the largest
totically
rule),
with
distance
first
dis-
x", using
characterizes
have received rules
to be-
by density
samples from the M populations
The k closest
originated.
that
manner.
x", is thought
drawn from each of the populations,
decision
which it
The non-
of these features,
function
Con-
from information
characterized
K~, f12...flM,
The nonparametric
a single
unknown.
in the following
of observed features,
are unspecified.
set of vectors
Hodges [1,2]. into
must be designed
is made as to which distribution
a training
conditional
samples drawn from each class.
long to one of M populations, that
class
or completely
problem may be stated
A random p-dimensional
tributions
partially
logic
measured from representative parametric
the underlying
problems,
of misclassi-
Despite
their
the k-nearest
desirable
neighbor
cation
to classification
tational
complexity.
cently
in this
for
training
neighbor
Another
rule
others)
to the classification
coordinate-wise
tions.
the nearest expensive
is that
of metric
that
be good in another. criminating
of the feature ple is a difficult An alternate
Feature
subset that
in some regions Discovering a particular
decision
rule
solut6ons
rules
based on statistically
properties. equivalent -2-
and
transforma-
a great
For ex-
deal of dislittle
transformation
and training
data sam-
have yet been proposed. rule
so that
Anderson [7] presents blocks
on
space may not
the best nonlinear
approach is to design the decision invariance
are the
of the space may contain
problem and no general
the desired
mono-
Unfor-
however, may not be linear.
contains
tains
sets.
good linear
A feature
axes for
strictly
subset selection
of the feature
regions.
is intrinsic
can depend greatly
is good in one region
information
or none in other
that
under all
rules
to find
The
above (as well
of the training
transformation.
The optimum transformation,
[5,6].
The maximal invariants
levels
are examples of trying
ample, a metric
discussed
of these decision
of a particular
with
reduced subset.
namely invariance
ordered population
in
Techniques
a subset of points
to this
axes.
compu-
to a point
have been proposed
rules
appli-
has been made re-
neighbors
they lack an invariance
of the feature
appeal,
to their
computationally.
information
the decision
the performance
progress
sample to extract
problem,
tone transformations
the choice
considerable
is then applied
problem with
as almost all
tunately,
This is due, mainly,
finding
high discrimination
k nearest
choice
[4],
and intuitive
have not found widespread
problems. Although
properties
rules
space is relatively
using the full
relatively
decision
regard
p-dimensional
statistical
or distribution
it
con-
decision free
tolerance
These rules
regions.
on the basis
of a set of prespecified
possess the desired Bayes risk
partition
invariance
efficient,
the multivariate
feature
Although
functions.
space
these rules
and can be shown to be asymptotically
they may be no more useful
than random assignment
for moderate sample sizes. Henrichon
and Fu [83 and Meisel
istic
strategies
for
class
identities
of the training
partition subsets
feature
the marginal are obtained
the partitioning
classification cision
maintain
their
sense indicate
Bayes risk that
In addition,
tionings
they can perform well Meisel
. . cision,
and Michalopoulos
can be represented
by binary for
space partitioning
Michalopoulos.
partitioning
rectly
from considerations
cision
rule
that
results
are not available
evidence
trees.
required
criterion
for
of Henrichon
is especially
of Bayes risk
sample
these parti-
They apply a dytree
that
to arrive
tends to at a de-
space.
driving
the recursive
and Fu, and Meisel
simple and is motivated
efficiency.
In fact,
can be shown to be asymptotically
-3-
and common
moderate training
of the feature
algorithms
This criterion
results
the decision
the average number of comparisons given a particular
These de-
monotone transfor-
observe that
decision
finding
This note proposes a different feature
with
used
measure of the mis-
to all
empirical
These
feature
sample identities.
asymptotic
efficiency,
namic programming technique minimize
and the particular
invariance
Although
recursively
At each stage,
using a heuristic
the desired
heur-
on the
sample subsets.
such partitionings.
location,
is decided,
of the features.
concerning
sizes.
their
present
based directly
of training
from previous
[9]
These strategies
samples.
rate based on the training
rules
mations
space partitioning
distributions
the number of partitions, for
and Michalopoulos
and di-
the de-
Bayes risk
efficient
with
densities
[lo].
the training missing
no assumptions Computationally,
decision
first
rule
the simplest
for the multiclass
extension
and F2(?)
their
assigning
the point
that
with
are presented.
[ll]
the left
x* that
that
region
minimizes
and f,(?)
The
represent. and Fl(a
Assume that
distributions. respectively,
and fll and fl2
case are straightforward
[lo].
and F2(x) are known univariate if
the
We make the restriction
to the general
shows that
(M=2).
of the two classes
probabilities.
the moment FL(x)
Staller
butions.
functions
are &, and t2,
prior
= t21Tc2. Extensions Suppose for
point
both in
be seen below to be a
Let f,(?)
rule.
cumulative
for misclassification
point,
fast
case of only two classes problem will
density
corresponding
are the corresponding
Ll11
and classification
of the two-class
the (unknown) probability
losses
is quite
Methods for using vectors
stages.
in both training
class probability
Partitioning
Consider
natural
the underlying
the procedure
and classification
coordinates
Recursive
concerning
one were to cut the real
to one class
the Bayes risk
and the right
distriline
at a
to the other,
of misclassification
is the
maximizes the quantity
D(x) = IFl(x) - F2(x)i -
(1)
D(x*)
(2)
is
The quantity
= max D(x)
D(x*)
the two distributions.
is the well-known
reapplying
In this
Kolmogorov-Smirnov
In many situations,
vide adequate discrimination; multimodel.
.
for
example,
case, the Stoller
if
fl(x)
procedure
it to each of the two subintervals -4-
a single
distance
between
cut would not proand/or
f2(x)
were
could be extended by
defined
by the first
par-
resulting
titioning, recursively unless F2(x)
in four
applied
the interval
A terminal
otherwise,
it
of
meets a terminal
criterion
(depending
at which point
interval
a --class
which the Kolmogorov-Smirnov is greatest.
terminal
FL(x)
and F2(x)
the empirical
functions.
A natural
ascending
values
of x.
is assigned
until
> F2(x);
of Staller feature
for
class
dis-
it meets a
to one of the two classes. cumulative
However, they are easily
distributions
8',(x)
and p,(x)
distributions
estimated
from
by
x < x;i’
k/n
xki)
1
,(i) n
is the kth point
fur-
case, one could apply
the marginal
0
where xp)
and
measure of the separ-
extension
to each subpopulation
are not known.
=
Fl(x)
between the two marginal
the univariate
applications,
pi(x)
if
case would be to cut on that
at which time it
cumulative
one cell
is a well-known
As with
In nonparametric
on FL(x)
is not divided
distance
distance
recursively
criterion,
a class
partitioning,
two cell.
to the multivariate
the partitioning
the interval
is called
two distribution
tributions
can be
by the previous
is called
partitioning
partitioning
defined
The Kolmogorov-Smirnoff ability
This Stoller
to each interval
in the interval)
ther.
cuts.
5 x < x1':;
(3)
;
of the ith
(x class with
Here n is the cardinality
the points
ordered
in
of the subsample under
consideration. A nonparametric crimination criterion,
recursive
partitioning
can proceed as follows. it
is assigned
Kolmogorov-Smirnov
distance
algorithm
dis-
If the subsample meets the terminal
to one of the two classes. between the empirical
-5-
for two-class
Otherwise,
marginal
the
distributions
of the two classes,
D(xJ) = maxIP,(xj) - F2(xj)l J xj is evaluated largest
for each f.eature,
in turn and the one for which D(x:)
j,
is chosen as the one to be cut. D(xJ,)
The location
= max D(xS) J
there
That is,
.
(5)
procedure
is nothing
that
deals only with
restricts
it
Based on his knowledge of the problem, or transgenerated
[8]
features
that
marginal
the researcher
are general
features.
can manufacture
functions the feature
is largest
will
can be performed
yields
original
and manufactured.
the best marginal
Features
containing
ignored
so
generated
for
that
features.
or several
little there
However, there
is not necessary
information
are
is a great features
deal to be gained
yield
simply transif
one
good discrimination
these additional
features
They can be constructed
be manufactured as the parti-
and made dependent upon the particular For example, 'i
= 2.2
where the w"i are the eigenvectors
matrix
chooses the one that
subsamples. that
which they are applied.
of the matrix
BC-1 .
and C is the within
class
eigenvalues
over all
is no loss in adding any number of extra
partitioned
progresses,
for which D(xS)
at each stage of the partitioning.
or no discriminating
in advance of the partitioning. tioning
The algorithm
discrimination
of these transgenerated
some of the It
This maximization
new
of the original
At each stage in the partitioning, be chosen.
distri-
to the p-original
features.
features,
is
of the cut is taken to be xJ*.
Since the partitioning butions,
(4)
subsample to
one might add the feature
set (6)
i associated
with
the largest
Here B is the between class scatter - 6 -
matrix
several scatter
for the particular
sub-
Thus, the manufactured
sample under consideration. self
adapt to different In several
space. single
adaptive
subsamples and different
applications,
feature
regions
we have found it
useful
set can it-
of the feature to add the
feature y = xv?
(7a)
9 = [v, + v2] -1 * (2.1-22)
(m)
where
is the direction
associated
zi and Vi (i=l,2)
metric
It
features
matrices
may be motivated into
on the basis
Here
discriminant.
of the by para-
the decision
rule
of the nonparametric
criterion.
should be noted that
the addition
can cause the resulting
under all
linear
be incorporated
they are found to be useful
Kolmogorov-Smirnov
tures
these generated they will
considerations,
only if
the Fisher
are the subsample mean and covariance
Although
two classes.
with
strictly
decision
of adaptively
rule
to no longer
monotone transformations
For those suggested
above, however,
generated
be invariant
of the original
the rule
fea-
is invariant
features. to linear
trans-
formations. Terminal It
Criteria remains to specify
a subsample establishing terminate
if
One possibility results fied
in all
a terminal
the subsample contains
since further
class,
the criterion
partitioning
is to make this of the training
by the decision
rule.
is known in advance that
that
The partitioning
cell. training
vectors
the sole criterion themselves
However, this
criterion
is no overlap
-7-
of
should clearly
only from a single
cannot change any class
vectors
there
stops the partitioning
assignments.
for termination. being correctly is best only if
in the feature
This classiit
space between
the underlying
class probability
densities.
q3 When the probability
d;;' =
densities
does not correctly
classify
estimate it
as closely
to correctly
performance
as possible all
in overlap
density
ratio
estimate
density
at least
whenever it
absolute
for
with
total
increasing
its
k(N)
=
N
A method is described ticular
of the
enough to provide
partitioned
should
in a way that
insures
all
cells.
terminal
distance n-k
a reasonable
of a cell
in each of the two daughter
sample size, value
Here
cells.
(eqn 1) should be
so that
k, is a parameter
is problem dependent.
sample size N, more slowly the recursive
paper is asymptotically lim
of the
The cardinality
Thus, the partitioning
ratio.
range x k+l