The formal goal of nonparametric density estimation is to estimate the probability density of a p-dimensional random vect,or X E Rp on the basis of i.i.d. ...
SLAC-PUB-3215 STAN-ORION-002 August 1983 Rev. March 1984 M
PROJECTION
PURSUIT
DENSITY
ESTIMATION*
JEROME H. FRIEDMAN and WERNER STUETZLE Statistics Department and Stanford Linear Accelerator Center Stanford University, Stanford, California 94305 ANNE SCHROEDER Institute National de Recherche en Informatique et Automatique Le Chesnay, France
ABSTRACT
The projection pursuit methodology is applied to the multivariate density estimation problem.
The resulting nonparametric
procedure is often less biased than kernel and
near neighbor methods. In addition, graphical information to help gain geometric insight into the multivariate
is produced that can be used
data distribution.
Submitted to Journal of the American Statistical Association, Theory and Methods
* Work supported by the Department of Energy under contracts DE-AC03-76SF00515 and DE-AT03-81-ERI0843, the Office of Naval Research under contract number ONRN00014-81-K-0340> and the Army Research Office under contract number DAAG29-8% K-0056.
1. Introduction
The formal goal of nonparametric density estimation is to estimate the probability density of a p-dimensional random vect,or X E Rp on the basis of i.i.d. observations Xl... x~ without making the assumption that the density belongs to a particular parametric family. Often in practice a more important objective is to gain geometric insight into the data distribution
in Rp.
Nonparametric estimation of univariate probability
density functions has been ex-
tensively studied. Examples of successful methods are the related techniques of kernel estimates (Parzen, 1962; Rosenblatt, 1971), near neighbor estimates (Loftsgaarden and Quesenberry, 1965) and splines (Boneva, Kendall, and Stefanov, 1971). A good overview is given by Tapia and Thompson (1978). Th e d irect extension of these methods to multivariate settings, however, has not been as successful in practice. This can partly be I a-
attributed to their deteriorating statistical performance caused by the so called “curse of dimensionality” (Bellman, 1961) which requires very large spans (radii of neighborhoods) in order to achieve sufficient counts. The resulting estimates are then highly biased. In addition, these methods do not provide any comprehensible information about the structure of the multivariate
point cloud.
Our approach t’o muhivariate density estimation is based on the notion of projection pursuit (Friedman and Tukey, 1974, and Friedman and Stuetzle, 1981). It attempts to overcome the curse of dimensionality
by extending the classical univariate density
estimation methods to higher dimensions in a manner that involves only univariate estimation. As a by-product, graphical information is produced that can be quite helpful in exploring and understanding the multivariate data distribution.
2. Overview
The goal of projection pursuit methods is to estimate multivariate functions by combinat,ions of smooth univariate (ridge) functions of carefully selected linear combinations of the variables.
2
Our projection pursuit density estimation (PPDE) method constructs estimates of the form
PM(X)= PO(X)ff
fmPm .x)7
m=l
(1)
where: -
PM is the densit.y estimate (or current model) after f%!fiterations of the procedure.
-
po is a given multivariate
-
8, is a unit vector specifying a direction in Rp, thus
density function to be used as the initial model.
0, +X=
e
BmjXi
i=l is a linear combination of the original coordinate measurements.
fm
-
is a univariate function.
From (1) PPDE is seen to approximate the multivariat,e
density by an initially
proposed density po, muhiplied (augmented) by a product of univaria,te functions
fm
of
linear combinations 8, * x of the coordinates. The choice of the init’ial density is left to the user and should reflect his best a priori knowledge of the data. A Gaussian density with sample mean and sample covariance matrix is often a natural choice. The purpose of PPDE is to choose the directions fm(em
. x).
0m and construct the corresponding functions
The product of these functions estimates the ratio of the data density to
the initial model density. From (1) we obtain the recursion relation ??4dx)
Since
fhl
is used to modify PM-~
=
PM-l(X)fM(oM
(4
’ x)0
Taoobtain phf, we refer to t,he
fnl
as “augmeming
functions”. The recursive definition of the model (2) suggests a stepwise approach for its construction.
At the M-th
iteration, there is a current model phf-l(x)
constructed in the
previous steps. (For t)he first step, M = 1, the current model is the initial model PO(X) specified by the user.) Given PM-~(X),
we seek a new model PM(X) to serve as a better
3
approximation
to the data density p(x).
augmenting function
Thus a direction 0~
fM( t?M- X) are chosen to
and its corresponding
maximize the goodness-of-fit of phf(x).
We measure relative goodness-of-fit by the cross-entropy term of the Kullback-Leibler distance
w= J
l”gPM(“)
P(“)
(3)
dx.
From (2) we see that W achieves its maximum at t)he same location as
f’-@M,fA,l) = /
kfM(~M
’ x, dx)
This is to be maximized under t,he constraint
that PM(X)
i.e.S pnf(x)dx
dx.
(4
be properly normalized,
= 1. For a given direction OhI and known p(x),
(5) a-
eM represent the data and current model is seen to maximize (4). Here p eM and PM-i-1
marginal densities along the (one-dimensional) subspace spanned by en{. Using this fM for given Oh{, it remains to find the direction 811,1for which (4) achieves t’he maximum value. The optlima eM and its corresponding augmenting function fM(oM
. x) define
the new model through (2). In actual applications the data density p(x) is unknown. We have, instead, a sample of N i.i.d. observations xl, x2, . ..xN from p(x). Th e cross-entropy W is e&mated
by
the log-likelihood
1 N CV = N .x lOgfl~,l(Xi).
(6)
1=1 Analogously, %!@n/r, f,~)
is estimated by
lN 6,~(emu fM)
=
N
&
log fM(eM
* xi),
(7)
i-l
where
fM(eM f x)
is an estimate for the ratio of data and model marginals along @*,I.
The optimal value tiM maximizing & ( eM, fM)
and thus the log likelihood I@ of the
new model is det’ermined by numerical optimization.
4
3. Estimation We now discuss the estimation of
Procedures
f(0 . x), th e ratio
of data and model marginals
along a direction 8. First consider the current model marginal P’M-l(t? . x). Without loss of generality we let 0 be the first coordinate axis, that is 0 ax = xl. Then
P”f-&l) = /- PM-h+hkwhz
(8)
is continuous then
If Py&Xl)
(9)
=
1 El
2h
O” 1(x1 - h 5 z 2 x1 + h)pb&)dz J
(10)
-co
where
(11) From (8), one has
P”M-l(xl) = ;eo& J
1(x1 - h L Yl 5 Xl + h)PAGl(Y)dY
= limo $EpM-, [I@1- h I YI I xl + h)l. Cur estimate of P’M-1(x1)
(12)
is obtained from (12) by using a small finite value for h and
employing a Monte Carlo method to estimate the expected value. A Monte Carlo sample Yl? Y2, * * * ye,
of size Ns is generated with density PM-~(X)
i&-1
(Xl) = &
f$ “j-1
is taken as our estimate of P’M-1(x1).
1(x1 - h -< yjl
and
5 xl + h)
Since the choice of xl as the direction 8 was
arbitrary, (13) can equally well be written
5
-_-.=w
for any 8. Note that the same Monte Carlo sample can be used for all 0 and x.
In
Appendix 2, we discuss in detail procedures for generating a Monte Carlo sample from the density PM-~(X). By assumption, the data represent a sample from p(x) that can be used, in analogy with (14) to estimate the data marginal ps(B - x) by
fis
(0. X) =
&
(15)
cI(f?*x-h