L2Density Estimation with Negative Kernels

2 downloads 0 Views 99KB Size Report
Christian Musso. ONERA/DTIM. 29 av. de la ... christian.musso@onera.fr. Abstract ... be useful in nonlinear filtering based on Monte Carlo meth- ods. Indeed, the ...
L2-Density estimation with negative kernels Nadia Oudjane EDF R&D/OSIRIS 1 av. du Gnrale De Gaulle, 92140 cedex, Calmart, France [email protected]

Abstract In this paper, we are interested in density estimation using kernels that can take negative values, also called negative kernels. On the one hand, using negative kernels allows to reduce the bias of the approximation, but on the other hand it implies that the resulting approximation can take negative values. To obtain a new approximation which is a probability density, we propose to replace the approximation by its L2 -projection on the space of L2 -probability densities. A similar approach has been proposed in [5] but, in this paper, we describe how to compute this projection and how to generate random variables from it. This approach can be useful for particle filtering, particularly for the regularization step in Regularized Particle Filters [8] or Kernel Filters [7].

1 Introduction Let us consider an i.i.d sample (X1 , · · · , XN ) from an unknown probability density function f on Rd . The aim of density estimation is to approximate the density f with a function fˆ which depends only on the sample. The density estimation approach considered here is nonparametric and based on kernel decomposition. More specifically, we are interested in kernels that can take negative values (called negative kernels). Such kernels can be useful because they allow bias reduction of the density approximation, but at the same time they can produce density approximations which take negative values, which is of course undesirable. To alleviate that drawback, we propose to project the negative kernel density estimate on the subspace of L2 -probability densities. This approach leads to a new density estimation method which allows both bias reduction and positivity of the approximation. A similar approach has been proposed in [5] but here the description is more precise and complete. In this paper, we describe how to compute that projection. Moreover, a method for generating random variables according to this L2 -approximation is presented. This can be useful in nonlinear filtering based on Monte Carlo methods. Indeed, the main idea of Regularized Particle Filters or Kernel Filters introduced in [8] and [7] is to replace the

Christian Musso ONERA/DTIM 29 av. de la division Leclerc, 92322 cedex Chˆatillon, France [email protected]

discrete empirical approximation of the optimal filter given by the Interacting Particle Filter [2] and [6] by a smooth density using density estimation theory. This approach improves the convergence of Particle Filters but induce a bias in the estimation. Results presented in this paper can help to reduce the bias induced by the regularization step.

2 Kernel density estimation: a brief review Kernel density estimators were introduced by Rosenblatt [10]. Initially the kernel was itself a density. In this paper, we focus on negative kernels which were introduced by Parzen [9] and Bartlett [1]. We briefly recall some classical results of density estimation theory. For simplicity, in this section, we assume that samples are one-dimensional, but the results presented here are generalized in a d-dimensional space (see [11] and [3]). Let s ≥ 2 be a positive integer, a function K defined on R is called kernel of order s if Z Z |x|s K(x) dx < ∞ K(x) dx = 1 , ks = R

R

and

Z

xi K(x) dx = 0 ,

R

for all i = 1, · · · , s − 1. A kernel of order s ≥ 3 necessarily takes negative values. In view of the symmetry condition, we will only consider kernels of even order. Let K be a kernel of order s ≥ 2, and h be a positive real the function Kh defined by Kh (x) =

x 1 K( ) , h h

for all x ∈ R ,

(1)

will be called the scaled kernel associated to K with the bandwidth h.

2.1 Error estimation Proposition 2.1. Let K be a kernel of order s ≥ 2 and f be a probability density defined on R having its derivatives well defined and continuous up to the order s + 1. Let

(X1 , · · · , XN ) be an i.i.d. sample from f . Let us consider fˆ the function defined, for all x ∈ R by N

fˆ(x)

1 X x − Xi K( ) N h i=1 h

=

N 1 X Kh (x − Xi ) . N i=1

=

1 ) + o(h2s ) , N

(i) Generate β according to the Beta distribution Beta(1, 1/2); (ii) Generate T uniformly in {−1, 1}; √ (iii) Put Z = βT .

(2)

Then the following approximation for the mean integrated square error (MISE) holds, R Z K 2 (x) dx ks2 h2s 2 (s) 2 R ˆ Ekf − f k2 = (f (x)) dx + (s!)2 R Nh +O(

Algorithm 2.2. (Epanechnikov generator)

(3)

where the expectation is performed w.r.t. (X1 , · · · , XN ). A proof of this expansion can be found in the monograph of Silverman [11]. For the treatment of the error in the L1 sense see the monographs of Devroye and Gy¨orfi [4] and Devroye [3]. Notice that the bias term can be reduced in increasing order s. That is precisely the reason why we are interested in negative kernels.

3 L2 -projection of functions on the subspace of probability densities In order to reduce the bias of the kernel approximation fˆ (see (3)), one can use negative kernels with order s ≥ 3. But, in that case, the appoximation fˆ can take negative values which is of course undesirable. We propose here an optimal (in some sense) solution to that problem. In the sequel, results will be stated in the d-dimensional case. More generally, if we are interested in approximating by a probability density, a functionR f , which can possibly take negative values and such that f = 1, one possibility, if f ∈ L1 , is to use the following approximation, f + (x) , f + (x)dx Rd

The optimal choice of the bandwith depends on the unknown underlying density f . Therefore, one way to chose the optimal bandwidth associated to the Epanechnikov kernel, hEpan , is to assume, for instance, that the underlying density f is a centered standard Gaussian density. In this case, the optimal bandwidth, in the MISE sense, is easily computed and is given by hEpan

√ 1 1 = [40 π] 5 N − 5 .

(5)

If the underlying density f is non standard (standard deviation different from one), hEpan is multiplied by the standard deviation of f . To generate a random variable Z according to the Epanechnikov kernel (5), one can use the following algorithm.

(6)

kf1 − gk1 ≤ kf − gk1 ,

(7)

where (a)+ = max(a, 0), for any a ∈ R. Notice that the function f1 has the following property,

2.2 Kernel and bandwidth selection Classically, the kernel, K, and bandwidth, h, are chosen so as to minimize the mean integrated error Ekfˆ − f k1 , or the mean integrated square error (MISE), Ekfˆ − f k22 . For positive kernels (s = 2), the density estimation theory, see [11, 3], provides the optimal choice for the kernel, in the MISE sense. The general expression of this optimal kernel is the Epanechnikov kernel, which is given by the following expression in the one dimensional case, ( 3 (1 − |x|2 ) if |x| < 1 (4) KEpan (x) = 4 0 elsewhere,

f1 (x) = R

for any probability density g. See Theorem 3 p. 269 in [4], for a proof. This property doesn’t mean that f1 is the “projection” of f on the space of densities (L1 is not an Hilbert space). It means that replacing f by f1 can only improve the estimation of a probability density. Hence f1 is a good approximation of f but it is not optimal. An alternative, if f is in L2 , is to build the projection of f on the subspace of L2 -probability densities. The next proposition gives a characterization of this optimal approximation. Proposition 3.1. Let f be a function defined on Rd with values in R, such that Z Z f (x) dx = 1 , and f 2 (x) dx < ∞ , Rd

Rd

then the projection f2 of f on the (convex) subspace of L2 probability densities is determined by f2 (x) = (f (x) − α∗ )+ ,

for all x ∈ Rd ,

where α∗ is the positive real such that Z (f (x) − α∗ )+ dx = 1 .

(8)

(9)

Rd

In consequence, f2 defined by (8) and (9) satisfies a property similar to (7) in the L2 sense, i.e. kf2 − gk2 ≤ kf − gk2 , for any L2 -probability density g.

(10)

P ROOF OF P ROPOSITION 3.1. The proof is based on the characterization of the projection X ∗ of a point Y on a convex set S in an Hilbert space. X ∗ is the projection of Y on S if and only if hY − X ∗ , X − X ∗ i ≤ 0

for all X ∈ S .

In our case, S is the subspace of L2 -probability densities (we easily check that S is convex). Then, it is sufficient to prove that the quantity, ∆ = Z [f (x) − (f (x) − α∗ )+ ] [g(x) − (f (x) − α∗ )+ ] dx Rd

is negative for any probability density g ∈ L2 , where α∗ is defined in (9). Let us introduce the subsets of Rd , A∗ = {x | f (x) ≥ α∗ } and A¯∗ = {x | f (x) < α∗ } , we have ∆ =

Z



A∗

+ =

[f (x) − f (x) + α ] [g(x) − f (x) + α ] dx

Z

α

Z

A∗

+ =

Z

α∗

Z

A∗

−α∗

=

Z

A¯∗

g(x) dx −

Z

∗ +

(f (x) − α ) dx



f (x)g(x) dx

A¯∗

=

Proposition 3.2. Let f be a function defined on Rd with values in R, such that Z Z f (x) dx = 1 , f 2 (x) dx < ∞ . Rd

Z

A¯∗

 Z g(x) dx − 1 +

A¯∗

g(x) dx +

Z

A¯∗

f (x)g(x) dx

f (x)g(x) dx

g(x)(f (x) − α∗ ) dx .

This quantity is negative since g is a positive function, which ends the first part of the proof. Property (10), is a direct consequence of the projection property. f

+ ( f(x) −α ) dx

Rd

2

Then the L -projection of f on the subspace of L2 probability densities denoted by f2 (given by (8) and (9)) is as close to the function f as the “L1 -projection” f1 given by (6), in the L1 sense i.e. kf − f1 k1 = kf − f2 k1 . P ROOF

OF

P ROPOSITION 3.2.

E1 = kf − f1 k1 ,

E1 − E2

Figure 1. Illustration of the characterization of α∗

and E2 = kf − f2 k1 .

f+ = kf − R + k1 − kf − (f − α∗ )+ k1 f Z f+ = 2 (f − R + )+ − (f − (f − α∗ )+ )+ f

Then notice that Z f+ (f − R + )+ f

f+ (f − R + )+ f f (x)≥0 Z f+ (f − R + )+ + f f (x)

Suggest Documents