Importance Sampling : how to approach the optimal density?

8 downloads 0 Views 156KB Size Report
Abstract. Importance sampling (IS) is a well-known random simulation technique to estimate rare event probability. It is designed to reduce the variance of the ...
RAPID COMMUNICATION

Importance Sampling : how to approach the optimal density? J´ erˆ ome Morio Office National d’Etudes et Recherches A´erospatiales, The French Aerospace Lab, Long-term Design and System Integration Department (ONERA-DPRS-SSD) BP72 29 avenue de la Division Leclerc FR-92322 CHATILLON CEDEX, tel:+33 1 46 73 49 09, fax:+33 1 46 73 41 41 E-mail: [email protected] Abstract. Importance sampling (IS) is a well-known random simulation technique to estimate rare event probability. It is designed to reduce the variance of the MonteCarlo estimators for a given sample size. IS consists in generating random weighted samples from an auxiliary distribution rather than the distribution of interest. The crucial part of this algorithm is the choice of an efficient auxiliary PDF that has to be able to simulate more rare random events. In this article, we propose to analyse how to simply approach the IS optimal auxiliary density with non parametric importance sampling (NIS). This article is intended for graduate student and general physicist.

Importance Sampling : how to approach the optimal density?

2

Importance Sampling (IS) is a random sampling technique that has been much discussed and practically compared to Monte Carlo methods [1, 2, 3]. Nevertheless, the choice of the auxiliary probability density function (PDF) is not always analysed [4, 5] although it is a major source of misestimation. We firstly review the motivation of IS technique, and show how it can reduce the estimation variance. The best possible variance for an estimation is zero which leads us to define the IS optimal auxiliary PDF (it is the PDF that minimizes the IS estimate variance). We then focus on a non-parametric approximation of this IS optimal auxiliary PDF based on Gaussian kernels called non-parametric importance sampling (NIS). This approach has firstly been presented in [4] and developed more recently in [6] but it is not well-known by the engineer community despite its accuracy. We then review the fundamentals of this technique and in the last part of this letter, we present some performance results on a simple case and a comparison with other auxiliary PDF. Calculations should be practicable for students comfortable with statistics and probability.

1. Importance Sampling Motivation In this section, we review the fundamentals of IS technique in a variance oriented approach contrary to [1, 2]. We consider a general case, that happens very often in Physics, where we want to estimate the following d-dimensional integral : Z Iφ = φ(x)f (x)dx (1) Rd

where f : Rd → R is the PDF of a random variable X and φ : Rd → R is a positive integrable function without any loss of generality. Indeed, if φ has negative values, then φ can be defined as a difference of two positive functions φ = φ+ −φ− with φ+ = max(φ, 0) and φ− = max(−φ, 0). Let us consider the simple case where φ = 1A with A a subset of Rd . The term φ(x) is equal to 1 when X ∈ A and is null otherwise. The estimated integral Iφ is thus the probability that X ∈ A with X distributed according to the PDF f. One relevant way to estimate Iφ is to use Monte-Carlo techniques [7, 8]. The MonteCarlo estimator of the integral Iφ is simply given by the statistical mean IφM C

N 1 X = φ(Xi ) N i=1

(2)

where X1 , ..., XN are random samples generated according to the PDF f . Let us analyse the accuracy of the estimation IφM C . Thanks to the law of large numbers, it can be shown that this method will display a √1N convergence. It means that quadrupling the number of sampled points will halve the error. Consequently, with very large N, IφM C converges to Iφ .

3

Importance Sampling : how to approach the optimal density?

In statistics, the accuracy of an estimation is defined by the variance. It can be shown that the variance of IφM C depends on N and on the theoretical probability Iφ . The variance of IφM C is given by :  V ar (Φ(X)) V ar IφM C = N It can be developed in the following way, because of the variance definition, Z   1 2 2 MC φ(x) f (x)dx − Iφ V ar Iφ = N Rd

(3)

(4)

Let us consider the simple case where φ = 1A . We can then notice that φ(x)2 = φ(x) whatever the value of the parameter x. Thus, one has Z Z 2 φ(x) f (x)dx = φ(x)f (x)dx = Iφ (5) Rd

The variance of

Rd

IφM C

is thus only a function of N and Iφ .   1 V ar IφM C = Iφ − Iφ2 N MC The standard deviation of Iφ , σIφM C , is defined from the variance q q   1 MC σIφM C = V ar Iφ =√ Iφ − Iφ2 N

(6)

(7)

The √1N convergence of IφM C is thus verified. Nevertheless, when it deals about rare event probability q estimation, that is when Iφ is very small, the standard deviation q of I

I

IφM C tends to Nφ . In that case, unless the sample size N takes very large values, Nφ is much larger than Iφ . The relative error of the estimate IφM C is given by the ratio σI M C φ

IφM C

. When Iφ tends to zero, one has lim

Iφ→0

σIφM C IφM C

= lim p Iφ→0

1 = +∞ NIφ

(8)

IφM C cannot be used to estimate Iφ with a good accuracy for rare event probability estimation. Monte-Carlo techniques are clearly not adapted to estimate such low probabilities.

2. Importance Sampling fundamentals The objective of IS is thus to reduce the estimation variance of IφM C without increasing N. The idea is to generate the samples X1 , ..., XN from an auxiliary PDF h and then estimate Iφ in the following way : IφIS

N 1 X f (Xi) = φ(Xi ) N i=1 h(Xi )

(9)

Importance Sampling : how to approach the optimal density? The term IφIS is an unbiased estimator of Iφ since : Z N Z f (xi ) 1 X IS E(Iφ ) = φ(xi ) h(xi )dxi = φ(x)f (x)dx = Iφ N i=1 Rd h(xi ) Rd

4

(10)

with E the expected value operator. The variance of IφIS is estimated by :

with w(X) =

 V ar (Φ(X)w(X)) V ar IφIS = N

(11)

f (X) . h(X)

One can then obtain Z     1 1 IS 2 2 2 V ar Iφ = φ(x) w(x) h(x)dx − Iφ = E φ(X)2 w(X)2 − Iφ2 (12) N N Rd

The variance of IS estimate depends notably on the choice of h. If h is well-chosen, then the IS estimate variance can be very low. Conversely, if h is not adapted to the estimation, the IS estimate results can have a higher variance than Monte-Carlo estimate. In this article, we propose to use an approximation the IS optimal auxiliary density in order to choose the PDF h. 3. IS optimal auxiliary density

The Equation 12 describes the variance of IS technique. The objective of IS is to minimize the variance of the estimation. The IS optimal auxiliary density hopt is the  auxiliary PDF that minimizes the variance V ar IφIS . Since variances are non negative quantities, the minimum possible variance is thus zero. IS optimal auxiliary density hopt can be determined by cancelling the variance in Equation 12     2 1 2 f (X) 2 − Iφ = 0 (13) E φ(X) N hopt (X)2

We have then following expression :   2 2 f (X) = Iφ2 E φ(X) hopt (X)2

(14)

and thus E

φ(X)2 f (X)2 Iφ2 hopt (X)2

!

=1

(15)

f (X) Z = φ(X) is a random variable with a second moment equal to unity. By Iφ hopt (X) definition, importance sampling is an unbiased estimation of Iφ and one has   f (X) E φ(X) = Iφ (16) hopt (X)

and consequently   φ(X) f (X) E = E(Z) = 1 Iφ hopt (X)

(17)

Importance Sampling : how to approach the optimal density?

5

Since the two first moments of Z are equal to 1, it implies that we can determine an expression of hopt with hopt (x) =

φ(x)f (x) Iφ

(18)

We can verify that the PDF hopt verifies equation 14. The auxiliary PDF hopt depends unfortunately on Iφ which is the unknown integral that we try to estimate. The PDF hopt is thus unusable in practice. Let us examine hopt in the simple case where φ = 1A : ( f (x) if x ∈ A 1A (x)f (x) Iφ hopt (x) = = (19) Iφ 0 otherwise Only one sample generated with this density hopt is sufficient to obtain a zero-variance estimator of Iφ . Our objective is thus to determine a density that approaches the optimal density hopt and also allows an easy generation of random numbers. For that purpose, we propose to use Gaussian kernel Kd [9] to determine an estimated IS optimal density ˆ opt in the following section. h 4. Non parametric importance sampling (NIS) The first step of NIS technique is to approximate the IS optimal auxiliary density. The principle is to apply IS algorithm from a initial auxiliary PDF g0 that is not optimal. Consequently, we generate X1 , ...XN , N random samples from a density g0 . The IS estimate is given by N 1 X φ(xi )f (xi ) IS Iφ = N i=1 g0 (xi ) From Equation 18, the IS optimal auxiliary density hopt is defined by hopt (x) =

φ(x)f (x) Iφ

(20)

The term Iφ is not kown but the IS estimate IφIS is nevertheless available. One can thus ˆ opt of IS optimal auxiliary density hopt with define an approximate PDF h ˆ opt (x) = φ(x)f (x) + δN h IφIS + δN

(21)

The term δn is a positive constant satisfying δN → 0 when N becomes large. Using δn is unnecessary and in practice, this constant δn is fixed to 0. Technically, however, this helps to prevent the denominator from being equal to zero. Generating samples that ˆ opt is not possible in the general case in a simple way because of the follow the PDF h term φ(x)f (x). It is thus necessary to approximate it with a PDF fˆφ(x)f (x) determined with an histogram or with weighted Gaussian kernels that enables an easy generation of samples. We choose to use in this article Gaussian kernel density estimator since it is a very well-known tool to estimate a PDF [7]. To determine a good estimate of φ(x)f (x) with weighted Gaussian kernels , we can notably use the random samples X1 , ...XN and

Importance Sampling : how to approach the optimal density?

6

add them a weight so that the weighted distribution follows φ(X)f (X). The weight (x) of each sample is thus w(x) = φ(x)f . Indeed, generating samples with the PDF g0 g0 (x) with the associated weight w is statistically equivalent to generating samples directly from φf since g0 w = φf . By definition of the Gaussian kernel density estimator, each random sample Xi is then the center of a Gaussian kernel with a weight w(xi ). The density fˆφ(x)f (x) has thus the following expression   N X 1 x − x i fˆφ(x)f (x) = w(xi )Kd (22) Nbd i=1 b where Kd is standard d-dimensional Gaussian function with zero mean and a covariance matrix equal to identity. The parameter b is a smoothing parameter called the bandwidth. An expression of the optimal bandwidth is available in [4] but cannot be derived in practice. Several rules have been proposed to estimate an efficient bandwidth in [7] but they are not in the scope of this article. A Matlab code of weighted kernel density estimator is available at http://www.ics.uci.edu/ ihler/code/. ˆ opt that is very close to the IS optimal density. Here We have thus determined a PDF h is its final expression : ˆ ˆ opt (x) = fφ(x)f (x)+δN (23) h IφIS + δN ˆ opt and estimate the integral Let us generate random samples X1∗ , ...XN∗ from the PDF h Iφ with Importance Sampling IφN IS

N 1 X f (Xi∗ ) = φ(Xi∗ ) ˆ opt (X ∗ ) N i=1 h i

(24)

ˆ opt is derived from an The variance on IφN IS is greatly reduced because the PDF h approximated IS optimal auxiliary density for a fixed sample-size. 5. Application ˆ opt . Let us now consider We have presented the theoretical part of the determination of h a simple case where f is a Gaussian PDF with zero mean and a variance equal to 1, N(0, 1) and φ = 1[5,+∞] . The figure 1 shows the theoretical optimal IS density hopt and ˆ opt obtained with N = 3000 samples and g0 = U[5, 10] the estimated optimal IS density h is a uniform law on [5, 10]. b is determine by the ”rule of thumb” given in [7] and is ˆ opt estimates with a good equal to 0.05. The term δN is null since N >> 0. The PDF h accuracy the IS optimal auxiliary PDF. ˆ opt and apply IS technique. The table We can generate then new samples with PDF h 1 shows the estimation of I1[5,+∞] with IS and different PDF h. The theoretical value I1[5,+∞[ is equal to 2.866510−7. Monte Carlo is not able to estimate the probability which confirms that IS is a valuable method to estimate rare events. Indeed, for all the chosen auxiliary PDF, the probability is well estimated in mean. Nevertheless, relative error

7

Importance Sampling : how to approach the optimal density?

is a major parameter that characterizes the quality of estimation for rare events and in ˆ opt which seriously decreases this case, the best results are obtained with auxiliary PDF h the estimation relative error. Optimizing the IS auxiliary density is thus very important to obtain accurate results. ˆ opt as shows the The valuable PDF g0 is also necessary to obtain an accurate estimate h table 2 and the figures 2. Indeed, if g0 is not able to generate some rare events like for instance with g0 = N(0, 1) or g0 = U[0, 5], or if g0 is not able to generate some rare events with high weight wi like for instance with g0 = U[6, 11] (the weighted samples are contained in the intervall [5, 6]), then the NIS estimation is not possible. Otherwise, NIS estimation can be very accurate and do not depend too much on g0 . Importance Sampling is used in many physical applications even if the crucial part of the algorithm is the choice of the auxiliary PDF h. This choice has a great impact on the quality of IS estimation. We have shown that it is of interest to optimize the IS auxiliary distribution. One possible non-parametric approach is to use Gaussian kernels to approach the optimal IS auxiliary density. 6

5

4

3

2

1

0 4

5

6

7

8

9

10

Figure 1. Theoretical optimal IS density hopt (in red) and the estimated optimal IS ˆ opt with Gaussian kernel (in blue) obtained with g0 = U [5, 10]. density h

method ˆ opt (NIS) IS with h = h IS with h = U[5, 10] IS with h = N(0, 5) Monte-Carlo

Iˆ1[5,+∞] 2.8710−7 2.8810−7 2.8910−7 0

relative error 1.2% 13% 45% ?

Table 1. Estimation of Iˆ1[5,+∞] with IS for N = 1000.

[1] Mark Denny. Introduction to importance sampling in rare-event simulations. European Journal of Physics, pages 403–411, July 2001.

8

Importance Sampling : how to approach the optimal density?

g0 = U[1, 6]

g0 = U[6, 11]

g0 = N(0, 5)

g0 = N(0, 10)

Figure 2. Theoretical optimal IS density hopt (in red) and the estimated optimal IS ˆ opt with Gaussian kernel (in blue) for different g0 . density h

method NIS with g0 = N(0, 1) NIS with g0 = N(0, 5) NIS with g0 = N(0, 10) NIS with g0 = U[0, 5] NIS with g0 = U[1, 6] NIS with g0 = U[5, 10] NIS with g0 = U[6, 11]

Iˆ1[5,+∞] 0 2.8510−7 2.8710−7 0 2.8610−7 2.8710−7 3.7510−9

relative error ? 23% 2.5% ? 1.3% 1.2% 248%

Table 2. Estimation of Iˆ1[5,+∞] with IS for N = 1000.

[2] P. H. Borcherds. Importance sampling: An illustrative introduction. European Journal of Physics, pages 405–411, April 2000. [3] T. Booth and J. Hendricks. Importance estimation in forward monte-carlo calculations. Nuclear

Importance Sampling : how to approach the optimal density? [4] [5] [6] [7] [8] [9]

9

Technology/Fusion, 9:90–100, 1984. P. Zhang. Nonparametric importance sampling. Journal of the American Statistical Association, 91(434):1245–1253, September 1996. Z. I Botev, D.P. Kroese, and T. Taimre. Generalized cross-entropy methods with applications to rare-event simulation and optimization. Simulation, 11:785 – 806, 2007. J. C. Neddermeyer. Computationally efficient nonparametric importance sampling. Journal of the American Statistical Association, 104:788–802, 2009. B.W. Silverman. Density estimation for statistcs and data analysis. In Monographs on Statistics and Applied Applied Probability. London: Chapman and Hall, 1986. C. P. Robert and G. Casella. Monte Carlo Statistical Methods. Springer, New York, 2005. S. Barber. All of statistics: a concise course in statistical inference. Journal Of The Royal Statistical Society Series A, 168(1):261–261, 2005.