Mar 4, 1996 - The cutting corners algorithm provides a single algorithm to generate ran- ... If a density can be speci ed only as a very long and slow algorithm.
Generating random numbers from a unimodal density by cutting corners Arif Zaman March 4, 1996 Abstract
The cutting corners algorithm can be used to generate random numbers from any unimodal density which is speci ed only as a computer subroutine. This allows for a single program that can generate samples from Normal, Exponential, Gamma, Cauchy, . . . , and almost any of your favorite density with any parameters. Despite the extremely open nature of the problem, the algorithm proposed is extremely ecient, giving a random number in the time that it takes to generate four or ve uniform random numbers. Even though it is slightly slower than the fastest algorithms, it more than pays for this in its convenience and generality. For unusual densities, it may be the only practical algorithm, and for the standard distributions it is provides an easy implementation.
1 Introduction The cutting corners algorithm provides a single algorithm to generate random numbers from any unimodal density. It is convenient to use, because is involves only writing the density as a computer subroutine. For example a FORTRAN program to print 100 random numbers from a Student's t distribution with 5 degrees of freedom would look like: FUNCTION DENSITY(X) DENSITY = (1+X*X/5)**(-3) RETURN END PROGRAM MAIN
1
INTEGER T5(20000) CALL INICUT(T5,-6.,6.,0.,1.) DO 1 I=1,100 1 PRINT *,RNDCUT(T5) END
Here we are assuming that the subroutine INICUT and RNDCUT have been implemented as library subroutines elsewhere. The call to INICUT sets up the random number generator (the arguments are explained later), and the function RNDCUT returns random numbers. Constants of proportionality may be left out from the DENSITY, so that the actual t density of ?((k + 1)=2) (k)? = (1 + x =k)? k = ?(k=2) simpli es to the formula used in DENSITY. In the implementation of the above algorithm, a general purpose subroutine to generate discrete random variables from any speci ed distribution is needed. For this purpose, the squared histogram algorithm is used, and is supplied in the above library. The pair of algorithms cutting corners for continuous unimodal densities and squared histogram for discrete distributions covers most of the situations encountered, and so these should be made a part of standard libraries. These routines provide simplicity, exibility and speed in programming. The simplicity of use makes the program easy to understand, and leaves less room for errors. The exibility allows generating random numbers from a whole class of unrelated density functions. Besides the speed of programming, the rate of generation of random numbers is approximately 28,000 random numbers per second for the cutting corners algorithm and about 70,000 random numbers per second for the squared histogram algorithm, timed using Turbo-Pascal on a Gateway 486/33 PC. This can be compared to 160,000 random uniform numbers using the built-in congruential generator. The occasions where this algorithm can be useful are described in the following section. A series of algorithms leading up to the cutting corners algorithm are covered in section 3. Some aspects of implementation are discussed in section 4. The eciency of the algorithm is examined in section 5. Actual results from simulating various densities are reported in section 6, followed by some thoughts on further speedup techniques in the nal section. 5
1 2
2
2
( +1) 2
2 Applications The convenience and simplicity of the cutting corners algorithms have already been mentioned. For example you could de ne the function f (x) = x (1 ? x) and simply ask for some random numbers according to that density (we will ignore proportionality constants in densities). Since the above is a well known Beta distribution, a number of reasonably ecient methods exist to solve that particular problem. But, if the density were changed even slightly so that f (x) = x (1 ? x) (2 ? x) for 0 < x < 1, then no cookbook algorithms are available. The cutting corners algorithm provides a simple way to generate numbers from any unimodal density, a class which covers almost all the standard textbook densities. This exibility makes it the only choice in the following situations: Where samples are needed from an unusual density for occasional simulation and no standard subroutines exist or are easily available. (as in the above example). When samples are needed from a variety of dierent densities, or from densities belonging to some parametric family. Other than the simple location-scale families and some of the more famous families of distributions, there are very few programs that can generate variables from a variety of densities. If random variables are needed from a density that changes at each step, as for example in Markov Chains. Once again most algorithms handle only one density, and can not generate from changing densities. If a density can be speci ed only as a very long and slow algorithm. For example if the density is the result of a kernel density estimator, or is computed using numerical integration of some other complicated function. The cutting corners algorithm produces random variables with a minimum of function evaluations. In fact when samples are needed from any density other than the uniform, this one algorithm serves to replace an assortment of odd routines for various densities. Of course if we x attention to any particular density, it is possible to create a fast customized algorithm. Historically this custom creation of algorithms has been done using human labor and ingenuity. A more recent 5
5
3
3
3
approach has been the development of some general ideas toward automating this process, by computing tables based on the density function evaluated at many values. Since this eort has always been thought of as a xed, one-time cost, all concepts of algorithmic eciency ignore this phase of the solution. If the time involved in the computations needed to generate the tables is included in the cost of generating a random number, the cutting corners algorithm is the most ecient algorithm by far. Even without such conditional comparisons, the cutting corners method is extremely ecient. Considering that many inecient methods such as the polar method for generating normal variables continue to be popular simply due to their simplicity, it is clear that many users are willing to trade peak performance for convenience with nearly optimal performance. Furthermore, the method is fully accurate, involving no approximations except the numerical roundos associated with computer oating point representations. Except for those involved in extremely long simulations (more than 100 million random numbers), the cutting corners method should suce for most ordinary users.
3 Algorithms We begin with a description of some algorithms, which clearly are not practical, and could easily be improved. The rst of these is simple rejection method, which forms the basis for the cutting corners method. Then an algorithm is described for monotone densities. It is nally modi ed to handle unimodal densities. The nal section on implementation deals with the practical issues needed for improving the performance.
3.1 The rejection method
As a special case of von Neumann's rejection method [1], consider a density function f (x) that is bounded by a rectangle, i.e. 0 f (x) M and f (x) = 0 for all x < A or x > B . Pick a point (x; y) at random (uniformly) from the rectangle. If it is underneath the density, i.e. y < f (x), then it is called a hit and the value x is returned. Otherwise the point is a miss. It is rejected and the process starts over again by picking a new point again. Clearly the random variable x has the requiredRdensity. The probability of a hit is p = AB f (x)=Mdx, and the number of points 4
until a hit is a geometric random variable, with expectation 1=p. Since each point requires an evaluation of f (x) to see if it lies under the density, the expected number of function evaluations for each random number generated is 1=p.
3.2 Monotone densities
If in the previous problem, it is known that the density f (x) is monotone decreasing (a similar argument applies for increasing), then there is a way to decrease the number of function evaluations by storing the information obtained from previous evaluations. To do this, save the values (xi; f (xi)) in a table, every time the density function is evaluated. These points can be used to form an outer and an inner histogram as shown in g. 1.
Figure 1: The inner histogram (dark), the ambiguous area (light shade), and the area above the outer histogram (white) when the density has been evaluated at one (left), two (center) and four (right) points. The actual density function is known to lie between these histograms. With this knowledge, we can often decide whether a point falls above or below the density without any function evaluations. The only time a function evaluation is needed is when the point falls in the ambiguous area between the two histograms (light shaded area). When a point falls under the inner histogram (dark area) it lies under the density, and when it falls outside the outer histogram (white area) it is above the density. As can be seen in the g. 1, the evaluation of the function at the second point splits the large ambiguous rectangle into four parts. Two of them remain ambiguous, but two of the corners are `cut' from the ambiguous area. Repeated function evaluation continue to reduce the ambiguous area by cutting further corners. 5
3.3 Unimodal densities
If f is known to be bounded as before, but it is unimodal instead of monotone, a similar method still works. Of course if the mode is known then the distribution can be broken into two parts, each of which is monotone. When the mode is unknown, the same table of values (xi; f (xi)) can be used to narrow down the location of the mode. If u < v < w and f (u) < f (v) and f (v) > f (w) then the mode must lie between u and w. On each side of this interval, the density is monotone, and hence can be bounded by a inner and an outer histogram as before. This is illustrated in g. 2. In the central interval from u to w, the only bound available is M , the global bound for the density f .
Figure 2: The ambiguous area for unimodal densities evaluated at two (left), three (center) and six (right) points. Note that the ambiguous area (light shade) is bounded by M in the central two intervals.
4 Implementation We will describe a subroutine which implements the above algorithm, without getting into speci c details. It needs as input the parameters A and B which specify the limits of the x values and M the maximum height of the density. It returns a real value x in the interval (A; B ), according to the speci ed density. Internally, a table of function evaluations needs to be maintained. In its most simple form, this could simply be a list of (xi; f (xi)) at every point that this density has been evaluated. On entry to the subroutine this table would usually be empty. But, if some values of the density are well known, or have previously been computed, these may be supplied as an initial table. An overall view of the subroutine is given in g. 3 6
1. Select a random point (x; y) under the outer histogram. 2. If the point is under the inner histogram, it is a hit so return with x. 3. Otherwise it is in the ambiguous area, so compute f (x). 4. Insert (x; f (x)) in the table. 5. If y > f (x) then it is a miss so restart from the beginning. 6. Otherwise it is a hit, so return with x. Figure 3: The Cutting Corners Algorithm The rst step itself is a dicult problem. How is one to select a random point under the outer histogram. If we consider the histogram as a union of many rectangles, the random point can be generated by a two step procedure. The areas of the rectangles can be computed and used to form a discrete distribution. If a rectangle is selected with a probability proportional to its size, and then a point is selected uniformly within the rectangle, the nal point will be uniformly distributed under the outer histogram. Of course the problem of generating a point from an arbitrary discrete distribution still remains. Fortunately, there are some good algorithms to select a point at random from a discrete distribution. The fastest methods [3] is almost as fast as a single random number generation followed by a single table lookup. In our problem, the areas of the rectangles will keep changing as they get split by function evaluations, so an algorithm is needed which is fast both in the initial set-up time as well as the time needed to generate random numbers from it. The squared histogram method seems to be one of the best according to this criterion. The set-up time is proportional to the number of points in the distribution, and the algorithm is quite simple. Once a distribution has been set-up the time to generate a random number from it is a small xed time not depending upon the number of points. 7
4.1 The Squared Histogram
Walker [4] rst proposed a method, called the method of aliases, for generating a discrete distribution on the integers with arbitrary frequencies. A breif description of the original method can also be found in [1]. A modi ed name the squared histogram algorithm and simpli ed explanation of a modi cation of that method are based on class notes given by G. Marsaglia. The explanation is best read along with the simple illustrative example shown in g. 4. Consider a discrete distribution on the integers f1; : : : ; ng, represented as a histogram. The total area under the histogram is 1, and if it was a uniform distribution, each point would have height 1=n. Draw a horizontal line at height 1=n to represent this equal distribution of probabilities. Those strips that have more than this poverty line are considered rich and those below are poor. Find the richest and the poorest strips. We `square' the histogram, in the sense of making all the strips of average height, by repeatedly applying the following simple rule. Cut from the height of a rich, and stack it on top of a poor until the poor reaches a height of 1=n (even if this turns the rich into poor). In less than n steps this process will end up with a picture where each strip has been squared, and contains at most two regions. As an inductive argument for this, note that at each step, one poor strip is squared, leaving one less strip to be squared. The classic Robin Hood algorithm takes from the richest and gives to the poorest. This involves nding the maximum and the minimum of a list n times, which would slow the setup time considerably. If instead, we go sequentially from left to right keeping track of the leftmost rich and the leftmost poor, and giving from the rich to the poor, the method still works. Some care has to be taken for the nouvaux-poor (the rich who become poor due to excessive taxation), who are to the left of the current leftmost poor, but in essentially a single sweep, the histogram can be squared. As a result the set-up cost is reduced greatly, with a very minor increase in the cost for each call to the subroutine. Finally we are left with two arrays, one identifying the number of the donor of the upper strip, and one specifying the cuto between the upper and the lower strip. To generate according to the original histogram, simply select a point at random in the squared histogram, and if it is in the k-th strip, below the cuto, return k, and if it is above the cuto, return the number of the donor to the upper portion of the k-th strip. 8
2
2
2
2
2
2
5
2
2
2
2
5
2
2
5
Figure 4: The Robin Hood algorithm used to square a histogram. The dotted line is the poverty line. The numbers written above the strips identify the donors. The top left shows the original histogram. The next has a donation from the rst rich (strip 2) to the rst poor (strip 1). The fourth picture shows where strip two nally got taxed so much that it became a nouveau poor. The bottom right is the nal squared histogram.
5 Ambiguous area rates The performance of the cutting corners method is intricately linked to the rate at which the ambiguous area decreases. To compute this rate, let us rst de ne our terms more precisely. Let (xi; f (xi)) for i = 1; : : : ; n ? 1 be the list of function evaluations. Let x ;n < x ;n < : : : < xn? ;n < xn;n denote the ordered sequence of xi's, with x ;n = A and xn;n = B by de nition. The x's divide (A; B ) into n intervals given by Ii = (xi? ;n; xi;n) for i = 1; : : : ; n. When f is monotone the strips of the inner histogram have height min(f (xi? ;n ); f (xi;n)) in interval Ii, while the outer histogram has height max(f (xi? ;n); f (xi;n)). When f is unimodal, the histograms are identically de ned, except in the intervals where the mode might be located. If 0
1
1
0
1
1
1
f (xi? ;n) < f (xi;n) = = f (xj;n ) > f (xj + 1; n) for some i j , then the mode is contained somewhere between xi? ;n and 1
1
9
xj
so in all these intervals, the outer histogram has height M . For the following discussion we will assume that f (x) is a strictly decreasing function. This makes the shapes of the inner and outer histograms reasonably simple. Even though the argument holds without major modi cations when the density is unimodal or has at spots, the proofs become uglier without any essential dierences. De ne F , Hn and Hn? to be the areas under the density, the outer and the inner histograms respectively. The ambiguous area is de ned as An = Hn ? Hn?. Notice that the only time An changes is when a point falls in the ambiguous area, forcing another function evaluation. In order to analyze the behavior of An , we de ne the splitting process which is just an abbreviated version of cutting corners method. The dierence is that the splitting process only generates points that cause a change in the ambiguous area. The precise de nition is given in g. 5. ;n ,
+1
+
+
1. Start with the ambiguous area consisting of a union of rectangles. 2. Select a random point (x; y) in the ambiguous area. 3. Split the rectangle into four pieces by a vertical cut at x and a horizontal cut at f (x). Remove two diagonal (topleft and bottom-right) subrectangles from the ambiguous area. 4. Continue with step 2. Figure 5: The Splitting Process The behavior of the upper and lower histograms and the ambiguous areas in the splitting process is that same as that for the cutting corners algorithm. By eliminating all the points that are generated in Hn? , this process provides a much faster way to simulate the behavior of the ambiguous area of the cutting corners algorithm. It also makes it easier to study the theoretical behavior of An. 10
5.1 The triangular density
To begin with, consider the triangular density f (x) = 1 ? x on the interval (0,1).
Theorem 1 For the triangular density the rate of decrease of the ambiguous
area An is bounded by the limits:
1 nAn 2: For the triangular density, the ambiguous area is a union of squares along the diagonal. The splitting process picks a square with probability proportional to its area, and then splits it into two squares by picking x uniformly. Simulation of this splitting process suggests that nAn converges rapidly to a limit. This limit constant raises its head repeatedly, and so we give it the name Cs. The results of 200 repetitions done for A ; gave the approximate value Cs = 1:571: The lower bound of 1=n An is a deterministic lower bound achievable only by an evenly spaced grid of n intervals each of length 1=n. The exact value for the expectation of the ambiguous area is hard to nd, but the upper bound can be found by using the expectation under the assumption that the x's are independent uniform [0; 1] random variables. More precisely, let x; : : :; xn be independent variables, uniformly distributed on (A; B ). De ne An to be the ambiguous area computed if the xi's were replaced by xi 's, and similarly de ne Ii. The length of the intervals jIij has a Beta(1; n ? 1) distribution, so E (jIij ) = 2=[n(n + 1)]. Adding up all the squares, n X E (An) = E (Ii) = n +2 1 : 100 000
1
2
i=1
This is an upper bound for the real An because the actual x's have a super uniform distribution, i.e. they are more evenly spaced than the x's, and hence have a smaller variance. This is because each new xi is more likely to split a larger interval as compared to an independent sequence. This is about as close to a proof as we shall ever approach in this paper. The point of this and the following `theorems' is to establish some general results about rates of convergence which will enable us to optimize some parameters for the program and obtain some qualitative information of how fast 11
the random number generator will be. In order to not clutter the discussion, all `proofs' are merely suggestions on why a `theorem' might be true, without any attempt to plug all the holes. The numerical examples at the end are the real `proofs'.
Corollary 1 For the triangular density f (x) = B ? x de ned on (0; B ), B nAn 2B : Corollary 2 The expected number of random numbers generated from the 2
2
triangular density with n function evaluations is O(n2 ), the actual limiting rate approaching
n : 4Cs A random number can be generated at step 2 or step 6 in the cutting corners algorithm. We can continue generating points until the n + 1st point falls in the ambiguous area, An . After i ? 1 function evaluations the ambiguous area consists of i squares. The probability of a point falling in the ambiguous area is pi = Ai=Hi , so an expected 1=pi ? 1 points will be generated at step 2 before the ith function evaluation is needed. For the triangular density exactly half the ambiguous area is below the density, Hi = 1=2 + Ai=2, and for this same reason we expect only half of the function evaluations to result in a hit at step 6. Thus the total number of random variables expected to be generated by n function evaluations should be about ! n n nX X 1 1 X 1=2 = (1=pi ? 1) + ?2 2A 2
+1
+
+
+1
+1
i=1
i=1
i=1
i
using the result of the previous theorem we can approximate this by (n + 2)(n + 1) ? 1 4Cs 2 random variables for n function evaluations, which is the rate mentioned in the corollary.
12
5.2 A general density
For a general density, note that it is the quantity An=F that is scale invariant. The limiting result in general will depend on the density through one rate constant as given by the following theorem.
Theorem 2 For an arbitrary monotone (or unimodal) density nAn ! C C ; s f F where
Cf =
R q R
jf (x)jdx 0
f (x)dx
2
;
is a property depending only on the shape of f (not the location or scale).
After some initial n evaluations, any density f will be nearly linear in each of the rectangles composing the ambiguous area. After that point, the behavior of An will resemble the behavior of An for the triangular density. To make this identi cation more precise, transform each of the n ambiguous rectangles by `squaring' q them, i.e. for rectangle i with sides xi by yi , multiply the x axis by yi=xi and divide the y axis correspondingly, leaving the area unchanged. If the same transformation is applied individually to each of the strips in the histogram, we get a picture as shown in g. 6. 0
0
Figure 6: Each strip of the original histogram (left) is linearly transformed to a corresponding strip on the right, with the ambiguous rectangles ending up as squares, but all the areas unchanged. 13
The original rectangle is now completely distorted, and the transformed function is no longer monotone (or unimodal). We can continue the process of cutting corners in the original histogram. Each new point selected and each ambiguous rectangle cut corresponds to a point on the transformed histogram and each cut of a rectangle corresponds to a cut of a square in the new histogram. Thus the splitting process in this case will be indistinguishable from the splitting process in the triangular case, except the starting con guration of the squares and the size of the triangle. We need to measure the size of base (x-axis) of the transformed histogram in order to use Corollary 1 to establish the rate of decrease of An. Consider lining up all the squares, that have been created after the transformation, along a straight line. The length of that line would be n X i=0
q
xi yi=xi =
n q X i=0
jf (xi ) ? f (xi)j(xi ? xi): +1
+1
As n gets large, the strips get thinner, and the density function gets more linear within each strip. At the same time, the above sums and x and y's start approaching integrals and derivatives, and in fact the above q sum approaches Cf as de ned in the statement of the theorem. Finally, the initial con guration of the n ambiguous rectangles will become immaterial in the limit as n gets much larger than n . Finally, for the triangular distribution, Cf = 2, which gives the value of 2 in the denominator of the limit mentioned in the theorem. Corollary 3 For large n, the expected number of random numbers generated with n function evaluations is n : 2Cs Cf As in the proof of corollary 2, after i ? 1 function evaluations, the probability of a point falling in the ambiguous area is pi = Ai=Hi and we expect to generate 1=pi ? 1 points before the ith evaluation is needed. Because of the approximate linearity of f in each of the rectangles, Hi is approximately equal to F + Ai =2. The expected total number of random numbers generated before the nth function evaluation is needed should be about 0
0
0
2
+
+
nX +1 i=1
(1=pi ? 1) +
n X i=1
1=2 = 14
nX +1 i=1
Hi =Ai ? 1=2 ? 1=2 +
=
nX +1 i=1
!
F=Ai ? 1=2 !
n + 1) ? 1 = (n +22)( Cf Cs 2 which is the rate mentioned in the corollary.
5.3 The rst step
Even though the previous sections provide an idea of the asymptotic behavior of An, we need some estimates for the user who wants to sample just one random number from each density. The following theorem provides just such an estimate
Theorem 3 De ne = ln
2
B ?A)M . F
(
area has an expected value given by
If (B ? A)M >> F , the ambiguous
E (An) (B ? A)M 2?n for n < .
This approximation stems from the idea that as long as B ? A >> F , the probability of a hit is nearly zero. Assuming a truly zero probability of hits, de ne xn to be the value of x used in cutting corners after the n-th miss. x is distributed uniformly in the interval (A; B ), and xi is distributed uniformly in the interval (A; xi? ). This gives E (xi) = (B ? A)2?n , which gives the value stated in the theorem. Such a halving at every step will continue so long as the ambiguous area is much larger than F , but after steps, the two will be of similar magnitudes. At that point the exponential decline in the An will stop, and also the chance of a hit will become substantial, resulting in the rst hit. Note that the marginal cost of even the second random number will be much less than that of the rst, because most of the work of bringing the ambiguous area into the same order of magnitude as F was done in producing the rst one. 0
1
15
6 Performance 6.1 Summary
According to what has been shown previously, we can describe the performance of the cutting corners method in three separate phases. In the early phase, of generating one or perhaps a few random variables, the cost is related to = ln (B ? A)M=F , i.e. the log of the size of the bounding rectangle. In the intermediate phase the time is proportional to the square root of the number of random variates needed. The constant of proportionality will depend on Cf a constant depending on the shape of the density, as well as the speed of density evaluation. The duration of the intermediate phase depends on the speed of the density evaluation. Regardless of the speed, eventually the ambiguous area decreases to such a point that the density is evaluated very infrequently. At this nal phase the time to generate a random variable is simply the time to generate a point using the squared histogram method, which is a small constant, regardless of the density under consideration. You could almost consider that the adaptive algorithm has `converged' to a xed squared histogram algorithm. Ignoring the speci cs of the density function f and the bounding rectangle used, overall experience tends to suggest long run timing on the order of approximately 26,000 rps (random numbers per second timed on a Gateway 80486-33 Mhz PC using Double precision in Turbo Pascal). After about the rst thousand random numbers, there is no signi cant dierence, and all densities give a rate of higher than 20,000 rps. To generate the rst random number, including setup costs, the rate ranged from 200 to about 4500 rps, depending very much upon the density and the bounding box used. The lower measure (200 rps) comes from densities where the bounding box used was about 10 times the area of the density, while the latter from densities where the box was only twice as large as the density. For most reasonable cases the rate was about 3000 rps per second. Compared to the fact that single precision uniform random numbers between 0 and 1 are generated at approximately 160,000 rps, generating one random number is about 50 times slower, while generating many random numbers from the same density is about 6 times slower. 2
8
16
6.2 The squared histogram
For those interested in discrete random variables, the squared histogram algorithm is even more spectacular. The algorithm to square a histogram is linear in n the number of strips in the histogram. The observed timings for this setup portion were about 60,000 strips per second. There are almost no overhead calculations other than the time to enter and leave the subroutine, so the setup time is nearly linear in the entire range from n = 2; : : :. Generating a random number from a squared histogram does not depend on the number of strips in the histogram and can be done at the approximate rate of 70,000 random numbers per second.
6.3 Simulation results
The table of standard densities from [2] was used as a test case. From each density samples of 100,000 random numbers were produced, and detailed statistics about the timing and the number of function evaluations were maintained. The sampling was repeated until the variance in the timings and counts were negligible. A summary of the densities tested is given in g. 7. Since location and scale do not in uence the performance, all location and scale parameters were removed. If any parameters still remained, the rst parameter was called n was always set to n = 3. If there was yet a second parameter, this parameter was called m and was set to m = 2. All symmetric densities were replaced by their one sided versions (since adding a random sign at the end is more ecient). The third column of the table gives the interval of support that was used for each density. If the support of the density was in nite, it was truncated so that the tail probability was around 10? . For heavy tailed distribution this required choosing a very large truncation interval. This very small tail probability was chosen intentionally to illustrate that even if the rectangle used to enclose the density is extremely large, there is only a very small penalty in terms of performance. The constants related to their performance are given in the fourth and fth columns. The fourth column is the constant Cf described in theorem 2, while the fth column is = ln (B ? A)M=F mentioned in theorem 3. The can be thought of as the number of times the ambiguous area needs to be halved before its area is equal to that of the density. This determines the 8
2
17
Density f (x) Triangular 1?x Normal exp(?x2 ) Exponential e?x Cauchy 1=(1 + x2 ) Logistic e?x =(1 + exp?x)2 Gumbel e?x exp(?e?x ) Chi Squared xn=2?1e?x=2 Pareto x?(n+1) t (1 + x2=n)?(n+1)=2 Gamma xn?1 e?x Weibul xnh?1 exp ?xn i Log-Normal x1 exp ? 12 ( ln(xm)?n )2 Beta xn?1(1 ? x)m?1 n ?1 ?(n+m)=2 F x 2 (1 + xn m)
(A; B ) Cf (0; 1) 2.00 1.00 (0; 6) 2.40 2.26 (0; 20) 4.00 4.32 (0; 6:4e7) 6.28 25.28 (0; 20) 2.87 3.32 (?3; 20) 5.08 3.08 (0; 40) 4.09 3.27 (1; 465) 5.33 10.44 (0; 670) 3.66 8.94 (0; 6) 4.47 2.10 (0; 3) 4.24 2.82 (0; 1:5e6) 13.46 17.75 (0; 1) 0.52 0.83 (0; 1e8) 8.21 25.94
Figure 7: A summary of densities tested
18
Time 5.34 22.29 25.40 8.09 29.62 49.70 51.24 45.18 44.62 45.46 66.66 83.55 51.16 69.35
behavior if only a few random numbers are needed from the density. Both of these columns were computed using numerical integration wherever exact answers were not computable. The nal column is the average CPU time measured in micro-seconds to evaluate the density function. This time was the typical time for function evaluation averaged over the entire simulation. Based on values of given in g. 7, we expect the Cauchy and F distributions to be the slowest starters, and the Log-Normal and Pareto should also distinguish themselves with moderately slow starts. This is con rmed in the graph of the actual rates shown in Fig. 8. Since most of the densities perform without any appreciable dierence, only the unusual densities have been identi ed on that graph. Another interesting feature to note is that even though the Log-Normal distribution starts o faster than the Cauchy or the F distributions, it slowly lags behind them. This can be explained by the fact that even though the value of A for the Log-Normal is smaller than the others, it has a value of Cf of nearly twice the other distributions. This means that to generate the same number of random variables requires nearly twice as many function evaluations. It is worth noting that regardless of the original speed, all densities become nearly indistinguishable after a few thousand random numbers are generated. In the particular examples we chose, the function evaluation time was never very large. If the cost of a function evaluation was large the graph shown in g. 9 of the the number of function evaluations needed to produce a given number of random numbers would become more relevant. The upper triangular half of the graph shows lines on the log-log scale that have a limiting slope of 1=2. This is simply a consequence of Corrolary 3, because the number of function evaluations are proportional to the square root of the number of random variates required. In order to show more detail, the ratio n =s (which should converge to 2Cs Cf according to Corollary 3) is shown in the lower triangular graph. The approach to a limit is clearly visible, and it is easy to verify that the limits obtained in the simulation are close to the theoretical results. 0
2
19
40,000 30,000 20,000
Rate (randoms/sec)
10,000 8000 6000 5000 4000 3000
Triangular Beta Normal
2000 1000 800 600 500 400 300
Students-t Pareto LogNormal Cauchy F
200
Random numbers sampled
Figure 8: Rate of random number generation vs. sample size
20
10 5
10 4
10 3
10 2
10
1001
100 70 50 40 30 20 10 7 5 4 3 2
LogNormal
Students-t
Normal Beta Triangular
30
F Cauchy
25
Pareto Gumbell
20
Gamma
15
Beta Logistic Normal Triangular
10 5 0
10 5
10 4
10 3
10 2
10
1
1
35
F Cauchy LogNormal Pareto Students-t
Ratio = n2/s
Function evaluations (n)
1000 700 500 400 300 200
Random numbers sampled (s)
Figure 9: The number of function evaluations needed to generate a sample
21
7 Further Optimization
7.1 Reducing Histogram Squarings
For the Cutting Corners algorithm, the discrete distribution representing the areas of the rectangles changes at every function evaluation. This forces a resquaring of the histogram, with one more probability strip. Assuming a linear time of nCh to square a histogram with n strips, which means that n function evaluations would incur a histogram squaring cost of Pni Ch i, which is approximately Chn =2. This excessive growth with respect to n really subdues a lot of the advantage we could gain out of cutting corners. Fortunately, there is a way to avoid this penalty by noticing that the histograms do not change radically by a few function evaluations. The trick is to simply not update the histogram at every function evaluation. As an example, if the function has been evaluated 100 times we use the outer histogram as it was when the function had been evaluated only 50 times. This means that occasionally a point will land outside the outer histogram. We check for such an occurrence, and if it happens, simply reject it and start over. This technique provides a drastic speedup and must be implemented to get any reasonable speeds. The exact times of when to re-square the histogram require some tunings to get the best timings. In the timings given here the histogram was squared if either the ambiguous area dropped to below half or if the number of strips doubled as compared to the histogram that was last squared. The factors of 0.5 for the area and 2 for the number of strips are somewhat arbitrary. One could try to nd `optimal' values, but the timings are relatively insensitive to these numbers. =1
2
7.2 Speedup by pre-evaluation
There is a temptation to ask why one shouldn't evaluate the function at an evenly spaced grid of points before starting to generate random numbers. The elegance of this method is that even though the grid of points where the function is evaluated is randomly selected, it turns out to be a very eciently designed grid. This is because the ambiguous area is the union of many rectangles. It is only when a point falls in this area that f needs to be evaluated. Whenever a point falls in any rectangle in here, it splits the rectangle into four pieces of 22
which only two remain ambiguous. The random selection of points ensures that points fall in the largest rectangles with the greatest probability, so that the function is evaluated at the points where it will decrease the ambiguous area most rapidly. Despite the above argument, there is an advantage to pre-evaluating the function at a carefully chosen point or two, if the bounding box is extremely large. For a large bounding box, evaluating the function at the 99% percentile, will cut out most of the ambiguous area in a single cut, rather than waiting for the logarithmic approach of the bisection described in Theorem 3. This will generate drastic speedups in the generation of the rst random variable, even though the asymptotic eects of this will be negligible.
7.3 Alternatives to Squaring the Histogram
Even though the squared histogram method serves our needs extremely well, in the early stages there is too much bookkeeping associated with that method. In fact, when there is a very ill speci ed bounding box or a poorly speci ed mode, most of the probability is associated with the two strips on either side of the suspected mode. A very simple method which has low setup overhead would tremendously speed up the early evaluations in these cases.
7.4 Programming
The timings described above are from a very pedantic program. Further speedups are possible by streamlining the programming.
7.5 Alternatives to the squared histogram
Even though the squared histogram is one of the fastest methods for large discrete densities, some kind of a tree structured algorithm which allows for rapidly changing densities might be faster in the initial phase. This may be of great bene t to users who are generating only a single sample from many dierent densities.
23
8 New Directions There are many situations where we expect to have `nearly' unimodal distribution, where there may be very slight modes at other locations. Even in these situations a variation of the above idea may prove useful. Instead of removing the entire corner, but we can add a constant to the value of the density before cutting. Thus if we compute yi = f (xi), in a decreasing f , we can safely say that if x > xi then f (x) < yi + . With an appropriate choice of , this can cover a larger class of densities. Elaborations on this idea could use non-rectangular shapes to provide a way to simulate any kind of density, without regard to unimodality, as long as it satis es some kind of Lipshutz (or similar) inequality. Thus every function evaluation would provide a bound about its neighborhood. If done carefully and adaptively, it should be possible to write a completely general random number generator, which is still extremely ecient for samples larger than 10,000 or so.
References [1] Knuth, D. E., (1981), The Art of Computer Programming: Seminumerical Algorithms, 2nd ed., Addison-Wesley. [2] Mood, A. M., Graybill, M. A., Boes, D. C. (1974), An Introduction to the Theory of Statistics, 3rd ed., McGraw Hill, New York. [3] Marsaglia, G. (1963), Generating Discrete Random Variables in a Computer, Comm. ACM, 6, No. 1, p. 37{38. [4] Walker, A. J. (1974), New Fast Method for Generating Discrete Random Numbers with Arbitrary Frequency Distributions, Electronics Letters, 10, No. 8, p. 127{128.
24