A simple and effective discretization of a continuous random variable Zvi Drezner∗and Dawit Zerom California State University - Fullerton, U.S.A.
Abstract A generally applicable discretization method is proposed to approximate a continuous distribution on a real line with a discrete one, supported by a finite set. The method adopts a criterion which is shown to be flexible in approximating higher order features of the underlying continuous distribution while automatically preserving mean and variance. To illustrate the effectiveness of the method, several examples covering a wide-range continuous distributions are analyzed. A computer implementation (using R) of the proposed procedure is provided. Keywords: Continuous distribution; Discrete distribution; Mean and variance; Optimization; Higher-order features.
1
Introduction
The need to approximate a continuous random variable with a discrete one arises in many settings. Discretization may be used to have the convenience of a discrete sample space with the aim to reduce excessive computational requirements of the continuous case. For example, a stochastic optimization problem with random parameter can be conveniently solved by discretizing the sample ∗
Corresponding author, Mihaylo College of Business and Economics, California State University, Fullerton, CA,
92834-6848, (657) 278-2712,
[email protected].
1
space of the parameter. Discretization is also used in applications where analytic approaches fail to provide a closed solution to a problem. For example, the stress-strength analysis is an important engineering problem where a system with random strength is subject to a random stress during its functioning, and the system works only when the strength is greater than the stress. The probability that a system operates successfully is termed reliability; see Roy and Dasgupta (2001) and references therein. When the stress or strength variables depend on several stochastic factors, evaluating the reliability of a system requires approximating the probability distributions of both stress and strength. However, these probability distributions are often not analytically tractable and hence approximations are needed. Given a known functional relationship between stress (or strength) and its random sub-components and assuming the sub-components of the stress (or strength) are independent, one feasible approach of approximating the probability distribution of stress (or strength) is through discretization of the sub-components; see for example, English et al. (1996). When discretizing a continuous random variable, losing some features of the underlying continuous distribution is unavoidable. So, the goal of a discretization procedure is to maintain the “relevant” properties of the distribution as much as possible where relevancy is guided by the property the user would like to preserve. For example, the discretization by Katz (1983) is aimed at maintaining the expectation of the distribution. Other discretization approaches such as D’Errico and Zaino (1988), Roy and Dasgupta (2001) and Barbiero (2012) (among others) are designed for reliability approximation of a system in engineering applications. In this paper we introduce a computationally simple, yet general, approach of discretization
2
that can appeal to many users. It is a criterion-driven discretization, and we show using several examples that the approach is quite flexible in approximating higher order features of the underlying continuous distribution while automatically preserving mean and variance. The rest of the paper is organized as follows. Section 2 introduces the proposed approach where we present the theoretical motivation and the optimization algorithm for the finding the discrete distribution. Section 3 discusses the R implementation of the proposed algorithm. In section 4, we illustrate the flexibility of the proposed discretization in capturing higher-order features beyond mean and variance of the underlying continuous distribution. In the illustrations, we focus on kurtosis as a higherorder feature of interest and consider a wide-range continuous distributions. Guided by reliability estimation problems in engineering applications, section 5 discusses three analytically tractable examples to further motivate the value of the proposed discretization. Proof of Theorem 1 and execution details of the accompanying R code are available in the Appendix.
2
Proposed approach
Let X be a continuous random variable with mean µx and variance σx2 . It has a density function f (x), −∞ < a ≤ x ≤ b < ∞ and a corresponding distribution function F (x). The goal is to discretize X into K points by selecting K − 1 boundary values: δ1 , . . . , δK−1 such that a = δ0 < δ1 < . . . < δK−1 < δK = b. We start by introducing a discretized version of X, denoted by M , where each segment above (note that there are a total of K disjoint segments) is replaced by one point located at the mean of that segment with corresponding probability equal to the area under the curve for that segment. This discretization automatically preserves µx . More formally, under
3
this discretization and for i = 1, . . . , K, the discrete point mi and the corresponding probability pi = P (M = mi ) are defined by ∫δi mi =
xf (x)dx
δi−1
∫δi and
∫δi
pi =
f (x)dx
f (x)dx.
(1)
δi−1
δi−1
The discrete random variable M has a proper probability distribution, i.e. K ∑
∫δi K ∑
pi =
i=1
∫b f (x)dx =
i=1 δ i−1
f (x)dx = 1. a
2 ) of M are defined by The mean (µm ) and variance (σm
µm =
K ∑
pi mi
and
2 σm
=
i=1
K ∑
pi m2i − µ2m .
(2)
i=1
It is easy to see that µm = µx , i.e. K ∑
pi mi =
i=1
∫δi K ∑
∫b xf (x)dx ≡ µx .
xf (x)dx =
i=1 δ i−1
a
The following result, which is proved in the Appendix, motivates an improved version that we will introduce subsequently. Theorem 1. : The variance of M is strictly less than the variance of X, i.e. 2 σm < σx2 .
(3)
By Theorem 1, one cannot partition the range of the continuous random variable X into K disjoint segments such that 1) each discrete point is defined as the mean of the segment with probability equal to the area under the curve for that segment and 2) the resulting discrete random 4
variable preserves both µx and σx2 . This means that one of these two conditions needs to be relaxed. Ideally one would like to preserve both µx and σx2 . 2 ) defined by Let D be a new discrete random variable with mean (µD ) and variance (σD
µd =
K ∑
pi di
and
σd2
i=1
=
K ∑
pi d2i − µ2d
i=1
where pi = P (M = mi ) = P (D = di ). Using M as the basis and guided by Theorem 1, we define D to be the closest in squared error sense to M while at the same time satisfying µd = µx and σd2 = σx2 . We will show in sections (4) and (5) that the discretization criterion (minimizing the squared error between M and D) is not only intuitively and computationally appealing, it also leads to a discrete distribution that is quite flexible in capturing relevant features of the continuous distribution, beyond mean and variance. Noting that ∫δi pi =
f (x)dx, δi−1
it can be seen that i ∑ pj . δi = F −1
(4)
j=1
where F −1 (·) is the inverse distribution function of X. Also we re-write mi as 1 mi = pi
∫δi tf (t)dt.
(5)
δi−1
So, M depends on the δ’s through p’s. More formally, we obtain the discrete distribution (pi , di , i =
5
1, . . . , K) as the solution to the following optimization problem, { min p,D
K ∑
} pi [mi − di ]2
i=1
subject to: K ∑ i=1 K ∑ i=1 K ∑
pi = 1
pi di = µx
(6)
pi d2i = µ2x + σx2 .
i=1
Problem (6) consists of 2K parameters (p, D). But, observing that mi is not a function of di and using the method of Lagrange multipliers, the dimensionality of the optimization problem can be substantially reduced. The Lagrange function F (D, p, λ) for problem (6) is given by K ∑
( pi [mi − di ]2 + λ1
K ∑
) pi − 1
i=1
i=1
+ λ2
(K ∑
) pi di − µ x
( + λ3
i=1
K ∑
) pi d2i − µ2x − σx2
i=1
where λ1 , λ2 and λ3 are the Lagrange multipliers of the constraints. Taking the derivative of F (D, p, λ) with respect to di and equating that to 0, we obtain 2pi (di − mi ) + λ2 pi + 2λ3 di pi = 0.
(7)
Consequently,
di =
mi − 12 λ2 , 1 + λ3
i = 1, . . . , K.
(8)
Therefore, D is no longer a parameter vector and hence the number of parameters is reduced to K + 2 (including λ2 and λ3 ).
6
2.1
Further simplification of the problem
In its current form, (6) is a nonlinear optimization problem where two of the three equality constraints are nonlinear. We now exploit the the structure of the problem to further simplify the optimization to facilitate easier implementation of the proposed discretization. Incorporating (8) into the constraints of µd and σd2 in (6), we obtain K ∑
K ∑
mi − 12 λ2 = µx 1 + λ3 i=1 [ ]2 K ∑ mi − 12 λ2 = µ2x + σx2 = pi 1 + λ3
pi di =
i=1
pi d2i
i=1
K ∑
pi
(9)
(10)
i=1
We know from (2) that K ∑ i=1
pi mi = µ x ;
K ∑
2 pi m2i = σm + µ2x
(11)
i=1
where µm = µx . Substituting (11) into (9) yields 1 µx − λ2 = [1 + λ3 ] µx → λ2 = −2µx λ3 2
(12)
2 + µ2 − µ λ + Substituting (11) into (10) yields σm x 2 x
λ22 4
[ ] = µ2x + σx2 (1 + λ3 )2 . Further substituting
(12): [ ]{ } σm 2 σm + µ2x + 2µ2x λ3 + µ2x λ23 = µ2x + σx2 1 + 2λ3 + λ23 → 1 + λ3 = σx
(13)
Substituting (12) and (13) into (8) yields:
di = =
[ ] σx σm mi + µx − µx σm σx σx [mi − µx ] + µx , i = 1, . . . , K. σm 7
(14)
Now, using (14), [
] σx di − mi = − 1 [mi − µx ] σm Therefore, the objective function of problem (6) becomes K ∑ i=1
[ 2
pi [mi − di ]
=
]2 ∑ [ ]2 K σx σx 2 2 −1 pi [mi − µx ] = − 1 σm σm σm i=1
= (σx − σm ) . 2
(15)
Since σm < σx by Theorem 1, minimizing the objective function in (6) is equivalent to maximizing σm . The two constraints of equalizing the mean and the variance are automatically satisfied by (14) and therefore the optimization problem (6) is equivalent to a much simpler problem: { max
pi ,i=1,...,K
K ∑
} pi m2i
i=1
Subject to: K ∑
pi = 1
(16)
i=1
where mi (which are functions of pi ) are given by (5). From the optimal vector p, the corresponding D is obtained by (14). It may be noted that when X has a symmetric distribution with symmetry at its mean µx , the number of parameters in problem (16) can be reduced even further because
pi = pK+1−i . Therefore, K can be cut by almost half depending on whether it is even or odd.
8
(17)
3
Computation
For a continuously distributed random variable X and a given K, we have shown that the discrete distribution (p, D) can be obtained by solving a constrained nonlinear optimization problem in (16). For two specific distribution (standard normal and weibull with shape=1.5 and scale=1), we have implemented the approach in R and the execution details are available in Appendix (B). For solving the optimization, our R implementation uses nloptr package (see Ypma, 2014) which is an interface to several free/open-source optimization algorithms. In searching for the optimum, we consider equal probabilities as starting values. The code function m.vec.distribution evaluates mi (see 5) for a specified distribution (standard normal or weibull). If a user wishes to adapt m.vec.distribution to other distributions of interest, the lower and upper bounds of δi (see 4) should reflect the support of X, i.e. by changing a and b. The code functions fn.obj.distribution, fn.con.EQ and fn.con.INEQ evaluate the objective function, the equality constraint and the inequality constraint, respectively. When the support of X is (−∞, ∞), the inequality constraint is not needed. The code function fn.con.INEQ is to be used when the support of X is (0, ∞) and must be appropriately adjusted if other supports are desired. Recall that 1 mi = pi
∫δi tf (t)dt i = 1, 2, . . . , K. δi−1
For the majority of continuous distributions, there is no explicit expression for mi in terms of δi and therefore, we suggest use of numerical integration. To keep the generality of the R code, the function m.vec.distribution above uses numerical integration. For some continuous distributions, there is an explicit expression for mi avoiding a need for numerical approximations. Below we 9
provide mi for three such distributions. Standard normal distribution ) 2 /2 1 ( −δi−1 2 e mi = √ − e−δi /2 2πpi
i = 1, . . . , K
where δi is obtained by (4), δ0 = −∞ and δK = ∞. The Student-t distribution with v degrees of freedom √ mi =
) v/πΓ[(v + 1)/2] ( 2 (1 + δi−1 /v)(−v−1)/2 − (1 + δi2 /v)(−v−1)/2 pi (v − 1)Γ[v/2]
i = 1, . . . , K
where δi is obtained by (4), δ0 = −∞ and δK = ∞. Note that both the standard normal and student-t distributions are symmetric about µx = 0. By symmetry, δK−i = −δi .
(18)
So, Using (18) and (17) and noting that f (−x) = f (x),
mK+1−i =
1
δK+1−i ∫
1 xf (x)dx = pi
pK+1−i δK−i
−δ ∫ i−1
xf (x)dx = −mi −δi
further reducing the computational requirement.
The Exponential distribution
mi =
) 1 ( 1 + δi−1 e−λδi−1 − δi e−λδi λ pi
where δi is obtained by (4), δ0 = 0 and δK = ∞. 10
i = 1, . . . , K
4
Discussion
In practice, the value of a particular discretization is determined by the closeness of the resulting discrete distribution to the underlying continuous distribution with respect to some desired characteristics of the continuous distribution. The proposed discretization is defined through minimization of SS(K) =
∑K
i=1 pi [mi
− di ]2 ; equivalently maximizing the objective function in (16). In
this section, we consider five continuous distributions: standard uniform, beta (shape=scale=1.25), standard normal, weibull (shape=1.5, scale=1) and exponential (λ = 1), to show that by optimally minimizing the criterion SS(K), the proposed discretization is flexible in capturing relevant features of the continuous distribution, beyond mean and variance. We label this optimal approach as “method O”. Using the five distributions, we examine how well a discretization approximates the kurtosis (a higher order feature) of a given continuous distribution. We consider K ∈ [5, 20] in increment of 1. For a discrete distribution (p, D), we obtain kurtosis by:
∑K
i=1 pi (di
− µ)4 /σ 4 . As the
benchmark discretization (label as “method B”), we consider the equal probability discretization, i.e. pi = 1/K, i = 1, 2, . . . , K and to make sure that this discretization maintains the mean and variance of the continuous distribution, the corresponding di is obtained by (14). Note that the equal probability discretization is not optimal in the sense of minimizing SS(K) because, for any K, the equal probability discretization will result in a larger SS(K) (denoted here by SS (B) (K)) than the SS(K) which results from allowing pi ’s to be optimally determined (denoted here by SS (O) (K)). So, the ratio R(K) = SS (O) (K)/SS (B) (K) will be less than 1. Comparisons are shown in Figure 1. The first column shows the density plots of the continuous
11
distributions. Using “method O”, the second column shows plots of (p, D) from several K values. For a given K, the third column plots kurtosis values from “method O” (show in sold lines) and similarly from “method B” (shown in dashed lines). The imposed horizontal line in each plot is the true kurtosis which is analytically known. The fourth column shows plots of R(K). Figure (1) here
For the uniform distribution, “method O”’s optimally chosen p’s are 1/K (see row 1 and column 2) and hence “method O” reduces to “method B”. The beta (shape=scale=1.25) distribution is similar to the standard uniform for a large part of its support. Reflecting that pattern, the optimal p’s from “method O” resemble the equal probability case (see row 2 and column 2) and consequently the R(K) is also close to 1. Performance (in terms of kurtosis approximation) of “method B” is also close to “method O”. In contrast to the standard uniform and beta (shape=scale=1.25), the R(K) value for the standard normal is much lower than 1, and the kurtosis approximations of “method O” and “method B” are also different where “method O” delivers a more accurate kurtosis value. For the non-symmetric distributions (exponential and weibull), the R(K) is even lower than the standard normal case, and the kurtosis approximation from “method B” gets a lot more worse than “method O”. Furthermore, kurtosis accuracy from “method B” does not seem to improve much with higher K. This illustration indicates that, by optimally minimizing SS(K) (beyond preserving mean and variance), the resulting discrete distribution appears to be flexible in adapting to the overall shape of the underlying continuous distribution (column 1 versus column 2), and such adaptation is reflected in the ability of the proposed discretization to capture higher order features such as kurtosis. 12
5
Application guided analytic examples
Roy and Dasgupta (2001) proposed a discretization approach for use in reliability evaluation; refer to section 1 for some context. To demonstrate the merit of their method, they considered four stylized engineering problems and reported that their approach (in terms of approximating reliability) is far more superior to a benchmark discretization by English et al. (1996). We applied our proposed discretization to same four engineering examples as in Roy and Dasgupta (2001) and the reliability estimates based on the proposed discretization are substantially more accurate. To keep the generality of our contribution, we do not report the results, but are readily available up on request. Guided by the important role discretization plays in the evaluating of reliability of a system, we construct analytically tractable examples (not necessarily reliability specific) to further motivate the value of the proposed discretization. The problem is to determine the probability distribution of a response Y which is defined as a function N independent random sub-components. We consider three examples. Example 1 :
√ 2(N − 1)Z Y = √∑ N −1 j=1 Ej
(19)
where Z is a standard normal and E1 , . . . , EN −1 are each exponentially distributed with λ = 0.5. It can be shown analytically that Y is has a t distribution with 2(N − 1) degrees of freedom. Example 2 :
Y = exp(0.15
N ∑
Zj )
(20)
j=1
where Zj are each standard normal. Note that (ln Y ) is normally distributed with a zero mean and √ a standard deviation of 0.15 N .
13
Example 3 :
Y =
Z12
+
Z22
+3
N ∑
Ej
(21)
j=3
where Z1 and Z2 are each normally distributed with zero mean and variance of 1.5. The random variables E3 , . . . , EN are each exponentially distributed with λ = 1. It can be shown analytically that Y is has a Gamma distribution with shape = N − 1 and scale = 3. For τ ∈ [0.1, 0.9] and to examine the merit of the proposed discretization in capturing distributional features of Y , we focus on approximating the τ -quantile of Y , (F −1 (τ )), via K points discretization of the N random sub-components using the proposed approach. For a given K and N , a discretized Y ,denoted by yD,i , is obtained by (19), (20) and (21) at all N ∗ = K N combination of the discretized sub-components. The probabilities (corresponding to yD,i ), denoted by py,i , are obtained by multiplying the probabilities of the discretized sub-components which is possible by independence. Then, the CDF of the discretized distribution of Y is given by G(yD,i ) =
∑N ∗
i=1 pY,i .
Finally, we obtain the discretization-based quantiles of Y as, G−1 (τ ) = min{yD : G(yD ) ≥ τ }. We consider K ∈ [5, 12] in increment of 1 and N =4, 5, and 6. Results for Example 1 are shown in the first row of Figure (2) (for N =4, 5, and 6, respectively) where the dashed lines correspond to G−1 (τ ) and the solid line is F −1 (τ ) (the true quantile which is analytically known). Note that each G−1 (τ ) is based on a particular K. Similarly, the second and third rows of Figure (2) correspond to Example 2 and Example 3, respectively.
Figure (2) here 14
With no exception, the results for the examples show that discretization-based quantiles are very close to the true quantiles of Y , and this seems to hold even for K as small as 5.
References [1] Barbiero A. “ A general discretization procedure for reliability computation in complex stressstrength models,” Mathematics and Computers in Simulation, 82, 1667-1676 (2012). [2] D’Errico J.R. and N.A. Zaino Jr “Statistical tolerancing using a modification of Taguchi’s method,” Technometrics, 30, 397-405 (1988). [3] English J.R., T. Sargent, and T.L. Landers “A discretizing approach for stress/strength analysis,” IEEE Transcations on Reliability , 45, 84-89 (1996). [4] Katz D. “ Discrete approximations to continuous density functions that are L1 optimal,” Computational Statistics & Data Analysis, 1 , 175-181 (1983). [5] Roy D. and T. Dasgupta “A discretizing approach for evaluating reliability of complex systems under stress-strength model,” IEEE Transcations on Reliability , 50, 145-150 (2001). [6] Ypma J. “ nloptr: an R interface to free/open-source library for nonlinear optimization (R package),” http://cran.r-project.org/web/packages/nloptr/index.html, (2014).
Appendix (A) Proof of Theorem 1 Because µm = µx , it suffices to prove the result on the second moments of M and X. Using (2), K ∑
∑K pi m2i =
i=1
∫δi
i=1
δi−1
2 xf (x)dx /
∫δi f (x)dx.
(22)
δi−1
First, note that (E[X|δi−1 < X ≤ δi ])2 < E[X 2 |δi−1 < X ≤ δi ]]
15
(23)
and f (x)
f (x|δi−1 < X ≤ δi ) =
.
(24)
x2 f (x)dx.
(25)
∫δi
f (x)dx
δi−1
Now, using (24), (23) leads to [
∫δi
]2 xf (x)dx
δi−1
∫δi
∫δi