Extension of sample size in Latin Hypercube ... - NUS Engineering

Extension of sample size in Latin Hypercube Sampling with correlated variables Miroslav Voˇrechovský Institute of Structural Mechanics, Faculty of Civil Engineering, Brno University of Technology, Veveˇr´ı 95, 602 00 Brno, Czech Republic, [email protected] Abstract. In this paper, we suggest principles of a novel simulation method for analyses of functions g (X) of a random vector X, suitable for the cases when the evaluation of g (X) is very expensive. The method is based on Latin Hypercube Sampling strategy. The paper explains how the statistical, sensitivity and reliability analysis of g (X) can be divided into a hierarchical sequence of simulations with (subsets of samples of a random vector X) such that (i) the favourable properties of LHS are retained (low number of simulations needed for significant estimations of statistics of g (X) with a low variability of the estimation); (ii) all subsets can anytime be merged into one set while keeping its consistency (i.e. the simulation process can be halted e.g., when reaching a certain prescribed statistical significance of the estimations). An important aspect of the method is that it efficiently simulates subsets samples of random vectors with focus on their correlation structure. The procedure is quite general and can be applied to other simulation techniques (e.g. crude Monte Carlo, etc.). The method should serve preferably as a tool for very complex and intensive analyses of nonlinear problems g (X) (involving a random/uncertain phenomena) where there is a need for pilot numerical studies, preliminary and subsequently refined estimations of statistics, progressive learning of neural networks or design of experiments. Keywords: simulation, Latin hypercube sampling, correlation, progressive sampling, design of experiments, adaptive sample size, neural network learning, response surface, Simulated annealing

1. Introduction In this paper, we deal with the topic of estimation of statistics of a function g (X). Consider a deterministic function Z = g (X), represented by a computational model or a physical experiment (that is expensive to compute/evaluate), where Z ∈ R or even a vector variable, and X ∈ RNvar is a random vector of Nvar marginals (input random variables describing uncertainties). The information on the random vector is limited to marginal probability distributions (with parameters) and correlation matrix T (symmetric squared matrix of order Nvar ). The task is to perform statistical, sensitivity and possibly reliability analysis of Z given the information above. Suppose the analytical analysis of transformation of input variables to Z is not possible. Statistical and probabilistic analyses can be viewed as estimations of probabilistic integrals. Given the joint cumulative density function (CDF) of the input random vector FX (x), and the output, i.e. the function g (X) of random vector, the estimate of the mean value of g (·) is, in fact,

4th International Workshop on Reliable Engineering Computing (REC 2010) Edited by Michael Beer, Rafi L. Muhanna and Robert L. Mullen Copyright © 2010 Professional Activities Centre, National University of Singapore. ISBN: 978-981-08-5118-7. Published by Research Publishing Services. doi:10.3850/978-981-08-5118-7 024

353

Miroslav Voˇrechovsk´ y

approximation to the following integral: μg =

∞ −∞

...

∞

g (X) dF (X)

(1)

−∞

Higher statistical moments of the response can be obtained by integrating polynomials in g (·). The probability of failure is obtained similarly; the function g (·) is replaced by Heaviside function (or indicator function) H [−g (X)] that equals one for failure event (g < 0) and zero otherwise. In this way, the domain of integration of the joint CDF above is limited to the failure domain. The most prevalent technique for the task is the Monte Carlo simulation (MCS). MCS is popular for its simplicity and transparency and is also used in benchmarks of other (specialized) methods. In Monte Carlo type techniques, the above integrals are numerically estimated using the following procedure: (i) draw Nsim realizations of X by using its distribution FX (x), (ii) compute the same number of output realizations of Z using the model g (·) and (iii) process the results, see Fig. 1. The main objective of sampling is to select a set of sample points so as to maximize the amount of important information that can be extracted from the output data. The joint CDF is usually defined just by marginals and covariances and this information is used for weighting the samples generated by the methods. In other words, the sample set should be as close as possible to the target joint probability density function (PDF), i.e. fulfill the marginal distributions F1 , . . . , FNvar and the target correlation matrix T . Since g (·) is expensive to compute (or otherwise evaluate) it pays to use more complicated sampling scheme. A good choice is one of the “variance reduction techniques” called Latin Hypercube Sampling (LHS). The technique was first developed by Conover (Conover, 1975) and later elaborated mainly in (McKay et al., 1979; Iman and Conover, 1980; Ayyub and Lai, 1989). It is a representant of stratified sampling. LHS strategy has been used by many authors in different fields of engineering and with both simple and very complicated computational models. LHS is suitable especially for statistical and sensitivity calculations. However, there is a possibility to use it for probabilistic assessment within the framework of curve fitting. LHS can also be combined with Importance sampling technique to minimize the variance of probability of failure estimates by sampling importance density around the design point (the most probable point on limit state surface g (X) = 0). Optimal coverage of space of many variables with minimum of samples is also an issue in design of experiments and LHS and related sampling techniques have their place in that field. When applying LHS, the choice of sample size is a practical difficulty. Small sample size may not give acceptable statistically significant results, while large sample size may not be feasible for simulations that take hours to compute (an example of such situation may be the statistical analysis of nonlinear fracture mechanics problems, e.g. (Voˇrechovsk´ y, 2004; Voˇrechovsk´ y, 2007a). In many computer analyses the sample size to give adequate statistics can not be determined a priori. Therefore the ability to extend and refine the experimental design may be important. It is thus desirable to start with a small sample and then extend (refine) the design (use more representants in estimation of the integrals) if deemed it is necessary. One needs, however, a sampling technique in which the pilot analyses can be followed by the additional sample set(s) without the need to discard results of the previous sample set(s). This problem is depicted in Fig. 1. In crude MCS a

354

4th International Workshop on Reliable Engineering Computing (REC 2010)

Extension of sample size in Latin Hypercube Sampling with correlated variables

new sample subset can be added to the rest without any violation of the consistency of the whole sample set. However, if some kind of variance reduction technique (such as LHS studied here) one has to proceed with a special care to keep the consistency and variance reduction properties at the same time. This paper presents the techniques first suggested in (Voˇrechovsk´ y, 2006) and later extended in another local publication (Voˇrechovsk´ y, 2007b). In Fig. 1 we also illustrate the need do increase the dimension of the input random vector. Adding a new variable to the current design does not bring about any problem in the proposed technique (either MCS or LHS).

2. The methods: Univariate sampling Firstly we examine the possibilities of sampling refinement in the case of sampling from a random variable X. The algorithms will be presented using sampling from a rectangular (uniformly) distributed variable U over a unit interval (0; 1) for which we can write: U ∼ R (0, 1) :

⎧ ⎨ 0

for u ≤ 0 P (U ≤ u) ≡ FU (u) = u for u ∈ (0, 1) ⎩ 1 for u ≥ 1

(2)

The reason for such a choice is that sampling from U can be understood as selection of sampling probabilities ui = πi that can be transformed to an arbitrary continuously distributed variable X ∼ P (X ≤ x) ≡ FX (x) by the inverse transformation of its cumulative distribution function:

Initial sample set πl

Additional sample set πl+1

(3)

Correlation control over inputs

Nl simulations (aggregated sample set) nl simulations Nl−1 simulations

Nvar variables

Inputs (random vector X)

xi = FX−1 (ui ) = FX−1 (πi )

Outputs (random vector)

g(X) function(s) of a random vector X (computer model or physical experiment) Z1 Z2 ZN

Correlations between inputs and outputs (sensitivities)

eventual new variable(s)

Figure 1. Illustration of the main idea: Monte Carlo sampling with additional samples


355


We know that LHS belongs to the category of stratified sampling techniques where the probability region is divided into a set of disjoint intervals and from these intervals samples are selected. Optimality of sampling with respect to random variable’s distribution is achieved by uniform selection of sampling probabilities i.e., division of the probability region into Nsim equiprobable disjoint intervals Ii (i = 1, . . . , Nsim ) each with a probability content 1/Nsim . From each such i-th interval, one value πi is chosen and transformed to a realization xi via Eq. (3). Several possibilities for selecting the sampling probabilities πi are available, see (Voˇrechovsk´ y, 2004; Voˇrechovsk´ y and Nov´ ak, 2009) for overview. Now, we must find a strategy for selecting a sequence of sampling probabilities that are uniformly distributed over the probability domain at any instant of sampling. Let us define a run r with its subset of sampling probabilities π. Each run is associated with a certain level l. Moreover, in each run rl we have a subset of sampling probabilities πl with a number of simulations nl . The problem is to find an optimal sequence of these subsets. In this text, we propose and describe two particular designs of aggregated sample sets: denoted as LLHS and HSLHS from here on. In both methods, we use the property that some of already selected sampling probabilities will constitute the parent subset πl . Its child subset πl+1 will be constructed such that each sampling probability will “generate” two other values – sampling probabilities. In the first design (LLHS), we will make use of the LHS method for each additional subset. Therefore, computer implementation into current sampling codes becomes extremely easy. Also, each subset can be treated separately as an unbiased LH-sample set. The total sample set is, however, only an approximation of the exact LS-sample set of the same size. Also, one has to use particular sample sizes to utilize the technique. The second design (HSLHS), a more flexible and efficient one, adds subsets in which all the previous sampling probabilities are used to generate two new sampling probabilities. The additional subsets are not LH-samples alone. However, if such a subset is combined with all previous subsets, one obtains an exact LH-sample set having the number of simulations equal to a sum of number os simulation of all of them. The size Nsim of initial LH-sample can be an arbitrary number. 2.1. LLHS (composition of LH-samples) A simple alternative fully based on aggregation of classical LH-samples can be suggested as follows. The first level (numbered 0) contains just one simulation. For a general level l we set the number of simulations in the additional subset to: nl = 2l

(4)

At each level, the set πl of added sampling probabilities are chosen from a probability interval as in the usual LHS. For example, we can use the probabilistic median of each interval, i.e.: πl = {π1 , π2 , . . . , πi , . . . , πnl } :

πi =

i − 0.5 nl

(5)

By replacing the number 0.5 in the nominator, we can obtain other alternatives of sampling probability selection. Generally, each sampling probability must be bounded by the interval πi ∈

356



(blow ; bupp ), where blow = (i − 1) /nl (lower bound) and bupp = i/nl (upper bound). There are two possible scenarios of the progressive simulation; and we describe them next. Scenario I. The simplest simulation procedure is to perform a sequence of runs (levels) of simulations starting right from level 0 (n0 = 1) and proceed with level 1 (n1 = 2), then level 2 (n2 = 4), etc. The simulation procedure can be stopped after completing an arbitrary number of consecutive runs (levels). The total number Nl of simulations at a certain level l can be computed as a sum of simulations at all complete preceding levels Nl−1 plus nl : Nl = nl + Nl−1 =

l k=0

nk =

l

2k = 2l + 2l − 1 = 2l+1 − 1

(6)

k=0

The sampling probabilities for levels 0, 1, . . . , 5 are plotted in Fig. 2. In the left part of the figure, we plot the sampling probabilities πl and bounds for all levels l = 0, . . . , 5. The total number of simulations Nl is always odd. We plot them on the right hand side; they are the total sampling probabilities for completed levels all up to l (denote them Πl ). They are compared to sampling probabilities of the classical LHS with the same number of simulations. Scenario II. Another possibility (a better one) is to start immediately at some nonzero level l. To proceed, one must first fill all the lower levels down to the zero level (meaning to almost double the current number of simulations nl , because Nl−1 = nl − 1 simulations that must be performed) at one step. If yet another subsets of simulations are requested, the procedure follows as in Scenario I, i.e., by proceeding with levels l + 1, l + 2, etc. To be concrete, suppose the analyst starts with 8 LH-simulations (level 3). In the first refinement, he must add, at once, 7 (=4+2+1) simulations that correspond to all preceding levels . Further refinements continue by adding 16, 32, . . . corresponding to levels 4, 5, . . .. The advantage is that correlation control is somewhat more efficient when dealing with larger sample sets: in this example of Scenario II, we never deal with sample size less than 8. Note that in both scenarios we require that when adding a certain level l, all preceding levels must already be processed. When this requirement is not met, we loose the property present in LHS, i.e. reduced variance of estimates; and we approach the performance of the crude Monte Carlo. The reason is that if any level is skipped, no sampling probability gets repeated, but some of them are missing. It can be seen (Fig. 2 right) that a completed sequence of LLHS at a certain level tend to be somewhat more grouped around the middle of the probability region compared to LHS at the same sample size. In contrast, the true LH-sampling probabilities are always perfectly uniformly distributed over the range (0, 1). This is actually a disadvantage of the proposed method, because this effect will obviously lead to underestimations of the actual variance, and distortion of the estimates of whole PDF of response function g (X). 2.2. HSLHS design Assume again the starting situation in which an initial LHS design has been completed and evaluated, see the initial sample set in Fig. 1. We want add a new design such that the aggregate design will be a true LHS design. The newly added subset may not be a true LHS design. Also, we may


357


No of simulations at level l: (nl )

LLHS (new subset) interval bound b

level l

No of simulations at level l: (Nl )

LLHS (total) LHS-median

1

0

1

2

1

3

4

2

7

8

3

15

16

4

31

32

5

63

0

0.2

0.4

0.6

0.8

Sampling probability π = FX (x)

1

0

0.2

0.4

0.6

0.8


1

Figure 2. LLHS design. Left: selection of sampling probabilities at each added level separately. Right: Comparison of total sampling probabilities of the proposed LLHS with the classical LHS

No of simulations at level l: (nl )

HSLHS (new subset) interval bound b

level l

HSLHS (total) LHS-median

No of simulations at level l: (Nl )

1

0

1

2

1

3

6

2

9

18

3

27

54

4

81

0

0.2

0.4

0.6

0.8


1

0

0.2

0.4

0.6

0.8

1


Figure 3. HSLHS design with one sample in the initial design. Left: selection of sampling probabilities at each added level separately. Right: Comparison of total sampling probabilities of the proposed HSLHS with the classical LHS

want to remove the severe limitation of the LLHS technique: one may require to refine a current design with an arbitrary number of simulations Nl−1 in the current stage l. In order to proceed slowly, the new design is proposed to double the previous total sample size, i.e., given the current Nl−1 , the new subset has nl = 2Nl−1 . Therefore, the total sample triples the current sample size: Nl = Nl−1 + nl = Nl−1 + 2Nl−1 = 3Nl−1 = N0 3l

(7)

If, for example, the first design (level l = 0) started with just one simulation (N0 = 1), the design at level l has total of Nl = 3l simulations, see Fig. (Fig. 3 right). In order to maintain a perfect uniformly distributed sampling probabilities at every stage l the sampling probabilities must be obtain as medians of the respective intervals (Eq. 5). The proposed procedure is illustrated in Fig 3 for the case when the initial design has one simulation only (N0 = 1). The left hand side plot shows how the additional subset evolve while the right hand side shows the aggregated designs and splitting of sampling probabilities into two new

358



ones. As become clear from the comparison of Figs 2 and 3, in HSLHS technique the sample size grows more rapidly (compare Eqs. 6 and 7). 3. The methods: Multivariate sampling 3.1. Correlation control for adding and merging simulations In order to obtain meaningful correlations between the input and output variables of the model, it is essential to precisely capture the input correlations in the simulated values. There are several methods available to control correlations by changing the sample ordering while leaving the values untouched. We propose to use the simulated annealing algorithm for optimizing the combinatorial task of rank values; a method that seems to be the most efficient for correlation control over samples of univariate marginals with fixed values (Voˇrechovsk´ y, 2004; Voˇrechovsk´ y and Nov´ ak, 2002; Voˇrechovsk´ y and Nov´ ak, 2009). We now show that computation of the target correlation coefficient of the additive sample set is very simple. Given the previous sample set, one can pre-calculate its contribution to the actual correlation matrix and accommodate the target correlation for the additional sample set accordingly. We show it using linear Pearson’s correlation; extensions to other coefficients are simple (e.g. computation of the Spearman correlation is identical except that the sample set is replaced by integer ranks, see (Voˇrechovsk´ y and Nov´ ak, 2009)). Linear Pearson’s correlation estimator on the data subset of a pair of random variables is given by: Nl

ρxy =

i=1 Nl

i=1

(xi − x) (yi − y)

2

(xi − x)

Nl

i=1

(8) (yi − y)

2

we know that the dispersion (variance) D and standard deviation σ is defined for both X, Y equally as (replace index “x” by “y” to obtain estimates for variable Y ): Nl

(xi − x)2 = (Nl − 1) Dx ,

σx =

Dx

(9)

i=1

We exploit the formulas to rewrite the correlation coefficient estimate as: Nl i=1

Nl

(xi − x) (yi − y)

ρxy = = (Nl − 1) Dx (Nl − 1) Dy

i=1

(xi − x) (yi − y) σx σy (Nl − 1)

(10)

Assume now with no loss on generality standardized variables, i.e.: x ¯ = 0, y¯ = 0 , σx = 1, σy = 1, Nl > 1. The coefficient then reads: Nl

ρxy =

i=1

xi y i

(Nl − 1)


(11)

359


√ Suppose all the values of the sample sets are further standardized by 1/ Nl − 1. The correlation can then be computed just as the dot product: ρxy =

Nl i=1

Nl−1

xi y i =

xi y i +

i

nl

xi y i

(12)

i

This is very suitable format for computing correlations when the whole sample set is divided into two subsets of samples for both variables (Nl−1 and nl ), see Fig. 1. We should make clear that the assumed zero mean and unit variance enabled us to develop the correlation coefficient as a weighted average of the two correlation coefficients of the two subsets. Note that in reality, the correlation coefficient is computed on each subset separately often using estimates of the mean (average) and sample variance. The above development is valid only when the mean and variance of the two subsets match. Therefore, for simulation of small samples, we recommend to use the target statistics instead of estimates for normalizing the data when computing correlation coefficients. It turns out that the subsequent addition of samples has almost no negative impact on the quality of correlation structure of the resulting sample set. More information on the performance in correlation control another papers are available (Voˇrechovsk´ y, 2009a; Voˇrechovsk´ y, 2009b). 3.2. Graphical illustration of the techniques To explain the suggested techniques for higher dimensional problems, we illustrate it with just two dimensions. Extension to higher dimensions is obvious and most of the above conclusions hold. Firstly, we simulate a pair of statistically uncorrelated uniform variables U1 and U2 . Each marginal is simulated separately by the methods proposed in the previous section. The vector is then obtained by the pairing method based on optimization algorithm described in (Voˇrechovsk´ y, 2004; Voˇrechovsk´ y and Nov´ ak, 2002; Voˇrechovsk´ y and Nov´ ak, 2009). In Figs. 4 and 5, we plot evolution of the growing sample set by LLHS and HSLHS. Traces of the vector realizations (univariate realizations) are plotted on the ordinate and abscissa. From the figures, we see how the sample subset at each level (on the left hand side) gets combined with the previous subsets to constitute a consistent sample set (see the right hand side plots). In case of sampling from a random vector, Scenario II of LLHS is better compared to Scenario I, because the correlation control will have somewhat richer “manipulation space”, i.e., a greater number of simulations will always be processed at a time giving the correlation algorithm better chance to fulfill the target correlation structure (Voˇrechovsk´ y and Nov´ ak, 2002; Voˇrechovsk´ y, 2009a; Voˇrechovsk´ y, 2009b). Using Fig. 4 we document the disadvantage of Scenario I. It can be seen that the number of simulations is two in level 1 and therefore there are only two possible mutual orderings of univariate sample. Either it can give pattern of perfect positive of perfect opposite dependence. If there were more simulations in that run, the (usually unwanted) strong pattern would not occur.

360



Target correlation r = 0 U Pi

i=1

1

l

Pl

U Pi

i=1

1

U2

0

U2

Pl

Target correlation r = - 0.9

Level l

l

U2

1

U2

0 1

0 1

U2

2

U2

0 1

0 1

U2

3

U2

0 1

0 1

U2

4

U2

0 1

0 1

U2

5

U2

0 1

0 1

U2

6

0

U2

0 1

0 1

0 0

U1

10

U1

1

0

U1

10

U1

1

Figure 4. LLHS – Evolution of bivariate sampling probabilities (or samples of uniform variables) with level (both additional subset and the total sample). Left: statistically independent case. Right: Negatively correlated variables.


361


Target correlation ρ = 0 U Πi

i=1

1

l

Πl

U Πi

i=1

1

U2

0

1

0 1

U2

2

3

0 1

U2

4

5

0 1

U2

6

0

U2

0 1

U2

0 1

U2

0 1

U2

0 1

U2

0 1

U2

0 1

U2

0 1

U2

0 1

U2

0 1

U2

Πl

Target correlation ρ = - 0.9

Level l

l

0 0

U1

10

U1

1

0

U1

10

U1

1

Figure 5. HSLHS – Evolution of bivariate sampling probabilities (or samples of uniform variables) with level (both additional subset and the total sample). Left: statistically independent case. Right: Negatively correlated variables.

362


Extension of sample size in Latin Hypercube Sampling with correlated variables Uniform bivariate distribution

Gaussian bivariate distribution

0 0

1

0

3.5

ρ = −0.9

0 0

U1

1

ρ = −0.9 -3.5 -3.5 N1

0.6 0.6

1

1.2

W2

1.2

N2

3.5

U2

1

-3.5 -3.5

W2

1.2

N2

3.5

U2

1

Weibull bivariate distribution

3.5

ρ = −0.9 0.6 0.6 W1 1

1.2

Figure 6. Examples of simulated sample from three studied joint distributions, level 11 (sample size n11 = 2048) in LLHS. Top row: uncorrelated marginals. Bottom row: negatively correlated marginals

4. Numerical examples: functions of random vectors In the following text, we document the convergence of functions of a growing sample of two-variate distribution to the analytical distributions. In particular, we pick three different functions of a random vector that should be representative of a range of possible relations in g (X). The three functions g1 , g2 and g3 are functions of a bivariate random vector that is either jointly Gaussian or Weibull distributed. Both vectors are obtained by transforming samples of bivariate uniformly distributed vector via Eq. (3). Again, we study correlation of 0 and −0.9. The patterns of samples of random vectors for the six identified combinations are plotted in Fig. 6. 4.1. Independent variables Firstly we study a pair of independent variables. Next, we introduce three different functions that cover the most frequent operations: summation, multiplication and extremes of random variables. For functions of independent marginals, we present analytical results first. 4.1.1. Sum of independent Gaussian random variables We skip the well-known definition of the bivariate Gaussian random vector. The distribution of a sum Z1 = g1 (X, Y ) = X + Y of two normally distributed variables X and Y with zero means and 2 2 + σ 2 . We unit variances is normal with mean given by μX+Y = μX + μY and variance σX+Y = σX Y will use standardized √ marginals and therefore the mean value of Z1 equals zero and the standard deviation equals 2.


363


4.1.2. Product of independent Gaussian random variables The distribution of a product Z2 = g2 (X, Y ) = XY of two standard independent normally distributed variables X and Y with zero means and unit variances is given by fX·Y

1 (z) = 2π

∞ ∞ −∞ −∞

x2 + y 2 1 exp − δ (xy − z) dxdy = K0 (|z|) 2 π

(13)

where δ(x) is the delta (spike) function and Kn (x) is the modified Bessel function of the second kind. Integrating the above distribution over (−∞, ∞) for statistical moments provides the analytical mean of zero and unit variance. Note that in a nonstandard case, the mean value μZ = μX μY and the variance σZ2 = μ2X σY2 + 2 + σ2 σ2 . μ2Y σX X Y 4.1.3. Minimum of independent Weibull random variables (extremes) The distribution of minimum Z3 = g3 (X, Y ) = min(X, Y ) of two independent and identically distributed Weibull variables X and Y with Weibull modulus m and unit scale parameter s variances is again Weibullian: m u Fmin(X,Y ) (z) = 1 − exp − (14) s2 where the shape parameter m (and also the coefficient of variation of Z) remains unchanged from that of X (series system) and the shape parameter s2 = s · 2−1/m . The mean value of Z3 is given by 1 1 −1/m μmin(X,Y ) = s2 Γ 1 + =s·2 ·Γ 1+ (15) m m In this numerical examples we use the scale parameter s = 1 and the shape parameter m = 12. 4.2. Statistically dependent variables To examine the effect of strong statistical dependence, we repeat the foregoing examples with identical (respective) marginals, but negative Spearman correlation of −0.9. The sum of two normally distributed variables is again Gaussian variable with the mean μZ = 2 + σ 2 + 2σ σ ρ. μX + μY and standard deviation σZ = σX X Y Y Moments of the aforementioned operation with normal variables are extremely sensitive to the actual correlation. In the case of minimum of two correlated Weibullian variables, the analytical solutions are dropped because to obtain it, one would need specification of the joint PDF of (X, Y ). Therefore, only plots documenting the convergence of estimated moments is presented.

5. Results and discussion In this section we briefly comment on the results obtained by examining the three functions from the previous section. The results we present here are the convergence of estimations of:

364



1. The mean value by an arithmetical average 2. Standard deviation by the sample standard deviation The results present averages of the estimated statistics based on several different runs. 0

1

2

3

4

level l 5 6

7

8

9 10 11

0

1

2

3

4

level l 5 6

7

8

9 10 11

HSLHS LLHS LHS

1.4 1.3

0 1.2 -0.05

1.1

g1 (N1,N2) = N1 + N2

1 1.2

mean

0.4

g 2 (N1,N2) = N1 . N 2

1

0.2

0.8

0

0.6

-0.2

0.4 0.2

-0.4

0 0.1

1 g3 (W1 ,W2 ) = min(W1 , W2 )

0.98

mean

standard deviation

-0.1

0.09

0.96

0.08

0.94

0.07

0.92

0.06

0.9

0.05

0.88

standard deviation

mean

0.05

standard deviation

1.5

0.1

0.04 1

10 100 Total simulations N l

1000

1


1000

Figure 7. Convergence to the analytical mean and standard deviation for functions of uncorrelated variables

We observed that LHS gives slightly better results at a given sample size. However, in our eyes, this small increase of accuracy cannot be balanced by the great advantage of HSLHS to increase the sample size adaptively based on current progress. Numerical tests we performed for number of simulations up to 4096. The reason for not going any further is that the typical property and advantage of LHS over crude Monte Carlo sampling (variance reduction of statistical response estimates) disappear for large samples. Mean value: Both LLHS and HSLHS give estimates of the mean (arithmetical averages) exactly equal to the analytical solution (and also to the classical LHS) in the case of a sum of random variables g1 (N1 , N2 ). For the product and minimum of random variables (functions g2 (N1 , N2 ) and


365


0

1

2

3

4

level l 5 6

7

8

9 10 11

0

1

2

3

4

level l 5 6

7

8

9 10 11

0.3 g1 (N1,N2) = N1 + N2

0.2 0.1

-0.1 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 -1 1

0 1.2 1.1 1 0.9 0.8 0.7 0.6 0.5 0.4 0.08 0.075 0.07 0.065 0.06 0.055 0.05 0.045 0.04

g2 (N1,N2) = N1 . N 2

0.98

mean

0.96

g3 (W1 ,W2 ) = min(W1 , W2 )

0.94 0.92 0.9 0.88 1


1000

1


standard deviation

mean

0.4

0 -0.05

mean

0.5

standard deviation

HSLHS LLHS LHS

0.05

standard deviation

0.6

0.1

1000

Figure 8. Convergence to the analytical mean and standard deviation for functions of negatively correlated variables

g3 (W1 , W2 ) respectively) of the vector, HSLHS seems to be about equally effective as LHS while LLHS converge somewhat slower. However, if the random variables are negatively correlated, LHS seems to give closer estimates than HSLHS at the same sample size. Standard deviation: In all three cases of studied functions of random variables (either uncorrelated or negatively correlated), the actual variance was underestimated by all three methods, LHS, LLHS and HSLHS. However, HSLHS is about equally efficient as the classical LHS. LLHS is somewhat worse and the reason is that the sampling probabilities are not perfectly uniformly spread over the interval (0, 1) as compared to LHS and HSLHS, and this effect propagates into the resulting variance estimates of g (X).

366



6. Conclusions In the paper we examined possibilities to refine sampling sets within the framework of Monte Carlo type simulations. We proposed two possible refinement techniques designed to give approximately equally good result as Latin Hypercube sampling. The two methods are denoted as LLHS and HSLHS. The methods are designed for small sample sets (from tens to maximum thousands of simulations) that can be extended and merged together to constitute a consistent sample set and preserve the variance reduction property at the same time. The desired correlation is proposed to be maintained by the previously developed combinatorial optimization technique. The technique is capable of matching the target correlation structure as the sample size grows. Adaptive refinement can performed as long as some stopping criteria are met. The criteria for termination can be especially: 1. user’s decision based e.g. on material or computing resources; 2. statistical significance of an arbitrary parameter (estimated mean value, standard deviation, sensitivity etc) is high enough (e.g. confidence intervals are narrowed bellow a given limit). LLHS and HSLHS have been shown to be good in estimation of the mean of response, HSLHS is sometimes slightly better. In the case of LLHS, the sampling flexibility and subset consistency are not for free: underestimations of response variance tend to be somewhat more pronounced than in the case of traditional LHS (which usually underestimates variance). This feature is explained to happen due to slight non-uniformity of selected sampling probabilities. HSLHS completely removes this problem, but the sample size grows more rapidly with increased number of added subsets of samples. The typical application involves a computer-based model in which it is impossible to find a way (closed form or numerical) to do the necessary transformation of variables and where it is expensive to run in terms of computing resources and time. Examples of applications and extensions are numerous: simulations of random fields, design of physical or computer experiments, pilot numerical studies of complicated functions of random variables, optimal adaptive and importance sampling, optimal progressive learning of neural networks or other related areas. Acknowledgements The author acknowledges financial support provided by the Grant Agency of the Academy of Sciences of the Czech Republic under project no. KJB201720902 and also a partial support provided by the Czech Science foundation under project no. GACR 103/08/0752 (SISMO). References B. M. Ayyub and K. L. Lai. Structural reliability assessment using Latin Hypercube Sampling. In A. H-S. Ang, M. Shinozuka, and G.I. Schuller, editors, ICoSSaR ’89, the 5 th International Conference on Structural Safety and Reliability, volume 2, pages 1177–1184, San Francisco, CA, USA, 1989. ASCE.


367


W.J. Conover. On a better method for selecting input variables. unpublished Los Alamos National Laboratories manuscript, reproduced as Appendix A of “Latin Hypercube Sampling and the Propagation of Uncertainty in Analyses of Complex Systems” by J.C. Helton and F.J. Davis, Sandia National Laboratories report SAND20010417, printed November 2002., 1975. R. C. Iman and W. J. Conover. Small sample sensitivity analysis techniques for computer models with an application to risk assessment. Communications in Statistics: Theory and Methods, A9(17):1749–1842, 1980. M. D. McKay, W. J. Conover, and R. J. Beckman. A comparison of three methods for selecting values of input variables in the analysis of output from a computer code. Technometrics, 21:239–245, 1979. M. Voˇrechovsk´ y and D. Nov´ ak. Correlated random variables in probabilistic simulation. In Peter Schießl, Norbert Gebbeken, Manfred Keuser, and Konrad Zilch, editors, 4th International Ph.D. Symposium in Civil Engineering held in Munich, Germany, volume 2, pages 410–417. Millpress, Rotterdam, 2002. M. Voˇrechovsk´ y and D. Nov´ ak. Correlation control in small sample Monte Carlo type simulations I: A Simulated Annealing approach. Probabilistic Engineering Mechanics (Elsevier), 24(3):452–462, 2009. M. Voˇrechovsk´ y. Stochastic fracture mechanics and size effect. PhD thesis, Brno University of Technology, Brno, Czech Republic, 2004. M. Voˇrechovsk´ y. Hierarchical Subset Latin Hypercube Sampling. In S. Vejvoda and et al., editors, PPK 2006 Pravdˇepodobnost poruˇsov´ an´ı konstrukc´ı, pages 285–298, Brno, Czech Republic, 2006. Brno University of Technol´ ogy, Faculty of Civil Engineering and Faculty of Mechanical Engineering& Ustav aplikované mechaniky Brno, s.r.o. & Asociace strojn´ıch inˇzen´ yr˚ u & TERIS, a.s., poboˇcka Brno. M. Voˇrechovsk´ y. Interplay of size effects in concrete specimens under tension studied via computational stochastic fracture mechanics. International Journal of Solids and Structures (Elsevier), 44(9):2715–2731, 2007. M. Voˇrechovsk´ y. Stochastic computational mechanics of quasibrittle structures. Number 235 in Vˇedecké spisy Vysokého uˇcen´ı technického v Brnˇe. Habilitaˇcn´ı a inauguraˇcn´ı spisy. Brno University of Technology, Brno, Czech Republic, 2007. Published habilitation thesis presented at Brno University of Technology, 389 pages. M. Voˇrechovsk´ y. Correlation control in small sample Monte Carlo type simulations II: Theoretical analysis and performance bounds. Probabilistic Engineering Mechanics (Elsevier), Unknown Vol.(Unknown No.):in review, Unknown Month 2009. Note: Vol., No. and month unknown and to be filled by the Editor. M. Voˇrechovsk´ y. Correlation control in small sample Monte Carlo type simulations III: Performance study, multivariate modeling and copulas. Probabilistic Engineering Mechanics (Elsevier), Unknown Vol.(Unknown No.):in review, Unknown Month 2009. Note: Vol., No. and month unknown and to be filled by the Editor.

368


Extension of sample size in Latin Hypercube ... - NUS Engineering

Extension of sample size in Latin Hypercube ... - NUS Engineering

Suggest Documents

Fast Generation of Space-Filling Latin Hypercube Sample Designs

Convergence analysis for Latin-hypercube lattice-sample ... - SciELO

construction of sliced orthogonal latin hypercube designs

Latin Hypercube Sampling of Gaussian Random ...

Efficiency Enhancement of Optimized Latin Hypercube Sampling ...

Hierarchical Refinement of Latin Hypercube ... - Semantic Scholar

Latin Hypercube Sampling with Evolutionary ... - Semantic Scholar

Sliced Orthogonal Array Based Latin Hypercube ...

Latin hypercube sampling with inequality constraints

Latin hypercube sampling for uncertainty analysis in multiphase ...

Formulation of the Optimal Latin Hypercube Design of ...

Formulation of the Optimal Latin Hypercube Design of Experiments ...

Formulation of the Optimal Latin Hypercube Design of ... - AIAA ARC

A conditioned Latin hypercube method for sampling in ...

Sample size

On Sample Size Control in Sample Average

Sample size matters: investigating the effect of sample size ... - NHESS

Sample size in usability studies.pdf

Remote Sensing Data with the Conditional Latin Hypercube Sampling ...

Sensitivity Analysis for MOVES Running Emission: A Latin Hypercube ...

LATIN HYPERCUBE SAMPLING: APPLICA- TION TO PIT LAKE ...

Uniform sliced Latin hypercube designs - Wiley Online Library

On Simulated Annealing Dedicated to Maximin Latin Hypercube ...

nearly orthogonal latin hypercube designs for many design columns