A Procedure for Generating Batch-Means Confidence Intervals for ...

SIMULATION http://sim.sagepub.com

A Procedure for Generating Batch-Means Confidence Intervals for Simulation: Checking Independence and Normality E. Jack Chen and W. David Kelton SIMULATION 2007; 83; 683 DOI: 10.1177/0037549707086039 The online version of this article can be found at: http://sim.sagepub.com/cgi/content/abstract/83/10/683

Published by: http://www.sagepublications.com

On behalf of:

Society for Modeling and Simulation International (SCS)

Additional services and information for SIMULATION can be found at: Email Alerts: http://sim.sagepub.com/cgi/alerts Subscriptions: http://sim.sagepub.com/subscriptions Reprints: http://www.sagepub.com/journalsReprints.nav Permissions: http://www.sagepub.com/journalsPermissions.nav

Downloaded from http://sim.sagepub.com at PENNSYLVANIA STATE UNIV on February 5, 2008 © 2007 Simulation Councils Inc.. All rights reserved. Not for commercial use or unauthorized distribution.

A Procedure for Generating Batch-Means Confidence Intervals for Simulation: Checking Independence and Normality E. Jack Chen BASF Corporation 333 Mount Hope Avenue Rockaway New Jersey 07866-0909, USA [email protected] W. David Kelton Department of Quantitative Analysis and Operations Management University of Cincinnati Cincinnati Ohio 45221-0130, USA Batch means are sample means of subsets of consecutive subsamples from a simulation output sequence. Independent and normally distributed batch means are not only the requirement for constructing a confidence interval for the mean of the steady-state distribution of a stochastic process, but are also the prerequisite for other simulation procedures such as ranking and selection (R&S). We propose a procedure to generate approximately independent and normally distributed batch means, as determined by the von Neumman test of independence and the chi-square test of normality, and then to construct a confidence interval for the mean of a steady-state expected simulation response. It is our intention for the batch means to play the role of the independent and identically normally distributed observations that confidence intervals and the original versions of R&S procedures require. We perform an empirical study for several stochastic processes to evaluate the performance of the procedure and to investigate the problem of determining valid batch sizes. Keywords: Batch means, test of independence, test of normality, ranking and selection

1. Introduction A concern of simulation output analysis is to estimate the sampling error of some unknown parameter. This gives the experimenter an idea of the precision with which the estimate reflects the true but unknown parameter. For example, when estimating a steady-state performance parameter such as the mean 1 of some discrete-time stochastic output process {X i : i 1 1} via simulation, we would like an algorithm to determine the simulation run length n so that1 (i) the mean estimator (i.e. the sample mean n 2 X i 4n) is unbiased, (ii) the confidence inX2n3 3 i31

SIMULATION, Vol. 83, Issue 10, October 2007 683–694 c 2007 The Society for Modeling and Simulation International 4

DOI: 10.1177/0037549707086039

terval (c.i.) for 1 is of a pre-specified width, and (iii) the actual coverage probability of the c.i. is close to the nominal coverage probability 1 5 5. Because we assume the underlying process is stationary, i.e. the joint distribution of the X i ’s is insensitive to time shifts, the mean estimator will be unbiased. If the process is not stationary due to initialization effects, it is common practice in simulation to delete or truncate some number of initial observations in order to approximate the stationarity assumption. However, the usual method of c.i. construction from classical statistics, which requires independent and identically distributed (i.i.d.) normal observations, is not directly applicable since simulation output data are generally correlated and non-normal. Furthermore, many simulation procedures are derived based on the assumption that data are i.i.d. normal: for example, ranking and selection procedures and multiple-comparison procedures Volume 83, Number 10 SIMULATION


683

Chen and Kelton

(Bechhofer et al. [1]). The method of batch means is the technique of choice to ‘manufacture’ data that are approximately i.i.d. normal. See Law and Kelton [2] for more details on batch means. We propose a procedure to obtain batch means that appear to be i.i.d. normal, as determined by the von Neumman [3] test of independence and the chisquare test of normality. In the remainder of the paper, we will use ‘data that appear to be i.i.d. normal’ to mean that the data pass both the tests of independence and normality. Chen and Kelton [4, 5] used a test of independence to determine simulation run length. Their procedure performs statistical analysis on sample sequences collected on strictly stationary stochastic processes. In the proposed procedure, we apply the statistical tests on the batch means directly, instead of on the original samples. The batch sizes are based entirely on the data and do not require user intervention. The only required condition is that the autocorrelations of the stochastic process output sequence die off as the lag between observations increases, in the sense of 6-mixing (see Billingsley [6]). These weakly dependent, stationary processes typically obey a central limit theorem of the form 2 3 2 6 X2n3 51 D n 57 N 208 13 as n 7 88 (1) 7 where 72 is the steady-state variance constant (SSVC), N 218 9 2 3 denotes the normal distribution with mean 1 D and variance 9 2 , and 57 denotes convergence in distribution [6]. For correlated sequences, the SSVC 8 4 2 3 2

i8 72 9 lim nVar X2n3 3 n78

i358

where i 3 Cov2X k 8 X k i 3 for any k is the lag-i covariance. A sufficient condition for the SSVC 1 to exist is that the output process is stationary and 8 58 i 8 If the sequence is independent, then the SSVC is equal to the process variance 9 2x 3 Var2X i 3. Our procedures may fail when the underlying stochastic process does not satisfy the limiting result of equation (1). The rest of this paper is organized as follows. In Section 2, we present some background on the batch-means method to construct c.i.’s and on the von Neumann test to determine whether the batch means appear to be independent. In Section 3, we present the algorithm to estimate batch sizes so that the batch means appear to be independent and normally distributed. In Section 4, we show some empirical–experimental results. In Section 5, we give concluding remarks. 2. Background This section overviews batch means to estimate the variance of the sample mean and von Neumann’s test to determine whether a sequence appears to be independent. 684 SIMULATION

2.1 Batch-Means Method In the non-overlapping batch-means (NOBM) method, the simulation output sequence {X i : i 3 18 28 8 n} is divided into b adjacent non-overlapping batches, each of size m. For simplicity, we assume that n is a multiple of m so that n 3 bm. The sample mean X2 j for the jth batch is 1 X2 j 3 m

mj 4

Xi

for

j 3 18 28 8 b

i3m2 j513 1

2 3 For a given batch size m, we have Var X2 j 3 72 2m34m 1m51 where 72 2m3 3 0 2 i31 21 5 i4m3 i . The grand mean 1 of the individual batch means (BM), given by b 14 2 (2) X j8 1 3 b j31 2 is used as a point estimator for 1. Here 1 3 X2n3, the sample mean of all n individual X i ’s, and we seek to construct a c.i. for 1 based on the point estimator (2). The method of BM is a well-known technique for estimating the variance of point estimators computed from simulation experiments. The BM method tries to reduce autocorrelation by batching observations. 2 3 The BM variance estimator (for estimating Var X2 j ) is simply the sample variance of the mean estimator X2 j computed from the sample (batch) means of subsets of consecutive subsamples, i.e. 1 4 2 2 X j 5 13 2 b 5 1 j31 b

S B2 3

(3)

2B 9 Consequently, the batch-means estimator for 72 is 7 2 m SB . Asymptotic validity of the c.i. constructed by the batchmeans method, i.e. that the coverage probability of the c.i. is close to the nominal coverage probability, often depends on the assumption that the batch means are approximately i.i.d. normal. That is, for a large batch size m, the batch means are approximately i.i.d. normal with unknown mean 1 and unknown variance 72 2m34m. However, some BM methods (for example, ASAP3 by Steiger et al. [7]) construct a c.i. for the mean by adjusting the c.i. half-width based on the strength of the autocorrelations between BMs. The key idea for ASAP3 is that the batch means may become approximately jointly normally distributed before becoming approximately independent. ASAP3 therefore adjusts the half-width to counter any remaining dependence between the batch means. There are a number of batch-size determination procedures that aim to generate independent batch means. However, these methods have focused on determining a batch size large enough to achieve near independence of

Volume 83, Number 10


A PROCEDURE FOR GENERATING BATCH-MEANS CONFIDENCE INTERVALS FOR SIMULATION

the batch means and have ignored the question of normality based on the assumption: if the batch size is large enough for the batch means to be approximately independent, then the batch size should be large enough for the batch means to be approximately normally distributed as well. For instance, the procedure of Law and Carson [8] starts with 400 batches of size 2 and doubles the sample size every other iteration until an estimate for the lag1 correlation among 400 batch means becomes smaller than 0.4 and larger than the estimated lag-1 correlation among 200 batch means with twice the batch size. A drawback of the method is that it does not address the issue of the normality of the batch means. Another widely studied batch-means technique is the set of procedures LBatch and ABatch [9]. However, these procedures require users to enter the simulation run length at the beginning of the execution. Hence, even though these procedures dynamically determine batch sizes (so that the batch means converge faster to normality), they are fixed-sample-size procedures. 2.2 The von Neumann Test of Independence The von Neumann test is relatively simple, can be applied when the number of batches is small and can easily be incorporated into other simulation procedures [10]. Several BM procedures have used the von Neumann ratio to test for independence, e.g. [7, 10]. The simulation software Arena also uses the von Neumann ratio to test for its automatic batching procedure [11]. We briefly review the von Neumann test [3, 9] for the hypothesis H0 : the batch means X2 1 8 X2 2 8 8 X2 b are uncorrelated. The von Neumann ratio is 1b 2 2 2 j32 2 X j 5 X j51 3 Cb 3 1 5 1b 2 j31 2 X2 j 5 13 2 Note that Cb is an estimator of the lag-1 autocorrelation

1 9 Corr( X2 j 8 X2 j 1 ), adjusted for end effects that diminish in importance as the number of batches b increases. If {X i } has a monotone non-increasing autocorrelation function, then 1 is positive and decreases monotonically to zero as the batch size m increases. The von Neumann test statistic for H0 is 5 b2 5 1 Cb Z3 b52 Under H0 , Z N 208 13, so one rejects H0 at level 155 ind if Z z 155ind 8 where z 155ind is the 1 5 5 ind quantile of the standard normal distribution. Fishman [9] points out that the von Neumann test of independence is likely to accept H0 when {X i } has an autocorrelation function that is negatively correlated and exhibits damped harmonic behavior around zero. In this case, the variance estimator will be biased high instead of

biased low. We will not lose any performance in terms of coverage1 however, the half-width of the confidence interval for 1 will be wider than necessary. 3. Manufacturing i.i.d. Normal Batch Means In this section, we discuss the strategy of manufacturing batch means as well as a procedure to construct a confidence interval for the mean. We refer to the proposed method as the quasi-independent and normal (QIN) procedure. 3.1 Validation of Normality Batch means that appear to be independent are not necessarily normally distributed and vice versa. However, the experimental results of Chen and Kelton [12] indicate that c.i.’s constructed under the assumption that samples are i.i.d. normal generally have coverages close to the nominal value even when the samples are independent but not normal. Therefore, in terms of c.i. coverage, it is not as critical to ensure that the batch means are normally distributed as to ensure that the batch means are independent. Nevertheless, ensuring that the batch means are normal can improve the c.i. coverage. Furthermore, some other procedures (e.g. R&S) do require the data to be i.i.d. normal. To determine whether the batch means appear to be normally distributed, we check the proportions ( p j , j 3 18 28 8 6) of the values of the batch means in each interval bounded by (58, 1 5 0 9674 S, 1 5 0 4303S, 1, 1 0 4303 S, 1 0 9674 S, 8), where 1 and S are the grand sample mean of the n observations and the standard deviation of these b batch means, respectively [2, 13]. The constants involved in these intervals are strategically chosen so that the proportions of batch means in each interval are equal under the null hypothesis that the batch means are normal. We apply the chi-square test of normality to these batch means, using a confidence level of 0.9. For example, we compute the test statistic 6 4 2 p j 5 14632 3 146 j31 2

We reject the null hypothesis that the batch means are normal when 2 20 985 3 9 236, the 0 9 quantile of the 2 distribution with 5 degrees of freedom (df). The powers of the independence and the normality tests increase as the number of batches used to perform the tests increases. Based on our experimental results, we recommend that the sample size used for this normality test and the von Neumann test be at least 180 (see Section 4). There are other normality tests of greater sophistication, for example, the Shapiro-Wilk test [14]. We chose the chisquare test because it is easy to apply and serves our purpose well1 however, other normality tests could be used in Volume 83, Number 10 SIMULATION


685

Chen and Kelton

place of the chi-square test of normality in the method if users desire. 3.2 Batch-Means Variance Estimator 2B 5 72 into three Fishman [9] classifies the difference 7 categories: (1) error due to finite sample size n1 (2) error due to ignoring correlation between batches1 and (3) error due to random sampling. He collectively refers to the errors in the first two categories as systematic variance error. He points out that under relatively weak conditions, the systematic variance error behaves as O214m3, and the standard 6 error of error due to random sampling behaves as O214 b3. Here O2an 3 denotes a quantity that converges to zero at least as fast as the sequence {an } does, as n 7 8. Hence, if we use a fixed number of batches, then a batch size m n would diminish the systematic variance error most rapidly. On the other hand, if we want to reduce the error due to random sampling, we should increase the number of batches. The hope behind the QIN method is that the degree of systematic variance error has diminished to a level that can be neglected when the batch means X2 j , for j 3 18 28 8 b, appear to be i.i.d. normal, as determined by the von Neumann test and the chi-square test. Hence, we do not need to increase the batch size any further since our goal is to manufacture as many i.i.d. normal batch means as possible with a given sample size. Recall that each batch mean is considered as a sample for subsequent processes and a large sample size can reduce the sampling error of the subsequent processes. If {X i } is an i.i.d. N 218 9 2 3 sequence, then the 1 5 5 half-width of a c.i. centered on the sample mean constructed with n observations is 9 H 3 z 15542 6 n If the sequence is divided into b non-overlapping batch means of batch size m, i.e. n 3 bm, then the variance of a batch mean is 9 2 4m and the 1 5 5 half-width constructed with b batch means is 6 94 m 9 3 z 15542 6 H 3 z 15542 6 n b That is, for i.i.d. normal sequences with known variance, the batch size has no impact on the c.i. half-width H when the sample size is fixed. However, this property generally does not hold when the variance is unknown, and the c.i. half-width needs to be calculated by SB H 3 t155428 b51 6 8 b

(4)

where t1558 f is the 1 5 5 quantile of the t distribution with f df. 686 SIMULATION

Once the sample size is large enough for the BM to pass both the test of independence and normality, we then compute the sample mean and the variance estimator S B2 based on these BM. The mean 1 and the c.i. half-width are estimated, respectively, by 1 and equation (4). The final step in the procedure is to determine whether the c.i. meets the user’s half-width requirement – a maximum absolute halfwidth or a maximum relative fraction r of the magnitude of the final mean estimator 1. If the precision of the c.i. is satisfied (i.e. H or H r1), then the procedure terminates, and we return the mean estimate 1 and the c.i. with half-width H . If the precision requirement is not satisfied, the procedure will increase the sample size to shorten the half-width. The procedure will increase the number of batches b to 6 72 6 72 H H b or b (5) r1 This step will be executed repeatedly until the half-width is within the specified precision. 3.3 The Implementation The procedure progressively increases the batch size m until the batch means appear to be i.i.d. normal, as determined by the von Neumann and chi-square tests. Figure 1 displays a high-level flow chart of QIN. We divide the entire output sequence into b 3 180 batches. We allocate a buffer of size 3b to keep primitive batches (PB), i.e. the average of l observations. Batch means are then computed from PB in the buffer. Initially, each observation is treated as a PB1 as the procedure proceeds, l will be doubled every two iterations. The steps between generating more observations are considered an iteration. To facilitate the description of the algorithm, we break each iteration into A and B sub-iterations so that most relevant quantities can be described in simpler forms. The procedure performs both independence and normality tests in both sub-iterations and terminates when the BM appear to be i.i.d. normal. Table 1 illustrates how sampling progresses from iteration to iteration. The Iteration row lists the index of the iteration. The n row lists the total number of observations to be taken during a certain iteration. The pb row lists the total number of PB in the buffer at a certain iteration. In the l row are listed the number of observations used to obtain the PB in the buffer. The m row lists the batch size. Note that the PB here could consist of just one observation. For example, at the end of iteration 1 B , the total number of observations is 3b, there are 3b PB in the buffer, and each PB is just the value of each observation. At the beginning of iteration 2 A , we reduce the number of PB in the buffer from 3b to 3b42 by taking the average of every two consecutive PBs. We will generate b42 PBs at iteration 2 A , so we will have 2b PBs at the end of the iteration.




Figure 1. High-level flow chart of QIN

Table 1. Properties of QIN at each iteration

3.4 The Quasi-Independent and Normal Algorithm

Iteration

0

1A

1B

2A

2B

kA

kB

n pb l m

b b 1 1

2b 2b 1 2

3b 3b 1 3

4b 2b 2 4

6b 3b 2 6

2k b 2b 2k51 2k

2k51 3b 3b 2k51 2k51 3

The size of the buffer used to store the PBs is 3b, l is the number of observations used to compute the PBs, is the incremental sample size, k is the index of iterations. Each iteration k contains two sub-iterations: k A and k B . 1. Initialization: Set b 3 180, l 3 1, 3 b and k 3 0.

We aggregate the available PBs into b BMs by averaging the adjacent PBs. The procedure will progressively increase l and consequently the batch size until these b BMs appear to be i.i.d. normal.

2. Generate PBs, where each PB is the average of l observations. 3. If this is the initial iteration (i.e. if k 3 0), set q 3 1. If this is a k A iteration, set q 3 2. If this is a Volume 83, Number 10 SIMULATION


687

Chen and Kelton

k B iteration, set q 3 3. Set the value of the batch means to the average of q PBs in the buffer (i.e. 1q2 j513 q X2 j 3 q2 j513 1 V j 4q for j 3 18 28 8 b, where V j is the value of the jth PB). 4. Check whether these b batch means pass the independence and normality tests. 5. If the batch means pass the independence and normality tests, go to step 10. 6. If this is the initial or a k B iteration, set k 3 k 1 and start a k A iteration. If this is a k A iteration, start a k B iteration. 7. If this is a k A iteration (k 1), then re-calculate the PB in the buffer by taking the average of two consecutive PBs and reindex the rest of the 3b42 PB in the first half of the buffer. Set l 3 2k51 , 3 b42. 8 If this is a k B iteration (k 1), set 3 b. 9. Go to step 2. 10. Compute the batch-means variance estimator S B2 according to equation (3) and confidence interval halfwidth according to equation (4). 11. Let be the desired absolute half-width, and let r1 be the desired relative half-width. If the half-width of the c.i. is greater than or r1, compute b , the required number of batches according to (5), generate b 5 b additional batches, set b 3 b , and go to step 101 otherwise the procedure returns the c.i. estimator and terminates. The QIN procedure starts with an initial sample size of 180 and doubles the sample sizes every two iterations. We choose the value b 3 180 because this is the sample size we used for the tests of independence and normality. Note that we set 3 b42 at A iterations and 3 b at B iterations so that the number of PBs will always be a multiple of b at the end of an iteration. The procedure progressively increases the batch size until these b batch means pass the independence and normality tests. Hence, we can use these b batch means to construct a classical c.i. without any adjustment. The procedure needs only to process each observation once and does not require storing the entire output sequence. 3.5 Discussion In this section, we discuss the rationale behind the QIN procedure and highlight the differences and similarities between QIN and the procedure of Law and Carson (LC) [8] and the set of methods denoted by LBatch and ABatch [9]. The sample-size incremental strategies of QIN are very similar to that of LC1 LC doubles the sample sizes every 688 SIMULATION

two iterations, by setting n 0 3 600, n 1 3 800 and n i 3 2n i52 . Note that in our implementation we need to process each observation only once. However, the stopping criteria are different. Once the stopping criteria are satisfied, LC constructs its c.i. by dividing the entire sequence into 40 batches. LC aggregates every 10 batch means that appear to be independent as the final batch mean, and does not explicitly check whether these batch means appear to be normally distributed. Furthermore, if the obtained c.i. half-width is wider than desired, LC will keep the number of batches at 40 and increase only the batch size. LBatch and ABatch [9] incorporate two different sample-size incremental strategies: FNB (Fixed Number of Batches) and SQRT (the number of batches and batch 6 size are increased by a factor 2 at each iteration). Both FNB and SQRT double the sample size at each iteration. However, LBatch and ABatch require users to enter the sample size n for a simulation run. Let b be the initial number of batches. If n equals n (the minimal sample size for the batch means with batch size n 4b to pass the test of independence), then the batch sizes determined by LBatch and ABatch are the same since these two procedures use the FNB rule exclusively and invoke the SQRT rule only after batch means pass the test of independence. Fishman [9] points out that BM methods that are based entirely on the FNB rule, for example LC, do obtain an asymptotically valid variance estimator S B2 , but are not statistically efficient. QIN also uses the FNB rule1 however, it doubles the sample size every two iterations instead of every iteration. Moreover, QIN continues to use the FNB rule to increase the batch size until the batch means pass the normality test. Only after QIN obtains a sufficiently large batch size such that the batch means appear to be i.i.d. normal, does it cease to increase the batch size any further. For example, if the obtained c.i. half-width is wider than desired, QIN will only increase the number of batches with the estimated batch size so that the variance of S B2 becomes smaller at a faster rate. On the other hand, if n n , then LBatch and ABatch will increase both the number of batches and batch size. Note that LBatch and ABatch do not explicitly check whether the batch means appear to be normally distributed. 3.6 Two-Stage Selection Procedure To demonstrate how the QIN procedure can be used as a pre-processor to manufacture i.i.d. normal batch means for ranking and selection procedures, used to select the ‘best’ design from among k competing designs, we performed some experiments following Chen’s [15] selection procedure. Let 1il be the lth smallest of the 1i ’s, so that 1i1 1i2 1ik . The best design is the design with the smallest expected response 1i1 . In practice, however, if 1i1 and 1i2 are very close together, we might not care if we mistakenly chose design i 2 whose expected response




is 1i2 . The practically significant difference d (a positive real number) between the best and next-best design is called the indifference zone in the statistical literature and represents the smallest difference about which we care. Let P(CS) denote the probability of correct selection, i.e. the probability that the best design is indeed selected. In indifference-zone selection, the user specifies an indifference amount d and a minimal probability of correct selection P . Chen’s selection procedure achieves P(CS) P provided that the difference between the best and the second-best design is at least d . If the procedure selected a design having a performance measure within d of the best, it will be considered as a correct selection regardless of whether the selected design is actually the best. For completeness, we describe Chen’s indifferencezone selection procedure here. Let b0 be the number of initial samples or batch means and let X i8 j be the jth i.i.d. normal sample or batch mean from the ith design. It is our intention for the batch means to play the role of the i.i.d. normal observations that the original version of the procedure requires. We compute the first-stage sample means 1b0 X i8 j 4b0 , and marginal sample variances X2 i 2b0 3 3 j31 1b0 Si2 2b0 3

3

j31 2X i8 j

5 X2 i 2b0 332

b0 5 1

8

(6)

for i 3 18 28 8 k. Let X2 b 2b0 3 3 minki31 X2 i 2b0 3. Based on the number of initial replications or batches b0 and the sample variance Si2 2b0 3 obtained from the first stage, the number of additional simulation replications or batches for each design in the second stage is Bi 5 b0 , where Bi 3 max2b0 8 2h t Si 2b0 34di 32 38

for i 3 18 28 8 k8

and di 3 max2d 8 X2 i 2b0 3 5 U 2 X2 b 2b0 333 is the adjusted controlled distance, U 2 X2 b 2b0 33 is the P upper confidence that is greater limit of 1b , z is the smallest integer 6 than or equal to the real number z, h t 3 2 t P8b0 51 , and P 3 1 5 21 5 P 342k 5 13. We then compute the overall 1 Bi sample means X2 i 2Bi 3 3 j31 X i8 j 4Bi , i 3 18 28 8 k, and select the design with the smallest X2 i 2Bi 3 as the best one. 4. Empirical Experiments In this section, we present some empirical results from simulation experiments using the proposed procedure. We use 180 batch means for the von Neumann test of independence and chi-square test of normality. We test the procedure with six stochastic processes. Observations are i.i.d. uniform between 0 and 1, denoted U 208 13. Observations are i.i.d. N 208 13.

Observations are i.i.d. exponential with mean 1, denoted expon(1). Steady-state first-order moving average process, generated by the recurrence X i 3 1 i i51

for i 3 18 28 8

where the i are i.i.d. N 208 13 and 51 1. We set 1 to 2 in our experiments. This process is denoted MA1( ). Steady-state first-order auto-regressive process, generated by the recurrence X i 3 1 2X i51 5 13 i

for i 3 18 28 8

where the i are i.i.d. N 208 13, and 51 1. We set 1 to 2 in our experiments. This process is denoted AR1(). We set X 0 to a random variate drawn 1 from a N 208 15 2 3 distribution. Steady-state M/M/1 delay-in-queue process with arrival rate and service rate 1 3 1. This process is denoted MM1(), where 3 41 is the traffic intensity. We first evaluate the performance of the von Neumann test and the chi-square test with different levels of 5. Based on the experimental results, the confidence levels of the von Neumann test and the chi-square test are set to 90% in the latter experiments. Note that a lower confidence level of these tests will increase the chance of committing a Type I error (rejecting the null hypothesis when it is true) and will increase the batch size and the simulation run length. We then check the batch size at which the batch means pass the normality test and check the interdependence between the batch size at which the batch means pass the independence test and the strength of the autocorrelation of the output sequence. We evaluate the performance of using the approximately i.i.d. normal batch means generated by the QIN procedure to estimate the variance of sample means. Finally, we evaluate the performance of using the QIN procedure as a pre-processor for Chen’s selection procedure. 4.1 Experiment 1 In this experiment, we used the von Neumann and chisquare tests to check whether a sequence of observations passes the tests of independence and normality. In this setting, the batch size m 3 1 and the number of batches b 3 180. Table 2 lists the experimental results. Each design point is based on 10 000 independent simulation runs. The P210021 5 5 ind 3%3 columns under Independence list the observed percentage of these 10 000 runs Volume 83, Number 10 SIMULATION


689

Chen and Kelton

Table 2. The percentage of the output sequences pass tests of independence and normality with 180 batch means at different levels of 5 Test Process U 208 13 N 208 13 expon(1)

P290%3

Independence 2%3 P295%3

P299%3

P290%3

Normality 2%3 P295%3

P299%3

90 90 90

95 95 95

99 99 99

20 97 0.07

35 98 0.18

68 100 0.58

MA1(0.35) MA1(0.50) MA1(0.75) MA1(0.90)

0.05 0 0 0

0.23 0 0 0

1.9 0.02 0 0

96 97 97 96

98 99 99 98

100 100 100 100

AR1(0.35) AR1(0.50) AR1(0.75) AR1(0.90)

0.03 0 0 0

0.1 0 0 0

0.85 0.01 0 0

96 96 94 78

98 98 97 86

100 100 99 94

MM1(0.35) MM1(0.50) MM1(0.75) MM1(0.90)

0.28 0 0 0

0.46 0.01 0 0

1.1 0.02 0 0

0 0 4.1 17

0 0 6.3 23

0 0 13 35

passing the von Neumann test when the nominal probability of the test of independence is set to 1 5 5 ind . The P210021 5 5 nor 3%3 columns under Normality list the observed percentage of these 10 000 runs passing the chisquare test when the nominal probability of the test of normality is set to 1 5 5 nor . We set 5 ind and 5 nor to 0 1, 0 05, and 0 01. The von Neumann test performs very well in terms of Type I error (i.e. independent sequences fail the test of independence)1 it is close to the specified 5 ind level. A Type II error is the event that we accept the null hypothesis when it is false, i.e. correlated sequences pass the test of independence. For slightly correlated sequences, the frequency of committing a Type II error is high: for example, the MA1(0.15), AR1(0.15) and MM1(0.15) processes. Note that we can use a larger number of batches to increase the power of a test and reduce the Type II error. If we mistakenly treat slightly correlated sequences as being i.i.d., the performance measurements should still be fairly accurate1 thus, the low probability of correct decision for those slightly correlated sequences should not pose a problem. On the other hand, for mildly or highly correlated sequences, the von Neumann test detects the dependence almost all the time. The normality test performs very well in terms of Type I error1 it is less than the specified 5 nor level. For non-normal distributions, the output sequences pass the normality test increases as the 5 nor level decreases. For distributions that are asymmetric, the output sequences fail the normality test almost all the time. Moreover, the results of the normality test indicate that the autocorrelations among the samples have very little impact on the chisquare normality test when the autocorrelations are small. However, as the autocorrelations become stronger, the chi690 SIMULATION

square normality test starts to break down, for example, for the AR1(0.9) and MM1(0.9) processes. Even although passing the von Neumann test does not guarantee independence, the sequence will only be slightly correlated if it is not actually independent. Hence, the residual correlation (if there is any) left in the batch means will have little impact on the chi-square normality test in the QIN procedure. 4.2 Experiment 2 In this section, we check the batch size at which the batch means pass the tests of normality and independence individually. In this experiment, the batch size m is determined dynamically and is a random variable. We sequentially increase the batch size (according to the algorithm in Section 3.4) until the test is passed, i.e. the batch size is doubled every two iterations. Table 3 lists the experimental results giving the batch size at which the batch means pass the normality test. The m2 column lists the average batch size at which the batch means appear to be normal. The stdv m column lists the standard deviation of the batch size m. For instance, the average batch size for U 208 13 observations to pass the normality test is 1.85. One of the referees pointed out that the average of two U 208 13 random variates is a random variate having a triang208 18 0 53 distribution, where triang2a8 b8 c3 denotes a triangular distribution on [a,b] with mode c. Even although this triangular distribution is symmetric, it is certainly not nearly normal. This result indicates that the power of the chi-square normality test with only 5 df is weak. Nevertheless, if we mistakenly treat this kind of non-normal data as being approximately normal,




Table 3. Average batch size m at which the batch means pass the test of normality Process

U 208 13 m2 stdv m 1.85

0.46

N 208 13 m2 stdv m 1.04

0.19

expon(1) m2 stdv m 4.41

2.14

Table 4. Average batch size m at which the batch means pass the test of independence Process 8 8 0.35 0.50 0.75 0.90

MA1( ) m2 stdv m 3.0 3.4 3.7 3.7

1.3 1.5 1.6 1.6

AR1() m2 stdv m 5.5 9.3 24.3 68.2

2.6 4.3 11.2 33.0

MM1() m2 stdv m 12.6 29.0 163.0 1209.0

7.7 16.6 91.6 671.0

the c.i. coverage is still fairly accurate. If this is a concern, users can use a normality test having higher power. Table 4 lists the experimental results concerning the batch size at which the batch means pass the test of independence. The 8 8 column lists the coefficient values of the corresponding stochastic process. The MA1() column lists the results of the underlying moving average output sequences. In general, the average batch size at which the batch means appear to be independent increases as the autocorrelation increases. The MA1( ) processes are only slightly correlated even with as large as 0.91 the average batch size at which the batch means pass the test of independence is 3.71. On the other hand, the MM1(0.90) output sequences are highly correlated, and the average batch size at which the batch means first pass the test for independence is as large as 1209. We also performed the experiment using different numbers of batches for the underlying independence and normality tests. The average batch size at which the batch means appear to be independent and normal generally increases with the number of batches used for those tests since the power of a test increases as the sample size increases. We chose to use a sample size of 180 for our tests because using this sample size meets the precision requirement and does not cause the simulation to run longer than necessary. We also experimented with increasing the batch size by 1 at each iteration, which results in too small a batch size for highly correlated sequences. The reason is that the probability of committing a Type II error is significantly increased with a larger number of iterations. 4.3 Experiment 3 In this experiment, we use the QIN procedure to determine batch sizes so that the batch means appear to be i.i.d. normal. Since batch means are often used to estimate the

Table 5. Coverage of 90% confidence intervals of independent samples Process 1

U(0,1) 0.50

N(0,1) 0.00

expon(1) 1.00

avg r stdv r avg samp stdv samp

0.0256 0.0202 365 166

1.0 0.0 214 111

0.0287 0.0237 973 800

batches avg bsize stdv bsize avg hw stdv hw coverage

180 2.03 0.92 0.0262 0.0045 90%


1.19 0.62 0.118 0.0148 91%

5.41 4.45 0.0587 0.0137 90%

30 12.17 5.54 0.0265 0.0054 90%

7.12 3.70 0.119 0.020 267 90%

32.44 26.68 0.0592 0.0153 90%

variance of sample means, we evaluate the accuracy of the variance estimated by the procedure. In these experiments, no relative or absolute precisions were specified, so the half-width of the c.i. is the result of the default precision. In this experiment we list the results of c.i. coverage when the entire sequences are divided into 180 batches as well as 30 batches. Recall that for i.i.d. normal data with known variance, the batch size has no impact on the c.i. half-width H when the sample size is fixed. We would like to determine if the half-widths constructed with different batch sizes are approximately the same. Table 5 lists the experimental results obtained from sampling observations from three different i.i.d. processes. The 1 row lists the true mean. The avg and stdv r rows list, respectively, the average relative precision and standard deviation of the relative precision of the estimator 1. Here, the relative precision is defined as r 3 1 5 141. The avg and stdv samp rows list the average sample size and standard deviation of the sample size, respectively. The avg and stdv bsize rows list the average batch size and standard deviation of the batch size obtained by the procedure, respectively. The avg and stdv hw rows list the average half-width and standard deviation of the half-width obtained by the procedure, respectively. The coverage row lists the percentage of the c.i.’s that cover the true mean value. The average batch size is 1.19 when sampling from an i.i.d. normal distribution, reflecting the fact that the underlying observations are independent and normally distributed. On the other hand, the batch means do not pass the normality test until the average batch size is 2.03 when sampling from a U 208 13 distribution. Note that the batch Volume 83, Number 10 SIMULATION


691

Chen and Kelton

Table 6. Coverage of 90% confidence intervals of autocorrelated samples Process 1 avg r stdv r avg samp stdv samp

MA1(0.9) 2.00

AR1(0.9) 2.00

MM1(0.9) 9.00

0.0286 0.0235 830 980

0.0368 0.0313 145 60 12 481

0.0121 0.0103 3 296 931 2 992 981


180 4.61 5.45 0.114 0.0212 88%


80.9 69.3 0.141 0.0268 88%

18 316 16 628 0.202 0.0641 87%

30 27.7 32.7 0.120 0.0261 89%

485 416 0.148 0.0333 89%

109 898 99 766 0.204 0.0674 88%

size of 2.03 is greater than the value of 1.85 in experiment 2 since independent sequences will fail the test of independence 5 ind 100% of the time. The exponential distribution is asymmetric1 the batch means do not pass the normality test until the average batch size is 5.41. The c.i. coverages are around the specified 90% confidence level. The c.i. half-widths and coverages are approximately the same when the entire output sequence is divided into 180 batches and 30 batches. This provides some assurance that the batch sizes determined by the QIN procedure are large enough for these batch means to be approximately normal: see Section 3.2. Table 6 lists the experimental results from stochastic processes of greater complexity. For these three tested processes, the c.i. coverages are around the specified 90% confidence level. Since the steady-state distributions of the MA1 and AR1 processes are normal, we believe that some of the batch means that passed the test of independence may be slightly correlated. For the M/M/1 queuing process, samples are not only highly correlated but also far from normal1 the steady-state distribution of the M/M/1 queuing process is not only asymmetric but also discontinuous at x 3 0. Hence, the batch size for the M/M/1 queuing process to appear independent and normally distributed is significantly larger than that of the MA1 and AR1 processes. In addition to some correlated batch means passing the test of independence, some nonnormal batch means also pass the test of normality1 thus, the coverage of the M/M/1 queuing process is slightly lower than that of the MA1 and AR1 processes. We compare the performance of QIN with other procedures from the literature: QIBatch [5], ASAP [16], 692 SIMULATION

ASAP2 [17], ASAP3 [7] and WASSP [18] (a waveletbased spectral method). Instead of working in the time domain with the original output process, WASSP works in the frequency domain by exploiting a spectral-analysis approach. The WASSP procedure transforms the covariance function to a power spectrum, which is then estimated by a periodogram. Table 7 lists the experimental results from QIBatch, ASAP, ASAP2, ASAP3 and WASSP for the M/M/1 process with traffic intensity 0.9. The results of QIBatch are obtained with the default precision, while the results from all other procedures are obtained with the required relative precision set to 7 5%. These results are extracted directly from the corresponding papers cited above. Since ASAP and its extensions terminate when they detect normality among batch means and deliver a correlation-adjusted c.i. half-width, they are able to provide valid c.i.’s with relatively small sample sizes. On the other hand, the QIBatch and QIN procedures generally require larger sample sizes and deliver tighter c.i.’s by default because they terminate only after they have allocated large enough samples to obtain batch means that pass both tests of independence and normality. Note that the observed default relative precisions from QIN are no more than 4%. Hence, QIN does not need to invoke the halfwidth reduction phase unless the required relative precision is high, say, less than 4%. We do not think this is a major drawback since wide half-widths provide little useful information. Furthermore, the batch means generated from QIBatch and QIN can be used as input to simulation procedures that require i.i.d. normal data. The estimated required sample sizes for QIBatch, ASAP, ASAP2, ASAP3 and WASSP to obtain average MM1(0.9) c.i. half-widths to be smaller than 0.202 are approximately 2 489 000 ((0 25940 202)2 1 180 442), 3 410 000, 2 716 000, 2 771 000 and 3 276 000, respectively, which are less than the average sample size of 3 297 000 used by the QIN procedure to obtain an average c.i. half-width of 0.202. We believe this is because the underlying distribution is extremely non-normal and has a discontinuous point1 a very large batch size is therefore required for the batch means to pass the normality test. For example, the average batch size for the MM1(0.9) process BM to pass the test of independence is 1209 (see Table 4), while the average batch size for the BM to pass both the tests of independence and normality is 18 316. For automated procedures, i.e. procedures for which no user intervention is required during execution, the complexities of the algorithms do not matter with respect to the ease of use once the procedures are implemented. Nevertheless, the QIN procedure is easy to state and interpret, thus making it attractive to software developers and for practical use.




Table 7. Coverage of 90% confidence intervals from QIBatch, ASAP, ASAP2, ASAP3 and WASSP Procedure

QIBatch

ASAP

ASAP2

ASAP3

WASSP

avg samp coverage avg r avg hw std hw

1180442 89.3% 0.016 0.295 0.056

815755 94.0% 0.046 0.413 0.134

281022 92.0% 0.070 0.628 0.045

287568 89.5% 0.070 0.627 0.045

388000 90.4% 0.065 0.587 0.085

Table 8. P(CS) of the AR1() processes with P 3 0 95 and the initial batched sample size b0 3 20

P(CS)

0.3 0.9

0.941 0.936

k32 T

stdv(T )

P(CS)

k35 T

stdv(T )

739 (164) 10 408 (157)

216 (62) 2836 (56)

0.947 0.953

3533 (781) 49 613 (732)

755 (205) 10 702 (181)

4.4 Experiment 4 In this experiment, we generate batch means that appear to be i.i.d. normal for Chen’s procedure (see Section 3.6) to select the stochastic processes that has the smallest expected value. We use various auto-regressive random variables to represent the performance measures. Dudewicz and Zaino [19] and Goldsman et al. [20] have investigated selection procedures in the presence of an auto-regressive process. In order to compare our results with other studies in the literature, we test the stochastic processes used by Goldman et al. [20], where they investigated the P(CS) and sample size with a series of pre-determined batch sizes. There are k alternative designs under consideration. Suppose X i8 j 3 1i 2X i8 j51 5 1i 3 i , i 3 18 28 8 k, where 1 N 208 1 5 6 2 3. The slippage configuration is used, in which 11 was set to 0, while 12 3 13 3 3 1k 3 d . The number of systems in each experiment varied over k 3 28 5. We want to select a design with the minimum mean: design 1. In all cases we set the marginal variance of each system 9 2 to 1. Let the best sys721 be the steady-state variance constant of8 tem. The indifference amount d is set to 721 470 and 8 721 41000 for 3 0 3 and 3 0 9, respectively. Note that for the AR1() processes, the SSVC 72 3 9 2 21 342153. Consequently, 6 the indifference amounts 21 340 73470) and 0.1378 are approximately 0.1629 (i.e. 6 (i.e. 21 940 1341000) for 3 0 3 and 3 0 9, respectively. The number of initial samples or batch means b0 is set to 20. Furthermore, 1000 independent experiments are performed to estimate the actual P(CS) by P(CS): the proportion of the 1000 experiments in which we obtained the correct selection. Table 8 displays the experimental results. The P(CS) column lists the proportion of correct selections. The T column lists the average of the number of total unbatched

observations and batched observations (in parentheses) used in each experimental design. The stdv(T ) column lists the standard deviation of the number of total unbatched observations and batched observations (in parenthesis) at each independent simulation run. The P(CS)’s are around the nominal value when the auto-regressive sequences are pre-processed to become approximately i.i.d. normal batch means before used as input to Chen’s two-stage selection procedure. The residual autocorrelations between BM may have caused the observed P(CS) to be lower than desired. If this is a concern, a lower confidence level for the test of independence can be used to reduce the frequency of committing a Type II error1 see Table 2. The batch size of each design is determined independently, i.e. the batch sizes may be different among these designs and can be different for different simulation runs. The observed average batch sizes are approximately 4.5 and 67 for 3 0 3 and 3 0 9, respectively. These batch sizes are consistent with the results in Table 4 of Experiment 2. Furthermore, these results are consistent with the results of Goldsman et al. [20] in the sense that the batch sizes determined by the QIN procedure are close to the pre-determined batch sizes with the allocated total unbatched observations and the observed P(CS). 5. Conclusions We have presented an algorithm for estimating the required batch size so that the batch means appear to be i.i.d. normal as determined by the tests of independence and normality. We have also presented a strategy for building a c.i. for the mean 1 of a steady-state simulation response. The QIN algorithm works well for determining the required batch size for the batch means to become approximately i.i.d. normal. The procedure estimates the required sample size based entirely on data and does not Volume 83, Number 10 SIMULATION


693

Chen and Kelton

require any user intervention. The QIN procedure can be used as a pre-processor of simulation procedures that require i.i.d. normal data, e.g. Chen’s selection procedure. The experimental evaluation reveals that QIN determines batch sizes that are sufficiently large for achieving approximately i.i.d. normal batch means and for achieving adequate c.i. coverage. A straightforward test of independence and normality was used to obtain approximately i.i.d. normal batch means, making the procedure easy to understand and simple to implement. 6. Acknowledgements The authors are grateful to the anonymous referees and Dr David Goldsman for their time and effort in providing guidance on how to improve the paper. Their detailed and helpful comments significantly improved the quality of the paper. The authors also thank Bobbie Chern for helpful comments. 7. References [1] Bechhofer, R. E., T. J. Santner, and D. M. Goldsman. 1995. Design and Analysis of Experiments for Statistical Selection, Screening and Multiple Comparisons. John Wiley & Sons, Inc., New York. [2] Law, A. M. and W. D. Kelton. 2000. Simulation Modeling and Analysis. 3rd ed. McGraw-Hill, New York. [3] von Neumann, J. 1941. Distribution of the Ratio of the Mean Square Successive Difference and the Variance. Annals of Mathematical Statistics, 12, 367–395. [4] Chen, E. J. and W. D. Kelton. 2003. Determining Simulation Run Length With the Runs Test. Simulation Modelling Practice and Theory, 11(3–4), 237–250. [5] Chen, E. J. and W. D. Kelton. In press. Confidence-Interval Estimation Using Quasi-Independent Sequences. IIE Transactions. [6] Billingsley, P. 1999. Convergence of Probability Measures. 2nd ed. John Wiley & Sons, Inc., New York. [7] Steiger, N. M., E. K. Lada, J. R., Wilson, J. A. Joines, C. Alexopoulos, and D. Goldsman. 2005. ASAP3: A Batch Means Procedure for Steady-State Simulation Output Analysis. ACM Transactions on Modeling and Computer Simulation, 15, 39–73. [8] Law, A. M. and J. S. Carson. 1979. A Sequential Procedure For Determining the Length of a Steady-State Simulation. Operations Research, 27, 1011–1025. [9] Fishman, G. S. 2001. Discrete-Event Simulation: Modeling Programming and Analysis. Springer-Verlag, New York. [10] Fishman, G. S. 1978. Grouping Observations in Digital Simulation. Management Science, 24(5), 510–521. [11] Kelton, W. D., R. P. Sadowski, and D. T. Sturrock. 2007. Simulation with Arena. 4th ed., McGraw-Hill, New York.

694 SIMULATION

[12] Chen, E. J. and W. D. Kelton. 2004. Experimental Performance Evaluation of Histogram Approximation for Simulation Output Analysis. Proceedings of the 2004 Winter Simulation Conference. R. G. Ingalls, M. D. Rossetti, J. S. Smith, and B. A. Peters (eds). Piscataway, New Jersey: Institute of Electrical and Electronics Engineers, 685–693. [13] Rice, J. A. 1995. Mathematical Statistics and Data Analysis. 2nd ed., Duxbury Press, Belmont, California. [14] Bratley, P., B. L. Fox, and L. E. Schrage. 1987. A Guide to Simulation. 2nd ed. Springer-Verlag, New York. [15] Chen, E. J. 2004. Using Ordinal Optimization Approach to Improve Efficency of Selection Procedures. Journal of Discrete Event Dynamic Systems, 14(2), 153–170. [16] Steiger, N. M. and J. R. Wilson. 1999. Improved Batching for Confidence Interval Construction in Steady-State Simulation. Proceedings of the 1999 Winter Simulation Conference. P.A. Farrington, H.B. Nembhard, D.T. Sturrock, and G.W. Evans (eds). Piscataway, New Jersey: Institute of Electrical and Electronics Engineers, 442–451. [17] Steiger, N. M., E. K. Lada, J. R. Wilson, C. Alexopoulos, D. Goldsman, and F. Zouaoui. 2002. ASAP2: An Improved Batch Means Procedure for Simulation Output Analysis. Proceedings of the 2002 Winter Simulation Conference. E. Yücesan, C.-H. Chen, J. L. Snowdon, and J. M. Charnes (eds). Piscataway, New Jersey: Institute of Electrical and Electronics Engineers, 336–344. [18] Lada, E. K. and J. R. Wilson. In press. A Wavelet-Based Spectral Procedure for Steady-State Simulation Analysis. European Journal of Operational Researches. [19] Dudewicz, E. J. and N. A. Zaino. 1977. Allowance for Correlation in Setting Simulation Run-Length via Ranking-and-Selection Procedures. TIMS Studies in the Management Sciences, 7, 51– 61. [20] Goldsman, D., W. S. Marshall, S. H. Kim, and B. L. Nelson. 2000. Ranking and Selection for Steady-State Simulation Proceedings of the 2000 Winter Simulation Conference. J. A. Joines, R. R. Barton, K. Kang, and P. A. Fishwick (eds). Piscataway, New Jersey: Institute of Electrical and Electronics Engineers, 544– 553.

E. Jack Chen is a Senior Staff Specialist with BASF Corporation. He received a Ph.D. degree from University of Cincinnati. His research interests are in the area of computer simulation. W. David Kelton is a Professor in the Department of Quantitative Analysis and Operations Management at the University of Cincinnati, where he teaches courses in simulation, statistics and operations research. His primary research interests are in simulation methods and applications. He was Editor-in-Chief of the INFORMS Journal on Computing from 2000 to mid-2007, and has served as simulation Area/Department editor for the INFORMS Journal on Computing, Operations Research, and IIE Transactions.



A Procedure for Generating Batch-Means Confidence Intervals for ...

A Procedure for Generating Batch-Means Confidence Intervals for ...

Suggest Documents

eAppendix Confidence Intervals Confidence intervals for the ...

simultaneous confidence intervals for

Confidence intervals for correlations - Stata

Confidence Intervals for One Mean

Confidence intervals for correlations - Stata

CONFIDENCE INTERVALS FOR ASSOCIATION ...

Confidence intervals for correlations - Stata

Confidence intervals for correlations - Stata

Confidence and Prediction Intervals for

Confidence intervals for correlations - Stata

Confidence intervals for a binomial proportion - CiteSeerX

On confidence intervals for quantiles and tolerance intervals for ...

MINIMUM SAMPLE SIZES FOR CONFIDENCE INTERVALS FOR ...

Models for Calculating Confidence Intervals for ...

Confidence intervals

CONFIDENCE INTERVALS VS BAYESIAN INTERVALS

Automatic procedure for generating symmetry

Recommended confidence intervals for two independent binomial ...

Reliable Confidence Intervals for Software Effort Estimation

Exact Binomial Confidence Intervals for Randomized Response

Newcombe RG. Confidence intervals for proportions ...

Confidence Intervals for Population Allele Frequencies - CiteSeerX

The Optimal Confidence Intervals for Agricultural

Confidence Intervals for Relative Risk by Likelihood