Regression and Kriging metamodels with their ...

European Journal of Operational Research 256 (2017) 1–16

Contents lists available at ScienceDirect

European Journal of Operational Research journal homepage: www.elsevier.com/locate/ejor

Invited Review

Regression and Kriging metamodels with their experimental designs in simulation: A review Jack P. C. Kleijnen∗ Tilburg University, Postbox 90153, Tilburg, Netherlands

a r t i c l e

i n f o

Article history: Received 8 July 2015 Accepted 17 June 2016 Available online 20 June 2016 Keywords: Robustness and sensitivity Metamodel Design Regression Kriging

a b s t r a c t This article reviews the design and analysis of simulation experiments. It focusses on analysis via two types of metamodel (surrogate. emulator); namely, low-order polynomial regression, and Kriging (or Gaussian process). The metamodel type determines the design of the simulation experiment, which determines the input combinations of the simulation model. For example, a first-order polynomial regression metamodel should use a “resolution-III”design, whereas Kriging may use “Latin hypercube sampling”. More generally, polynomials of first or second order may use resolution III, IV, V, or “central composite” designs. Before applying either regression or Kriging metamodeling, the many inputs of a realistic simulation model can be screened via “sequential bifurcation”. Optimization of the simulated system may use either a sequence of low-order polynomials—known as “response surface methodology” (RSM)—or Kriging models fitted through sequential designs—including “efficient global optimization” (EGO). Finally, “robust”optimization accounts for uncertainty in some simulation inputs. © 2016 Elsevier B.V. All rights reserved.

1. Introduction In this article we review the design and analysis of simulation experiments. This design depends on the metamodel — also called surrogate model or emulator — which is an explicit and relatively simple approximation of the input/output (I/O) function implicitly defined by the underlying simulation model. Mathematically, we consider w = fsim (z, r ) where w denotes the random simulation output (response), fsim the I/O function of the simulation model, z the vector with the values of the k simulation inputs, and r the vector with pseudorandom numbers (PRNs) used by the random simulation model. Usually, z is standardized such that the resulting input d has elements −1 ≤ d j ≤ 1 with j = 1, . . . , k. Notice that queuing simulations may have the traffic rate as “the” simulation input — not the arrival rate 1/μa and the service rate 1/μs where μ denotes the mean (expected value) — so we may have z1 = μs /μa . Another input of queuing simulations may be the priority rule (queueing discipline), which is a qualitative input with at least two values or “levels”; e.g., first-in-first-out and smallestservice-time-first (a qualitative input with more than two values needs special care, as explained in Kleijnen (2005, pp. 69–71)). We focus on stochastic simulation, but our review is also relevant for deterministic simulation — where r vanishes — which is often ∗

Tel.: +31 134662029. E-mail address: [email protected]

http://dx.doi.org/10.1016/j.ejor.2016.06.041 0377-2217/© 2016 Elsevier B.V. All rights reserved.

applied in system dynamics and in computer-aided design/computer-aided engineering (CAD/CAM). Obviously, fsim is an implicit nonlinear transformation of z and r. Metamodeling approximates fsim (z, r) through y = fmeta (x ) + e where y is the metamodel output, which may deviate from w so the approximation error e may have μe = 0. An example of fmeta is a second-order polynomial in dj ( j = 1, . . . , k); actually, a polynomial of any order is a linear regression (meta)model, as we shall detail in Section 2. Kriging — or Gaussian process (GP) — metamodels are also explicit — but more complicated — models of dj , as we shall detail in Section 5. Altogether, fmeta is explicit and much simpler than fsim . We call fmeta “adequate” or “valid” if μe = 0. There are several types of metamodels, but we focus on the most popular types; namely, low-order polynomial regression and Kriging metamodels. The close relationship between the type of metamodel and the experimental design can be illustrated as follows. If we assume that a first-order polynomial is an adequate metamodel, then changing one factor (simulation input) at a time enables unbiased estimators of the first-order or “main” effects of these factors. However, these estimators do not have minimum variances; a fractional factorial design with two values per factor of so-called k−p resolution-III (abbreviated to R-III) — denoted as a 2III design — gives estimators with minimum variances — as we shall see in Section 2. By definition, R-III designs have estimators of the main effects that are not confounded with (biased by) each other; we

2

J.P.C. Kleijnen / European Journal of Operational Research 256 (2017) 1–16

shall also discuss resolution-IV (R-IV) and resolution-V (R-V) designs. Furthermore, we focus on simulation aimed at sensitivity analysis (SA) and optimization of the underlying real system. For such SA and optimization there are many methods, but we focus on methods that use either low-order polynomials or Kriging and the designs typically used by these two types of metamodels. SA is an ambiguous term; e.g., SA may be either global or local. We focus on global SA; e.g., in screening and Kriging we use global metamodels. Nevertheless, we use local SA in optimization through low-order polynomials, known as response surface methodology (RSM). Notice that many SA methods were recently reviewed in Borgonovo and Plischke (2016) and Wei, Lu, and Song (2015). Because both Kriging and simulation-based optimization are very active fields of research, we update two previous reviews; namely, Kleijnen (2005, 2009). We base our update on Kleijnen (2015), which includes many website addresses for software and hundreds of additional references. We assume that the readers have a basic knowledge of simulation and mathematical statistics. We organize this article as follows. Section 2 summarizes classic linear regression metamodels — including polynomials — and their designs. Section 3 presents solutions when in practice the classic assumptions do not hold. Section 4 explains how sequential bifurcation (SB) can screen hundreds of inputs of realistic simulation models; this section uses the two preceding sections. Section 5 summarizes Kriging and its designs. Section 6 explains simulation optimization that uses either low-order polynomials or Kriging; this section includes robust optimization. This article ends with 61 carefully selected references, including 35 references not older than 2010 (a rather arbitrary cut-off date). For the readers’ convenience, an Appendix gives a list of major symbols and their meanings (the symbols in the list are “global” variables that feature throughout this review, whereas subscripts in summation m signs such as r in r=1 are “local” variables so global and local variables have different meanings; without this distinction we would run out of “natural” symbols). Bold symbols denote matrixes and vectors; Greek symbols denote parameters that are to be estimated.

2. Classic linear regression metamodels and their designs Classic linear regression (meta)models have the following mathematical specification:

y = Xβ +e.

(1)

Here y denotes the n-dimensional vector with the dependent (explained) variable of the regression metamodel (y is to be distinguished from w, the response of the simulation model) where n denotes the number of simulated input combinations (in Section 1 the original simulation input is z, and the corresponding standardized simulation input is d). The independent (explanatory) regression variables are x, which are simple functions of z or d; e.g., x = z (first-order effect) or x = z1 z2 (interaction between simulation inputs 1 and 2). So, X = (xi;g ) in (1) denotes the n × q matrix of independent regression variables with xi; g the value of the independent variable g in combination i (i = 1, . . . , n; g = 1, . . . , q). Note that row i of X is xi = (xi;1 , . . . , xi;q ). Furthermore, β denotes the q-dimensional vector with regression parameters. Finally, e denotes the n-dimensional vector with the residuals E (y ) − E (w ); Section 1 implies that w denotes the n-dimensional vector with the simulation outputs wi = fsim (zi , ri ) where zi denotes combination i of the k original simulation inputs that is determined by the n × k design matrix D = (di; j ), and ri denotes the vector with PRNs used in combination i.

In this review we focus on a special case of (1); namely, a second-order polynomial with k simulation inputs:

y = β0 +

k

k k

j=1

j=1 j ≥ j

β jx j +

β j; j x j x j + e

(2)

with the intercept β 0 , the k first-order effects β j ( j = 1, . . . , k), the k(k − 1 )/2 two-factor interactions (cross-products) β j; j (j > j), and the k purely quadratic effects β j; j . Interaction means that the effect of an input depends on the values of one or more other inputs; furthermore, β j; j > 0 means that the inputs j and j are complementary, and β j; j < 0 means that the inputs j and j are substitutes. A purely quadratic effect means that the marginal effect of the input is not constant, but diminishes or increases. We point out that the metamodel (2) is nonlinear in x, but — as in (1) — it is linear in β (engineers call this metamodel nonlinear, whereas statisticians call it linear). In this review, we assume that interactions among three or more inputs are unimportant. Our reason is that such interactions are hard to interpret, and are often unimportant in practice. Of course, we should check this assumption; i.e., we should “validate” the estimated metamodel, as we shall see in Section 3.5. To estimate β in (1), we apply the least squares (LS) criterion (or the mathematical L2 norm) and obtain

βˆ = (X X )−1 X w.

(3)

Obviously, the LS estimator βˆ exists if and only if the inverse of X X exists (so the square matrix X X is nonsingular); e.g. βˆ exists if X is orthogonal (or X X = nI). If di — which determines xi — is simulated mi times (usually, mi > 1 in random simulation), then xi is repeated mi times within X — so X is a matrix with ni=1 mi rows and q columns. If mi is a constant (say) m, then we may rem i place w by w with the n elements wi = r=1 wi;r /mi so X is an n × q matrix. Note that n denotes the number of different combinations of the k simulation inputs. The LS estimator βˆ is identical to the maximum likelihood estimator (MLE) if we assume that e in (1) is white noise; i.e., ei is normally, independently, and identically distributed (NIID) with zero mean and a constant variance σe2 . If the metamodel (1) is valid, then obviously σe2 = σw2 where σw2 denotes the variance of w. The white-noise assumption implies that βˆ has the following covariance matrix:

βˆ = (X X )−1 σw2 .

(4)

In this equation, σw2 is unknown, but we can estimate σw2 = σe2 through the mean squared residuals (MSR):

MSR =

( y − w ) ( y − w) n−q

(5)

where y = Xβˆ ; obviously, (5) assumes n − q > 0. Together, (5) and . (4) give βˆ Next we can derive confidence intervals (CIs) and tests for the individual elements of βˆ , using their estimated standard deviations ) — which are the square roots of the esti(standard errors) s(β g 2 ) given by the main diagonal of — and t mated variances s (β g f βˆ (Student statistic with f degrees of freedom) with f = n − q:

tn−q =

βg − βg with g = 1, . . . , q. ) s (β g

(6)

To select a specific design matrix D (which determines X, given a specific metamodel) — with zi constrained over some set or re ). It can be proven that gion — we may decide to minimize Var(β g ) are minimal if X is orthogonal. Usually, the theory these Var(β g

on design of experiments (DOE) assumes that the original simulation inputs zi; j are standardized (scaled) such that −1 ≤ di; j ≤ 1.


3

Table 2 A one-sixteenth fractional factorial design for seven inputs.

Table 1 The Plackett–Burman design for eleven inputs. i

1

2

3

4

5

6

7

8

9

10

11

i

1

2

3

4=1×2

5=1×3

6=2×3

7=1×2×3

1 2 3 4 5 6 7 8 9 10 11 12

+ + − + + + − − − + − −

− + + − + + + − − − + −

+ − + + − + + + − − − −

− + − + + − + + + − − −

− − + − + + + − + + − −

− − − + − + + − + + + −

+ − − − + − + + − + + −

+ + − − − + − + + − + −

+ + + − − − + − + + − −

− + + + − − − + − + + −

+ − + + + − − − + − + −

1 2 3 4 5 6 7 8

− + − + − + − +

− − + + − − + +

− − − − + + + +

+ − − + + − − +

+ − + − − + − +

+ + − − − − + +

− + + − + − − +

If each simulation input has only two values in the whole experiment with its n input combinations, then this standardization implies the following linear transformation where Lj denotes the lower value of zj in the experiment (so L j = mini zi; j ), Hj the higher value (so H j = max zi; j ), z j the average value of input j in a balancedexperiment in which each input is observed at its lower value in n/2 combinations:

di; j =

zi; j − z j ( i = 1, . . . , n; j = 1, . . . , k ). ( H j − L j )/2

(7)

Consequently, if X is orthogonal, then X X = nI so (4) gives

βˆ =(nI )−1 σw2 = I

σw2 n

.

(8)

are statistically indepenThis implies that the q estimators in β dent (βˆ is diagonal) and have the same variance (namely, σw2 /n).

Because these estimators have the same estimated variances, we can rank the explanatory variables in order of importance, using or t either β g n−q in (6) with βg = 0 so tn−q = βg /s (βg ); obviously, we do not hypothesize that the intercept β 0 is zero in (2). Because all q estimators are independent, the full regression model with q effects and the reduced model with nonsignificant effects eliminated have identical values for those estimated effects that occur in both models. If X were not orthogonal, then so-called backwards elimination of nonsignificant effects would change the values of the remaining estimates. The selection of D that gives a “good” βˆ is the goal of DOE,

as we shall discuss in the next subsections. In this discussion we initially assume that no input combination is simulated more than once, so mi = 1 (i = 1, . . . , n). We discuss the following special cases of (2): (i) all second-order effects β j; j (j ≤ j ) are zero (so the metamodel becomes a first-order polynomial); (ii) all purely quadratic effects β j; j are zero; we discuss (a) we estimate the firstorder effects β j unbiased by the two-factor interactions β j; j with j < j , and (b) we obtain unbiased estimators of both β j and β j; j . These cases require designs of different resolution (abbreviated to R); e.g., R-III designs for first-order polynomials, which we shall discuss next. 2.1. R-III designs for first-order polynomials By definition, a R-III design gives unbiased estimators of β j ( j = 1, . . . , k), assuming a first-order polynomial is a valid metamodel. These designs are also known as Plackett–Burmandesigns. A subk−p class of these designs are fractional factorial two-level 2III designs k −p with positive integer p such that p < k and n = 2 ≥ 1 + k. In RIII designs, n is a multiple of 4. For example, 8 ≤ k ≤ 11 implies n = 12; see Table 1 where i = 1, . . . , n, 1 stands for (d1;1 , . . . , dn;1 ) and 2 through 11 stand for inputs 2 through 11, − stands for −1,

and + for 1. Another example is the 27II−4 design in Table 2, so I k = 7 and p = 4; we explain Table 2 as follows. The symbol 4 = 1 × 2 stands for di;4 = di;1 di;2 with i = 1, . . . , n; e.g., the first element (i = 1) in this column is d1;4 = d1;1 d1;2 = (−1 )(−1 ) = +1. The DOE literature calls “4 = 1 × 2” a design generator. A 2k−p design requires p generators; i.e., Table 2 specifies these generators in the last four columns. It is easy to check that this example gives a balanced design and an orthogonal X. If 4 ≤ k ≤ 6, then we ignore some columns of Table 2; e.g., if k = 6 we may ignore the last column. If k = 7, then Table 2 implies a saturated design; by definition, such a design means n = q (obviously, a first-order polynomial means q = 1 + k). A saturated design implies υ = n − q = 0 in (5) so the MSR is undefined. To solve this problem, we may add one or more combinations to Table 2; e.g., either combinations from the full factorial 27 design excluding the combinations in Table 2 or the combination at the center of the experimental area where d j = 0 if dj is quantitative and dj is randomly selected as −1 or 1 if dj is qualitative with two values. Kleijnen (2015, p. 53) details a 215−11 design, including a simple algorithm for constructing this design. Algorithms for the construck−p tion of 2III designs with high k values are presented in Ryan and Bulutoglu (2010) and Shrivastava and Ding (2010). We do not detail k−p 2III designs with such high k values, because in practice such values are rare in R-III designs. In Section 4 we shall discuss so-called k−p screening designs that are more efficient than 2III designs. There are also Plackett–Burman designs for 12 ≤ n ≤ 96 with n not a power of two but a multiple of four. Montgomery (2009, p. 326) and Myers, Montgomery, and Anderson-Cook (2009, pp. 165) tabulate such designs for 12 ≤ n ≤ 36. These designs are balanced and orthogonal, like 2k−p designs are. 2.2. R-IV designs By definition, a R-IV design gives unbiased estimators of β j ( j = 1, . . . , k ) in a first-order polynomial — even if two-factor interactions are nonzero, but all “higher-order” effects (among three or more factors) are zero. Remembering that the number of twofactor interactions in (2) is k(k − 1 )/2, we obtain q = 1 + k + k(k − 1 )2. Furthermore, X follows from the n × k design matrix D = (di; j ); i.e. row i (i = 1, . . . , n) of X is

xi = (1, di;1 , . . . , di;k , di;1 di;2 , . . . , di;k−1 di;k ).

(9)

To construct a R-IV design, we apply the foldover theorem, which states that augmenting a R-III design D with its mirror design −D gives a R-IV design; see (Box & Wilson, 1951). Obviously, a R-IV design doubles the number of combinations. A R-IV design does not enable unbiased estimators of all the individual two-factor interactions. For example, Table 2 gave a 27II−4 design, so the R-IV design I implies n = 16. Furthermore, X follows from (9) with k = 7 so X has q = 29 columns. Because n < q , X X is singular so we cannot apply (3) to obtain the LS estimators of the 29 individual regression parameters. Note: Kleijnen (2015, p. 59) explains that we can estimate sums of these interactions; e.g., the 28IV−4 design with the generators 5 =

4


1 × 3 × 4, 6 = 2×3×4, 7 = 1×2×3, and 8 = 1×2×4 gives estimators of the 8(8–1 )/2 = 28 two-factor interactions that are aliased or confounded in seven groups of size four (obviously, 7 × 4 = 28). In general, let us assume that a valid linear regression metamodel is

y = X1 β1 + X2 β2 + e;

(10)

e.g., X1 corresponds with the intercept and the first-order effects collected in β1 , and X2 corresponds with the two-factor interactions β2 . Suppose that we start with a tentative simple metamodel without these interactions, so we obtain

β1 = (X1 X1 )−1 X1 w.

(11)

If (10) is actually the valid metamodel, then E (w ) = E (y ). Combining (11) and (10) gives

) = (X X )−1 X E (w ) =(X X )−1 X (X β + X β ) E (β 1 1 1 2 2 1 1 1 1 1 1 =

β1 + (X1 X1 )−1 X1 X2 β2 .

(12)

This equation includes the alias matrix A = (X1 X1 )−1 X1 X2 . Equation (12) implies an unbiased estimator of β1 if either β2 = 0 or X1 X2 = 0. Indeed, R-III designs assume that β2 = 0 where β2 consists of the two-factor interactions; R-IV designs ensure that X1 X2 = 0 (the k(k − 1 )/2 columns for β j; j with j < j are orthogonal to the k + 1 columns for β j and β 0 ). 2.3. R-V designs for two-factor interactions By definition, a R-V design enables LS estimation of the firstorder effects, the two-factor interactions, and the intercept — provided all higher-order effects are zero. So, q = 1 + k + k(k − 1 )/2. k−p The DOE literature gives tables with generators for 2V designs; 8−2 e.g., the 2V design with the p = 2 generators 7 = 1×2×3×4 k−p

and 8 = 1×2×5×6. Unfortunately, 2V designs are not saturated at all; e.g., the 2V8−2 design implies n = 64 q = 37. So-called Rechtschaffner designs include saturated R-V designs, but they are not orthogonal; the statistical properties of these designs are further investigated in Qu (2007). 2.4. CCDs for second-degree polynomials By definition, a CCD or central composite design enables LS estimation of all the effects in a second-order polynomial — provided all higher-order effects are zero. A CCD consists of the following combinations: (i) a R-V design (see Section 2.3); (ii) the central combination 0k , which denotes a row-vector with k zeroes; (iii) the 2 kaxial combinations — which form a star design — where the “positive” axial combination for input j is d j = c while all other (k − 1) inputs are fixed at the center so d j = 0 with j = j, and the “negative” axial combination for input j is d j = −c and d j = 0. Notice that c = 1 implies a CCD with five values per simulation input, whereas c = 1 implies a CCD with only three values per input; the usual choice is c = 1. The “optimal” choice of c assumes white noise, which does not hold in practice (see the next section) so we do not detail this choice. Finally, if c ≤ 1, then −1 ≤ di; j ≤ 1; else −c ≤ di; j ≤ c. A CCD does not give an orthogonal X (e.g., any two columns corresponding with the intercept or the purely quadratic effects are not orthogonal), and a CCD is rather inefficient (i.e., n q). CCDs are popular in RSM, which we shall discuss in Section 6.1 (which is a subsection of Section 6 on simulationoptimization). For further discussion of CCDs and other types of designs for second-degree polynomials we refer to Khuri and Mukhopadhyay (2010); Kleijnen (2015, pp. 64–66), and Myers et al. (2009, pp. 296–317).

3. Assumptions versus simulation practice In the preceding section we detailed the classic assumptions of linear regression metamodels and their concomitant designs; these assumptions stipulate a single type of simulation output (univariate output) and white noise. In practice, however, these assumptions usually do not hold. Indeed, a practical simulation model may give amultivariate output. Furthermore, white noise implies (i) normally distributed output; (ii) no common random numbers(CRN); (iii) homogeneous variances of the simulation output; (iv) a valid metamodel. In this section, we try to answer the following questions: (a) How realistic are the classic assumptions? (b) How can we test these assumptions? (c) If an assumption is violated, can we then transform the simulation’s I/O data such that the assumption holds for the transformed data? (d) If we cannot find such a transformation, which statistical methods can we then apply? 3.1. Multivariate output Examples of multivariate output are inventory simulations with two outputs; namely, (i) the sum of the holding costs and the ordering costs; (ii) the service (or fill) rate. In general we assume that for r-variate simulation output (in the example, r = 2) we use r univariate linear regression metamodels that are the analogues of (1):

y(l ) = X(l ) β (l ) +e(l ) with l = 1, . . . r

(13)

y(l)

where is the n-dimensional vector with the dependent variable corresponding with simulation output type l; n is the numl) ber of simulated input combinations; X(l ) = (xi(;g ) is the n × ql

l) matrix of independent regression variables with xi(;g denoting the value of independent regression variable g in combination i for metamodel l (i = 1, . . . , n; g = 1, . . . , ql ); β (l ) = (β1(l ) , . . . , βq(ll ) ) is

the vector with the ql regression parameters for metamodel l; e(l) is the n-dimensional vector with the residuals of metamodel l, in the n combinations. In our review, we assume that all the r fitted regression metamodels are polynomials of the same order (e.g., second-order), so X(l ) = X and ql = q. Note that the regression literature calls the metamodel a multiple regression model if q > 1 and the metamodel has an intercept, and calls it a multivariate regression model if r > 1. The e(l) have the following two properties: (i) Their variances may vary with l; e.g., the estimated inventory costs and service percentages have very different variances. (ii) The residuals ei(l )

and ei(l ) are not independent, because they are (different) transformations of the same PRNs. Consequently, it might seem that we need to replace classic ordinary LS (OLS) by — rather complicated — generalized LS(GLS); see (Khuri & Mukhopadhyay, 2010). Surprisingly, if X(l ) = X, then GLS reduces to OLS computed per output; ´ see (Markiewicz & Szczepanska, 2007). Consequently, the best linear unbiased estimator(BLUE) of β(l) is

β(l ) = (X X )−1 X w(l ) (l = 1, . . . , r ).

(14)

Using this equation, we can easily obtain CIs and tests for the el(l ) ; i.e., we may use the classic formulas presented in ements in β the preceding section. We are not aware of any general designs for multivariate output; also see (Khuri & Mukhopadhyay, 2010). 3.2. Nonnormal output In simulation, the normality assumption often holds asymptotically; i.e., if the simulation run is long, then the sample average of the autocorrelated observations is nearly normal. Estimated quantiles, however, may be very nonnormal, especially in case of extreme quantiles such as the 99 percent quantile. The t-statistic


is known to be quite insensitive to nonnormality, whereas the Fstatistic is not. Whether the actual simulation run is long enough to make the normality assumption hold, is always hard to know. Therefore it seems good practice to test whether the normality assumption holds, as follows. To test whether a set of simulation outputs has a Gaussian probability density function (PDF), we may use various residual plotsand goodness-of-fit statistics; e.g., the chi-square statistic. A basic assumption of these statistics is that the observations are IID. We may therefore obtain “many” (say, 100) replications for a specific input combination (e.g., the base scenario) if the simulation is not computationally expensive. However, if a single simulation run takes relatively much computer time, then we can obtain only “a few” (between 2 and 10) replications so the plots are too rough and the statistical tests lack power. Actually, the white-noise assumption concerns the metamodel’s residualse, not the simulation model’s outputs w. Given mi ≥ 1 replications for combination i (i = 1, . . . , n), we obtain wi = m i w /mi and the corresponding ei = yi − wi . For simplicity of r=1 i;r presentation, we assume that mi is a constant so mi = m. If wi;r has a constant variance σw2 , then wi also has a constant variance; namely, σw2 = σw2 /m. Unfortunately, even if wi has a constant variance σw2 and is independent of wi with i = i (no CRN), then we can prove that

e = [I − X(X X )−1 X ]σw2 ,

(15)

so ei does not have a constant variance and ei and ei are not independent. Consequently, the interpretation of the popular plot with estimated residuals should account for variance heterogeneity and correlation of these residuals — which we think is not straightforward. We may applynormalizing transformationsto w; e.g., log(w ) may be more normally distributed than w. Unfortunately, the metamodel now explains the behavior of the transformed output — not the original output. We also refer to Bekki, Fowler, Mackulak, and Kulahci (2009) and Kleijnen (2015, p. 93). Another transformation is jackknifing, which is a general statistical method for solving the following two types of problems: (i) constructing CIs in case of nonnormal observations, or (ii) reducing possible bias of a given estimator; we shall give examples of both problem types. To explain jackknifing, we use the regression problem in (1). Suppose we want CIs for the q elements of β in case of nonnormal w. For simplicity, we assume mi = m > 1 (i = 1, . . . , n). The original OLS estimator βˆ was given in (3). The jackknife then deletes replication r for each combination i, and computes

β−r = (X X )−1 X w−r (r = 1, . . . , m )

(16)

where the n-dimensional vector w−r has the elements wi;−r which denote the average of the m − 1 simulation outputs when exclud = (β ing the output of replication r; obviously, β −r 1;−r , . . . , βq;−r ) . The m estimators β−r in (16) are correlated because they share m − 2 elements. For ease of presentation, we focus on β q (the last element of β). Jackknifing then uses the pseudovalue

− ( m − 1 )β . Jr = mβ q q;−r

(17)

and the jackknifed In this example, both the original estimator β q estimators β ( r = 1 , . . . , m ) are unbiased, so the pseudovalq;−r ues also remain unbiased. In general, however, it can be proven that possible bias is reduced by the jackknife point estimatorJ = m r=1 Jr /m; examples of biased estimators are ratio estimators and nonlinear estimators such as (23) below. To compute a CI, jackknifing treats the pseudovalues as if they were NIID. So if tm−1;1−α /2 2 dedenotes the 1 − α /2 quantile of the tm−1 -distribution and σ J m notes r=1 (Jr − J )2 /[m(m − 1 )], then jackknifing gives the follow-

5

ing two-sided symmetric (1 − α ) CI for β q :

J < βq < J + tm−1;1−α /2 σ J ) = 1 − α . P (J − tm−1;1−α /2 σ

(18)

Applications of jackknifing in simulation are numerous; see Gordy and Juneja (2010) and Kleijnen (2015, p. 95). Another general statistical method that does not assume normality is known as distribution-free bootstrapping or nonparametric bootstrapping. This bootstrapping may be used to solve two types of problems; namely, problems caused by (i) nonnormal distributions, or (ii) nonstandard statistics. Let us reconsider (18). Now we distinguish between the original observations w and the bootstrapped observations w∗ ; the superscript ∗ is the usual symbol for bootstrapped observations. Standard bootstrapping assumes that the original observations are IID; indeed, wi;1 , . . . , wi;m are IID because the m replications use nonoverlapping PRN streams. Distribution-free bootstrapping means that the m original IID observations are resampled with replacement such that the original sample size (namely, m) remains unchanged. This resampling is applied to each of the n combinations. The resulting w∗i;1 , . . . , w∗i;m ∗ give w (vector with bootstrap averages), which upon substitution into (3) gives

β∗ = (X X )−1 X w∗ .

(19)

To reduce sampling variation, we repeat this resampling (say) B times. B is known as the bootstrap sample size. A typical value for ∗ with B is 100 or 1,000. A bootstrap sample of size B gives β b b = 1, . . . , B. The so-called percentile method gives the following two-sided (1 − α ) CI:

∗ ∗ P (β q;(Bα /2 ) < βq < βq;(B[1−α /2] ) ) = 1 − α

(20)

∗ where β denotes the α /2 quantile of the empirical density q ; ( Bα / 2 ) ∗ obtained through theorder statistics denoted function (EDF) of β q by the subscript ( · ) where — for notational simplicity — we as∗ sume that Bα /2 is integer; β is defined analogously. We q;(B[1−α /2] ) point out that this CI is not symmetric. Another example is given in Turner, Balestrini-Robinson, and Mavris (2013); namely, a CI for s2w (sample variance of w) if w does not have a Gaussian distribution. We shall mention more examples; namely, CIs for a quantile below (27), for R2 below (29), and for cross-validation statistics below (34). 3.3. Heterogeneous variances of simulation outputs In practice, Var(wi ) changes as the input combination zi changes. In some applications, however, we may hope that this variance heterogeneity is negligible. Unfortunately, Var(wi ) is unknown so we must estimate it. The classic estimator based on mi replications, is

m i

s2i =

r=1

(wi;r − wi )2 mi − 1

(i= 1, . . . ,n ).

(21)

This s2i itself has high variance; i.e., if wi;r is normally distributed with Var(wi;r ) = σi2 , then Var(s2i ) = 2σi4 /mi . In practice, we need to compare n estimators s2i , and we may apply many tests; see (Kleijnen, 2015, p. 101). The logarithmic transformation may be used not only to obtain Gaussian simulation output but also to obtain simulation outputs with constant variances. (Actually, the logarithmic transformation is a special case of the Box-Cox power transformation; also see (Kleijnen, 2015, p. 93).) We, however, prefer to accept variance heterogeneity and adapt our analysis, as follows. If the metamodel is valid so E (e ) = 0, then the OLS estimator βˆ is still unbiased. However, if Var(wi ) in not constant but the number of replications equals mi = m, then (4) is replaced by

βˆ = (X X )−1 X w X(X X )−1

(22)

6


where w is the n × n diagonal matrix with the first element on its main diagonal equal to σ12 /m, . . . , the last element on this diagonal equal to σn2 /m. However, because Var(wi ) is not constant, we may switch from OLS to GLS, as follows. In this case, GLS reduces to weighted LS(WLS), which gives the . However, Var(w ) must be estimated. PlugWLS estimator (say) β i ging the estimators s2i (defined in (21)) into WLS results in esti ; obvimated WLS (EWLS), which gives the nonlinear estimator β ously, β is nonnormally distributed and may be biased so it is difficult to derive an exact CI. In Section 3.2 we have already discussed a simple solution for such a type of problem; namely, jackknifing. In jackknifed EWLS (JEWLS) with mi = m and without CRN, we proceed analogously to (16):

= ( X −1 X )−1 X −1 w−r (r = 1, . . . , m ) β −r w;−r w;−r

(23)

where w−r is the vector with the n averages of the m − 1 replica tions after deleting replication r, and w;−r is the diagonal matrix 2 and with s computed from the same m − 1 replications. Using β i

, we compute the pseudovalues that give the desired CI. β −r

The DOE literature ignores designs for heterogeneous output variances. We propose classic designs with mi such that the resulting V ar (wi ) = s2i /mi with i = 1, . . . , n are approximately constant. First we take a pilot sample of size m0 ≥ 2 for each combination, which gives s2i (m0 ). Next we select a number of additional replii − m0 with cations m

i = m0 × nint m

s2i (m0 )

(24)

mini s2i (m0 )

i where nint[x] denotes the integer closest to x. We combine the m replications of the two stages to compute wi and s2i . From wi we . We estimate through (22) with estimated by a compute β w β )/m . We compute CIs for β diagonal matrix with elements s2 (m i

i

i

j

using tf with f = m0 − 1. i /m i . Actually, (24) guides the relative number of replications m , we recommend the following rule To select absolute numbers m derived in Law (2015, p. 505) with a relative estimation error ree :

= min r ≥ m : m

tr−1;1−α /2

s2i (m )/i

|w ( m )|

ree ≤ . 1 + ree

(25)

We shall return to the selection of mi , in Section 5. 3.4. Common random numbers (CRN) CRN are the default in software for discrete-event simulation. CRN are meant to compare the simulation outputs of different simulation input combinations while all other “circumstances” are the same; e.g., waiting times are compared for n servers with different service rates (and different costs so optimization is required) while random customer arrivals are the same. If the number of replications is a constant so mi = m, then we may arrange the simulation output wi;r with i = 1, . . . , n and r = 1, . . . , m into a matrix W = (wi;r ) = (w1 , . . . , wm ) with wr = (w1;r , . . . , wn;r ) . Obviously, CRN create correlation between wi;r and wi ;r . Moreover, two different replications use nonoverlapping PRN streams, so wi;r and wi;r with r = r are independent; i.e., wr and wr are inde ) and V ar ( pendent. The final goal of CRN is to reduce V ar (β y ); g actually, CRN increase the variance of the estimated intercept. Note: Effective implementation of CRN requires some care; e.g., we may use two separate nonoverlapping PRN streams for the arrival process and the service process in a single-server simulation to ensure so-called “synchronization” so that (say) stream 1 is used only to generate interarrival times, whereas stream 2 is used to

generate service times (with n different rates). Fortunately, software such as Arena simplifies the implementation of separate PRN streams; i.e., each stream has its own “seed”. For details we refer to Law (2015, pp. 592–604). Note: Bootstrapping in case of CRN and nonconstant mi is discussed in Kleijnen, Beers, and Nieuwenhuyse (2010). This case arises if mi is selected such that wi (average simulation output of combination i) has the same — absolute or relative — width of the CI around wi ; see again (25). If we continue to use OLS in case of CRN, then βˆ is given w by (22) but now w is not a diagonal matrix. The estimator from t issingular if m ≤ n; else we may compute CIs for β . An j

m−1

alternative method requires only m > 1, and uses replication r to estimate β through

βˆr = (X X )−1 X wr (r = 1, . . . , m ).

(26)

Obviously, the n elements of wr are correlated because of CRN, and these elements may have different variances. The m estimators β g;r (g = 1, . . . , q; r = 1, . . . , m) are independent and have a common ). These β give β = m β standard deviation (say) σ (β g g;r g r=1 g;r /m m )= 2 and s2 (β g r=1 (βg;r − β g ) /[m (m − 1 )], which gives

tm−1 =

βg − βg ) s (β g

with g = 1, . . . , q.

(27)

Unfortunately, we cannot apply this alternative when estimating a quantile instead of a mean. In case of a quantile, we recommend distribution-free bootstrapping; see Kleijnen (2015, p. 99, 110) and Kleijnen, Pierreval, and Zhang (2011). 3.5. Validation of metamodels In practice, we do not know whether E ( yi ) = E (wi ) so E ( e ) = 0. For example, given a “small” experimental area, the estimated firstorder polynomial may be adequate for estimating the gradient when searching for the optimum; see Section 6. We now discuss the following validation methods: (i) two related coefficients of determination; namely, R2 and R2adj ; (ii) cross-validation. These methods may also be used to compare first-order against second-order polynomials, or linear regression against Kriging metamodels. R2 may be defined as

n n ( yi − w )2 ( y i − wi ) 2 R2 = ni=1 = 1 − ni=1 2 2 i=1 (wi − w ) i=1 (wi − w )

(28)

where w = ni=1 wi /n and mi ≥ 1. If n = q (saturated design), then 2 R = 1 — even if E ( yi ) = E (wi ). If n > q and q increases, then R2 increases — whatever the size of |E ( yi ) − E (wi )| is. Because of possible overfitting, the regression literature adjusts R2 :

R2adj = 1 −

n−1 ( 1 − R2 ). n−q

(29)

Critical values for R2 or R2adj are unknown, because these statistics do not have well-known distributions. We might use subjective lower thresholds. However, Kleijnen and Deflandre (2006) demonstrates how to estimate the distributions of these two statistics through distribution-free bootstrapping. Leave-one-out cross-validation may be defined as follows. For ease of presentation, we assume that X has only n (instead of n i=1 mi ) rows, because mi = m ≥ 1 so in (3) we may replace w by w. Now we delete I/O combination i to obtain (X−i , w−i ), which we use to compute

β−i = (X−i X−i )−1 X−i w−i (i = 1, . . . , n ).

(30)


. We may “eyeball” the scatterplot with This gives y−i = xi β −i ( wi , y−i ), and decide whether the metamodel is valid. Alternatively, we may compute the Studentized prediction error (i ) tm = −1

wi − yi s2 (wi ) + s2 ( y−i )

(31)

where s2 (wi ) = s2i /m and

x with = s2 (w )(X X )−1 . s2 ( y−i ) = xi i −i −i β−i i β−i

(32)

We reject the regression metamodel if (i ) maxi |tm | > tm−1;1−[α/(2n)] −1

(33)

where the well-known Bonferroni inequality implies that α /2 is replaced by α /(2n) — resulting in the experimentwise or familywise type-I error rate α . ; see again Cross-validation affects not only y−i , but also β −i (30). We may be interested not only in the predictive performance of the metamodel, but also in its explanatoryperformance; i.e., do (i = 1, . . . , n) remain stable? the n estimates β −i Related to cross-validation are several diagnosticstatistics in the regression literature such as DEFITS, DFBETAS, and Cook’s D; the most popular diagnostic statistic, however, is the prediction sum of squares (PRESS):

PRESS =

n i=1

( y−i − wi )2 . n

(34)

Regression software uses a shortcut to avoid the n recomputations in cross-validation. We may apply bootstrapping to estimate the distribution of these validation statistics; see (Bischl, Mersmann, Trautmann, & Weihs, 2012). If the validation suggests a big fitting error e, then we may consider various transformations. For example, we may replace y and xj by log (y) and log (xj ) with j = 1, . . . , k so that the first-order polynomial approximates relative changes through k elasticity coefficients. If we assume that fsim is monotonic, then we may replace w and xj by their ranks, which is called rank regression. In the preceding subsections, we also considered transformations that make w better satisfy the assumptions of normality and variance homogeneity; unfortunately, different goals of a transformation may conflict with each other. We may also use different regression metamodels for different subsets of the simulation I/O data, as Santos and Santos (2016) explains. In Section 2 we discussed designs for low-order polynomials. If such a design does not give a valid metamodel, then we do not recommend routinely adding higher-order terms: these terms are hard to interpret. However, if the goal is not to better understand the simulation model but to better predict its output, then we may add higher-order terms; e.g., a 2k design enables the estimation of the interactions between two inputs but also among three or more inputs. In the discussion of (29), we have already mentioned the danger of overfitting. A detailed example of fitting a polynomial in simulation optimization is presented in Barton (1997, Fig. 4). Note that adding more explanatory variables is called stepwise regression; eliminating nonsignificant variables (see (27)) is called backwards elimination, which we briefly discussed below (8). 4. Factor screening: sequential bifurcation (SB) Screening means searching for the really important simulation inputs among the many inputs that can be varied in a simulation experiment. It is realistic to assume sparsity; i.e., only a few inputs among these many inputs are really important. Indeed, the Pareto principle or 20–80 rule states that only a few inputs — say, 20 percent — are really important. Kleijnen (2015, p. 136) presents two

7

examples, with 281 and 92 inputs, respectively; screening finds only 15 and 11 inputs to be really important. The literature presents several types of screening designs. We focus on designs that treat the simulation model as a black box; i.e., only the I/O of the simulation model is observed. Kleijnen (2015, pp. 137–139) summarizes four types of screening designs; these types use different mathematical assumptions. We focus on sequential bifurcation(SB), because SB is very efficient and effective if its assumptions are satisfied. SB selects the next input combination after analyzing the preceding I/O data. SB is customized; i.e., SB accounts for the specific simulation model. Note: Borgonovo and Plischke (2016) also summarizes SB, besides Morris’s method. The latter method is also discussed in Fédou and Rendas (2015). These two types of screening designs and more types are discussed in Woods and Lewis (2015), including 117 references. 4.1. Deterministic simulations and first-order polynomial metamodels In this section we explain the basic idea of SB, assuming deterministic simulation and a first-orderpolynomial metamodel with approximation error e where E (e ) = 0 so

y = γ0 + γ1 z1 + · · · + γk zk + e

(35)

where γ j denotes the first-order effect of the original (nonstandardized) simulation input zj ( j = 1, . . . , k). Furthermore, we assume that the signs of γ are known so that we can define the low and high bounds Lj and Hj of zj such that γ j ≥ 0. For example, if z1 denotes the traffic rate in a waiting-time simulation and this rate may vary between 0.5 and 0.8, then L1 = 0.5 and H1 = 0.8 so we assume that γ 1 > 0 if the simulation output is the steady-state mean waiting time (because waiting increases, as traffic increases); however, if z1 denotes the dosage of an anti-fever medicine and this dosage may vary between 1 and 4 tablets (and we assume that more pills lower the fever), then we select L1 = 4 and H1 = 1 (so L1 > H1 ) so we assume that γ 1 > 0 if the simulation output is the body temperature. Together (35) and γ j ≥ 0 imply that we may rank the inputs such that the most important input has the largest first-order effect; the least important inputs have effects close to zero. Input j is called important if γ j > cγ where cγ ≥ 0 is specified by the users. In its first step, SB aggregates all k inputs into a single group, and checks whether or not that group has an important effect. So in this step, SB obtains the simulation outputs w(z = L ) with z = (z1 , . . . , zk ) and L = (L1 , . . . , Lk ) ; SB also obtains w(z = H) where H = (H1 , . . . , Hk ) . Obviously, if all inputs have zero effects, then w(z = L ) = w(z = H ). However, if one or more inputs have positive effects, then w(z = L ) < w(z = H ). In practice, not all k inputs have zero effects. However, it may happen that all effects are unimportant (0 ≤ γ j < cγ ), but w(z = H ) − w(z = L ) > cγ ; SB will discover this “false importance” in its next steps, as we shall see next. If SB concludes that the group has an important effect, then the next step splits the group into two subgroups: so-called bifurcation. Let k1 denote the size of subgroup 1 and k2 the size of subgroup 2 (so k1 + k2 = k). SB obtains wk1 which denotes w when all k1 inputs in subgroup 1 are “high” so z1 = H1 , . . . , zk1 = Hk1 and zk1 +1 = L1 , . . . , zk = Lk . SB compares this wk1 with w0 = w(z = L ); if wk1 − w0 < cγ , then none of the individual inputs within subgroup 1 is important and SB eliminates this subgroup from further experimentation. SB also compares wk1 with wk = w(z = H ). The result wk − wk1 < cγ is impossible if at least one input is important and this input is a member of subgroup 2. SB continues splitting important subgroups into smaller subgroups, and eliminating unimportant subgroups. It may happen that — in a specific step — SB finds both subgroups to be important, which leads to further experimentation with two important

8


subgroups in parallel. Finally, SB identifies and estimates all individual inputs that are not in eliminated (unimportant) subgroups. Obviously, assuming γ j ≥ 0 ensures that first-order effects do not cancel each other within a group. Furthermore, we can define the inputs such that if fsim is monotonically decreasing in zj , then this function becomes monotonically increasing in the standardized inputs dj . Experience shows that in practice the users often do know the signs of γ j ; e.g., some inputs in a SB case study are transportation speeds, so higher speeds decrease the cost which is the output of interest. Nevertheless, if in a specific case study it is hard to specify the signs of a few specific inputs, then we should not group these inputs with the other inputs (with known signs). We should treat these inputs individually and investigate these inputs not through SB but through one of the classic designs discussed in Section 2. Treating such inputs individually is safer than assuming a negligible probability of cancellation within a subgroup. The efficiency of SB — measured by the number of simulated input combinations — improves if the individual inputs are labeled such that inputs are placed in increasing order of importance. Such labeling implies that the important inputs are clustered; i.e., these inputs are labeled such that they have consecutive numbers and are members of the same subgroup. The efficiency further improves when placing “similar” inputs within the same subgroup; e.g., place all “transportation” inputs in the same subgroup. Anyhow, splitting a group into subgroups of equal size is not necessarily optimal. Academic and practical examples of SB are given in Kleijnen (2015, pp. 136–172). 4.2. Random simulations and second-order polynomials Now we assume a second-order polynomial plus approximation error e with E (e ) = 0. Moreover, we assume that if γ j (firstorder effect of original simulation input) is zero, then γ j; j (purely quadratic effect) and γ j; j with j < j (interaction) are zero; this is the heredity assumption in Wu and Hamada (2009). In Section 2.2 we discussed the foldover principle for constructing RIV designs from R-III designs; likewise, SB enables the estimation of γ j unbiased by γ j; j and γ j; j if SB simulates the mirror input of the original input in its sequential design. Like we did in (7), we standardize the original inputs z such that the standardized inputs x are either −1 or 1 in the experiment. We let w− j denote w when the first j standardized inputs are −1 and the remaining inputs are 1 (so z1 = L1 , . . . , z j = L j , z j+1 = H j+1 , . . . , zk = Hk ). Kleijnen (2015, p. 150) shows that an unbiased estimator of β j − j = hj = j βh (where the subscript j − j uses the endash to denote “through”) is

βj − j =

(w j − w− j ) − (w j −1 − w−( j −1) ) 4

.

(36)

Let us first assume a fixed m (number of replications per simulated input combination). Let w j;r with r = 1, . . . , m denote obser vation r on w . We then obtain the analogue of (36)): β = j

( j − j );r

[(w j;r − w(− j );r ) − (w( j −1 );r − w−( j −1 );r )]/4. These m observations β( j − j );r give the classic estimators of the mean and the variance of β , analogously to (21). These estimated mean and variance ( j − j );r

give a tm−1 -test, analogously to (27). In SB we apply a one-sided test because SB assumes γ j > 0 so β j > 0. Now we assume a variable random m. Whereas the preceding t-test assumes a favorite null-hypothesis, the sequential probability ratio test (SPRT) in Wan, Ankenman, and Nelson (2010) considers two comparable hypotheses, and each step selects m such that the SPRT controls the type-I error probability through the whole procedure and holds the type-II error probability at each step. This SPRT assumes that the users select values for the “importance” thresholds cγ U and cγ I so SB classifies inputs with β j ≤

cγ U as unimportant and inputs with β j ≥ cγ I as important; for intermediate inputs — with cγ U < β j < cγ I — the power should be “reasonable”. In each step, the initial number of replications when estimating β j − j is a fixed m0; j − j . We expect that m0; j − j may be smaller in the early steps, because those steps estimate the sum of the positive first-order effects of bigger groups so the signal-noise ratio is larger. Statistical details of the SPRT and a Monte Carlo experiment are given in Kleijnen (2015, p. 154–159) and Shi, Kleijnen, and Liu (2014). 4.3. Multiresponse SB In practice, simulation models have multiple response types, as we have already discussed in Section 3.1. Shi et al. (2014) extends SB to multiresponse SB (MSB). This MSB selects groups of inputs such that within a group all inputs have the same sign for a specific type of output, so no cancellation of first-order effects occurs. More precisely, we still denote the number of simulation outputs by r. We now define L(jl ) and H (j l ) such that changing the value of

the original simulation input j from L(jl ) to H (j l ) increases output l (l = 1, . . . , r); this change, however, may decrease another output l = l. For example, if there are k inputs and r = 2 outputs, then inputs 1 through k1 may have the same + signs for both outputs, whereas inputs k1 + 1 through k have opposite signs for the two outputs; so, changing inputs k1 + 1 through k from L(j1 ) to H (j 1 ) increases output 1 but decreases output 2. MSB also applies the ) SPRT summarized in Section 4.2, but now with thresholds cγ(l U and cγ(l I) . Kleijnen (2015, pp. 159–172) and Shi et al. (2014) give more details including extensive Monte Carlo experiments, a case study concerning a logistic system in China, and the validation of the assumptions in MSB. 5. Kriging metamodels and their designs Kriging metamodels are fitted to simulation I/O data obtained for larger or globalexperimental areas than the localareas in loworder polynomial metamodels. 5.1. Ordinary Kriging (OK) in deterministic simulation There are several types of Kriging, but in this subsection we focus on OK. OK is popular and successful in practical deterministic simulation; see (Chen, Tsui, Barton, & Meckesheimer, 2006). OK assumes the following metamodel:

y (x ) = μ + M (x )

(37)

where μ is the constant mean E[y(x)] and M(x) is is a Gaussian stationary process with zero mean. By definition, such a process has covariances that depend only on the distance between the input combinations x and x . Note that M(x) is called the extrinsic noise to distinguish it from the intrinsic noise in stochastic simulation, as we shall see in Section 5.3. Let X denote the n × k matrix with the n combinations xi (i = 1, . . . , n). Kriging software standardizes the original simulation inputs zi to obtain xi (this software also standardizes the simulation output w); many publications on Kriging software and websites for such software are referenced in Kleijnen (2015, p. 190). OK uses the best linear unbiased predictor (BLUP) criterion. This criterion implies that the predictor y(x0 ) for the newcombination x0 is a linear function of the n old outputs w at X:

y ( x0 ) =

n

i=1

λi wi = λ w.

(38)

Furthermore, this criterion requires an “unbiased” predictor, which implies that if x0 = xi , then the predictor is an exact interpolator: y(xi ) = w(xi ). Finally, the criterion implies that the “best” predictor


minimizes the mean squared error (MSE); because the predictor is unbiased, the MSE equals the variance Var[ y(x0 )]. Altogether, the optimal weight-vector is

λ = o

σM ( x 0 ) + 1

1 − 1 −1 σ ( x0 ) M

1 −1 1 M

−1 M

(39)

where M = (cov(yi , yi )) denotes the n × n matrix with the covariances between the metamodel’s “old” outputs, and σ M (x0 ) = (cov(yi , y0 )) denotes the n-dimensional vector with the covariances between the metamodel’s n old outputs yi and the new output y. Note that the weight λi decreases with the distance between x0 and xi , so λ is not a constant vector (whereas β in linear regression is constant). Substitution of λo into (38) gives y(x0 ) = μ + σM (x0 ) −1 M (w−μ1 )

(40)

where 1 denotes an n-dimensional vector with all elements equal to 1. Obviously, y(x0 ) in (40) varies with σ M (x0 ), whereas μ, M , and w remain fixed. The gradient ∇ ( y ) follows from (40); see (Lophaven, Nielsen, and Sondergaard, 2002, Eq. 2.18). We should not confuse ∇ ( y) and ∇ (w ); sometimes we can indeed estimate ∇ (w ), and use (w ) to estimate a better OK model; see Qu and Fu (2014) and ∇ Ulaganathan, Couckuyt, Dhaene, and Laermans (2014). If τ 2 denotes Var(y), then

MSE [ y(x0 )] = τ 2 − σM (x0 ) −1 M σM ( x 0 ) +

−1

[1 − 1 M

σ M ( x 0 )] 2

1 −1 1 M

.

(41) Obviously, Var [ y(x0 )] = 0 if x0 = xi . So, Var [ y(x0 )] has many local minima. Experimental results suggest that Var [ y(x0 )] has local maxima at x0 approximately halfway between old input combinations (we shall use this property in Section 6.2). Kriging gives bad extrapolations compared with interpolations (linear regression also gives minimal Var [ y(x0 )] when x0 = 0). Obviously, (39) shows that λo is a function of M and σ M (x0 ) — or switching from covariances to correlations — on = τ −2 M and ρ (x0 ) = τ −2 σM (x0 ). There are several types of correlation functions; see (Rasmussen & Williams, 2006, pp. 80–104). Most popular is the Gaussian correlation function:

ρ (h ) =

k j=1

exp −θ

2 jh j

= exp −

k j=1

θ

2 jh j

(42)

with distance vector h = (h j ) where h j = xg; j − xg ; j and g, g = 0, 1, . . . , n. Note that ρ (h) implies that λo has relatively high elements (large weights) for xi close to x0 . Furthermore, the standardization of the original simulation inputs affects the values in h. When estimating the Kriging parameters ψ = (μ, τ 2 , θ ) with θ = (θ j ), the most popular criterion is maximum likelihood (ML) — but LS and cross-validation are also used. The estimation of ψ is a mathematical challenge; different values may result from different software packages or from initializing the same package with different starting values. Anyhow, we denote the resulting estimator . by ψ into (40), we obtain Plugging ψ

) = μ −1 (w−μ +σ ( x0 ) 1 ). y ( x0 , ψ

(43)

) is a nonlinear predictor. In practice, we simply Obviously, y ( x0 , ψ )]; moreover, we ignore plug ψ into (41) to obtain MSE [ y ( x0 , ψ )]. To compute a possible bias of y(x0 ) so s2 { y(x0 )} = MSE[ y ( x0 , ψ 2 CI, we use y(x0 , ψ ), s { y(x0 )}, and zα /2 (where zα /2 is the classic symbol for the α /2 quantile of the standard Normal distribution):

) ± z s{ P [w(x0 ) ∈ [ y ( x0 , ψ α /2 y ( x0 )}] = 1 − α .

(44)

9

There is much software for Kriging; see the many publications and websites in Kleijnen (2015, p. 190). Note: There are many publications that interpret Kriging in a Bayesian way; see (Yuan & Ng, 2015). However, we find it hard to come up with a prior distribution for ψ , because we have little intuition about θ . For a discussion of the Bayesian view versus the classical frequentist view we also refer to Efron (2015). ) in (43), we may apNote: To estimate the true MSE of y ( x0 , ψ ply parametric bootstrapping. We may also apply a bootstrap variant called conditional simulation. These two methods are rather complicated, and yet they give CIs with coverages and lengths that are not superior compared with the CI specified in (44); see (Kleijnen, 2015, pp. 191–197). Note: Universal Kriging (UK) replaces μ in (37) by f(x ) β where f(x) is a q-dimensional vector of known functions of x and β is the corresponding q-dimensional vector of unknown parameters; an example of f(x ) β is the second-order polynomial in (2). The disadvantage of UK is that UK requires the estimation of q − 1 additional parameters, besides β0 = μ. We conjecture that the estimation of these extra parameters explains why UK often has a higher MSE. In practice, most Kriging models use OK instead of UK. 5.2. Designs for deterministic simulation There is an abundant literature on various design types for Kriging in deterministic simulation, such as orthogonal array, uniform, maximum entropy, minimax, maximin, integrated mean squared prediction error, and “optimal” designs; see the many references in Kleijnen (2015, p. 198). However, the most popular design uses Latin hypercube sampling (LHS); first we shall explain the basic idea of LHS, then its mathematical procedure. LHS assumes that an adequate metamodel is more complicated than a low-order polynomial; LHS does not assume a specific type of metamodel, but focuses on the input space formed by the k standardized simulation inputs. The result is the n × k matrix X. LHS does not imply a strict mathematical relationship between n and k, whereas DOE may use n = 2k−p so n increases with k. Nevertheless, if LHS keeps n“small” and k is “large”, then “space filling” LHS covers the input space so sparsely that the fitted Kriging metamodel may be inadequate. Therefore a well-known rule-of-thumb for LHS in Kriging is n = 10k; see (Loeppky, Sacks, & Welch, 2009). Mathematically, LHS proceeds as follows. LHS divides the range of each input into n mutually exclusive and exhaustive intervals of equal probability. LHS gives a noncollapsing design; i.e., if an input turns out to be unimportant, then each remaining individual input is still sampled with one observation per interval (the estimation of the correlation function may benefit from this noncollapsing property). Unfortunately, projections of a LHS-combination onto more than one dimension may give “bad’’ designs. Therefore standard LHS is further refined, leading to so-called maximin LHS and nearly-orthogonal LHS. Now we switch from LHS — which gives fixed sample or one shot designs — to sequential designs that account for fsim so the sequential designs are application-driven or customized. In general, sequential procedures require fewer observations than fixed-sample procedures do; in sequential procedures we learn about the behavior of the underlying system as we experiment with this system and collect data (also see Section 4 on SB). In Kriging, however, we need extra computer time if we re-estimate ψ when new I/O data become available. Kleijnen (2015, pp. 203–206) details sequential designs for Kriging, to perform either SA(so the whole experimental area is of interest) or optimization (so only the global optimum is interesting). In a sequential design we may start with a pilot experiment with n0 combinations of the k inputs selected through LHS, and obtain

10


the corresponding simulation I/O data. Next we fit a Kriging model to these data. Then we may consider — but not yet simulate — Xcand which denotes a larger matrix with candidate combinations selected through LHS, and find the “winning” candidate. In SA this winner has the highest s2 { y(x )} with x ∈ Xcand ; in optimization the winner is discussed in Section 6.2. Experiments show that sequential designs for SA select relatively few combinations in subareas with an approximately linear fsim , and many combinations in the other subareas; in optimization the winner looks “promising”. Next we use the winner as the input to be simulated, which gives additional I/O data. We may now re-fit (update) the Kriging model to the augmented I/O data. We stop if either the Kriging metamodel satisfies a given goal or the computer budget is exhausted. We also refer to Crombecq, Laermans, and Dhaene (2011) for an extensive study of new and state-of-the-art space-filling sequential design methods. 5.3. Stochastic Kriging (SK) Ankenman, Nelson, and Staum (2010) develops SK, adding the intrinsic noiseterm ε r (xi ) for replication r (r = 1, . . . , mi ) at combination xi (i = 1, . . . , n) to (37), which after averaging over replications gives

y ( xi ) = μ + M ( xi ) + ε ( xi )

(45)

where ε r (x) ∈ N(0,Var[ε r (x)]) and ε r (x) is independent of M(x). Obviously, mi replications without CRN make ε (covariance matrix of ε ) diagonal with on its main diagonal the n elements Var[ε (xi )]/mi . Furthermore, CRN and mi = m give ε = ε /m. To estimate Var[ε (xi )], SK may use s2i defined in (21). Alternatively, SK may use another Kriging metamodel for Var[ε (xi )] — besides the Kriging metamodel for the mean E[yr (xi )] — to estimate Var[ε (xi )]. This alternative may give less volatile estimates than the point-estimates s2i . Because s2i is not normally distributed, the GP is only a rough approximation. We might also replace s by log(s2i ) in ´ the Kriging metamodel; also see Section 3.3 and Kaminski (2015). To get the SK predictor, Ankenman et al. (2010, Eq. 25) replaces M in OK by M + ε and w by w:

) = μ + )−1 (w−μ +σ (x0 ) ( 1 ) y ( x0 , ψ ε M and

+ )−1 σ ( x0 ) ( x0 ) ( s2 { y(x0 )} = τ2 − σ ε M +

+ )−1 σ ( x 0 )] 2 [1 − 1 ( ε M . −1 1 ( + ) 1 M

ε

SK for a quantile instead of an average is discussed in Chen and Kim (2013). In the discussion of (44) we mentioned the problems caused . To solve this problem we may apby the randomness of ψ ply distribution-free bootstrapping; see Van Beers and Kleijnen (2008) and Yin, Ng, and Ng (2009). Usually SK employs the same designs as OK does for deterministic simulation. So, SK often uses one-shot LHS. However, we also need to select the number of replications, mi . In Section 3.3 we have already discussed the analogous problem in regression metamodeling; a simple rule-of-thumb is (25). In sequential designs for SA, we may select the “winner” defined in Section 5.2. In SK we may select this winner, using distribution-free bootstrapping. Van Beers and Kleijnen (2008) selects more input values in the subdomain of the experimental area that gives a highly nonlinear estimated I/O function; this design gives better Kriging predictions than the fixed LHS design — especially for small designs, which are used in expensive simulations. Sequential designs for simulation optimization through SK will follow in Section 6.2.

5.4. Monotonic Kriging In practice we sometimes know that fsim is monotonic; e.g., if the traffic rate increases, then the mean waiting time increases. More examples were given in Section 4 on screening. The Kriging predictor y, however, may be wiggling if the sample size n is small. To make y monotonic, we may apply distribution-free bootstrapping with acceptance/rejection — as follows. We might assume that if the analysts require monotonicity for the simulation model’s I/O function, then they obtain so many replications that the n average simulation outputs wi (i = 1, . . . , n) also show this property. Technically, however, the heuristic has a weaker requirement; namely, mini (wi )< maxi (wi+1 ) with i = 1, . . . , n − 1; see (Kleijnen and Van Beers, 2013, p. 710). Next the heuristic rejects the Kriging metamodel fitted in bootstrap sample b — with b = 1, . . . , B and bootstrap sample size B — if this metamodel gives nonmonotonic predictions. A monotonic predictor means that the estimated gradients of the predictor remain positive as the inputs increase. An advantage of monotonic Kriging is that in practice the users better understand and accept the resulting SA. Furthermore, monotonic Kriging may give a smaller MSE and a CI with higher coverage and acceptable length. Finally, the estimated gradients with correct signs may improve optimization. For mathematical details and references we refer to Kleijnen (2015, pp. 212–216). To quantify the performance of this method in SA, we may use the estimated integrated mean squared error(EIMSE). where we average prediction errors over a set of “new” combinations. Whereas OK uses the CI defined in (44), which is symmetric around its point estimate y and may include negative values — even if negative values are impossible, as is the case for waiting times — monotonic Kriging gives an asymmetric CI that does not cover negative values. Experiments show that — compared with OK — monotonic Kriging gives a smaller — but not significantly smaller — EIMSE, and significantly higher estimated coverages of the CIs for the true outputs at the new combinations — without widening the CI. We may also apply bootstrapped Kriging with acceptance/rejection to preserve other I/O characteristics; e.g., we accept only positivevalues for outputs such as waiting times, variances, and thickness. Furthermore, we may apply bootstrapping with acceptance/rejection to other types of metamodels such as the linear regression metamodel. Altogether, we think that bootstrapped Kriging is a simple heuristic that may give better predictions and CIs, but further research is needed; see Maatouk and Bay (2016) and Cousin, Maatouk, and Rullière (2015). 5.5. Global sensitivity analysis: Sobol’s FANOVA So far we focused on the predictor y, but now we measure how sensitive y — and wif y is an adequate predictor — are to the individual inputs d1 through dk and their interactions. We assume that d has a prespecified distribution, as in LHS; see Section 5.2. We apply functional analysis of variance (FANOVA), using variancebased indexes originally proposed by Sobol. FANOVA decomposes the simulation output variance σw2 into fractions that refer to the individual inputs or to sets of inputs; e.g., FANOVA may show that 70 percent of σw2 is caused by the variance in d1 , 20 percent by d2 , and 10 percent by the interaction between d1 and d2 . We can prove the following variance decomposition into a sum of 2k−1 components:

σw2 =

k j=1

σ j2 +

k j< j

σ j2; j + · · · + σ12;...;k

(46)

with the main-effect variance σ j2 = Var[E (w|d j )], the two-factor

interaction variance σ j2; j = Var[E (w|d j , d j )] − Var[E (w|d j )] − Var


[E (w|d j )], etc.; in this review we do not give the mathematical definitions of the variances given fixed values for three or more factors, because we assume that these variances are negligible (as we assume in regression analysis). The measure σ j2 leads to the first-order sensitivity index or the main effect index ζ j = σ j2 /σw2 . So, ζ j quantifies the effect of varying dj alone — averaged over the variations in all the other k − 1 inputs; σw2 in the denominator standardizes ζ j to provide a fractional contribution (in linear regression we standardize dj so that β j measures the relative first-order effect; see (7)). Likewise, σ j2; j through σ12;...;k are divided by σw2 . Altogether we get k j=1

ζj +

k−1 k

ζ j; j + · · · + ζ1;...;k = 1.

(47)

j=1 j = j+1

As k increases, the number of measures in (46) or (47) increases dramatically. So we may assume that only ζ j — and possibly ζ j; j — are important, and verify whether they sum up to a fraction “close enough” to 1 in (47). The estimation of the various sensitivity measures may use LHS, and replace the simulation model by a Kriging metamodel; see Borgonovo and Plischke (2016) and Saltelli et al. (2008, pp. 164- 67). 5.6. Risk analysis (RA) or uncertainty analysis (UA) In FANOVA we assume a given distribution for the simulation inputs. In RA we may wish to estimate P (w > c1 ) with a given threshold value c1 . RA is applied in nuclear engineering, finance, water management, etc. Actually, P (w > c1 ) may be very small — so w > c1 is called a rare event — but may have disastrous consequences. An example of RA may be w denoting the net present value (NPV) and d the cash flows so d is sampled from a given distribution function; spreadsheets are popular software for such NPV computations. The uncertainty about the exact values of d is called subjective or epistemic, whereas the “intrinsic” uncertainty in stochastic simulation is called objective or aleatory. SA and RA address different questions; namely, “Which are the most important inputs in the simulation model of a given real system?” and “What is the probability of a given (disastrous) event happening?”. So, SA may identify those inputs for which the distribution in RA needs further refinement. RA and SA are also detailed in Borgonovo and Plischke (2016). Methodologically, we propose the following method for RA aimed at estimating P (w > c1 ). We use a Monte Carlo method to sample input combination d from its given distribution. Next we use this d as input into the given simulation model. We run the simulation model to transform d into w, which is called propagation of uncertainty about the input. We repeat these steps n times to obtain the EDF of w. Finally, we use this EDF to estimate P (w > c1 ). This method is also known asnested simulation; see (Gordy & Juneja, 2010). In expensive simulation, we do not run n simulation runs, but run its metamodel n times. The true P (w > c1 ) may be better estimated through inexpensive sampling of many values from the metamodel, which is estimated from relatively few I/O values obtained from the expensive simulation model. Uncertainty in simulation models including RA and SA is also studied by the British community Managing uncertainty in complex models (MUCM) and the French research group GdR MASCOT-NUM. For example, Chevalier et al. (2014) uses Kriging to estimate the excursion setdefined as the set of inputs that give an output that exceeds a given threshold, and quantifies uncertainties in this estimate; a sequential design may reduce this uncertainty. Obviously, the volume of the excursion set is closely related to the failure probability P (w > c1 ). Notice that Kleijnen et al. (2011) uses a first-

11

order polynomial to estimate which combinations of uncertain inputs form the frontier that separates acceptable and unacceptable outputs; both aleatory and epistemic uncertainty are included. RA is related to the Bayesian approach that assumes the parameters of the simulation model to be unknown with given prior distributions for these parameters — usually, conjugate priors. After obtaining simulation I/O data, this approach uses the well-known Bayes theorem to compute the posterior distribution of the simulation output. Bayesian model averaging and Bayesian melding formally account — not only for the uncertainty of the input parameters — but also for the uncertainty in the form of the (simulation) model itself. In practice, however, classical frequentist RA has been applied more often than Bayesian RA; see (Helton, Hansen, & Swift, 2014). We also refer to the review of classical and Bayesian approaches in Nelson (2016). 6. Simulation optimization In practice, optimization of real-world systems is important. We also emphasize the crucial role ofuncertaintyin the input data for simulation models, which implies thatrobustoptimization is important. In academic research, the importance of optimization is demonstrated by the many publications on this topic; e.g., a Google search for “simulation optimization” gave “about 123,0 0 0,0 0 0 results” on 27 November 2015. The simplest optimization has no constraints for the inputs or the outputs, has no uncertain inputs, and concerns the expected value of a single output, E (w ). This E (w ) may represent the probability of a binary variable: E (w ) = p if P (w = 1 ) = p and P (w = 0 ) = 1 − p. However, E (w ) excludes quantiles and the mode of the output distribution. Furthermore, the simplest optimization assumes continuous inputs, excluding so-called ranking and selection (R&S) and multiple comparison procedures. There are so many optimization methods that we do not try to summarize these methods. Instead, we focus on optimization using metamodels, especially linear regression and Kriging. Jalali and Van Nieuwenhuyse (2015) claims that metamodel-based optimization is “relatively common” and that RSM is the most popular metamodel-based method, while Kriging is popular in theoretical publications. We add that linear regression and Kriging for optimization are the focus of Barton and Meckesheimer (2006), including a fully worked example for both regression and Kriging. Furthermore, we focus on expensivesimulations; for these simulations, it is impractical to apply optimization methods such as OptQuest; the commercial method OptQuest is detailed on http://www.opttek.com/OptQuest. In practice, a single simulation run may be computationally inexpensive — but there are extremely many input combinations. Furthermore, most simulation models have many inputs, which leads to the curse of dimensionality; e.g., k = 7 with 10 values per input requires 107 combinations. Moreover, a single run may be expensive if we wish to estimate the steady-state performance of a queueing system with a high traffic rate. Finally, if we wish to estimate a small E (w ) = p (e.g. p = 10−7 ), then we need extremely long simulation runs (unless we succeed in applying importance sampling). 6.1. Linear regression for optimization: RSM RSM treats the simulation model as a black box. RSM is sequential; i.e., it uses a sequence of local experiments that is meant to lead to the optimum input combination. RSM has gained a good track record; see Kleijnen (2015, p. 244), Law (2015, pp. 656–679), and Myers et al. (2009). We assume that RSM is applied only after the important inputs and their experimental area have been identified; i.e., before RSM

12


starts, we may need screening (which we detailed in Section 4). However, RSM and screening are integrated in Chang, Li, and Wan (2014). Methodologically, the goal of RSM is to minimize E (w|z ) with simulation output w and original simulation inputs z. To initialize RSM, we select a starting point; e.g., the combination currently used in practice. In the neighborhoodof this starting point we fit a first-order polynomial, assuming white noise; notwithstanding this white-noise assumption, RSM allows Var(w ) to change in a next step. Unfortunately, there are no general guidelines for determining the appropriatesizeof the local area in each step (Chang et al., 2014, however, selects this size through a so-called trust region). To fit a first-order polynomial, we use a R-III design; these designs were detailed in Section 2.1. In the next steps, we fit local firstorder polynomials. In each of these steps, we use the gradient im−0 plied by the first-order polynomial fitted in that step: ∇ ( y) = γ 0 is removed from the (k + 1)where −0 means that the intercept γ . dimensional vector with the estimated regression parameters γ This ∇ ( y ) estimates the steepest descent direction. We take a step in the steepest-descent direction, trying intuitively selected values for the step size. After a number of such steps in the steepestdescent direction, w will increase (instead of decrease) because the local first-order polynomial becomes inadequate. When such deterioration occurs, we simulate the n > k combinations of the R-III design — but now we center this design around the best combination found so far. To quantify the adequacy of the local first-order polynomial, we may compute R2 or cross-validation statistics; see Section 3.5. Obviously, a planeimplied by a first-order polynomial cannot adequately represent a hill top when searching to maximize w or — equivalently — minimize w. So, in the neighborhood of the estimated optimum we fit a second-order polynomial; for this fitting we use a CCD. Next we use the derivatives of this polynomial to estimate the optimum. We may also apply canonical analysis to examine the shape of the optimal subregion: does this subregion give a unique minimum, a saddle point, or a ridge with stationary points? If time permits, then we may try to escape from a possible local minimum and restart the search from a different initial local area. We recommend not to eliminate inputs with nonsignificant effects in a local first-order polynomial: these inputs may become significant in a next local area. The selection of mi (number of replications) is a moot issue; see Section 3.3 and the SPRT for SB in Section 4.2. A higher-order polynomial is more accurate than a lower-order polynomial is, but the former may give a predictor y with lower bias but higher variance so its MSE increases; moreover, a higher-order polynomial requires the simulation of many more input combinations. Barton and Meckesheimer (2006) provides explicit rules for RSM on what to do when either lack of fit or lack of significance are encountered. r (r = 1, . . . , m). Assuming mi = m and CRN, we may compute γ So, replication r gives ∇ ( y ) — if a first-order polynomial is used — or the optimum — if a second-order polynomial is used. If we find the estimated accuracy to be too low, then we may simulate additional replications (so m increases). We may also use either parametric or distribution-free bootstrapping to derive CIs for ∇ ( y ) and the optimum. These issues, however, require more research. A scale-independent adapted steepest-descent direction that ac) is discussed in Kleijnen counts for γ (covariance matrix of γ (2015, pp. 252–253). Experimental results suggest that this direction performs better than the classic steepest-descent direction. In practice, simulation models have multiple responses types (see again Sections 4.3 and 3.1). The RSM literature offers several approaches for such situations; see (Khuri & Mukhopadhyay, 2010). We, however, focus on generalized RSM (GRSM), originally published in Angün (2004). GRSM addresses the following constrained

nonlinear random optimization problem:

minz E (w(1) |z )

E ( w ( l ) |z ) ≥ cl ( l = 2 , . . . , r )

(48)

L j ≤ z j ≤ H j with j = 1, . . . , k. GRSM combines RSM and interior point methods from mathematical programming, avoiding creeping along the boundary of the feasible area that is determined by the constraints on the random outputs and the deterministic inputs. So, GRSM moves faster to the optimum than steepest descent does. Moreover, GRSM is scale independent. For details we refer to Angün (2004) and Kleijnen (2015, pp. 253–258). Obviously, it is uncertain whether the optimum estimated by GRSM is close to the true optimum. The first-order necessary optimality conditions are known as the Karush–Kuhn–Tucker (KKT) conditions. The KKT conditions may be tested through parametric bootstrapping; see Bettonvil, Castillo, and Kleijnen (2009) and Kleijnen (2015, pp. 259–266). 6.2. Kriging metamodels for optimization Efficient global optimization (EGO) is a well-known sequential method that uses Kriging to balance local and global search; i.e., it balances exploitation and exploration. When EGO selects a new (standardized) combination x0 , it estimates the maximum of the expected improvement (EI) comparing w(x0 ) and — in minimization — mini w(xi ) with i = 1, . . . , n. We saw below (41) that s2 { y ( x0 )} increases as x0 lies farther away from xi . So, EI reaches its maximum if either y is much smaller than min w(xi ) or s2 { y(x0 )} is relatively large so y is relatively uncertain. We present only basic EGO for deterministic simulation; also see the classic EGO reference, Jones, Schonlau, and Welch (1998). Note: There are many EGO variants for deterministic and random simulations, constrained optimization, multi-objective optimization including Pareto frontiers, the “admissible set” or “excursion set”, robust optimization, estimation of a quantile, and Bayesian approaches; see, Preuss et al. (2012) and the many references in Kleijnen (2015, p. 267–269) plus the file with “Corrections and additions” for Kleijnen (2015, p. 267) available on https://sites.google.com/site/kleijnenjackpc/home/publications. We start with a pilot sample, typically selected through LHS. To the resulting simulation I/O data (X, w), we fit a Kriging metamodel y(x) Next we find fmin = min1 ≤ i ≤ n w(xi ). This gives

EI(x ) = E [max ( fmin − y(x ), 0 )].

(49)

Jones et al. (1998) derives the following closed-form expression for the estimator of EI:

(x ) = ( f EI y(x ) ) min −

fmin − y (x ) +s{ y ( x0 )}φ s{ y ( x0 )}

fmin − y (x ) s{ y ( x0 )}

(50) where and φ denote the cumulative distribution function (CDF) and the PDF of the standard Normal distribution. Using (50), we ( x ). find xopt , which denotes the estimate of x that maximizes EI (To find xopt , we should apply a global optimizer; a local optimizer is undesirable because s{ y(xi )} = 0 so EI(xi ) = 0. Alternatively, we use a set of candidate points selected through LHS.) Next we run the simulation with this xopt , and obtain w( xopt ). Then we fit a ´ new Kriging model to the augmented I/O data (Kaminski, 2015) presents methods for avoiding re-estimation of the Kriging parameters). We update n and return to (50) — until we satisfy a stop ( ping criterion; e.g., EI xopt ) is “close” to 0. For the constrained nonlinear random optimization problem already formalized in (48) we may derive a variant of EGO; see the Note immediately preceding (49). Kleijnen et al. (2010), however,


derives a heuristic called Kriging and integer mathematical programming (KrIMP) for solving (48) augmented with constraints on z. These constraints are necessary if z includes resources such as the number of employees with prespecified expertise and the number of trunk lines in a call-center simulation. Altogether, (48) is augmented with s constraints fg for z and the constraint that z must belong to the set of nonnegative integers N, so KrIMP tries to solve

minz E (w(1) |z )

E ( w ( l ) |z ) ≥ cl ( l = 2 , . . . , r ) f g ( z ) ≥ cg ( g = 1 , . . . , s ) z j ∈ N ( j = 1, . . . , k ).

(51)

KrIMP combines the following three methods: (i) sequentialized DOE to specify the next combination, like EGO does; (ii) Kriging to analyze the resulting I/O data, and obtain explicit functions for E (w(l ) |z ) (l = 1, . . . , r), like EGO does; (iii) integer nonlinear programming (INLP) to estimate the optimal solution from these explicit Kriging models, unlike EGO. Experiments with KrIMP and OptQuest (OptQuest has already been mentioned in the beginning of Section 6) suggest that KrIMP requires fewer simulated combinations and gives better estimated optima. 6.3. Robust optimization Robust optimization (RO) is crucial in today’s uncertain world. The optimum solution for the decision variables — that we may estimate through RSM or Kriging — may turn out to be inferior when ignoring uncertainties in the noncontrollable environmental variables; i.e., these uncertainties create arisk (also see RA in Section 5.6). Taguchians emphasize that in practice some inputs of a manufactured product are under complete control of the engineers, whereas other inputs are not. In simulation, the estimated optimal input combination zopt may be completely wrong when we ignore uncertainties in some inputs. Taguchians therefore distinguish between (i) controllable (decision) variables, and (ii) noncontrollable (environmental, noise) factors. We denote the number of controllable simulated inputs by kC and the number of noncontrollable simulated inputs by kNC , so kC + kNC = k; obviously, the simulation analysts control all k inputs. Let the first kC of the k simulated inputs be controllable, and the next kNC simulated inputs be noncontrollable. We denote the vector with the kC controllable simulated inputs by zC , and the vector with the kNC noncontrollable simulated inputs by zNC . Note: The goal of RO is the design of robust products or systems, whereas the goal of risk analysis is to quantify the risk of a given engineering design; that design may turn out to be not robust at all. For example, Kleijnen and Gaury (2003) simulates production-management, using RSM assuming a specific — namely the most likely — combination of noncontrollable simulated inputs. Next, the robustness of zopt is estimated when the combination of noncontrollable simulated inputs changes; technically, these changes are generated through LHS. In RO, however, we wish to find a solution that — from the start of the analysis — accounts for all possible environments, including their likelihood; i.e., whereas Kleijnen and Gaury (2003) performs an ex post robustness analysis, we wish to perform an ex ante analysis. Taguchians assume a single output (say) w, focusing on its mean μw and its variance caused by the noise factors so σ 2 (w|zC ) > 0. These two outputs are combined into a scalar loss function such as the signal-to-noise or mean-to-variance ratio μw /σw2 ; see (Myers et al., 2009, pp. 486–488). We, however, prefer to use μw and σw (unlike σw2 , the measure σw has the same scale as μw has) separately so that we can use constrained optimization, which is

13

popular in operations research; i.e., given an upper threshold cσ for σw , we try to solve

minzC E (w|zC ) such that

σ ( w |zC ) ≤ cσ .

(52)

Constrained optimization is also discussed in Myers et al. (2009, p. 492). The Taguchian worldview is successful in production engineering, but statisticians criticize the statistical techniques. Moreover — compared with real-life experiments — simulation experiments have more inputs, more input values, and more input combinations. Myers et al. (2009, pp. 502–506) combines the Taguchian worldview with the statisticians’ RSM. Whereas Myers et al. (2009) assumes that the noncontrollable inputs are independently distributed with a common variance, we assume a general covariance matrix for the noncontrollable simulated inputs (say) NC . Whereas Myers et al. (2009) superimposes contour plots for E (w|zC ) and σ 2 (w|zC ) to estimate the optimal zC , we use mathematical programming (MP). This MP, however, requires specification of cσ in (52). Unfortunately, managers may find it hard to select a specific value for cσ , so we may try different cσ values and estimate the corresponding Pareto-optimal efficiency frontier. To estimate the variability of this frontier resulting from the estimators of E (w|zC ) and σ (w|zC ), we may use bootstrapping. We now present some details. Myers et al. (2009) fits a second-order polynomial for zC with zC to be optimized; moreover, the regression metamodel includes kNC first-order effects of zNC , and kC × kNC two-factor interactions between the controllable and the noncontrollable simulated inputs:

γ0 +

y=

kC

γ jz j +

kC k

γ j; j z j z j +

j=1 j ≥ j

j=1

+

kC kC

k

γg zg

g=kC +1

γ j;g z j zg + e.

(53)

j=1 g=kC +1

If this metamodel is valid so E(e) = 0, then

E ( y|zC ) =

γ0 +

kC

γ jz j +

kC k

γ j; j z j z j +

j=1 j ≥ j

j=1

+

kC kC

k

γg E (zg )

g=kC +1

γ j;g z j E (zg ).

(54)

j=1 g=kC +1

Remembering that kC + kNC = k, and defining the kNC -dimensional vector γ NC = (γg ) and the (kC × kNC ) matrix C;NC = (γ j;g ) gives

σ 2 (y|zC ) = (γ NC + zC C;NC )NC (γ NC + C;NC zC ) + σe2 = where η

η NC η + σe2

(55)

= (γ NC + C;NC zC ) is the gradient (∂ y/∂ zg ). Obviously, the

larger the gradient’s elements are, the larger σ 2 (y|zC ) is. Furthermore, if C;NC = 0, then we cannot control σ 2 (y|zC ) through zC . To estimate the parameters in (53), Taguchians usually apply a crossed design which combines (i) the design or inner array for zC with its nC combinations and (ii) the design or outer array for zNC with its nNC combinations; altogether, the crossed design has nC × nNC combinations. Like Taguchians, we also apply a crossed design; i.e., we combine a CCD for zC and a R-III design for zNC ; these designs were detailed in Section 2.4 and Section 2.1, respectively. Obviously, this combined design also enables the estimation of C; NC . To derive the OLS estimator using this crossed design, we reformulate (53) as a linear regression model like (1), which gives y = ξ z + e with ξ = (γ0 , . . . , γkC ;k ) and z defined such that it corresponds with ξ . This reformulation gives

ξ = (Z Z )−1 Z w and ξ = (Z Z )−1 σw2 .

(56)

14


To estimate E(y|zC ) in (54), we simply plug ξ into (54) with known zC and known E(zNC ). To estimate σ 2 (y|zC ) we also plug ξ into (55) with known NC . Obviously, plugging-in ξ creates bias — which we ignore. Next we solve (52), using (for example) Matlab’s fmincon. Examples are presented in Myers et al. (2009) and also in Dellino, Kleijnen, and Meloni (2012) and Yanikog˘ lu, den Hertog, and Kleijnen (2016). Dellino et al. (2012) combines the Taguchian worldview and Kriging, replacing the polynomial in (53) by OK. The estimated Kriging metamodels for E(y|zC ) and σ (y|zC ) are solved through NLP. Changing the threshold cσ in (52) enables estimation of the Pareto frontier. Bootstrapping is applied to quantify the variability of the estimated Kriging metamodels and Pareto frontier. Notice that Koch, Wagner, Emmerich, Bäck, and Konen (2015) derives an EGO variant for multi-criteria optimization. To estimate Kriging metamodels (instead of the RSM metamodel) for E (w|zC ) and σ (w|zC ) in (52), Dellino et al. (2012) uses the following two approaches: (i) Like in RSM, fit one Kriging model for E (w|zC ) and one Kriging model for σ (w|zC ) — estimating both models from the same simulation I/O data. (ii) Like in Lee and Park (2006), fit one Kriging model to a relatively small number n of combinations of zC and zNC , and use this metamodel to compute predictionsfor w for N n combinations of zC and zNC , accounting for the distribution of zNC . We detail these approaches, as follows. In approach (i) we select the nC combinations of zC in a spacefilling way; e.g., we use LHS. The nNC combinations of zNC , however, we sample from the distribution of zNC . If we have no prior information about the likelihood of specific values for zNC , then we use independent uniform distributions for each element of zNC . To sample zNC , we may again use LHS. The result is an nC × nNC crossed design specifying the k = kC + kNC simulation inputs. Running the simulation with these nC × nNC input combinations gives wi; j , which give

nNC wi =

j=1

wi; j

nNC

nNC and s2i =

j=1

( w i ; j − wi ) 2

nNC − 1

with i = 1, . . . , nC . (57)

The estimators in (57) are unbiased, as they do not use any metamodels. In approach (ii) we select n nC × nNC combinations of the k = kC + kNC inputs zj ( j = 1, . . . , kC ) and zg (g = kC + 1, . . . , k) through a space-filling design (e.g., a LHS design). Next, we use this n × k matrix as simulation input, and obtain the n-dimensional vector with simulation outputs w. To these I/O simulation data we fit a Kriging model, which approximates w as a function of zj and zg . Finally, we use a design with N n combinations, crossing a space-filling design with NC combinations of zC and a LHS design with NNC combinations of zNC accounting for the distribution of zNC . We use the Kriging model to compute the predictors y of the NC × NNC simulation outputs. We then compute the NC conditional means wi and standard deviations si using (57) replacing w by y and replacing nC and nNC by NC and NNC . We use these wi and si (i = 1, . . . , NC ) to fit one Kriging model for wi and one for si . Finally, we summarize Ben-Tal et al.’s RO; see (Ben-Tal, Ghaoui, & Nemirovski, 2009). If MP ignores the uncertainty in the coefficients of the MP model, then the resulting nominal solution may easily violate the constraints in the given model. RO may give a slightly worse value for the goal variable, but RO increases the probability of satisfying the constraints; i.e., a robust solution is “immune” to variations of the variables within the uncertainty set U. Yanikog˘ lu et al. (2016) derives a specific U for p where p denotes the unknown density function of zNC that is compatible with the given historical data on zNC . This type of RO develops a computationally tractable robust counterpart of the original problem. Com-

pared with the output of the nominal solution in MP, RO may give better worst-case and average outputs. Acknowledgment I thank the three referees for their very constructive comments, which helped me correct some errors and improve some explanations in three earlier versions. Appendix. List of major symbols and their meanings We first list Latin symbols, then Greek symbols. B: sample size in bootstrapping c: axial value in CCD cg : threshold in constraint g for simulation inputs cl : threshold for simulation output l cγ : threshold for importance so input j is important if γ j > cγ cσ : threshold for σw in robust optimization di; j : standardized simulation input j in combination i D = (di; j ): n × k design matrix e: n-dimensional vector with residuals E(y) - E(w) fsim : I/O function implicitly defined by simulation model hj : distance between two values of xj in Kriging metamodel Hj : highest value of zj in the experiment k: number of simulation inputs kC : number of controllable simulated inputs kNC : number of noncontrollable simulated inputs Lj : lowest value of zj in the experiment mi : number of replications of simulation input combination i M(x): Gaussian stationary process with zero mean n: number of different simulated input combinations p: number of generators in a 2k−p design q: number of explanatory regression variables r: number of simulation output types r: vector with pseudorandom numbers (PRNs) R2 : coefficient of determination ree : relative estimation error s2 : estimated variance tf : Student statistic with f degrees of freedom tf; α : α quantile of tf -distribution U: uncertainty set in robust optimization wi;r : simulation output in replication r of simulation input combination i wk1 : simulation output with inputs z1 = H1 , . . . , zk1 = Hk1 , zk1 +1 = L1 , . . . , zk = Lk w− j : simulation output with inputs z1 = L1 , . . . , z j = L j , z j+1 = H j+1 , . . . , zk = Hk xi; g : value of independent standardized variable g in combination i x: vector with q independent standardized variables of the metamodel X: n × q matrix of independent standardized variables in the metamodel Xcand : matrix with candidate combinations of independent standardized variables in the metamodel y: n-dimensional vector with dependent variable zi; j : original (nonstandardized) simulation input j in combination i z: vector with k inputs of simulation model zα /2 : α /2 quantile of standard Normal distribution zC : kC -dimensional vector with controllable simulated inputs


zNC :

β0: βj: β j; j : β j; j : β j − j : γ 0: γ j: γ j; j : γ j; j :

C; NC : ε(x): ζ j: η: θ: λi : μ: ξ: ρ: σ 2: σM: : τ 2: ψ: :

kNC -dimensional vector with noncontrollable simulated inputs intercept in polynomial metamodel with standardized inputs first-order effect of xj ( j = 1, . . . , k) in polynomial metamodel two-factor interaction between xj and x j with j < j in polynomial metamodel purely quadratic effect of xj in polynomial metamodel sum of first-order effects of standardized simulation inputs j through j intercept in polynomial metamodel with original simulation inputs first-order effect of original simulation input j ( j = 1 , . . . , k) two-factor interaction between original simulation inputs j and j with j < j purely quadratic effect of original simulation input j (kC × kNC ) matrix with two-factor interactions between controllable and noncontrollable simulated inputs intrinsic noisein stochastic Kriging first-order sensitivity index in FANOVA gradient with marginal effects of noncontrollable simulated inputs parameter in Gaussian correlation function weight of simulation output wi in Kriging predictor expected value (mean) vector with regression parameters in RSM for robust optimization vector with correlations between y(xi ) and y(x0 ) variance vector with covariances between y(xi ) and y(x0 ) covariance matrix variance of Kriging output vector with Kriging parameters μ, τ 2 , and θ j correlation matrix

References Angün, M. E. (2004). Black box simulation optimization: Generalized response surface methodology. In CentER dissertation series. Tilburg, Netherlands: Tilburg University. Also published in 2011 by VDM Verlag Dr. Müller, Saarbrücken, Germany. Ankenman, B., Nelson, B., & Staum, J. (2010). Stochastic Kriging for simulation metamodeling. Operations Research, 58(2), 371–382. Barton, R. R. (1997). Design of experiments for fitting subsystem metamodels. In S. Andradóttir, K. J. Healy, D. H. Withers, & B. L. Nelson (Eds.), Proceedings of the 1997 winter simulation conference (pp. 303–310). Barton, R. R., & Meckesheimer, M. (2006). Metamodel-based simulation optimization. In S. G. Henderson, & B. L. Nelson (Eds.), Chapter 18 in handbooks in operations research and management science: Vol. 13. Elsevier, Amsterdam. Simulation Bekki, J. M., Fowler, J. W., Mackulak, G. T., & Kulahci, M. (2009). Simulation-based cycle-time quantile estimation in manufacturing settings employing non-FIFO dispatching policies. Journal of Simulation, 3(2), 69–128. June Ben-Tal, A., Ghaoui, L. E., & Nemirovski, A. (2009). Robust optimization. Princeton University Press, Princeton. Bettonvil, B. W. M., Castillo, E. D., & Kleijnen, J. P. C. (2009). Statistical testing of optimality conditions in multiresponse simulation-based optimization. European Journal of Operational Research, 199(2), 448–458. Bischl, B., Mersmann, O., Trautmann, H., & Weihs, C. (2012). Resampling methods for meta-model validation with recommendations for evolutionary computation. Evolutionary Computation, 20(2), 249–275. Borgonovo, E., & Plischke, E. (2016). Sensitivity analysis: a review of recent advances. European Journal of Operational Research, 248(3), 869–887. Box, G. E. P., & Wilson, K. B. (1951). On the experimental attainment of optimum conditions. Journal Royal Statistical Society, Series B, 13(1), 1–38. Chang, K.-H., Li, M.-K., & Wan, H. (2014). Combining STRONG with screening designs for large-scale simulation optimization. IIE Transactions, 46(4), 357–373. Chen, V. C. P., Tsui, K.-L., Barton, R. R., & Meckesheimer, M. (2006). A review on design, modeling, and applications of computer experiments. IIE Transactions, 38, 273–291. Chen, X., & Kim, K.-K. (2013). Building metamodels for quantile-based measures using sectioning. In R. Pasupathy, S.-H. Kim, A. Tolk, R. Hill, & M. E. Kuhl (Eds.), Proceedings of the 2013 winter simulation conference (pp. 521–532).

15

Chevalier, C., Ginsbourger, D., Bect, J., Vazquez, E., Picheny, V., & Richet, Y. (2014). Fast parallel kriging-based stepwise uncertainty reduction with application to the identification of an excursion set. Technometrics, 56(4), 455–465. Cousin, A., Maatouk, H., & Rullière, D. (2015). Kriging of financialterm-structures. https://hal.archives-ouvertes.fr/hal-01206388. Accessed 13.06.16. Crombecq, K., Laermans, E., & Dhaene, T. (2011). Efficient space-filling and non– collapsing sequential design strategies for simulation-based modeling. European Journal of Operational Research, 214, 683–696. Dellino, G., Kleijnen, J. P. C., & Meloni, C. (2012). Robust optimization in simulation: Taguchi and Krige combined. INFORMS Journal on Computing, 24(3), 471–484. Efron, B. (2015). Frequentist accuracy of Bayesian estimates. Royal Statistical Society’s Series B Journal, 77(part 3), 617–646. Fédou, J.-M., & Rendas, M.-J. (2015). Extending morris method: identification of the interaction graph using cycle-equitable designs. Journal of Statistical Computation and Simulation, 85(7), 1398–1419. Gordy, M. B., & Juneja, S. (2010). Nested simulation in portfolio risk measurement. Management Science, 56(11), 1833–1848. Helton, J. C., Hansen, C. W., & Swift, P. N. (2014). Performance assessment for the proposed high-level radioactive waste repository at yucca mountain, nevada. Reliability Engineering and System Safety, 122, 1–6. Jalali, H., & Van Nieuwenhuyse, I. (2015). Simulation optimization in inventory replenishment: a classification. IIE Transactions, accepted. Jones, D. R., Schonlau, M., & Welch, W. J. (1998). Efficient global optimization of expensive black-box functions. Journal of Global Optimization, 13, 455–492. ´ Kaminski, B. (2015). A method for updating of stochastic Kriging metamodels. European Journal of Operational Research, accepted. Khuri, A. I., & Mukhopadhyay, S. (2010). Response surface methodology. Wiley Interdisciplinary Reviews: Computational Statistics, 2, 128–149. Kleijnen, J. P. C. (2005). An overview of the design and analysis of simulation experiments for sensitivity analysis. European Journal of Operational Research, 164(2), 287–300. Kleijnen, J. P. C. (2009). Kriging metamodeling in simulation: a review. European Journal of Operational Research, 192(3), 707–716. Kleijnen, J. P. C. (2015). Design and analysis of simulation experiments (2nd). Springer. Kleijnen, J. P. C., & Van Beers, W. C. M. (2013). Monotonicity-preserving bootstrapped Kriging metamodels for expensive simulations. Journal Operational Research Society, 64, 708–717. Kleijnen, J. P. C., Beers, W. C. M. V., & Nieuwenhuyse, I. V. (2010). Constrained optimization in simulation: a novel approach. European Journal of Operational Research, 202(1), 164–174. Kleijnen, J. P. C., & Deflandre, D. (2006). Validation of regression metamodels in simulation: bootstrap approach. European Journal of Operational Research, 170(1), 120–131. Kleijnen, J. P. C., & Gaury, E. G. A. (2003). Short-term robustness of production-management systems: a case study. European Journal of Operational Research, 148(2), 452–465. Kleijnen, J. P. C., Pierreval, H., & Zhang, J. (2011). Methodology for determining the acceptability of system designs in uncertain environments. European Journal of Operational Research, 209(2), 176–183. Koch, P., Wagner, T., Emmerich, M. T. M., Bäck, T., & Konen, W. (2015). Efficient multi-criteria optimization on noisy machine learning problems. Applied Soft Computing, 29, 357–370. Law, A. M. (2015). Simulation modeling and analysis (5th). McGraw-Hill, Boston. Lee, K.-H., & Park, G.-J. (2006). A global robust optimization using Kriging based approximation model. Journal of the Japanese Society for Mechanical Engineering, 49(3), 779–788. Loeppky, J. L., Sacks, J., & Welch, W. (2009). Choosing the sample size of a computer experiment: a practical guide. Technometrics, 51(4), 366–376. Lophaven, S. N., Nielsen, H. B., & Sondergaard, J. (2002). DACE: A Matlab Kriging toolbox, version 2.0. IMM Technical University of Denmark, Kongens Lyngby. Maatouk, H., & Bay, X. (2016). Gaussian process emulators for computer experiments with inequality constraints. arXiv:1606.01265v1. ´ Markiewicz, A., & Szczepanska, A. (2007). Optimal designs in multivariate linear models. Statistics & Probability Letters, 77, 426–430. Montgomery, D. C. (2009). Design and analysis of experiments (7th). Wiley, Hoboken, NJ. Myers, R. H., Montgomery, D. C., & Anderson-Cook, C. M. (2009). Response surface methodology: process and product optimization using designed experiments (3rd). Wiley, New York. Nelson, B. L. (2016). ‘some tactical problems in digital simulation’ for the next 10 years. Journal of Simulation, 10, 2–11. Preuss, M., Wagner, T., & Ginsbourger, D. (2012). High-Dimensional Model-Based Optimization Based on Noisy Evaluations of Computer Games. In Y. Hamadi, & M. Schoenauer (Eds.), LION 6: LNCS 7219 (pp. 145–159). Springer-Verlag Berlin Heidelberg. Qu, H., & Fu, M. C. (2014). Gradient extrapolated stochastic kriging. ACM Transactions on Modeling and Computer Simulation, 24(4), 23:1–23:25. Qu, X. (2007). Statistical properties of rechtschaffner designs. Journal of Statistical Planning and Inference, 137, 2156–2164. Rasmussen, C. E., & Williams, C. (2006). Gaussian processes for machine learning. MIT Press, Cambridge, Massachusetts. Ryan, K. J., & Bulutoglu, D. A. (2010). Minimum aberration fractional factorial designs with large n. Technometrics, 52(2), 250–255. Saltelli, A., Ratto, M., Andres, T., Campolongo, F., Cariboni, J., Gatelli, D., et al. (2008). Global sensitivity analysis: The primer. Wiley.

16


Santos, M. I., & Santos, P. M. (2016). Switching regression metamodels in stochastic simulation. European Journal of Operational Research, 251(1), 142–147. Shi, W., Kleijnen, J. P. C., & Liu, Z. (2014). Factor screening for simulation with multiple responses: sequential bifurcation. European Journal of Operational Research, 237(1), 136–147. Shrivastava, A. K., & Ding, Y. (2010). Graph based isomorph-free generation of two-level regular fractional factorial designs. Journal of Statistical Planning and Inference, 140, 169–179. Turner, A. J., Balestrini-Robinson, S., & Mavris, D. (2013). Heuristics for the regression of stochastic simulations. Journal of Simulation, 7, 229–239. Ulaganathan, S., Couckuyt, I., Dhaene, T., & Laermans, E. (2014). On the use of gradients in kriging surrogate models. In A. Tolk, S. Y. Diallo, I. O. Ryzhov, L. Yilmaz, S. Buckley, & J. A. Miller (Eds.), Proceedings of the 2014 winter simulation conference (pp. 2692–2701). Van Beers, W. C. M., & Kleijnen, J. P. C. (2008). Customized sequential designs for random simulation experiments: Kriging metamodeling and bootstrapping. European Journal of Operational Research, 186(3), 1099–1113.

Wan, H., Ankenman, B. E., & Nelson, B. L. (2010). Improving the efficiency and efficacy of controlled sequential bifurcation for simulation factor screening. INFORMS Journal on Computing, 22(3), 482–492. Wei, P., Lu, Z., & Song, J. (2015). Variable importance analysis: a comprehensive review. Reliability Engineering and System Safety, 142, 399–432. Woods, D. C., & Lewis, S. M. (2015). Design of experiments for screening. arXiv: 1510.05248. Wu, C. F. J., & Hamada, M. (2009). Experiments: planning, analysis, and parameter design optimization (2nd). Wiley, New York. Yanikog˘ lu, I., den Hertog, D., & Kleijnen, J. P. C. (2016). Adjustable robust parameter design with unknown distributions. IIE Transactions, 48(3), 298–312. Yin, J., Ng, S. H., & Ng, K. M. (2009). A study on the effects of parameter estimation on Kriging model’s prediction error in stochastic simulation. In M. D. Rossini, R. R. Hill, B. Johansson, A. Dunkin, & R. G. Ingalls (Eds.), Proceedings of the 2009 winter simulation conference (pp. 674–685). Yuan, J., & Ng, S. H. (2015). Calibration, validation and prediction in random simulation models: Gaussian process metamodels and a bayesian integrated solution. Transactions on Modeling and Computer Simulation, 25(3), 1–25. Article 18.