Procedures for Statistical Validation of Stochastic Simulation Models

Forest Sci., Vol. 27, No. 2, 1981, pp. 349-364 Copyright 1981, by the Society of American Foresters

Proceduresfor Statistical Validation of Stochastic

Simulation

Models MARION R. REYNOLDS, JR. HAROLD RICHARD

E. BURKHART F. DANIELS

ABSTRACT. Proper validation of a stochasticsimulation model requires that the predictions of the model be compared with real world data that are independent of the data that were used in the construction of the model. Valid comparisons of real data and model output require an understandingof the nature of the validationproblemplusthe availability of statisticalprocedures that are designedto fit the conditionsof the problem. In many cases,decisionsabout the validity of a model are made by a cursoryexaminationof the predictedvalues or by statisticalprocedures

that may not be appropriatefor the problem. This paper developsa framework for testingthe validity of models in situationsthat are frequently encounteredin renewable natural resource problems.The suitability of several common statisticaltests is discussedand several additional parametricand nonparametricproceduresare proposedfor the validationproblem. The proposed proceduresare applied to the validation of a stochasticforest stand simulatorusingdata from 63 Ioblolly pine plantation plots. The results of the analysisindicated that there were previously undetecteddeficienciesin the simulator's ability to predict certain aspectsof the real system. FOREST SCI. 27:349--364.

ADDITIONAL KEY WORDS. Combining independent tests, Ioblolly pine, Pinus taeda, nonparametric tests, stand simulator.

COMPUTERBASEDSTOCHASTIC SIMULATIONMODELSare increasinglybeing used in the study of renewable natural resourceproblems where the complexity of the system or process of interest precludes the useful application of other forms of analysis. Simulation models are frequently applied to problems when the general process is composedof a number of simpler processesin a dynamic setting and the interactions of the parts make it impossible to determine analytically the characteristics of the total system. These models allow the inclusion of a substantial amount of detail even in complex systems and thus hopefully achieve an accurate representationof the phenomena under study. As with any model or scientific hypothesis, a simulation model must be subjected to a process of testing and validation before the inferences about the real world obtained from the model can be used with confidence. Most of the work

on the problem of validation of simulation models has been from the point of view that the objective of the validation processis not to establishthe absolute truth or falsity of the model but rather to determine whether the model will be useful for its intended purpose. Van Horn (1971) has defined validation as "the

The authorsare, respectively, AssistantProfessorof Forestry and Statistics,Professorof Forestry, and Graduate Research Assistantin the Department of Forestry, Virginia PolytechnicInstitute and State University, Blacksburg,VA 24061. This researchwas supportedin part by CooperativeAgreement No. 80-0006 with the U.S. Department of Agriculture Forest Service. The authors wish to thank

the referees who provided many helpful suggestionsfor improving this paper. Manuscript received 26 June 1979 and in revised form

1 December

1980.

VOLUME 27, NUMBER 2, 1981 / 349

process of building an acceptable level of confidence that an inference about a simulated process is a correct or valid inference for the actual process." Good discussions of the philosophical issues involved are given in the papers by McKenney (1967), Naylot and Finger (1967), and Schrank and Holt (1967).

The methodologyof validation does not seem to have an extensive development althoughin recent years increasingattention has been devoted to the problem. General procedures and possible statistical tests have been discussedby Freese (1960), Naylot and Finger (1967), Fishman and Kiviat (1968), Van Horn (1971), Mihram (1971 and 1972), Mankin and others (1975), Snee (1977), and Goulding (1979). Initial stagesin the validation process involve the examination of the structure and operation of the model to make sure that the program is working as intended. Fishman and Kiviat (1968) and Mihram (1972) call this stage the verification stage. This paper will be concerned with the next stage (sometimescalled the validation stage) that involves the testing and comparing of the model output with what is observedin the real world. This stageusually requires that data from the real systembe comparedwith simulateddata that is generated using input values that correspondto the observed values of the variables in the real data set. Confidence in conclusions reached about the model will

be greatest when the real data used in the validation process is independent of the data used in the construction

and calibration

of the model.

In cases where

the collection of new data is not practical, an alternate procedure is to split the data into two or more sets (for example, see Snee 1977). The first set is used in the construction of the model, and the second set is used for validation.

A valid comparisonof real data and model output in the validation stage requires an understanding of the nature of the problem plus the availability of statisticalproceduresthat are designedto fit the conditionsof the problem. In many cases, decisionsabout the validity of a model have been made by a cursory examinationof the predicted values or by statisticalproceduresthat may not be appropriate for the problem. Thus, there is a real need for a clear specification of an appropriate statisticalframework as well as for the development of statistical proceduresthat are applicablewithin this framework. This paper developsa framework for a type of validation problem that is frequently encountered in renewable natural resource problems. The appropriateness of several common statistical procedures is discussed,and some additional parametric tests are proposed. Next, some simple nonparametrictests are developed and these parametric and nonparametric procedures are then used to evaluate a stochasticforest stand simulator. Finally, some general conclusions and recommendationsare given. THE STATISTICAL

FRAMEWORK

In mc;stcasesa stochasticsimulation model is constructedin such a way that the output of the model correspondsto some variable or variables of interest in the

real systembeing modeled. Let Y be the main variable of interest and suppose that Y is associatedin someway with p other variablesrepresentedby the vector X = (X•, X2 ..... Xp). The variables in X can be independentvariables measured without error as in a regressionproblem, or random variables associated in someway with Y, or a combinationof these two cases.For example, Y might represent total wood volume on a forest stand, and _X might represent certain initial stand and site characteristicssuch as age, site index, and basal area. Let F(y,_x) represent the joint distribution function of (Y,_X). The goal in constructingthe simulation model is to be able to generate, for given values of _X, output that has the same distribution as the conditional distribution of Y given X. Thus, for any observed or specified value of _X, the

350 / FOREST SCIENCE

simulation

model

should be able to mimic

the conditional

distribution

of Y. Let

F(yl_x) betheconditional distribution function of Y given_xandlet G(yl_x) be the distribution function of the simulated observations which depends on the value of X. If the simulationmodel is an accuratemodel of the real phenomenon,

thenF(yl_x) andG(yI_x) should beidentical forallvalues of Yandforallvalues of X in some specified set of values, say A. One way of posing the validation

problem is therefore as theproblemof testingthe hypothesis H0: F(yl_x)= G(yl_x) forall -o• < y < o•and_x• A. Note that it is unreasonable to expect a simulated observation generated at a specifiedvalue of X to agree exactly with a real observationassociatedwith the same value of _X. Given only the values of a finite number of input variables, it is in general not possibleto predict exactly the value of Y even if the model is a "perfect" reflection of the real system. This is because the vector X does not contain all of the relevant variables and relationships and thus the variable Y, given only the value of _X,is a random variable, reflecting the random error due to the factors not included in X. For example, two forest stands having the same age, site index and basal area (_X) will usually not have exactly the same value for total volume (Y). Thus the best that can be done in such situations is to predict the distribution of Y. Suppose that for purposes of validation, n observations (Y1,_X1), (Y1,_X2), .... (Y,,_X,) on (Y,_X) are available. These n observations may in some cases result from a random sample of values of _X from A, and in other cases some selectionof values of _Xmay be employed to give a desired range of values across A. The problem to be considered here is the situation where the n observations on _X,however selected, are all distinct so that the correspondingn observations on Y are all obtained

under different

initial

conditions

and thus have different

distributions. This situation is typical of validation problems encountered with simulationmodels in renewable natural resources.For example, the observations might be tree volume measurementsfrom n different forest plots or be animal population levels in n different years or areas.

In ordertocompare thesimulation distribution G(yl_x ) withF(yl_x), simulated

observationsmust be generatedto compare with the real observations.Suppose that, given the value of _X•,rn runs of the simulationmodel are made generating m independentand identically distributed values, say Z_i= (Z•i, Zi2..... Z,•). Conditional on _X•,these simulatedvalues are independentof Y•. Under the null hypothesis,the distribution of Z•, j = l, 2.... , m, shouldbe the sameas the distribution of Y• given _X,. Since each of the n pairs of values (Y•,_Z0, i = l, 2 ..... n, was generated under different values of _X, the data shouldnot be grouped into one large set for the purpose of applying a standard test designed for determining whether two samplescome from the samedistribution. Instead, the n independentpairs must be kept separate for the purpose of analysis. The main objective is not, however,

to testF(yl_& ) -- G(yl_xi) individually for eachi, butto testfor equality of the distributionsover the whole range of X. In addition, the individual pairs have only one real observation so that a test applied to an individual pair would not have much power by itself. The proposed procedure is to apply individual tests and then combine them into one overall test that has reasonablepower for the general problem. The problem of combining n independenttests of the same hypothesisinto one overall test has been studied extensively in the statisticalliterature. Fisher (1938)

proposeda method of combiningindependenttests that is based on the product of the observed significancelevels for the n tests. Additional properties and modificationsof Fisher's procedure have been studiedby Birnbaum (1954), Good

VOLUME 27, NUMBER 2, 1981 / 351

(1955), Zelen and Joel (1959), Littell and Folks (1971, 1973), and Pape (1972). Other tests for the combination problem have been based directly on the n test statisticsthemselvesrather than on the observedsignificancelevels of the tests. Investigationsof combinationsof certain parametricstatisticshave been carried out by Van Zwet and Oosterhoff (1967), Oosterhoff (1969), Monti and Sen (1976), Koziol and Perlman (1978), and Berk and Cohen (1979). Van Elteren (1960) proposeda test statisticthat is a linear combinationof Wilcoxon two-sampletests that are applied to the individual tests. The properties of combinationsof Wilcoxon tests have also been studiedby Hodges and Lehmann (1962) and Noether (1963). Purl (1965) studiedthe asymptoticefficienciesof linear combinationsof independenttwo-samplet-statisticscomparedto the linear combinationof Wilcoxon two-sample statistics.The validation problem can be consideredas a special case of the two treatment problem where one treatment (real) has only one observationin each pair. The alternatives of interest in the present problem are more general than the location or scalealternativeusually investigatedin hypothesistestingproblems. In addition

to the location

alternative

where the distribution

of Z is the same as

Y except for a systematicshift or bias in one direction, other possibilitiessuch as shifts in different directions at different values of _X, differences in variance, and differencesin shapebetween F and G could reasonablybe expected. For this reason, either a test that is sensitive to a very wide class of alternatives is needed, or another possibilityis to use severaltests, each sensitiveto a particular alternative.The latter optionusuallypresentsproblemsin establishingthe overall significancelevel for the combinedtests becausethe tests are not independent. PARAMETRIC

TESTS

The simplestparametrictest that might be applied to the problem is the paired t-test where the simulatedvalues at each point are averaged to get a predicted value to compare with the real observationat that point. The test is then based on the differencesbetween the real and predicted valuesßBy using only the means of the simulated values, this test ignores the individual values that would be needed to detect certain types of model inadequacies.Freese (1960) has also pointed out a potentialproblem with the use of the t-testßIf the variancesof Y and Z are large, then bias in the model may not be detected if the sample size n is too small to provide reasonablepower for the test. On the other hand, the test may detect unimportantbias if the variancesare small and n is large. An additionalproblemis that the assumptions underlyingthe test may not be satisfied since, conditional on _X•, _X2 .... , _X,,,the n real observationsmay not come from the same distribution, and thus the differences may not come from the same

distribution. For example, the population of X's may be stratified with a fixed number of values taken from each stratum, in which case the Y•'s will have different distributions.However, if it is assumedthat the n values in the sample were selectedat random from a population then, unconditionallyon _X•, _X2, ß ß ß , _Xn,the n differencescould be consideredas having come from the same distribution. Thus, in many situations,the paired t-test may not be appropriate for the validation problem. Among the standardstatisticaltechniques,the analysisof varianceappearsto be a logical choice for application to the current situation. In the analysis of variance framework there are two treatments (real and simulation), and n blocks correspondingto the n pointsin A where observationsare obtained.The treatment sumof squarescouldbe usedto test for systematicshifts, and a combination of the treatment and the interaction sums of squares could be used to test for nonsystematicdifferencesin F and G. The estimateof the variancewould have


to be a pooled estimate based on the simulatedobservationsin each block. Although this variance estimateis based entirely on observationsgeneratedby the model, the situation where the model variance is not the same as the variance of

the real processwill tend to produce extreme values of the test statisticand thus be detectable. The only situation that might not be detectable is the case where the simulation variance is inflated and at the same time there is bias in the model.

Even if normality is assumedhowever, this analysisof variance test is not strictly valid since the null distribution of the statistic requires the same variance within each block and this is not likely to be the case in many applications. Purl (1965) studied the asymptotic efficienciesof linear combinationsof independent two-sample t-statisticscomparedto the linear combinationof Wilcoxon two-sample statistics proposed by Wilcoxon (1946) and Van Elteren (1960). If

F(yl_&)andG(yl_xO are normaldistributions withthe samemeanandvariance, then the statistic

ti=(Yi - •i)/Si•/1 +1 m

hasa t-distribution withm - 1 degrees of freedom where2• = • Zu/mand S•2= • (Z•j-2•)2/(m- 1). For thepresent casewherethereare m simulation observationsin each pair, the statistic used by Purl is

U1= •hNote that since there is only one real observation in each pair, the normality assumptionfor Yi is critical in justifying the t-distribution for t•. The normality assumptionis not as critical in the usualtwo-samplet-test basedon the difference of sample means because the central limit theorem can be used to partially justify the distribution of the test statistic even when the distribution of the observations is not normal. If m •> 4, the variance of t• is finite and then U• should be approximately normal for large n by the central limit theorem. Now E(tO = 0 and Var(h) -

m m-3

1

under the null hypothesisso that the test statistic

U•* =U•/•/n m- 1 m-3

should have approximately a standard normal distribution. Thus, the test can be carried out by comparing the observed value of the statistic with the appro-

priate critical value from the standard normal distribution. For example, a two-sided test with significancelevel a would reject if U•* •< c,/• or if U•* •> c•_•/• where c•/• and c•_•/• are the a/2 and I - a/2 fractiles of the standard normal distribution, respectively. In the test for treatment differences in the analysis of variance approach discussedpreviously there are only two treatments and thus the F-test with 1 degree of freedom in the denominator is equivalent to a t-test. The corresponding t-statistic is

1 (y,-2,)/s1+ wheref• = • Ydn,2 = • 2,In, andS• = • S,Vn. The difference between i

i

i

thestatistics U•andUdx/n istheuseofthepooled estimate S• forthevariance VOLUME 27, NUMBER 2, 1981 / 353

instead oftheindividual within-block estimates. U2istheproduct of 1/•/•and

the sum of n dependent t random variables each with n(m - 1) degreesof freedomwhich can be shownto resultin a t-distributionwith n(m - 1) degrees

of freedom, whileu•f• is theproduct of 1/•f•andthesumof n independent t random variables each with m - 1 degreesof freedom which is approximately normal but not exactly a t-distribution.

Thetest based onU•= • t•issensitive primarily toa general shift inone i=1

direction.

A test that is sensitive to shifts in different

directions

on different

plots can be based on

Us= • t•2. i=1

If the distributionsF and G are normal, then t•2 has an F-distributionwith 1 and m - 1 degrees of freedom. Thus

E(ti2) _ rn rn-3' - 1 m•>4 and = 2(m Var(t•2) - 1)2(m - 2) (m - 3)2(m - 5) '

m•>6.

The moments of the t- and F-distributions are given, for example, in Johnson and Kotz (1970). Using the normal approximation to the distribution of Us, the test can be carried out by comparing

Us*= (U3 - n•-•m- 31/•, l•/{n2(m (m - 3)2(m 1)2(m__-52•) )• with the appropriate standard normal critical value. Note that we may want to reject the null hypothesis for very small values of Us as these values tend to indicate

an inflated

simulation

variance.

The

case where

the simulation

variance is too small would tend to produce large values of Us. The normal approximation to the distribution of Us depends on having a reasonably large

value of n and also on the normality assumptionfor the distributionsF and G used in establishingthe t-distribution for t•.

If thepooledestimate S2 = • &2/nis usedin placeof eachS• in Ua,the statisticthat corresponds•t9 Ua/n is

Under the null hypothesis and the normality assumption, the statistic U4 has an F-distribution with n and n(m - 1) degreesof freedom. This F-test is equivalent to the analysis of variance test obtained by combining the treatment and interactionsumsof squaresand comparingthis sumwith the error sumof squares. As an alternate to Ua, a test can be based on the statistic i=1

The meanand variance of [t,I canbe obtained fromthe fact E(lti[2)= E(t?) = (m - 1)/(m - 3), rn •> 4, and


i

m-2

m-I

A normal approximation to the distribution of U5 can then be easily developed. A test based on Us should be sensitive to the same types of alternatives as the test based on Ua. Fisher's procedure for combining independent tests can be applied using parametric statistics such as the set t•, t2, ß ß ßtn. The only requirement is that the set of n statistics be independent and the null distribution of the statistics be continuous. If o• is the observed significance level for the i th individual test, then

U6= -2 Inl•Io• i=1

has a chi-squared distribution with 2n degrees of freedom under the null hypothesis. The test rejects the null hypothesis for large values of U6. This approach has the same disadvantageas the other parametric tests based on the tds in that it requires normality to assure that the null distribution of ti is the assumed t-distribution.

NONPARAMETRIC

TESTS

The parametric tests discussedso far all require that the distributionsF and G be normal. The analysis of variance tests also require homogeneousvariances. In many practical situationsthese requirementsare not satisfied,and thus the use of nonparametric tests would be appropriate. A nonparametric test based on ranks can be developed as follows. Let R• be the rank of Y• in the set Y•, Z•, Z•2..... Z•,•. Then, under the null hypothesis, Y• has the same distribution as Z•, Z•2..... Z•m, and Re has a uniform distribution on the values 1, 2.... , m + 1. Van Elteren (1960) proposedusing a linear combination of two-sample Wilcoxon tests for the problem of combining independent tests for location. In the case where there are one real and m simulatedobservationsfor each pair, Van Elteren's statisticis equivalent to the statistic

V1---2 Ri' i=1

If n and m are small, the exact distribution of V• under the null hypothesiscan be computed with some effort. Under the null hypothesis, E(R 0 = (m + 2)/2 and Var(R0 = m(m + 2)/12 and for large n the statistic"

V•*=(V•_nm +2•/•/nm(m +2) will have approximatelya standardnormal distribution. This test shouldbe very simple to apply in practice and does not require any assumptions about the distributionsF and G other than the requirementthat they be continuous. The test based on V•, like the test based on U•, is sensitive primarily to shift alternatives. An additional nonparametrictest that is sensitive to more general alternatives

can be based on the statistic

A test that rejects H 0 for large values of Ve should be sensitive to general

VOLUME 27, NUMBER 2, 1981 / 355

shift alternatives and to situations where the simulation distribution is shifted in different directions for different values of X. This test is also sensitive to dif-

ferences in variance between F and G. If there is the possibility that the simulation variance is inflated, then the null hypothesis should be rejected for small as well as large values of V2. Thus a test based on V2 could serve as a singlegeneral purposetest for the validation problem. The normal approximation to the distribution of V2 can be used when n and m are too large to compute the exact null distribution. When H0 is true,

lm(m +2)

E(R• -m+2 ) 4(m +1) rn even 4 rnodd •

m+l

and

m(m +2) +m(m +2)

Var(IRi _rn +21) - 48 16(m +1)•meven 48 rnodd. 2

(m + 3)(m + 1)

The test could then be carried out by comparing the statistic

rn+21))/½ Var( Ri rn+21) with the appropriate critical value from the standard normal distribution. As an alternative to the test based on the statistic V.2,the statistic

•=1

2

could be used as a basis for a test that is sensitive to the same type of alternatives as V2. Under the null hypothesis E(Va) = n

m(m + 2) 12

and

(m - 1)m(m + 2)(m + 3)

Var(Va)

180

For largen the statistic

V3* (V..• - nm(m+ 2))/•/n(m-1)m(m +2)(m +3) 12

180

will be approximately normal. The practical application of the proceduresdiscussedhere requires some guidance on the choice of the values for n and m. In many situations n will of necessity be rather small due to a limited amount of available real data. The question then is the choice of the value of m. One approach used by Noether (1963) in the context of a randomized block model is to examine the variance

of the estimator of the parameter of interest. The expected value of R i is mpi + 1 where pi = P(Zij < YO. If p• = p, i = 1, 2, . . . , n, as in a shift alternative model, then the statistic


_

--

a i=• r•

mn .=

R i - __

m

is an unbiasedestimator of p. The variance of this statisticis

Var--•R• •mn •:•

=

mn

p(1-p)+

mn

_p2)

where p' = P(Z•j < Y• and Z• < Y'0 and where Y• and Y'• are independent

withdistribution F(y[_xi). If thealternative is closeto thenullhypothesis we can use the null variance

which reduces to

Var

•

12mn

For fixed n this is a decreasingfunction of m which depends on the factor m+2

--

m

so that the marginal reduction in variance from an increase in m is quite

small even for moderate size m. For example, increasing m from 10 to o• reduces the variance by only 17 percent. This suggeststhat past a certain point there is little to be gained from increasing m since the limiting factor is the fixed value of n.

Another case of interest occurs when the limiting factor is the cost of executing the simulation model and the number of potential real observations is large. If mn, the number of simulated observations, is fixed at some value, N, then

thevariance ofmn1 •R, isminimized when m= 1and n--N.This gives an equal number of real and simulated values. Hodges and Lehmann (1962) and Puri (1965) investigated the asymptotic efficiency of the test based on V1 relative to the test based on U2 when the distributions are normal. The efficiency is

3

m+l

z- m+2

for the case where there is

a uniform shift at all values of x. The efficiency is lowest (2/z-) when m = 1 since the test based on V• is the sign test and the efficiency of the sign test relative to the t-test is 2/z-. For moderate values of m the efficiency is close to 3/z-, the efficiency of the Wilcoxon rank-sum test relative to the t-test in the standard

situation.

EXAMPLE

The application of the various tests discussedin previous sections can be illustrated using the stand simulator developed by Daniels and Burkhart (1975). This simulator is an individual-tree model. Trees are assignedinitial coordinatelocations and sizes at an age correspondingto the onset of competition, and then annual diameter and height growth is simulated as a function of size, site, age, and an index reflecting competitionfrom neighbors.Tree growth is adjustedby a random component representinggenetic and/or microsite variability, and survival probability is controlled by tree size and competition. Individual tree volumes are calculated by substitutingdiameter and height values into tree volumes equations.Unit area yield estimatesare then obtainedby summingthe individual tree volumes and multiplying by the appropriate expansion factor. The validity of Danielsand Burkhart's standsimulator,alongwith severalother types of models, has been investigated previously by Daniels and others (1979). Their conclusion, based on visual plots and regressionanalysis of differences

VOLUME 27, NUMBER 2, 1981 / 357

between real and simulated values, was that the stand simulator was comparable

to the other models in its ability to predict plot means and thus was acceptable for many managementpurposes. The question of whether the simulator is adequate for a particular purpose will of course depend on the purpose. In many casesthe objective may be to predict mean volumesaveragedover a wide rangeof standconditions.For this objective, many other models, such as regressionmodels, have been developed and investigated. But in other situations,the objective may be, for example, to determine what proportion of stands of a particular type will have a merchantable volume below some economically determined lower threshold. A regression model for predicting mean volumes cannot aid in answering this type of question, but a simulation model such as that of Daniels and Burkhart has the flexibility to provide this type of information because it models the distribution of volume and not just mean volume. Since stand simulators are capable of modeling many different aspectsof the system,the validation processhas to be more demanding and comprehensive than the process used for simpler models. A set of 63 1oblolly pine plantation observationsobtained from the Westvaco Corporation was used to assessthe performance of Daniels and Burkhart's stand simulator. These data are independent of the data used in construction of the model. The independent data consisted of 0.05- to 0.10-acre (0.02- to 0.04-ha) temporary plots located in coastal plain and piedmont Virginia. In each plot, diameter at breast height (dbh) was recorded by 1-inch (2.54-cm) classesand age, stand density (trees per unit area), and average dominant height (average height of dominant and codominant trees) were determined. Plot volume to a 4-inch (10.2-cm) top diameter (outside bark) was computed for all trees in the 5-inch (12.7-cm) dbh classand above. Table 1 gives the values of the variables measured on each plot along with the statistics generated from rn = 10 independent runs of the simulator on each of the 63 plots. Conclusions about a model that are based solely on the differences between

real and predicted values may give some indication of predictive ability averaged over all plots, but can be misleadingif the objective is to detect more

subtlediscrepancies betweenthe modeland the real system.For example,if the 63 differences(Yi -•i) are obtainedusing Yi and •i from Table 1, it is

found thattheaverage difference is• (Y•- ,•i)/n= 8.76m3/ha witha standard i

deviation for the differences of 55.61 mVha. This means that the simulator is,

on the average, 8.76 ma/halower than the observed volumes for these 63 plots. This average underprediction appears to be relatively small compared to the magnitude of the volumes. If the paired t-test is used for these values, the value of the t-statistic is 1.25 with 62 degrees of freedom. This value is not quite significantat the 0.20 level for a two-sided test. As was discussedpreviously, this paired t-test is not really appropriate for this problem but was used here for purposes of illustration. Now consider some of the other tests that were proposed in the section on parametrictests. The columnof simulatedstandarddeviationsin Table 1 indicates that the variance of the simulation distribution is not constant from plot to plot. Thus the analysisof variance tests based on the statistics U• and U4, which use a pooled variance estimate, are not appropriate. The values of the statisticsbased on the ti's are

U• = -68.67 Ua = 1448.70

U•* = -7.63 Ua* = 105.95.

The value of U•* is highly significant when compared with a standard normal critical value. This tends to indicate that there is a systematic difference be-


tween the means of the two distributions. Note that the sign of U• is different

fromthesignof • (Yj - 20/n.Thereason forthisanomaly willbecome apparent later in this example. The value of Us* is even more extreme than the value of U•*. This indicates that there may be an additional problem with the simulator other than just a simple ghift in the mean since U• should be more sensitive to simple shift alternatives than is Us. The observed values of the nonparametric statisticsare V• = V2 =

335 283

V•* = -1.71 V2* = 8.75

Va = 1343

Vs* = 10.17

The value of V•* is not significant at the 0.05 level, but the values of V2* and

Vs* are highly significant.Note that the nonparametricstatisticsdo not appear to be as extreme as the comparableparametricstatistics(V•* comparedto U•*, and V2* and Vs* compared to Us*). This can be explained by the fact that the most extreme value that Ri can assume is 1 or 11, while t• can be any real number. Thus extreme discrepanciesbetween the real and simulated values on only a few plots can lead to very large values of U•* or Us*, while the nonparametric statistics may not be greatly affected. For this data set there does appear to be very large differences between the real and simulated values on cer-

tain plots. For models and data sets with smaller differences, the parametric and nonparametric statistics should be in closer agreement. The extremely large observed values of the test statistics Us*, V2* and Vs* could be causedby a number of factors. Two possiblefactors are: (1) the presence of bias or shift in the model, i.e., E(¾i) • E(Zi•) for some values of i; and (2) a simulationvariance that is too small. Var(Zi•), the variance of the simulated observationstaken at _X•, can be estimated by Si2, but it is not

possibleto get a direct estimateof Var( Y0 unlessmultiplereal observations are available at _Xi. Using the squareddifference(¾• -Z•) • between the real observation and the simulated mean at _X•,an estimate of a combination of

varianceand biascan be obtained.SinceYi and2i are independent, giventhe value of _X•,it can be shown that

E(( Y• - 2•)x)= Var(YO+ (E(¾•)- E(Zi•))• + Var(Z•s)/m SinceE(&X/m) = Var(Zi•)/m, it followsthat (¾• - •)• - SiX/mis an unbiased estimatorof Vat( YO + (E( YO - E(Z•)) •. Now the valuesof Vat(Yi) and (E(¾•) E(Z•)) • are not likely to be constantfrom plot to plot, but a general idea of the magnitudesof the terms involved can be obtained by averaging the individual estimatorsover all n plots to obtain the estimator

S.e= x•((yi_ Z•)x _ SiX/m)/n. i=1

S'2isanestimator of• Var( YO/n ifE(YO- E(Zi•) = 0foralli, and the pooled i=1

estimator of thesimulation variance, Sx--• SiX/n, is an estimator of X•Var(Z•)/n. Thevalue ofS*xis3103.71 (S*= 55.71 mS/ha) and thevalue of i=1

i=1

Sx is 159.71 (S = 12.64 ma/ha).The fact that the value of S '2 is so much larger than the value of Sx could be caused by bias in the model (S *x is estimating the variance of the real observationsplus a positive bias term) or by the fact that the simulation observations

have smaller variance than the real observations

VOLUME 27, NUMBER 2, 1981 / 359

TABLE 1. Plot data and correspondingstatisticsfor the stand simulator. Site

Age

Height

index

(yrs)

(m

(m)

11 12 12 12 12 13 13 13 13 13 13 13 13 13 14 14

8.1 10.0 8.9 8.9 10.0 9.7 10.8 9.7 10.8 8.6 12.0 10.8 10.8 10.8 11.6 11.6

16.8 18.3 16.8 16.8 18.3 16.8 18.3 16.8 18.3 15.2 19.8 18.3 18.3 18.3 18.3 18.3

1,581 1,013 1,680 2,076 2,372 1,903 2,817 2,570 2,372 2,273 2,323 2,125 2,224 1,483 1,829 2,520

26.2 17.0 22.7 22.7 34.2 25.3 30.5 29.6 26.6 21.6 28.9 29.6 23.4 21.1 33.1 31.7

62.1 49.2 55.8 41.1 91.3 61.9 63.5 61.9 62.3 33.0 78.0 88.4 50.9 60.4 118.1 87.7

44.2 75.2 59.3 56.5 72.6 70.9 82.1 70.2 92.0 54.9 113.7 91.3 92.5 92.9 113.6 101.7

4.53 2.85 2.81 5.63 5.80 2.61 7.29 5.60 3.68 5.12 5.96 5.11 4.53 3.64 4.06 6.60

3.77 -8.69 - 1.19 -2.61 3.08 -3.28 -2.43 - ! .41 -7.72 -4.07 -5.72 -0.55 -8.74 -8.50 1.06 -2.03

11 1 3 1 11 1 1 2 1 1 1 4 1

15 15 15 15 15 15 15 15 15 15 15 15 16 16 17 18 18 18

11.2 11.2 9.9 9.9 9.9 9.9 11.2 11.2 12.4 11.2 11.2 8.8 11.9 11.9 12.5 14.5 14.5 14.5

16.8 16.8 15.2 15.2 15.2 15.2 16.8 16.8 18.3 16.8 16.8 13.7 16.8 16.8 16.8 18.3 18.3 18.3

2,273 1,680 1,285 1,977 1,977 1,606 2,817 2,471 2,471 1,754 1,779 1,927 1,532 2,446 2,224 1,631 1,112 2,051

26.6 30.1 25.7 23.6 20.9 20.7 37.2 37.4 38.3 26.9 32.8 20.2 27.5 31.2 32.1 34.7 30.3 39.5

70.3 103.8 84.7 55.1 40.2 50.7 106.4 122.6 139.9 85.6 115.4 32.3 100.8 90.1 110.7 165.6 154.8 182.6

103.0 108.0 83.2 78.7 79.6 82.9 90.3 97.2 120.5 103.9 104.4 61.6 123.4 114.3 130.3 190.6 200.9 179.4

6.34 3.47 4.24 5.05 3.83 3.95 6.44 7.95 5.05 3.12 4.67 3.1•6 3.13 8.96 5.11 7.78 3.15 4.69

-4.92 - 1.13 0.35 -4.46 -9.80 -7.78 2.39 3.04 3.67 -5.61 2.25 -7.24 -6.86 -2.58 -3.66 -3.06 -13.96 0.67

1 2 7 1 1 1 11 11 11 1 11

19 19 19 19 19 19

16.5 13.7 15.1 15.1 16.5 13.7

19.8 16.8 18.3 18.3 19.8 16.8

2,002 2,100 2,100 2,249 2,669 1,977

47.1 38.8 44.1 37.9 49.6 36.0

265.2 169.1 217.3 168.9 254.4 154.1

225.2 158.5 194.1 184.3 198.5 168.2

7.91 5.37 5.15 9.02 9.78 8.57

4.82 1.88 4.30 - 1.63 5.44 -1.57

11 11 11 1 11 1

21

16.3

18.3

2,076

38.3

196.2

215.6

9.20

-2.01

1

21 21 21 21 21 24

16.3 14.8 17.7 17.7 16.3 17.8

18.3 16.8 19.8 19.8 18.3 18.3

2,545 2,224 1,779 1,853 2,002 1,174

39.9 38.8 44.3 45.5 39.3 32.4

187.1 179.3 269.5 275.1 204.2 202.2

195.6 184.7 266.0 267.3 225.4 315.7

12.56 7.50 12.03 14.07 7.79 9.50

-0.64 -0.68 0.28 0.53 -2.60 -11.39

3 3 8 8 1 1

27 27 27

17.6 17.6 19.2

16.8 16.8 18.3

2,372 2,323 1,829

53.0 41.6 48.2

318.2 227.3 323.5

213.7 223.0 281.1

20.10 13.30 20.65

4.96 0.31 1.96

11 8 11

27

22.3

21.3

1,779

60.6

492.3

316.2

26.26

6.39

11

27 28

20.7 18.0

19.8 16.8

1,532 2,026

64.0 50.5

511.8 320.3

312.9 235.9

27.24 20.58

6.97 3.91

11 11


Trees/ha BA/ha

(m3/ha)

(mS/ha) (m3/ha)

t•

9 1

1 1 1 1 1 7

TABLE

1.

Continued.

Site

Age

Height

index

(yrs)

(m

(m)

y•a

29 29 29 29 29 30 30 31

20.0 16.7 16.7 21.5 20.0 17.1 18.7 22.3

18.3 15.2 15.2 19.8 18.3 15.2 16.8 19.8

32

19.4

16.8

988

33

23.0

19.8

1,730

Trees/ha BA/ha 1,769 1,421 1,997 1,927 1,174 1,532 1,169 1,149

50.5 42.2 48.9 56.5 44.1 49.8 49.1 49.8

(m3/ha)

(ma/ha) (ma/ha)

tid

Rie

17.15 10.02 16.61 29.93 15.76 15.77 16.46 16.48

4.08 - 1.27 3.96 4.80 -1.73 4.07 1.41 3.54

11 2 11 11 1 11 10 11

-3.86

362.7 262.3 291.4 434.8 333.1 322.9 359.0 423.8

289.3 275.6 222.4 284.0 361.7 255.5 334.6 362.6

39.0

290.7

373.5

20.47

51.0

416.3

282.2

45.98

2.78

1

11

a y, is observedvolume to a 4-inch top diameteron plot i. b 2{ is the averageof 10 simulatedvolumesfor plot i. c S• is the standarddeviation of the 10 simulatedvolumesfor plot

d ti is the t-statistic(Y• - 2l)/S•X/l + l/m for plot i. e Rs is the rank of Y• among Y• and the 10 simulatedvolumesfor plot i.

(S 2 is estimating somethingsmaller than the variance of the real observations). The evidence thus seems to suggestthat the simulator is biased, or has too small a variance, or, more likely, suffers from a combination of these two problems. Once it was determined that there were previously undetecteddiscrepancies between the simulator and the real system, an attempt was made to determine whether the simulator would work well for some restricted range of X values. Accordingly, the 63 plots were subdividedinto age groupsand the various parametric and nonparametric statisticswere computed for each age group. These

agegroupscorrespondto the subdivisions in Table 1. The valuesof the statistics are given in Table 2. From the signsof U•* and V•* it appears that the model

overpredictsfor the youngerstandsand underpredicts for the olderstands.This explainswhy the averageof the 63 differences( Y, - Z•) is positivewhile the sum of the 63 values of t• is negative. The serious underprediction for older stands more than balances the overpredictionfor younger stands, but the larger variances on the older stands prevent the values of t, from becoming excessively large.

The values of the ratio S*/S given in Table 2 are almost constant over the four age groups. This indicates that the simulationvariance and/or bias is a problem over all age groups. Thus it appears that the model does not adequately mimic the distribution of volume on certain individual stands,but it doesdo a reasonable

job of predictingmeanvolumeaveragedovera wide rangeof stands. DISCUSSION

AND CONCLUSIONS

In the example in the previous section,the valuesof all of the statisticsexcept thosein the analysisof variancetestswere computedfor purposesof illustration. In this exampleit really didn't matter which test was used sinceit was obvious from all of these statistics(except V•*) that the null hypothesiswould be rejected. But in other applicationsit may not be desirableto calculateall of the statistics that have been discussed.The questionis which test or combinationof tests is best to use in practice. VOLUME 27, NUMBER 2, 1981 / 361

TABLE 2.

Values of standardizedstatisticsfor plots grouped accordingto age.

Age (yrs)

U• *•

Us*•

11-14

- 10.81

56.63

15-18

- 12.20

82.31

19-25

- .80

34.69

26-33

9.32

32.81

V• *•

V• *a

Va*•

S*/S r

-3.64

4.43

5.04

4.45

-2.75

4.69

5.68

4.83

-.61

3.39

3.80

4.11

4.90

5.69

4.43

3.72

a U•* is a standardizedparametric statisticbasedon the sum of individual plot t-statistics. b Ua* is a standardizedparametricstatisticbasedon the sum of squaredindividualplot t-statistics. c V•* is a standardizednonparametricstatisticbasedon the sum of ranks. a V2* is a standardizednonparametricstatistic based on the sum of absolute deviations for the ranks.

e Va* is a standardizednonparametricstatistic based on the sum of squared deviations for the ranks.

f S* is the squareroot of an estimateof a combinationof real varianceand model bias and S is a pooled estimateof simulationvariance.

The parametrictestsbasedon U•* and U3* usestandardtwo-samplet-statistics computedfor each plot and are thus appealingbecauseof their familiarity. Of these two tests, U3* is probably the best generalpurposetest sinceit is sensitive to a wide class of alternatives. A potential problem with these tests is that they may not be as robust against departures from normality as the standard two-sample t-test since there is only one value of Yi used in each ti. Although U• and Us are based on a sum of terms, the calculationof the momentsof ti used in the standardizationof U• and Ua is based on the assumptionthat t• has a t-distribution with rn - 1 degreesof freedom. The nonparametric tests based on the ranks R, Re..... Rn have the advantage that no assumptionshave to be made about the distribution or variance of Y. The only approximationthat enters the picture is the normal approximation to the distribution of the sums in V, Ve, and Va and, unless n is quite small, this approximationshouldbe very good sincethe null distributionof the ranks is the uniform distribution. The other important advantage of the nonparametric tests is the possibilityof higher efficiencythan with the parametric

tests.If rnis nottoosmall,theefficiency of • R, relative to thet-testis esseni

tiaily that of the standard Wilcoxon rank-sum test relative to the t-test. The Wilcoxon test is almost as efficient as the t-test for normal populations and can be much more efficient for nonnormal populations. Since the application of multiple tests, whether parametric or nonparametric, makes the calculationof the exact significancelevel very difficult, the best strat-

egy may be to apply a singletest basedeither on Ve or on Va*. These tests are distributionfree, easy to calculate,and sensitiveto a wide classof alternatives. The proceduresdiscussedso far can be extended to caseswhere the number of simulationruns at _Xiis mi and m•, me..... mn are not all equal. The tests would be essentiallythe same except that the mean and variance formulas would involve the sum of unequal components.As long as there is not too much difference in the sample sizes, the normal approximationto the sums should be reasonablygood. Van Elteren (1960) and Puri (1965) consideredthe problem of choosingthe proper weightsin combiningthe individualtests when the sample sizes are unequal. The tests can be extendedto the case where more than one real observationis available at each _Xi.This problem is essentiallythe original problemof combiningindependentteststhat has been discussedin the literature except that the alternative is still very general. 362 / FOREST SCIENCE

The proceduresdiscussedso far apply to problemswhere a single value Y• is measured at the point _Xi. In some situations the available data may be in the form of time series data. For example, total volume on each of the n plots may be measured at several time points. In this case Y• and Zi• would be vectors whose componentscorrespondto the several measurements.Another extension along the same line is the multivariate problem where more than one variable is measured on each plot. For example, the objective may be to simultaneously predict volume, basalarea, and numberof treesper acre. Both of theseextensions require that techniquesbe developedfor comparingreal and simulatedvectors. Freese (1960) has proposeda different approachfor the problem of comparing model predictions with an establishedstandard. Although he did not assumethat Y (in our notation) was a random variable, his basic philosophy may still be applicable. Whereas our approach has been to test the null hypothesisthat the model is a correct representationof the real system, Freese's approach is to specify the accuracy required by the user and then test the hypothesisthat the

modelwill meetthe accuracyrequirement.This approachmay be usefulin situations where the user is primarily concernedwith whether Zi is close enough to Y• for a particularpurpose.There are, however, severalproblemswith Freese's approach.The specificationof required accuracyor error mustbe translatedinto

the varianceof a normaldistributionunderthe assumption that thereis no bias in the model. Trying to answer the question of whether Z• will be close to Y• still does not answer the more difficult question of whether the distribution of Zij is close to the distribution of Y•. In addition the accuracy required of the model will vary from one user to another. Rennie and Wiant (1978) and Ek and Monserud (1979) have attemptedto overcomethis latter objectionby defininga critical error which is the smallest error specificationthat will lead to acceptanceof the hypothesisthat the model will meet the error specification.All of these techniques

basedon Freese'sapproachrequire the interpretationthat the error specification is made for a plot selectedat random from the population of plots, rather than for a plot with particular characteristics. A possibleproblem with the hypothesistestingformulation presentedhere and by Freese (1960) is that the modeler is put in the positionof wanting to accept the null hypothesis.In this position, anythingthat will reduce the power of the tests being used will tend to make the model look better. For example, reducing the value of n or artificially increasingthe variance of the simulated observations may have this effect. In many cases,the main use of the proposedtests may not be for the purposeof reachinga strict acceptor reject decisionabout the model, but rather for providing some kind of objective index for evaluatingthe fit of the model. In such situations, the modeler must be careful not to try to artificially "fit"

the model to the test in order to obtain a value of the test statistic that

agreeswith the null hypothesis. In interpretingthe results of the validation tests, the user shouldalso keep in mind that the decisionto acceptthe null hypothesisdoes not mean that the model is correct or that it is the best possiblemodel. On the other hand, the decision to reject the null hypothesis does not necessarily mean that the model is not useful for practical purposes.If the null hypothesisis rejected, the questionthen is where and how the model fails and what can be done to improve it. LITERATURE

CITED

BERK,R. H., and A. COHEN. 1979. Asymptoticallyoptimalmethodsof combiningtests.J Am Star Assoc 74:812-814.

BIRNBAUM,A. 1954. Combiningindependenttestsof significance.J Am Stat Assoc49:559-575. DANIELS,R. F., and H. E. BURKHART.1975. Simulationof individualtree growthand standdevel-

VOLUME 27, NUMBER 2, 1981 / 363

opment in managedloblolly pine plantations.VPI&SU, Sch For and Wildl Resour, FWS-5-75, 69 p. DANIELS, R. F., H. E. BURKHART,and M. R. STRtJB. 1979. Yield estimates for loblolly pine plantations. J For 77:581-583,586.

Eic, A. R., and R. A. MONSERUD. 1979. Performance and comparison of stand growth models based on individual tree and diameter-classgrowth. Can J Forest Res 9:231-244. FISHER, R. A. 1938. Statistical methods for research workers. 7th ed. Oliver and Boyd, Edinburgh and London. 356 p. FISHMAN, G. S., and P. J. KIVIAT. 1968. The statistics of discrete-event simulation. Simulation 10:185-195.

FREESE,F. 1960. Testing accuracy. Forest Sci 6:139-145. GOOD,I. J. 1955. On the weightedcombinationof significancetests. J Roy Stat Soc, Ser B, 17:264265.

GOULDING,C. J. 1979. Validation of growth modelsin forest management.N Z J For 24:108-124. HODGES,J. L., JR., and E. L. LEHMANN. 1962. Rank methods for combination of independent experiments in analysis of variance. Ann Math Stat 33:482-497. JOHNSON,N. L., and S. KoTz. 1970. Continuous univeriate distributions-2. Houghton Mifflin, Boston. 306 p.

KozloL, J. A., and M.D.

PERLMAN. 1978. Combining independentchi-squaredtests. J Am Stat

Assoc 73:753-763.

LITTELL, R. C., and J. LER. FoPcs. 1971. Asymptotic optimality of Fisher's method of combining independenttests. J Am Stat Assoc 66:802-806. LITTELL, R. C., and J. LER. FoPcs. 1973. Asymptotic optimality of Fisher's method of combining independent tests II. J Am Stat Assoc 68:193-194. MANKIN, J. B., R. V. O'NEILL, H. H. SHUGART,and B. W. RUST. 1975. The importanceof validation in ecosystem analysis. In New directions in the analysis of ecological systems, Part I (George S. Innis, ed), p 63-71. SimulationCouncils, Inc., LaJolla, Calif. 132 p. MCKENNEY, J. L. 1967. Critique of: "Verification of computer simulation models." Manage Sci 14:B-102-103.

MIHRAM, G. A. 1971. Simulation: statisticalfoundationsand methodology.Academic Press, New York. 526 p. MIHRAM,G. A. 1972. Somepracticalaspectsof the verificationand validationof simulationmodels. Oper Res Q 23:17-29. MONTI, K. L., and P. K. SEN. 1976. The locally optimal combinationof independenttest statistics. J Am Stat Assoc 71:903-911.

NAYLOR, T. H., and J. M. FINGER. 1967. Verification of computer simulation models. Manage Sci 14:B-92-101.

NOET}tER,G. E. 1963. Efficiency of the Wilcoxon two-sample statistic for randomized blocks. J Am Stat Assoc 58:894-898.

OOSTERHOFF,J. 1969. Combination of one-sided statistical tests. Mathematisch Centrum, Amsterdam, Mathematical Centre Tract 28, 148 p. PAPE, E. S. 1972. A combination of F-statistics. Technometrics 14:89-99.

PURI, M. L. 1965. On the combinationof independenttwo-sampletests of a general class. Rev Int Stat Inst 33:229-241.

RENNIE, J. C., and H. V. WIANT. 1978. Modification of Freese's chi-square test of accuracy. USDI. Bur Land Manage Resour Inventory Notes 14:1-3. SC}mANiC,W. E., and C. C. HOLT. 1967. Critique of: "Verification of computersimulationmodels." Manage Sci 14:B-104-106. SNEE, R. D. 1977. Validation of regressionmodels: methods and examples. Technometrics 19:415428.

VAN ELTEREN,P. 1960. On the combinationof independenttwo sampletests of Wilcoxon. Bulletin de l'Institute International de Statistique 37:351-361. VAN HORN, R. L. 1971. Validation of simulation results. Manage Sci 17:247-258. VAN ZWET, W. R., and J. OOSTERHOFF.1967. On the combination of independent test statistics. Ann Math Stat 38:659-680.

WILCOXON,F. 1946. Individual comparisonsof groupeddata by rankingmethods.J Entomol 39:269270.

ZEI•EN, M., and L. S. JOEL. 1959. The weightedcompoundingof two independentsignificancetests. Ann Math

Stat 30:385-395.


Procedures for Statistical Validation of Stochastic Simulation Models

Procedures for Statistical Validation of Stochastic Simulation Models

Suggest Documents

Statistical Validation of Traffic Simulation Models - CiteSeerX

Verification & Validation of Stochastic Models

VALIDATION OF MODELS: STATISTICAL ...

Statistical validation of simulation models Ramesh Rebba ... - CiteSeerX

Statistical validation of simulation models Ramesh Rebba ... - CiteSeerX

validation of simulation models - CiteSeerX

Statistical Inference for Stochastic Epidemic Models - CiteSeerX

1999: VALIDATION AND VERIFICATION OF SIMULATION MODELS

VERIFICATION AND VALIDATION OF SIMULATION MODELS

VERIFICATION AND VALIDATION OF SIMULATION MODELS

Simulation Procedures for Box-Jenkins Models - Department of ...

Basic validation procedures for regression models in QSAR and

Using Statistical Models for Dynamic Validation of a Metadata

SMC-ABC methods for the estimation of stochastic simulation models ...

verification and validation of simulation models - Winter Simulation ...

Models and simulation techniques from stochastic geometry

Stochastic Models: Theory and Simulation - Sandia National ...

dependence in stochastic simulation models - Cornell University

integrating stochastic and simulation-based models ...

Internal validation of predictive models: Efficiency of some procedures ...

STOCHASTIC SIMULATION FOR CRASHWORTHINESS

Daylight Simulation: Validation, Sky Models and ...

Daylight Simulation: Validation, Sky Models and ...

sas for statistical procedures