The Methodology of Methodological Research

1 downloads 0 Views 100KB Size Report
interaction appears to be disordinal in nature wherein bias is smallest when. µc and p are zero and increases with increasing. µc and p. We return to the original ...
Edgeworth Series in Quantitative Behavioral Science, UNBC

Simulation Experiments 1

Edgeworth Series in Quantitative Behavioral Science (Report No. ESQBS-99-2)

The Methodology of Methodological Research: Analyzing the Results of Simulation Experiments Bruno D. Zumbo University of Northern British Columbia

Michael R. Harwell University of Pittsburgh

Abstract: In this paper we advocate response surface methodology (RSM) as a complement to the commonly used narrative and tabular description of simulation results. In its essence, RSM provides a graphical and mathematical description of the relationship between the response and explanatory variable(s) in a simulation study. The nature and purpose of response surface methodology are introduced in the context of an example of the effect of outlier contamination on the biasedness of the sample mean.

Keywords: Monte Carlo methodology, simulation, methodological research Running Head: Simulation Experiments

Send Correspondence to:

Dr. Bruno D. Zumbo Departments of Psychology or of Mathematics University of Northern British Columbia 3333 University Way Prince George B.C. CANADA, V2N 4Z9 E-mail: [email protected]

(internet)

When citing this paper please use the following format (for APA style):

Zumbo, B. D., & Harwell, M. R. (1999). The Methodology of Methodological Research: Analyzing the Results of Simulation Experiments (Paper No. ESQBS99-2). Prince George, B.C.: University of Northern British Columbia. Edgeworth Laboratory for Quantitative Behavioral Science.

Edgeworth Series in Quantitative Behavioral Science, UNBC

Simulation Experiments 2

Envision how you, as a methodologist, would react when reading the results section of a graduate thesis in which the student followed up an interesting complex factorial experiment by reporting the group means in a table and then speaking to the general patterns among the means. We suspect that, at best, you might consider the report incomplete. Although the student might claim a philosophical orientation toward exploratory or graphical methods, even the most rudimentary of such methods would require one to go beyond a simple tabulation of the group means1. The scenario you envisioned above occurs on a regular basis in methodological research. It is well known that methodological researchers often turn to simulation experiments in their investigations. Interestingly, the results of these experiments are often reported in a narrative and tabular fashion. For example, in the summer 1994 issue of the Journal of Educational Statistics three of the five papers report results of simulation experiments in a tabular manner. Similar descriptions hold for other journals that regularly publish papers describing the results of simulation experiments, such as the Journal of Mathematical and Statistical Psychology, Communications in Statistics - Simulation and Computation, Psychometrika, Applied Psychological Measurement, and the Journal of Educational Measurement. In drawing attention to these recent papers we are not suggesting that they are incorrect but rather that they reflect a current accepted standard for reporting the results of simulation experiments that does not involve statistically analyzing the results of the experiments. Failure to analyze and report the results of simulation experiments using something other than tabular methods makes it unlikely that the magnitude of the effect of a simulation factor (i.e., independent variable) can be reliably assessed, or that complex interactions among simulation factors will be detected. In a discussion of simulation experiments twenty years ago, Hoaglin and Andrews (1975) stated that "Statisticians (who, of all people, should know better) often pay little attention to their own principles of design, and they compound the error by rarely analyzing the results of experiments in statistical theory" (p. 124, italics ours). Clearly, the current state of affairs has not progressed much beyond that observed twenty years ago.

1

An exemplar of a graphical method to analyze simulation results is Thissen and Wainer's (1986) XTREE graphical icon. However, for reasons unknown to us this graphical technique has seldom been used in reporting simulation results.

Edgeworth Series in Quantitative Behavioral Science, UNBC

Simulation Experiments 3

The purpose of this paper is to introduce methodologists to the advantages of the statistical analysis of simulation experiments through response surface methodology (RSM). The fundamental premise from which our suggestions stem is that "simulation plays much the same role in methodological research as experimentation plays in other sciences" (Hoaglin & Andrews, 1975, p. 124). For our purpose, experimentation is defined as a set of conditions purposefully chosen to record data that aids in the resolution of a research question. We will present RSM and highlight its advantages in the context of an example from statistical science. Our hope is that an in-depth illustration of the use of RSM techniques to analyze and report the results of one simulation experiment will, in the spirit of Hoaglin and Andrews (1975), encourage methodologists conducting such studies to consider this method in conjunction with tabular methods. It should be noted, however, that the approach advocated herein is applicable to both the traditional problems in statistical science (e.g., type I error rate in inference, properties of estimators) and the statistical aspects of measurement theory.

What is response surface methodology? In this section we will provide a brief description of RSM while the details will be discussed in the context of the example to follow. A basic introduction to response surface methodology and the design of experiments can be found in Snee & Hare (1992) while a more advanced treatment can be found in Box & Draper (1987) and Khuri & Cornell (1987). In its essence RSM is about empirical model-building. RSM deals with a problem frequently faced by researchers: exploring the relationship between some response variable, for now denoted y, and an explanatory variable(s), denoted xj, j=1, 2, ..., p, where xj are assumed to be fixed. That is, a researcher conducts an experiment to produce data and then seeks to determine how changes in the explanatory variable(s) -- i.e., experimental factor(s) -- affect the response variable. This is a familiar objective for social and behavioral scientists; however, with RSM the researcher places more emphasis on the mathematical model (i.e., the response surface) of the relationship between the response and explanatory variable(s). Therefore, one distinguishing feature of RSM is that it embodies a set of graphical and data visualization tools to explore the response model.

Edgeworth Series in Quantitative Behavioral Science, UNBC

Simulation Experiments 4

To this point, then, RSM can be characterized as the combination of experimental design to produce data and empirical model-building to fit a model. As Box and Draper note, however, various degrees of knowledge or ignorance may exist about the relationship between the response and explanatory variable(s): one extreme is where the nature of the relationship is exactly known then we can write what is referred to in the RSM domain as a mechanistic model, y=f(xj)+e, where f(xj) denotes a mathematical function. Of course, when a mechanistic model is known it should certainly be used. However, in much of statistical science (like much of experimental science) we lack knowledge of the mechanistic model and all that we know is that, apart from experimental error, the relationship between the response and the p explanatory variables is likely to be smooth2. In this case the relationship can be explored experimentally and useful conclusions drawn. In a sense we want to approximate the mechanistic model through experimental data. The resulting model is an empirical model and is written y=g(xj)+e to distinguish it from the mechanistic model, where g(xj) denotes a mathematical function. In the simplest case of RSM the empirical model, g(xj) is fit to the experimental data through ordinary least-squares (i.e., regression) methods. In fact, all of the material discussed in behavioral and social science statistics texts like Darlington (1990) or Draper and Smith (1981) is of relevance to the model-building phase of RSM. Recall, that in the simplest case of model-building we have only one explanatory variable and we are then fitting a response curve; however, in the case of more than one explanatory variable we are fitting a plane (or hyperplane) which in the RSM tradition is simply denoted as a surface. At its core, then, RSM differs little from the standard statistical and experimental design methods taught and used in the social and behavioral sciences. The difference is really more in emphasis than technique or philosophy. RSM is a combination of experimental design, empirical model-building, and graphical methods to explore the response surface.

2

Smooth is a mathematical term which means that the function, or curve, is differentiable at every point and usually continuity of the derivative is demanded. A function f(x) may not be differentiable at x=a if the function is not continuous (where intuitively a function is said to be continuous if its graph contains no breaks such as holes gaps or jumps), or if it is continuous but the tangent line to the curve at x=a is vertical (and therefore its slope is not defined), if it continuous but the direction of the curve changes abruptly. Interestingly, smoothness is not a new assumption for methodologists because it is a tacit assumption of most of the model-building in the social and behavioral sciences.

Edgeworth Series in Quantitative Behavioral Science, UNBC

Simulation Experiments 5

A case study in using response surface methodology for simulation experiments To motivate further discussion of RSM, we will consider, as an example, the effects of outlier contamination on the bias of the sample mean,

X , as an estimator of the population mean. Bias is

commonly measured as:

bias ∆ E(θˆ ) − θ , where

θ denotes the population parameter, θˆ its estimator, E is the expected value operator, and

∆ means "equals, by definition." From a practical point of view, bias is an indicator of the tendency for θˆ to overestimate (or underestimate)

θ. Our research question can then be expressed as: What is the effect

of outlier contamination on the bias of the sample mean? In order to answer our research question we will first need to specify the experimental design and a region of interest.

Experimental design Outlier contamination is modeled by a family of mixed normal distributions. The mixed normal family is sometimes referred to as a mixture distribution, compound normal distribution, or contamination distribution (Tukey, 1960). The contamination model is created by drawing the majority of data points from a parent distribution and the remainder from a contamination distribution. For example, the parent population may be normal with a mean of 0 and a standard deviation of 1. The contamination distribution, also normally distributed, may have the same mean but a different standard deviation than the parent. In this case the contamination is symmetric. Alternatively, the contamination distribution may have a different mean and standard deviation than the parent. In this case the contamination is asymmetric. Increasing the difference between the mean of the parent and contamination distributions creates increasing degrees of asymmetric outlier contamination. In addition, greater degrees of outlier contamination can be created by increasing the proportion of sampling from the contamination distribution. The notation used to describe contamination distributions is

(1- p) N(0,1) + p N( µc , σ c ),

Edgeworth Series in Quantitative Behavioral Science, UNBC where p denotes the proportion of contamination, and

Simulation Experiments 6

µ c and σ c denote the mean and the standard

deviation of the contamination distribution, respectively. Furthermore, given that the parent distribution is a standard normal distribution,

µ c reflects the difference between the means of the parent and

contamination distributions in standard score units. These three parameters (p,

µ c , σ c ) are the

independent variables in the design of this study whose values are selected to create a systematic range of symmetric and asymmetric contamination.

The region of interest. Given that we have isolated the experimental factors we now need to determine our region of interest. That is, as per usual in experimental design we need to isolate the combination of levels that we want to investigate in our experiment. In the RSM literature the combination of levels of the experimental factors is denoted the region of interest. As Box and Draper (1987) note, the region of interest will be determined by the context of the research and the research question. For example, methodological researchers studying the behavior of the two sample t-test for independent means might use type of score distribution, sample size, and degree of variance heterogeneity as experimental factors. In our example experiment, the specification of the region of interest is guided by previous studies in which contamination models were used. First of all, the values selected by Rasmussen (1985) which included a

µ c of 33, a σ c of 10 and a 5% p were considered too extreme to have reasonable

practical application in research settings. Next, Mosteller and Tukey (1968) provide an example of a distribution that is sampled at 1% from a contamination distribution with a mean of 0 and a standard deviation of 3 (relative to the parent with a mean of 0 and standard deviation of 1). This contamination model is used by the authors to determine the relative efficiency of some statistical procedures. Mosteller and Tukey describe this example of contamination as 'extreme' within the context of their example. Given this wide range of 'extreme' degrees of contamination, it was difficult to choose the values of the parameters for this study. Ultimately, the parameter values used by Blair and Higgins (1981) were selected and then slightly modified to create equal intervals between the levels of each parameter.

Edgeworth Series in Quantitative Behavioral Science, UNBC

Simulation Experiments 7

As a result of this process three levels of each of the parameters of contamination were chosen. For proportion of sampling from the contamination distribution, values of 1% (.01), 8% (.08) and 15% (.15) were selected. For mean shift, values of 0, 1.5 and 3.0 were selected. The contamination. The other two

µ c of 0 indicates symmetric

µ c values represent increasing degrees of asymmetric contamination. For

σ c , values of 0.5, 1.75 and 3.0 were chosen. The σ c of 0.5 is actually less than the standard deviation of 1.0 in the parent distribution. This means that the spread of the contamination distribution is actually less than the spread of the parent distribution. Very few outliers are likely to occur in this situation. The

σc

of 1.75 and 3.0 are greater than the standard deviation in the parent distribution and have the effect of introducing increasing degrees of outlier contamination into the distribution. The values selected for the parameters of the contamination model are shown in Table 1. Table 1 establishes the region of interest for this study. Each numbered box in the table represents a distinct population with a specific combination of parameters of contamination. Therefore, a total of 27 different degrees of outlier contamination has been generated in the design of this study. Each contaminated population can be described in terms of the parameters associated with that population. For example, population 2 is a distribution that has a proportion of sampling from the contamination distribution, p, of .01 or 1%. The mean shift,

µ c , for this cell is zero so the contamination is symmetric.

Finally, the standard deviation of the contamination distribution,

σ c , is 1.75 relative to the parent

distribution with a standard deviation of 1. In contrast, population 27 is a distribution that has a p of .15 or 15%. The

µ c for this population is 3.0 so the contamination is asymmetric. The σ c is 3.0. The degree

of contamination increases from population 1 through to population 27 which contains the most extreme degree of outlier contamination explored in this study. For each of the 27 populations, five cells are included. Each cell represents one of the five sample sizes in the design. Therefore, the experimental design is a 33 x 5 completely-crossed design. Specifically, the design contains three levels each of p (1%, 8%, & 15%),

µ c (0, 1.5, & 3.0), and σ c (0.50, 1.75, & 3.0) as well as five

levels of sample size (8, 16, 32, 64, 128). It should be noted that in the field of RSM several authors strongly advocate the use of factorial designs (see, for example, Box & Draper, 1987, pp.105-107; Snee & Hare, 1992).

Edgeworth Series in Quantitative Behavioral Science, UNBC

Simulation Experiments 8

Table 1 Resulting Population Distributions Used in the Study (i.e., Region of Interest)

σc

p

µc

n

.01

0

8 16 32 64 128 8 16 32 64 128 8 16 32 64 128 8 16 32 64 128 8 16 32 64 128 8 16 32 64 128 8 16 32 64 128 8 16 32 64 128 8 16 32 64 128

1.5

3.0

.08

0

1.5

3.0

.15

0

1.5

3.0

0.5

1.75

3.0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

Edgeworth Series in Quantitative Behavioral Science, UNBC

Simulation Experiments 9

During the modeling phase of the experiment we will be interested in assessing the appropriateness of a response model as well as testing hypotheses regarding individual parameters in the model, and testing lack of fit of the fitted response model; therefore, we will need several observations in each cell (Box & Draper, 1989, pp. 70-71). It is important to note that replicating and testing for lack of fit allows model specification to be tested and that this should be a routine activity in simulation experiments. As a lower bound, the number of observations, of course, must exceed the number of terms in the fitted model. Therefore, to allow ourselves the flexibility of testing the lack of fit for rather complex models there are 20 observations per cell resulting in a total of 2160 observations3. It should be noted, however, that the manner in which bias is defined above requires that each observation itself requires a number of replications. That is, the definition of bias involves the expected value operator which can be operationalized as an "average over replications". Therefore, to be able to use bias as the response variable, whose value is assumed to be affected by changing the levels of the factors, it will be computed over 1000 replications. In sum, then, each cell of the experimental design will involve 20 observations that are computed from 1000 replications each.

A check on the data generation methodology The computer simulation methodology was the same as that used in Zumbo and Jennings (1994). Zumbo and Jennings report extensive tests of the simulation methodology wherein they: (a) tested the random number generator, and (b) showed that the mixed normal distributions performed as expected. They concluded that the simulation tests indicate that the method of generating symmetric and asymmetric contamination is functioning as intended. Additional support for the methods used to generate the contamination model in this study is found in Tukey (1960) who demonstrated the analytical accuracy of these models of contamination. Having established that the method of data generation is sound we can produce the data and proceed with the empirical model-building.

3

In all of the response surface methodology we have investigated (including the standard texts mentioned earlier), the number of observations per cell is quite arbitrary as long as it is greater than the lower bound set by the number of potential parameters in the model. Although it is as of yet not standard RSM practice, the number of replications may be determined by power analysis. We caution, however, that the minimum effect size may be quite difficult to determine in simulation experiments.

Edgeworth Series in Quantitative Behavioral Science, UNBC

Simulation Experiments 10

Should the explanatory variables be examined in their original form, or should transformed explanatory variables be used? In RSM explanatory variables are sometimes transformed. The question that the researcher needs to answer is whether transforming the variables will enhance interpretability of the results. Like most statistical modelling, the set of possible transformations is quite large; however, coded variables (i.e., a design matrix) or "standardized" variables are often used. In essence, as Box and Draper (1987, p. 20) state, transformation is a matter of convenience wherein it is sometimes convenient not to have to deal with the actual numerical measures of the explanatory variables. As they further state (pp. 4-5), usually the best choice of transformations is not clear-cut and, initially at least, will be the subject of conflicting opinion. Therefore, in the present study the explanatory variables will be treated in their raw metric rather than as coded or standardized variables because we believe that transforming the variables will not enhance the interpretability in our context. In other words, the metric of the original variables is appropriate for the research question at hand.

Response modeling The next phase in the response surface modeling is to determine a mathematical model that best fits the data collected in the experimental design. As noted earlier, such a determination is possible through standard ordinary least-squares regression analysis. Therefore, if one is using ordinary leastsquares estimation, the model-building phase of RSM can be conducted with most commercial statistical packages. Given that it has an extensive set of options for RSM (under the heading Design of Experiments -- DOE) available as an option, we used MINITAB. Results and Conclusion The first step in the analysis involves the conventional tabular summary of the results (see Table 2). This highlights the fact that RSM, as we are advocating it, is not a substitute for the tabular summary of the results but rather is complementary.

Edgeworth Series in Quantitative Behavioral Science, UNBC

Simulation Experiments 11

Table 2 Tabular results of bias displayed by the factors in the experimental design.

σc p

µc

n

0.5

1.75

3.0

.01

0

8 16 32 64 128 8 16 32 64 128 8 16 32 64 128 8 16 32 64 128 8 16 32 64 128 8 16 32 64 128 8 16 32 64 128 8 16 32 64 128 8 16 32 64 128

.00000 .00148 .00040 .00005 .00055 .01763 .01443 .01535 .01681 .01322 .03228 .02822 .02872 .03127 .02913 -.00165 -.00299 .00203 -.00030 .00017 .12049 .11903 .11936 .11866 .11978 .24230 .24316 .24040 .24046 .23838 -.00116 .00204 .00014 .00128 -.00092 .22595 .22624 .22614 .22411 .22424 .44502 .44886 .44940 .44927 .44998

-.00283 .00006 -.00164 .00146 -.00119 .01109 .01770 .01846 .01473 .01473 .02995 .02989 .03084 .03115 .02997 .00233 -.00200 .00040 -.00112 -.00036 .12015 .12072 .12025 .11828 .11935 .24144 .23835 .24292 .24220 .24054 .00166 -.00156 -.00263 -.00087 .00007 .22543 .22440 .22340 .22436 .22490 .45224 .45005 .45169 .45060 .45072

.00330 -.00177 .00053 .00029 .00043 .01694 .01795 .01678 .01457 .01512 .02948 .02738 .03034 .02905 .03001 -.00016 -.00110 -.00020 -.00100 -.00088 .12102 .12167 .12055 .11884 .12104 .24029 .24041 .24138 .23894 .24129 .00140 .00043 -.00207 -.00008 .00085 .22759 .22490 .22814 .22249 .22498 .45132 .45149 .45145 .44816 .45133

1.5

3.0

.08

0

1.5

3.0

.15

0

1.5

3.0

Edgeworth Series in Quantitative Behavioral Science, UNBC

It appears from Table 2 that the bias is not affected by n nor by

Simulation Experiments 12

σ c . The null effect of n is in accord with

well-known statistical theory. However, bias appears to increase with increasing p and with increasing

µ c . Furthermore, there may be an interaction of p and µ c since the effect of µ c appears to be greater for increasing p. In the next phase MINITAB was used to fit a series of response models. The series of models were fit because we need to ascertain whether a simple linear model will suffice or whether there is sufficient deviation from the additivity in the usual linear model to require a more complex model. This deviation from the usual additivity of the linear model is reflected in either an addition of two-way interactions terms to the model and/or squares terms to the model4. The following four models were fit:

bias = bo + b 1p + b2µc + b3σ c + b4n (1) bias = {Equation(1)}+ b 5σc p + b 6σ c µc + b7σ c n + b8 pσc + b9 pn + b10σ c n (2) bias = {Equation(1)}+ b11σ 2c + b12 p2 + b13µ 2c + b14n 2 bias = {Equation(1)} + {interaction terms} + {squares terms}

(3) (4)

where model (1) is the simple linear model, model (2) is the linear plus two-way interactions model, model (3) is the linear plus squares model, and model (4) is the fully quadratic model. We followed the recommended model-building steps from Draper and Smith (1981, p. 40) wherein the residual sum of squares is broken up into lack of fit and pure error sum of squares. The lack of fit tests indicated that models (1) and (3) were inadequate while for models (2) and (4) the lack of fit test indicated that there appears to be no reason to doubt the adequacy of the model (i.e., F450.00. Given the above results it was concluded that the following model would be adequate:

bias = bo + b 1µ c + b 2 p + b3µ c p. This model was fit to the data and it resulted in the following equation:

bias = 0.00010 + 0.00007 µc − 0.00244p + 1.00095 µc p + error,

(5)

and which explains 99.7% of the variation with a standard error of estimate of 0.00813. It is noteworthy that the more parsimonious model (5) explains the same amount of variation as either models (2) or (4). Model (5) is accepted as a plausible empirical model. Once we have attained a plausible empirical model, we now move to the final stage of the RSM wherein we investigate the response surface to see if we can gain more insight into the properties of model (5). We begin by plotting model (5) in a three-dimensional space in Figure 1, where

µ c is denoted

as "shift" and p is denoted as "proport". Furthermore, to aid in the interpretation of the three-dimensional plot, spikes were included in the graph (the height of the spikes reflects the amount of bias).

Edgeworth Series in Quantitative Behavioral Science, UNBC

Simulation Experiments 14

Figure 1. Three-Dimensional Response Surface with Spikes.

0.45 0.36 0.27 0.18 bia 0.09 0.15 3

0.10 2 0.05 propor t 0.00

1 0

shif t

It is apparent from Figure 1 that the model is not a flat plane but rather a warped surface that looks rather like a "ski-hill". The above description of model (5) is supported by the series of eight slides of the spinning threedimensional plot of the response surface (Figure 2). Figure 2 begins with slide #1 (which depicts the much the same image as the three-dimensional plot in Figure 1) and progresses to slides #2 through #8 rotating to the left hand while maintaining the axis labelled "bias" stationary. That is, the slides progress as if the axis entitled "proport" is moving towards the reader in slides #1 through #3 and then away from the reader in slides #5 through #8. It is apparent from Figure 2 that the surface is not flat (in particularly, see, slides #3 and #7). In sum, then, model (5) and its graphical depiction support the earlier observation that there is an interaction between p and

µ c . The reader should note that the ease with which one can interpret graphs

like those in Figures 1 and 2 increases with experience with RSM, and with the quality of the computer technology. We used SYSTAT for the three-dimensional plots in Figures 1 and 2.

Edgeworth Series in Quantitative Behavioral Science, UNBC

Simulation Experiments 15

Figure 2. Spinning three-dimensional plots of the response surface. Slide #1

Slide #2 BIAS

BIAS

PROPORT SHIFT

Slide #3

SHIFT

Slide #4 BIAS

BIAS

PROPORT

PROPORT

SHIFT

SHIFT

Slide #5

Slide #6 BIAS

BIAS

PROPORT

PROPO

SHIFT

SHIFT

Slide #7

Slide #8 BIAS

BIAS

PROPORT SHIFT

PROPORT SHIFT

Edgeworth Series in Quantitative Behavioral Science, UNBC

Simulation Experiments 16

In the final step of the RSM we will take yet another look at the interaction in model (5). To do so we remind ourselves of the definition of an interaction: the rate of change of response with respect to a variable, say p, depends on the level of

µ c . In other words, it is the deviation from additivity reflected in

the usual multiple linear regression. To decompose the interaction in model (5) we consider the rate of change (i.e., the "proper" regression coefficient) as

∂bias = −0.0024 + 1.0010µ c , or ∂p (6) ∂bias = 0.0001+ 1.0010p. ∂µ c

Note that the slope of the regression surface in the p direction depends on of the regression surface in the

µ c and that in turn the slope

µ c direction depends on p. Furthermore, we can rewrite (6) as bias = (−0.0024 + 1.0010µc )p, or (7) bias = (0.0001 + 1.0010p)µc .

Equation (7) can be used in creating the conditional effect plots to help understand the interaction effect. That is, given equation (7) we can plot, for example, the p-bias relationship at the various levels of and the

µc

µ c -bias relationship at various levels of p (see Figure 3). Together the conditional effect plots

shed further light on the response

Edgeworth Series in Quantitative Behavioral Science, UNBC

Simulation Experiments 17

Figure 3. Conditional Effect Plots. A. P-Bias Plot at various levels of

µ c , denoted as "shift".

0.5 shift0

0.4

shift1.5 shift3.0

Bias

0.3

0.2

0.1

0.0

-0.1 0.00

0.05 0.10 Proportion Contamination

0.15

0.20

µ B. Bias- c , denoted as "shift", Plot at various levels of p. 0.5

p.01

0.4

p.08 p.15

Bias

0.3

0.2

0.1

0.0

-0.1 0.0

0.5

1.0

1.5 shift

2.0

2.5

3.0

3.5

Edgeworth Series in Quantitative Behavioral Science, UNBC

Simulation Experiments 18

model by providing, in essence, a "side profile" of the response surface. That is, it is apparent from Figure

µ c of zero results in little to no bias. However, as µ c increases µ the effect of p on bias becomes more pronounced -- i.e., the conditional slope is greatest at c equal to 3a that, as suggested in Figures 1 and 2, a

µ c of 3.0 can have a small effect on bias if p µ µ is 0.01. And as Figure 3a, as p increases the effect of c on bias becomes more pronounced. The c by p µ interaction appears to be disordinal in nature wherein bias is smallest when c and p are zero and 3.0. Similarly, from Figure 3b, one can conclude that even a

increases with increasing

µ c and p.

We return to the original research question: What is the effect of outlier contamination on the bias of the sample mean? Our results indicate that over the region of interest described in Table 1, 99.7% of the variation in bias can be explained by a model containing a linear-by-linear interaction of the proportion of contamination and the shift in location of the contamination distribution. Furthermore, this interaction applies irrespective of sample size or standard deviation of the contamination distribution.

General Discussion The purpose of this paper was to describe the nature and process of response surface methodology (RSM) in the context of an example. In the end, it is apparent that RSM shares much in common with the commonly used techniques and approaches to behavioral and social science research. As a complement to the more commonly used narrative and tabular description of simulation results RSM provides a graphical and mathematical description of the relationship between the response and explanatory variable(s). In cases like the one discussed above, where the functional relationship is relatively simple and pronounced, the narrative and RSM conclusions will be similar. However, one can imagine that when the effects are less pronounced the narrative and RSM conclusions will be complementary but quite distinct. Consider, for example, how multidimensional and conditional effect plots could aid methodological researchers attempting to disentangle the effects (i.e., both main effects and interactions) of several factors on the Type I error rate and power of a statistical test. In sum, RSM offers a perspective toward, and language with which to discuss the planning and analysis of simulation experiments. In particular, RSM brings to the forefront the issues of (a) the region

Edgeworth Series in Quantitative Behavioral Science, UNBC

Simulation Experiments 19

of interest for the experiment, and (b) the scaling and metric of both the response and explanatory variable(s). Finally, RSM highlights what Box and Draper (1987) and Snee and Hare (1992) refer to as the iterative nature of experimentation -- wherein we progress from hypothesis to design to experimentation to analysis to further hypothesis to further design to further experimentation and on and on. RSM can be used, then, to respond to Hoaglin and Andrews' (1975) call for statisticians and methodologists alike to pay more attention to their own principles of design and analysis when reporting simulation experiments.

Authors note: An earlier version of this paper was presented to the Research and Measurement Methodology Division D at the 1995 Annual Meeting of AERA. We would like to thank Ann Wilson, ACT, for comments on an earlier version of this paper.

References Blair, R.C. & Higgins, J.J. (1981). A note on the asymptotic relative efficiency of the Wilcoxon rank-sum test relative to the independent means t test under mixtures of two normal distributions. British Journal of Mathematical and Statistical Psychology, 34, 124128. Box, G. E. P., & Draper, N. R. (1987). Empirical model-building and response surfaces. New York: John Wiley & Sons, Inc.. Darlington, R. B. (1990). Regression and linear models. New York: McGraw Hill. Draper, N. R., & Smith, H. (1981). Applied regression analysis (2nd Ed.). New York: John Wiley & Sons, Inc.. Hoaglin, D. C., & Andrews, D. F. (1975). The reporting of computation-based results in statistics. The American Statistician, 29, 122-126. Khuri, A. I., & Cornell, J. A. (1987). Response surfaces: Designs and analyses. New York: Marcel Dekker, Inc.. Mosteller, F. & Tukey, J.W. (1968). Data analysis, including statistics. In G. Lindzey & E. Aronson (Eds.), The Handbook of Social Psychology: Vol. 2 Research methods (pp. 80-203). Reading, MA: Addison-Wesley.

Rasmussen, J.L. (1985). The power of Student's t and Wilcoxon W statistics: A comparison. Evaluation Review, 9(4), 505-510. Snee, R. D., & Hare, L. B. (1992). The statistical approach to design of experiments. In David C. Hoaglin & David S. Moore (Eds.), Perspectives on Contemporary Statistics (pp. 7192), MAA Notes Number 21. Washington, D.C.: Mathematical Association of America. Thissen, D., & Wainer, H. (1986). XTREE: A multivariate graphical icon applicable in the evaluation of statistical estimators. The American Statistician, 40, 149-153. Tukey, J.W. (1960). A survey of sampling from contaminated distributions. In I. Olkin, S.G. Ghwyne, W. Hoeffding, W.G. Madow, & H.B. Mann (Eds.), Contributions to Probability and Statistics. Essays in Honour of Harold Hotelling (pp. 448-485). Stanford: Stanford University Press. Zumbo, B. D., & Jennings, M. J. (1994).The Robustness of Validity and Efficiency of the Onesample t Test in the Presence of Normal Contamination. Unpublished Manuscript.