According to Tarski, a model is a vehicle that can be used to ... programs. Furthermore, related empirical models are proposed. Successful ...... Montreal, Canada. ..... Mechanical, historical, philosophical or psychoanalytical explanations for ...
VALIDATION OF SIMULATION MODELS Cor van Dijkum Dorien DeTombe Etzel van Kuijk (Editors) SISWO Publication 403 Amsterdam SISWO, 1999
SISWO
Netherlands Universities Institute for Coördination of Research in Social Sciences. Plantage Muidergracht 4, 1018 TV Amsterdam tel: 31 20 5270600 / fax: 31 20 6229430
©1999
No part of this book may be reproduced in any form by print, photoprint, microfilm or any other means without the prior written permission from the publisher.
1999 Second Impression ©
Cor van Dijkum Etzel van Kuijk Ferdinand Verhulst Marco Janssen Bert de Vries Cornelis Hoede Helene Weening Jack P.C. Kleijnen Dorien J. DeTombe
Validation of simulation Models Cor van Dijkum, Dorien DeTombe, Etzel van Kuijk (Eds.) Amsterdam: SISWO, 1998 / 1999 ISBN-90-676-152-2 Keywords: models, simulation, validation, multi-disciplinarity, complex problems A publication in coöperation with the Dutch researchgroups SIMULATION and COMPLEX SOCIETAL PROBLEMS of the NOSMO (Dutch Research group on Methodology)
2
Content: Preface Cor van Dijkum, Dorien DeTombe, Etzel van Kuijk
4
Validation of simulation models in the social sciences Cor van Dijkum and Etzel van Kuijk
7
The validation of metaphors Ferdinand Verhulst
30
Global modeling: managing uncertainty, complexity and incomplete information Marco Janssen and Bert de Vries
45
Graph theoretical analysis as tool for validation and conceptualization Cornelis Hoede and Helene Margot Weening
70
Simulation of educational effects Henny de Vos and Roel J. Bosker
88
Statistical validation of simulation, including case studies Jack P.C. Kleijnen
112
Outside validity of research of complex societal problems Dorien J. DeTombe
128
About the authors
139
3
PREFACE Computer models are more and more used in our society to get grasp on difficult scientific and social problems. In science computer models enable us to understand phenomena, such as turbulence in fluid, that were never understood before. For society threats to mankind, such as a possible climate change or limits to the availability of food and unpolluted fresh water, can be explored with computer models. Science is necessary to accumulate the insight needed for these models. On the other hand scientists can not work without computer models to handle their complex knowledge. However, with those models old scientific questions are back on the stage: how valid is the knowledge provided by the models ? How are the facts of the world related to those models ? How sound is the logical reasoning in the models ? There is not an easy answer to such questions. The logic of science which seems adequate for a number of old(fashioned) ways to describe and explain the world, shows its limitations when it comes to the surprises of computermodels. This book is dealing with the challenge posed by computermodels for science. Each year the Dutch Federation of Social Science Methodology (NOSMO) discusses actual (methodo)logical problems. In 1997 one of the themes reflected was the question of the validation of knowledge generated by computer models. Authoritative lecturers from different disciplines (inside and outside of the social sciences) were invited to explain how they dealt with this question in their field. The written examinations of that sessions were starting points for the articles in this book. The first chapter of this book describes the development of the question of validation from a historical perspective. Looking at the history of (social) sciences, van Dijkum and van Kuijk show that the question of validation of scientific statements is a complicated one. It was a controversial one as well, with opinions ranging from the view of members of the Vienna Circle to the conclusions of Feyerabend. The former proclaimed that science would bring valid knowledge, the latter seemed to believe that in science 'anything goes' (Feyerabend 1975). Between these views on science a broad range of opinions exists. An effort is made to reduce the existing confusion by criticizing some of the opinions and identifying possible
4
myths. In doing so, the first steps are taken towards some insights into the (re)construction of a more consistent logic of validation for modern science. In the second chapter Verhulst reaches a similar conclusion although from a different perspective and discipline. Verhulst introduces models as metaphors for the description of reality. A number of case studies of contemporary research are presented: pollution of the North-Sea; the flow field of the Wadden Sea; drillstring dynamics; and the use of metaphors in psychoanalysis. The validation of the results takes a very different form in each of the cases that are presented. The characteristic that unites these cases is that their validation is certainly not in agreement with the picture of scientific research projected by textbook examples. In the third chapter Janssen and de Vries show that the problems with the textbook approach of validation also exists in global modeling. Global modeling, also known as integrated assessment modeling, faces the dilemma that the possibilities to validate these models and the theories they represent, are limited. For those models too many subjective assumptions have to be made and far too little information is available for rigorous validation. Nevertheless, the quest for better validation methods for these models is kept alive by the need for the type of knowledge gain they promise, for instance to policy makers. In their chapter Janssen and de Vries explore some answers to these problems. An emphasis on model transparency and interactivity and the model building process; the use of expertvalidated meta-models; more explicit recognition of the problems arising from multi-disciplinarity; as well as the use of cultural perspectives to deal explicitly with controversies is proposed. In chapter four Hoede and Weening take things one step further by presenting a new method for the validation of theories. Perhaps the theory of knowledge graphs they developed will prove to be a fruitful method for the validation of integrated assessment models. A first exploration in that direction is made by them by discussing the possibility to use the theory of knowledge graphs to analyze processes in social science. An effort to analyze the process of empirical research in the social science process is made by de Vos and Bosker in chapter five. Their chapter deals with the study of educational effects by means of a (computer) simulation. They use the simulation model that they developed to test some of the text book statistical validation methods. Tested in this way these methods are proven to fall short for the
5
validation purposes of their model. This seems to be an indication that text book statistical validation methods need to be re-evaluated for their merits. In chapter six Kleijnen shows that not necessarily all the possible solutions for the validation problem need to be sought outside the text book of statistical methods. Opportunities for improvement also lie within these methods. A survey is given on how to validate simulation models through the application of mathematical statistics. Stating that those experiments should be guided by the statistical theory on design of experiments. Finally, DeTombe argues that most methodological textbooks only refer to validity in reflection of the research performed. However the concept of validity should be widened concerning the reflection of complex societal problems. This means that next to inside validity (internal and external validity) the concept of outside validity should be used. Using the term outside validity makes it possible to reflect on decisions the researcher makes before formulating the research question. It reflects on the selected phenomena and methods in the view of the results of the research and its effect on society. The articles in the book are the result of intellectual efforts of the authors. However, to facilitate this work of science, invited reviewers gave comments on draft versions of the articles. We would like to thank these reviewers for their support to bring the best out of the authors. As editors we were pleased to be acknowledged by members of the SISWO publication board that we completed with these articles a valuable contribution to the development of the public discussion of science.
Amsterdam, November 1998. Cor van Dijkum, Dorien DeTombe, Etzel van Kuijk
6
VALIDATION OF SIMULATION MODELS IN THE SOCIAL SCIENCES Cor van Dijkum and Etzel van Kuijk
1 INTRODUCTION Looking at the history of (social) sciences, it appears that the question of validation of scientific statements is a complicated one. Perhaps it started with the proclamation of positivists such as Auguste Comte (1818) and neo-positivists such as Schlick and Carnap (Vienna Circle 1929). They proclaimed that science would bring logic instead of irrationality, facts instead of fiction, theories instead of myths, and valid knowledge instead of fallacious perception. It seems to end with the conclusion of Feyerabend (1975) that in science 'anything goes'. Between these views on science, a broad range of simplifications, possible myths, confusion, but also some new insights exist. Each one acting as the (methodo)logical justification for the research of social scientists. This chapter will be an effort to reduce the confusion on this (methodo)logic field by criticizing some of the simplifications and the possible myths. In doing so we try to generate some insights to (re)constructing a more consistent logic of validation for modern science.
2 A STRUGGLE FOR RATIONALITY The start of modern social science is reflected in the statement of Auguste Comte that society has developed to a new stage. A stage, in which the dark ages of feudal regulation of society are fading away and are replaced with rational decisions supported by positive scientific knowledge. This positive idea of science, endorsed by members of the Vienna Circle such as Schlick, Neurath, and Carnap (Vienna Circle 1929), resulted in a program for the rationality of science. In this program the truth had to be found with the help of arguments in a free discussion between subjects. From this program the logic of science would gradually emerge. According to the Vienna Circle, arguments in the scientific discussion had to be grounded in the facts of the world, and explanations of these facts in theory. Statements in a
7
Validation of Simulation Models theory had to correspond with these facts and this had to be done in a logically sound way. This principle of correspondence was supposed to be essential for valid scientific knowledge. Later on, among others, Popper (1934) continued the struggle against irrationality, especially against the idleness of modern believe systems which were camouflaged as science. In his quest he took the ideas of the Vienna Circle for granted, but he tried to determine the rules of the logic of science more precisely. In this attempt he dealt with many contemporary logical views and unanswered (methodo)logical questions. To specify the correspondence theory of truth of the Vienna Circle, he used Tarski's effort (1936) to lay a foundation for a theory of truth. Popper followed Tarski's idea that between the world and a (grand) theory, a model had to be constructed. According to Tarski, a model is a vehicle that can be used to demonstrate that aspects of a theory are true. Statements inferred from theory could be tested on their truthfulness using the model, because they represent facts in the world. Popper had the opinion that with such models theories could be compared, and that could be determined which theories are closer to the truth i.e. were more valid for science. According to Tarski, the language most suited to reveal the truth of a model was the artificial language of logic. But logic was a field in which different formal languages were existed. Of course there was the effort of Hilbert (1934, 1938) to reduce all these different formal languages to one computable formal logic. But the fundamental analysis of Gödel (1931) indicated that there were severe limits to this program of the simplification of logic to the computable first order predicate logic. Popper however, was attracted by the simplicity of Hilbert's program and tried to reduce the language in which valid scientific statements were expressed to first order predicate logic. Popper preferred finite chains of simple predicative arguments (Dijkum 1988). This influenced his search for concepts and rules of reasoning, which could establish the validity of scientific statements. In general he made a plea for parsimony in the use of concepts in models. If a simple model and a more complicated model could both describe and explain the same facts, then the simple model was preferable. The simple model did more with fewer concepts and thus grasped more information. In this line of reasoning he tried to establish the idea that scientific statements with
8
Cor van Dijkum, Etzel van Kuijk: Validation of models in the social sciences simple linear relations between variables always represent more information than more complex statements with non-linear relations between variables. According to Popper, simple linear statements seemed to be more valid for science. Next, according to Popper's tradition to work with simple statements, researchers in social science followed Reichenbach (1956) in limiting the interaction of causes 1 and effects . Interaction was restricted to the action of a cause and the reaction of an effect. Would one allow that the effect in return influences the cause, then a logical 2 inconsistency would arise . A variable representing an effect was influenced by a variable representing a cause, but the reverse was not the case: there was no feedback between effect and cause. Hence, infinity is not accessible for the logic of social sciences. Popper argued this in many ways (Dijkum 1988).
1.1 The rise of three myths Gradually Popper's program started to dominate the ideas of social scientists. Lakatos however (1978), showed that a growing number of followers misunderstood Popper’s carefully balanced ideas of objectivity, deduction, and falsification, and made them into dogmatic ideas. For example, it was true that Popper at first rejected the principle of induction, but this can be attributed to the historical context. According to Lakatos, Popper's falsification principle had to be understood as a sophisticated falsification principle. In a mature scientific discipline not a single theory is developed, but a number of theories generated by competing research programs. Furthermore, related empirical models are proposed. Successful science would be a matter of comparing and judging those different theories by the quality by which they, and the models originating from them, would describe and explain facts. This was at least understood by Popper. The simplicity of Popper's naive
1
For Reichenbach a closed chain of cause and effect was physically impossible because then cause and effect were simultaneously existing, whereas it was clear that a cause precedes the effect (see: Reichenbach 1956, p. 39). 2 When Variable B stands for effect and Variable A for cause then B=f(A), but because A=g(B) one has to continue, B=f(g(B)), but when f≠g-1, there is a contradiction. 9
Validation of Simulation Models falsification principle had a great appeal to social scientists. Only one theory is available, and as a consequence falsification is the logical strategy left. Exit the sophisticated falsificationism, and a naive and dogmatic use of ideas of falsification expanded as a new (and very powerful) myth into the scientific community. The second myth emerged from another source of misunderstanding, Popper's preference for simplicity. "Simple statements, if knowledge is our object, are to be prized more highly than the less simple ones because they tell us more; because their empirical content is greater; and because they are better testable" (Popper 1959, p. 142). In Popper’s time, natural and social sciences were restricted to the use of simple linear models. The reason is that the calculation of the mathematics involved very soon became to complex to perform and understand. It even became an art to explain phenomena with those linear models, an art that was transformed into a habit in many scientific disciplines. But with the aid of new hard- en software, developed after the second world war, one could go beyond this fixation and explore non-linear models3 in those cases where their use is indicated by theory. In this way many fruitful non-linear models were developed by the natural sciences. This myth however, is still firmly rooted in social sciences. The idea, that linear models are always better and/or that everything can be modeled with linear models, no matter what subjects or circumstances were modeled, still has lost little ground. The third myth developed from following the same ideological track of simplification. Reichenbach's idea of causality became dominant in the methodology of social sciences. According to Maruyama (1997) this interpretation of causality became a myth in textbooks and even a taboo to discuss. For decades this blocked the development of alternative research programs into the causality in the social sciences. In the practice of social sciences, an overwhelming majority analyses causality only in a one way direction. Cause Effect That there are occasions were a cause can be influenced by its effect was not accepted as a valid scientific statement. Moreover, if the relation between cause and effect was quantified, the preference for simplicity made that only models were analyzed in which there was a linear relation between cause and effect (Dijkum
3
To be exact: non-linear differential equations. Those equations are more complicated than normal non-linear equations. 10
Cor van Dijkum, Etzel van Kuijk: Validation of models in the social sciences 1988). Statistics used to formulate stochastic relationships resulted in models in which Pearson correlation coefficients were used to express the linear relation. Generally, the linear paths that exist between several variables in these models are 4 mapped like this:
+
+
B
E
A +
+
F
+ C
+
Figure 1: Paths for unidirectional causality By means of the partial correlation coefficient one could determine which relation 5 was in fact produced indirectly by other paths . Again no feedback relations are allowed in these models. Thus the third myth emerges, the myth that such analyses are sufficient to express all meaningful causal relations.
1.3 Standard rules for validation Consequently, the rules of validation that can be found in social science, compromise with Popper's simplified methodology and the other associated myths. In the well known book by Campbell and Stanley (1966), which for a long time laid down for many social scientists the rules of "how to do research in social science in the right way", it was stated:
4
It is amazing that Reichenbach, on a different level of abstraction, draw a similar unidirectional picture of paths between causes and effect (see: Reichenbach 1956, p. 37). 5 In figure 1 it could for example happen that the relation between C en B was a result of the relation between A and B and A and C. In this case the partial correlation coefficient equals zero, and the relation between C and B disappears. 11
Validation of Simulation Models
"It is by now generally understood that the 'null hypothesis' often employed for convenience in stating the hypothesis of an experiment can never be 'accepted' by the data obtained; it can only be 'rejected', or 'fail to be rejected.' Similarly with hypotheses more general they are technically never 'confirmed': where we for convenience use that term we imply rather that the hypothesis was exposed to disconfirmation and was not disconfirmed. This point of view is compatible with all Human philosophies of science, which emphasize the impossibility of deductive proof for inductive laws. Recently Hanson (1958) and (Popper 1959) have been particularly explicit upon this point." (Campbell 1966, p. 35) Campbell and Stanley propagated the hypothetico-deductive method, i.e. from a theory a hypothesis was inferred, from that hypothesis a prediction, for which an attempt was made to falsify it by facts. It was preferred to falsify a causal hypothesis (H1), and as a result to verify a hypothesis (H0) of the non-existence of a causal relation. The causal hypothesis should be falsified in the most pure way. That is, in the situation were the researcher had control over the variables which were the causes of the effects. This could be done in an experiment by varying (manipulating) the relevant independent variables and observing the influence of that variation on the relevant dependent variables (effects). However, there was a problem that needed to be solved: alternative causes could jeopardize the conclusion that there was a causal relation between those variables. Following the strategy of the statistician Fisher (1925) it was tried to control i.e. to filter out, those disturbing factors by using all kinds of research settings. A preference for experimental settings could be found with test psychologists. In this field another source for rules of validation existed. As a result of long and complicated discussions test psychologists focused gradually on the question how traits, described in a theory, were related to each other in an empirical way, e.g. were correlated to each other. In the multi-trait-multi-method matrix of Campbell and Fiske (1959) the correlation between the same traits were shown to be
12
Cor van Dijkum, Etzel van Kuijk: Validation of models in the social sciences convergent (convergent validity) and the correlation between different traits as to be different (discriminant validity). From the same line of reasoning came the concepts: prediction validity, and criterion validity. As was observed already by Hofstee (1973) in practice those quests for correlation were most of the time limited to linear correlation. Consequently the logic of validation concentrated on the use of techniques which implicitly or explicitly worked with linear models. Psychology has had, and still does have a substantial impact on other social sciences. This is why other disciplines adopted the same line of reasoning. However, most of the time for disciplines such as social psychology, anthropology, economy and sociology it was not possible to conduct experiments designed carefully enough to meet the methodological requirements. As a substitute for the control of variables who could jeopardize the presupposed unidirectional causality, efforts were made to control the research situation in another way. This was done by standardization of the collection of data, for example in interviews and through observation. Afterwards, efforts were made to manage the threats to the validity of the reconstruction of unidirectional causality that disturbing variables incorporate. This was done by using designs of data-analysis, such as elaboration of cross-tabs, partial correlation and path analysis. Again in those approaches, for reasons of parsimony, linear models were preferred.
1.4 Validity in system dynamics Although, mainstream methodology focused on uni-directional causality, researchers such as Forrester (1968), Meadows (1974), Hanneman (1988), Richardson et al. (1981) and Blalock (1969) toke a stand for mutual interaction between cause and effect inspired by the feedback idea of Wiener (1948). Both, cause and effect were viewed as variables and the logical contradiction Reichenbach was afraid of was eliminated by the concept of recursion.
13
Validation of Simulation Models
A(t) A(t+2∆t)
B(t+∆t) ∆t
Figure 2: Causal recursion At time t, A has a value, that has as a consequence on B at time t+∆t. In turn that value of B at time t+∆t has as a consequence on A on time t+2∆t. This means a self reference of A, thus A=function(A,t). This principle can be expressed in a difference equation:
∆A = function(A) ∆t A recursive difference equation implies causal recursion. Using this type of equations one can also model the concept of feed forward. This implies that the future state of a variable is represented in an anticipation model. Interactions that arise from this type of models could be foreseen and be accounted for in the present. Thus presenting a way to model self-fulfilling and self-denying prophecies. It is even possible to create models that can restructure the causal relations between variables, by introducing a meta-model in which causal relations are expressed and can be changed. In this way it is possible to produce the morphogenesis of causal models, by introducing a meta-model that models the parameters of a causal equation in another equation (see for example: Zouwen 1997). Another step ahead is the introduction of non-linear feedback models whose patterns of outcomes can be described on a meta-level using non-linear mathematics. Introducing in this way the complexity of self-organizing systems (Dijkum 1997a). The choice to let go of uni-directional causality and introducing models with the characteristics as described, forces us to reconsider the question of validity. The
14
Cor van Dijkum, Etzel van Kuijk: Validation of models in the social sciences validation of models becomes a little bit more complicated than it was with simple 'one way causal models'. The effect of feedback structures is for example a delay in the influence of A on B, and after another delay, an auto correlation effect on A. A linear correlation coefficient can not be used to reflect such phenomena. One could choose to use time-lagged correlation as a way. But compared to the standard routines this already requires an advanced use of correlation techniques. The same is true for the solution of the problem of auto-correlation. In each case, the feedback models could not be validated with the routines used for the validation of uni-causal models. That is why system dynamics developed other procedures to validate feedback models. Bearing in mind the idea that causal relations are expressed in a nomological network of statements, procedures were aimed at two different levels of validation: the level of theory, and the level of facts. The first thing that needs to be investigated is the theory. The hypotheses that are derived from the theory and the structure of the relations between included variables they imply. Are all variables included in the model that are relevant for the problem? Are all relevant feedback loops included? Are all included feedback loops relevant for the problem, and do they really exist in the real system the model represents? To continue in a more technical mathematical sense, do the (difference)equations make sense? Are the parameters of those equations meaningful and do they relate to realistic values in the real system? Do the dimensions of the variables in the equations fit both sides of the equation without introducing fudge factors? Next, the behavior of the model has to be validated, concerning facts i.e. observed events. Is the model producing the sequence of events of the real system, and in the right order? Can the model predict events that are meaningful for the problem and which can be observed in reality? Is the model free of anomalies, e.g. free of the production of events that are to extreme to be observed? Is the model "economical" concerning the included feedback loops, i.e. are all these loops necessary to produce all relevant events? Do plausible changes in parameters lead to observed changes that can be related to corresponding events in the real system? Finally, the behavior of the model has to be validated concerning possible events. If values of variables are changed, reflecting actions in the real system, does that lead to behavior that can be observed in the real system? Furthermore, is the
15
Validation of Simulation Models effect of such changes invariant concerning small plausible changes of values of parameters in the domain of the variables? (Richardson and Pugh 1981)
2 TOWARDS A NEW METHODOLOGY FOR VALIDITY These considerations are leading us to the conclusion that we are in need for new procedures to validate more complex models. New procedures, that imply going beyond Popper's simplified methodology of validation. The systematic approach of Popper to avoid evident fallacies and to come to a theory of validity that is grounded in logically sound reasoning still remains relevant. But the challenge of concepts such as feedback and non-linearity has to be taken to produce a meaningful theory for modern science. 2.1 Elementary logic for validity According to Lakatos, Popper's falsification principle has to be understood in a sophisticated way. One has to compare at least two models and to select the model which best describes and explains the facts. To be able to judge which model is the best, one needs a standard, or rather a measure. According to the idea of sophisticated falsification, one component of the measure has to take into account the facts which falsify the model, and the other component has to deal with the facts which verify the model. A conceptual framework to construct such a measure is given in figure 3. For each model a set statements needs to be defined which contradicts the model, i.e. a set of "forbidden" output. If this set or a subset thereof is a part of the set of actual observations the model is falsified. Next to the check for forbidden output, models should be judged according to the model adequacy. That is, the ratio of the size of Q to the size of S. Furthermore, the reliability of the model needs to be judged. That is, the ratio of the size of Q to the size of M (see Mankin et al. 1977).
16
Cor van Dijkum, Etzel van Kuijk: Validation of models in the social sciences
P (Set of possible observations)
S (Set of actual observations)
Q (Set of intersection between M en S) Falsifyers:
M
set of forbidden output
(Set of model output)
Figure 3: Comparing a model with observations In the practice of science possible observations are bounded by the available instruments of observation and by the research question. Not all facts of the world are observed, only those which are judged relevant for the interest of the research. 6 Having a leading question makes it possible to determine and observe relevant facts (relatively) independent of the model and the theory on the background. There is of course an interaction between the theory about the world, in this case the model that represents aspects of the theory, and observations which are made and compacted in a system of observation. But it is assumed that the coupling between these two sides of science is loosely and the result is an open system. That idea gives rise to another
6
Lakatos (1978) described how in science research questions gave rise to both theories and models, as well to instruments of observation. The latter could be viewed as a "theory of observation". 17
Validation of Simulation Models (dynamic) rule of evaluation of a theory i.e. model. A model has to be progressive concerning the new facts coming into the scientific game. The set Q has to grow during the realization of a research program. The application of that rule makes only sense when a model has survived falsification in a minimal way. Besides that, mature science is characterized by not only using one theory, but by letting at least two theories and related models compete. This implies that the above given standard of validation has to be used in a comparative way. This is illustrated in figure 4. The preferred result of such a comparison is that one of the models is both more adequate and more reliable. For example the comparison of two models such as been pictured in figure 4. Model M1 is more adequate and more reliable and thus preferred.
P
S
Q2 Q1
M2
M1
Figure 4: Comparing two models But, one can easily imagine another situation, for example the one illustrated in figure 5.
18
Cor van Dijkum, Etzel van Kuijk: Validation of models in the social sciences P
S
Q2
M2
Q1
M1
Figure 5: Comparing two other models In this latter situation M1 is more adequate than M2 but less reliable. Usually one should prefer the more adequate one, because the aim of modeling is describing and explaining as many facts as possible. However, if one has to decide to act on the basis of the prediction of the model, as often occurs in test psychology, one has to evaluate the risk one takes supposing that the prediction is right, while in reality the forecasting is wrong. When the model is not very reliable the chance is quite high that a prediction of the model is wrong. One then has to deal with the kind of errors that are recorded in table 1, which were introduced in statistics by Neyman and Pearson (1967).
One supposes that prediction is right One supposes that prediction is wrong
Prediction is actually right No error
Prediction is actually wrong Error of type 1
Error of type 2
No error
Table 1: Type of errors
19
Validation of Simulation Models 7
Of course one tries to minimize such errors . Errors, that are more likely when the model is less reliable. Statistics could help us to avoid getting stuck in a highly adequate but extremely unreliable model, and/or to avoid the pitfall to end with a completely reliable but inadequate model. 8 To simplify the reasoning involved it is most of the time assumed that the set of data for falsification or verification has been (collected) independently of the models. This implies that the data which are used to estimate parameters, i.e. to calibrate a model, should not be used to verify or falsify the model. There is another complication that makes it not very easy to validate models, a single model as well as validation in a comparative way. Observation of facts and prediction of outcomes of a model can not be done with absolute accuracy. Because of that, most of the time, even falsification can not be done with absolute certainty. Instead of a clear-cut decision one has to use the art and science of statistics to calculate degrees of falsification and verification. In the end one has to estimate what can be lost (or won) by acting with the outcomes of a model. Popper suggested that with science one could win information about the world. To get clear what is meant by that, and then not to get lost in a simplified linear view on the world is more difficult as Popper originally thought, and still seems to be a serious puzzle for modern science. Anyway, a logical outline is given how to validate models. The examples are however still too simple for models of social phenomena. Most of the time a meaningful social model consists of more than two output variables. Besides that, a model with feedback is usually more complicated than is outlined using the above linear model. For the set of non-linear models new mathematics have to be developed to figure out what patterns are generated and to be able to evaluate them. Nevertheless with the above reasoning a logical framework is set and this can be used to construct sound techniques for validation of simulation models. In this
7
For example with a technique that is called 'sequence sampling' invented by the mathematician Wald (1947) and optimize the decision to choose for a model with minimum costs. 8 In 'sequence sampling' that assumption is not necessary. 20
Cor van Dijkum, Etzel van Kuijk: Validation of models in the social sciences framework more output variables should be included by extending the twodimensional space of confrontation of data and model to a more dimensional space. Moreover, more mathematical sophisticated validation procedures should be developed to handle non-linearity.
2.2 Statistical measures of similarity for linear models Fortunately one does not have to construct such procedures from scratch. There are a number of operational techniques that can be used in a reconstruction of sound 9 validation procedures for linear models . In these procedures the outcome of the model is confronted with facts i.e. data. A rather classical procedure is to express the similarity of the model generated and observed trajectories in the distance they have to each other. This can be done, by comparing outcomes of variables in the model and the data. Statistics can be introduced by interpreting the model outcomes as expected values of the variable and the data as observed. One has to assume errors in the measurement of the data, leading to a known distribution of values of the variables. This distribution can be used in such a way, that the chance of finding a value in the data that differs of the mean outcome of the model can be calculated. Looking to only one variable and by means of an example: a rather simple method is to plot the pairs of observations and outcomes of the model at time ti (0 10 ) combinations may be distinguished. An example with fewer factors (less than, say, fifteen) is the supermarket. In a simulation context, I define DOE as selecting the combinations of factor levels that will be actually simulated when experimenting with the simulation model. To illustrate this problem, I present a case study in the appendix with six factors. After this selection of input combinations, the simulation program is executed or 'run'. Next DOE analyzes the resulting I/O data of the simulation experiment. One goal is to derive conclusions about the importance of the factors; in simulation this is also known as sensitivity analysis, which is related to what-if analysis, optimization, and validation. Unfortunately, there is no standard definition of sensitivity analysis. I define sensitivity analysis as the systematic investigation of the reaction of the simulation re sponses to extreme values of the model's input or to drastic changes in the model's structure. For example, what happens to the customers' mean waiting time when their arrival rate doubles; what happens if the priority rule is changed by introducing ‘fast lanes’? For this analysis, DOE uses regression analysis, also known as Analysis Of Variance or ANOVA. This is based on a metamodel or response surface, which is a model of the underlying simulation model (see Friedman 1996, Kleijnen 1987). In other words, a metamodel is an approximation of the simulation program's I/O transformation. Typically, this model uses one of the following three polynomial approximations. (i) a first-order polynomial, which consists of an overall or grand mean β0 and k main effects (say) βj with j = 1, ... , k where k denotes the number of factors; (ii) the same polynomial augmented with interactions between pairs of factors (twofactor interactions) βj; j’ with j’ = j + 1, ..., k; (iii) a second-order polynomial, which adds purely quadratic effects βj; j.
116
Jack P.C. Kleijnen: Statistical validation of simulation, including case studies Notice that a first-degree polynomial misses interactions and has constant marginal effects; a third-order polynomial would be more difficult to interpret and would need many more simulation runs to estimate the many effects. So a second-order polynomial may be a good compromise, depending on the goal of the metamodel. The validation of this metamodel (not the underlying simulation model, which is the 2 focus of this chapter) may use the well-known multiple correlation coefficient R . An example is given in the appendix. More refined tests (such as cross-validation and Rao’s F test) are given in Kleijnen and Sargent (1998). Applications of these linear regression metamodels are abundant: see the references above. These applications cover both discrete-event simulations (such as queueing models) and non-linear differential models. Other types of metamodels are spines and neural networks. I repeat that this type of metamodeling aims at the investigation of the effects of changing factors. A different type of metamodel aims at studying the dynamic behavior (including instabilities) for given factor values; see Young (1997). DOE and its regression analysis treat the simulation model as a black box: the simulation model's I/O is observed, and the factor effects in the metamodel are estimated. This approach has advantages and disadvantages. An advantage is that DOE can be applied to all simulation models, either deterministic or stochastic, discrete-event or continuous. A disadvantage is that DOE cannot exploit the specific structure of a given simulation model. DOE is a classic topic in statistics. However, the standard statistical techniques must be adapted such that they account for the following simulation peculiarities. (i) There are a great many factors in many practical simulation models. For example, the ecological case study (mentioned above) has 281 factors, whereas standard DOE assumes only up to (say) fifteen factors. ‘Screening’ aims at finding a short list of really important factors; again see Bettonvil and Kleijnen (1997). (ii) Stochastic simulation models use pseudorandom numbers, which means that analysts have much more control over the noise in their experiments than they have in standard statistical applications. For example, to reduce that noise, analysts may use so-called common and antithetic pseudorandom numbers. (iii) Randomization is of major concern in DOE outside simulation (starting with Fisher). In simulation, however, this randomization problem disappears: pseudorandom numbers take over.
117
Validation of Simulation Models The regression metamodel shows which factors are most important; that is, which factors have highly significant regression estimates in the metamodel. If possible, information on these factors should be collected, for validation purposes. (If the significant factors are controllable by the users, then the estimated regression effects show how to change these factors to optimize the real system; see Kleijnen and Pala 1998 for an application.) DOE assumes that the area of experimentation is given. A valid simulation model, however, requires that the inputs be restricted to a certain domain of factor combinations. This domain corresponds with the experimental frame in Zeigler (1976)’s seminal book on modeling and simulation. Simulation models are often used in risk analysis: what is the probability of a ‘disaster’? That disaster may be a nuclear accident, an ecological collapse, a financial mis-investment, etc. I emphasize that these disasters are unique events, whereas the supermarket simulation concerns repetitive events (e.g., customer waiting times). Consequently, validation in risk analysis is very difficult; see Jansen and De Vries (1998) in this book. A better term may be credibility; also see Fossett, Harrison, Weintrob, and Gass (1991). A technical issue in risk analysis is that DOE would select extreme combinations of factor values, which typically have extremely low probability of realization. Instead, risk analysis samples from the whole domain of possible combinations, according to a prespecified (joint) probability distribution. This sampling uses the Monte Carlo technique (sometimes refined to Latin hypercube sampling or LHS). Next, analysts improve the risk model’s credibility by applying statistical techniques. For example, regression analysis and contingency tables may detect which factors have significant effects; next analysts - using their expert knowledge - should be able to explain why these factors are important. An example is the case study on nuclear waste disposal in the waste-isolation pilot-plant (WIPP) near Carlsbad, New Mexico (NM), USA. A model was developed at Sandia National Laboratories (SNL) in Albuquerque (NM). The Environmental Protection Agency (EPA) will give permission to start using the WIPP, only if the WIPP simulation model is accepted as credible - and the model’s output shows an acceptable risk. See Helton, Anderson, Marietta, and Rechard (1997) and Kleijnen and Helton (1998). The importance of sensitivity analysis in validation is also emphasized by Fossett et al. (1991); they present three military case studies. Another case study that does
118
Jack P.C. Kleijnen: Statistical validation of simulation, including case studies explicitly demonstrate the role of DOE and regression analysis in validation, is the ecological simulation in Bettonvil and Kleijnen (1997) and Kleijnen, Van Ham, and Rotmans (1992). The regression metamodel in the latter article helped to detect a serious error in the simulation model: one of the original modules should be split into two modules. Both publications further show that some factors are more important than the ecological experts originally expected. This 'surprise' gives more insight into the simulation model. I present another application in the appendix.
3 REAL OUTPUT DATA AVAILABLE: CLASSIC STATISTICAL TESTS In the Introduction (§1), I mentioned that even if real output data are available, the corresponding real environment may not be measured. As an example I mentioned the sonar search, in which certain characteristics of the sea water were not measured; the real output - namely the detection of mines - is measured. This real output can be compared with the final output of the total simulation model (for the intermediate outputs of the individual modules I refer to the appendix). Let’s first return to the familiar supermarket example. Suppose that the real and simulated outputs (say) x and y are the average waiting time of the Tx and Ty T T customers served per day: x= Σ t=1 W t / T and y= Σ t=1 V t / T in the real and the simulated worlds respectively. Suppose further that n days are simulated and m days are observed in the real system. Assume that each day gives an independent and identically distributed or i.i.d. observation (no seasonality; only busy Saturdays simulated and measured). Define µd= µx - µy . Then the n and m observations give 2 2 the classic estimators asx,y, sx and sy of the means and variances of x and y. These estimators yield Student’s t statistic with n + m - 2 degrees of freedom: −
−
[( n + m − 2) nm] (x − y ) − µ d t n+ m− 2 = 1 2 2 12 [( n − 1) s x + (m − 1)s y ] (n + m ) 2 Obviously, the null-hypothesis is that simulated and real average waiting times per day are equal; that is, H0: µd = 0. The power of this test increases, as in equation (1) µd increases (bigger differences are easier to detect), n or m increases (more days simulated or measured), or or decreases (less noise: more customers per day or 119
Validation of Simulation Models lower traffic rate). Because σ d = σ x + σ y - 2cov(x, y), the analysts may try to create a positive covariance (or correlation) through the use of trace-driven simulation: see the next section (§4). Notice that a difference such as x -y may be non-significant and yet important: if only a few days are simulated or there is much noise, then an important difference µd may go undetected. The reverse is also possible. The sonar case study (mentioned before) gives a binary variable: detect or miss a mine. The n simulation runs give a binomial variable with parameters n and (say) p, the detection probability. Analogously, the field test gives a binomial variable with parameters m and q. To test the null-hypothesis of equal simulated and real probabilities (H0: p = q), Kleijnen (1995a) uses the t-statistic as an approximate test. Unfortunately, the t test assumes that the outputs are normal (Gaussian) besides i.i.d., denoted as n.i.i.d.. Simulation models usually give non-normal, autocorrelated, possibly non-stationary outputs. A simple solution is available if the simulation is terminating. Then each simulation run gives i.i.d. outputs. The t statistic is known to be not very sensitive to nonnormality. Besides the t test, distribution-free tests (such as the rank test) may be applied; see Conover (1971). In practice, however, these tests are rare applied - unfortunately. In non-terminating simulation, analysts may try to create i.i.d. observations through the batching or subrun approach; see Kleijnen (1987), Law and Kelton (1991). One more alternative statistical technique is bootstrapping, which is a type of Monte Carlo simulation; see Efron and Tibshirani (1993). 2
2
2
4 REAL I/O DATA AVAILABLE: TRACE-DRIVEN SIMULATION In the introduction (§1), I claimed that it is wrong to analyze a trace-driven simulation by making a scatter plot with real and simulated outputs (say) x and y, fit a line y = β0 + β1 x and test whether β1 = 1 and β0 = 0; again see Figure 2. Now let ρxy denote the classic linear correlation coefficient between x and y. Suppose the real and the simulated outputs have equal positive means: µx = µy = µ > 0. Then it is easy to prove that imperfect correlation (ρxy < 1) gives 0 < β1 < 1 and 0 < β0 < µ So the naive regression analysis of the trace-driven simulation is wrong indeed.
120
Jack P.C. Kleijnen: Statistical validation of simulation, including case studies
Combinat ion 1 : u s =. 5 , T =1 0
Combinat ion 1 : u s =. 5 , T =1 0 0 0
1.00
1.00
0.80
0.80
0.60
0.60
0.40
0.40
0.20
0.20
0.00
0.00 0.40
0.45
0.50
0.55
0.60
0.40
0.45
0.50
0.55
0.60
Figure 3: Estimated power of naive (---- ) and novel ( - - - - ) tests for logarithmic transformed real and simulated outputs x and y, with varying µs (simulated MST) and fixed µs = 0.5 (real MST), for varying T (jobs per day), given n = 10 (days), and = 0.10 (type I error rate) (Source: Kleijnen et al. 1998, Figure 2) Kleijnen, Bettonvil, and Van Groenendaal (1998) propose the following test. Compute not only the n differences di (also see equation 1 with n = m), but also the n sums (say) qi = xi + yi. Now make a scatter plot with these differences and sums, fit a line d = γ0 + γ1 q and test H0:: γ0 γ0 = 0 and γ1 = 0. Obviously, this (joint, composite) hypothesis implies µd = 0 or µx = µy. Moreover, assuming normality for x and y, it is 2 2 easy to prove that γ1 = 0 implies equal variances: σ x = σ y . To test the joint hypothesis:, the analysts can use standard regression software (which uses an F test). Kleijnen et al. (1998) evaluate both the naive and the novel regression analyses, applying them to single server systems with Poisson arrival and service times (Markov systems with one server or M/M/1); these systems are terminating, since each day stops after T customers (jobs). This gives the following conclusions; also see Figure 3 where MST denotes mean service times: (i) the naive test rejects a valid simulation model substantially more often than the novel test does;
121
Validation of Simulation Models (ii) the naive test shows perverse behavior in a certain domain: the worse the simulation model, the higher its probability of acceptance, and (iii) the novel test does not reject a valid simulation model too often, provided the outputs are transformed logarithmically to realize normality. These regression analyses assume n.i.i.d. real and simulated outputs (as did the t test in the preceding section). Currently Kleijnen, Cheng, and Bettonvil (1998) are developing a test for non-normal observations; that test uses bootstrapping.
5 CONCLUSIONS In practice, validation has many forms, but I focussed on validation through mathematical statistics. Such validation gives quantitative information on the quality of the simulation model (other types of validation - such as animation - give only ‘face’ validity). Statistical validation may use various tests, depending on the type of data available for the real system. I distinguished the following three situations. (i) No real data Then the analysts can still generate simulated data. Their simulation experiment should be guided by DOE; an inferior approach changes only one factor at a time. Regression models provide approximations (metamodels) of the simulation’s I/O transformation, and may show which factors are important. (ii) Only data on real output Real and simulated outputs may be compared through the Student t test. Alternatives are distribution-free and bootstrap procedures, which - unfortunately - are applied rarely. (iii) I/O data on real system Real I/O data enable trace-driven simulation. The validation of trace-driven simulation, however, should not use a scatter plot with real and simulated outputs, fit a line, and test whether that line has unit slope and zero intercept. Instead, two alternatives were discussed: alternative #1 regresses sums and differences; it applies if the outputs are n.i.i.d. Alternative #2 applies bootstrapping; this alternative is still under investigation.
122
Jack P.C. Kleijnen: Statistical validation of simulation, including case studies I referenced several case studies, to demonstrate the applicability of the various statistical methods. Nevertheless, I think that because validation involves the art of modeling and the philosophy of science, validation will remain controversial.
References Beck, M.B., J.R. Ravetz, L.A. Mulkey, and T.O. Barnwell (1997), On the problem of model validation for predictive exposure assessments. Stochastic Hydrology and Hydraulics, 11, pp. 229-254 Bettonvil, B. and J.P.C. Kleijnen (1997) Searching for important factors in simulation models with many factors: sequential bifurcation. European Journal of Opera tional Research, 96, no. 1, pp. 180-194 Conover, W.J. (1971), Practical Non-parametric Statistics. Wiley, New York Efron, B. and R.J. Tibshirani (1993), Introduction to the Bootstrap. Chapman & Hall, London Fossett, C.A., Harrison D., Weintrob H., and Gass S.I. (1991), An assessment procedure for simulation models: a case study, Operations Research 39, pp. 710723 Friedman, L.W. (1996), The simulation metamodel. Kluwer, Dordrecht, Netherlands Helton, J.C., D.R. Anderson, M.G. Marietta, and R.P. Rechard (1997), Performance assessment for the waste isolation pilot plant: from regulation to calculation for 40 CFR 191.13. Operations Research, 45, no. 2, pp. 157-177 Jansen, M. and B. De Vries (1998), Global modelling: managing uncertainty, complexity and incomplete information. Validation of simulation models, eds. C. van Dijkum, D. de Tombe, and E. van Kuijk, SISWO, Amsterdam. Kleijnen, J.P.C. (1998), Experimental design for sensitivity analysis, optimization, and validation of simulation models. Handbook of Simulation, Jerry Banks, Editor, Wiley, New York --- (1995a), Case study: statistical validation of simulation models. European Journal of Operational Research, 87, no. 1, pp. 21-34 --- (1995b), Verification and validation of simulation models. European Journal of Operational Research, 82, no. 1, April 1995, pp. 145-16 --- (1987) Statistical tools for simulation practitioners. New York: Marcel Dekker
123
Validation of Simulation Models ---, B. Bettonvil, and W. Van Groenendaal (1996). Validation of trace-driven simulation models: a novel regression test. Management Science, 44, no. 6, pp. 812-819 ---, R.C.H. Cheng, and B. Bettonvil (1998),Validation of trace-driven simulation models: bootstrapped tests. Working Paper (in preparation) --- and J. Helton (1998), Statistical analysis of scatter plots to identify important factors in large-scale simulations. Sandia National Laboratories, Albuquerque, New Mexico --- and _. Pala (1998), Maximizing the simulation output: a competition. Kwantitatieve Methoden (accepted) --- and R.G. Sargent (1998), A methodology for the fitting and validation of metamodels in simulation. European Journal of Operational Research (accepted condition ally) ---, G. Van Ham, and J. Rotmans (1992), Techniques for sensitivity analysis of simulation models: a case study of the CO2 greenhouse effect. Simulation, 58, no. 6, pp. 410- 417 Kozempel, M.F., Tomasula, P. and Craig, J.C. (1995), 'The development of the ERRC food process simulator', Simulation; Practice and Theory, 2, 4-5 Law A.M., and Kelton W.D. (1991), Simulation Modeling and Analysis; Second Edition, McGraw-Hill, New York Lysyk, T.J. (1989), 'Stochastic model of Eastern spruce budworm (lepidoptera: tortricidae) phenology on white spruce and balsam fir', Journal of Economic Entomology, 82, 4, 1161-1168 Pirsig, R.M. (1974), Zen and the art of motorcycle maintenance; an inquiry into values. The Bodley Head, Ltd., London Sargent, R.G. (1996), Verifying and Validating Simulation Models. Proceedings of the 1996 Winter Simulation Conference, eds. J.M. Charnes, D.M. Morrice, D.T. Brunner, and J.J. Swain, pp. 55-64 Young, P. (1997), Data-based mechanistic modelling of environmental, ecological, economic and engineering systems. MODSIM 97, International Congress on Modelling and Simulation Proceedings, Volume 4, edited by A.D. McDonald & M. McAleer
124
Jack P.C. Kleijnen: Statistical validation of simulation, including case studies Zeigler, B. (1976) Theory of modelling and simulation. New York: Wiley Interscience
125
Validation of Simulation Models APPENDIX CASE STUDY: MINE HUNTING ON THE HIGH SEAS Explosives on the sea bottom may be detected (hunted) by means of sonar. This sonar may be compared with a torch light that shines at the sea bottom. A simulation model called HUNTOP (mine HUNTing OPerations) was developed for the Dutch navy, by Applied Scientific Research/Physics and Electronics Laboratory (TNO/FEL) in the Netherlands. Kleijnen (1995a) validates this HUNTOP model in two stages. In stage #1 individual modules are validated. (In stage #2 the total simulation model is treated as one black box, and is validated; that stage is discussed in §3.) Some of these modules give intermediate output that is hard to observe in practice, and hence is hard to validate. Therefore sensitivity analysis is applied to these modules: check if factor effects have signs that agree with experts' prior qualitative knowledge. For example, deeper water gives a wider sonar window; see the main effect β2 in the sonar window module below. Because of time constraints, only the following two modules are validated; I use the symbol z for the original (non-standardized) factor values. (i) Sonar window module The sonar rays hit the bottom under an angle determined deterministically by three factors, namely z1 or Sound Velocity Profile (SVP), z2 or average water depth, and z3 or tilt angle. Suffice it here to state that SVP defines sound velocity (and hence sonar rays) as a function of sea depth; SVP is treated as a qualitative factor. The output of the sonar window module (say) y is the minimum distance of the area on the sea bottom that is ‘insonified’ by the sonar beam (‘lighted by the torch’). Consider a set of second-degree polynomials in the two quantitative factors z2 and z3, namely one polynomial per SVP type z1. Each polynomial has six parameters: Β0 , Β 2, Β 3, Β 2; 3 , Β 2; 2, and Β 3; 3 . To estimate these parameters, Kleijnen (1995a) uses the classical central composite design for two factors, which has nine input combinations; see Table 1 for the standardized values where + denotes +1, - denotes -1, c = •k (k denotes the number of factors; here k = 2). (Other values for c can be found in the literature; I propose •k because the distance of the combinations with all factors at the absolute value +1, to the origin is •k ; a simulation model is valid only within its experimental frame.)
126
Jack P.C. Kleijnen: Statistical validation of simulation, including case studies Combination: Factor 1: Factor 2:
1 +
2 -
3 +
4 -
5 c
6 -c
7 0
8 0
9 0
-
+
-
+
0
0
c
-c
0
Table 1: Central composite designs for two factors in standarized values (Notation + denotes +1, - denotes -1, c= k with k numbers of factors) The fitted polynomial turns out to give an acceptable (‘valid’) approximation: the 2 multiple correlation coefficient R ranges between 0.96 and 0.98, for the four SVPs simulated. Expert knowledge suggests that certain factor effects have specific signs, namely Β 2 > 0, Β 3 < 0, and Β 2, 3 < 0. Fortunately, the corresponding estimates turn out to have the correct signs. So this module has the correct I/O transformation, and the validity of this module need not be questioned. (The quadratic effects turned out to be non-significant, so - on hindsight - simulation runs could have been saved, since a smaller design -namely, only the first four runs in Table 1 - would have sufficed.) (ii) Visibility module An object is visible if it is within the ‘sonar window’ (‘torch light circle’), and it is not concealed by the bottom profile (for example, a mine may be hidden behind a hill). The output of this module is the time that the object is visible, expressed as a percentage of the time it would have been visible were the bottom flat. Kleijnen (1995a) varies six inputs. Again Kleijnen (1995a) fits a quadratic polynomial, and uses a central composite design. This polynomial has 28 regression parameters: 1 grand mean, six main effects, fifteen interactions, and six quadratic effects. The classical central composite design for six factors has as many as 77 input combinations. It turns out that R2 is 0.86. It turns out that the factor ‘upward hill slope’ has no significant effects at all: no main effect, no interactions with the other factors, no quadratic effect. These results agree with the experts' qualitative knowledge. So the validity of this module is not questioned either. I emphasize that central composite designs require many simulation runs. If the computer budget is tight, then alternative designs may be constructed. For example, Kleijnen and Pala (1998) derive a saturated design, which is a design with the number of runs equal to the number of factor effects to be estimated. 127
OUTSIDE VALIDITY OF RESEARCH OF COMPLEX SOCIETAL PROBLEMS Dorien J. DeTombe
1 INTRODUCTION In the field of methodology much effort is given to describe how to get a scientific sound research. The difference between just an opinion and a scientific opinion (view) lays mainly in the way the opinion is derived. When the result of a study is based on research, which is done according to the carefully described guidelines of scientific research, the results can be called scientific. In commonsense words this can be translated to ‘this is the truth’. Scientists would directly comment on such a judgment pointing to the many subjective issues that influence the results. Post modernists would even discuss whether there is a truth at all, constructivists might argue the truth is constructed etc., see the work of Van Dijkum & Wallner (1992), Habermas (1987), Habermas & Luhmann (1971), Luhmann (1990), Leydesdorff (1996, 1997). However, in the regular stream of social science research and in social science methodology there are many researchers who agree that when the methodological guidelines are followed a study can be called scientific. These guidelines and how they should be applied are described in many methodology textbooks. Every self-respecting university has produced such a book. However, looking at the issue of validity, we see that the concept of validity is only related to the research performed, and not to the decisions taken before or after the actual study. In discussing the subject of validity in relation to complex societal problems we will focus on decisions a researcher makes before formulating the research question for the actually study.
128
Dorien J. DeTombe: Outside validity of research on complex societal problems 2 VALIDITY Validity is often discussed in relation with utility, accuracy, stability and reliability (De Groot, 1969, chapter eight, pp. 239-283; Segers, 1977; Swanborn, 1987; ‘t Hart, Van Dijk, De Goede, Jansen & Teunissen, 1996, p.186). Some textbooks make a 1 distinction between external and internal validity see e.g. Swanborn (1987, p. 189) . External validity is related to several moments in the research performed. Some examples of external validity are the validity of the operationalization of the variables, the representativeness of the sample, and the way the outcome of the 2 research can be generalized to other situations . These three aspects of validity are related. The operationalization of concepts and constructs to real observable variables is the first condition without further steps will be useless in order to get a scientifically sound result. This is followed by the representativeness of the sample. The third step is the generalization of the research outcome. It is a cumulating of demands for validity. The more they are fulfilled the wider range the research results have. Each step gives a higher degree of validity. External validation is also related to observation, to the research set-up and to 3 issues like social desirable answers, and predictability (Swanborn, 1987, p.189). Internal validity is related to the quality of the conclusion of the research project (Swanborn, pp. 189). Internal validity is related to statistical validity: ‘Is there a relation between two variables?’ Internal validity is also related to the validity of the causal interpretation: ‘Is the presumed causal relation also the real causal relation?’
1
The differentiation between external and internal validity comes from experimental psychology. 2 See for instance the methodological textbooks of De Groot (1967), Seegers (1977), Swanborn (1987), ‘t Hart ea. (1996) for a more extended description of validity. 3 Not always the difference between internal and external validity is made. In some textbooks on methodology the differentiation is not made (see ‘t Hart, ea.). However it is relevant to describe the difference in relation of the term inside and outside validity. 129
Validation of Simulation Models 3 OUTSIDE VALIDITY How careful validity is described and how difficult it is to perform validity as described above, the internal and external validity, is still not enough to valid research. The concept of validity is limited to the research performed and leaves out the issues that are not performed. In the validation of the research the issues that are not included in the actual study, the pre-decisions, are not included in the evaluation. However these pre-decisions can be crucial to the validity of the research. In order to label words to this discussion we call internal and external validity inside validity, and we call the reflection of issues that are left out of the actually performed research outside validity (see DeTombe, 1998). By imagining the research topic as a horizontal line we locate outside validity at the two ends of the research, before and after the research is actually performed (see figure1). outside validity [ inside validity ] outside validity ____________________________________________________________ start research performed research end research Figure 1: Inside and outside validity The research starts with questions regarding the selection of the boundaries, and methods and tools of the research. What phenomena will be included and what phenomena will be excluded, and which methods and tools should be selected for doing the research. Which phenomena, concepts, and part of reality are not included in the research, and why and how these decisions are made, is the reflection of outside validity. Outside validity refers to the boundaries of the research, to phenomena, concepts and areas that are included or excluded. This includes also the selected research methods and tools. 3.1 Selecting the boundaries of research In complex societal problems, like the expansion of an airport or building a new railroad in a crowded area, many issues about the phenomena and data are unclear and uncertain. It is often even unclear which actors are involved. Problem handling
130
Dorien J. DeTombe: Outside validity of research on complex societal problems starts with defining the problem. This includes defining the boundaries of the problem (see for this discussion DeTombe, 1994; DeTombe & ‘t Hart, 1996). The boundaries of a problem, what to include and what to exclude, are often difficult to select. It is important to select the boundaries of a problem neither too narrow nor too wide. When the boundaries are too narrow the focus is too limited, when they are too wide to problem is too complicated to handle. The decisions and the discussion, and the reasons must be included in the research report in order to be able to evaluate the outside validity. The concept of outside validity makes it possible to evaluate research on this issue. The necessity of outside validation can be illustrated by reflecting a research report about the extension of Schiphol. The Dutch national airport Schiphol is located in a crowded area, which contains much road and rail traffic, many houses and farmland. In this situation any extension of the airport, or building a new one, is problematic. However no extension is also problematic, in the fear of loosing economical benefit. In the Netherlands there is a long discussion about this issue. In this discussion many actors are involved, each with their own goals, their own agenda’s, their own interests, their own power, and their own possibility to support or to prevent certain solutions. In this discussion, which is going on for many years, many researchers and research bureaus are asked to reflect the problem, well-respected and famous large bureaus as well as small unknown bureaus. Scientifically sound bureaus as well as bureaus that primarily have commercial interests. A bureau that was invited by a problem owner, a governmental department, to do research on the effect of the extension of flight routes in relation to the pollution of the people living near the airport. The research was limited to the noise pollution. The result of the research was that the noise pollution for these people did not extend the legal norms of the government decreed. This research result was gladly accepted by the problem owner. However, the result was presented in a way that suggested that the extension of the airport was no problem. The research outcome of one selected variable, the noise pollution, was invalid generalized to many variables which could influence the decision for extension. Although the report did mention that only one variable was reflected, the way the report was written could easily be red as of all the variables were reflected (Van der Riet, 1998; Rooze, 1998). If we would only reflect inside validity this research was done valid. However when outside validity is included the
131
Validation of Simulation Models limitation of the research to only one variable, the noise pollution, should be argued in the view of the written report. Why were other, obvious important variables, not included? And how would these variables influence the report. Variables such as environmental pollution e.g. caused by kerosene, danger of too low flights, or accidence in landing and take off. Including other variables would give another research outcome. Excluding these variables on purpose and more or less disguising the limited view in the research report gives a negative outside validity for the study. When excluding variables, which is legitimated to do, the research report should contain a clear justification of the limitations of the research. The same can be said about including or excluding certain phenomena. For instance in analyzing complex societal problems the formal as well as the informal procedures should be analyzed. For instance in reflecting the research question ‘Why an underground parking lot is build in town A and not in town B’, one should not limit the research to formal reports, and legitimated acts only. The legal as well as the illegal acts of the legal as well as the semi-legal actors should be included. Just focusing on the legal side of the society often gives a biased view of the problem and decisions taken. So not including these aspects can give a negative outside validity.
3.2 Selecting research methods and tools Researchers are trained to carefully select the methods and tools needed for research. In everyday practice of scientific research we see that most researchers use domain related methods and tools. In doing this they mostly use those methods and tools they are familiar with. When the research is more complicated or goes beyond the boundaries of their field, they often call for support of methodologists. In the field of handling complex societal problem a large scale of methods and tools are used. Each method and tool has it’s own specialties and limitations. Selecting a method or tool is dependable on the time and money available. Knowing what methods and tools to use and knowing what the effect of a certain method or tool is, and in what circumstances it can be used, seems to be a skill in itself. However, each method and tool will influence the outcome of the research. When the selection of
132
Dorien J. DeTombe: Outside validity of research on complex societal problems methods and tools is done carefully in relation to the problem and performed in a valid way one could consider the research valid on this point. However, when the researcher on purpose selects the wrong method or tool in order to influence the results of the research, then there is a negative outside validity.
3.3 Whose view on reality is represented? Research concerning analyzing, guiding and evaluating complex societal problems is often done from the view of a problem owner. In complex societal problems many actors are involved, each with their own view on the problem. In presenting the results of the research done for one of these actors, one must make it very clear, that the research is done based on the problem definition of that certain problem owner, and thus the research results represents only that view. Generalizing that particular view to ‘this is the problem, that is the solution’ would give a negative outside validity. In recent years consulting has become a real opportunity for making easy money, also for universities. Due to the tasks university researchers have for earning (a part of) their own money, scientists are involved in getting assignments. A positive result for the problem owner, or presenting a problem owner's view as a scientific independent view, will increase the possibility of a second assignment. So there is a temptation to diminish or to avoid a negative research result or to present the problem owners view as a scientific independent view. This will certainly influence the outside validity of research negative.
3.4 Presenting research results Outside validity is also influenced by the way the research results are presented. The selection of certain methods and tools, the limited time, and selection of data should at least be made clear in the results. Diminishing or neglecting these issues influence the outside validity negatively.
133
Validation of Simulation Models 4 HIDDEN AGENDA’S IN DOING RESEARCH There are many reasons for purposely abusing the validity of a research. A reason for doing this can be changing a negative research result into a positive result in order to make the research look more important. In the area of medical research this sometimes happens, often when money or honor is involved. Another reason of abusing the validity of research is for political reasons. The th 20 century has many examples in which so-called scientific research has been misused for political goals. Research aiming at ‘proving’ that certain groups of people are inferior to others, such as is done in the field of psychology in trying to ‘prove’ that women are inferior to men, black to whites, Jews to Aryans etc are examples of this. This research has often a negative outside validity. A more subtle way of political influence can be seen in a recent example of a Dutch sociologist, who presented the results from data derived of research about the relation between fathers and sons related to school results in Brazil as the results of 4 research of children and parents in relation to school results . Here is not only a preoccupied male view, but also a totally blind spot for real life. This way of doing research results in a negative outside validity. Due to the influence of feminists and women in science this kind of negative outside validity, based on male superiority, is nowadays more and more diminished in social science. A clear example of a negative outside validity can be seen in the reports of the research involved in the (political) debate about the nature and nurture discussion. The discussion was on the justification of the difference in societal position between the (white) middle class and the (black) lower class. The question was, that the better positions given to the (white) middle class people and the minor positions given to the (black) lower class people should be justified on the basis that (white) middle people are more intelligent than (black) lower class people. Thus to prove that the societal structure reflects natural structure. Other political groups tried to
4
Brazil is a country in which about 50% of the children are raised by single mothers. In these countries fathers are not, or only slightly contributing to education.
134
Dorien J. DeTombe: Outside validity of research on complex societal problems prove that the difference in position was due to nurture, and the chances to be educated. Extremely much effort is given to prove that white middle-class men deserved their position because they are more intelligent then others. A way to prove this was to reflect monozygotic twins, who were separated from the beginning of their lives. In those cases the starting position is exactly the same, so all differences could be subscribed to the influence of nurture. Much money was spend to prove that nature was dominant over nurture. One researcher was particular successful in finding the twins and doing research with them. His results were published and studied and cited all over the world by proponents and by opponents. Longtime the people who politically believed or benefited from the emphasizing the nature dominance based their ideas on especially this twin research. Only after decades it was shown that the researcher was a fraud. The research had actually never taken place. Reason for doing this could have been out of politically benefit, personal fame, or money. The inside validity of the research was well performed however there was a negative outside validity. Science is part of the real world and therefor also vulnerable to jalousie, lies, and all kind of frauds. This often influences outside validity negatively. Habermas and the Frankfurtherschule (1987) made it already clear that the side you are on influences the kind and sometime also the way of research you are doing. It involves the selection of research topics, and for whom you want to work. Here outside validity is related to the kind of research a group or a field is doing, which research is performed, and how it is performed. 5 CONCLUSIONS Labeling a methodological issue makes it easier to examine the issue. The discussion of outside validity makes it possible to see what is included and what is excluded in the research, to reflect why phenomena are included or excluded why the methods and tools are selected and how this is related to the results of the research. The question of outside validity is an important, although often not so easy, to answer question. However validity points to the center of science: finding the truth, valid knowledge, as much as possible. This is and should be the
135
Validation of Simulation Models assignment of science. Because science is often used in society as a legitimization for other issues, a scientist should try to reach a positive inside and outside validity for the research that he or she is carrying out.
References5 Boskma, A.F. & M. Herweijer (1988) Beleidseffectiviteit en case-studies: Een vergelijking van verschilllende onderzoeksontwerpen. Beleidswetenschap, 2, pp. 52-69. Dijkum, C. van & F. Wallner (1992) Constructive realism in discussion. Amsterdam: Sokrates Science Publisher. Dijkum C. van, Constructivism in Informatics (1991) in: Dijkum C. van, Wallner F., (Eds.) Constructive Realism in Discussion. Amsterdam: Sokrates Science Publisher. Groot, A.D. de (1965) Thought and choice in chess. New York: Mouton. Groot, A.D. de (1969) Methodology. Foundations of inference and research in the behavioral sciences. The Hague: Mouton. De Groot, A.D. de (1969) Methodologie: Grondslagen van onderzoek en denken in de gedragswetenschappen. Den Haag: Mouton & Co. DeTombe, D.J.(1994) Defining complex interdisciplinary societal problems. Amsterdam: Thesis publishers DeTombe, D.J. & C. Van Dijkum (1996) Analyzing Societal Problems (Reviewed & Edited volume) Munchen: Hampp Verlag DeTombe, D.J. & H.'t Hart (1996) Using system dynamic modelling techniques for constructing scenarios of societal problems. In Dorien J. DeTombe & Cor van Dijkum (Eds.) Analyzing Societal Problems. Munchen: Hampp Verlag DeTombe, D.J. (1998a) Special groups support session on the operationalizing of the Costen index of the noise pollution on Schiphol ‘Voldoet de Costen maat nog’, Februari 1998, Tudelft, unpublished report. Een discussie met experts o.l.v. Rijkswaterstaat, VROM, en Tudelft.
5
Additional references, not directly referred to in the text are also included.
136
Dorien J. DeTombe: Outside validity of research on complex societal problems DeTombe, Dorien J.(1998) Validity of simulation models for handling Complex Technical Policy Problems, Simulation in Industry. Nottingham UK, proceedings Ganzeboom, H. (1996) Onderwijsdata ouders-kinderen in verscheidene landen. Lezing Nosmo methodologendag. Habermas, J. (1987). Excursus on Luhmann's Appropriation of the Philosophy of the Subject through Systems Theory. pp. 368-85 in: The Philosophical Discourse of Modernity: Twelve Lectures. Cambridge, MA: MIT Press. (Der philosophische Diskurs der Moderne: zwölf Vorlesungen. Frankfurt a.M.: Suhrkamp, 1985.) Habermas, J. & N. Luhmann (1971). Theorie der Gesellschaft oder Sozialtechnologie. Frankfurt a.M.: Suhrkamp. Hart, H. ‘t, J.van Dijk, M. de Goede, W. Jansen & J. Teunissen (1996) Onderzoeksmethoden. Meppel/Amsterdam: Boom. Hoogstraten, J. (1979) De machteloze onderzoeker. Meppel/Amsterdam: Boom. Heffen, van, O. (1995) Onderzoek van overheidsbeleid. In Hout, W & H. Pellikaan (red.) Leren van onderzoek: het onderzoeksproces en methodologische problemen in de sociale wetenschappen. Meppel/Amsterdam: Boom, pp.166-193. Hout, W & H. Pellikaan (1995b) Het onderzoeksproces en methodologische problemen in de sociale wetenschappen. In Hout, W & H. Pellikaan (red.) Leren van onderzoek: het onderzoeksproces en methodologische problemen in de sociale wetenschappen. Meppel/Amsterdam: Boom, pp.16-36. Hulspas, M. (1998) Tabaksonderzoek als rookgordijn Intermediair, 34, nr 24 Köbben ( 1996) Kafka in Zoetermeer. NRC Handelsblad, 15 mei 1996. Leydesdorff, L. (1996). Luhmann's sociological theory: its operationalization and future perspectives. Social Science. Information 35, 283-306. Leydesdorff, L. (1997). The Non-linear Dynamics of Sociological Reflections, International Sociology 12: 25-45. Luhmann, N. (1990). Die Wissenschaft der Gesellschaft. Frankfurt a.M.: Suhrkamp. Miller, A. (1948) All my sons (theaterplay). Citation: ‘ I know how a bug is made in this country.’ Riet, Van der, O. (1998) What’s the Problem? The Problem Formulation Task in (Infrastructure) Planning, The Netherlands TRAIL Onderzoekschool, Delft/Rotterdam, October 30, 1997 Rooze, E.J. (1998) Supporting Creativity In Problem Handling By Generating Requisite Chaos. Munchen / Mering: Hampp verlag in press.
137
Validation of Simulation Models Segers, J.H.G. (1977) Sociologische onderzoeksmethoden: Inleiding tot de structuur van het onderzoeksproces en tot de methoden van dataverzameling. Assen/Amsterdam: Van Gorcum. Siddiqui, F. (1998) Op weg naar een miljardenstrop. Intermediair, 34, nr 24 Swaan, A. Een discussie aanvraag NWO Amsterdam: Facta, 1998 Siswo discussion NWO multiculturel en pluriforme samenleving. Swanborn, P.G. (1987) Methoden van sociaal wetenschappelijk onderzoek. Nieuwe editie. Meppel/Amsterdam: Boom Swanborn, P.G. (1996) Case studies. Meppel/Amsterdam: Boom Teisman, G.R. (1992) Complexe besluitvorming: Een pluricentrisch perspectief op besluitvorming over ruimtelijke investeringen. ‘s Gravenhage: Vuga. Veenman, J. (1998) The usual suspects. Facta, 6, nr 4. Zouwen, Van der (1998) Mens & Maatschappij. Review of the book: ‘t Hart, H., J.van Dijk, M. de Goede, W. Jansen & J. Teunissen (1996) Onderzoeksmethoden. Meppel/Amsterdam: Boom, pp. 400-401.
138
ABOUT THE AUTHORS Dr. Dorien J. DeTombe Dorien J. DeTombe studied social science and computer science. She is working in the field of methodology for complex societal problems at the School of Systems Engineering, Policy Analysis and Management at Delft University of Technology. She received her doctorate in the field of methodology for complex societal problems and developed the Compram method (complex problem analyzing method). She published many articles and some books. She organizes conferences on the topic of methodology of complex societal problems. She is secretary of the Dutch Simulation group (Nosmo). She is chair of the International Society on Methodology for Complex Societal Problems, and the Euro Working Group 21 of Operational Research, and of the Dutch research group (Nosmo) on the same topic. Dr. Cor van Dijkum Cor van Dijkum studied Physics and Social Psychology in Amsterdam. His PhD was in the field of Methodology (and Simulation). He now works at the Department of Methodology and Statistics of the Faculty of Social Sciences of Utrecht University. He is, among other things, chair of the Dutch Simulation group of the Nosmo. He published many articles and books about Philosophy, Methodology, Methods of Social Research and Computer Simulation. Prof. dr. Cornelis Hoede Prof. dr. Cornelis Hoede was born Amsterdam and studied Theoretical Physics. After being a teacher in Physics and Mathematics he started working as an at the Twente Technical University. After his PHD on the Ising Problem he became professor of Algebra and Discrete Mathematics. He published many articles and books on the Theory of Graphs and its use in Computerscience and the Social Sciences. Dr. Marco Janssen Marco Janssen studied econometrics and Operations Research at the Erasmus University. In November 1996 he received his PhD degree at the Maastricht University on proposed and applied methodological improvements of integrated
139
assessment modelling. From June 1991 till October 1998 he worked on the development and application of integrated assessment models at RIVM (National Institute for Public Health and the Environment). Since October 1, 1998, he is a post doc at the Department of Spatial Economics, Free University in Amsterdam, working on industrial metabolism. His is interested in evolutionary modelling and ecological economics. Prof. dr. Jack P.C. Kleijnen Jack P.C. Kleijnen is Professor of Simulation and Information Systems. His research interests are in Simulation, Mathematical Statistics, Information Systems, and Logistics. He published six books and nearly 150 articles, was a consultant to several organizations in the USA and Europe, and was on many editorial boards and scientific committees. He spent a few years in the USA, at universities and companies. A number of international fellowships and prizes were awarded to him. For more information: http://cwis.kub.nl/~few5/center/staff/kleijnen/ Drs. Etzel van Kuijk Etzel van Kuijk studied Social Sciences in Utrecht and specialised in Methodology and Simulation. He became a researcher at the Amsterdam Free University in Policy Studies and is preparing his PhD in this field. In 1997 het entered the world of Software Management and became a professional consultant for clients of Magnus. Prof. dr. Ferdinand Verhulst Ferdinand Verhulst received his masters degree in Astrophysics and Mathematics at University of Amsterdam. His PhD was on Applied Mathematics at the University of Utrecht. He worked at the University of Amsterdam and Technical University Delft. His present position is professor of Dynamical Systems at Mathematical Institute Instituut, Utrecht University. Guest professorships were at: Imperial College London, Academy of Science Moscow, Engineering at Prague, Princeton University, Institute of Technology Bandung. He published many articles articles and 4 books on Applied Mathematics and Dynamical Systems. He is editor of the Journal of Nonlinear Science, Zeitschrift f. Angewandte Mathematik und Mechanik, Nonlinear Science Today, Chaos-Solitons-Fractals, SIAM Classics and Epsilon Uitgaven.
140
Dr. Henny de Vos Henny de Vos was born on May 29, 1966. After a three-year period of working on modelbuilding in the shipping industry, she became a PhD student at the university of Twente, where she developed new methodologies, such as simulation, to study school effectiveness. She received het Ph.D. in 1998 on the basis of her dissertation entitled "Education Effects: A Simulation-Based Analysis". Dr. Bert de Vries Bert de Vries is a senior scientist at RIVM (National Institute for Public Health and the Environment). He studied chemistry and received his PhD degree in 1989 at the University of Groningen. His expertise involves modelling and simulation of energy systems. He was project leader of the TARGETS model and is responsible for the energy model of IMAGE. He published widely on topics such as energy systems and environmental assessments and is a member of several international panels concerned with various aspects of environmental change and energy modelling. Drs. Helene Margot Weening Helene Margot Weening was born in Groningen studied Policy Studies at Twente University. In 1997-1998 she was assistent professor for Algebra and Discrete Mathematics at the University of Twente.
141
142