Journal of Experimental Psychology: Applied 2009, Vol. 15, No. 3, 199 –212
© 2009 American Psychological Association 1076-898X/09/$12.00 DOI: 10.1037/a0016533
Evaluating Science Arguments: Evidence, Uncertainty, and Argument Strength Adam Corner and Ulrike Hahn Cardiff University Public debates about socioscientific issues are increasingly prevalent, but the public response to messages about, for example, climate change, does not always seem to match the seriousness of the problem identified by scientists. Is there anything unique about appeals based on scientific evidence— do people evaluate science and nonscience arguments differently? In an attempt to apply a systematic framework to people’s evaluation of science arguments, the authors draw on the Bayesian approach to informal argumentation. The Bayesian approach permits questions about how people evaluate science arguments to be posed and comparisons to be made between the evaluation of science and nonscience arguments. In an experiment involving three separate argument evaluation tasks, the authors investigated whether people’s evaluations of science and nonscience arguments differed in any meaningful way. Although some differences were observed in the relative strength of science and nonscience arguments, the evaluation of science arguments was determined by the same factors as nonscience arguments. Our results suggest that science communicators wishing to construct a successful appeal can make use of the Bayesian framework to distinguish strong and weak arguments. Keywords: science, science communication, argumentation, evidence, Bayesian probability
Certainly, improvement in scientific literacy is the explicit goal of science educators (see, e.g., von Aufschaiter, Erduran, Osborne, & Simon, 2008), and the implicit goal of much of the work that has been conducted into the public understanding of science (Irwin & Wynne, 1996). Despite the sustained attention that the issues surrounding science communication have received, a clear understanding of how people interpret and evaluate basic scientific messages has not been forthcoming. In particular, rather little is known about how ordinary members of the public evaluate science arguments. The reason for the lack of an account of how people interpret and evaluate science arguments is, in our view, simple: A systematic framework for asking questions about how people evaluate science arguments has not been developed. We propose that the Bayesian approach to informal argumentation (see Hahn & Oaksford, 2007) provides just such a framework. The Bayesian approach to informal argumentation treats arguments as claims (or hypotheses) that are accompanied by evidence. When we evaluate the strength of an uncertain claim, we do so probabilistically: How likely is a particular claim to be true, in light of the available evidence? We explain the Bayesian approach in full below. For now, we simply note that possessing such a framework has a distinct advantage: It allows judgments about science arguments to be directly compared with judgments about nonscience arguments. On the Bayesian account of argumentation, science arguments are simply arguments that happen to be about science. As such, they can be analyzed in exactly the same way as nonscience arguments. The key components that intuitively might be thought to determine the strength of an argument (e.g., how much evidence it contains, the relation of the evidence to the hypothesis, the reliability of the source reporting the evidence) have a straightforward Bayesian interpretation. And crucially, these factors can be identified and
Public debates about socioscientific issues are becoming increasingly prevalent, and in many ways, science no longer belongs to scientists alone. Of course, it is still scientists who conduct scientific experiments and scientists who publish the results of their research in academic journals. But bodies of scientific evidence, and the debates that surround them, are no longer restricted to these circles. Scientific developments are debated by politicians, journalists, and citizens groups. Many of the most important decisions we make (as individuals or as a society) are rooted in our understanding and evaluation of scientific evidence, arguments, and claims. In particular, the communication of messages about socioscientific issues such as climate change has become a matter of some urgency. However, the public response to these messages does not always seem to match the seriousness of the problem the scientists claim to have identified (see, e.g., Lorenzoni, NicholsonCole, & Whitmarsh, 2007). Academic interest in the public understanding of science, driven by concerns such as these, is rapidly increasing. This is reflected in the range of researchers—philosophers, social scientists, communication scholars, policy experts, and science educators—who study the many aspects of science communication (see, e.g., Collins & Evans, 2007; Gregory & Miller, 1998; Pollack, 2005). Such diverse interest in the topic suggests that the process of communicating science to the general public is not straightforward. Could the communication of science be improved, if only we had a better understanding of how lay people evaluate scientific arguments?
Adam Corner and Ulrike Hahn, School of Psychology, Cardiff University, Cardiff, United Kingdom. Correspondence concerning this article should be addressed to Adam Corner, School of Psychology, Cardiff University, Tower Building, Park Place, Cardiff CF10 3AT. E-mail:
[email protected] 199
CORNER AND HAHN
200
manipulated in both science and nonscience arguments (Hahn & Oaksford, 2007). If there is nothing special about science argument evaluation, then there is no need to treat them as something distinct from nonscience arguments. Designers of science communication messages can draw on existing knowledge about the evaluation of arguments in general. If, however, there are features unique to the evaluation of science arguments, then these features should be the focus of future research. This article will proceed as follows: First, we give a brief overview of the varied ways in which the issues surrounding science communication have been approached, summarizing existing research specifically on the public evaluation of science arguments. Second, we outline our theoretical framework, by introducing the basic principles of the Bayesian approach to informal argumentation. Third, we present data from an experiment involving three separate argument evaluation tasks. This experiment represents the first attempt in the literature to compare experimentally people’s evaluations of different types of science and nonscience arguments. Throughout, the practical implications of our data for the successful communication of scientific messages are discussed.
Evaluating Science Arguments The study of science communication is characterized by a multiplicity of approaches. There is a substantial philosophical literature on science as an epistemology (Knowles, 2003; Kuhn, 1970; Popper, 1959), as well as several prominent (and competing) sociological accounts of how science fits into the world of social actors, and how controversy and consensus develop in science (Brante, Fuller, & Lynch, 1993; Collins & Pinch, 1993; Irwin & Wynn, 1996). Media analysts have examined the roles of different groups in the production, communication, and consumption of science (Friedman et al., 1999). How people perceive risky probabilities is another important component of understanding the general public’s perception of science, because the communication of scientific information typically involves the communication of risk (Pidgeon, Kasperson, & Slovic, 2003). Finally, there have been many attempts to measure people’s attitudes and perceptions of particular scientific developments, such as climate change (Lorenzoni & Pidgeon, 2006) or nanotechnology (Pidgeon & Rogers-Hayden, 2007). Even this cursory examination of different ways in which the public evaluation of science has been approached highlights a fundamental problem facing researchers interested in the topic: It is not at all obvious where to start, or which questions to ask. Clearly, to fully answer the question “how do people evaluate science?” the wisdom of many different disciplines must be brought to bear. At a minimum, however, it is essential to understand how nonexperts evaluate arguments about scientific topics. The most popular framework for analysis of different types of science argument has been Toulmin’s (1958) model of argumentation. Toulmin’s model is dialectical, because it defines arguments as moves that are made in a conversation. According to Toulmin, an argument can be broken down into several distinct components: a claim (the conclusion whose merits are to be established); data (the facts that are used to support the claim); warrants (the reasons that are used to justify the connections
between the data and the claim); and backing (the basic assumptions that provide the justification for particular warrants). Several studies have drawn on Toulmin’s model to develop an account of students’ use of science arguments in educational settings. For example, von Aufschaiter et al. (2008) used it to analyze the verbal conversations of school pupils during science lessons. Although the model is primarily a system for classification, not evaluation, the authors identified patterns of argumentation that contained warrants and backings as demonstrating a higher quality of argument than argumentation based on simply an unwarranted claim. Furthermore, if pupils were able to construct arguments with claims, warrants and data even once their initial position had been rebutted, they were classified as demonstrating an even higher level of argumentative skill. Because Toulmin’s approach identifies putative “components” of argumentation, it has also been used in recent work attempting to provide a mental models account of argumentation (Green, 2007, 2008), as well as programs aimed at “improving” critical thinking (van Gelder, Bissett, & Cumming, 2004). Johnson-Laird (1983) proposed that individuals represent entities and relations in the external world using mental tokens and relations to construct an internal “mental model” of reality. Drawing on this work, Green (2007) suggested that people comprehend the arguments for and against a conclusion by representing both the structure of the arguments and a mental model of their relative strength (i.e., the relation of the arguments to each other). Noting that the degree of certainty in a conclusion should be derivable from the ratings of strength assigned to the arguments for and against it, Green proposed that the notion of argument strength is commensurate with the mental models approach to reasoning. However, Toulmin’s (1958) framework provides only the most minimal account of the requirements of a ‘good argument,’ and hence cannot be viewed as a comprehensive account of argument quality. Claims without data and warrants are clearly limited in value, but the purely structural characterization offered by Toulmin’s framework cannot distinguish between, for example, data of greater and lesser relevance. Two arguments can have an identical overall structure and be given in exactly the same dialectical circumstances but still differ substantially in their convincingness, simply because they differ in their actual content. This means that Toulmin’s approach must be supplemented with an additional measure for capturing the relative strength of arguments (rather than simply classifying their components). Green (2007) suggested that pretesting arguments for their strength was an appropriate method of achieving this, and this is also the approach that has been taken in the social psychological literature on persuasion (see, e.g., Petty & Cacioppo, 1986). However, pretesting only determines which arguments a given group views as strong and which they view as weak. It provides no explanation of these preferences; moreover, it provides no information on whether these preferences are justified or not. It consequently has no purchase in the context of normative concerns about argument quality, which is the frequent focus of science educators. For this, a theory of argument strength is required. The absence of such a theory has been repeatedly noted in the social psychological literature on persuasion (Areni & Lutz, 1988; O’Keefe & Jackson, 1995; Petty & Wegener, 1991). It has also hampered investigations into how people evaluate arguments about scientific topics (Driver, Newton, & Osborne, 2000; Erduan,
EVALUATING SCIENCE ARGUMENTS
Simon, & Osborne, 2004; Jimenez-Alexandre, 2002; JimenezAlexandre, Rodriguez, & Duschl, 2000; Korpan, Bisanz, Bisanz, & Henderson, 1997; Kolsto et al., 2006; Kortland, 1996; Kuhn, Shaw, & Felton, 1997; Norris, Phillips, & Korpan, 2003; Patronis, Potari, & Spiliotopolou, 1999; Ratcliffe, 1999; Sadler, 2004; Simon, Erduan, & Osborne, 2002). These have tended to employ qualitative classification systems developed on an ad hoc basis (Takao & Kelly, 2003). They frequently apply only to the specific set of data for which they were developed (Kuhn et al., 1997) and therefore provide little opportunity for comparison with other studies. It should come as no surprise, therefore, that it has been difficult to reach any firm conclusions about what influences the strength of science arguments. Several factors have consistently been identified as being important for understanding science argument evaluation (e.g., assessments of source reliability, or the ability to evaluate evidence), but there have been no systematic attempts at varying or manipulating these factors. Nor has it been possible to situate them within a single explanatory framework. Moreover, because the normative foundations for the ad hoc criteria used have not been clear, no real verdict on people’s competence has been possible. In addition, studies explicitly aimed at comparing the evaluation of science and nonscience arguments are scarce—yet this seems particularly informative regarding people’s capacity for evaluation. An approach that permitted predictions about argument strength to be made, key variables to be manipulated experimentally, and performance across a number of tasks to be compared would be invaluable in improving our understanding of the evaluation of science arguments. In the next section, we introduce the Bayesian approach to informal argumentation as a tool for this endeavor.
The Bayesian Approach to Informal Argumentation The Bayesian approach to argumentation starts from a very basic premise—that the strength of an uncertain claim is evaluated probabilistically. From the Bayesian perspective, the extent to which one believes a claim to be true—that is, one’s degree of belief in a claim—is something that can be described probabilistically. Degree of belief is simply the subjective estimate that a particular claim is true. Being able to evaluate the claims that other people make is a fundamental social/cognitive ability. It involves not only the assignment of static degrees of belief to claims and hypotheses, but also an evaluation of how degrees of belief change in light of new evidence. The Bayesian approach provides a normative model of belief revision (see, e.g., Howson & Urbach, 1996). Any claim (or hypothesis) h has a degree of belief (or subjective probability) P共h兲 associated with it. Without looking out of the window, for example, one’s prior degree of belief that it is raining might be maximally uncommitted. Rain and no-rain might be seen as equally probable, that is, P共h兲 ⫽ 0.5. On encountering new evidence e (e.g., droplets of water covering the window), Bayes’s theorem provides a rational method of updating the prior belief P共h兲 to a revised posterior belief P共h|e兲 based on the new evidence. According to Bayes’s theorem, this revised estimate depends on both prior belief and the characteristics of the evidence. Specifi-
201
cally, these are the conditional probabilities P共e|h兲 and P共e|¬h兲 that capture the “hit rate” and the “false positive rate” of that piece of evidence with regard to the hypothesis in question (in our example, the probability of droplets given that it is/is not raining). The ratio of hit rate to false positive rate, known as the likelihood ratio, characterizes the diagnosticity of the evidence. In our example, observing droplets of water on the window seems far more likely given that it is raining P共e|h兲 than that it is not (even though there could be other explanations for the droplets). The observation of water droplets will consequently lead to a considerable increase in the belief that it is raining. Specifically, that increase is given by Bayes’s theorem, stated formally in Equation 1: P共h兩e兲 ⫽
P共h兲P共e兩h兲 P共h兲P共e兩h兲 ⫹ P共¬h兲P共e兩¬h兲
(1)
The posterior degree of belief provides a measure of how convinced we should be by the evidence for a claim, and, in that sense, a measure of argument strength (Hahn & Oaksford, 2006, 2007). Crucially, the values of the probabilistic quantities in Bayes’s theorem are determined by the specific content of the arguments involved. Prior degrees of belief depend on what that belief is, and the value of the evidence depends on the nature of that evidence. Hence, argument strength rests on what the argument is actually about, not simply on the presence or absence of purely structural components such as “warrants” or “backing.” This gives the framework much greater resolution than that possessed by Toulmin’s dialectical account. That increasing evidence for a claim yields a better argument is captured in the Bayesian framework— only the provision of evidence can lead to an increase in (posterior) degree of belief in a claim. Additionally, however, the Bayesian framework—through the likelihood ratio—naturally captures considerations about the quality and relevance of that evidence. For example, evidence that has no relationship to the claim in question will be no more likely to be observed if the claim is true than if it is false. The likelihood ratio will equal 1, and no belief change will arise as a result of it. Of course, arguments are not only about facts—many arguments pertain to the desirability of a particular action. The way to accommodate this feature of informal arguments within the Bayesian framework is Bayesian decision theory (Edwards, 1961; Keeney & Raiffa, 1976; Savage, 1954), which provides a guide to decision-making in situations where outcomes are uncertain based on the subjective probabilities, but also the subjective utilities involved. For an argument that warns against a particular consequence, the more (subjective) negative utility there is associated with that consequence the stronger that argument should be (Corner, Hahn, & Oaksford, 2006). There are several reasons that recommend the Bayesian approach as a heuristically valuable framework for studying science argument evaluation. First, the Bayesian approach has been enormously influential in the philosophy of science, describing and explaining the ways that scientists construct, test and eliminate hypotheses, design experiments and statistically analyze data (Howson & Urbach, 1996; see also Fugelsang, Stein, Green, & Dunbar, 2004). Similarly, epistemologists have used Bayesian principles to explain how people assess the coherence of sets of information, confirm and disconfirm hypotheses, and come to conclusions based on contradictory or disparate evidence (Bovens
202
CORNER AND HAHN
& Hartmann, 2003). This suggests that the Bayesian approach and an analysis of scientific arguments are well suited to each other. Second, the Bayesian approach has already been used to examine informal argumentation. Recent work has demonstrated how Bayesian probability theory can be applied to a wide variety of so-called fallacies of argumentation (the many argument pitfalls that form the focus of textbooks on critical thinking). Here, probability theory has been used to assess people’s evaluation of everyday examples of supposedly fallacious arguments (Corner et al., 2006; Hahn & Oaksford, 2007; Hahn, Oaksford, & Corner, 2005; Oaksford & Hahn, 2004). On a practical level, the experimental results of these studies have suggested that people are clearly capable of distinguishing strong and weak versions of a range of informal arguments. On a theoretical level, the fact that the Bayesian framework can be applied to many structurally different types of arguments recommends it as a viable framework for studying informal argument strength. In fact, the Bayesian approach to human reasoning in general has become increasingly popular (Kemp & Tenembaum, 2009; Korb, 2004; Nelson, 2005; Oaksford & Chater, 2007; Tentori, Crupi, Bonini, & Osherson, 2007). For example, there is a longstanding tradition of psychological research into category-based inductive inference which has examined the way people draw novel inferences about members of a category (Osherson, Smith, Wilkie, Lo`pez, & Shafir, 1990; Rips, 1989). This research has identified a number of key heuristics (such as the similarity or typicality between category members) that govern inference. These heuristics, in turn, have been relocated within a Bayesian framework (Heit, 1998; Kemp & Tenenbaum, 2009). Because it is a quantitative model of belief revision, Bayes’s theorem has been used to construct detailed models of specific aspects of reasoning behavior (e.g., Kemp & Tenembaum, 2009). In other contexts, such as the philosophy of science and epistemology, the Bayesian framework has typically been employed in a qualitative fashion, and its main asset lies in the fact that it provides a normative, formal framework that captures key intuitions about reasoning and argumentation. As noted famously by Laplace, the theory of probabilities is ‘at bottom nothing but common sense reduced to calculus’ (Laplace, 1814/1951). Work specifically within the Bayesian approach to informal argumentation has tended to emphasize qualitative predictions, though quantitative modeling has also been conducted (Hahn & Oaksford, 2007; Oaksford & Hahn, 2004). A similarly qualitative approach seems appropriate for a first investigation into science argument evaluation. For the practitioners of science communication, the quantitative modeling of belief functions is likely to be of less interest than an attempt to identify the broad factors involved in the evaluation of science arguments. The question of most interest to the science communication community is the extent to which fundamental philosophical and scientific intuitions about evidential impact (Bovens & Hartmann, 2003; Howson & Urbach, 1996) are shared by the lay public, because this seems most useful for the improvement of science education and communication. Our focus here is consequently on the heuristic value of the Bayesian framework and its qualitative predictions about evidential impact. Our aim is to demonstrate that this framework (which has never been applied to informal scientific arguments) allows us to pose questions that previous studies of science communication have been unable to ask. Specifically, the Bayesian approach
permits us to ask questions about content-level factors such as evidential strength, source reliability, and outcome utility. From a practical perspective, this is a significant advancement on previous attempts at analyzing science arguments that have focused only on structural components. In the remainder of this article, we will introduce and report the results of an experiment using the Bayesian approach to compare the evaluation of three different types of science and nonscience arguments.
Overview of Experiment Using the Bayesian framework as a guide, we sought to examine three forms of argument that are found frequently in attempts to communicate science. In each case we sought to establish whether people were sensitive to quantities implicated by Bayes’s theorem and whether they evaluated science arguments any differently from nonscience arguments. The first experimental task was based on the work of Oaksford and Hahn (2004), and involved ‘arguments from ignorance’ about science and nonscience topics.
Arguments From Ignorance Task Arguments from ignorance use the absence of evidence (e.g., no side effects found during the testing of a new drug) to support a hypothesis (e.g., that the drug is safe), and have traditionally been treated as informal reasoning fallacies by critical thinking textbooks (Woods, Irvine, & Walton, 2004). However, the industry safety standard for pharmaceutical products involves demonstrating ‘no harmful side effects’ over a given testing period. Moreover, many high-profile socioscientific arguments seem to take the form of an argument from ignorance—arguments about the safety of nuclear power stations or the MMR vaccination are both founded on a lack of evidence that any danger exists. Similarly, debates about health epidemics such as “Bird Flu” are often based on the absence of evidence that an outbreak will occur. So how should we distinguish strong arguments from ignorance from weak ones? Arguments from ignorance long evaded formal treatment— because in arguments from ignorance it is not the presence of evidence that supports a claim, but its absence. Oaksford and Hahn (2004) and Hahn and Oaksford (2007) presented appropriate versions of Bayes’s theorem that captured such cases. Moreover, they demonstrated formally how and why arguments from ignorance can be acceptable, even though they will typically be weaker than positive arguments based on the same tests. In general, the strength of arguments from ignorance is determined by the same components as positive arguments, namely, the hit rate and false positive rate of the test. This can be illustrated intuitively with two example arguments from ignorance. First, train timetables as a test of which stations a train will stop at have a high ”hit rate“ P共e|h兲. That is, the train will (typically) stop at all the stations on the timetable. They also constitute a test with a very low false positive rate as the train will typically not stop at any stations not listed on the timetable. This means the likelihood ratio is high, and the argument from ignorance that a train does not stop at a particular station because it is not listed on the timetable is strong. By contrast, a shop window is an information source with a much lower hit rate: not everything that is stocked can appear in the window. Hence it seems unjustified to
EVALUATING SCIENCE ARGUMENTS
conclude that a shop does not stock a particular piece of clothing simply because it is not in the window. However, even in this latter case we will become more convinced the more windows of the shop we survey (or the more items that are on display). Figure 1 plots the impact of increasing “units” of evidence (e.g., the number of different studies that find no side effects of a drug), and an increasing likelihood ratio (e.g., an increasingly reliable source reporting these results) on posterior degree of belief in a hypothesis (P共h|e兲). Starting from a prior degree of belief of 0.4, a posterior degree of belief is calculated following the addition of each “unit” of evidence. This posterior then becomes the new prior for the next “unit” of evidence, and updating proceeds from there. Where the likelihood ratio is low, that is, where a source of evidence is just as likely to report P共e|h兲 as they are to report P共e|¬ h兲, the amount of the evidence has little impact. When the likelihood ratio is higher, however, increasing the amount of evidence has a systematic effect on posterior belief in the hypothesis. In other words, the multiplicative nature of Bayes’s theorem means that amount of evidence and its quality interact (for a more technical treatment of negative evidence using the Bayesian approach readers are referred to Hahn & Oaksford, 2007). Consequently, Figure 1 highlights three normative aspects of Bayesian belief updating that are readily monitored in participants’ argument evaluations. First, increasing the thoroughness of a search (and hence amount of negative evidence) should produce higher ratings of argument strength. Second, more reliable sources should produce higher ratings of argument strength. Third, we should observe a multiplicative effect of these two factors on ratings of argument strength (i.e., an interaction). Our first experimental task allowed us to establish whether participants’ judgments of both science and nonscience arguments from ignorance were sensitive to these factors.
Mixed Evidence Task The second experimental task examined the notion of evidential strength and uncertainty in science and nonscience arguments. The communication of uncertainty in scientific messages is the subject of much controversy—not least in current debates about climate change (Patt, 2007; Zehr, 2000), and the language used by groups such as the Intergovernmental Panel on Climate Change (Budescu,
1
P(h|e)
0.9 0.8
1
0.7
1.5 2
0.6
2.5 3
0.5 0.4 0.3 0
1
2
3
4
203
Broomell, & Por, 2009; Patt & Schrag, 2003). But there is a straightforward way in which the notion of uncertainty can be translated into an experimental task using the Bayesian framework as a guide. In Bayesian terms, a good argument is one that gives evidence in support of a hypothesis, and recent philosophical treatments of Bayesian evidence evaluation have suggested that coherence is an important factor in how favorably a set of evidence is judged (Bovens & Hartmann, 2003). These philosophical analyses are in line with psychological evidence that evidential coherence is an important factor in jury decision-making (Lagnado & Harvey, 2008; Pennington & Hastie, 1986). The presence of mixed or contradictory evidence should lead to a lower degree of belief in a hypothesis than if all the evidence supported it. But having mixed evidence in support of a hypothesis is not the same as having only evidence that disconfirms it—mixed evidence should provide an intermediate level of support. Evidence may impact negatively on the truth of a hypothesis, however, but be consistent in doing so. This means that although we might expect the strength of an argument to have a monotonic relationship with the amount of confirmatory evidence it contains, the reliability of the sources that provide the evidence will be affected by the evidential coherence in a different way. Bovens and Hartmann (2003, see chapter 3 for technical details) demonstrate that from a Bayesian perspective incoherent evidence should produce ratings of source reliability that are lower than consistently disconfirming evidence. In our second experimental task, we tested these predictions using a science and nonscience argument. While practicing scientists may be aware that uncertainty is an integral part of the scientific process, the public perception of science as a discipline that provides certainty and consistency does not sit comfortably with the notion of scientific evidence that does not cohere. The degree to which a lack of evidential coherence impacts on ratings of source reliability may be greater, therefore, for arguments about scientific topics. Of course, the communication of science is typically concerned not only with constructing a compelling message, but with achieving a particular goal. The communication of information about greenhouse gas emissions, for example, is typically aimed at reducing future emissions (see, e.g., Schultz, Nolan, Cialdini, Goldstein, & Griskevicius, 2007). The frequent focus of science communication campaigns on practical action means that the effectiveness of these campaigns are likely to be measured by their subsequent outcomes (e.g., observed reductions in greenhouse gas emissions). However, messages aimed at bringing about a particular outcome are fundamentally based on how desirable that outcome is perceived to be. This means that the effectiveness of arguments about outcomes— or consequences—will depend not only on the probability of that outcome occurring, but on the utility associated with that outcome. The third experimental task was focused, therefore, on science and nonscience consequentialist arguments.
5
Amount of Evidence
Figure 1. Impact of amount of evidence and source reliability (likelihood ratio) on posterior belief in a hypothesis. Each line represents a different likelihood ratio.
Consequentialist Arguments Task Consequentialist arguments take the generic form of “if A, then B”. The strength of a consequentialist argument will depend not only on how likely B is to occur, but also on how
CORNER AND HAHN
204
desirable B is. Stating “Mow the lawn, because I will give you £100,” is likely to be a more effective argument than stating “Mow the lawn, because I will give you £10,” assuming that the probability of the two outcomes occurring is equivalent. As noted above, the way to accommodate this feature of consequentialist arguments within the Bayesian framework is Bayesian decision theory (Edwards, 1961; Keeney & Raiffa, 1976; Savage, 1954). Applying decision theory to consequentialist arguments, the more (subjective) negative utility there is associated with a consequence, the stronger that consequentialist argument should be (Corner et al., 2006; Hahn & Oaksford, 2006, 2007). In addition, however, the type of action that would be required to avoid this negative outcome will also contribute to the utility of the argument—and it is often the case that the avoidance of a negative outcome requires a degree of personal sacrifice. Climate change is only the most pertinent example of a scientific topic where the consequences of (in)action play a central role in the debate and many appeals based on scientific data take the form of a consequentialist argument (e.g., healthy eating or antismoking campaigns). We examined the extent to which evaluations of science and nonscience consequentialist arguments were based on outcome negativity, and the amount of sacrifice required to avoid the negative outcome.
Method Participants One hundred students from the School of Psychology’s Undergraduate Participant panel (panel mean age: ca. 20 years, panel female to male ratio: 84% to 16%) participated in the experiment in exchange for course credit. Occasionally a participant did not complete one of the three tasks. The number of completing participants is reported in the results section for each experimental task.
Design For each experimental task, the topic, type and order of the arguments were randomized using the Latin Square Confounded method, where participants see only one argument from each topic, and participate once in each experimental condition (Kirk, 1995). This allows multiple responses to be obtained from each participant, but prevents multiple arguments about the same topic being viewed by any one participant (reducing demand characteristics and potential confusion). The order in which participants completed the three experimental tasks was counterbalanced to guard against order effects. In the Argument From Ignorance task, three variables (Source, Search for evidence, and Class) were manipulated at two levels (reliable/unreliable, thorough/incomplete, and science/nonscience) across four different argument topics, creating a total of 16 distinct arguments. Participants were required to evaluate four arguments (determined by the Latin Square method). Participants provided ratings of argument strength on a scale from 0 (Very unconvincing) to 10 (Very convincing), and also indicated how reliable they thought the source in each
argument was on a scale from 0 (Unreliable) to 10 (Very reliable). In the Mixed Evidence task, two arguments were designed based on a claim and some evidence. Four pieces of evidence accompanied each claim, from four different sources. Two variables were manipulated in this task; Evidence (confirms/mixed/disconfirms) and Class (science/nonscience). This created a total of 6 distinct arguments. In accordance with the Latin Square method, participants evaluated one argument from each topic (evaluating two arguments in total). Participants were required to indicate how likely they thought the claims in the arguments were to be true on a scale from 0 (Unlikely) to 10 (Very likely), and also how reliable they thought the sources providing the evidence were, on a scale from 0 (Unreliable) to 10 (Very reliable). In the Consequentialist Argument task, three variables (Outcome Utility, Level of Sacrifice, and Class) were manipulated at two levels (moderately/very negative outcome utility, small/big sacrifice and science/nonscience argument) across four different argument topics, creating a total of 16 distinct arguments. Using the Latin Square method, all participants were presented with four consequentialist arguments and were required to provide a rating of argument strength for each argument on a numerical scale from 0 (Very unconvincing) to 10 (Very convincing). In addition, participants were required to indicate how bad they thought each outcome was, and how bad it would be if they had to make the sacrifice prescribed by each argument, on a scale from 0 (Very bad) to 10 (Not at all bad).
Materials and Procedure For each experimental task, participants received a booklet containing the arguments to be evaluated. For the Arguments From Ignorance task, the two science topics were: (S1) the safety of a new anti-inflammatory drug, and (S2) the risks associated with GM crops, and the two nonscience topics were: (NS1) the release of a new games console, and (NS2) the presence of absence of a particular item of clothing in a High Street store. An example of these materials is provided in Appendix A. In the Mixed Evidence task the science argument concerned the date of the next visible solar flare, and the nonscience argument concerned the date of the leader of the Conservative Party and his wife’s expected baby. The materials used in this experimental task are presented in Appendix B. In the Consequentialist Argument task the two scientific topics were: (S1) some predicted effects of climate change, and (S2) the risks associated with high blood pressure. The two nonscientific topics were: (NS1) the potential benefits of installing a car alarm, and (NS2) the risk of an alarm clock failing to go off in the morning. An example of these materials is provided in Appendix C.
Results Arguments From Ignorance Task Ninety-nine participants completed the Arguments From Ignorance task. To statistically analyze data from Latin Square Con-
EVALUATING SCIENCE ARGUMENTS
Mean Rating of Argument Strength
10 9 8 7 6 5 4 3 2 1 0
Incomplete Search Thorough Search
10 Mean rating of source reliability
founded designs, participant effects within the ratings are factored out and the analyses are conducted on the residuals (Kirk, 1995).1 Figure 2 displays the mean ratings of argument strength obtained in each condition (raw, not residual data). A three-way analysis of variance (ANOVA) was conducted on the residual ratings of argument strength, with Search (thorough/incomplete), Source (reliable/unreliable) and Class (science/nonscience) as independent variables. The presence of a thorough search for evidence in the argument produced higher ratings of argument strength, F(1, 386) ⫽ 156.56, p ⬍ .001, 2p ⫽ .29, as did a reliable source reporting the results of the search, F(1, 386) ⫽ 269.69, p ⬍ .001, 2p ⫽ .41. Furthermore, there was a significant interaction between Search and Source, F(1, 386) ⫽ 19.82, p ⬍ .001, 2p ⫽ .05, such that a report of a thorough search by a reliable source produced the highest ratings of argument strength. These effects are in line with Bayesian prescriptions about how evidential strength and source reliability should interact in impacting on argument evaluation (see Figure 1). The main focus of our analyses was the comparison between science and nonscience arguments from ignorance. There was a main effect of Class, F(1, 386) ⫽ 12.76, p ⬍ .001, 2p ⫽ .03, and a significant interaction between Source and Class, F(1, 386) ⫽ 11.05, p ⬍ .001, 2p ⫽ .03: Nonscience arguments were perceived as significantly more convincing than science arguments, while the combination of an unreliable source with a scientific topic produced especially low ratings of argument strength. That is, ratings of science argument strength seemed to be somewhat polarized. Source reliability ratings are displayed in Figure 3. A three-way ANOVA was conducted on the residual source reliability ratings. Judgments of source reliability were significantly affected by the Source manipulation, F(1, 386) ⫽ 679.1, p ⬍ .001, 2p ⫽ .64, and the Search manipulation, F(1, 386) ⫽ 50.24, p ⬍ .001, 2p ⫽ .12. Although the effect of Class was nonsignificant, there was a significant interaction between Source and Class of argument, F(1, 386) ⫽ 25.28, p ⬍ .001, 2p ⫽ .06. Participants assigned lower ratings of reliability to unreliable scientific sources than to unreliable nonscientific sources, yet when the scientific sources were reliable they attracted higher ratings of reliability than their equivalent nonscientific sources. This tracks the pattern of polarization observed in the argument strength ratings. Science arguments from ignorance from unreliable sources were rated as less compelling, therefore, only when
Unreliable Source
Science
Reliable Source
Unreliable Source
Non Science
Figure 2. Mean ratings of argument strength across each condition of the Arguments From Ignorance task. Error bars indicate 1 SD.
9 8 7 6
Reliable
5
Unreliable
4 3 2 1 0 Science
Non-science
Figure 3. Mean ratings of source reliability obtained in each condition of the Arguments From Ignorance task. Error bars indicate 1 SD.
the sources themselves were judged to be less reliable. According to Bayes’s theorem, less reliable sources should produce less compelling arguments (and more reliable sources more compelling arguments). This suggests that the observed difference in the strength of the science/nonscience arguments was related to the perceived reliability of the sources in the arguments—not to participants using a qualitatively different approach in evaluating the science arguments (the question of why unreliable and weak science arguments should be perceived as less compelling and less reliable than their nonscience counterparts is discussed below).
Mixed Evidence Task Ninety-nine participants completed the Mixed Evidence task. Analyses were based on two dependent measures—a rating of argument strength (the truth of the claim), and a rating of the reliability of the claim. As in the Arguments From Ignorance task, all statistical analyses were conducted on residual ratings, while graphed data is of raw, untreated means. Figure 4 displays the mean ratings of argument strength obtained in each condition. An ANOVA was conducted on ratings of argument strength with Evidence (confirms/mixed/disconfirms) and Class (science/nonscience) as independent variables. There was a main effect of Evidence in the expected direction, F(2, 194) ⫽ 79.09, p ⬍ .001, 2p ⫽ .45 and post hoc Tukey’s tests confirmed significant differences between each level of the manipulation (for both the science and nonscience arguments). There was no main effect of Class. However, there was an interaction between Class and Evidence such that science arguments were once again polarized—more convincing when based on strong evidence, and less convincing when based on weak evidence: F(2, 194) ⫽ 5.68, p ⬍ .01, 2p ⫽ .06. 1
Reliable Source
205
Computing residual values is necessary because although participants provide data in every condition of the experiment, the combination of topic and experimental condition differs between participants. Computing a residual transformation permits standard, between-subjects analyses to be conducted. Though this changes the absolute numerical values, it typically leaves the overall shape of the data unaltered. In all the data reported in this article, ANOVAs on raw and residual values produced the same statistical effects.
CORNER AND HAHN
The argument strength ratings are in line with the Bayesian prescription that the more evidence a claim has, the more compelling it should be. The polarization of the science argument strength ratings tracks a similar pattern to the data obtained in the Arguments From Ignorance task. And, once again, the ratings of source reliability provide valuable information about the way in which the arguments were evaluated. Figure 5 displays the mean ratings of the reliability of the sources of evidence in each condition of the task. An ANOVA was conducted on the residuals of these ratings, with Evidence (confirms/mixed/disconfirms) and Class (science/nonscience) as independent variables. There was a main effect of Evidence, F(2, 194) ⫽ 23.35, p ⬍ .001, 2p ⫽ .19. As we predicted, however the ordinal pattern of results is different to the argument strength ratings—ratings of source reliability are higher for the disconfirming evidence than for the mixed evidence. There was a main effect of Class in the reliability ratings, F(1, 194) ⫽ 87.39, p ⬍ .001, 2p ⫽ .31. Figure 5 shows that the source reliability ratings for the nonscience arguments were consistently low— even when all the evidence confirmed the hypothesis, ratings of source reliability did not reach the midpoint of the response scale.2 Another way of describing this pattern of data is to say that the likelihood ratios for the nonscience argument were all low, and all fairly similar— even the coherent evidence was not deemed especially diagnostic. Thus, the observed polarization in the strength of the arguments—with strong science arguments perceived as more convincing, and weak science arguments less convincing than their nonscience counterparts—would seem to be related to the perceived reliability of the science and nonscience sources. As in the Arguments From Ignorance task, this would suggest that evaluations of both science and nonscience arguments are sensitive to the diagnosticity of the evidence. There is no reason to believe that the science arguments were being evaluated in a qualitatively different way.
Consequentialist Arguments All 100 participants completed the Consequentialist Arguments task. A three-way analysis of variance (ANOVA) with Outcome Negativity, Level of Sacrifice, and Class (science or nonscience) as independent variables was conducted on these residual ratings of argument strength. Arguments containing more negative outcomes
Mean Ratings of Truth of Claim
10 9 8 7 6
Science
5
10 Mean Rating of Source Reliability
206
9 8 7 6
Science
5
Non-Science
4 3 2 1 0 Confirms
Mixed
Disconfirms
Figure 5. Mean ratings of the reliability of the sources in the Mixed Evidence task. Error bars indicate 1 SD.
were rated as significantly stronger (M ⫽ 5.66, SD ⫽ 2.52) than arguments containing less negative outcomes (M ⫽ 4.68, SD ⫽ 2.78), F(1, 392) ⫽ 20.51, p ⬍ .001, 2p ⫽ .05. Arguments requiring a smaller sacrifice were rated as significantly stronger (M ⫽ 5.66, SD ⫽ 2.56) than arguments requiring a bigger sacrifice (M ⫽ 4.77, SD ⫽ 2.64), F(1, 392) ⫽ 18.66, p ⬍ .001, 2p ⫽ .05. There was no effect of Class on ratings of argument strength, p ⬎ .05. In addition, none of the interaction terms between these three variables were significant. Ratings of both the science and nonscience consequentialist arguments were influenced by the negativity of the outcome and the sacrifice that was required to avoid this outcome. Because both these factors contribute to the utility component of Bayesian decision theory, these results are as we would expect. Participants also provided separate ratings of Outcome Utility and Sacrifice Desirability—specifically, participants were asked “how bad would it be if Outcome ⫻ occurred?” and “how bad would it be to make Sacrifice Z?”. A two-way ANOVA with Level of Sacrifice and Outcome Utility as independent variables, and Outcome Negativity residual ratings and Sacrifice Desirability residual ratings as dependent variables was conducted. Only Level of Sacrifice had a significant effect on Sacrifice Desirability ratings, with more negative ratings of the sacrifice when it was big (M ⫽ 4.62, SD ⫽ 2.8) than when it was small (M ⫽ 7.63, SD ⫽ 2.49), F(1, 392) ⫽ 192.04, p ⬍ .001, 2p ⫽ .33. It is interesting to note, however, that both Outcome Utility, F(1, 392) ⫽ 220.16, p ⬍ .001, 2p ⫽ .36, and Level of Sacrifice, F(1, 392) ⫽ 4.67, p ⬍ .05, 2p ⫽ .01, had a significant effect on Outcome Negativity ratings. Specifically, in addition to the expected effect of the manipulation of outcome negativity on ratings of outcome negativity, participants rated the outcome as significantly less negative when the sacrifice required was great (see Figure 6). Participants seemed to be reducing the negativity of the
Non-Science
4 3
2
2 1 0 Confirms
Mixed
Disconfirms
Figure 4. Mean ratings of argument strength in the Mixed Evidence task. Error bars indicate 1 SD.
One possible reason for this is that the nonscience argument was based on the testimonies of newspapers—notoriously unreliable sources of evidence (Friedman, Dunwoody, & Rogers, 1999). Whatever the reason for the discrepancy, however, the “squashing” of the source reliability ratings for nonscience arguments provides an explanation for the polarization of the ratings of science argument strength: The reliability of the sources differed substantially between evidence conditions for the science arguments, but not for the nonscience arguments.
EVALUATING SCIENCE ARGUMENTS
Mean rating of Outcome Utility
3.5 3 2.5 2
Mod.Neg Outcome V.Neg Outcome
1.5 1 0.5 0 Small Sacrifice
Big Sacrifice
Figure 6. The effect of outcome utility and level of sacrifice on ratings of outcome negativity in the Consequentialist Arguments task.
outcome when avoiding that outcome required a big sacrifice on their part. From the perspective of Bayesian decision theory, the level of sacrifice required should not feed into the perceived negativity of the outcome. This unpredicted effect of sacrifice on outcome negativity was common to the evaluation of science and nonscience arguments.
Discussion In the Arguments From Ignorance task, participants assigned lower ratings of convincingness to science arguments from ignorance than to nonscience arguments from ignorance. However, while the strength of the arguments differed according to their class (i.e., science/nonscience), closer inspection revealed that the perceived reliability of the scientific and nonscientific sources also differed. Science arguments from ignorance from unreliable sources were rated as less compelling only when the sources themselves were judged to be less reliable. Thus, it would seem that evaluations of science and nonscience arguments from ignorance were both determined by the diagnosticity of the evidence—a finding with immediate implications for science communication. Arguments from ignorance have traditionally been treated by philosophers as reasoning fallacies—arguments that seem convincing, but should not be (Woods et al., 2004). In a wide range of science communication contexts, however, arguments from ignorance are used as the basis for important claims (e.g., the safety of pharmaceutical drugs). Our results suggest that drug companies are right to use arguments from ignorance as the basis for their safety claims—people are capable of distinguishing strong and weak arguments from ignorance, for both science and nonscience topics (see also Oaksford & Hahn, 2004). However, our results also identify conditions for the legitimate use of scientific arguments from ignorance—when the search for evidence has been thorough (e.g., valid laboratory tests) and when the evidence is reported by a reliable source (e.g., in an established medical journal). Thus, our findings underscore the importance of the scientific method, and the peer review process for the legitimacy of communicating scientific messages. A question remains, however, as to why unreliable and weak science arguments should be perceived as less compelling and less reliable than their nonscience counterparts. One possible explana-
207
tion is the perceived position of scientific knowledge in our lives. Scientific knowledge is taught in classrooms as a collection of facts— certainties—that are arrived at by a rigorous and objective process of hypothesis testing (Simon et al., 2002). It is perhaps no surprise that a science argument lacking evidence from an unreliable source strikes us as particularly unconvincing, and equally plausible that a reliable and evidence based science argument is more compelling than a comparable nonscience argument. In the Mixed Evidence task, a similar relationship between argument strength and reliability ratings was observed. In particular, nonscience reliability ratings were consistently “flat,” contributing to the polarized ratings of argument strength. However, similarly to the Arguments From Ignorance task, evaluations of science and nonscience arguments were both determined by the diagnosticity of the evidence—stronger evidence and increasing source reliability produced higher ratings of argument strength. Our results have implications, therefore, for the communication of uncertainty in scientific messages—people were sensitive to both the coherence of the evidence and the reliability of the source reporting the evidence. Lack of evidential coherence (i.e., a “mixed message”) reflects badly on judgments of source reliability, and this effect seemed to be strongest for the science argument. Contradictory scientific evidence may impact badly on perceptions of scientists themselves—and this will feed in to the evaluation of scientific messages. For bodies such as the Intergovernmental Panel on Climate Change (IPCC), presenting a “unified front” may be a critical determinant of how compelling the general public consider their messages to be. Debate continues over the most appropriate way of representing uncertainty in IPCC reports (Budescu et al., 2009; Patt & Schrag, 2003). Our results suggest that an additional factor to consider might be whether the source of the uncertainty is attributed to the experts or the data itself— because disagreement between experts seems to weaken the strength of scientific messages. In the Consequentialist Arguments task, there was no effect of the science/nonscience manipulation on ratings of argument strength, nor on ratings of outcome or sacrifice utility. However, an unexpected effect was observed—ratings of the perceived negativity of the outcome of both science and nonscience arguments were lower when avoiding that outcome required a big sacrifice. The “impartiality requirement” of decision theory means that these utilities should be independent—the utility of a negative outcome should be unaffected by presenting it alongside a small or large sacrifice (Allingham, 2002; Stewart, Chater, Stott, & Reimers, 2003). Consideration of a potential explanation for this result has implications for the communication of appeals based on scientific evidence. In the social psychological literature the phenomenon of cognitive dissonance (Cooper & Fazio, 1984; Festinger, 1957) is well established. Very simply, to reduce the “dissonance” between their behavior (i.e., being unwilling to make a large sacrifice) and their beliefs (i.e., that climate change may cause widespread flooding), individuals modify either their behavior or their beliefs. We assume that people are typically unwilling to make a big sacrifice, yet a sufficiently negative outcome demands some form of evasive action. It is possible that participants in this experiment minimized the negativity of the outcome, rather than modifying their behavioral intentions—as a less negative outcome carries a reduced obligation to avoid it.
CORNER AND HAHN
208
If consequentialist arguments such as these do engage dissonance mechanisms in this way then communicators must be cautious in invoking too negative an outcome and linking it to personal sacrifice. Changing environmental behavior may be more effectively achieved by emphasizing the positive effects of proenvironmental behaviors (e.g., the health benefits of cycling rather than using a car). Research on the use of fear appeals in persuasive communication suggests that there is a danger of inducing defensive reactions if the severity of the message is too high (de Vries, Ruiter, & Leegwater, 2002), and that simply increasing severity does not necessarily add to the persuasive impact of a message (Hoog, Stroebe, & de Wit, 2005). On the other hand, however, cognitive dissonance is a well understood psychological phenomenon—and as such, there is a wealth of psychological literature (see, e.g., Cooper & Fazio, 1984; Dickerson, Thibodeau, Aronson, & Miller, 1992; Kantola, Syme, & Campbell, 1984) that could be brought to bear on the communication of science-based consequentialist appeals. For example, Thøgersen (2004) has examined dissonance-based explanations for the “spillover” of one proenvironmental behavior to another. Thøgersen suggests that the underlying reason for performing a behavior will impact on whether dissonance is felt toward other behaviors. If an individual has insulated their loft for environmental rather than financial reasons, they may be more likely to perceive themselves as inconsistent if they do not also fit draft excluders. Our results suggest that dissonance may play a role not just in the spillover of environmental behaviors, but in the evaluation of messages about these behaviors. Thus, the construction of consequentialist appeals about scientific topics like climate change and the performance of proenvironmental behaviors might benefit from paying attention to the considerations that Thøgersen (2004) has identified. In particular, consequentialist appeals about climate change may be able to effectively harness the power of cognitive dissonance if they focus on the environmental (rather than financial) reasons for acting.
General Discussion In this article we used the Bayesian approach to informal argumentation (Hahn & Oaksford, 2007) as a framework for comparing people’s evaluations of science and nonscience arguments—the sorts of representations of science that the general public might come across in their daily lives. In an experiment involving three separate argument evaluation tasks, we sought to establish whether there were any differences in how people evaluated science and nonscience arguments. Some differences emerged in the perceived strength of science and nonscience arguments. Ratings of the strength of arguments from ignorance and mixed-evidence arguments seemed to be polarized for scientific topics. Crucially, though, people’s evaluations were sensitive to the diagnosticity of the evidence in the arguments for both science and nonscience arguments. Therefore, although people’s subjective estimates of the parameters that determine argument strength sometimes differed depending on whether the arguments were scientific or not, the evaluation of science arguments was determined by the same factors that determined the evaluation of nonscience arguments. Participant evaluations of both science and nonscience arguments were influenced by nor-
matively relevant factors; the amount of evidence, the reliability of the source, and their interaction. The one deviation we observed from normative responding was not related to the evaluation of evidence but to decisions and the assignment of utilities. This deviation, however, was common to both science and nonscience arguments. On the basis of the three experimental tasks examined in this article, the factors that determine the strength of nonscience arguments would seem to be the very same factors that determine the strength of science arguments. This suggests that the communication of scientific messages should be guided by the same considerations that guide the construction of ostensibly nonscientific messages. From the perspective of designing programs aimed at improving the communication of science, this is encouraging. Science arguments seem subject to the same evaluative framework as nonscience arguments, though future work will need to examine these parallels in more depth. Previous work on how people evaluate science arguments has been constrained by the lack of a content level account of argument strength, and research has been unable to move beyond the structural analysis of frameworks such as Toulmin’s (Toulmin, 1958). This focus on the procedural and structural features of argumentation is not confined to the study of science arguments— the pragma-dialectical theory of fallacy borrows heavily from Toulmin to identify criteria for distinguishing acceptable from unacceptable patterns of argumentation in general (van Eemeren & Grootendorst, 2004). However, the pragma-dialectical approach simply proposes rules that govern acceptable discourse. It is not a framework for assessing argument strength. The Bayesian approach does not suffer from this problem. By analyzing arguments at the level of their individual content, and providing a straightforward calculus with which to predict the impact of varying different aspects of arguments, the Bayesian framework offers a valuable metric with which to assess argument strength. The results of the current experiment suggest that this metric is equally applicable to arguments about scientific topics, and thus provides an essential tool for designing and communicating messages about science. Normatively, the factors that make a nonscience argument strong are the same factors that make a science argument strong. Anyone wishing to design an effective scientific argument can make use of the growing body of research that uses the Bayesian framework to distinguish strong and weak arguments (Corner et al., 2006; Hahn & Oaksford, 2007; Oaksford & Hahn, 2004). The experiment reported in this article represents the first attempt at a systematic, psychological study of science argument evaluation. We therefore think that the main contribution of this article lies not just in the data obtained, but in the demonstration of how the Bayesian framework can be brought to bear on science argument evaluation—and with that science communication and education. This novel application enables questions to be posed about science argument evaluation that have not been addressed in previous applied research on lay understanding of science. It allows future research to proceed with more systematic questions in hand. And, as demonstrated in this article, these questions can be examined with a minimum degree of mathematical elaboration. Of course, as noted earlier, it is possible to make more complex quantitative predictions about argument strength using the Bayesian framework (an approach we have pursued in other work, see
EVALUATING SCIENCE ARGUMENTS
Hahn & Oaksford, 2007). However, while some work on the evaluation of science arguments and the communication of science should also take this direction, we suspect that much future work will not be along detailed, quantitative lines. A key strength of the Bayesian framework is that it has been shown to capture a wide array of fundamental philosophical and scientific intuitions about evidential impact (Bovens & Hartmann, 2003; Howson & Urbach, 1996). The question of most interest to the science communication community would seem to be the extent to which these intuitions are shared by the lay public, and much of this can be addressed without the need for modeling. In the application of the Bayesian framework to the evaluation of informal science arguments and evidence, we hope to be able to contribute toward answering this important practical question.
References Allingham, M. (2002). Choice theory: A very short introduction. Oxford: Oxford University Press. Areni, C. S., & Lutz, R. J. (1988). The role of argument quality in the Elaboration Likelihood Model. Advances in Consumer Research, 15, 197–203. Bovens, L., & Hartmann, S. (2003). Bayesian epistemology. Oxford: Oxford University Press. Brante, T., Fuller, S., & Lynch, W. (Eds.). (1993). Controversial science: From content to contention. New York: New York Press. Budescu, D. V., Broomell, S., & Por, H. H. (2009). Improving communication of uncertainty in the reports of the intergovernmental panel on climate change. Psychological Science, 20, 299 –307. Collins, H. M., & Evans, R. (2007). Rethinking expertise. Chicago: University of Chicago Press. Collins, H. M., & Pinch, T. J. (1993). The Golem: What you should know about science. New York: Cambridge University Press. Cooper, J., & Fazio, R. H. (1984). A new look at dissonance theory. In L. Berkowitz (Ed.), Advances in experimental social psychology (Vol. 17). New York: Academic Press. Corner, A., Hahn, U., & Oaksford, M. (2006). The Slippery Slope Argument: Probability, utility and category boundary re-appraisal. Proceedings of the 28th Annual Conference of the Cognitive Science Society (pp. 1145–1151). Vancouver: Cognitive Science Society. de Vries, N., Ruiter, R., & Leegwater, Y. (2002). Fear appeals in persuasive communication. In: G. Bartels & W. Nelissen (Eds.), Marketing for sustainability: Towards transactional policy making. Amsterdam: IOS Press. Dickerson, C. A., Thibodeau, R., Aronson, E., & Miller, D. (1992). Using cognitive dissonance to encourage water conservation. Journal of Applied Social Psychology, 22, 841– 854. Driver, R., Newton, P., & Osborne, J. (2000). Establishing the norms of scientific argumentation in classrooms. Science Education, 84, 287–312. Edwards, W. (1961). Behavioural decision theory. Annual Review of Psychology, 12, 473– 498. Erduan, S., Simon, S., & Osborne, J. (2004). TAPping into argumentation: Developments in the application of Toulmin’s argument pattern for studying science discourse. Science Education, 88, 915–933. Festinger, L. (1957). A theory of cognitive dissonance. Stanford: Stanford University Press. Friedman, S. M., Dunwoody, S., & Rogers, C. L. (Eds.). (1999). Communicating uncertainty: Media coverage of new and controversial science. Hillsdale, NJ: Erlbaum. Fugelsang, J. A., Stein, C. B., Green, A. E., & Dunbar, K. N. (2004). Theory and data interactions of the scientific mind: Evidence from the molecular and the cognitive laboratory. Canadian Journal of Experimental Psychology, 58, 86 –95.
209
Green, D. W. (2007). A mental model theory of informal argument. In W. Schaeken, A. Vandirendonck, W. Shroyens, & G. d’Ydewalle (Eds.), The mental models theory of reasoning: Refinements and extensions. Hillsdale, NJ: Erlbaum. Green, D. W. (2008). Persuasion and the contexts of dissuasion: Causal models and informal arguments. Thinking & Reasoning, 14, 28 –59. Gregory, J., & Miller, S. (1998). Science in public: Communication, culture & credibility. Cambridge, MA: Basic Books. Hahn, U., & Oaksford, M. (2006). A Bayesian approach to informal fallacies. Synthese, 152, 207–237. Hahn, U., & Oaksford, M. (2007). The rationality of informal argumentation: A Bayesian approach to reasoning fallacies. Psychological Review 114, 704 –732. Hahn, U., Oaksford, M., & Corner, A. (2005). Circular arguments, begging the question and the formalization of argument strength. Proceedings of AMKCL05–Adaptive Knowledge Representation and Reasoning. Helsinki: Helsinki University. Heit, E. (1998). A Bayesian analysis of some forms of inductive reasoning. In M. Oaksford & N. Chater (Eds.), Rational models of cognition. Oxford: Oxford University Press. Hoog, N., Stroebe, W., & de Wit, J. B. F. (2005). The impact of Fear Appeals on processing and acceptance of action recommendations. Personality & Social Psychology Bulletin, 31, 24 –33. Howson, C., & Urbach, P. (1996). Scientific reasoning: The Bayesian approach. Chicago: Open Court. Irwin, A., & Wynne, B. (Eds.). (1996). Misunderstanding science? The public reconstruction of science and technology. Cambridge, MA: Cambridge University Press. Jimenez-Alexandre, M. P. (2002). Knowledge producers or knowledge consumers? Argumentation and decision making about environmental management. International Journal of Science Education 24, 1171– 1190. Jimenez-Alexandre, M. P., Rodriguez, A. B., & Duschl, R. A. (2000). “Doing the lesson” or “doing science”: Argument in high school genetics. Science Education, 84, 757–792. Johnson-Laird, P. N. (1983). Mental models: Towards a cognitive science of language, inference and consciousness. Cambridge, MA: Harvard University Press. Kantola, S. J., Syme, G. J., & Campbell, N. A. (1984). Cognitive dissonance and energy conservation. Journal of Applied Psychology, 69, 416 – 421. Keeney, R. L., & Raiffa, H. (1976). Decisions with multiple objectives: Preferences and value tradeoffs. Wiley: New York. Kemp, C., & Tenenbaum, J. B. (2009). Structured statistical models of inductive reasoning. Psychological Review, 116, 20 –58. Kirk, R. E. (1995). Experimental design: Procedures for the behavioural sciences. London: Brooks/Cole. Knowles, J. (2003). Norms, naturalism and epistemology. New York: Palgrave Macmillan. Kolsto, S. D., Bungum, B., Arneses, E., Isnes, A., Kristensen, T., Mathiassen, K., et al. Science students’ critical examination of scientific information related to socioscientific issues. Science Education, 90, 632– 655. Korb, K. B. (2004). Bayesian informal logic and fallacy. Informal Logic, 24, 41–70. Korpan, C. A., Bisanz, G. L., Bisanz, J., & Henderson, J. M. (1997). Assessing literacy in science: Evaluation of scientific news briefs. Science Education 81, 515–532. Kortland, K. (1996). An STS case study about students’ decision making on the waste issue. Science Education, 80, 673– 689. Kuhn, D., Shaw, V., & Felton, M. (1997). Effects of dyadic interaction on argumentative reasoning. Cognition & Instruction, 15, 287–315. Kuhn, T. S. (1970). The structure of scientific revolutions. Chicago: The University of Chicago Press.
210
CORNER AND HAHN
Lagnado, D. A., & Harvey, N. (2008). The impact of discredited evidence. Psychonomic Bulletin & Review, 15, 1166 –1173. Laplace, P. S. (1951). A philosophical essay on probabilities (F. W. Truscott & F. L. Emory, Trans.). New York: Dover Publications. (Original work published 1814) Lorenzoni, I., Nicholson-Cole, S., & Whitmarsh, L. (2007). Barriers perceived to engaging with climate change among the UK public and their policy implications. Global Environmental Change, 17, 445– 459. Lorenzoni, I., & Pidgeon, N. F (2006). Public views on climate change: European and USA perspectives. Climatic Change, 77, 73–95. Nelson, J. D. (2005). Finding useful questions: On Bayesian diagnosticity, probability, impact and information gain. Psychological Review, 112, 979 –999. Norris, S. P., Phillips, L. M., & Korpan, C. A. (2003). University students’ interpretation of media reports of science and its relationship to background knowledge, interest and reading difficulty. Public Understanding of Science, 12, 123–145. Oaksford, M., & Chater, N. (2007). Bayesian rationality: The probabilistic approach to human reasoning. Oxford: Oxford University Press. Oaksford, M., & Hahn, U. (2004). A Bayesian approach to the argument from ignorance. Canadian Journal of Experimental Psychology, 58, 75– 85. O’Keefe, D. J., & Jackson, S. (1995). Argument quality and persuasive effects: A review of current approaches. In S. Jackson (Ed.), Argumentation and values: Proceedings of the ninth Alta conference on argumentation (pp. 88 –92). Annandale, VA: Speech Communication Association. Osherson, D., Smith, E. E., Wilkie, O., Lo`pez, A., & Shafir, E. (1990). Category based induction. Psychological Review, 97, 185–200. Patronis, T., Potari, D., & Spiliotopolou, V. (1999). Students’ argumentation in decision-making on a socio-scientific issue: Implications for teaching. International Journal of Science Education, 21, 745–754. Patt, A. (2007). Assessing model-based and conflict-based uncertainty. Global Environmental Change, 17, 37– 46. Patt, A. G., & Schrag, D. P. (2003). Using specified language to describe risk and probability. Climatic Change, 61, 17–30. Pennington, N., & Hastie, R. (1986). Evidence evaluation in complex decision making. Journal of Personality and Social Psychology, 51, 242–258. Petty, R. E., & Cacioppo, J. T. (1986). Communication and persuasion: Central and peripheral routes to attitude change. New York: SpringerVerlag. Petty, R. E., & Wegener, D. T. (1991). Thought systems, argument quality, and persuasion. In R. S. Wyer & T. K. Srull (Eds.), Advances in social cognition (Vol. 4, pp. 147–161). Hillsdale, NJ: Erlbaum. Pidgeon, N., Kasperson, R. E., & Slovic, P. (2003). The social amplification of risk. Cambridge, United Kingdom: Cambridge University Press. Pidgeon, N. F., & Rogers-Hayden, T. (2007). Opening up nanotechnology
dialogue with the publics: Risk communication or “upstream engagement”? Health, Risk and Society, 9, 191–210. Pollack, H. N. (2005). Uncertain science . . . uncertain world. Cambridge, United Kingdom: Cambridge University Press. Popper, K. R. (1959). The logic of scientific discovery. London: Hutchison. Ratcliffe, M. (1999). Evaluation of abilities in interpreting media reports of scientific research. International Journal of Science Education, 21, 1085–1099. Rips, L. J. (1989). Similarity, typicality and categorisation. In S. Vosinadou & A. Ortony (Eds.), Similarity and analogical reasoning. New York: Cambridge University Press. Sadler, T. D. (2004). Informal reasoning regarding socioscientific issues: A critical review of research. Journal of Research in Science Teaching, 41, 513–536. Savage, L. J. (1954). The foundations of statistics. New York: Wiley. Schultz, P. W., Nolan, J. M., Cialdini, R. B., Goldstein, N. J., & Griskevicius, V. (2007). The constructive, destructive, and reconstructive power of social norms. Psychological Science, 18, 429 – 434. Simon, S., Erduran, S., & Osborne, J. (2002). Enhancing the quality of argumentation in school science. Proceedings of the Annual Meeting of the National Association for Research in Science Teaching, New Orleans: National Association for Research in Science Teaching. Stewart, N., Chater, N., Stott, H. P., & Reimers, S. (2003). Prospect relativity: How choice options influence decision under risk. Journal of Experimental Psychology: General, 132, 23– 46. Takao, A. Y., & Kelly, G. J. (2003). Assessment of evidence in university students’ scientific writing. Science & Education, 12, 341–363. Tentori, K., Crupi, V., Bonini, N., & Osherson, D. (2007). Comparison of confirmation measures. Cognition, 103, 107–119. Thøgersen, J. (2004). A cognitive dissonance interpretation of consistencies and inconsistencies in environmentally responsible behavior. Journal of Environmental Psychology, 24, 93–103. Toulmin, S. (1958). The uses of argument. Cambridge, UK: Cambridge University Press. van Eemeren, F. H., & Grootendorst, R. (2004). A systematic theory of argumentation–the pragma-dialectical approach. Cambridge, UK: Cambridge University Press. van Gelder, T. J., Bissett, M., & Cumming, G. (2004). Cultivating expertise in informal reasoning. Canadian Journal of Experimental Psychology, 58, 142–152. von Aufschaiter, C., Erduran, S., Osborne, J., & Simon, S. (2008). Arguing to learn and learning to argue: Case studies of how students’ argumentation relates to their scientific knowledge. Journal of Research in Science Teaching, 45, 101–131. Woods, J., Irvine, A., & Walton, D. (2004). Critical thinking, logic & the fallacies. Toronto: Prentice Hall. Zehr, S. (2000). Public representations of scientific uncertainty about global climate change. Public Understanding of Science, 9, 85–103.
EVALUATING SCIENCE ARGUMENTS
211
Appendix A An Example of the Materials used in the Arguments from Ignorance Task (topic S1) Dave: I got sent a circular email from excitingnews@ wowee.com
In the first example (a), the search for evidence is incomplete, while the source of the evidence is unreliable. In the second example (b), the search for evidence is thorough, while the source of the evidence is reliable. (a)
(b)
Dave: This new anti-inflammatory drug is safe.
Dave: This new anti-inflammatory drug is safe.
Jimmy: How do you know?
Jimmy: How do you know?
Dave: Because I read that there have been fifty experiments conducted, and they didn’t find any side effects.
Dave: Because I read that there has been one experiment conducted, and it didn’t find any side effects.
Jimmy: Where did you read that?
Jimmy: Where did you read that?
Dave: I read it in the journal Science just yesterday.
Appendix B The Materials used in the Mixed Evidence Task Science Claim The next visible solar flare will be during the month of November.
Confirming Evidence Professor Grantham has calculated that the next visible solar flare will occur on November 20th. Professor Bootley reports that there will be a visible solar flare in the first week of November. Professor Parry has identified November as the most likely date for the next visible solar flare. Professor Reddon published a statement which estimated the next visible solar flare to occur during the winter.
Mixed Evidence Professor Grantham has calculated that the next visible solar flare will occur on November 20th. Professor Bootley reports that there will be a visible solar flare in the first week of August. Professor Parry has identified November as the most likely date for the next visible solar flare. Professor Reddon published a statement which estimated the next visible solar flare to occur during the summer.
Disconfirming Evidence Professor Grantham has calculated that the next visible solar flare will occur on July 20th.
Professor Bootley reports that there will be a visible solar flare in the first week of August. Professor Parry has identified July as the most likely date for the next visible solar flare. Professor Reddon published a statement which estimated the next visible solar flare to occur during the summer.
Non-science claim The leader of the Conservative Party and his wife are going to have a baby in the summer.
Confirming Evidence The Daily News reports that the leader of the Conservative Party and his wife are expecting a baby in August. The Globe reports that the leader of the Conservative Party and his wife are expecting a baby in July. The World Today reports that the leader of the Conservative Party and his wife are expecting a baby in June. News Update reports that the leader of the Conservative Party and his wife are expecting a baby in May.
Mixed Evidence The Daily News reports that the leader of the Conservative Party and his wife are expecting a baby in November. The Globe reports that the leader of the Conservative Party and his wife are expecting a baby in August. The World Today reports that the leader of the Conservative Party and his wife are expecting a baby in July.
(Appendixes continue)
CORNER AND HAHN
212
News Update reports that the leader of the Conservative Party and his wife are expecting a baby in March.
Disconfirming Evidence The Daily News reports that the leader of the Conservative Party and his wife are expecting a baby in November.
The Globe reports that the leader of the Conservative Party and his wife are expecting a baby in December. The World Today reports that the leader of the Conservative Party and his wife are expecting a baby in January. News Update reports that the leader of the Conservative Party and his wife are expecting a baby in March.
Appendix C An Example of the Materials used in the Consequentialist Arguments Task The first example (a) is from topic S1. The second example (b) is from topic NS2. Bold text indicates the manipulation of the utility of the outcome, and the utility of the sacrifice. The Intergovernmental Panel on Climate Change (IPCC) have claimed that if global warming continues at the current rate, it will cause global sea levels to rise and thousands of people who live in low-lying areas will lose their homes/tourism will be disrupted. The IPCC has calculated that if everyone switched to using energy efficient light bulbs/we all stopped using aeroplanes the amount of CO2 saved would stop the sea level from rising.
Imagine that you are worried about your car being stolen/ being scratched. In order to stop people stealing/scratching your car, you can install an alarm system that warns people when they approach your vehicle to keep their distance, at a cost of £20/£200. Received August 13, 2008 Revision received May 7, 2009 Accepted May 11, 2009 䡲
New Editors Appointed, 2011–2016 The Publications and Communications Board of the American Psychological Association announces the appointment of 3 new editors for 6-year terms beginning in 2011. As of January 1, 2010, manuscripts should be directed as follows: ● Developmental Psychology (http://www.apa.org/journals/dev), Jacquelynne S. Eccles, PhD, Department of Psychology, University of Michigan, Ann Arbor, MI 48109 ● Journal of Consulting and Clinical Psychology (http://www.apa.org/journals/ccp), Arthur M. Nezu, PhD, Department of Psychology, Drexel University, Philadelphia, PA 19102 ● Psychological Review (http://www.apa.org/journals/rev), John R. Anderson, PhD, Department of Psychology, Carnegie Mellon University, Pittsburgh, PA 15213 Electronic manuscript submission: As of January 1, 2010, manuscripts should be submitted electronically to the new editors via the journal’s Manuscript Submission Portal (see the website listed above with each journal title). Manuscript submission patterns make the precise date of completion of the 2010 volumes uncertain. Current editors, Cynthia Garcı´a Coll, PhD, Annette M. La Greca, PhD, and Keith Rayner, PhD, will receive and consider new manuscripts through December 31, 2009. Should 2010 volumes be completed before that date, manuscripts will be redirected to the new editors for consideration in 2011 volumes.