Because Hitler did it! Quantitative tests of Bayesian argumentation ...

1 downloads 0 Views 499KB Size Report
sensitivity (e.g., Hahn, Harris & Corner, 2009; Hahn & Oaksford, 2007a; Oaksford &. Hahn, 2004). The ad Hitlerum. Hitler has been used as an argumentative ...
Because Hitler did it! Quantitative tests of Bayesian argumentation using Ad Hominem

Adam J. L. Harris, Anne S. Hsu, and Jens K. Madsen University College London, UK

Corresponding author: Adam J. L. Harris, Department of Cognitive, Perceptual and Brain Sciences, University College London, London, UK, WC1H 0AP. Tel: +44 (0)20 7679 5412 Email: [email protected]

Word count (main text) = 9,192

1

Abstract Bayesian probability has recently been proposed as a normative theory of argumentation. In this article, we provide a Bayesian formalisation of the ad Hitlerum argument, as a special case of the ad hominem argument. Across 3 experiments, we demonstrate that people’s evaluation of the argument is sensitive to probabilistic factors deemed relevant on a Bayesian formalisation. Moreover, we provide the first quantitative evidence in favour of the Bayesian approach to argumentation. Quantitative Bayesian prescriptions were derived from participants’ stated subjective probabilities (Experiments 1 & 2), as well as from frequency information explicitly provided in the experiment (Experiment 3). Participants’ stated evaluations of the convincingness of the argument were well matched to these prescriptions.

2

Introduction Adolf Hitler is one of the most infamous characters in History. Responsible for atrocities such as the murder of approximately 6 million Jewish people in the Holocaust (see e.g., Gigliotti & Lang, 2005), his name has become synonymous with evil. Consequently, Hitler has since been used as an argumentative tool to argue against propositions. Strauss (1953) coined the term ‘ad Hitlerum’ to refer to the use of Hitler in argumentation: ”Unfortunately, it does not go without saying that in our examination we must avoid the fallacy...ad Hitlerum. A view is not refuted by the fact that it happens to have been shared by Hitler” (pp. 42-43). The status of the ad Hitlerum as an argumentation fallacy is implied in this quote, and stems from the traditional normative standard of argumentation, logic. As Strauss points out, the fact that Hitler shared a view does not necessarily refute that view (for example, using that argument against vegetarianism seems ridiculous), that is, it is not a deductively valid argument. However, there do exist a number of domains for which Hitler's endorsement may well be considered good evidence against a viewpoint (for example, in the domain of human rights). The recent Bayesian approach to argumentation (Hahn & Oaksford, 2006a, 2007a) provides a normative framework within which an argument can be understood as potentially providing some evidence in favour of a position, without the necessity of deductive validity in the manner of binary formal logic. The Bayesian approach models argument content as continuous and probabilistic. The use of a probabilistic framework enables us to distinguish between strong and weak instantiations of the same argument form, and thus to 3

understand why the same argument form might be seen as weak in one context, but strong in another. The current article begins with a discussion of the various forms of the ad Hitlerum, before highlighting its critical components from a Bayesian perspective. Consequently, we suggest that Hitler can be invoked as part of a number of argument forms in order to argue against a proposition. In the current article, however, we focus on perhaps the simplest instantiation of the ad Hitlerum argument. We do, however, propose that the probabilistic components that underlie this particular instantiation can be generalised to many other cases where Hitler is used in argumentation. We subsequently present three empirical studies, which demonstrate that people’s understanding of the ad Hitlerum is well predicted by the relevant Bayesian parameters. Moreover, we extend evidence in favour of the rationality of participants’ reactions to argumentative dialogue by testing the degree to which participants’ understanding of an argument is in line with quantitative predictions derived from the Bayesian framework. This represents an important contribution as extant evidence in support of people being ‘Bayesian arguers’ has focused on their being sensitive to the relevant Bayesian parameters, but has not tested the direct quantitative nature of this sensitivity (e.g., Hahn, Harris & Corner, 2009; Hahn & Oaksford, 2007a; Oaksford & Hahn, 2004). The ad Hitlerum Hitler has been used as an argumentative device ever since the Second World War. Strauss coined the ad Hitlerum name in 1953, but there is no reason to suppose that the argument was not present in colloquial expression prior to that date. Its usage continues today. For instance, a Penn State Trustee compared Reagan’s speech to 4

Young Americans for Freedom to Hitler’s speeches to the Hitler Youth (Moser, 2006), anti-smoking campaigns in Germany have been linked to the Nazi’s attempts to ban smoking in public places (Schneider & Glantz, 2008), and on his radio show, Rush Limbaugh noted that “Adolf Hitler, like Barack Obama, ruled by dictate” (McGlynn, 2010). Indeed, the pervasiveness of the argument is so dominant that Mike Godwin felt it necessary to ‘phrase’ the truism that is the Godwin Law: “As an online discussion grows longer, the probability of a comparison involving Nazis or Hitler approaches one” (Godwin, 1994). The ad Hitlerum therefore remains prevalent in today’s society. It can, however, be seen that the argument can take on different forms. Below we outline those forms and categorise them within the existing catalogue of argument fallacies. The ad Hitlerum as a Slippery Slope Argument Not all references to Hitler are made within the same argument form. Strauss (1953) explicitly considers the ad Hitlerum to be a specific case of the reductio ad absurdum argument. The reductio ad absurdum argues in favour of a proposition by deriving an absurdity from the denial of that proposition (Rescher, 2005). For logicians, the ad absurdum derives a self-contradiction from the denial of that proposition, typically via a logical derivation (Rescher, 2005). It is unclear to us, however, that the ad Hitlerum should be categorised as an instance of the ad absurdum (at least as that argument is defined by logicians). Rather, we suggest that Strauss, not an argumentation scholar himself, was considering an intuitive interpretation of what ‘ad absurdum’ means. We therefore suggest that Strauss himself actually perceived the ad Hitlerum to be a specific instance of a Slippery Slope argument (SSA). SSAs have been somewhat resistant to a stable definition, but 5

Corner, Hahn and Oaksford (2011, p. 135) define them as comprising of the following four components: i.

An initial proposal (A).

ii.

An undesirable outcome (C).

iii.

The belief that allowing (A) will lead to a re-evaluation of (C) in the future.

iv.

The rejection of (A) based on this belief.

Hitler can be used in a SSA, with the recognition of an implicit premise. For example, “You should not adopt policy X because Hitler did, [and look where that led…]” The implication being that the adoption of Policy X would make the more negative aspects of Hitler’s dictatorship (e.g., the Holocaust, WWII) (C) more likely, thus resulting in the rejection of Policy X (A). From a Bayesian perspective, those factors that make SSAs differentially strong and weak have been explicated and supported in Corner et al. (2011). The ad Hitlerum as ad hominem To refer to someone as “Hitler” or a “Nazi” is a direct attack upon that person. Within an argumentative discourse, this is an example of an abusive ad hominem argument. By comparing an individual or their plans to Hitler, the character of the individual is attacked, and thus it is suggested that the proposition they advance should be disregarded (see e.g., Walton, 1989, 1995, on the ad hominem fallacy). The ad hominem has classically been viewed as an argument fallacy from both a logical (e.g., Copi & Cohen, 1994) and pragma-dialectic (e.g., Van Eemeren & Grootendorst, 2004) perspective. From a logical perspective, the ad hominem is considered to be a fallacy of relevance - the characteristics of the individual advancing an argument are 6

not necessarily relevant to the acceptability of that argument. From a pragma-dialectic perspective, the ad hominem is considered fallacious because an attack on an opponent’s character attempts to prevent the opponent from advancing standpoints and, as such, is not a valid defence of one’s standpoint (see Van Eemeren, Garssen, & Meuffels, 2009, current issue, for a detailed pragma-dialectic account of the argumentum ad hominem). Although not a logically valid argument, an understanding of the characteristics of an argumentation opponent can provide relevant information in ascertaining the truth of the proposition under consideration. For example, once an individual’s credibility has been questioned, one is no longer able to have absolute confidence in facts reported by that individual (on source reliability see also, from a Bayesian perspective, Bovens & Hartmann, 2003; Corner, Harris, & Hahn, 2010; Hahn et al., 2009; Schum, 1981; and from a non-Bayesian perspective, e.g., Birnbaum & Stegner, 1979; Walton, 2008a)1. Similarly, if an individual is similar to Hitler in their moral values, this would seem to provide some evidence against following their advice in certain contexts (specifically, moral contexts). Because certain of Hitler’s moral judgments were clearly bad, if an individual with similar moral values to Hitler judges that an action is morally right, one should have less confidence in that judgment than if the individual were not similar to Hitler in this regard. Bovens and Hartmann (2003) and Hahn et al. (2009) both argue that Bayesian probability provides an appropriate normative framework within which to investigate source reliability generally, and the present article provides further empirical support for this notion. Copi and Cohen (1994) extend the ad hominem classification to arguments in which “a conclusion or its proponent are condemned simply because the view 7

defended is defended also by persons widely believed to be of bad character” (p. 123). A simple ad Hitlerum argument clearly takes this argumentation form: “Policy X is not a good idea because Hitler adopted the same policy.” Here, the policy is being condemned because Hitler (whom reasoners will presumably consider to be of bad character) also adopted it. Thus, this is an ad hominem argument because Hitler’s bad character is being used to condemn a policy he proposed. Some researchers might consider that this is too broad an extension of the ad hominem argument because it does not constitute a direct attack against the proponent of the present argument. In the General Discussion we consider the possibility that this example of the ad Hitlerum might be equally well classified as an example of an appeal to (negative) authority. However, in terms of the existing literature, Copi and Cohen’s conceptualisation above provides the best classification for the ad Hitlerum arguments considered in the current article, as examples of the argumentum ad hominem. In the empirical studies that follow, we consider ad Hitlerum arguments that most closely resemble this latter structure. Specifically, the arguments used in our experimental materials take a dialogue form in which one individual argues for the likely ‘badness’ of a proposition by stating that Hitler entertained the same proposition. We note, however, that the aspect of the ad Hitlerum argument form that we focus on is critical for the evaluation of all forms of the argument, including as an SSA and more generic forms of the abusive ad hominem. Essentially our probabilistic formulation determines the likelihood that a proposal is bad given that Hitler shared it. This is relevant for all forms of the ad Hitlerum, as they all necessitate that Hitler is viewed as evidence against the goodness of a proposal.

8

Bayesian Argumentation We will investigate the ad Hitlerum within the Bayesian framework of argumentation (Hahn & Oaksford, 2006a, 2007a)2. Central to the Bayesian approach is the recognition that the amount to which people believe a given proposition is a matter of degree. That is, propositions do not have truth values of 0 or 1, but rather an individual’s degree of belief in a particular proposition can be understood as a probability between 0 and 1 (see also e.g., Evans & Over, 2004; Howson & Urbach, 1996; Oaksford & Chater, 1998, 2007). Before commencing an argumentation dialogue, an individual will hold a prior belief in a given proposition, or hypothesis (h). This belief takes the form of a probability, and is termed the prior, P(h). An argumentation dialogue consists of two parties providing evidence in order to try and convince their argumentation partner of the truth of their position. Thus, the discussants within an argumentation may each provide and receive evidence in support of, or contrary to, the hypothesis under consideration. The normative procedure by which individuals should update their degree of belief in a hypothesis h upon receipt of an item of evidence e is given by Bayes’ Theorem: P ( h | e) =

P ( h ) P (e | h ) P (e )

(1)

where P(h|e) represents the posterior degree of belief that a hypothesis h is true after having received some evidence, e. P(e|h) represents the probability of receiving the evidence e if the hypothesis is true. P(e) is the probability of the evidence occurring regardless of the truth or falsity of the hypothesis, and can be calculated from P(h), P(e|h) and P(e|¬h) (the probability of receiving the evidence if, in fact, the hypothesis is not true (¬h). These probabilities are considered to be subjective degrees of belief. 9

Consequently, the Bayesian framework requires beliefs to be coherent (consistent with the axioms of probability), but does not require correspondence (i.e., matching) between these beliefs and real-world probabilities. The Bayesian approach stipulates that, upon receiving evidence, people should update their probabilistic degrees of belief in a hypothesis in accordance with the prescriptions of Bayes’ Theorem. Empirical evidence has demonstrated that people are sensitive to the relevant probabilistic features of an argument across a variety of argumentation fallacies, including: the argument from ignorance (Hahn, Oaksford & Bayindir, 2005; Oaksford & Hahn, 2004), slippery slope arguments (Corner et al., 2011), circular arguments (Hahn & Oaksford, 2007a) and more prototypical examples of ad hominem, where the focus has been on the role of the prior (Oaksford & Hahn, forthcoming). More generally, people’s treatment of source expertise appears consistent with Bayesian prescriptions (Hahn et al., 2009). In addition, Corner and Hahn (2009) demonstrated that the evaluation of arguments pertaining to current scientific issues of considerable import were also in line with Bayesian prescriptions. In the present article, we extend the results cited above in two ways. Firstly, we consider an additional argument type (a particular instantiation of the ad hominem). Secondly, our experimental design enables us to test the degree to which participants’ evaluations are consistent with the quantitative prescriptions of Bayes’ Theorem. The previous research cited above only explores the qualitative predictions of the Bayesian framework. Precise predictions can be calculated using Bayes’ Theorem if one has information about the relevant conditional probabilities, P(e|h) and P(e|¬h), and the prior, P(h). In Experiments 1 and 2, we collect data from participants pertaining to their subjective conditional probabilities, allowing us, assuming a specified prior, to make precise predictions as to how convincing they should find the argument. In Experiment 3, we again collect 10

these conditional probabilities, but also present a design in which we attempt to define these conditional probabilities such that there is an objective probabilistic prediction against which participants’ ratings can be evaluated (see also Griffiths & Tenenbaum, 2006; Harris & Hahn, 2009). It is prudent to note here that, although standard, subjective Bayesianism makes no correspondence prescriptions, using frequency information from our environment (especially in the absence of other relevant information) is desirable and relevant if we wish to use evidence to maximise the accuracy of our beliefs, as we would in any situation in which an action was to be based on these beliefs (e.g., vote in favour of, or against, the proposed policy)3. Formalising the ad Hitlerum For the type of argument investigated in this paper, our hypothesis concerns the goodness of a proposal given that Hitler endorsed it. From Bayes’ Theorem, we can prescribe normative predictions for how good a proposal should be perceived to be given that Hitler had endorsed it. In this work, we make the simplifying assumption that proposals are either good or not good (i.e., “bad”). For illustration purposes, we use the example proposal of a policy previously implemented by Hitler. Here, we ask participants how likely they expect it is that a policy is good after they hear the claim that Hitler had implemented such a policy in the past. We define participants’ posterior probability that the policy is good, given the association with Hitler as the posterior probability, P(good|Hitler). Bayes’ Theorem (Equation 1) therefore defines P(good|Hitler) as:

P ( good | Hitler ) =

P ( good ) P( Hitler | good ) (2) P ( Hitler )

from which the denominator can be expanded to: 11

P ( good | Hitler ) =

P ( good ) P( Hitler | good ) (3) P ( good ) P( Hitler | good ) + P (¬good ) P( Hitler | ¬good )

where P(Hitler|good) and P(Hitler|¬good) are the probabilities that Hitler had implemented the policy given that it was a good and not good policy respectively. For simplicity, in explaining the experimental design and results of Experiments 1, 2 and 3, we refer to these two conditional probabilities collectively as the likelihood probabilities. Oaksford and Hahn (forthcoming) have analysed the role of the prior in a Bayesian analysis of ad hominem arguments, but here we assume that the prior degree of belief (before any evidence is given) is that the policy is equally likely to be bad as it is to be good, that is P(¬good) = P(good)=.5. We can then rewrite Equation 2 as:

P( good | Hitler ) =

1 (4) P( Hitler | ¬good ) +1 P( Hitler | good )

This shows that the probability of a policy being good only depends on the likelihood ratio: P( Hitler | good ) . (5) P( Hitler | ¬good ) Where the likelihood ratio is less than 1, P(good|Hitler) < P(good), that is, the policy will be perceived as less good once one learns that it was previously implemented by Hitler. Where the likelihood ratio is greater than 1, the opposite will apply. In the case of the ad Hitlerum, we assume that the former condition is more often consistent with people’s beliefs.

12

In the experiments that follow, we not only require participants to rate P(good|Hitler), but subsequently require them to provide ratings of the likelihood probabilities, P(Hitler|good) and P(Hitler|¬good). By setting up the experimental materials in such a way that the prior degree of belief, P(good), can be assumed to equal .5, we can calculate a Bayesian posterior for each argument for each participant, and compare this with participants’ explicit ratings of P(good|Hitler) in order to ascertain the degree of quantitative fit between participants’ argument ratings and the Bayesian prescriptions. Overview of the Current Experiments Experiment 1 provides participants with simple ad Hitlerum arguments in opposition to five different propositions. We test whether participants are sensitive to changes in the topic of the argumentation, and whether these sensitivities can be predicted on the basis of the conditional probabilities they provide pertaining to the likelihood probabilities of Hitler as an item of evidence. If participants are truly sensitive to the important rational considerations of the likelihood probabilities, they should revise their opinion of an argument’s convincingness when provided with additional information referring to these conditional probabilities. This is the main question addressed in Experiment 2. Experiment 3 provides background information about the likelihood probabilities, so as to create an objective rational prediction for the convincingness of these arguments. Because we are unable to control participants’ background knowledge about Hitler, the argument used in Experiment 3 concerns a fictional alien leader, ‘Zhang’, but the form of the argument is identical to the ad Hitlerums we use in Experiments 1 and 2. In addition, Experiment 3 explores a greater range of explicit likelihood probabilities and also examines arguments which 13

are both for and against a proposal (ad Hitlerum arguments are only given against the proposal). In the work that follows, some probabilities are explicitly provided to participants in the experimental materials, whilst others are provided by participants. For clarity we term the former, objective probabilities and the latter, subjective probabilities, which we denote with subscripts subj and obj respectively4.

Experiment 1 We presented participants with a series of dialogues concerning the likely ‘goodness’ of propositions for various topics. All dialogues contained the same structure, and featured two speakers, A and B. In the dialogue, B argues to A that a particular proposition is a bad idea because Hitler had done it (or had liked it). We then asked participants what A’s opinion should be of the proposition after hearing the argument. A participant’s judgment of what A’s opinion should be is represented as the posterior probability Psubj(good (idea)|Hitler (liked/did it)), hereafter written as Psubj(good|Hitler). In order to evaluate the degree of rationality of participants’ assessments, we assessed Psubj(Hitler (liked/did it)|good (idea)) and Psubj(Hitler (liked/did it)|bad (idea)), hereafter written as Psubj(Hitler|good) and Psubj(Hitler|bad), which we collectively refer to as likelihood probabilities. From a participant’s likelihood probabilities, Equation 3 can be used to make a quantitative prediction for what the participant’s judgment of A’s opinion Psubj, predict(good |Hitler) should be5. Here the subscript subj,predict is used to indicate a prediction based on participant’s subjective responses. This can be compared with Psubj(good|Hitler), in order to assess how well participant’s responses can be described by the rational Bayesian framework. 14

Method Participants. After excluding participants who were under 18 years old (in line with departmental ethical guidelines)6, 61 participants (42 female) volunteered for the experiment, which was advertised on http://psych.hanover.edu/research/exponnet.html, a site for recruiting volunteers to participate in web-based experiments. The age range was from 18-60 years (median = 22 years) and participants predominantly reported being from North America (N = 40), with the next largest group (N = 9) being from Africa. Participants received no renumeration for completing this short experiment. Design. A within-participants design was employed. There were five different argument topics, presented in the form of a dialogue between A and B (see Figure 1). Each dialogue had the same structure. The five different topics were: a transportation policy, an economic policy, a religious policy, a law banning smoking in parks, and a film. In the dialogue, we enforced our assumption of neutral priors on whether the proposal is good or bad, P(good) = P(bad)= .5, with A stating, “I have no idea if it’s [referring to the given topic] a good idea or not.” We note that there are other possible pragmatic interpretations of A’s statement: For example, the phrase ‘no idea’, may just reflect a certain degree of probabilistic uncertainty, without guaranteeing an uncertainty of 0.5, but here we assume P(h) = .5. The presentation order of the five topics was randomized across participants.

15

After assessing all five dialogues, participants provided likelihood probabilities for each topic, again with the order randomized across participants (independent of the first presentation order). Materials and procedure. The experiment was programmed in Adobe Flash and run online in the participant’s web browser. The five dialogues were presented sequentially with only one dialogue on the screen at a time. Having read a dialogue, participants were asked to rate the convincingness of the argument in response to the question: “In light of the dialogue above, what do you think A’s opinion should now be of the proposed [transportation policy/economic policy/policy on religion/ law banning smoking in parks/ film]?” Participants made their responses by moving a slider on a scale between “Completely convinced it’s a bad idea” and “Completely convinced it’s a good idea” in the four policy topics, and “Completely convinced it’s unsuitable to show children” and “Completely convinced it’s suitable to show children” in the film topic (see Figure 1). The slider was initially positioned at the halfway point on the scale. The participant’s positioning of the slider on the scale was taken to be directly proportional to the degree of belief that the proposition was good, given the knowledge that Hitler had liked/implemented it. This is represented as the posterior probability Psubj(good|Hitler). This is the main dependent variable of interest and will be referred to in the remainder of this article as the posterior rating. Note that we followed Oaksford and Hahn (2004) in asking about how convinced A should now be. This is in line with the Bayesian theory of argumentation being a normative one (i.e., that arguments can be evaluated according to this objective rational standard). In this 16

instance, the participant is taking the place of the ‘reasonable critic’ in Van Eemeren and Grootendorst’s (2004, p. 1) definition of argumentation: “Argumentation is a verbal, social, and rational activity aimed at convincing a reasonable critic of the acceptability of a standpoint by putting forward a constellation of propositions justifying or refuting the proposition expressed in the standpoint.” After seeing all five argument dialogues, participants were asked, for each topic, two questions designed to elicit their subjective beliefs about the likelihood probabilities, Psubj(Hitler|good) and Psubj(Hitler|bad). The five topics were presented sequentially with only the questions relevant to one topic on the screen at a time. The two questions referring to a single topic were on the screen at the same time. The first question always elicited Psubj(Hitler|good) and the second question elicited Psubj(Hitler|bad) for the given topic. The first questions (eliciting Psubj[Hitler|good]) were of the form: “ Of all German transportation policies between 1925 and 1945 that Historians now recognise as being GOOD, how many do you think Hitler was responsible for?” The second questions were exactly the same, but with "GOOD" replaced with "BAD". The questions for the other political scenarios used the same format with minor word changes. The questions for the film topic read: “Of all the films [SUITABLE/UNSUITABLE] for children, what proportion do you think Hitler would have liked?” For each question, participants used the slider (between “none” and “all”) to indicate their responses. We chose to elicit likelihood probabilities in frequentist form as we preceived it as the most intuitive format for participants in the contexts of these 17

particular experiments (see e.g., Gigerenzer, 2002). Alternative possibilities would be to directly ask for the subjective probability, or to ask for a judgment of confidence in the conditional: “If the policy was good, then Hitler was responsible for it,” with which ordinal predictions could be made on the recognition that participants typically represent such conditionals as conditional probabilities (e.g., Over, Hadjichristidis, Evans, Handley & Sloman, 2007)7. At the end of the experiment, participants provided their age and gender before being thanked and debriefed. Results and Discussion A one-way ANOVA demonstrated that the ad Hitlerum argument was viewed as differentially convincing across the five different topics, F(3.43, 240) = 16.74, p < .001, etap2 = .22 (Greenhouse-Geisser correction for repeated measures applied). This result cannot be explained on the basis of the argument structure (which is identical across conditions), but is readily explainable from a Bayesian perspective, which takes into account the content of the argument in its evaluation. The content is accounted for because it affects the likelihood probabilities, which determine the argument’s convincingness, according to the Bayesian account. We used each participant’s likelihood probabilities to calculate, for each topic, and each participant, a Bayesian prediction for how convinced A should be that the proposal is good in light of the argument, that is Psubj,predict(good|Hitler). As mentioned above, we assumed an initial prior of P(good)=.5. A further simplifying assumption used in this experiment and Experiment 2 is that the judgment given by a contemporary person on the five topics would be the same regardless of whether they thought the topics were to be considered/implemented between 1925 and 1945 or 18

considered/implemented today. The same one-way ANOVA as above was conducted on the Psubj,predict(good|Hitler) values and the main effect of topic was again observed, F(3.50, 209.78) = 13.11, p < .001, etap2 = .18. Figure 2 demonstrates that the average pattern of argument convincingness across topics was well predicted by the Bayesian model. A topic-level analysis between the average value of Psubj,predict(good|Hitler), and the average value of the posterior rating for each topic yielded a correlation of radj(3)= .94, p < .05.8 This result indicates that 89% of the variance in the posterior ratings across the different topics was explained by the Bayesian model. An alternative way to analyse the data is to compute a correlation for each participant individually across their 5 judgments and corresponding predicted judgments and then compute the average of those correlation coefficients9. This individual-level analysis resulted in a mean correlation coefficient of .24. Wallsten, Budescu, Erev, and Diederich (1997) show that, where participants’ judgments are subject to some degree of random error or noise, a group-level average will be closer to the true values underlying those judgments. Hence, the better model fit observed with the group-level average (Figure 2), itself provides some support for the hypothesis that participants’ posterior ratings were (somewhat) noisy estimates of the Bayesian predictions. Experiment 2 The results of Experiment 1 showed that, on average, people’s posterior ratings were well predicted by their likelihood probabilities. Thus, people’s responses to the argument were in line with the predictions of rational Bayesian reasoning. A further prediction of the rational Bayesian model is that if people were given objective information about the likelihood probabilities, this should be taken into account and thus affect their posterior ratings10. In Experiment 2, we therefore provided 19

participants with objective values for these likelihood probabilities. This was done by introducing a third interlocutor into the argument, who provided participants with information pertaining to these likelihood probabilities. Method Participants. 184 participants (76 female), aged between 18 and 78 (median age 27), were recruited using Amazon Mechanical Turk (see e.g., Paolacci, Chandler, & Ipeirotis, 2010, in support of the validity of experimental data obtained via Amazon Mechanical Turk). Participants predominantly reported being from North America (N = 72) and Asia (N = 70). Each participant was paid $0.10 for completing this short experiment. Design. A 3x2 (topic x likelihood probability) between-participants design was employed. We chose films and transport as two of the topics as these gave rise to the least change in participants’ ratings in Experiment 1 (that is, posterior ratings were the closest to the assumed prior of .5). In addition, we also included religion, as this showed the most negative belief ratings, and is associated with the worst of all Hitler’s atrocities, thus providing a strict test for the rational account. The likelihood probability variable refers to whether Pobj(Hitler|Good) was high and Pobj(Hitler|Bad) was low (the positive condition), or vice versa (negative condition). Low and high likelihood probabilities were .2 and .8 respectively.11 Materials and procedure. The initial screens of the experiment were identical to those in Experiment 1 (except now only one topic was shown per participant). Participants were asked to 20

make the same initial rating about the proposal’s goodness as in Experiment 1. This we will refer to as posterior rating1. Following this rating, a new screen appeared, headed with the words “Person C now joins the argument”. The initial dialogue was faded but remained visible, and C’s contribution was presented below it in black. An example of the structure of C’s contribution is as follows (shown for the transportation topic): “Expert historians agree that between 1925-1945, there were just as many GOOD policies on transportation in Germany as there were BAD.

Of all German policies on transportation between 1925 and 1945 that historians now recognise as being GOOD, it is a fact that Hitler was responsible for [80%/20%] of them.

Of all German policies on transportation between 1925 and 1945 that historians now recognise as being BAD, it is a fact that Hitler was responsible for [20%/80%] of them.”

C mentioned the equal number of good and bad policies on the topic in Germany so as to guard against the possibility that the period 1925-1945 might have been perceived as a particularly bad or good period for policies. Such a perception would reduce the consistency between the conditional probability questions asked and those required to calculate a Bayesian posterior degree of belief (the conditional probability questions asked how likely Hitler was to have been involved in something bad or good during the period 1925-1945, whereas the posterior Bayesian question concerns how good something is likely to be in the present day).12 On a following 21

screen, the whole dialogue remained visible and participants were asked, “In light of this new information, what do you think A’s opinion should now be of the proposed policy on religion?” Thus, participants were asked a second time for their responses, which we term posterior rating2. Participants again responded using a slider, as they had for their initial judgments. Although participants were explicitly provided with likelihood probabilities regarding how likely good and bad policies/films were to be implemented/liked by Hitler, participants’ subjective likelihood probabilities would still have been influenced by their own subjective beliefs. Thus, at the end of the experiment, we elicited the likelihood probability judgments, Psubj(Hitler|good) and Psubj(Hitler|bad), from participants, as was done in Experiment 1. Here, participants only provided likelihood probability judgments for the topic that they had read. All other aspects of the procedure were identical to Experiment 1. Results and Discussion The first stage of Experiment 2 was equivalent to a between-participants replication of Experiment 1 with only three topics. However, a factorial ANOVA performed on posterior ratings1 did not replicate Experiment 1’s significant effect of topic (F < 1). We tentatively attribute this difference to different recruitment methods, given that these ratings should be based on participants’ subjective beliefs about the relevant conditional probabilities, which can differ across people. The main analyses of interest, however, concerned the effect of likelihood probability on judgments of posterior ratings2, for which the population averages are displayed in Figure 3. Results showed that ratings were affected by the likelihood probability manipulation, F(1, 178) = 56.50, p < .001, etap2 = .24. There was also a 22

significant interaction between likelihood probability and topic, F(2, 178) = 3.84, p = .04, etap2 = .04. This interaction is explainable from a Bayesian perspective as topics can differ in the degree to which participants would assimilate the values of Pobj(Hitler|good) and Pobj(Hitler|bad) offered to them by Person C in the dialogue. Although no effect of topic was observed in the analyses of posterior ratings1, this is likely to be a noisy estimate of participants’ true ratings (see also, Vul & Pashler, 2008; Wallsten et al., 1997), and these ratings might have been more stable for some topics than for others. The significant difference observed between these three topics in the less noisy, within-participant, Experiment 1 offers some support for this suggestion. We also acknowledge, however, that certain argument topics might be particularly difficult to fully explain within a rational framework. For example, for the topic of religion, people may be more reluctant to adopt a rational framework and to judge a policy on religion associated with Hitler as good, despite being provided with objective likelihood probabilities consistent with such a judgment. Further support for the possibility that certain highly emotive argument topics might be less susceptible to a rational treatment than other topics comes from an analysis of the Bayesian predictions, Psubj,predict(good|Hitler) (calculated as in Experiment 1). An ANOVA conducted on these values replicated the effect of the likelihood probability condition, F(1, 178) = 114.16, p < .001, etap2 = .39, but failed to replicate the significant interaction observed above, F(2, 178) = 1.33, p = .27. This suggests that the interaction cannot be explained in terms of the subjective likelihood probabilities provided by participants. As in Experiment 1, we wished to test the degree to which participants’ posterior ratings were quantitatively predicted by the Bayesian model. Because subjective likelihood probabilities were collected after participants had made their 23

second rating, we correlated Psubj,predict(good|Hitler) with posterior ratings2. Once again, mean ratings for each experimental condition were well predicted by the Bayesian model, radj(4)= .95, p < .01 (see Figure 4), accounting for 89% of the variance in mean responses across experimental conditions. Note that Figure 4 shows that the only topic for which the error bars of posterior ratings2 and Psubj,predict(good|Hitler) do not overlap concern the topic of religion, that most associated with highly emotive atrocities carried out by Hitler. Experiment 3 Experiment 2 showed that participants’ posterior ratings following an ad Hitlerum argument were, in general, well predicted by their subjective likelihood probabilities, in line with the predictions of the Bayesian framework. Experiment 3 was designed as a further exploration of this. Here, as in Experiment 2, likelihood probabilities were explicitly provided to participants. These likelihood probabilities were systematically varied across 17 different pairs of values (see Table 1), which gave rise to a range of predicted judgments regarding the goodness of the proposal. We also sought a more controlled experimental design in which there would be no previous knowledge about the likelihood probabilities, and for which we could investigate the effects of argument direction (using a reference to an individual either to attack or support a proposition). To achieve this, the ad Hitlerum argument was modified to an ‘ad Zhangum’, where the scenario concerned policies on the fictional Planet Xenon, which had once been governed by Zhang. The aim was to maintain the structure of the ad Hitlerum argument while minimising the impact of participants’ prior real-world knowledge of likelihood probabilities and allowing for exploration of both argument directions. 24

Method Participants. 725 participants (305 female), aged between 18 and 79 years (median = 29), were recruited via Amazon Mechanical Turk. Participants predominantly reported being from Asia (N = 395), with 176 from North America. They were paid $0.10 for completing this short experiment. Design. A 17×2 (likelihood probability × argument direction) between-participants design was employed. Participants were randomly assigned to one of 34 experimental conditions. Participants were told that there had been 10 successful and 10 unsuccessful transportation policies that had been implemented on Planet Xenon. Likelihood probabilities pertaining to how likely Zhang was to have been responsible for successful and unsuccessful policies, Pobj(Zhang|successful) and Pobj(Zhang|unsuccessful), were provided in frequency format by showing participants the number of successful and unsuccessful transport policies (out of 10) that Zhang had been responsible for. We used 17 different combinations for Pobj(Zhang|successful) and Pobj(Zhang|unsuccessful) (see Table 1). These objective likelihoods, Pobj(Zhang|successful) and Pobj(Zhang|unsuccessful), can be used to make predictions for how successful the policy should be judged to be, Pobj,predicted(successful|Zhang) (see Table 1). Note that here, unlike in Experiment 2, the likelihood probability information was provided at the very start of the experiment. Also, participants were only asked to judge the goodness of the policy, Psubj(successful|Zhang) (again referred to hereafter as the posterior rating) once, whereas in Experiment 2 they were asked to judge this twice. Argument direction 25

refers to how the reference to Zhang was used in the argument. In the ‘pro’ direction condition, Zhang was used to support the argument that the proposed policy would be successful and in the ‘against’ condition, Zhang was used to support the argument that the policy would be unsuccessful. This variable was introduced to balance the design as, unlike Hitler, Zhang is not a character with existing negative connotations, and therefore can potentially be used as positive evidence, as well as negative evidence. Finally, as in Experiments 1 and 2, and consistent with the Bayesian emphasis on subjective probability, predicted values were calculated from the subjective likelihoods, Psubj(Zhang|successful) and Psubj(Zhang|unsuccessful), which were again elicited from the participants. As before, predictions were calculated using Equation 3 and assuming P(successful) = P(unsuccessful) = .5. Materials and procedure. Participants were informed that transport was new on Planet Xenon and there had consequently only been 20 transportation policies implemented to date. They were told that, of these 20 policies, 10 had been successful and 10 unsuccessful and Zhang had been responsible for some of these policies. On a subsequent screen, participants were presented with 20 transportation policies, 10 in a column labelled ‘Successful’ and 10 in a column labelled ‘Unsuccessful’ (see Figure 5). Those that Zhang had implemented were marked with a green check mark. Within each likelihood probability condition, the specific policies that had been implemented by Zhang were randomised across participants. To ensure that participants processed the information, and were given some initial reason for its presence on the screen, participants were asked, on the basis of this information, to indicate how good they thought Zhang had been for transport on Planet Xenon. To minimise the risk of 26

participants simply copying this response in their subsequent posterior ratings, rather than using a slider, this response was typed as a number between -10 (extremely bad) and 10 (extremely good). On the next screen, participants were provided with an argument in the same format as in Experiment 1. The reference to Hitler was replaced with a reference to Zhang and the argument proponents were introduced as Zeeb and Zorba, two citizens of Planet Xenon. In the ‘against’ condition, all other aspects were identical to Experiment 1. In the ‘pro’ condition, Zorba’s (Protagonist B’s) assertion that “It’s definitely not (a good idea)” was replaced with “It’s definitely a good idea.” After providing a posterior rating, participants provided subjective likelihood probability ratings for Psubj(Zhang|successful) and Psubj(Zhang|unsuccessful), with the question formats following those in Experiment 1. Results and Discussion As a manipulation check, to test whether participants had processed the likelihood probabilities provided relating to Zhang’s successes and failures, P(Zhang|good) and P(Zhang|bad), we first analysed participants’ responses to the question, “How good do you think Zhang was for the transport on Planet Xenon?” A 17×2 factorial ANOVA revealed the expected main effect of likelihood probabilities, F(16, 691) = 20.66, p < .001, etap2 = .32. No effect of argument direction was observed, F(1, 691) = 3.00, p = .084, etap2 = .004, neither was there an interaction between the two variables (F < 1). This is as expected, because the two argument direction conditions are identical at this stage of the experiment. Figure 6 shows that as Zhang objectively became better for transportation, so evaluative perceptions increased. 27

As in Experiments 1 and 2, the main dependent variable of interest was participants’ subjective judgments of how good the policy seemed after reading the argument, the posterior rating, Psubj(successful|Zhang). Figure 7 shows these ratings across experimental conditions. As expected under a rational framework, there was a main effect of likelihood probabilities on posterior ratings, F(16, 691) = 5.96, p