Mechanical Turk

1 downloads 0 Views 814KB Size Report
Oct 7, 2011 - (2010). Computer scientists have also tested MTurk's suitability as a source of ..... Population Study (CPS) and the 2008 American National Elections Study ... of succession for the presidency, the length of a U.S. Senate term, and the .... that gender or race are associated with differences in effect sizes in the.
Using Mechanical Turk as a Subject Recruitment Tool for Experimental Research Adam J. Berinsky Massachusetts Institute of Technology Associate Professor Department of Political Science [email protected] Gregory A. Huber Yale University Associate Professor Department of Political Science Institution for Social and Policy Studies [email protected] Gabriel S. Lenz University of California, Berkeley Assistant Professor Department of Political Science [email protected]

October 7, 2011

We examine the tradeoffs associated with using Amazon.com’s Mechanical Turk (MTurk) interface for subject recruitment. We first describe MTurk and its promise as a vehicle for performing low-cost and easy-to-field experiments. We then assess the internal and external validity of experiments performed using MTurk, employing a framework that can be used to evaluate other subject pools. We first investigate the characteristics of samples drawn from the MTurk population. We show that respondents recruited in this manner are often more representative of the U.S. population than in-person convenience samples — the modal sample in published experimental political science — but less representative than subjects in Internet-based panels or national probability samples. Finally, we replicate important published experimental work using MTurk samples.

Interest in experimental research has increased substantially in political science. But experiments can be time consuming and costly to implement, particularly when they involve non-student adult subjects. Amazon.com’s Mechanical Turk (MTurk) has the potential to facilitate low-cost experiments in political science with a diverse subject pool. MTurk is an online web-based platform for recruiting and paying subjects to perform tasks. Relative to other experimental pools, MTurk is extremely inexpensive both in terms of the cost of subjects and the time required to implement studies. Not surprisingly, scholars across the social sciences have begun using MTurk to recruit research subjects.1 However, despite this burgeoning line of research, the benefits and potential limitations of using MTurk for subject recruitment in political science research remain relatively unexplored. (For related evaluations in psychology and in economics, see Buhrmester, Kwang, and Gosling 2011; Horton, Rand, Zeckhauser 2010.) 2 This paper addresses a simple but important question: Is MTurk a valid tool for conducting experimental research in political science?3 We present a framework for evaluating subject pools in general and then apply the framework to the MTurk subject pool. While the particular object of study here is the contemporary MTurk community, the types of analysis we undertake can be used to evaluate the strengths and limitations of other subject recruitment procedures. We first identify threats to the internal and external validity of research conducted using MTurk. Even accounting for these threats, we argue that MTurk is a valuable subject recruitment tool. First, the demographic characteristics of domestic MTurk users are more representative and diverse than the corresponding student and convenience samples typically used in experimental

1

As of October, 2011, Google Scholar lists 769 social sciences articles with the phrase “Mechanical Turk.” Relevant studies by economists include, e.g., Chandler and Kapelner (2010); Chen and Horton (2010); Horton and Chilton (2010), and Paolacci et al. (2010). Computer scientists have also tested MTurk’s suitability as a source of data for training machine learning algorithms (e.g., Sheng et al., 2008; Sorokin and Forsyth, 2008). For example, Snow et al. (2008) assessed the quality of MTurkers responses to several classic human language problems, finding that the quality was no worse than the expert data that most researchers use. 2 Analyses have generally found that experiments on Internet samples yield results similar to traditional samples. Based on a comprehensive analysis, for example, Gosling et al. (2004) conclude that Internet samples tend to be diverse, are not adversely affected by nonserious or habitual responders, and produce findings consistent with traditional methods. 3 The MTurk platform is of course limited to conducting research that does not require physical interactions between the subject and either the researcher or other subjects (e.g., to gather DNA samples, administer physical interventions, or observe face-to-face interactions among subjects).

1

political science studies. Second, we replicate experimental studies previously conducted using convenience and nationally representative samples, finding that the estimates of average treatment effects are similar in the MTurk and original samples. Third, we find that potential limitations to using MTurk to recruit subjects and conduct research — in particular, concerns about heterogeneous treatment effects, subject attentiveness, and the prevalence of habitual survey takers — are not large problems in practice. The remainder of the paper proceeds as follows. We begin by providing an overview of the subject-recruitment and data gathering choices involved in using MTurk. We then lay out a framework to evaluate the threats to internal and external validity that arise from the particular characteristics of a given subject pool. The bulk of our paper concerns an evaluation of the MTurk pool according to these standards. To do this, we first describe results from a series of surveys measuring MTurk subjects’ demographic and political characteristics. Next, we compare our MTurk sample to: (1) those samples used in experiments published in leading political science journals, (2) a high quality, Internet panel sample, and (3) probability samples used in the Current Population Survey (CPS) and the American National Election Studies (ANES). We then demonstrate that the effects of experimental manipulations observed in the MTurk population comport well with those conducted in other samples. We conclude by addressing two concerns raised about online samples: whether the MTurk population is dominated by subjects who participate in numerous experiments (or participate more than once in a given experiment) and whether MTurk subjects are effectively engaged with the survey stimuli. These concerns, we conclude, are relatively modest in the context of the MTurk platform. Recruiting Experimental Subjects using MTurk A core problem for experimental researchers in political science is the difficulty and prohibitive cost of recruiting subjects. In recent years, important innovations, such as the NSF-funded Time-Sharing Experiments in the Social Sciences (TESS) project, have enabled broader access to nationally representative samples, but access to these resources remains limited.

2

Amazon.com’s Mechanical Turk is a promising alternative vehicle for experimental subject recruitment. Amazon.com markets MTurk as a means to recruit individuals to undertake tasks. In practice, these tasks involve a wide array of jobs requiring human intelligence, such as classifying pictures or transcribing handwriting, but tasks can also include taking surveys with embedded experimental manipulations. To initiate a survey using MTurk, a researcher (a “Requester” in Amazon’s vernacular) establishes an account (www.mturk.com), places funds into her account, and then posts a “job listing” using the MTurk web interface that describes the Human Intelligence Task (HIT) to be completed and the compensation to be paid (Amazon assesses requesters a 10 percent surcharge on all payments). Each HIT has a designated number of tasks and the requester can specify how many times an individual MTurk “Worker” can undertake the task. Researchers can also set requirements for subjects, including country of residence and prior “approval rate,” which is the percent of prior HITs submitted by the respondent that were subsequently accepted by Requesters. When MTurk Workers who meet these eligibility requirements log onto their account, they can review the list of HITs available to them and choose to undertake any task for which they are eligible.4 The MTurk interface gives the researcher a great deal of flexibility to conduct a study. In addition to using MTurk’s embedded workspace to set up simple tasks, the researcher can also refer subjects to an external website. For instance, subjects might be redirected to a webpage to take a survey with an embedded experimental manipulation.5 Additionally, outside websites make it easy to obtain informed consent, implement additional screening procedures, debrief after an experiment, and collect detailed information about the survey process (including response times for items and respondents’ location when taking the survey as determined on the basis of the respondents’ Internet Protocol [IP] address). The final

4

We present screen shots of a sample HIT from a worker’s view in the online appendix. We have successfully used commercial websites like SurveyGizmo and Qualtrics for this process, and any web survey service that can produce a unique worker code should be suitable. Providing subjects with a unique code and having them enter it in the MTurk website ensures that they have completed the task.

5

3

stage for the researcher is compensating subjects. The researcher can easily authorize payment for the task through the MTurk web interface. 6 Cost per Subject and Ease of Recruitment Each HIT advertisement shows the amount a worker will be paid to complete it. Additionally, the description can list how long the task will take and/or other features (e.g., “fun” or “easy”) that may make the task more attractive to MTurk workers. In practice, we have found that relative to other experimental pools, recruiting subjects using MTurk is extremely inexpensive. In particular, a listing of different studies we have undertaken, including advertised length, payment, and number of completions per day, appears in Table 1. As that Table makes clear, for short surveys advertised as taking between 2 and 4 minutes, we have been able to obtain more than 200 subjects per day when paying as little as $.25 per survey. When payments were lower (e.g., $.15 per HIT), recruitment was somewhat slower. At higher pay rates — between $.50 and $.75 per completion — we have been able to recruit over 300 subjects per day. 7 To put these costs in perspective, even the highest pay rate we have used on MTurk of $.50 for a 5-minute survey (an effective hourly rate of $6.00) is still associated with a per-respondent cost of $.55 (including Amazon.com’s 10 percent surcharge) or $.11 per survey minute. By contrast, per subject costs for typical undergraduate samples are about $5-10, for non-student campus samples about $30 (Kam et al. 2007), and for temporary agency subjects between $15 and $20. Outside of the campus setting, private survey firms we have worked with charge at least $10 per subject for a 5-minute survey when respondents

6

If the researcher has arranged for the external website to produce a unique identifier, she can then use these identifiers to reject poor quality work on the MTurk website. For example, if the experiment included mandatory filter questions or questions designed to verify the subject was reading instructions, the worker’s compensation can be made contingent on responses. Finally, a unique identifier also allows the researcher to pay subjects a bonus based on their performance using either the MTurk web interface or Amazon.com’s Application Programming Interface (API). A Python script we have developed and tested to automate the process of paying individual bonuses appears in the online appendix. 7 We are unaware of research using the MTurk interface to recruit large numbers of subjects for longer surveys, although Buhrmester et al. (2011) report being able to recruit about 5 subjects per hour for a survey advertised as taking 30 minutes for a $.02 payment. Other scholars have reported that higher pay increases the speed at which subjects are recruited but does not affect accuracy (Buhrmester, Kwang, and Gosling 2011; Mason and Watts 2009; but see Downs et al. 2010 and Kittur, Chi, and Suh 2008 on potentially unmotivated subjects, a topic addressed in greater detail below).

4

are drawn from an Internet-panel. MTurk is, in short, extremely inexpensive relative to nearly every alternative other than uncompensated students. 8 Assessing the Validity of Research Conducted Using MTurk MTurk provides researchers with access to inexpensive samples. But are these samples of sufficient quality for political science research? To answer this question, we need to consider the validity of experiments conducted through MTurk using the standards that should be applied to evaluate any subject pool. Threats to the validity of experimental research are generally divided into questions of internal validity and external validity. External validity is an assessment of whether the causal estimates deduced from experimental research would persist in other settings and with other samples. Internal validity pertains to the question of whether causal estimates appropriately reflect the effects of the experimental manipulation among the participants in the original setting.

8

Another promise of MTurk is as an inexpensive tool for conducting panel studies. Panel studies offer several potential advantages. For example, recent research in political science on the rate at which treatment effects decay (Chong and Druckman 2010; Gerber, Gimpel, Green, Shaw 2011) has led to concerns that survey experiments may overstate the effects of manipulations relative to what one would observe over longer periods of time. For this reason, scholars are interested in mechanisms for exposing respondents to experimental manipulations and then measuring treatment effects over the long term. Panels also allow researchers to conduct pre-treatment surveys and then administer a treatment distant from that initial measurement (allowing time to serve as a substitute for a distracter task). Another potential use of a panel study is to screen a large population, and then to select from that initial pool of respondents a subset who better match desired sample characteristics. The MTurk interface provides a mechanism for performing these sorts of panel studies. To conduct a panel survey, the researcher first fields a task as described above. Next, the researcher posts a new task on the MTurk workspace. We recommend that this task be clearly labeled as open only to prior research participants. Finally, the researcher notifies those workers she wishes to perform the new task of its availability. We have written and tested a customizable Perl script that does just this (see the online appendix). In particular, after it is edited to work with the researcher’s MTurk account and to describe the new task, it interacts with the Amazon.com Application Program Interface (API) to send messages through the MTurk interface to each invited worker. As with any other task, workers can be directed to an external website and asked to submit a code to receive payment. Our initial experiences with using MTurk to perform panel studies are positive. In one study, respondents were offered 25 cents for a 3-minute follow-up survey conducted eight days after a first-wave survey. Two reminders were sent. Within five days, 68 percent of the original respondents took the follow up. In a second study, respondents were offered 50 cents for a 3-minute follow-up survey conducted one to three months after a first-wave interview. Within eight days, almost 60 percent of the original respondents took the follow up. Consistent with our findings, Buhrmester, Kwang, and Gosling (2011) report a two-wave panel study, conducted three weeks apart, also achieving a 60 percent response rate. They paid respondents 50 cents for the first wave and 50 cents for the second. Analysis of our two studies suggests that the demographic profile does not change significantly in the follow up survey. Based on these results, we see no obstacle to oversampling demographic or other groups in follow-up surveys, which could allow researchers to study specific groups or improve the representativeness of samples.

5

Concerns about the external validity of research conducted using student samples have been debated extensively (Sears 1986; Druckman and Kam 2011). For experimental research conducted using MTurk, two concerns raised about student samples are pertinent: (1) Whether estimated (average) treatment effects are accurate assessments of treatment effects for other samples, and (2) whether these estimates are reliable assessments of treatment effects for the same sample outside the MTurk setting. The former concern is most likely to be a threat if treatment effects are heterogeneous and the composition of the MTurk sample is unrepresentative of the target population (see Druckman and Kam 2011). For example, if treatment effects are smaller for younger individuals than older ones, a sample dominated by younger individuals will yield estimated treatment effects smaller than what one would observe with a representative sample. The latter concern arises if people behave differently in the MTurk setting than they do outside of that setting. Given our particular interest in assessing the validity of MTurk samples relative to other convenience samples, we undertake three types of analysis to address potential threats to validity. To address the concerns about external validity we first compare the characteristics of MTurk samples to other samples used in political science research. Second, we use MTurk to replicate prior experiments performed using other samples to compare estimated treatment effects. The replication exercise also allows us to, in part, address the other concern about generalizability — whether MTurk subjects behave differently than similar subjects in other research settings. Third, we consider whether MTurk samples are dominated by habitual participants whose behavior might be unrepresentative of similar populations not exposed to frequent political surveys. 9 Turning to the internal validity of estimates derived from MTurk samples, two concerns are especially pertinent. The first is the possibility that subjects violate treatment assignment by participating in a given task more than once. The second is subject inattentiveness, in which case some subsets of the

9

It should be noted that other convenience samples, such as student or local intercept samples, may also have significant numbers of habitual experimental participants. However, it is important to determine whether this is especially a problem in the MTurk sample, where subjects can easily participate in experiments from their home or work computers.

6

sample do not attend to the experimental stimuli and are effectively not treated. To explore these concerns, we directly assess subject attentiveness and the apparent prevalence of subjects participating in the same MTurk HIT multiple times from different user accounts. Assessing Threats to External Validity We begin our investigation of external validity by assessing the nature of the MTurk subject pool. Specifically, we compare measured characteristics of MTurk survey participants to characteristics of participants in three distinct types of research samples: convenience samples used in experiments published in leading political science journals, a sample generated by a high-quality Internet panel, and probability samples of U.S. residents. We surveyed 587 MTurk workers in February and March of 2010. We advertised the survey as taking about 10 minutes and paid respondents 50 cents each.10 Because we wish to benchmark MTurk against samples of adult U.S. citizens, we restricted the survey to individuals MTurk classified as 18 or older and living in the U.S. 11 We also excluded individuals with approval rates below 95 percent on previous MTurk tasks. As an additional check on U.S. residency, we verified that respondents took the survey from U.S. IP addresses and excluded the 32 individuals (5.8 percent) who did not.12 Comparison of Respondent Characteristics: Local Convenience Samples Local convenience samples are the modal means of subject recruitment among recent published survey and lab experimental research in political science. We examined all issues of the American Political Science Review, the American Journal of Political Science, and the Journal of Politics from January 2005 to June 2010. Of the 961 articles in these issues, 51 used experimental data. Forty-four of 10

The HIT was described as follows: Title: Survey of Public Affairs and National Conditions. Description: Complete a survey to gauge your opinion of national conditions and current events (USA only). Should be no more than 10 mins. Keywords: survey, current affairs, research, opinion, politics, fun. Detailed Posting: Complete this research survey. Usually takes no more than 10 minutes. You can find the survey here: [URL removed]. At the end of the survey, you'll find a code. To get paid, please enter the code below. 11 MTurk classifies individuals as 18 or older based on self reports. MTurk does not reveal how it classifies individuals as living in a particular country, but may rely on mailing addresses and credit card billing addresses. 12 These individuals may reside in the U.S., but be traveling or studying abroad. Additionally, although IP address locators seem reliable, we are unaware of research benchmarking their accuracy. Still, so as to provide as conservative a picture of our sample as is possible, we excluded these questionable respondents. Our results did not change when we included them.

7

these articles used U.S. subjects exclusively (the complete list of these articles, as well as a summary of subject recruitment methods, appears in the online appendix). Of these 44 articles, more than half used convenience samples for subjects (including student samples, local intercept samples, or temporary agencies). Table 2 compares our MTurk sample to several convenience samples for a series of measures reported in prior work. After presenting selected demographics and partisanship of our MTurk sample, it displays the average characteristics of the student and adult samples collected by Kam, Wilking, and Zechmeister (2007). Next, the table lists characteristics of two adult convenience samples used in Berinsky and Kinder (2006) — one of the handful of articles that describes the characteristics of its convenience samples. One of these samples is from around Princeton, New Jersey and the other is from around Ann Arbor, Michigan. On demographic representativeness, the MTurk sample fares well in comparison with these convenience samples. 13 Not surprisingly, relative to an average student sample, the MTurk population is substantially older, but it is younger than any of the three non-student adult samples. The MTurk and student samples are similar in terms of gender distribution, but the adult sample reported by Kam et al. and the Ann Arbor Berinsky and Kinder samples are substantially more female. The MTurk sample has a similar (high) education level as compared to the two Berinsky and Kinder samples, whereas the Kam et al. adult sample is much less educated (even compared to national probability samples; see Table 3). Finally, in terms of racial composition, the MTurk sample is much less White than the Kam et al. student sample (which is again very different from national samples, see Table 3) but similar to the Kam et al. sample and the Berinsky and Kinder Ann Arbor sample. 14 More importantly for the purposes of political science experiments, the Democratic party identification skew in the MTurk sample is better, relative to the ANES (see Table 3), than in either adult sample from Berinsky and Kinder. Of course, our point is not to single out Berinsky and Kinder — the 13

Other researchers have surveyed MTurk respondents and found a similar demographic profile (e.g., Ross et al. 2010). 14 The MTurk sample does have fewer blacks than either of the Berinsky and Kinder adult samples.

8

distribution of relevant demographic and political variables in their study was, in fact, more representative than that found in several other studies.15 Instead, we simply wish to emphasize that, when compared to the practical alternatives, the MTurk respondent pool has attractive characteristics — even apart from issues of cost. Comparison of Respondent Characteristics: Internet Samples and High-Quality, Face-To-Face, Probability Samples Besides local convenience samples, the other dominant form of sample recruitment in the published experimental studies is Internet-based surveys. MTurk also fares reasonably well when compared to these samples. In this section, we compare MTurk to a high-quality Internet survey, the American National Election 2008-2009 Panel Study (ANESP). The firm Knowledge Networks conducted the ANESP by recruiting respondents through a random-digit-dial method for a 21-wave Internet-based panel survey (10 waves of the survey concerned political matters; the other 11 waves did not). 16 Since we are treating the ANESP not as a best estimate of true population parameters, but rather as an example of a high-quality Internet sample, we present unweighted results from this survey. Comparing our MTurk survey with the ANESP has an additional advantage. Since both are Internet surveys, we hold the “mode” of survey constant. Additionally, in designing our MTurk survey, we followed the ANESP as closely as possible, using identical question wordings and branching formats. To put these comparisons in perspective, we benchmark them against nationally representative samples, including the Current Population Study (CPS) and the 2008 American National Elections Study (ANES 2008). These latter two studies use face-to-face probability samples and are widely considered the “gold standard” for survey sampling. In comparing these samples, it should be noted that most differences were statistically significant, even when those differences were substantively trivial (this is due, in part, to

15

Moreover, as the material in the online appendix makes clear, many other studies do not report any information about sample characteristics. 16 Prospective respondents were offered $10 per month to complete surveys on the Internet for 30 minutes each month.

9

the large sample sizes of the ANES, the CPS, and the ANES). We present these significance tests in the appendix. 17 In Table 3, we begin by presenting means and standard errors for key demographic variables. For continuous and near-continuous measures (Age, Education, and Income), we also plot the distribution for these four samples in Figure 1. On many demographics, the MTurk sample is very similar to the unweighted ANESP. Starting with gender, MTurk is only slightly more female than the ANESP, 60 versus 58 percent, and only slightly more educated, 14.9 versus 14.5 years. As Figure 1 shows, both education estimates are somewhat higher than in the face-to-face national probability samples, indicating that both MTurk and ANESP somewhat underrepresent low education respondents. On age, we cannot compare the MTurk sample to the ANESP because the latter survey redacts age, but the MTurk sample is notably younger than either of the face-to-face samples. Consistent with the MTurk sample being younger, both mean and median income are lower than in the other samples. On race, MTurk’s characteristics are mixed: It is slightly closer to the CPS on percent White than is the ANESP, but is considerably worse on percent black and only slightly better on proportion Hispanic. Not surprisingly, MTurk fares worse in comparison to both the ANESP and CPS on demographic characteristics related to life-cycle events, such as marital status, homeownership, and religious preference. MTurk subjects are more likely to: have never married (51 percent), rent rather than own their home (53 percent), and report no religious affiliation (42 percent).18 Finally, the MTurk sample is broadly similar to the other samples in region of residence, with perhaps a slightly larger prevalence of those living in the Northeast.

17

We therefore only report significance tests in the exceptional cases when they are not statistically significant, relying on Kolmogorov-Smirnov tests of differences in distributions (and proportion tests for categorical variables) As shown in the appendix, about 85% of the tests between MTurk and the other samples are statistically significant at the 0.10 threshold. In comparison, about 60% of the tests between ANES 2008 and CPS 2008 are significant. 18 Because the ANESP redacts age, unfortunately, we cannot ascertain whether these differences are rooted solely in the age of the subject pool. On demographics, the only nonsignificant differences between MTurk and the other samples are on gender, marriage separation, Catholic, and region.

10

We next compare the samples on key political and psychological measures, including partisanship, ideology, political interest, political knowledge, need for cognition and need to evaluate.19 These measures are often used in experimental research and are common on key political science surveys. These distributions are presented in Table 4 and Figure 2. Beginning with registration and 2008 turnout, the MTurk sample is more similar to the nationally representative samples than is the ANESP.20 MTurk respondents are slightly more Democratic in their partisan identification than are ANESP respondents, and are substantially more liberal in their ideology (a difference especially visible in Figure 2). MTurk respondents are also somewhat more interested in politics than ANESP respondents are, and both samples are considerably more interested than are ANES respondents.21 We also administered a battery of six political knowledge items from the ANESP. This battery includes questions about the line of succession for the presidency, the length of a U.S. Senate term, and the number of federal senators per state. Just before asking these questions, we instructed respondents to provide just their best guess and not to look up answers. For each item, we offered four answer options in a multiple-choice format. 22 Based on their responses, MTurk subjects appear more knowledgeable than ANESP respondents, but the gap is not large (though it is statistically significant in four of six cases). Finally, for both need for cognition and need to evaluate, the MTurk sample scores higher than the ANESP, which is itself higher than the ANES 2008. But Figure 2 shows that the distributions are nevertheless quite similar. 19

The need for cognition and need to evaluate scales are from the 2008 ANES. These items were placed on a separate survey of 785 MTurk respondents conducted in May 2011. This study also contained the Kam and Simas (2010) replication discussed below. The HIT was described as follows: Title: Survey of Public Affairs and Values. Description: Relatively short survey about opinions and values (USA only). 10-12 minutes. Keywords: survey, relatively short. Detailed Posting: Complete this research survey. Usually takes 10-12 minutes. You can find the survey here: [URL removed]. At the end of the survey, you'll find a code. To get paid, please enter the code below. 20 In fact, differences between MTurk and ANES 2008 on registration and turnout are not statistically significant (see the appendix). There are inconsistencies in the ANESP’s measures of turnout and registration (e.g., the survey contains respondents who say they are not registered to vote, but report voting) that suggest caution here. 21 This result is somewhat odd because workers visit MTurk to make money, not because they are interested in politics. The higher levels of interest may be due to advertising the survey as about “public affairs.” 22 To check whether MTurk subjects looked up answers to knowledge questions on the Internet, we asked two additional multiple choice questions of much greater difficulty: who was the first Catholic to be a major party candidate for president and who was Woodrow Wilson's vice president. Without cheating, we expected respondents to do no better than chance. On the question about the first Catholic candidate, MTurk subjects did worse than chance with only 10 percent answering correctly (Alfred Smith; many chose an obvious but wrong answer, John F. Kennedy). About a quarter did correctly answer the vice presidential question (Thomas Marshall), exactly what one would expect by chance. These results suggest political knowledge is not inflated much by cheating on MTurk.

11

Finally, we asked the MTurk sample several attitudinal questions that mirrored questions on the ANESP and the ANES 2008 (see Table 5). These questions asked about support for the prescription drug benefit for seniors, universal healthcare, and a citizenship process for illegal immigrants. The MTurk responses match the ANES well on universal health care — about 50 percent of both samples support it — while those in the ANESP are somewhat less supportive at 42 percent. MTurk also compares reasonably well on the question about a citizenship process for illegal immigrants. Perhaps as a function of the age skew of the sample or a different political environment after the political discussions surrounding the Obama health care initiative, MTurk respondents are less supportive of the prescription drug benefit for seniors compared to the ANES and ANESP — 64 percent of MTurk respondents favor the benefit, compared to 75 percent of ANESP and 80 percent of ANES respondents. Our MTurk survey also included three additional policy questions from the ANESP that were not included on the 2008 ANES. These asked about support for a constitutional amendment banning gay marriage, raising taxes on people making more than $200,000, and raising taxes on people making less than $200,000. Compared to the ANESP, MTurk subjects express somewhat more liberal views on all three items, with only 16 percent supporting a constitutional amendment banning gay marriage, compared to 31 percent in the ANESP (Table 5). 23 On both tax increase items, MTurk subjects are only a few percentage points more liberal in their views (and these differences are not statistically significant). All told, these comparisons reinforce the conclusion that the MTurk sample does not perfectly match the demographic and attitudinal characteristics of the U.S. population, but does not present a wildly distorted view of the U.S. population either. Statistically significant differences exist between the MTurk sample and the benchmark surveys, but these differences are substantively small. MTurk samples will often be more diverse than convenience samples and will always be more diverse than student samples. Thus, if we treat the MTurk as a means for conducting internally valid experiments, instead of a representative sample, the MTurk respondent pool is very attractive. At the same time, if one is interested in estimating treatment effects that may differ due to any of the factors for which the MTurk sample is 23

As with the drug benefit, this difference may be due to age or to differences in political circumstances.

12

less representative, then the MTurk sample may yield estimates that are inaccurate for the larger population. For example, if one believes that older or more conservative citizens are particularly responsive to treatments, researchers should be cautious about drawing broader conclusions. Furthermore, given the relative dearth of these sorts of individuals in the MTurk pool, large sample sizes may be necessary to obtain sufficient diversity on these dimensions to estimate differences in treatment effects for these groups, an issue that may guide ex ante targets for sample populations.24 Tables 3-5 and Figures 1 and 2, which compare the MTurk to Internet and face-to-face samples, provide clear guidance about for which variables these concerns should be most salient. Benchmarking Via Replication of Experimental Effects To further assess MTurk’s usefulness as a vehicle for experimental research, we also replicated the results reported in three experiments. The first is a classic study of the effect of question wording on survey responses, the second is a canonical framing experiment, and the third is a recently published political science experiment on the effects of risk preferences on susceptibility to framing. In all three cases, the experimental results found using the MTurk sample are highly similar to those found in published research. (Additional question wording and design details necessary to conduct these replications appear in the appendix.) Experiment 1: Welfare Spending Rasinski (1989) reports results from a question wording experiment that asked representative samples from the General Social Surveys (GSS) from 1984 to 1986 whether too much or too little was being spent on either “welfare” or “assistance to the poor.” The GSS is a nationally representative face-toface interview sample with characteristics similar to the ANES face-to-face surveys. Even though welfare and assistance to the poor are thought by policy experts to refer to the same policy, the study found important differences in levels of support between the two question forms. While 20 to 25 percent of the respondents in each year said that too little was being spent on “welfare,” 63 to 65 percent said that too 24

These discrepancies also suggest the potential utility of using an initial survey to screen large numbers of individuals and then inviting a more representative subset of those respondents to participate in the experiment itself. We discuss the technique for contacting selected respondents for a follow up survey in footnote 8.

13

little was being spent on “assistance to the poor” (Rasinski 1989, 391). The GSS has continued to ask the spending experiment and the gap in support for increasing spending between the question forms remains similar over time, ranging from 28 percent to 50 percent, with an average difference of 37 percent (Green and Kern 2010). We ran the same between-subjects experiment on MTurk (N=329) in our original MTurk demographic survey described above. Respondents were randomly assigned to either the “welfare” or “assistance to the poor” version of the question. We found a statistically-significant 38 percentage point gap (p-value < .001) between the two conditions that is similar in magnitude to that found using the GSS (the comparable gap was 44 percent in the 2010 GSS). Only 17 percent of MTurkers said too little was being spent on “welfare,” while 55 percent said too little was being spent on “assistance to the poor,”25 We also explored whether these treatment effects were heterogeneous. In both the 2010 GSS and the MTurk samples the experimental effect does not differ reliably across gender, education, or racial lines. 26 Furthermore, a test of the difference in the estimated treatment effect by demographic group crossed with differences in experimental administration (MTurk vs. GSS) also yields null results. 27 We present this analysis in the appendix.

25

The support for increased spending is, on average, somewhat higher in both conditions on the GSS. Specifically, in 2010, the GSS data show that 24 percent think that too little is being spent on “welfare,” while 68 percent think that too little is spent on “spending for the poor.” 26 Prior work similarly finds no evidence that gender or race are associated with differences in effect sizes in the GSS in years earlier than 2010 (Green and Kern 2010). 27 To conduct these tests, we pooled the MTurk and GSS samples. We then ran three ordered probits using the threecategory welfare spending response scale as the dependent variable – one for each of the demographic variables (men vs. women; college educated vs. other; blacks vs. all other races). For each of these probits, we included as independent variables a dummy variable for sample (GSS vs. MTurk), a dummy variable for the treatment (“welfare” vs. “assistance to the poor” question form), the demographic variable of interest (education, gender, or race) and interactions between all of the variables. The interactions between the treatment and the demographic variables allow us to test whether heterogeneous treatment effects are present, while the three-way interaction between the demographic variable, the sample, and the treatment allow us to test whether the treatment effect varies by demographic subgroup across forms. In all cases, these interaction terms were insignificant (the p-values on the terms range from 0.28 to 0.78). The full ordered probit results are presented in the appendix.

14

Experiment 2: Asian Disease Problem On a separate MTurk survey, we also replicated (N=450) a classic framing experiment — the “Asian Disease Problem” reported in Tversky and Kahneman (1981).28 This experiment was first conducted using a student sample, but has been replicated in many other settings (e.g., Bless et al. 1998, Druckman 2001, Jou et al. 1996, Takemura 1994, Kuhberger 1995). All respondents were initially given the following scenario: Imagine that your country is preparing for the outbreak of an unusual disease, which is expected to kill 600 people. Two alternative programs to combat the disease have been proposed. Assume that the exact scientific estimate of the consequences of the programs are as follows: They were then randomly assigned to one of the two following conditions: Condition 1, Lives Saved: If Program A is adopted, 200 people will be saved. If Program B is adopted, there is 1/3 probability that 600 people will be saved, and 2/3 probability that no people will be saved. Condition 2, Lives Lost: If Program A is adopted, 400 people will die. If Program B is adopted there is 1/3 probability that nobody will die, and 2/3 probability that 600 people will die. In each of these conditions, respondents are asked to choose one of two policy options. The first is a program with certain consequences. The second is a program that comes with risks — the outcome is probabilistic. These scenarios are exactly the same in their description of the expected consequences of each program, but differ in framing. In Condition 1, both the certain program and the risky program are described in terms of likelihood of positive outcomes, namely the lives saved by the programs. In Condition 2, by contrast, the two programs are described in terms of likelihood of negative outcomes — the lives lost by the different options. Tversky and Kahneman report that, when the problem was framed in terms of “lives saved,” respondents were more likely to pick the certain choice, while when it was frame in terms of “lives lost,” respondents were more likely to pick the risky choice. Framing the

28

This survey was fielded in January 2010. HIT was described as follows: Title: Answer a survey about current affairs and your beliefs. Description: “Answer a survey about current affairs and your beliefs. Should take less than 5 minutes.” Paolacci et al. (2010) also reports a replication of this experiment.

15

outcomes in positive terms therefore produced a reversal of participants’ preferences for the two programs compared to when it was presented in negative terms. In the original Tversky and Kahneman experiment, which was run with student samples, 72 percent of respondents picked the certain choice in the “lives saved” condition, as compared to 22 percent who picked the certain choice in the “lives lost” condition. We find a similar pattern among our MTurk sample: 74 percent pick the certain choice in the “lives saved” condition, and 38 percent select the certain choice in the “lives lost” condition (p-value of test of statistical significance of difference < .001). Thus, while the gap is smaller than in the student sample, we still observe the large preference reversal reported in Tversky and Kahneman (and replicated in subsequent experiments with different samples). Experiment 3: Framing and Risk The first two studies we replicated were both straightforward experiments with large effect sizes. To demonstrate the utility of MTurk as a tool for subject recruitment for more complex political science experiments, we also replicated a third study conducted by Kam and Simas (2010) that uses a Knowledge-Networks probability sample and was published in the Journal of Politics. 29 Kam and Simas used a modified version of the Tversky and Kahneman framing experiment and an original scale measuring individual differences in risk orientation to explore how both the frame of a problem and variation in risk proclivities affected choice under uncertainty. They first use a betweensubject design to test framing effects. As in the original Tversky and Kahneman framing study, the frame (lives lost vs. lives saved) influenced the choices respondents made. Moreover, as predicted, individuals who scored high on the risk orientation scale were more likely to choose the risky policy option, regardless of the frame. 30

29

We chose to replicate the Kam and Simas study because it is an excellent example of the way in which contemporary political scientists use experimentation to understand key political dynamics. That study examines the importance both of framing and of the relationship between framing and underlying preferences for risk aversion (i.e., heterogeneity in treatment effects). 30 Kam and Simas also employed a within-subjects design to show that high levels on the risk acceptance scale reduce susceptibility to framing effects across successive framing scenarios. We replicated these results as well (see online appendix).

16

For this replication, the experimental materials were taken verbatim from Kam and Simas (2010). In Table 6, we present the results for our MTurk sample of the Kam and Simas between-subjects experiment alongside the original Kam and Simas results from Table 2 of their paper (the analysis from their Tables 3 and 4 are presented in the online appendix). The dependent variable in this analysis is a binary variable measuring preference for the risky policy option over the certain policy option. The similarities between the experimental results are striking. The “lives lost” mortality frame increases support for the risky policy option — the probabilistic outcome — in both experiments and across all three model specifications. Moreover, the coefficients on the effect of the frame are very similar between the original study and our replication (1.07 in the original article compared to 1.18 in the MTurk sample). Additionally, higher levels of risk acceptance are associated with greater support for the risky policy option in both experiments and the coefficients are also similar across all three specifications.31 Finally, in the third specification for each sample, like Kam and Simas, we also find a statistically insignificant coefficient on the interaction between risk assessment and exposure to the mortality frame (though the coefficient is in both cases statistically insignificant by a wide margin, we find a negative sign on the interaction between risk assessment and mortality frame, while Kam and Simas find a positive sign). Overall, although our estimate of the predictive power of risk assessment is modestly larger than in the original paper, the basic pattern of effects is the same. Habitual Participants Analysis A final source of concern about external validity for any self-selected sample is the potential for “habitual customers.” If the same subjects take several surveys, there is the potential for cross-experiment stimuli contamination (Transue et al. 2009). To assess the severity of this problem, we asked our respondents how many political surveys they had taken in the last month on MTurk. The mean was 1.7. Thirty nine percent of the respondents took no other survey, while 78 percent took two or fewer surveys.

31

We also find a somewhat different pattern of signs for the coefficients on the control variables – notably education and income. However, the coefficients on these variables – both in our analysis and in the original Kam and Simas paper – fall short of statistical significance by a wide margin.

17

We also assessed the prevalence of habitual participants by examining a broad range of experiments run on MTurk. We gathered the unique MTurk ID number for all workers who participated in each of seven studies we conducted from January 2010 to April 2010. The compensation for these studies ranged from 10 cents to 50 cents, and the N ranged from 200 to 587. Across the seven experiments, there were a total of 1,574 unique subjects. Of these subjects, 70 percent participated in only one experiment; another 18 percent participated in two experiments. Only two percent of the subjects participated in 5 or more experiments. While this set of experiments represents only a small proportion of those conducted on MTurk, our findings may illuminate broader trends. Although there are certainly a handful of respondents who participate habitually in experiments, the majority of MTurkers are not chronic study participants. Furthermore, the presence of these habitual responders does not seem to pose a threat to our inferences. In the experiments presented above, we found that the effects did not differ — in either a statistical or a substantive sense — when we examined the habitual respondents and the non-habitual respondents separately. 32 All told, our results, combined with other replications of well-known experiments in other fields by other scholars (Horton, Rand, and Zeckhauser 2010), provide further support for the external validity of MTurk as an experimental platform. 33 Assessing Threats to Internal Validity In the previous section, we noted that estimates of experimental treatment effects are similar for habitual and one-time participants. This result addresses both external and internal validity concerns. To further explore the internal validity of experiments conducted using MTurk samples, we also examined

32

We conducted the habitual responder analysis for the welfare and Asian flu experiments (see online appendix). We do not perform this analysis for the Kam and Simas study because it was conducted a year after our data on frequent participants was collected. 33 Lawson et al. (2010) successfully replicate Ballew and Todorov’s (2008) ratings of 2006 Senate candidate faces on MTurk. Horton, Rand, and Zeckhauser (2010) replicate several experimental findings in economics. Gabriele Paolacci’s Experimental Turk blog (http://experimentalturk.wordpress.com/) has collected reports of successful replications of several canonical experiments from a diverse group of researchers, including the Asian Disease Problem discussed in this section and other examples from psychology and behavioral economics.

18

whether MTurk users appear to violate treatment assignment and their engagement with experimental stimuli.

Do MTurk Workers Violate Assignment by Participating in Experiments Multiple Times? We sought to assess whether a given respondent took our survey more than once. By default, each HIT (survey) can only be completed by a single worker. However, an individual could potentially subvert this process by opening multiple MTurk accounts (though this behavior would violate the terms of the MTurk user agreement). They could then take the survey once from each account, which might expose them to more than one treatment condition. Given the relatively low pay rate of our studies and the availability of other paid work, we do not believe our work is likely to encourage such behavior. Nevertheless, we did check to see if multiple responses came from a single IP address. We found that a total of 7 IP addresses produced two responses each to our demographic survey (i.e., 14 of 587 responses or 2.4 percent of the total responses). This pattern is not necessarily evidence of repeat survey taking. It could, for example, be the case that these IP addresses were assigned dynamically to different users at different points in time or that multiple people took the survey from the same large company, home, or even coffee shop. But even if these are cases of repeat survey taking, only a handful of responses would be contaminated, suggesting that repeat survey taking is not a large problem in the MTurk subject pool.34 Attention, Demand, and Subject Motivation Given their incentives, MTurk respondents may generally pay greater attention to experimental instruments and survey questions than do other subjects. Since Requesters often specify at least a 95 percent prior “approval rate” — that is, previous Requesters accepted 95 percent or more of the HITs submitted by an individual — respondents have an incentive to read instructions carefully and consider their responses. Our experiences are consistent with this expectation. In a study conducted by one of the authors, subjects were asked to identify the political office held by a person mentioned in a story they had just 34

Researchers can reject and block future work by suspected retakers, or simply exclude duplicate work from their analysis by selecting only the first observation from a given IP address.

19

read. The format of this question was a multiple choice item with five possible responses. On the MTurk study, 60 percent of the respondents answered the question correctly. An identical question concerning the same article was also included on experiments run through Polimetrix/YouGov, another high-quality Internet panel, and with a sample collected by Survey Sampling International (SSI). The correct answer rates on these platforms were markedly lower than in the MTurk sample — 49 percent on Polimetrix/YouGov and 46 percent on SSI. While a concern for pleasing the researcher has benefits, it may also have costs. MTurk respondents may pay close attention to experimental stimuli, but they may also exhibit experimental demand characteristics to a greater degree than do respondents in other subject pools, divining the experimenter’s intent and behaving accordingly (Orne 1962; Sears 1986). To avoid this problem and the resulting internal validity concerns, it may be desirable to avoid signaling to subjects ahead of time the particular aims of the experiment. 35 Demand concerns are relevant to any experimental research, but future work needs to be done to explore if these concerns are especially serious with respect to the MTurk respondent pool and how they are affected by recruitment and consent text. Conclusion This paper describes the potential advantages and limitations of using Amazon.com’s Mechanical Turk platform as a subject recruitment device for experimental research. We demonstrate that relative to other convenience samples often used in experimental research in political science, MTurk subjects are often more representative of the general population and substantially less expensive to recruit. MTurk subjects appear to respond to experimental stimuli in a manner consistent with prior research. They are apparently also not currently an excessively overused pool, and habitual responding appears to be a minor concern. Put simply, despite possible self-selection concerns, the MTurk subject pool is no worse than convenience samples used by other researchers in political science. The analysis we undertake for the MTurk pool also provides a template for evaluating the desirability of other means of subject recruitment.

35

In the case of experiments involving deception, it is also feasible to debrief at the conclusion of the experiment.

20

Despite these advantages, several aspects of MTurk should engender caution. In particular, MTurk subjects are notably younger and more ideologically liberal than the public, which may limit their suitability for some research topics. They also appear to pay more attention to tasks than do other respondents. Finally, as use increases, habitual responding may pose more of an external validity problem. Interest in experimental research has risen substantially in political science, but experiments can be difficult and costly to implement. MTurk potentially provides an important way to overcome the barrier subject recruitment costs and difficulties raises to conducting research by providing easy and inexpensive access to non-student adult subjects. Our results provide researchers with a clearer understanding of the potential advantages of the MTurk tool for conducting experiments as well as areas where caution may be in order.

21

Table 1: Task Title, Compensation, and Speed of Completion for Selected MTurk Studies Completions per day Date Launched

Number of Subjects

Pay per Subject

Mean Mins. per Subject

2

3

4

5

7

8

Answer a survey about current affairs and your beliefs 2-3 minute survey for political science research

1/5/2010

490

$0.15

7

116

64

41

40

27

36

15

11

3/16/2010

500

$0.25

2

210

68

37

55

53

64

18

4 minute survey for political science research

4/26/2010

500

$0.40

4

298

105

79

18

3 minute survey for political science research

4/29/2010

200

$0.25

1

200

3-4 minute survey for political science research

5/17/2010

150

$0.45

2

150

5-7 minute survey

6/28/2010

400

$0.75

5

321

7-9 minute survey

6/24/2010

400

$0.75

6

400

5-7 minute survey

7/3/2010

400

$0.50

3

256

2-3 minute survey

7/16/2010

200

$0.25

3

200

Task Title

1

6

79

115

29

Note: The remaining subjects for the 1/5/2010 study were recruited as follows: Day 10 (11), Day 11 (10), Day 12 (17), Day 13 (22), Day 14 (29), Day 15 (36).

22

9 15

Table 2: Comparing MTurk Sample with Other Convenience Samples Convenience Samples Adult samples (Berinsky and Kinder 2006)

Demographics Female Age (mean years) Education (mean years) White Black

MTurk 60.1%

Student samples

Adult sample

Exp. 1:

Exp. 2:

(Kam et al. 2007)

(Kam et al. 2007)

Ann Arbor, MI

Princeton, NJ

56.7%

75.7%

66.0%

57.1%

42.5

45.3

15.1

14.9

81.4

72.4

4.4 (0.9)

12.9

22.7

41.9

46.1

46.5

20.6

17.6

16.3

25.8

17.0

10.1

(2.1)

(1.3)

(4.1)

32.3

20.3

45.5

(0.5)

(8.2)

(.916)

14.9

--

5.48

(0.1)

--

(1.29)

83.5

42.5

82.2

(1.6)

--

(3.7)

Party Identification Democrat

(0.8) Independent

23.1 (0.6)

Republican

24.9 (0.7)

None/other

10.2 (0.5)

N 587 (Varies) 109 141 163 Note: Percentages except for age and education with standard errors in parentheses. Adult sample from Kam et al. (2007) is for campus employee participants from their Table 1, Column 1. MTurk survey is from March 2010.

23

Table 3: Comparing MTurk Sample Demographics to Internet and Face-to-Face Samples MTurk

Internet Sample ANESP

60.1% (2.1) 14.9 (0.1) 32.3 (0.5)

57.6% (0.9) 14.5 (0.1) ---

51.7% (0.2) 13.2 (0.0) 46.0 (0.1)

54.9% (1.3) 13.5 (0.1) 46.6 (0.5)

Mean Income

$55,332 ($1,659)

$69,043 ($794)

$62,966 ($130)

$62,501 ($1,147)

Median Income

$45,000

$67,500

$55,000

$47,500

95.0 (1.6) 5.0 (0.9) 6.7 (1.1)

90.3 (0.7) 9.7 (0.5) 5.5 (0.4)

87.3 (0.1) 12.7 (0.1) 13.7 (0.1)

86.8 (0.9) 13.2 (0.6) 9.1 (0.5)

39.0 (2.1) 7.1 (1.1) 2.5 (0.7) 50.6 (2.1) 0.7 (0.4)

63.6 (0.9) 13.6 (0.7) 1.5 (0.2) 15.9 (0.7) 5.4 (0.4)

55.7 (0.2) 10.2 (0.1) 2.1 (0.1) 25.7 (0.2) 6.3 (0.1)

50.1 (1.3) 12.9 (0.8) 2.9 (0.4) 24.7 (1.1) 7.8 (0.6)

52.7 (2.3) 47.3 (2.3)

15.1 (0.7) 84.9 (0.7)

32.4 (1.2) 66.1 (1.2)

41.8 (2.1) 20.7 (1.7) 16.5 (1.6) 4.4 (0.9) 16.5 (1.6)

13.3 (1.0) 36.4 (1.4) 24.2 (1.3) 3.8 (0.6) 22.3 (1.2)

20.1 (1.1) 30.6 (1.2) 19.1 (1.0) 1.3 (0.3) 28.9 (1.1)

22.4 (1.7) 26.8 (1.8) 30.4 (1.9) 20.4 (1.7) 587

16.9 (0.7) 28.3 (0.9) 31.4 (0.9) 23.4 (0.8) 1602

Female Education (mean years) Age (mean years)

Face-To-Face Samples CPS 2008 ANES 2008

Race White Black Hispanic Marital Status Married Divorced Separated Never married Widowed Housing Status Rent Own home Religion None Protestant Catholic Jewish Other Region of US Northeast Midwest South West N

18.0 (0.2) 23.2 (0.2) 37.1 (0.2) 21.7 (0.2) 92360

24

14.6 (0.9) 21.2 (1.1) 42.8 (1.2) 21.4 (0.9) 2323

Note: Percentages except for education, age, and income with standard errors in parentheses. CPS 2008 and ANES 2008 are weighted. ANESP redacts age. MTurk survey is from March 2010. Tests of statistical significance of differences across samples appear in the appendix.

25

Table 4: Comparing MTurk Sample Political and Psychological Measures to Internet and Face-toFace Samples MTurk

Internet Sample ANESP

78.8%

92.0%

71.0%

78.2%

(1.7)

(0.7)

(0.2)

(1.1)

70.2

89.6

63.6

70.4

(1.5)

(0.6)

(0.2)

(1.1)

3.48

3.90

3.70

(0.09)

(0.05)

(0.05)

3.39

4.30

4.24

(0.09)

(0.05)

(0.04)

Face-To-Face Samples CPS 2008 ANES 2008

Registration and Turnout Registered Voter Turnout 2008 Party Identification (mean on 7-point scale, 7 = Strong Republican) Ideology (mean on 7-point scale, 7 = Strong conservative) Political Interest (mean on 5-point scale, 5 = Extremely interested)

2.43

2.71

2.93

(0.04)

(0.02)

(0.03)

70.0

65.2

(1.3)

(2.0)

81.3

73.6

(1.7)

(1.3)

96.2

92.8

(0.8)

(0.7)

45.0

37.5

(2.1)

(1.3)

85.4

73.2

(1.5)

(1.2)

50.1

38.9

(2.1)

(1.3)

71.3

63.5

Political knowledge (% Correct) Presidential succession after Vice President House vote percentage needed to override a veto Number of terms to which an individual can be elected president Length of a U.S. Senate term Number of Senators per State Length of a U.S. House term Average Need for Cognition (mean on 0-1 scale) Need to Evaluate (mean on 0-1 scale)

.625

.607

.548

(0.012)

(0.006)

(0.007)

.628

.579

.552

(0.008)

(0.004) 1602

(0.005)

2323 N 587/699 Note: Means with standard errors in parentheses. CPS 2008 and ANES 2008 are weighted. Political measures are from the March 2010 MTurk survey (N = 587). Need for Cognition and Need to Evaluate are from the May 2011 MTurk survey (N = 699). Tests of statistical significance of differences across samples appear in the appendix

26

92360

Table 5: Comparing MTurk Sample Policy Attitudes to Internet and Face-to-Face Samples

Favor prescription drug benefit for seniors Favor universal healthcare Favor citizenship process for illegals Favor a constitutional amendment banning gay marriage Favor raising taxes on people making more than $200,000 Favor raising tax on people making less than $200,000

Internet Sample

Face-ToFace Samples

MTurk 63.5%

ANESP 74.8%

ANES 2008 80.1%

(2.0)

(1.1)

(1.5)

47.8

41.7

51.0

(2.1)

(1.2)

(1.9)

38.4

42.7

49.1

(2.1)

(1.2)

(1.9)

15.5

30.7

(1.5)

(1.2)

61.4

55.4

(2.1)

(1.2)

6.1

7.1

(0.6)

(0.6)

2323 N 587 1602 Note: Percentages supporting each policy with standard errors in parentheses. ANES 2008 is weighted. MTurk survey is from March 2010. Tests of statistical significance of differences across samples appear in the appendix

27

Table 6: Replication of Kam and Simas (2010) Table 2 — Risk Acceptance and Preference for the Probabilistic Outcome Kam and Simas (2010)

Mortality Frame in Trial 1

Risk Acceptance

MTurk replication

(H1a) Mortality Frame and Risk Acceptance 1.068

(H1b) Adding Controls 1.082

(H2) Frame x Risk Acceptance 1.058

(H1a) Mortality Frame and Risk Acceptance 1.180

(H1b) Adding Controls 1.180

(H2) Frame x Risk Acceptance 1.410

(0.10)

(0.10)

(0.29)

(0.10)

(0.10)

(0.31)

0.521

0.628

0.507

0.760

0.780

0.990

(0.31)

(0.32)

(0.48)

(0.29)

(0.31)

(0.42)

Female

0.105

-0.018

(0.10)

(0.11)

Age

0.262

0.110

(0.22)

(0.31)

Education

-0.214

0.025

(0.20)

(0.23)

0.205

-0.024

(0.23)

(0.23)

0.038

0.006

(0.19)

(0.15)

Income

Partisan Ideology

Risk Acceptance x Mortality Frame Intercept

lnL p > χ2 N

0.023

-0.450

(0.62)

(0.58)

-0.706

-0.933

-0.700

-1.060

-1.100

-1.190

(0.155)

(0.259)

(0.227)

(-0.170)

(-0.290)

(-0.230)

-453.185

-450.481

-453.184

-409.740

-409.662

-409.439

0.000

0.000

0.000

0.000

0.000

0.000

752

750

752

699

699

699

Note: Table entries are probit coefficient with standard errors in parentheses. Dependent variable is Preference for the Probabilistic Outcome (0 = deterministic outcome; 1 = probabilistic outcome). All independent variables are scaled to range from 0 to 1. MTurk survey is from May 2010. None of differences between coefficients across studies are statistically significant (see the online appendix).

28

Figure 1: Comparing MTurk Sample on Selected Demographic Measures to Face-to-Face Samples

Note: CPS 2008 and ANES Face To Face 2008 are weighted. Tests of statistical significance of differences across samples appear in the appendix

29

Figure 2: Comparing MTurk Sample Selected Demographics to Face-to-Face Samples

Note: ANES Face To Face 2008 is weighted. Tests of statistical significance of differences across samples appear in the appendix

30

Cites Ballew, Charles C. II, and Alexander Todorov. 2007. "Predicting Political Elections from Rapid and Unreflective Face Judgments." Proceedings of the National Academy of Sciences of the United States of America 104(46): 17948-53 Bless, Herbert, Tilmann Betsch, and Axel Franzen. 1998. "Framing the framing effect: the impact of context cues on solutions to the 'Asian disease' problem." European Journal of Social Psychology 28: 287 — 291. Berinsky, Adam J., and Donald R. Kinder. 2006. “Making Sense of Issues Through Media Frames: Understanding the Kosovo Crisis.” Journal of Politics 68(3): 640-656. Bless, Herbert, Tilmann Betsch, and Axel Franzen. 1998. “Framing the Framing Effect: The Impact of Context Cues on Solutions to the ‘Asian disease’ Problem.” European Journal of Social Psychology 28(2): 287-291. Buhrmester, Michael D., Tracy Kwang, and Samuel D. Gosling. 2011. “Amazon's Mechanical Turk: A New Source of Inexpensive, yet High-Quality, Data?” Perspectives on Psychological Science. 6(1): 3-5. Chandler, Dana, and Adam Kapelner. 2010. “Breaking Monotony with Meaning: Motivation in Crowdsourcing Markets,” University of Chicago mimeo. Chen, Daniel L., and John J. Horton. 2010. “The Wages of Pay Cuts: Evidence from a Field Experiment,” Harvard University Mimeo. Chong, Dennis, and James N. Druckman. 2010. “Dynamic Public Opinion: Communication Effects over Time.” American Political Science Review 104(4): 663-680. Downs, Julie S., Mandy B. Holbrook, Steve Sheng, and Lorrie Faith Cranor. 2010. “Are Your Participants Gaming the System? Screening Mechanical Turk Workers.” In “CHI 2010: 1001 Users,” 2399-2402.

31

Druckman, James N. 2001. “Evaluating Framing Effects.” Journal of Economic Psychology 22(1): 91101. Druckman, James N. and Cindy D, Kam. 2011. “Students as Experimental Participants: A Defense of the ‘Narrow Data Base.’” In James N. Druckman, Donald P. Green, James H. Kuklinski, and Arthur Lupia, eds., Handbook of Experimental Political Science. Gerber, Alan S., James G. Gimpel, Donald P. Green, and Daron R. Shaw. 2011. “How Large and Longlasting Are the Persuasive Effects of Televised Campaign Ads? Results from a Randomized Field Experiment.” American Political Science Review 105(1): 135-150. Green, Donald P. and Holger L. Kern. 2010. “Modeling Heterogeneous Treatment Effects in Survey Experiments with Bayesian Additive Regression Trees.” Yale University mimeo. Gosling, Samuel D., Simine Vazire, Sanjay Srivastava, and Oliver P. John. 2004. “Should We Trust WebBased Studies.” American Psychologist 59(2): 93-104. Horton, John J., Rand, David G. and Zeckhauser, Richard J. 2010. “The Online Laboratory: Conducting Experiments in a Real Labor Market.” Available at SSRN: http://ssrn.com/abstract=1591202 Horton, J. and L. Chilton, “The Labor Economics of Paid Crowdsourcing,” Proceedings of the 11th ACM Conference on Electronic Commerce (forthcoming), 2010. Jou, Jerwen, James Shanteau, and Richard Harris. 1996. “An Information Processing View of Framing Effects: The Role of Causal Schemas in Decision Making.” Memory & Cognition 24(1): 1-15. Kam, Cindy D., and Elizabeth N. Simas. 2010. “Risk Orientations and Policy Frames.” Journal of Politics 72(2): 381-396. Kam, Cindy D., Jennifer R. Wilking, and Elizabeth J. Zechmeister. 2007. “Beyond the “Narrow Data Base”: Another Convenience Sample for Experimental Research.” Political Behavior 29(4): 41540. Kittur, Aniket, Ed H. Chi, and Bongwon Suh. 2008. “Crowdsourcing User Studies with Mechanical Turk.” In “CHI 2008 Proceedings: Data Collection,” 453-456.

32

Kuhberger, Anton. 1995. “The Framing of Decisions: A New Look at Old Problems.” Organizational Behavior & Human Decision Processes 62(2): 230-240. Lawson, C., Gabriel S. Lenz, Mike Myers, and Andy Baker. 2010. "Looking Like a Winner: Candidate Appearance and Electoral Success in New Democracies." World Politics 62(4): 561-93. Mason, Winter, and Duncan J. Watts. 2009. “Financial Incentives and the Performance of Crowds.” In “Proceedings of the ACM SIGKDD Workshop on Human Computation,” 77-85. Orne, M. T. 1962. “On the Social Psychology of the Psychological Experiment: With Particular Reference to Demand Characteristics and Their Implications.” American Psychologist 17(11): 776-83. Paolacci, Gabriele, Jesse Chandler, and Panagiotis G. Ipeirotis. 2010. “Running experiments on Amazon Mechanical Turk,” Judgment and Decision Making 5 (5): 411-19. In Rasinski, Kenneth A. 1989. “The Effect of Question Wording on Public Support for Government Spending.” Public Opinion Quarterly 53(3): 388-394. Ross, Joel, Irani, Lily, Silberman, M. Six, Zaldivar, Andrew, and Tomlinson, Bill. 2010. “Who Are the Crowdworkers? Shifting Demographics in Amazon Mechanical Turk.” In “CHI EA 2010,” 28632872. Sears, David. O. 1986. “College Sophomores in the Laboratory: Influences of a Narrow Data Base on Social Psychology’s View of Human Nature.” Journal of Personality and Social Psychology 51(3): 515-530. Sheng, Victor S., Foster Provost, and Panagiotis G. Ipeirotis. 2008. “Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers.” In “Proceeding of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,” 614-622. Snow, Rion, Brendan O'Connor, Daniel Jurafsky, and Andrew Y. Ng. 2008. “Cheap and Fast — But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks.” In “EMNLP ’08: Proceedings of the Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Morristown, NJ, USA,” 254-263. 33

Sorokin, Alexander, and David Forsyth. 2008. “Utility Data Annotation with Amazon Mechanical Turk.” University of Illinois at Urbana-Champaign mimeo. Takemura, Kazuhisa. 1994. “Influence of Elaboration on the Framing of Decision.” Journal of Psychology 128(1): 33-39. Transue, John E., Daniel J. Lee, and John H. Aldrich. 2009. “Treatment Spillover Effects across Survey Experiments.” Political Analysis 17(2): 143-161. Tversky, Amos, and Daniel Kahneman. 1981. “The Framing of Decisions and the Psychology of Choice.” Science 211(4481): 453-458.

34

Online Appendix

Significance Tests between MTurk and the Other Samples ......................................................................... 2 Significance Tests between CPS and ANES Face-To-Face (Placebo) ......................................................... 5 Replication Details: GSS Welfare Experiment Wording .............................................................................. 7 Tests for Heterogeneous Treatment Effects in the Welfare Question Wording Experiment ........................ 8 Replication Details: Kam and Simas (2010) Question Wording ................................................................ 10 Comparison of Habitual and Non-Habitual Response ................................................................................ 19 MTurk Worker View of Posted HIT ........................................................................................................... 20 Sample Python Script for Paying Worker Bonuses .................................................................................... 23 Instructions and Code to Recontact MTurk Respondents for Panel Studies............................................... 25 Perl Code For Recontacting Workers ......................................................................................................... 27 Listing of Published Experimental Studies and Subject Recruitment......................................................... 30

1

Significance Tests between MTurk and the Other Samples Significance Tests for Table 3: Comparing MTurk Sample Demographics to Internet and Face-toFace Samples Internet Sample ANESP Female Education (years) Age (years) Mean Income

Face-To-Face Samples CPS 2008 ANES 2008

Dif. in prop. (p-value) D-stat (p-value) D-stat (p-value) D-stat (p-value)

-.0250 (.2751) .2648 (0.000) .5044 (.0000) .1594 (.0000)

.0840 (.0001) .3361 (.0000) .3916 (.0000) .1235 (.0000)

.0509 (.0305) .3741 (.0000) .3899 (.0000) .0925 (.0000)

Dif. in prop. (p-value) Dif. in prop. (p-value) Dif. in prop. (p-value)

.0471 (.0008) -0.0471 (.0008) .0177 (.0856)

.0769 (.0000) -.0769 (.0000) -.0696 (.0000)

.0821 (.0000) -.0821 (.0000) -.0236 (.0761)

Dif. in prop. (p-value) Dif. in prop. (p-value) Dif. in prop. (p-value) Dif. in prop. (p-value) Dif. in prop. (p-value)

-.2456 (.0000) -.0651 (.0000) .0107 (.0722) .3470 (.0000) -.0471 (.0000)

-.1671 (.0000) -.0210 (.0000) .0040 (.5178) .2497 (.0000) -.0556 (0.000)

-.1107 (.0000) -.0485 (.0016) -.0038 (.6299) .2440 (.0000) -.0711 (.0000)

Dif. in prop. (p-value)

.3769 (.0000)

-.1987 (.0000)

Dif. in prop. (p-value) Dif. in prop. (p-value) Dif. in prop. (p-value) Dif. in prop. (p-value) Dif. in prop. (p-value)

.2874 (.0000) -.1802 (.0000) -.0639 (.0009) .0132 (.1074) -.0565 (.0029)

.1487 (.0000) -.0745 (.0004) -.0095 (.5962) .0314 (.0000) -.0961 (.0000)

Dif. in prop. (p-value) Dif. in prop. (p-value) Dif. in prop. (p-value) Dif. in prop. (p-value)

.0522 (.0031) -.0172 (.4081) -.0052 (.8086) -.0298 (.1259)

Race White Black Hispanic Marital Status Married Divorced Separated Never Widowed Housing Status Own home Religion None Protestant Catholic Jewish Other Region of US Northeast Midwest South West

.0370 (.0255) .0468 (.0081) -.0568 (.0058) -.0269 (.1353)

.0755 (.0000) .0540 (.0061) -.1198 (.0000) -.0097 (.6166)

Notes: For proportions, the table shows the difference in proportion between MTurk and the relevant sample with p-values in parentheses. For other variables, the table shows the D-statistic from KS tests with p-values in parentheses. Nonsignificant differences (p >0.10) are bolded.

2

Significance Tests for Table 4: Comparing MTurk Sample Political and Psychological Measures to Internet and Face-to-Face Samples Internet Sample ANESP

Face-To-Face Samples ANES CPS 2008 2008

Registration and Turnout Registered

Dif. in prop.

.1324

.0655

.0141

(p-value)

(0.000)

(.0180)

(1.000)

Voter Turnout 2008

Dif. in prop.

.1919

.0571

.0181

(p-value)

(0.000)

(0.059)

D-stat

.1427

(0.999) .1007

(p-value)

(0.000)

(0.001)

D-stat

.2588

.2946

(p-value)

(0.000)

(0.000)

Party Identification (7-point scale, 7 = Strong Republican) Ideology (7-point scale, 7 = Strong conservative) Political Interest (5-point scale, 5 = Extremely interested)

D-stat

.1072

.2044

(p-value)

(0.000)

(0.000)

Political knowledge (% Correct) Presidential succession after Vice President

Dif. in prop.

.0485

(p-value)

House vote percentage needed to override a veto

Dif. in prop.

(0.338) .0768

(p-value)

(0.024)

# of terms to which an ind. can be elected president

Dif. in prop.

.0342

(p-value)

Length of a U.S. Senate term

Dif. in prop.

(0.737) .0749

(p-value)

(0.022)

Number of Senators per State

Dif. in prop.

.1211

(p-value)

(0.000)

Dif. in prop.

.1121

(p-value)

(0.000)

D-stat

.0707

.1600

(p-value)

(0.005)

(0.000)

D-stat

.1284

.2177

(p-value)

(0.000)

(0.000)

Length of a U.S. House term Average Need for Cognition (0-1 scale)

Need to Evaluate (0-1 scale)

Notes: For proportions, the table shows the difference in proportion between MTurk and the relevant sample with p-values in parentheses. For other variables, the table shows the D-statistic from KS tests with p-values in parentheses. Nonsignificant differences (p >0.10) are bolded.

3

Significance Tests for Table 5: Comparing MTurk Sample Policy Attitudes to Internet and Face-toFace Samples Internet Sample

Face-ToFace Samples

Favor prescription drug benefit for seniors

Dif. in prop.

ANESP .1126

ANES 2008 .1978

(p-value)

(0.000)

(0.000)

Favor universal healthcare

Dif. in prop.

.0601

.0720

(p-value)

(0.102)

(0.042)

Favor citizenship process for illegals

Dif. in prop.

.0456

.1504

(p-value)

Favor a constitutional amendment banning gay marriage

(0.360) .2080

(0.000)

Dif. in prop. (p-value)

(0.000)

Favor raising taxes on people making more than $200,000

Dif. in prop.

.0578

(p-value)

(0.128)

Favor raising tax on people making less than $200,000

Dif. in prop.

.0142

(p-value)

(1.000)

Notes: For proportions, the table shows the difference in proportion between MTurk and the relevant sample with p-values in parentheses. For other variables, the table shows the D-statistic from KS tests with p-values in parentheses. Nonsignificant differences (p >0.10) are bolded. Nonsignificant differences (p 0.10) are bolded.

5

Comparing ANES Face-to-Face Political and Psychological Measures to CPS Sample

Registration and Turnout Registered Voter Turnout 2008

Dif. in prop.

.0514

(p-value)

(0.000)

Dif. in prop.

.0390

(p-value)

(0.002)

Notes: Tests are the difference in proportion between ANES Face-To-Face and the CPS with p-values in parentheses. Nonsignificant differences (p >0.10) are bolded.

6

Replication Details: GSS Welfare Experiment Wording We are faced with many problems in this country, none of which can be solved easily or inexpensively. I'm going to name one of these problems, and I'd like you to tell me whether you think we're spending too much money on it, too little money, or about the right amount. [Random assignment to either] Welfare Assistance to the poor Response options: Too little, About right, Too much, Don't know, No answer. 1

1

We also included another wording in a third condition: "Caring for the poor."

7

Tests for Heterogeneous Treatment Effects in the Welfare Question Wording Experiment Test by Gender Variable Female GSS “Welfare” GSS*Welfare GSS*Female Female*Welfare GSS*Welfare*Female µ1 µ2 LL p > χ2 N

Coefficient (SE) -0.452 (0.19) -0.484 (0.16) 0.797 (0.20) 0.291 (0.21) 0.223 (0.22) 0.278 (0.26) -0.223 (0.28) -0.178 (0.14) 0.714 (0.14) 543.88 0.00 2305

Note: The omitted categories are: Male (for the “Female” variable), MTurk (for the “GSS variable”) and “Spending on the poor” wording (for “welfare”).

8

Test by Education Variable Some College+ GSS “Welfare” GSS*Welfare GSS*Some College+ Some College+*Welfare GSS*Welfare*Some College+ µ1 µ2 LL p > χ2 N

Coefficient (SE) -0.151 (0.32) -0.707 (0.31) 0.834 (0.42) 0.446 (0.42) 0.481 (0.33) 0.132 (0.44) -0.344 (0.45) -0.057 (0.30) 0.836 (0.30) 544.53 0.00 2300

Note: The omitted categories are: High school graduate or less (for the “Some College+” variable), MTurk (for the “GSS variable”) and “Spending on the poor” wording (for “welfare”).

Test by Race Variable Black GSS “Welfare” GSS*Welfare GSS*Black Black*Welfare GSS*Welfare*Black µ1 µ2 LL p > χ2 N

Coefficient (SE) -0.812 (0.47) -0.340 (0.11) 0.960 (0.13) 0.214 (0.14) 0.223 (0.49) -0.355 (0.67) 0.352 (0.69) 0.036 (0.10) 0.941 (0.10) 591.88 0.00 2305

Note: The omitted categories are: All other races (for the “Black variable), MTurk (for the “GSS variable”) and “Spending on the poor” wording (for “welfare”).

9

Replication Details: Kam and Simas (2010) Question Wording The information below is from the following website: http://www.tessexperiments.org/data/kam593.html TESS DHS 01 - Kam December 2007 - Study Details Note: This page may be removed when the questionnaire is sent to the client. However, it must exist in the version sent to TOST. SNO Survey Name Client Name Great Plains Project Number Project Director Name Team/Area Name

11388 TESS DHS 01 – Kam University of Pennsylvania / TESS K1721 Poom Nukulkij SPQR

Samvar (Include name, type and response values. “None” means none. Blank means standard demos. This must match SurveyMan.)

Standard demos, XPARTY7 (1 Strong Republican; 2 Not Strong Republican; 3 Leans Republican; 4 Undecided/Independent/Other; 5 Leans Democrat; 6 Not Strong Democrat; 7 Strong Democrat; 9 Missing), XIDEO (1 Extremely liberal; 2 Liberal; 3 Slightly liberal; 4 Moderate, middle of the road; 5 Slightly conservative; 6 Conservative; 7 Extremely conservative; 9 Missing).

Specified Pre-coding Required Timing Template Required (y/n) Multi-Media Disposition Information (Used to create Toplines: Provide exact definitions of base(s), referencing question numbers and responses defining the group(s) for which Toplines are desired) Important:

Do not change Question numbers after Version 1; to add a new question, use alpha characters (e.g., 3a, 3b, 3c.) Changing question numbers will cause delays and potentially errors in the program.

10

TESS DHS 01 - Kam December 2007 - Questionnaire ATTITUDES TOWARDS RISK AND THE FRAMING OF BIOTERRORIST PREVENTION PI: CINDY KAM E-MAIL: [email protected]

STUDY DESIGN: 2 CONDITIONS (MORTALITY FRAME THEN SURVIVAL FRAME OR SURVIVAL FRAME THEN MORTALITY FRAME). RESPONDENTS WILL BE RANDOMLY ASSIGNED TO RECEIVE THE MORTALITY FRAME (FRAME M1) OR THE SURVIVAL FRAME (FRAME S1) FIRST, THEN THE OTHER SECOND.



SAMPLE SPECS: RANDOM SAMPLE OF THE U.S. ADULT POPULATION WITH SAMPLE N=660, 12 QUESTIONS FOR EACH SUBJECT, FOR A TOTAL OF 660 X 12 = 7920 RESPONDENT-QUESTIONS

[GRID - SP] Q1. Some people say you should be cautious about making major changes in life. Suppose these people are located at 1. Others say that you will never achieve much in life unless you act boldly. Suppose these people are located at 7. And others have views in between. Where would you place yourself on this scale? 1 You should be cautious about making major changes in life

2

3

4

5

6

7 You will never achieve much in life unless you act boldly

[GRID - SP] Q2. Suppose you were betting on horses and were a big winner in the third or fourth race. Would you be more likely to continue betting on additional races or take your winnings and stop? Definitely Continue Playing

Probably Continue Playing

Not sure

11

Probably Take My Winnings

Definitely Take My Winnings

FOR Q3-Q6, SHOW EACH STATEMENT CENTERED IN YELLOW . [SP] Q3. Please rate your level of agreement or disagreement with the following statement: I would like to explore strange places.

[SP] Q4. Please rate your level of agreement or disagreement with the following statement: I like to do frightening things.

[SP] Q5. Please rate your level of agreement or disagreement with the following statement: I like new and exciting experiences, even if I have to break the rules.

[SP] Q6. Please rate your level of agreement or disagreement with the following statement: I prefer friends who are exciting and unpredictable.

12

[GRID - SP] Q7. In general, how easy or difficult is it for you to accept taking risks? Very easy to take risks

Somewhat easy to take risks

Somewhat difficult to take risks

Very difficult to take risks

[FRAMING SCENARIO 1] [DISPLAY] Experts from the Centers for Disease Control (CDC) recently appeared before Congress to discuss the need to take steps to protect Americans from a possible smallpox epidemic. Although some Americans were vaccinated against smallpox in their youth, those vaccinations are now ineffective against the more powerful smallpox strains that exist today. All 300 million Americans are vulnerable to being infected by smallpox, even though the possibility of a bioterrorist attack remains very small. [DISPLAY] CDC experts have proposed two programs to try to minimize the consequences of a smallpox epidemic. They proposed two alternative programs to combat the disease. These programs would fund research, vaccinations, medical treatment facilities, and the training of medical personnel. As an example, they illustrated the effects of the programs in a medium-sized town in the United States. [DISPLAY3] Scientists believe that an initial outbreak of smallpox in a medium-sized town of 60,000 people in the United States would kill 6,000 people. The scientific estimates of the impacts of two programs, A and B, are as follows: PROGRAMMING NOTE: CREATE DATA-ONLY VARIABLE INDICATING WHETHER S1 OR M1 WAS SELECTED. [FRAME S1: RANDOMLY ASSIGNED TO 1⁄2 RESPONDENTS] If program A is adopted, 2000 people will be saved. If program B is adopted, there is a 1 in 3 chance that 6000 people will be saved and a 2 in 3 chance that no people will be saved. [FRAME M1: RANDOMLY ASSIGNED TO 1⁄2 RESPONDENTS] If program A is adopted, 4000 people will die. If program B is adopted, there is a 1 in 3 chance that nobody will die and a 2 in 3 chance that 6000 people will die. FOR Q9, LINK “PROGRAM A” AND “PROGRAM B” RESPONSES TO SCREENS SHOWING THE NUMBERS ABOVE. [SP] Q9. Imagine you were faced with the decision of adopting program A or program B. Which would you select? Program A Program B PROMPT ONCE. 13

SHOW Q10 IF Q9 IS A VALID, NON-REFUSAL RESPONSE. [GRID - SP] Q10. How certain are you of your preference for Program [INSERT SELECTION FROM Q9: A / B]? Very Certain

Somewhat Certain

Somewhat Uncertain

Very Uncertain

[FRAMING SCENARIO 2] [DISPLAY] Another set of CDC experts have proposed two other programs, C and D. These programs would fund research, vaccinations, medical treatment facilities, and the training of medical personnel. Again, they illustrated the effects of the programs in a medium-sized town in the United States. [DISPLAY5] Scientists believe that an initial outbreak of smallpox in a medium-sized town of 60,000 people in the United States would kill 6,000 people. The scientific estimates of the impacts of these two alternative programs, C and D, are as follows: [FRAME M2, IF R RECEIVED S1] If program C is adopted, 4000 people will die. If program D is adopted, there is a 1 in 3 chance that nobody will die, and a 2 in 3 chance that 6000 people will die. [FRAME S2, IF R RECEIVED M1] If program C is adopted, 2000 people will be saved. If program D is adopted, there is a 1 in 3 chance that 6000 people will be saved and a 2 in 3 chance that no people will be saved. FOR Q11, LINK “PROGRAM C” AND “PROGRAM D” RESPONSES TO SCREENS SHOWING THE NUMBERS ABOVE. [SP] Q11. Imagine you were faced with the decision of adopting program C or program D. Which would you select? Program C Program D PROMPT ONCE. SHOW Q12 IF Q11 IS A VALID, NON-REFUSAL RESPONSE. [GRID - SP] 14

Q12. How certain are you of your preference for Program [INSERT SELECTION FROM Q11: C /D]? Very Certain

Somewhat Certain

Somewhat Uncertain

Very Uncertain

[DISPLAY] This survey involved the effect of attitudes towards risk on policy choices. During the survey, you may have been told that policymakers were considering various smallpox prevention policies. These were not actual policies, but hypothetical scenarios designed to assess whether people are sensitive to how policies are described. If you have any questions about this study, you may contact the University of California, Davis Institutional Review Board by calling 916-703-9151. You may also mail them at IRB Administration, CRISP Building, UC Davis, 2921 Stockton Blvd., Ste. 1400, Rm 1429, Sacramento, CA 95817. Thanks for your participation in the survey. INSERT STANDARD CLOSE.

15

Table 2: Risk Acceptance and Preference for the Probabilistic Outcome, Trial 1 Kam and Simas (2010)

Mortality Frame in Trial 1 Risk Acceptance

(H1a) Mortality Frame and Risk Acceptance 1.068

(H1b) Adding Controls 1.082

(H2) Frame x Risk Acceptance 1.058

(H1a) Mortality Frame and Risk Acceptance 1.180

(H1b) Adding Controls 1.180

(0.10)

(0.10)

(0.29)

(0.10)

0.521

0.628

0.507

0.760

(0.31)

(0.32)

(0.48)

(0.29)

Female

Difference (MTurk minus Kam and Simas)

MTurk replication (H2) Frame x Risk Acceptance 1.410

(H1a) Mortality Frame and Risk Acceptance 0.112

(H1b) Adding Controls 0.098

(H2) Frame x Risk Acceptance 0.352

(0.10)

(0.31)

(0.139)

(0.141)

(0.427)

0.780

0.990

0.239

0.152

0.483

(0.31)

(0.42)

(0.42)

(0.44)

(0.64)

0.105

-0.018

-0.123

(0.10)

(0.11)

(0.148)

Age

0.262

0.110

-0.152

(0.22)

(0.31)

(0.378)

Education

-0.214

0.025

0.239

(0.20)

(0.23)

(0.304)

0.205

-0.024

-0.229

(0.23)

(0.23)

(0.318)

0.038

0.006

-0.032

(0.19)

(0.15)

(0.242)

Income

Partisan Ideology

Risk Acceptance x Mortality Frame

Intercept

lnL p > χ2 N

0.023

-0.450

0.62

(0.58)

-0.706

-0.933

-0.700

-1.060

-1.100

-1.190

-0.354

-0.167

-0.490

(0.155)

(0.259)

(0.227)

(-0.170)

(-0.290)

(-0.230)

(0.230)

(0.389)

(0.323)

-453.185

-450.481

-453.184

-409.740

-409.662

-409.439

0.000

0.000

0.000

0.000

0.000

0.000

752

750

752

699

699

699

Note: Table entry is the probit coefficient with standard error below. Dependent variable is Preference for the Probabilistic Outcome (0 = Policy A; 1 = Policy B). All independent variables are scaled to range from 0 to 1.

16

Table 3: Risk Acceptance and Preference for the Probabilistic Outcome, Trial 2

Mortality Frame in Trial 2

Kam and Simas (2010) (H1A) Mortality (H1B) Frame and Adding Risk Acceptance Controls 0.202 0.186

(0.214) -0.107

(0.220) 0.071

(0.307) 0.178

0.182 -0.451 (0.234) -511.744 0.082 750

-0.140 -0.470 (-0.270) -464.400 0.000 699

(0.230) -0.019 (0.357)

Education

Index Intercept lNl p > χ2 N

-0.242 (0.143) -515.394 0.017 752

-0.440 (-0.160) -465.338 0.000 699

(0.134) 0.176 (0.409)

(H1B) Adding Controls 0.274

Partisan Ideology

Age

(0.097) 0.650 (0.290) -0.070 (0.100) -0.210 (0.290) 0.130 (0.220) 0.080

(H1A) Mortality Frame and Risk 0.248

Household Income

Female

(0.096) 0.720 (0.270)

Difference (MTurk minus Kam and Simas)

(0.094) 0.691 (0.302) 0.169 (0.094) 0.153 (0.207) -0.016 (0.191) 0.126

Risk Acceptance

(0.093) 0.544 (0.288)

MTurk replication (H1A) Mortality (H1B) Frame and Adding Risk Acceptance Controls 0.450 0.460

-0.198 (0.215)

(0.135) -0.041 (0.419) -0.239 (0.137) -0.363 (0.356) 0.146 (0.291) -0.046

Note: Table entry is the probit coefficient with standard error below. Dependent variable is Preference for the Probabilistic Outcome on the second trial (0 = Policy C; 1 = Policy D). All independent variables are scaled to range from 0 to 1.

17

Table 4: Risk Acceptance and Preferences Across Two Trials Kam and Simas (2010)

Mortality Frame in Trial 1

Prob. in Both Trials 1.120 (0.20)

Risk Acceptance

Female

Age

Education

Household Income

Partisan Ideology Index

Intercept

lNl p > χ2 N

MTurk replication

Sure Thing to Prob. -0.870

Prob. to Sure Thing 2.229

Prob. in Both Trials 0.910

(0.25)

(0.29)

(0.21)

Sure Thing to Prob. -1.540

Prob. to Sure Thing 2.440

(0.26)

(0.34)

Difference (MTurk minus Kam and Simas) Both Prob

Switched From Prob

Switched to Prob

-0.210

-0.670

0.211

(0.29)

(0.36)

(0.45)

1.713

1.134

0.910

1.830

0.600

0.860

0.117

-0.534

-0.050

(0.64)

(0.77)

(0.75)

(0.62)

(0.65)

(0.75)

(0.89)

(1.00)

(1.06)

0.354

0.291

0.134

-0.090

0.270

0.500

-0.444

-0.021

0.366

(0.20)

(0.23)

(0.24)

(0.21)

(0.23)

(0.26)

(0.29)

(0.33)

(0.35)

0.490

-0.237

0.104

-0.120

-0.670

-0.028

-0.610

-0.433

-0.132

(0.44)

(0.51)

(0.52)

(0.61)

(0.65)

(0.75)

(0.75)

(0.83)

(0.91)

-0.343

0.248

-0.725

0.170

0.120

-0.071

0.513

-0.128

0.654

(0.40)

(0.48)

(0.48)

(0.46)

(0.47)

(0.59)

(0.61)

(0.67)

(0.76)

0.329

-0.832

-0.539

0.069

0.390

0.220

-0.260

1.222

0.759

(0.46)

(0.54)

(0.53)

(0.46)

(0.47)

(0.59)

(0.65)

(0.71)

(0.79)

-0.066

-0.119

0.158

0.081

0.049

-0.130

0.147

0.168

-0.288

(0.38)

(0.45)

(0.46)

(0.29)

(0.32)

(0.36)

(0.47)

(0.55)

(0.58)

-1.444

-0.040

-1.771

-1.430

-0.270

-2.960

0.014

-0.230

-1.189

(0.52)

(0.58)

(0.63)

(0.57)

(0.60)

(0.75)

(0.77)

(0.83)

(0.98)

-920.590

-837.666

0.000

0.000

750

699

Note: Table entry is the multinomial logit coefficient with standard error below. Baseline reference category for the dependent variable is Sure Thing in Both Trials. All independent variables are scaled to range from 0 to 1.

18

Comparison of Habitual and Non-Habitual Response

Asian Flu: Non-Habitual Participants

Habitual Participants

Outcome Certain

Lives Saved 70.59%

Lives Lost 39.47%

Lives Saved 79.01%

Lives Lost 36.92%

Risky

29.41

60.53

20.99

63.08

N 119 114 81 65 Note. Non-habitual participants: Pearson chi2(1) = 22.8093 Pr = 0.000. Habitual participants: Pearson chi2(1) = 26.6798 Pr = 0.000. Note: The differences between habitual and non-habitual participants are not significant.

Welfare Spending: Amount Too Little

Non-Habitual Participants Poor Welfare 45.71% 15.19%

Habitual Participants Poor 61.33%

Welfare 19.51%

N

70

75

82

79

Note: Non-habitual participants: Pearson chi2(2) = 16.9818 Pr = 0.000. Habitual participants: Pearson chi2(2) = 29.8591 Pr = 0.000. Differences between habitual and non-habitual participants are not significant.

19

MTurk Worker View of Posted HIT Figure S1: Worker Search view for HITs

Note: Workers can browse for HITs and search using a variety of criteria. The listings include the HIT title, requester name, time allotted, reward amount, and number of HITs available (this is the number of times a given worker can complete a given HIT).

20

Figure S2: Worker Detail View of HIT from Search Screen

Note: From the search screen, workers can click on a HIT name and see more details about the HIT. They can also preview a HIT if they meet the qualification requirements.

21

Figure S3: Worker Preview of HIT

Note: From this preview page the worker can accept the HIT.

22

Sample Python Script for Paying Worker Bonuses # # # # # # # # # # # #

pay_Batch_XXXX.py Using Amazon Mechanical Turk command line interface, pay bonuses for one batch of jobs. Approval done through Mechanical Turk GUI. This python file writes out a windows batch (.bat) file to pay bonuses from the windows command line using Amazon.com's windows command line tools. It assumes the worker entered a pin when they submitted a hit that matches to the csv "bonusTable.csv" matching bonus to the pin that the worker entered.

import csv batch = "XXXX"

# Batch to approve.

# Result file downloaded from MTurk Requester Management page. A # data file with columns listing worker id and the pin entered. # The pin column is "Answer.answer". resultFile = "Batch_"+batch+"_result.csv" # Bonus file created by analyst. bonusFile = "bonusTable.csv" # Create batch grant bonus as .bat text file. out = open("pay_Batch_"+batch+".bat.notyet","w") # append with suffix notyet so as not to accidentally run. out.write("REM\nREM Batch file to pay bonuses to a set of Mechanical Turk Workers.\nREM\n") out.write("set /p blah = Are you sure you want to proceed?\n") out.write('cd "C:\\mech-turk-tools-1.3.0\\bin"\n') # Loop over bonus file and create a dictionary of bonus payments # matched to worker pin. bf = open(bonusFile,"rb") bonusDict = {} for row in csv.DictReader(bf): bonusDict[row['pin']] = round(float(row['bonus']),2) bf.close() bonuscounter = float(0) workercounter = 0 # Loop over result file, write out bonus line for # each worker to out batch file. rf = open(resultFile,"rb") for row in csv.DictReader(rf): # Write out bonus payment, if pin matches to a bonus in bonusDict. pin = row['Answer.answer'] if bonusDict.has_key(pin): grantLine = "call grantBonus -workerid %s -amount %s -assignment %s reason %s >> %s.log\n" % (row['WorkerId'],\ (float(bonusDict[pin])),row['AssignmentId'],'"Good job."',batch)

23

out.write(grantLine) workercounter += 1 bonuscounter += float(bonusDict[pin]) #print "Paying %s to pin %s.\n" % ((float(bonusDict[pin])/100),pin) else: print "No bonus for pin %s.\n" % pin rf.close() out.close() #summarize output print "Paying %s workers %s dollars (%s dollars with Amazon fees): average %1.1f cents." % (workercounter,bonuscounter,1.1*bonuscounter,100*float(bonuscounter/workercou nter)) #To execute the resulting batch file, you will need to remove the .notyet suffix and execute it. #We suggest appending .alldone after paying bonuses so that you do not accidently pay bonuses twice.

24

Instructions and Code to Recontact MTurk Respondents for Panel Studies Step-By-Step Instruction for Windows

1. Install ActivePerl: http://www.activestate.com/activeperl/downloads ActivePerl runs the code to recontact workers. After downloading the installer, run it, and proceed through the install wizard with the default settings, which will install it to C:\Perl. We recommend keeping it there, as it is quick and easy to navigate to in the Command Prompt. 2. Install necessary packages. When the installation finishes, the ActivePerl folder should appear in your Start Menu's Applications folder. In that folder, find the Perl Package Manager (PPM). Open it. You will see a list of Perl "packages." You need to install two packages: "XML-XPath" and "TimeDate." Type the names of these missing package, including the hyphens, into the search bar (one at a time), tag each them for install (using the "Mark for install" button just to the right of search bar), and then install them by clicking the green arrow button right of the search bar. The PPM should take care of everything else; it will download, unpack, and generate the HTML documentation into your Perl folder. This package manager is not userfriendly, so follow the instructions above exactly, and be prepared to spend a few minutes figuring it out. 3. Save the file mt.pl (shown below) to the folder C:\Perl. 4. Edit mt.pl by right clicking on it and editing it in WordPad or Notepad (or any other text editor). Do not open it by double-clicking it as that will run the file in Perl. Follow the instructions in mt.pl. Your email to workers is created within this file. 5. Create the list of workers' IDs you want to recontact, save it to a file called id.txt, and place it in C:\Perl. Create the list by first exporting the first-wave MTurk results to an Excel file. In the Excel file, select the variable WorkerID and paste this variable into a separate text file. The file id.txt should only contain a list of worker IDs, each on a separate line (no headers, just the list). At this stage, you may want to drop workers whom do not want to recontact. 6. Open the command prompt: click Start, point to All Programs, point to Accessories, and then click Command Prompt. 7. In the command prompt, navigate to your Perl folder by typing cd C:\Perl (or whatever you named the Perl folder, the name is case-sensitive), and hit Enter. The new line should now say C:\Perl>. 8. Send the e-mails to workers by typing mt.pl in the command prompt and hitting enter. Perl will take it from there. (Suggested: Test your code by putting only your own ID in id.txt. See the next page for suggestions on testing.)

25

Tips: •

• •





In the message you add to mt.pl o Include the survey link. o Tell respondents that they need to sign in to Mechanical Turk before clicking on the survey link. At the end of the second-wave survey, include a link to the MT HIT along with the unique survey ID. When creating the HIT o Don't use any restrictions (such as better than 95% accuracy or country of origin). This helps with people who forget to sign in first. o Don't include the link to the survey in the HIT (or people not part of the first study will try to take it). o State in your HIT that this is part of a follow-up survey and that you will reject work from people who did not take the original survey. Quick note and warning on Perl (.pl) files: Until a Perl compiler like ActivePerl is installed, anything saved with a .pl extension will not be recognizable by Windows. Once installed, the icon will change to some black lines and stars. Note: you cannot edit these files by double-clicking on .pl files, as that will execute the file; you must right click and choose Edit. Since Perl files can send emails message or pay bonuses, be careful! Double-clicking on a .pl file inadvertently can cost you money. Testing your recontact code • Before sending the message, test it on yourself by putting only your own worker ID into the id.txt file. • Finding your worker ID is surprisingly difficult. One approach is to sign in as a worker, take your own HIT, and enter a note to yourself rather than the survey ID. Finally, log out as a worker, log in as a requester, and find your ID in the results for that HIT. • You should receive the test e-mail you just sent yourself momentarily. • Since some workers will forget to sign in before clicking on the survey link, it's worth testing the link to your survey without being signed in.

26

Perl Code For Recontacting Workers Save code below to a text file called mt.pl. #!/usr/bin/perl use strict;use warnings; # Use modules use LWP; use Digest::HMAC_SHA1 qw(hmac_sha1); use MIME::Base64; use XML::XPath; use Date::Format;

open(PFILE,"id.txt")

||

die "cannot open idfile $!" ;

my $i = 0; while () { chomp;

###### MAKE 1ST CHANGE HERE ###### #Insert (or adapt) the e-mail subject heading and message text below. #(Make sure you leave the ";" at the end of the subject or message. The subject and message should be in quotation marks. my $subject = "Take our 3-minute, follow-up MechTurk survey for 50 cents"; my $message = "Hello, You recently completed the first wave of a survey for us on Mechanical Turk. We have selected you for the next wave of our study, which is a 3 minute survey that pays 50 cents. Here is the survey link: http://www.surveygizmo.com/s/303811/pa-post-survey-1035. At the end of the survey, please enter the survey code shown to you into the Mechanical Turk task, which is included as a link on the last page of the survey (you should sign in to Mechanical Turk before clicking on that link). If you cannot find that Mechanical Turk task, search for keywords: [insert your key words here] We appreciate your help with our research!";

###### MAKE 2ND CHANGE HERE ###### #Sign up for Amazon web services here: # https://aws-portal.amazon.com/gp/aws/developer/registration/index.html #Look up your personal requester "Access Key ID" with this link and insert below (replacing "AKIAJQXKZY6O5P52MTMQ") # http://aws-portal.amazon.com/gp/aws/developer/account/index.html?action=access-key my $AWS_ACCESS_KEY_ID = "AKIAJQXKZY6O5P52MTMQ"; #from the same page, look up your secret access key and insert below (replacing "0/kzAXThLe7aA/Cnpf6ZdNoy0wEBNP/MPMX7RPvj")

27

my $AWS_SECRET_ACCESS_KEY = "0/kzAXThLe7aA/Cnpf6ZdNoy0wEBNP/MPMX7RPvj";

###### THAT'S IT, NO MORE CHANGES ###### ###### LEAVE BELOW UNCHANGED ###### my $SERVICE_NAME = "AWSMechanicalTurkRequester"; my $SERVICE_VERSION = "2008-04-01";

# Define authentication routines- never change sub generate_timestamp { my ($t) = @_; return time2str('%Y-%m-%dT%H:%M:%SZ', $t, 'GMT'); }

sub generate_signature { my ($service, $operation, $timestamp, $secret_access_key) = @_; my $string_to_encode = $service . $operation . $timestamp; my $hmac = hmac_sha1($string_to_encode, $secret_access_key); my $signature = encode_base64($hmac); chop $signature;

return $signature; }

# Calculate the request authentication parameters my $operation = "NotifyWorkers"; my $timestamp = generate_timestamp(time); my $signature = generate_signature($SERVICE_NAME, $operation, $timestamp, $AWS_SECRET_ACCESS_KEY); #this doesn't change, as it looks at each line of the id file you open. my $workerid = $_;

# Construct the request my $parameters = { Service => $SERVICE_NAME, Version => $SERVICE_VERSION, AWSAccessKeyId => $AWS_ACCESS_KEY_ID, Timestamp => $timestamp, Signature => $signature, Operation => $operation,

28

Subject => $subject, MessageText => $message, WorkerId => $workerid, }; # Make the request my $url = "http://mechanicalturk.amazonaws.com/onca/xml?"; my $ua = LWP::UserAgent->new; my $response = $ua->post($url, $parameters); $i++; sleep 5 }

29

Listing of Published Experimental Studies and Subject Recruitment using U.S. Samples in APSR, AJPS, and JOP (2005-2010) APSR Of the 7 (out of 12 total) survey experiment articles in the APSR from 2005-2010 that did not use a national US sample, none reported demographic characteristics in a table. Here’s what the text says about the characteristics of each sample: Feddersen et al. (2009): “We conducted a total of six sessions of the experiment in computer labs at Northwestern University (four sessions) and the Experimental Social Science Laboratory (Xlab) at the University of California–Berkeley (two sessions). Subjects were Northwestern or Berkeley undergraduates recruited from the Management and Organizations subject pool, undergraduate social science classes, computer labs (Northwestern), and the Xlab subject pool (Berkeley). Subjects were not selected to have any specialized training in game theory, political science, or economics.” (p. 181-182) Mutz (2007): “Participants were recruited through temporary employment agencies and community groups and either the group treasury or the participants were compensated for their time.” (p. 625) Chong and Druckman (2007): “We recruited participants from a large public university and the general public by inviting them to take part in a study on public opinion at the university’s political psychology laboratory in exchange for a cash payment and a snack.” Footnote 7: “Overall, aside from the disproportionate number of students, the samples were fairly diverse, with liberals, whites, and politically knowledgeable individuals being slightly over-represented (relative to the area’s population).We checked and confirmed that adults and nonadults did not significantly differ from one another in terms of the experimental causal dynamics presented later. This is consistent with other work that shows no differences between students and nonstudents (e.g., Druckman 2004; Kuhberger 1998).” (p. 641) Battaglini et al. (2007): “The experiments were conducted at the Princeton Laboratory for Experimental Social Science and subjects were registered students at Princeton University.” (p. 413) White (2007): Experiment (1) “The experiment was carried out in computer labs at three separate locations: the University of Michigan in Ann Arbor, Michigan; Southern University in Baton Rouge, Louisiana; and Louisiana State University in Baton Rouge, Louisiana. … The sample does not differ dramatically from that of the nation as a whole on most political dimensions, although there are some marked differences, of course, in other demographics. On the important dimensions of partisanship and racial group identification, Black subjects were very similar to the national population. For Whites, the experimental sample is slightly more Democratic and somewhat more racially liberal than the population as a whole. Although 30% of the participants were nonstudents, in terms of age and education the sample still differs from the national population for both Blacks and Whites. The median age of experimental participants is 22 years and the average participant is college educated. Analysis of variance, however, indicated that neither these sociodemographic variables nor the partisan and attitudinal variables differed across conditions, so we can be reasonably confident that the results observed are, in fact, the result of exposure to the various conditions.” (p. 343)

30

Experiment (2) “This experiment was carried out both in computer labs at the University of Texas in Austin, Texas and Southern University in Baton Rouge, Louisiana, and at various locations in both cities through the use of laptop computers. … One-hundred sixty self-identified African-American subjects and 181 self-identified White subjects participated in the experiment. Participants were much younger than the national population; the median age is approximately 26 years. The participants were also more educated than the national population, with most subjects indicating that they had some college education. Analysis of variance, however, indicates that neither sociodemographic variables nor relevant partisan and attitudinal variables differed across conditions.” (p.347) Levine and Palfrey (2006): “Subjects were recruited by email announcement from a subject pool consisting of registered UCLA students.” (p. 147) Mutz and Reeves (2005): “In three experiments using adults and undergraduate subjects,7 we exposed viewers to systematically different versions of four different political disagreements that were drawn from a larger pool.” “7 Adult subjects were recruited through temporary employment agencies, and they were paid for their participation by the agency at the hourly rate they had agreed on with the agency. Student subjects were recruited from political science courses as part of a class opportunity for extra credit. All subjects were invited to participate “in a study that involves watching television.” In Experiment 1, 75% of the subjects were college students, and the remaining 25% were recruited from the community. In Experiment 2, 45% of the subjects were students, and 55% were drawn from the community. In Experiment 3, all subjects were recruited from the community. We found no systematic difference in the reactions of student and nonstudent subjects.”

AJPS Of the 4 (out of 13 total) survey experiment articles in the AJPS from 2005-2010 that did not use a national US sample, none reported demographic characteristics in a table. Here’s what the text says about the characteristics of each sample: Dickson (2009): “The experiments were carried out at the Center for Experimental Social Science (CESS) at New York University. Subjects signed up for the experiment via a webbased recruitment system that draws from a broad pool of potential participants, almost all of whom are undergraduates. Subjects were not recruited from the author’s courses, and all subjects gave informed consent according to standard human subjects protocols.” (p. 912) Philpot and Walton (2007): “Participants in the 2005 Party Image Study were nonstudent subjects recruited from a number of locations, including an art fair and hotel lobbies, in Michigan and Texas. In total, 469 subjects were recruited for the experiment, including 226 blacks and 210 whites. The mean age of the sample was 42, 62% of the sample was female, 58% of the sample was college educated, and the median income of the sample was between $40,000 and $49,999.” (p. 53-54) Smith (2006): “To operationalize the research design outlined in Table 1, a group of N=132 undergraduates at a large Midwestern university were recruited for the experiment.” (p. 1017) Brader (2005):

31

“Subjects for this study were adult residents of Massachusetts, who in the summer of 1998 were faced with a Democratic primary race for governor. That race featured Scott Harshbarger, the incumbent attorney general, and PatriciaMcGovern, a former state senator. In all, 286 subjects from 11 communities participated over the course of 10 weeks leading up to the election. This sample closely resembles the state electorate in a number of ways, including sex (53% women), age (mean is 41), and race (89% white, 4% black). The median household income is slightly below average ($33,500). Finally, subjects are well educated on average (56% have a college degree), making them closer to the likely primary electorate than to the state population. (p. 391)

JOP Of the 12 (out of 19 total) survey experiment articles in the JOP from 2005-2010 that did not use a national US sample, only 1 (Barker and Hansen 2005) reported demographic characteristics in a table. Here’s what the text says about the characteristics of each sample: Boudreau and McCubbins (2010): “To test our hypotheses, we conducted laboratory experiments at a large public university. When recruiting subjects, we posted flyers on campus and sent out campus-wide emails to advertise the experiments. A total of 236 adults who were enrolled in undergraduate classes participated.” (p. 520) • Control for school year and female because small differences across treatment groups (see FN 15) Druckman et al. (2010): “We recruited participants from a large university (students and staff) and from the general public by inviting them to take part in a study on political learning at the university’s political science laboratory in exchange for a cash payment. A total of 416 individuals participated in the study during the early winter of 2008. This voluntary response sample generally reflected the area population fromwhich it was recruited. 5” “5Reflecting the population from which it was recruited, the sample is relatively liberal and Democratic. Also, while there are a disproportionate number of student-aged participants (e.g., less than 25 years old), they do not constitute a majority of the sample. We checked and confirmed that studentaged and nonstudent-aged participants did not significantly differ from one another in terms of the experimental causal dynamics presented below.” (p. 138) Dickson et al. (2009): “We conducted a laboratory experiment to explore the dynamics of enforcement and compliance in the context of the model described above. The paper presents data collected during 12 experimental sessions that were carried out at the Center for Experimental Social Science at New York University. Each of the 230 subjects who participated took part in one session only. Subjects interacted anonymously via networked computers. The experiments were programmed and conducted with the software z-Tree (Fischbacher 1999). Participants signed up via a webbased recruitment system that draws on a large, preexisting pool of potential subjects. (Subjects were not recruited from the authors’ courses.) Almost all subjects were undergraduates from the university.” (p. 1361) Scott and Bornstein (2009): “Participants were 580 undergraduates, recruited for the study in political science classes at the University of California, Davis, in which they received course credit for participation. … Participants were 47.1% women, and ranged in age from 18 to 49, with a mean (and median) age of 19 years old.” (p. 839) Zink et al. (2009): “The respondents (558 undergraduates in Political Science courses at the University of California, Davis)” (p. 913)

32

Boudreau (2009): “In order to assess the effects that the endorser’s statements have on sophisticated and unsophisticated subjects’ decisions, I conducted laboratory experiments at a large public university. When recruiting subjects, I posted flyers on campus and sent out campus-wide emails to advertise the experiments. A total of 381 adults who were enrolled in undergraduate classes and who were of different genders, ages, and college majors participated.” (p. 971) Dickson et al. (2008): “The experiment was carried out at the NYU Center for Experimental Social Science (CESS). Our results come from data collected in two experimental sessions involving 18 subjects each, for a total of 36 subjects. Subjects signed up for the experiment via a web-based recruitment system that draws from a broad pool of potential participants; individuals in the subject pool are mostly undergraduates from around the university, though a smaller number came from the broader community. We did not recruit from our classes, and all subjects gave informed consent according to standard human subjects protocols.” (p. 979-980) Smith et al. (2007): “We recruited subjects from a broad cross section of the population of a mid-sized U.S. city.11 11 Subjects were recruited using newspaper ads, posters and community listserves, which produced a very diverse pool of respondents. The average age was 37, with amedian income of $20,000 to $40,000. There were slightly more males (55% of our N) than females (45%), and most were white (approximately 70%). We make no claims that this constitutes a random sample, but do suggest that we have a much more representative pool of subjects than the undergraduate population that is typical of experimental research. (p. 291) Nelson et al. (2007): “We sampled members of the Columbus, OH community for our study, aiming for roughly equal representation by blacks and whites. Nonstudent adults were solicited through fliers and newspaper advertisements or were approached in public places such as the public library, bus station, and city marketplaces.2 2The average age of the respondent was 36. About 22% of the sample had completed high school, 35% completed some college, 27% graduated from college, and 12% had postgraduate education. Fifty-six percent of respondents were male and 44% female. While we deliberately sampled nonstudents, we do not claim our sample is representative of the general population. The ages of our participants range from 18 to 78; incomes range across all five offered income categories (less than $25,000, through over $100,000). Compared to 2000 Census data, our sample resembles the county in terms of median age and income. Our participants have a somewhat higher level of educational attainment. Our sample is unlike the county population in that the sample is more male and more Democratic. Finally, our sample overrepresents African-American participants by design. See the web appendix (http://www.journalofpolitics.org) for details and the complete questionnaire.” (p. 420) Kam (2007): FN 1: “Despite the effort to recruit subjects from many walks of life, the subject pool reflects a convenience sample drawn from a Midwestern college town. Eighty-two percent of the sample is white; 61% is female. The subjects range from 18 to over 61, with approximately a quarter of the sample aged 61 or over. About a fifth (21%) of subjects identify as Republicans, 19% identify as pure Independents, and 54% identify as Democrats. Two-thirds of the sample possess a Bachelor’s degree or its equivalent.” (p. 19) Berinsky and Kinder (2006): Experiment (1): “Our first experiment was a between-subjects design carried out in the spring of 2000 in and around Ann Arbor, Michigan. Participants (n = 141) were enlisted through posting advertisements 33

and recruiting at local businesses and voluntary associations and were paid for their participation. We deliberately avoided college students (for reasons spelled out in Sears 1986). As we had hoped, participants came from virtually all walks of life: men and women, black and white, poorly educated and well-educated, young and old, Democratic, Independent, and Republican, engaged in and indifferent to politics (see the supplemental appendix on the Journal of Politics web site (http: //journalof politics.org/articles.html) for respondent characteristics).” (p. 644) Experiment (2): “Experiment 2 was another between-subjects design, conducted in the spring of 2002 in central New Jersey. As before, participants (n = 163) were recruited in such a way as to guarantee a broad representation of citizens (see the web appendix) and were paid for their participation.” (p. 651) Barker and Hansen (2005): “Participating in the experiments were 220 university students, most of whom we recruited from political science classes at a large public university.” (p. 327) Table 1 (p. 328) displays demographics.

Cited studies Barabas, Jason, and Jennifer Jerit. 2010. Are Survey Experiments Externally Valid. American Political Science Review 104(2) Barker, David C., and Susan B. Hansen. 2005. All Things Considered: Systematic Cognitive Processing and Electoral Decision-making. Journal of Politics 67. 2 Bartels, Brandon, and Diana C. Mutz. 2009. Explaining Processes of Institutional Opinion Leadership. Journal of Politics 71. 1 Battaglini, Marco, Rebecca Morton, and Thomas Palfrey. 2007. Efficiency, Equity, and Timing of Voting Mechanisms. American Political Science Review 101. 3 Berinsky, Adam J., and Donald R. Kinder. 2006. Making Sense of Issues Through Media Frames: Understanding the Kosovo Crisis. Journal of Politics 68. 3 Bianco, William T., Michael S. Lynch, Gary J. Miller, and Itai Sened. 2006. A TheoryWaiting to Be Discovered and Used: A Reanalysis of Canonical Experiments on Majority-Rule Decision Making. Journal of Politics 68. 4 Boudreau, Cheryl, and Mathew D. McCubbins. 2010. The Blind Leading the Blind: Who Gets Polling Information and Does it Improve Decisions? Journal of Politics 72. 2 Boudreau, Cheryl. 2009. Closing the Gap: When Do Cues Eliminate Differences between Sophisticated and Unsophisticated Citizens? Journal of Politics 71. 3 Brader, Ted, Nicholas A. Valentino, and Elizabeth Suhay. 2008. What Triggers Public Opposition to Immigration? Anxiety, Group Cues, and Immigration Threat. American Journal of Political Science 52. 4 Brader, Ted. 2005. Striking a Responsive Chord: How Political Ads Motivate and Persuade Voters by Appealing to Emotions. American Journal of Political Science 49. 2 Brooks, Deborah Jordan, and John G. Geer. 2007. Beyond Negativity: The Effects of Incivility on the Electorate. American Journal of Political Science 51. 1 Chong, Dennis, and James N. Druckman. 2007. Framing Public Opinion in Competitive Democracies. American Political Science Review 101. 4 Dickson, Eric S., Catherine Hafer, and Dimitri Landa. 2008. Cognition and Strategy: A Deliberation Experiment. Journal of Politics 70. 4 Dickson, Eric S., Sanford C. Gordon, and Gregory A. Huber. 2009. Enforcement and Compliance in an Uncertain World: An Experimental Investigation. Journal of Politics 71. 4 Dickson, Eric S. 2009. Do Participants and Observers Assess Intentions Differently During Bargaining and Conflict? American Journal of Political Science 53. 4

34

Druckman, James N., Cari Lynn Hennessy, Kristi St. Charles, Jonathan Webber. 2010. Competing Rhetoric Over Time: Frames Versus Cues. Journal of Politics 72. 1 Dunning, Thad, and Lauren Harrison. 2010. Cross-cutting Cleavages and Ethnic Voting: An Experimental Study of Cousinage in Mali. American Political Science Review 104. 1 Feddersen, Timothy, Sean Gailmard, and Alvaro Sandroni. 2009. Moral Bias in Large Elections: Theory and Experimental Evidence. American Political Science Review 103. 2 Feldman, Stanley, and Leonie Huddy. 2005. Racial Resentment and White Opposition to Race-Conscious Programs: Principles or Prejudice? American Journal of Political Science 49. 1 Gadarian, Shana Kushnar. 2010. The Politics of Threat: How Terrorism News Shapes Foreign Policy Attitudes. Journal of Politics 72. 2 Gartner, Scott Sigmund. 2008. The Multiple Effects of Casualties on Public Support for War: An Experimental Approach. American Political Science Review 102. 1 Gibson, James L. 2008. Group Identities and Theories of Justice: An Experimental Investigation into the Justice and Injustice of Land Squatting in South Africa. Journal of Politics 70. 3 Goren, Paul, Christopher M. Federico, and Miki Caul Kittilson. 2009. Source Cues, Partisan Identities, and Political Value Expression. American Journal of Political Science 53. 4 Grober, Jens, and Arthur Schram. 2006. Neighborhood Information Exchange and Voter Participation: An Experimental Study. American Political Science Review 100. 2 Hainmeuller. 2010. Attitudes toward Highly Skilled and Low-skilled Immigration: Evidence from a Survey Experiment. American Political Science Review 104. 1 Horiuchi, Yusaku, Kosuke Imai, and Naoko Taniguchi. 2007. Designing and Analyzing Randomized Experiments: Application to a Japanese Election Survey Experiment. American Journal of Political Science 51. 3 Huber, Gregory A., and John S. Lapinski. 2006. The 'Race Card' Revisited: Assessing Racial Priming in Policy Contests. American Journal of Political Science 50. 2 Jerit, Jennifer. 2009. How Predictive Appeals Affect Policy Opinions. American Journal of Political Science 53. 2 Kam, Cindy D., and Elizabeth N. Simas. 2010. Risk Orientations and Policy Frames. Journal of Politics 72. 2 Kam, Cindy D. 2007. When Duty Calls, Do Citizens Answer? Journal of Politics 69. 1 Levine, David K., and Thomas R. Palfrey. 2007. The Paradox of Voter Participation? A Laboratory Study. American Political Science Review 101. 1 Lupia, Arthur, and Tasha S. Philpot. 2005. Views from Inside the Net: How Websites Affect Young Adults’ Political Interest. Journal of Politics 67. 4 Malhotra, Neil, and Alexander G. Kuo. 2008. Attributing Blame: The Public’s Response to Hurricane Katrina. Journal of Politics 70. 2 McDermott, Monika L. 2005. Candidate Occupations and Voter Information Shortcuts. Journal of Politics 67. 1 Mutz, Diana C. 2007. Effects of 'In-Your-Face' Television Discourse on Perceptions of a Legitimate Opposition. American Political Science Review 101. 4 Mutz, Diana C., and Byron Reeves. 2005. The New Videomalaise: Effects of Televised Incivility on Political Trust. American Political Science Review 99. 1 Nelson, Thomas E., Kira Sanbonmatsu, and Harwood K. McClerking. 2007. Playing a Different Race Card: Examining the Limits of Elite Influence on Perceptions of Racism. Journal of Politics 69. 2 Peffley, Mark, and Jon Hurwitz. 2007. Persuasion and Resistance: Race and the Death Penalty in America. American Journal of Political Science 51. 4 Philpot, Tasha S., and Hanes Walton, Jr. 2007. One of Our Own: Black Female Candidates and the Voters Who Support Them. American Journal of Political Science 51. 1 Prior, Markus. 2009. Improving Media Effects Research through Better Measurement of News Exposure. Journal of Politics 71. 3 35

Scott, John T., and Brian H. Bornstein. 2009. What’s Fair in Foul Weather and Fair? Distributive Justice across Different Allocation Contexts and Goods. Journal of Politics 71. 3 Smith, Kevin B., Christopher W. Larimer, Levente Littvay, and John R. Hibbing. 2007. Evolutionary Theory and Political Leadership: Why Certain People Do Not Trust Decision Makers. Journal of Politics 69. 2 Smtih, Kevin B. 2006. Representational Altruism: The Wary Cooperator as Authoritative Decision Maker. American Journal of Political Science 50. 4 Tomz, Michael, and Robert P. van Houweling. 2008. Candidate Positioning and Voter Choice. American Political Science Review 102. 3 Tomz, Michael, and Robert P. van Houweling. 2009. The Electoral Implications of Candidate Ambiguity. American Political Science Review 103. 1 Transue, John E. 2007. Identity Salience, Identity Acceptance, and Racial Policy Attitudes: American National Identity as a Uniting Force. American Journal of Political Science 51. 1 White, Ismail K. 2007. When Race Matters and When It Doesn't: Racial Group Differences in Response to Racial Cues. American Political Science Review 101. 2 Whitt, Sam, and Rick K. Wilson. 2007. The Dictator Game, Fairness and Ethnicity in Postwar Bosnia. American Journal of Political Science 51. 3 Wood, B. Dan, and Arnold Vedlitz. 2007. Issue Definition, Information Processing, and the Politics of Global Warming. American Journal of Political Science 51. 3 Zink, James R., James F. Spriggs II, and John T. Scott. 2009. Courting the Public: The Influence of Decision Attributes on Individuals’ Views of Court Opinions. Journal of Politics 71. 3

36