Sep 17, 2016 - allel experiments in different modes (e.g. Clifford and Jerit 2014, ...... Grose, Christian R., Neil Malhotra and Robert Parks Van Houweling. 2015.
Comparing Modes and Samples in Experiments: When Lab Subjects Meet Real People Pablo Beramendi Duke University Raymond M. Duch and Akitaka Matsuo Centre for Experimental Social Sciences Nuffield College University of Oxford September 17, 2016
1
2
1
Introduction: Lab and Online Experiments Experiments are a powerful tool for helping us understand how individuals make de-
cisions. And many of the most important experiments conducted in the social sciences involve subjects making decisions that are affected by the choices taken by other subjects participating in the experiment. Social scientists have traditionally conducted these experiments in laboratory settings, typically with student subject pools, because they facilitated such interactive decision making. The lab environment facilitates the implementation of experiments in which subjects interact in real time with each other. They also provide scholars with an opportunity to exercise tight control over treatment assignment and to ensure high levels of internal validity. But as the use of experimental methods has become much more widespread to include, in particular, governmental policy makers and firms, there is a growing demand to conduct these decision-making experiments with much more diverse and representative subject pools. Regulators want to conduct decision-making experiments with subjects who resemble the individuals in the population that they regulate; firms are interested in experiments conducted with subjects resembling their customers. And many academic researchers are interested in a diverse and representative subject pools. As is the case with any experimental design we are concerned with whether the random assignment of online subjects to treatment and control meet the stringent assumption that “their potential outcome depends solely on whether the subject itself receives the treatment.” Of particular concern here is the classic exclusion restriction: Are potential outcomes affected simply by the fact that treatments are delivered in an online environment? This essay compares subjects and treatment effects for different experimental modes in an attempt to answer this question. We focus on comparing lab and online treatment effects. Strictly speaking these should be identical. They are not. The essays
3 indicates why them might not be identical and suggests how online experiments might be designed in order to minimize violation of the exclusion restriction. Of particular concern is understanding mode effect and subject pool effects associated with lab versus internet experiments. Are the treatment effects estimated with internet experiments systematically different than those estimated in a traditional lab? And are these differences associated with mode versus subject pool effects? Distinguishing between the two potential differences is critical. The former might undermine the internal validity of Internet experiments. The latter could represent an advantage for Internet experiments because it would bolster arguments regarding their external validity. There are attempts to identify the heterogeneity of treatment effects by running parallel experiments in different modes (e.g. Clifford and Jerit 2014, Weigold, Weigold and Russell 2013) and different subject pools (e.g. Berinsky, Huber and Lenz 2012, Mullinix et al. 2016, Coppock 2016). These studies have explored the similarities and differences in the treatment effects from experiments conducted under different settings and environments using non-interactive experiments, typically survey experiments. Our contribution to this literature is that we focus on interactive experiments and we attempt to distinguish differences in treatment effects associated with the mode versus and subject pools. This article attempts to disentangle the effects of modes of experimental administration and the effects of different subject pools by comparing experimental results from various modes and samples. To achieve these goals, we run (quasi-)interactive experiments in both traditional lab and online contexts. In the online experiment, subjects recruited from various sources participating in an experiment involving strategic interactions using a novel strategy for replicating an interactive lab experiment in an online environment. We examine whether subjects in the online equivalent of a lab experiment behave similarly to the subjects in
4 the lab even when their actions are not truly interactive. It builds on the tax compliance experiments conducted by Beramendi and Duch (2014). We explore the differences in the behavioral measures in non-interactive games and also differences in the choices they make in the quasi-interactive tax compliance experiment. Of particular interest is their behavior in the quasi-experimental games: Do the choices of subjects in a lab, where they interact with subjects in close proximity, resemble those of subjects online where they “interact” with subjects who clearly are not proximate nor is the interaction conducted in real time. Conducting experiments in an experimental lab setting has a number of advantages, particularly when subjects interact with each other in real time (Morton and Williams 2009). And this is the case for public goods games in which subjects contribute to a common pool and are compensated depending on the size of contributions by other subjects in the game. Concerns about the internal validity should take priority over external validity (McDermott 2002) and lab experiments are the gold standard for ensuring internal validity. Researchers have greater control of the environments in laboratory experiment than other forms of experiment, which helps to establish internal validity of the causal inference. However, there are strong arguments for going outside the conventional experimental lab. If one thinks that treatment effects are likely to interact with characteristics of laboratory subjects, and one is concerned with the external validity, then a more diverse subject pool becomes important. Druckman and Kam (2011) argues that student convenience sample could bias the estimates of causal effects if the size of a treatment effect depends on a characteristic of convenience sample with little variance. For instance, Belot, Duch and Miller (2015) have compared the student convenience samples with non-student subjects and find that certainly with respect to games involving other-regarding preferences students can differ significantly from non-students. Online experiments clearly offer such
5 diversity but at a potential cost to the internal validity. Our incentivised tax compliance experiments start in the lab where we can ensure the highest degree of internal validity. The tax compliance experiments were initially designed and implemented in the Nuffield Centre for Experimental Social Sciences (CESS) lab. Results from four sessions with a total of 66 subjects are the basis for assessing the internal validity of the experimental design and whether there were in fact significant treatment effects. Treatment effects observed in the lab can interact in a systematic, theoretically interesting fashion, with certain characteristics of the population. For this particular experiment, ability, skill-levels and effort interact with treatment effects. The convenience sample of Oxford undergraduates might generate a skewed distribution of ability levels. And since we were very interested in the external validity of the treatment effects we included in the design online experiments with a more diverse subject population. The goal of this article is to explore whether the findings from lab experiments using undergraduate students subjects are the artefacts of particular experimental setup, in particular whether there are effects of the experimental environment in the lab would influence the outcomes, as well as the use of student subjects. In order to thoroughly explore this issue, we conduct a series of parallel experiments conducted in an online virtual laboratory in which subjects are put in an quasi-interactive environment where we match their online behavior with the subjects’ behavior in the past laboratory experiments. We conduct a series of such virtual lab experiments with subjects from multiple subject pools. In order to detect the effect of experimental modes, lab and online, we use the student subjects, as well as, crowd-sourced subjects with a broader backgrounds so that we can extract the peculiarities of student subjects. Subjects in the lab and online versions of the tax compliance experiments essentially play the same game although they make decisions in quite different environments. In the
6 lab version of the experiment subjects are randomly assigned to groups of six; although anonymous, subjects know their group members are one of approximately 24 other participants in the lab facility; and the collective decisions of the six subjects affect the payoffs of each subject in real time. The online environment is quite different. Online subjects are assigned, through a substitution strategy, to groups that have already played the game. Payoffs to the online subjects are then determined by their choices and the choices that had already been made in previous lab experiments. And there is of course an absence of physical proximity online that might also affect the choices that are made by subjects. We attempt in this essay to estimate whether treatment effects vary by the lab versus online modes.
2
Disentangling the Effects of Subject Pool and Experimental Modes
2.1
Subject Pool Effects
Most laboratory experiments are conducted using undergraduate students (Morton and Williams 2009). Students are easily available “convenience samples” and even preferred over non-students because we expect that they are more likely to receive treatments and behave accordingly. This advantage of student subjects can easily turn to be a drawback if the ultimate motivation of scientific research is to provide generalizable insights. The primary concerns are the heterogeneity of treatment effects. We might end up with a biased conclusion if we infer the behavior of general population solely based on the data from the behavior of students. In contrast, online experiments generally allow researchers to recruit participants from wider population. For instance, researchers frequently make use of subjects recruited from crowdsourcing platforms or other convenience online subject pools because they are
7 cost efficient. We make use of two online subjects pools in our online replication the tax compliance lab experiment. One subject pool is recruited from Amazon Mechanical Turk (MTurk). There is considerable evidence that online crowd-sourced subjects are a reasonable basis for estimating heterogeneous effects of general population although crowd-sourced subject pools are not representative of specific populations (Berinsky, Huber and Lenz 2012). MTurk is one of the most frequently used online crowd-sourcing platform for micro-tasks, called Human Intelligent Tasks (HITs). People who have already registered as a workforce at MTurk (called workers) browse the list of HITs, and can accept and complete HITs. MTurk becomes a popular platform in social sciences because of the cost efficiency compared to the traditional method for data collection in particular for survey or experimental research. Experimental research employing MTurk subjects has been widely published in leading journals in political science (e.g. Healy and Lenz 2014, Malhotra and Margalit 2014, Grose, Malhotra and Parks Van Houweling 2015), economics (e.g. Horton, Rand and Zeckhauser 2011, Olea and Strzalecki 2014, Kuziemko et al. 2013) and sociology (e.g. Tsvetkova and Macy 2014). The second subject pool is an online subject pool drawn from the UK population developed by the CESS with similar demographic characteristics to those of the MTurk workers. A second goal of this essay is to determine whether subject pool composition affects the estimated treatment effects. Researchers have explored whether the research findings in survey research using traditional sampling both online and face-to-face can be replicated using the online subjects, especially those from MTurk. Berinsky, Huber and Lenz (2012) reports that many of the established results from survey experiments using framing as a treatment can be replicated employing the MTurk subject pool (see also Crump, McDonnell and Gureckis 2013, Rand, Greene and Nowak 2012). Other studies have claimed that MTurk subjects are even preferable than traditional samples. For instance, Kuziemko et al. (2013) show that the attrition rate for a multiple wave survey is lower for MTurk samples than some other
8 experiments; they also contend that the MTurk sample is no less representative than conventional telephone surveys of the U.S. population.1 Also, Grose, Malhotra and Parks Van Houweling (2015) maintain that MTurk subjects are more suitable for some types of treatments, such as reading lengthy texts, than online national representative samples because “MTurk respondents are ‘workers’ who are accustomed to completing lengthy tasks in exchange for payment (p.734).” There are some issues associated with MTurk subjects, and researchers need to use the best practice for implementing the research based on the MTurk subjects (Paolacci, Chandler and Stern 2010, Horton, Rand and Zeckhauser 2011). For example, they should screen out the subjects who do not pay enough attention to survey protocols and instructions. Berinsky, Margolis and Sances (2014) suggest implementing multiple screener questions in an unobtrusive fashion. This project makes comparisons between three different subject pools, student subjects, online convenience samples from an opt-in panel in the UK, and online convenience samples in the US recruited from MTurk. Subjects from different pools take part in an identical online experiments in a virtual online lab. These allows us to make a direct comparison of treatment effects as well as a controlled comparison of association between behavioural attitudes and behaviour in an experiment.
2.2
Mode Effects
It is well known that moving from the lab to the online environment can potentially undermine the internal validity of an experiment (e.g. Clifford and Jerit 2014, Weigold, Weigold and Russell 2013). Among other things, it is difficult to determine who is actually treated and the conditions under which subjects are making their decisions. This essay attempts to measure the threats to internal validity associated with moving 1
Their comparison is to an unweighted CBS survey.
9 from a very controlled lab environment to an online experimental environment over which the experimenter has much less control. Subjects in the lab and online versions of the tax compliance experiments essentially play the same game although they make decisions in quite different environments. In the lab version of the experiment subjects are randomly assigned to groups of six; although anonymous, subjects know their group members are one of approximately 24 other participants in the lab facility; and the collective decisions of the six subjects affect the payoffs of each subject in real time. The online environment is quite different. Online subjects are assigned, through a substitution strategy, to groups that have already played the game. Payoffs to the online subjects are then determined by their choices and the choices that had already been made in previous lab experiments. And there is of course an absence of physical proximity online that might also affect the choices that are made by subjects. We attempt in this essay to estimate whether treatment effects vary by the lab versus online modes. This essay is primarily concerned with whether there are mode effects associated with the implementation of the tax compliance experiment in online settings. Since we have implemented essentially identical experiments both in the lab and online we are able to address this question. And in particular we are interested in mode affects that are associated with experiments involving strategic interaction amongst subjects. As we point out above there have been some comparisons between experiments conducted with online subject pools and other types of subject pools, but these studies have not focused on experiments that involve strategic interaction amongst individuals. There are experiments in which online subjects playing real-time synchronic games (e.g. Mason and Suri 2012), but most studies in behavioural economics using crowdsourced subjects do not use realtime interaction. Many studies use either one-shot game without feedback (e.g. Horton, Rand and Zeckhauser 2011) or sequential interaction in
10 which subjects leave a virtual lab and come back after several days (e.g. Tsvetkova and Macy 2014). Studies without incentivised treatment also examine the mode effects. For instance, Clifford and Jerit (2014) test several psychological measures in parallel surveys in lab and online using student subjects and find that although there are few differences in attention measures and social desirability effects, there is significant difference in political knowledge measure, suggesting that subjects consult with outside sources in online setting. Our online experiment did not implement a real-time synchronic interaction. Rather, what we do is to put online subjects in a virtual laboratory using the data obtained from the laboratory sessions held prior to the online session. When online subjects come in the virtual laboratory, they are randomly assigned to one of the groups in the laboratory sessions, and based on their responses they are substituted with one of the laboratory subjects and participate in the tax compliance experiment as a surrogate.
3
Experimental Setting The original tax compliance experiment was designed and implemented by Beramendi
and Duch (2014). Their motivation is to understand the compliance with redistributive taxation given that such compliance is the foundation for the modern welfare state. They argue that beyond arguments about the monitoring ability and the efficiency of state institutions and normative concerns for the worse-off in society, the distributive incidence of the public sector over the long run drives the level of compliance. In other words, the distribution of net benefits (as determined by the progressivity of taxes and transfers) is a key determinant of what individuals perceive as collectively fair, and therefore drives their willingness to contribute to the common pool. They test the argument with a series of lab experiments in which subjects are randomly assigned to different fiscal treatments, which approximate the variation in the design of national fiscal systems around the world.
11 The current project attempts to replicate their experiments using different mode and different subject pools. Both lab and online experiments have identical procedure. Subjects are required to first undertake a real effort task (RET) and then secondly report their earned income from these tasks. In the lab experiment, subjects are randomly assigned to groups of six. In the online experiment, they are assigned to one of groups formed in the lab and replaced with a member of the group based on the criteria described below. In both modes, they are informed of “deduction” rates, audit rates, and penalties for cheating; they are also informed about how the group earnings are distributed to each of the four members; and they are given instructions on how to earn money in the RET. Our intuition is that ensuring that subjects are treated in this experimental set-up will be more challenging in online as opposed to lab environment. Hence treatment effects could vary significantly across these different modes. We are also interested in whether variations in the subject pools correlate with, in a systematic fashion, treatment effects. Accordingly, the experiments gathered extensive information on the characteristics of the online and lab experiment samples. Subjects are asked to answer questions on some basic demographics as well as perform tasks designed to recover basic preferences and personality characteristics: 1) a dictator game to measure other-regarding preferences; 2) a lottery game to measure risk preferences; and 3) and an integrity test to measure honesty.
3.1
Laboratory Experiment
The lab experiments were conducted at the CESS experimental lab with 66 University of Oxford students from the CESS student subject pool. At the beginning, participants receive printed instructions which are read and explained aloud. The experimental sessions were conducted from November 22 to December 3, 2013. The experiment consisted of five modules, two tax compliance modules with and without auditing, dictator game,
12 lottery game, and non-incentivised questionnaire. In the first three modules, we offered earnings in Experimental Currency Unit (ECU). The conversion rate is 300 ECUs to 1 British Pound. The tax modules consist of ten rounds each. Table 1 summarises the treatments that are implemented in these two modules of the experiment. Prior to the tax treatments, participants are randomly assigned to groups of six and the composition of each group remains unchanged. Each round of these two tax modules has two stages. In the first stage subjects perform RET to compute a series of two-number additions in one minute. Their Preliminary Gains depend on the number of correct answers, getting 150 ECUs for each correct answer. In the second stage, they are asked to report their gains. Depending on their rank of Preliminary Gains within the group, subjects are assigned to one of the tax terciles. If their report is audited, they have to pay additional penalty – 50 percent of unreported gains are also deducted. The audit rate is 0 percent for the first tax module and is 10 percent for the second. Deductions applying to the six group members including the penalty are pooled and distributed amongst those members in accordance with the redistribution rate in Column 5 of Table 1. Table 1: Summary of Tax Compliance Experimental Treatments Session Subjects
Number of Groups
Tax Terciles
1 2 3 4
24 18 12 12
4 3 2 2
1) 1) 1) 1)
Total
66
11
28% 20% 42% 42%
2) 2) 2) 2)
30%; 30%; 45%; 45%;
Benefits 3) 3) 3) 3)
32% 43% 47% 47%
Rev / 6 1) 25% Rev; 2) 17.5% Rev; 3) 7.5% Rev Rev /6 1) 25% Rev; 2) 17.5% Rev; 3) 7.5% Rev
Note Rev: the total contribution with in the group. At the end of each round participants are informed of their Preliminary and Declared gains; whether these two amounts have been audited; the total tax contribution in the group; the amount they receive from the deductions in their group; and the earnings in
13 the round. At the end of each tax module one round is chosen at random, and their earnings are based on their profit for that round. Participants are only informed of their earnings for each tax module at the end of the experiment. The experiment also measured preferences and socio-demographic characteristics. The demographic variables were gender and income. We included a measure of trust, integrity, other-regarding preferences and risk aversion. The details of these games are provided in Online Appendix.
3.2
Online Experiment
The Beramendi and Duch (2014) research question called for implementing an identical version of the lab experiment with a much larger and more diverse online subject pool. In the lab experiment, subjects were assigned to groups and their final payments were conditional on the behaviour of the other subjects in the group. Replicating this group assignment and real-time feedback in an online setting is challenging particularly for a relatively large number of subjects. We solve this challenge by realtime matching of online subjects with the behaviour of lab subjects. The sequence of the modules in the experiment is mostly the same as the lab experiment except for the number of rounds in each tax module. To prevent the subjects from losing attention to the experiment and lower the number of dropouts from the experiments, we reduce the number of round in each tax module from ten to five. From the ten rounds of tax compliance game, five with audits and five without audits, one round is randomly selected for the payment. In the data analysis of laboratory experiments, we only use the data of ten rounds, first five rounds of from each of two modules. In the tax compliance game, online participants played exactly the same real effort tasks, which are two-digits number additions for sixty seconds. Each correct answer is converted to 150 ECUs.
14 The wording used in the questions is closely assimilated to the one in the lab experiment. However, we trimmed down the instructions given to the subjects. In the laboratory experiments, the experimenter read the instruction aloud, and experimental subjects could not skip the instruction to proceed to the next screen until the experimenter finishes the recitation. Since this is not the case for online experiments, we tried to make sure that the experimental subjects receive the instructions by making the instructions shorter and clearer as well as by setting a minimum time to spend in each page of instruction before proceeding and adding more examples. The experiment was conducted on a virtual laboratory hosted on Qualtrics, an online survey platform. A critical component of the experimental design is assigning subjects to a tax rate depending on how they perform in the real effort task compared to other subjects in their group. In this online experiment we recreate the group environment by randomly assigning subjects to one of groups in the lab, and replacing one of lab subjects in the group instantly in each round. After a subject performed RET, we matched each subject to a lab subject – the matching is based on the similarity of their Preliminary Gains. The online subject’s tax rate and after-tax revenues are determined by her or his contribution and remaining members of the lab group. This matching is accomplished by linking a database on our web server to Qualtrics.2 We set up an online database that stores the laboratory data on the experimental parameters (such as the audit rate and tax scheme) as well as the real effort task performance and tax compliance in the laboratory sessions. Qualtrics survey software accesses the database through a web-service function. At the start of experiment, each participant was randomly assigned to one of the 11 laboratory groups (see Table 1), and then the information about the tax scheme for the group is retrieved from the server and shown on the subject’s instruction screen. The matching of online subjects’ behavior with previous lab behavior works in the following manner. In each round of the tax compliance game, the Qualtrics survey pro2
A survey experiment using the similar technology is found in Boas and Hidalgo (2012).
15 gram sends the record of an online subject’s Preliminary Gain of an online subject to the web server, then a program placed on the server retrieves the information of the group performance for the round and determines a lab subject to be replaced by fining a subject whose Preliminary Gain is the most approximate to the online subject’s Preliminary Gain for the round. The program retrieves the data on deductions of other five subjects, and then passes the data on the total contributions, tax rate of online subjects, and audit information back to Qualtrics. The information is shown on the experimental screen before the online subject is prompted to declare the gain. With the Declared Gains of an online subject and information received from the web server, Qualtrics survey program produces the results of the round and online participants receive a feedback regarding the round’s results, which is equivalent to the information the lab subjects have received. We have run three online experiments using different subject pools.The first online experiment used 500 MTurk workers.3 We published the HIT on October 3, 2014 and finished the data collection within a day. The second UK online experiment uses 150 subjects from the CESS subject pool. The field period was February 20 to 27, 2015. The third experiment used the student subjects of University of Oxford. The field period was March 18 to 20, 2016. For both general online subjects and student subjects we randomly invite three hundred registered subjects and the experimental website was open until the number of completion reaches the number we targeted. We used currency conversion rates different from the lab: 1000 ECUs is equal to one currency units, the British Pound Sterling for lab and UK online, and the US Dollar for MTurk. The experimental settings for UK is exactly the same as the US experiment other than the currency of the reward. 3
The worker qualifications we used were country of residence (United States), number of completed HITs (more than 500), and rate of previous HITs approved by requesters (99 percent or higher).
16
4
Results: Lab versus Online Experiments Compared We firstly compare the socio-demographic and behavioral measures of different subject
pools before looking into the tax game outcomes. Concerns regarding subject pool effects build on the notion that these samples differ quite significantly in terms of their sociodemographic profiles. There is some evidence to this effect. Figure 1 indicates that the gender distribution of subjects in the lab and online are quite similar except for the UK online samples. Figure 2 confirms that there are age differences in the three subject pools. We know that MTurk workers tend to be younger than population survey samples (Berinsky, Huber and Lenz 2012), and as we would expect, the undergraduate student subjects both in the lab and online are even younger on average. UK online subjects are similar to MTurk subjects. The age distributions for MTurk and UK online are significantly different from UK lab and online in both t-test and Wilcoxon rank sum test, but MTurk and UK online are not distinguishable. Figure 1: Gender of Subjects
Proportion
0.6
0.4
0.2
0.0 Male
Female
Gender MTurk
CESS Online
Lab Student
Online Student
17
Figure 2: Age of Subjects 0.15
Density
0.10
0.05
0.00 20
40
60
Age MTurk
CESS Online
Lab Student
Online Student
Preferences and personality. A more important concern though is that the subject pools differ with respect to fundamental preferences – this is the concern raised from the results reported in Belot, Duch and Miller (2015). As we pointed out earlier, both the online and lab experiments included the same set of behavioral measures for recovering underlying preferences that might be the source for heterogeneity in treatment effects. Comparisons of these different metrics across the different subject pools suggest no significant differences. Other-regarding preferences are very similar across the different subject pools. We employ the classic Dictator Game described in the Online Appendix to measure otherregarding preferences. In both the lab and online versions of the Dictator Game subjects have an opportunity to split an endowment of 1000 ECUs between themselves and a recipient. Figure 4 describes the allocation of ECUs to the recipients. A large proportion of subjects either allocate nothing or a half of the endowment to the recipients. The
18 average amount allocated to the recipient is 299 by student lab subjects; 302 by student online subjects; 283 by MTurk workers; and 408 by the UK Online subjects. Student and MTurk subjects are similar: in both t-test and Wilcox rank sum test, the difference between two groups is insignificant. In contrast, the UK Online subjects are significantly more generous than the other two subject pools. This is confirmed by both t-test and Wilcox rank sum tests.4 Figure 3: Dictator Game
Proportion
0.6
0.4
0.2
0.0 0
100 200 300 400 500 600 700 800 900 1000
Amount Sent MTurk
CESS Online
Lab Student
Online Student
Integrity represents a personality trait that again may be the basis for heterogeneous treatment effects; particularly in our tax compliance experiments. The test consists of 10-item questionnaire described in Online Appendix. A summation of these ten scores results in an overall measure of integrity. Figure 4 shows the distribution of this total score which ranges from 10 (highest integrity) to 40 (lowest integrity). The distribution of scores suggests lower integrity (higher scores) for the student subjects. And in fact the average integrity score is significantly higher for lab subjects than online subjects. These differences in integrity between a student subject pool and more diverse subject pools are 4
The averages of scores in each measure from different pools and test results are presented in Online Appendix.
19 consistent with the student/non-student differences reported by Belot, Duch and Miller (2015). Figure 4: Integrity Test: Total Score CESS Online
Lab Student
0.12
Avg = 17.50
0.09
Avg = 19.04
0.06
Density
0.03 0.00 MTurk
Online Student
0.12
Avg = 17.33
0.09
Avg = 18.47
0.06 0.03 0.00 10
20
30
40
10
20
30
40
Integrity Score
A final source of treatment heterogeneity that might vary across the subject pools is risk preferences. The risk preference elicitation game described earlier is designed to measure these preferences in both lab and online subjects. The measure assumes transitive preference and monotonically non-decreasing utility in terms of monetary earnings. If a subject chooses Option B in a particular lottery, then in subsequent lotteries she should choose Option B. Violation of transitivity is often observed in these games. In this experiment, most lab and MTurk subjects reveals consistent preference: 7.6 percent of lab subjects and 9.8 MTurk subjects exhibit such inconstancy in the test. The rate is slightly higher for online subjects, but the difference is not statically significant. On balance violation of transitivity represents less than 10 percent of subjects – we exclude these subjects from the analysis. However, UK Online subjects exhibits surprisingly high inconsistency: 27 percent of subjects are inconsistent.
20 Figure 5 shows the distribution of risk preference after deleting these inconsistent results. The x-axis indicates the risk preference measured as the lottery choice switch. For example the score 4 means that a subject has switched their choice from Option A to B at Lottery 4. The score 0 represents the most risk seeking score while 10 represents the most risk aversive score.5 Online subjects are slightly more risk seeking but on balance the three subject pools are quite similar with respect to risk preferences. Figure 5: Risk Preference CESS Online
Lab Student
0.3
Avg = 4.33
Avg = 3.88
0.2
Proportion
0.1
0.0 MTurk
Online Student
0.3
Avg = 3.44
Avg = 4.57
0.2
0.1
0.0 0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
Risk Preference
On balance, these subject pools do not differ dramatically on the preferences and personality measures we administered in these experiments. Potentially more problematic is the fact that, as we pointed out earlier, the lab and the online modes differ quite significantly. And these very different environments could affect treatment effects. We now explore this possibility. There are two key behavioural outcomes associated with the 5
The score 10 is a logically inconsistent choice because a subject with this score has never switched their choice even if in Lottery 10 the Option B provide a higher earning with a probability of 1. There are eight such subjects among MTurk participants and four among UK Online.
21 tax compliance experiments.
Real Effort Performance. Recall that the tax compliance experiment required subjects to perform a real effort task that consisted of adding two randomly generated twodigit numbers in thirty seconds. Our expectation is that both lab and online subjects would treat the real effort tasks with similar levels of seriousness and they would register similar outcomes. Both subject pools are incentivised although the online subjects earn less and for many this is one of a large number of HITs that they perform. And as we pointed out earlier there is some question about how attentive crowd-sourced subjects are to these such tasks. Significant differences would raise question about the employment of real effort tasks with online experimental subjects. Figure 6 shows the distribution of outcomes for both lab and online subjects. Online subjects performed slightly lower than did the lab subjects: Average Preliminary Gains for lab student subjects is 1836 (an average of 12.2 correct answers), for online student subject is 1821 (12.1 correct answers), for MTurk subjects it is 1579 (10.5 correct answers); and for UK Online subjects it is 1425 (9.5 correct answers). Lab subjects who are university students are younger and, on average, better educated which might explain the higher average. MTurk subjects performance is higher than UK Online as they are “professionals” in such online work. The online version of the tax compliance experiment required us to match, based on the Preliminary Gains, online subjects with lab subjects.
Cheating. But there are factors that might result in differences across the two environments. One might imagine, for example, that subjects in a lab experiment will be more hesitant about cheating given that they actually physically interact with other participants in the lab. There is little questioning the anonymity of online subjects and this might exaggerate cheating levels. An additional factor that might affect the choices subjects make relates to the design
22
Figure 6: Real Effort Task Performance CESS Online
Lab Student
Avg = 1425
6e−04
Avg = 1837
4e−04
Proportion
2e−04
0e+00 MTurk
Online Student
Avg = 1580
6e−04
Avg = 1821
4e−04
2e−04
0e+00 0
1000
2000
3000
4000
5000
0
1000
2000
3000
4000
5000
Preliminary Gain
innovations that were necessary in order to replicate the tax compliance experiments online. The innovation here is assigning online subjects to groups of lab subjects who had already played the game – in fact, each online subject was substituted for a real lab subject who had already played the game.6 Our expectation is that this change in the tax compliance experiment should not result in the online subjects complying with tax rates differently than lab subjects. One of the goals of the comparisons we implement is to test this conjecture. In our lab experiment, the subjects are assigned to a six-member group at the start of tax experiment. The lab experiments took place months before the online experiment. In each of the ten rounds of the online experiment, online subjects are matched to one of the lab subjects based on the similarity of their performance on the real effort task. At the income reporting stage the online subjects receives the identical information that was provided to their matched lab subject: the are informed of the tax regime; they learn their tax rate; and they are informed of the benefits regime which determines how the 6
There is no deception in how we implement the assignment of online subjects – they are told that they are being paired with subjects who had already made their decisions about how much of their income to declare.
23 pooled tax revenues are distributed to each member of the group. If the online subjects happened to have the identical preference to their matched lab subjects, then the online subjects would make the exact same decision (this controls for treatment effects because the treatment effect is precisely the same in each pairing). In order to test this hypothesis, we matched the replacing subjects for each round with replaced subjects to make a comparable datasets of lab and online results. The matching of lab to online is 1-to-N; in other words, for each data point of lab data, there are “N” multiple points in the online data. Figure 7 compares the lab and online income report rates using the matched dataset. Reporting no income always maximized subjects’ expected earnings in this tax experiment – in both audit rate treatments. The income report rate takes the value of one in a substantial number of online observations. With the positive audit rate case, about 16 percent of UK Online subjects and 12 percent of MTurk subjects reported the full income while 9 percent of lab student subjects reported full income. The proportion is even smaller for online student subject with 7 percent report rate. These proportions go down when there is no auditing of subjects’ income; nevertheless, still 8 percent of UK Online subjects and 11 percent of MTurk subjects reported their full income while 2 percent of lab student subjects did so and 4 percent of online student subject did. It is clear that the students, in particular lab students, are more sensitive to the change in the possibility of auditing. We disentangle the effects of modes and subject pool effects in the following subsection.
24
Figure 7: Comparison of Income Report Rate 0.3 With Audit
0.2
Proportion
0.1
0.0
0.3 No Audit
0.2
0.1
0.0 0.00
0.25
0.50
0.75
1.00
Report Rate MTurk
CESS Online
Lab Student
Online Student
25
4.1
Comparing Treatment Effect
Subjects in both experiments were assigned to similar tax and audit treatments.7 Our general expectation is that tax compliance will drop as tax rates rise and that tax compliance will be lower when there is no auditing of income. Table 2 reports results from the first set of models in which we regress dummy variables of tax rate terciles and auditing rate on the lab or online income report rate. We include two dummy variables for tax rates, Middle and Low, as well as a “No Audit” dummy variable. The baseline is a high tax rate tercile and a 10 percent audit rate. For both online and lab models, all estimated coefficients are highly significant in the expected direction. The are some notable difference between lab and online. First, the effect of no audit is larger for the lab subjects. For lab subjects, moving from the no audit treatment to the 10 percent audit treatment results in an average decrease in reported income of 31 percent. The online subjects are less responsive to the possibility to cheat without retribution. Second, we find similar differences for the effects of tax rates. Lab subjects are more responsive to tax rates. They complying less than online subjects when rates are high and their compliance rates are more responsive to drops in the tax rates.
7
The results presented in this paper are limited to these fairly broadly defined tax and audit treatments. Beramendi and Duch (2014) report the actual substantive results from the experiments that focus on treatment effects associated with different levels of progressivity related to both revenue collection and the distribution of benefits.
26
Table 2: Effects of Tax Rates and Auditing CESS Online
MTurk
Lab Student
Online Student
(1)
(2)
(3)
(4)
Middle Tax Bracket
0.111∗∗∗ (0.034)
0.065∗∗∗ (0.014)
0.076∗∗ (0.030)
0.110∗∗∗ (0.025)
Low Tax Bracket
0.124∗∗∗ (0.030)
0.117∗∗∗ (0.013)
0.176∗∗∗ (0.030)
0.237∗∗∗ (0.024)
No Audit
−0.142∗∗∗ (0.023)
−0.173∗∗∗ (0.010)
−0.329∗∗∗ (0.025)
−0.219∗∗∗ (0.020)
Constant
0.409∗∗∗ (0.028)
0.511∗∗∗ (0.011)
0.384∗∗∗ (0.024)
0.282∗∗∗ (0.019)
994 0.054 0.051
4,968 0.066 0.066
778 0.217 0.214
968 0.176 0.174
Observations R2 Adjusted R2 ∗
p