WORKING PAPER Constructed-response versus ...

5 downloads 227 Views 182KB Size Report
Mar 2, 2012 - The use of multiple-choice questions in accounting examinations has been traced to the 1930's ..... Masters degree from non-business faculties.
FACULTEIT ECONOMIE EN BEDRIJFSKUNDE TWEEKERKENSTRAAT 2 B-9000 GENT Tel. Fax.

: 32 - (0)9 – 264.34.61 : 32 - (0)9 – 264.35.92

WORKING PAPER

Constructed-response versus multiple choice: the impact on performance in combination with gender

Patricia Everaert1 Neal Arthur2 March 2012 2012/777

1

Corresponding author: Patricia Everaert, Ghent University, Faculty of Economics and Business Administration, Department of Accounting and Corporate Finance, Kuiperskaai 55/E, 9000 Gent, E-mail: [email protected] 2 The University of Sydney,

D/2012/7012/10

Abstract This paper addresses the question of whether the increasing use of multiple-choice questions will favour particular student groups, i.e. male or female students. This paper empirically examines the existence of a gender effect by comparing the relative performance of male and female students in both multiple-choice and constructed-response questions in financial accounting examinations. The study is motivated by the increasing number of students in accounting classes; changes in the gender mix in accounting classes and debates over appropriate means of assessment. We find that female students outperform male students in answering questions of both formats, but their superiority in multiple-choice questions is diminished in comparison with constructed-response questions. This might suggest that multiple choice questions favour male students more than female students. The results hold even if we restrict the comparison to multiple-choice and constructed-response questions having the same general content (e.g. exercise type).

Furthermore, the diminishing result was found both for undergraduate and postgraduate

students. These results should prompt those involved in assessment to be cautious in planning the type of assessment used in evaluating students. Keywords: Gender, accounting, assessment, multiple-choice questions.

1. Introduction The use of multiple-choice questions in accounting examinations has been traced to the 1930’s (Hardaway, 1966).‡ The extent to which multiple-choice questions has been used has grown markedly, partly due to changes in technology and partly in response to increasing accounting class sizes, particularly at the undergraduate level.

Evidence in support of the use of multiple-choice questions to evaluate students’

understanding of accounting is provided by Bible et al. (2008) who report that scores on accounting multiplechoice questions explained about two-thirds of the variability of the scores on accounting short answer questions which they claim is “… suggesting that multiple-choice questions can perform adequate assessment of subject mastery” (Bible et al., 2008, p. S55).§ This form of question is used in a variety of contexts including undergraduate and postgraduate accounting examinations, standard aptitude tests used to screen applicants for business degrees and for professional entrance examinations.** Given the perceived problems of large undergraduate class sizes (Parker, 2005), and the economies that can be achieved using this form of assessment relative to using constructedresponse questions, the extent of use of multiple-choice questions in accounting examinations is expected to continue or even increase.†† Research has increasingly questioned whether the use of multiple-choice questions will favour particular student groups.‡‡ This paper examines the relationship between type of examination question (multiple-choice vs. constructed-response), gender and performance.

It is motivated by the trend towards multiple-choice type

questions in larger undergraduate accounting classes and evidence from other disciplines that indicates a gender effect exits if an examination consists only of (or is heavily weighted towards) one of these two forms of assessment (e.g. Walstad & Becker, 1994). We test for achievement differences in a large undergraduate class and confirm a prior finding that female students do better on both multiple-choice and constructed-response questions, compared to male students. We find that female students outperform male students in answering questions of both



The term “multiple-choice question” is a misnomer as often the test item is often stated in the form of a statement or equation rather than a question. The term multiple-choice “item” is arguably more appropriate. In this paper, we use the term “question” as it is the most commonly used term in practice. § Similar results have been reported based on the analysis of results for economics examinations (Walstad and Becker, 1994). ** Jarnagin and Harris (1977) report that the use of multiple-choice questions on the CPA examination increased from about 9 percent of the examination in 1966 to approximately 38 percent in 1975. Currently, the AICPA examination includes more than 300 multiple-choice questions which count 70% of the total examination score. The ICAEW examination papers also contain a mix of multiple-choice and constructed-response questions. †† Parker (2005, 392) reports that for Australian and New Zealand universities “… accounting lecture class sizes (excluding distance students) in the range of 500-1,000 students are not uncommon”. ‡‡ An early example in the accounting literature is Gul et al. (1992) who showed that performance in multiplechoice examinations was affected by cognitive style.

formats, but this superiority is lower for multiple-choice questions relative to extended-response questions. This result was tested for another dataset of postgraduate students as well. For the postgraduate student group we find that female students have higher levels of performance, however only on the constructed-response questions. Due to our unique research design we are able to rule out a number of potential reasons for the gender effect documented in the paper. These findings have practical relevance to instructors and administrators in the choice of assessment tools; to comparisons of the performance of male and female students in accounting courses (e.g. Alfan and Othman, 2005) and to the comparison of student performance across time. The structure of the remainder of this paper is as follows. Section 2 provides an analysis of the related research literature and develops the research hypotheses. Section 3 provides a description of the data and methods used in the empirical analysis. Section 4 contains the results of the empirical tests. The conclusions of the paper are contained in section 5.

2. Literature review and hypothesis development Early work in the relationship between gender and performance in accounting courses reported that female students outperform male students (Hendricks, 1978; Weston and Matoney, 1976; Mutchler et al., 1987; Lipe, 1989; Gammie et al., 2003).

Explanations offered for this result include females’ higher (i) aptitude for

quantitative courses; and (ii) levels of intrinsic motivation (Tyson, 1989). These results might be due, at least in part, to the absence of controls for factors apart from gender that might influence performance. In contrast to the results from US studies reported above, Koh and Koh (1999) using a sample of undergraduate accounting students in Singapore, finds evidence that male students outperform female students in examinations.§§ Ethnic or cultural factors may, in part, explain these inconsistent results.*** However, based on our experience in teaching accounting, we expect higher levels of achievement for female students relative to male students. This leads to the following hypothesis: H1: Female students outperform male students in accounting examinations. As noted in section 1, there is widespread use of multiple-choice questions in accounting examinations. However, a number of potential concerns have been expressed in relation to the use of multiple-choice questions, including their ability to examine the same level of understanding as constructed-response questions and whether multiple-choice questions can measure analytical skills adequately (Becker and Johnston, 1999). Existing evidence §§

This result is based on a multivariate regression model including controls for students’ age, prior work experience, mathematics background and academic aptitude. *** Greenfield (1996) reports a relationship between ethnicity and both achievement and attitudes.

4

(e.g. Bible et al., 2008) suggests that multiple-choice questions can “adequately” assess subject mastery.††† Others (e.g., Hancock, 1994; Bacon, 2003) argue that only constructed-response questions can adequately test the existence of a deeper level of understanding of the subject material. One possible concern with the use of multiple-choice questions for assessment purposes is the potential for gender effects. Evidence from Australia (Bell and Hay, 1987), the UK (Murphy, 1982; Lumsden and Scott, 1987, 1995), Ireland (Bolger and Kellaghan, 1990) and the United States (Bridgeman and Lewis, 1994; Walstad and Robinson, 1997) indicates that males have a relative advantage over females in multiple-choice examinations. Similarly, Ghorpade and Lackritz (1998) found, using a sample of six human resource management classes that women do appear to perform relatively better in constructed-response questions, but that there was no significant difference on multiple-choice tests. Together, these results provide some evidence of a relationship between gender, type of examination question and performance. However, a number of other considerations are important to an understanding of this inter-relationship. First, it should be noted that early research fails to adequately control for factors other than gender that might affect the relationship between type of question and performance. These factors include prior university experience and student maturity (e.g. undergraduate compared with postgraduate students). This paper aims to shed more light on this issue by controlling for prior university experience and student maturity. Second, previous studies frequently compare the performance of male and female students across different courses (which have different modes of assessment), or the performance of students in a given course across time (where the mode of assessment varies across time). In this study, we eliminate the noise in comparing across courses or across time by measuring performance for both the constructed-response questions and multiple-choice questions on the same examination for the same students. Within an accounting examination, we expect that the multiple-choice questions will favour male students. This leads to the following research hypothesis: H2: There is a gender difference in the exam grades of accounting courses where multiple-choice questions and constructed-response questions are presented to students for assessment purposes. Multiple-choice questions will favour male students compared to constructed-response questions.

†††

Bible et al. (2008) find that scores on multiple-choice questions explain about 2/3 of the variability of the results of the constructed-response questions.

5

3. Methodology 3.1 Data collection Data for hypotheses testing are from information collected at a major Belgian university, which offers a variety of undergraduate and postgraduate programs. The data collected are for a financial accounting course taken by students who are enrolled in the three year Bachelors degree. All students are enrolled for a common 14 course program in the first year of study. The academic year comprises two semesters and students are enrolled in 7 courses in each semester. Each semester, students have 12 weeks of classes and then a two week study-period to prepare for end of semester examinations. A four week examination period follows immediately after the studyperiod. There is an accounting course in each semester of the first year of the degree. The data we used in this paper are for the introductory accounting course of the first semester 2008-2009. The accounting course took the form of a combination of lectures and tutorials. Lectures were of 1.5 hours duration and involve the whole student group (comprising both postgraduate and undergraduate students) of about 800 students. There was a single lecturer (Tutorials were of one and a half hours duration and organized as smaller groups (four groups with about 200 students in each group).‡‡‡ The tutorial instructors were two Teaching Assistants (each teaching two sessions), with relevant experience in an accounting firm. The course is an introductory financial accounting course. Its major objective is to provide students with the knowledge and ability to translate the most frequent transactions of a company into a balance sheet and a profit and loss account. Issues addressed are the basic accounting equality; the relation between balance sheet and income statement; the conventions of recording transactions into T-accounts/journal entries; journal entries for purchasing, revenue recognition, investments in non-current assets, financing activities, income taxes and periodend adjustments. All study material was contained in a Dutch textbook, including solution keys to the exercises. Students reported in the course evaluation that the textbook very well explained the course topics. All students were local students and courses were taught in Dutch. Except for the four language courses (2 languages courses each semester), assessment for all first year courses was based entirely on an end of semester written examination. During summer, students can retake the examinations for the courses of first and second semester for which they failed. The data for the retake group were not analysed in this paper.§§§

‡‡‡ §§§

Students were allocated to tutorial groups based on the alphabetic order of students’ last names. One reason for this was to avoid two observations for a single student.

6

There were 644 students enrolled in the course. Of these, 68 students had prior university experience, because either they resat (repeated) the course or they switched from another degree program. We excluded these students from the sample, to create a homogeneous group comprising freshmen only.**** Further, 71 students did not attend the examination and were thus not included in the sample. Six students received an alternative examination and were also excluded from the sample. Four students received special examination facilities in the form of extra time allowed to answer both sections of the examination (due to a medical condition) and were also excluded from the sample. We had missing data-points for gender for 34 students and six students did not attend (or left without attempting any questions) the exercise part of the examination and were also deleted from the sample. This results in a final sample of 455 students. The accounting examination was a “closed book” examination comprising a mixture of 20 multiplechoice (33% weighting) and 4 comprehensive constructed-response questions (67% weighting).†††† There was a single examination. All students received the same constructed-response questions. The multiple-choice questions were mixed into 6 different sets which varied only in terms of the sequence of the questions and the sequence of the alternative answers for each question. The examination was structured as two back-to-back examinations where students were allowed one hour and fifteen minutes to complete the multiple-choice questions and two and a half hours to complete the constructed-response questions. Between the two parts of the examination students were allowed a 15 minute break. Before this break all students had to submit their answers to the multiple-choice questions. Hence, total examination time was 3.75 hours which is normal for the degree.‡‡‡‡ The multiple-choice questions took the form of both theory and exercise type questions. The constructedresponse questions were mainly in the form of exercise type questions (e.g. journal entries, preparation or completion of a balance sheet and/or profit and loss account) scored on 40 points in total.

****

If we include the 68 students with prior university experience, we have the same results. For hypothesis 1, females have a significant higher performance than males for both mc (t = 3.337; p = .001 one-tailed) and constructed-response exam format (t = 4.203; p = .000 one-tailed). For hypothesis 2, the difference between mc and constructed-response questions is significant larger for males than females (p = .011); while controlling for TA and prior university experience in the ANCOVA analysis. Also the additional analysis on the difference between the exercise type of multiple-choice questions and constructed-response exam format, reveals the same results (p = .025), while controlling for TA and prior university experience. The same applies for the difference between the theory type of multiple-choice questions and the constructed-response exam format (p = .024) †††† The multiple-choice questions were answered on a computer readable form. The answers to the other questions were hand-written by the students. ‡‡‡‡ Of the 12 other courses, two used exclusively multiple-choice questions, one used a combination of constructed-response questions and multiple-choice questions and the other courses used constructed-response examination questions only.

7

The multiple-choice section of the examination is graded by awarding points for correct answers. Points were not deducted for incorrect answers. Accordingly, if the source of any gender effect is greater risk taking by male students, this will bias against a finding of a gender effect as expressed in Hypothesis 2. A strength of our research design is that if evidence of a gender effect is found, we can rule out the risk-taking explanation of BenShaker and Sinai (1991). 3.2 Measurement of the dependent variables First, we use the scores in the multiple-choice (out of 20) and constructed-response section of the examination (out of 40). The score on the constructed-response section is converted to a mark out of 20. We label the scores out of 20 for the multiple-choice and constructed-response questions as MC and CR respectively. Second, to calculate the difference between the multiple-choice and constructed-response section of the examination, we subtract the score out of 20 in the multiple-choice section from the mark out of 20 for the constructed-response section of the examination. Hence if the difference is positive (negative) for an individual student, the student has performed relatively better in the multiple-choice (constructed-response) section of the examination. We label this variable as MC-CR. Our second hypothesis is that the value of MC-CR will vary for the male and female groups within the course. Specifically, if multiple-choice questions favour male students, the value of MC-CR will be larger for male than female students. Third, as part of our sensitivity tests, to determine whether any difference in the scores in the multiplechoice and constructed-response sections of the examination is driven by whether the question is of an applied (exercise) type or is assessing understanding concepts (theoretical type), we calculate the score for each student for each of the two categories of multiple-choice questions. We classify the multiple-choice questions into “exercise” and “theory” categories. If the answer to the question takes the form of a number, the question is classified as “exercise”. In all other cases, the question is classified as theory type. Of the 20 multiple-choice questions, 10 are classified as exercise type questions and 10 are classified as theory type questions.

We calculate a score

(converted to a mark out of 20) for each student on the multiple-choice theory and multiple-choice exercise type questions. These variables are labelled as MCtheory and MCexercise respectively. Fourth, the difference measure MCtheory-CR is calculated by taking the difference between the score (out of 20) in the multiple-choice theory questions and the score (out of 20) in the constructed-response questions. Similarly, MCexercise-CR is calculated as the difference between the score in the multiple-choice exercise type questions and the score in the constructed-response questions. Finally, MCtheory-MCexercise is calculated as

8

the difference between the score on the theory multiple-choice questions and the exercise multiple-choice questions. Table 1 (Panel A) provides a list of variable names and descriptions.

Insert Table 1 here

3.3 Control variables and statistical model As described in section 3.1, students were allocated to one of four tutorial groups and there were two tutors responsible for teaching these classes. We controlled for the potential effects of differences between these tutors by constructing a dummy variable (TA) which takes a value of 1 for tutor A and a value of 0 for tutor B. Analysis of covariance (ANCOVA) will be used to test whether differences in the performance on the constructedresponse and multiple-choice sections of the examination can be explained, at least in part, by gender. This technique is similar to sequential regression and is used where there are one or more independent variables of which one is a categorical grouping variable. The advantage of ANCOVA (relative to ANOVA) is that it allows for the inclusion of one or more covariates. It thus allows us to control for TA, while investigating the main effect of gender.

4. Descriptive statistics and results of hypothesis testing 4.1 Descriptive Statistics Panel A of Table 2 provides frequency data for the student group. Of the 455 students in the sample, 60% are male students and 40% are female.§§§§ The average age of the students (not tabulated) was about 19 years. In terms of teachers for the tutorials, 38% of the students had TA 1 as an instructor and 62% of the students had TA 2 as an instructor. Table 2 also provides the summary descriptive statistics for the dependent variables. The mean for the constructed-response part of the examination is 9.44/20, whereas the mean for the multiple-choice section is 13.32/20. The mean for the difference between the multiple-choice and constructed-response section is 3.87. Note that the mean difference in the scores out of 20 for the multiple-choice theory and multiple-choice exercise questions is only 1.85. §§§§

This is higher than countries such as the US, where approximately 55% of undergraduate students are female (AICPA in Sanders, 2005).

9

Insert Table 2 here

4.2 Correlations The correlations, reported in Table 3, indicate that, as expected, students who perform well on one part of the examination typically perform well on other parts of the examination. We find a positive correlation between the scores in the multiple-choice and constructed-response sections of the examination (r = .691; p = .000). As expected, most of the other significant correlations are positive. The negative correlation between CR and MCtheory-MCexercise indicates that students who scored well on the constructed-response questions have a consistent result in terms of MCtheory and MCexercise (i.e. have a low difference score between MCtheory and MCexercise). This correlation is relatively low (r = -.159), but statistically significant (0.001).

Insert Table 3 here

4.3 Hypotheses testing The results show evidence of a significant gender effect. As shown in Table 4 (Panel A), the group mean for performance is significantly higher for the female than for male students, both for the multiple-choice questions (p = .003) and the constructed-response questions (p = .000). Comparing performance on the multiplechoice theory (MCtheory) and exercise type questions (MCexercise) separately, we find that the superior performance of female students holds for both the theory and exercise type of multiple-choice questions (p = .008 and p = .010). The results reported in Table 4 (Panel A) support the first hypothesis.

Insert Table 4 here

The t-tests reported in Table 4 (Panel A) also provide preliminary support for Hypothesis 2. The group means for MC-CR show that the gap between the multiple-choice questions and the constructed-response questions is larger for the male than female students (the group mean for female students = 3.40, compared to 4.20 for male students). The difference between these group means is significant (p = 0.007). This difference exists for both exercise (MCexercise-CR) and theory (MCtheory-CR) types of multiple-choice questions (p = .013 and

10

.016 respectively). There is no difference in the performance of the two groups on the different types of multiplechoice questions MCtheory-MCexercise (p = .413). Hence, the question type (theory MC versus exercise MC) seems not to explain the gender effect. Note that while the results from the t-tests are significant, the analysis does not control for TA. To test whether the multiple-choice questions favour the male students compared to the constructed-response questions, while controlling for TA, we perform an ANCOVA. As shown in Table 5, gender has a significant impact (p = .012) on the difference between multiple-choice and constructed-response questions (MC-CR). From the group means in Table 4 (Panel A), we know that the difference is larger for males than females. As a robustness check, we test again for whether the mix of multiple choice questions (exercise vs. theory type) is driving the result reported above. We find that both MCTheory-CR and MCExercise-CR are related to gender (p = .024 and p = .033 respectively) and that the main result reported above applies to both categories of multiple-choice questions. Hence, the results of both the t-test and the ANCOVA support hypothesis 2. The results together indicate that female students outperform male students on both constructed-response and multiple-choice type of questions. However, the superiority of female students is larger on the constructedresponse questions than on the multiple-choice questions. Hence, female students have relative advantage over male students in the constructed-response (their score on CR is larger than for the male students). Furthermore, male students have a relative advantage over female students in the multiple-choice questions because their score is much larger for the multiple choice questions than for the constructed-response questions, compared to the female students. This result is not driven by the type of understanding (theory vs. exercise) tested by the multiple choice exam questions.

Insert Table 5 here

4.4 Additional robustness test Prior research indicates that the propensity for students to score better marks on multiple-choice questions is relatively greater for students with prior university experience (Krieg and Uyar, 2001). Given this, a question arises as to whether the gender effect found for the undergraduate group of students would also exist for students with higher levels of university experience. Therefore, we also test whether the gender effect still holds for more experienced postgraduate students.

11

The second dataset is collected at the same Belgian university, from an accounting course taken by students who are enrolled in a one-year postgraduate program (a so-called “master after master program”). A requirement for admission to this business studies program is that the applicants hold both a Bachelors and Masters degree from non-business faculties. This second dataset is from the same accounting course, in the first semester 2008-2009.***** In total 149 students are enrolled. Of these, 30 students did not attend the examination. Consistent with the rules used to construct database 1, we excluded students with another exam (2), additional facilities (1), missing data for gender (4) and who did not return for the second part of the examination (1). This results in a sample of 111 students. The unique characteristic of the two datasets is that both the undergraduate and the postgraduate students attended the same lectures in a combined class. The tutorial instructor for the postgraduate students was Teaching Assistant 1 and, like lectures, the tutorial group contained both undergraduate and postgraduate students (resulting in about 200 students per group, as noted above). Furthermore, all examination characteristics were the same as in dataset 1.††††† In summary, and as shown in Table 1 (Panel B), the only difference between the two datasets is that the second dataset contains more mature and experienced students, whereas the first dataset contains only freshman students. Of the 111 students in the second sample, 62% are male students and 39% are female. The average age of the students (not tabulated) was about 23 years. Descriptives are shown in Panel B of Table 2. As shown in Table 4 (Panel B), the results show that female students outperform male students only for the constructedresponse part of the exam (p = .039). In contrast to the result from the undergraduate dataset, no difference between performance of the male and female students was found for the multiple-choice type of questions (p = .456). However, similar to the results from the undergraduate dataset, the difference between the scores on the multiple-choice questions and the constructed-response questions is significantly larger for the male students than for the female students (p = .008). Finally, when comparing only the scores on exercise type of questions (MCexercise-CR), there is still a significantly larger difference for the male than for the female students (p = .008).‡‡‡‡‡ This suggests that the content of the multiple choice questions is not explaining the gender difference.

*****

The postgraduate course had the same content (in terms of topics) and used the same textbook as the undergraduate course. ††††† That is 20 multiple-choice questions, 4 extended open questions graded out of 40, no penalty for wrong answers for the multiple-choice questions, the same weighting of marks (being one third for the multiple-choice section and two thirds for the constructed response section) and similar organization of back-to-back exams with a similar timing. ‡‡‡‡‡ We also tested the hypotheses using total dataset and include a dummy for postgraduate student. Using an ANCOVA analysis, we find that gender is significant in each of the three models (MC-CR, MCTheory-CR and

12

Insert Figure 1 here

As shown in figure 1, the results together indicate that the multiple-choice type of questions favour male students and that the constructed-response questions favour female students. Or stated in another way, male students have relative advantage over female students in the multiple-choice questions. Female students have relative advantage over male students in the constructed-response questions.

4.5 Discussion of the gender effect A number of alternative explanations have been offered as why male students perform relatively better at multiple-choice type questions compared to constructed-response questions.

Explanations that have been

proposed relate to differences between males and females with respect to handwriting neatness, risk-taking behaviour, verbal fluency and the frequency of changing answers to multiple-choice questions. These explanations are not viewed as mutually exclusive. Identifying the probable source(s) of the gender effect provides potential considerations for those responsible for assessment which are discussed in the conclusions to this paper. Based on our research and results, we identify and discuss below potential causes of the gender effect reported in our study. Early research by Ben-Shaker and Sinai (1991) finds that part of the reason for the gender effect is the difference between male and female students in the tendency to omit questions in multiple-choice examinations. Their hypothesis was that ‘risk-taking’ behaviour of males would lead to a smaller number of omitted questions (i.e. more guessing of answers by males) and consequently higher scores in multiple-choice examinations.§§§§§ Whilst their results support this hypothesis, the guessing tendency differences in their sample explain only a small proportion of the observed gender differences. We rule out this explanation for any observed gender effect in our study, as points were not deducted for incorrect answers to the multiple-choice questions. In this context it is not rational for students to omit answers to any of the multiple-choice questions and this is reflected in the very high response rate for both samples.******

MCExercise-CR). We also find that the difference between the MC and Written is larger for the undergraduate than for the postgraduate students. §§§§§ This assumes, of course, that marks are deducted for incorrect responses. ****** Of all possible 9,100 (455 x 20) answers to the multiple choice questions in the first database, 9,078 answers were recorded. This represents a response rate of 99.76%. For the second database, of all possible 2,220 (111 x 20) answers to the multiple choice questions, 2,218 answers were recorded. This represents a response rate of 99.91%.

13

Risk-taking behaviour may take other forms. A potentially successful strategy for a student attempting to reach a passing grade, could be to allocate a disproportionate amount of time to answering questions from this section of the paper. If male students adopted this strategy and spent a larger amount of time on the multiplechoice questions than did the female students, this may explain their relatively better performance on this question type in past research. However, we rule out this explanation, since the exams were organized as back-to-back exams. All students had to hand in the multiple-choice answering form and then returned after the break to start with the constructed-response questions. Consequently, all students spent the same amount of time answering the questions in each of the two different formats. Based on previous research we expect that constructed-response questions favour students with better handwriting, even where graders are given specific instruction to grade answers based on content alone (Marshall and Powers, 1969; Markham, 1976). In general female students have neater handwriting (Massey 1983) and this may then explain the gender effect in some contexts. Breland et al. (1994) examined this proposition, but found no evidence of a handwriting effect. In our study, the constructed-response questions were all exercise type questions and we have no reason to believe that the tidiness of numerical schedules will affect the marks awarded as a holistic approach to grading the answers was not adopted. Each of the 40 marks on the constructed-response examination, relate directly to a calculation or journal entry. Consequently, the marking was to a large extent objective. Another explanation for the gender effect is a difference between male and female “test-wiseness”, defined by Evans (1984) as the ability to respond correctly to multiple-choice questions containing extraneous clues (Bacon, 2003).†††††† By nature, test-wiseness is independent of a student’s knowledge of the subject matter which is assessed. Aspects of test-wiseness include the efficient use of time and identifying possible clues unintentionally provided by the author of a question. There is evidence that males and females respond to multiple-choice questions in different ways. Female students change their answers approximately twice as much as males (Skinner (1983). Whilst males have been found to change their answers less often than females, when they do change their answers, they more often go from incorrect to correct relative to changes made by female students (Pascale, 1974). This process of changing answers may lead to more cases of running out of time on the multiple-choice section of the examination by female students compared to male students.

††††††

Well known clues include “the longest alternative is often the correct one” and “if there are two alternatives with opposite meanings, chose either”.

14

In summary, we believe that the most likely source of the gender effect found for our samples of accounting students is in male’s superior guessing ability or females tendency to change their answers to multiple-choice questions.

5. Conclusions This paper provides evidence that a gender effect exists in relation to the use of multiple-choice versus constructed-response questions for accounting examinations. The dataset we used in this paper provides some unique characteristics that can rule out some of the explanations for the gender effect provided in previous studies. We investigated in this paper two datasets, one from a first undergraduate year and another from a postgraduate programme. Our data and method allow us to control for factors other than question format that might affect the relative performance of male and female students. First, we kept the subject area similar for both question formats and analysed the student marks for both multiple-choice and constructed-response questions for the same course. Second, we used a within-subjects design, where students were presented both exam formats on the same day. Third, we organized both examination formats as back-to-back examinations so that each student spent a fixed amount of time answering questions of each type. Fourth, we control for prior university experience. Fifth, as the constructed-response questions were all exercise type questions we mitigate the potential impact of any handwriting effects.

Sixth, we made a distinction between exercise and theoretical type of multiple-choice

questions and found that limiting the comparison only to exercise type of multiple choice questions did not alter the conclusions on the gender effect. Finally, no penalties were applied for incorrect answers to the multiplechoice questions, so differences in risk-taking behaviour between male and female students cannot (fully) explain the gender effect of exam formats.

Our main results are as follows. We find firstly that female undergraduate students outperform male students both on constructed-response and multiple-choice type of accounting examination questions. Second, we find a relative performance difference between multiple-choice and constructed-response examination questions. Females perform relatively consistently for both the multiple-choice and constructedresponse questions, while males score much higher for the multiple-choice questions than for the constructedresponse questions. The results show that the superior performance in multiple-choice questions, compared to constructed-response questions, is larger for the male than for the female students.

15

Third, we analysed whether the favourable impact could be explained by the type of multiple-choice questions used. The results show that the relative advantage of using multiple-choice questions for the male students holds for both theory and exercise type of multiple choice questions. Fourth, we found the gender effect both for the sample of first year undergraduate students, as well as for the experienced students in a post graduate programme. For the more experienced students, the gender effect showed higher performance for female students on the constructed-response part of the exam, while there was no significant difference between male and female students for the multiple-choice questions. This finding suggests that in postgraduate examinations, prior university experience helped male more than female students in finding appropriate ways to answer the multiple-choice type of questions. This study is subject to a number of limitations. One limitation relates to the fact that the constructed response questions were of exercise type (calculations, journal entries, filling out balance sheet, profit- and loss accounts). Accordingly, we are not able to conclude as to whether the performance differences we observe would also hold if the constructed response questions were of an essay type. A further limitation is that the evidence from this study does not directly lead to any conclusions as the superiority of different tools for the assessment of learning as there are many considerations in developing mechanisms for effective assessment. More research is needed to investigate whether male and female students differ in their answering strategy for multiple-choice questions. From Skinner (1984) and Pascale (1974), one might hypothesise that female students are more uncertain about their answer and hence frequently change a correct answer into an incorrect answer. Furthermore, more research is needed to further explore the test-wiseness hypothesis. We might consider converting the exercise type of multiple-choice questions into questions where students have to answer only “this statement is correct” or “incorrect”. This would avoid giving alternate (say 4) numerical solutions, where students can investigate the different items and perform backwards calculations to guess the answer. Finally, more research is needed to investigate whether the extent of the gender effect in an accounting context can be reduced by using an assessment method known as self-assessment, whereby students can assess the perceived correctness of their answers. This method, developed by Hunt (1982) assesses both the correctness of the answer and the accuracy of the self-assessment. Application of this method of assessment has been shown to reduce the gender effect (Hassmén and Hunt, 1994), principally through increasing the scores of female examinees. The evidence of this paper is of relevance to accounting instructors in a number of ways. First, in courses with a mix of male and female students, the choice of assessment format (exam question type) may result in a change in the gender mix within each of the grades (pass, credit, distinction, etc.). This could have consequent

16

impact on the selection of graduates into graduate employment if this selection is based, at least in part, on university grades. Secondly, in professional examinations where entry is based on examination mix, the use of multiple-choice examinations (constructed-response) may impose a “barrier” to female (male) students. Connor and Vargyas (1992) provides a legal analysis of gender bias in testing and concludes that gender differences may place educational test users at legal risk under United States laws. Further, Connor and Vargyas (1992) provides legal argument that females might be able to bring claims based on sex discrimination when examinations comprise exclusively multiple-choice types of questions. Thirdly, the research is of relevance to those concerned with comparing the performance of male and female students. Based on the evidence we provide, the analysis of relative performance (and changes in relative performance) may need to account for question format (and changes in question format) as well as simply gender. What can we do about the gender effect? We call for a balanced examination, comprising both multiplechoice and constructed-response questions. However with the increasing number of students in accounting classes, it can be very time-consuming to grade constructed-response examinations. We experimented several years with a constructed-response examination for about 700 students and formulated the following guidelines and suggestions to reduce marking time without compromising accuracy or consistency of marking.

1) When

preparing and allocating marks to an individual constructed-response question, keep in mind how you plan to grade it. If the question is to be marked out of (say) 10 marks, then make sure you can easily allocate the 10 marks based on the number of steps or parts required as part of the answer. 2) Avoid the use of questions where answers will contain consequential errors. Markers recalculations based on previous answers can be very time-consuming. 3) Use predefined answer sheets. For instance, provide row and column grid lines (with bold lines between questions), avoiding the need for the marker to search where the answer is for each question. 4) Make a distinction between the space where you expect calculations and a box with the end-solution. Again, this is saving time in looking for the appropriate answer.

17

References Alfan, E. and Othman, M., 2005, “Undergraduate students’ performance: the case of university of Malaya”, Quality Assurance in Education, 13 (4), pp. 329-343. Bacon, D., 2003, “Assessing learning outcomes: A comparison of multiple-choice and short-answer questions in a marketing context”, Journal of Marketing Education, 25 (1), pp. 31-36. Bell, R. and J. Hay, 1987, “Differences and biases in English language examination formats”, British Journal of Educational Psychology, 57, pp. 212-220. Ben-Shakhar, G. and Y. Sinai, 1991, “Gender differences in multiple-choice tests: the role of differential guessing tendencies”, Journal of Educational Measurement, 28 (1), pp. 23-35. Becker, W.E. and C. Jonston, 1999, “The relationship between multiple-choice and essay response questions in assessing understanding”, Economic Record, 75 (231), pp. 19-28. Bible, L., M. G. Simkin and W. L. Kuechler, 2008, “Using multiple-choice tests to evaluate students’ understanding of accounting”, Accounting Education: An International Journal, 17 (Supplement), S55-S68. Bolger, N. and T. Kellaghan, 1990, “Method of measurement and gender differences in scholastic achievement”, Journal of Educational Measurement, 27, pp. 165-174. Breland, H.M., D. O. Danos, H.D. Kahn, M.Y. Kubota and M.W. Bonner, 1994, “Performance Versus Objective Testing and Gender: An Exploratory Study of an Advanced Placement History Examination”, Journal of Educational Measurement, 31 (4), 275-293. Bridgeman, B. and C. Lewis, 1994, “The relationship of essay and multiple-choice scores with grades in college courses”, Journal of Educational Measurement, 31, pp. 37-50. Connor, K. and E.J. Vargyas, 1992, “Legal implications of gender bias in standardised testing”, Berkley Women’s Law Journal, 7 pp.13-89. Evans, W. 1984, “Test wiseness: an examination of cue-using strategies”, The Journal of Experimental Education, 52 (3), pp. 141-144. Gammie, E., B. Paver, B. Gammie and F. Duncan, 2003,”Gender differences in accounting education: an undergraduate exploration”, Accounting Education, 12 (2), pp. 177-196. Ghorpade, J. and R. Lackritz, 1998, “Equal opportunity in the classroom: test construction in a diversity-sensitive environment”, Journal of Management Education 22 (4), pp. 452-471. Greenfield, T.A., 1996, “Gender, ethnicity, science achievement and attitudes”, Journal of Research in Science Teaching, 33 (8), pp. 901-933. Gul, F.A., H.Y. Teoh and R. Shannon, 1992, “Cognitive style as a factor in accounting students’ performance on multiple-choice examination”, Accounting Education, 1 (4), pp. 311-319. Hancock, G.R., 1994, “Cognitive complexity and the comparability of multiple-choice and constructed-response test formats,” Journal of Experimental Education, 62 (2), pp. 143-157.

18

Hardaway, M., 1966, “Testing and Evaluation in Business Education”, 3rd Edition, (Cincinnati: Southwestern Publishing). Hassmén, P. and D.P. Hunt, 1994, “Human self-assessment in multiple-choice testing”, Journal of Educational Measurement, 31 (2), pp. 149-160. Hendricks, A., 1978, “Hiring the woman graduate: why and how”, The National Public Accountant, (October), pp. 14-16. Hunt, D. P., 1982, “Effects of human self-assessment responding in learning”, Journal of Applied Psychology, 67 (1), pp. 75-82. Jarnagin, B.D. and J.K. Harris, 1977, “Teaching with multiple-choice questions”, The Accounting Review, 52 (4), pp. 930-934. Koh, M. and H. Koh, 1999, “The determinants of performance in an accountancy degree program”, Accounting Education, 8 (1), pp. 13-29. Krieg, R. and B. Uyar, 2001, “Student Performance in Business and Economics and Statistics: Does Exam Structure Matter?”, Journal of Economics and Finance, 25 (2), pp. 229-241. Lipe, M.G., 1989, “Further evidence on the performance of female versus male accounting students”, Issues in Accounting Education, 4 (1), pp. 144-152. Lumsden, K.G. and A. Scott, 1987, “The economics student reexamined: male-female differences in comprehension”, Research in Accounting Education, 18 (4), pp. 365-375. Lumsden, K.G. and A. Scott, 1995, “Economics performance on multiple-choice and essay examinations: a largescale study of accounting students”, Accounting Education, 4 (2), pp. 153-167. Markham, L.R., 1976, “Influences of handwriting quality on teacher evaluation of written work”, American Education Research Journal, 13 (4), 277-283. Marshall, J.C and J.M. Powers, 1969, “Writing neatness, composition errors and essay grades”, Journal of Educational Measurement, 6 (2), pp. 97-101. Massey, A., 1983, “The Effects of Handwriting and other Incidental Variables on the GCE ‘A’ Level Marks in English Literature”, Educational Review, 35 (1), pp.45-50. Murphy, R., 1982, “Sex differences in objective test performance”, British Journal of Educational Psychology”, 52 (2), pp. 213-219. Mutchler, J.F, J.H. Turner, and D.D. Williams, 1987, “The performance of female versus male accounting students”, Issues in Accounting Education, Vol. 2 No1, pp. 103-111. Parker, L.A., 2005, “Corporate governance crisis down under: post-Enron accounting education and research inertia”, European Accounting Review, Vol. 14 No 2, pp. 383-394. Pascale, P. J. 1974, “Changing initial answers on multiple-choice achievement tests” Measurement and Evaluation in Guidance, 6, pp.236-238.

19

Sanders, B., 2005, “The supply of accounting graduates and the demand for public accounting recruits–2005”, American Institute of Certified Public Accountants, URL: http://ceae.aicpa.org/NR/rdonlyres/11715FC6-F0A74AD6-8D28-6285CBE77315/0 /Supply_DemandReport_2005.pdf, accessed 16 October, 2008. Skinner, N.F., 1983, “Switching Answers on Multiple-choice Questions: Shrewdness or Shibboleth?” Teaching of Psychology, 10 (4), pp.206-210. Tyson, T., 1989, “Grade performance in introductory accounting courses: why female students outperform males”, Issues in Accounting Education, Vol. 4 No. 1, pp. 153-160. Walstad, W.B. and W.E. Becker, 1994, “Achievement differences on multiple-choice and essay tests in economics”, The American Economic Review, Vol. 84 No 2, pp. 193-196. Walstad, W. and D. Robinson, 1997, “Differential item functioning and male-female differences on multiplechoice tests in economics”, Journal of Economic Education, Vol 28, pp. 155-171. Weston, M. and J. Matoney, 1976, “More college women majoring in accounting: the numbers and some reasons”, Woman CPA, Vol. 22, January 1976, pp. 14-15.

20

Table 1: Variable names, data sets Panel A: Variables and definitions Variables Gender TA MC MCtheory MCexercise CR MC-CR MCtheory-CR MCexercise-CR MCtheoryMCexercise

Description Dummy variable with the value of 1 for male and 0 for female students Dummy variable for teaching assistant: teacher 1 (value of 1) and teacher 2 (value of 0) Score on the 20 multiple-choice questions for the accounting course (on 20) Score on the 10 multiple-choice, theoretical type of questions (rescaled on 20) Score on the 10 multiple-choice, exercise type of questions (rescaled on 20) Score on the constructed-response, exercise type of questions (original on 40, but rescaled on 20) Difference in score for multiple-choice and for constructed-response questions Difference in score for theoretical multiple-choice (rescaled on 20) and for constructedresponse questions Difference in score for exercise multiple-choice (rescaled on 20) and for constructed-response questions Difference in score for theoretical multiple-choice (rescaled on 20) and for exercise multiplechoice questions (rescaled on 20)

Panel B: Data sets

Dataset 1 Year

Maturity

Data set 2 st

Undergraduate: 1 year in Economics and Business Administration

Postgraduate: Master in Complementary Studies in Business Administration

Freshman

Mature students st

st

Date

2008-2009, 1 semester

2008-2009, 1 semester

Students

N = 644

N = 149

Sample

N = 455

N = 111

Course

Intro to Financial Accounting

Intro to Financial Accounting

Lecturer

1 single group, prof. X

Tutorials

4 groups

1 group

TA

TA 1, TA 2

TA 1

Exam

Set 1: 20 mc, 40 constructed-response questions (exercise type)

Set 2: 20 mc, 40 constructed-response questions (exercise type)

21

Table 2: Descriptive Statistics Panel A: Data set 1 Frequency Table Gender Female Male

TA TA 1 61 110 171 38%

TA 2 123 161 284 62%

Total 184 271 455

% 40% 60% 100%

N

Minimum

Maximum

Mean

Std. Dev.

MC

455

4.00

20.00

13.32

3.03

MCtheory

455

2.00

20.00

14.24

3.29

MCexercise

455

2.00

20.00

12.39

3.79

CR

455

0.00

19.00

9.44

4.64

MC-CR

455

-5.00

13.50

3.87

3.36

MCtheory-CR

455

-6.00

16.50

4.80

3.94

MCexercise-CR

455

-7.00

13.50

2.94

3.70

MCtheory-MCexercise

455

-10.00

12.00

1.85

3.66

Descriptives

Panel B: Data set 2

Frequency Table Gender Female Male

Descriptives MC MCtheory MCexercise CR MC-CR MCtheory-CR MCexercise-CR MCtheory-MCexercise

TA TA 1 42 69 111 100%

TA 2

Total 42 69 111

% 38% 62% 100%

N

Minimum

Maximum

Mean

Std. Dev.

111

4.00

19.00

13.32

2.82

5.00

18.33

13.81

3.00

2.50

20.00

12.59

3.81

0.50

19.00

10.34

4.02

-4.00

10.00

2.99

3.09

-4.00

10.83

3.48

3.01

-6.50

14.00

2.25

4.30

-8.33

9.17

1.22

3.70

111 111 111 111 111 111 111

22

Table 3: Correlation Coefficients Pearson correlations are reported in the lower diagonal. Spearman correlations are reported in the upper diagonal. P-statistics are shown in parentheses. N = 455. Dataset 1: 1st undergraduate

MC MCtheory

MC 1 .836

MCexercise .884

CR .695

MC-CR -.118

(.000)

(.000)

(.000)

(.012)

(.000)

(.931)

(.000)

1

.457

.549

-.061

.122

-.231

.369

(.000)

(.000)

(.194)

(.009)

(.000)

(.000)

1

.635

-.127

-.380

.179

-.607

(.000)

(.007)

(.000)

(.000)

(.000)

1

-.766

-.725

-.613

-.178

(.000)

(.000)

(.000)

(.000)

1

.879

.864

.096

(.000)

(.000)

(.040)

1

.536

.528

(.000)

(.000)

1

-.385

(.000) MCexercise CR MC-CR MCtheory-CR MCexerciseCR MCtheoryMCexercise

MCtheory MCexercise MCtheory-CR -CR MCexercise -.187 -.004 -.201

MCtheory .804

.879

.473

(.000)

(.000)

.691

.550

.632

(.000)

(.000)

(.000)

-.051

-.005

-.078

-.757

(.276)

(.917)

(.097)

(.000)

-.117

.185

-.349

-.718

.887

(.012)

(.000)

(.000)

(.000)

(.000)

.032

-.207

.231

-.609

.874

.544

(.494)

(.000)

(.000)

(.000)

(.000)

(.000)

-.159

.409

-.610

-.159

.061

.528

-.425

(.001)

(.000)

(.000)

(.001)

(.165)

(.000)

(.000)

(.000) 1

23

Table 4: Group means (standard deviation) and t-tests for gender Panel A: Data set 1 Group mean for female (std dev)

Group mean for male (std dev)

t-test

One-tailed

N = 184

N = 271

N = 455

Sig.

MC

13.79 (2.87)

12.99 (3.114)

2.780

.003

MCtheory

14.70 (3.21)

13.93 (3.308)

2.440

.008

MCexercise

12.89 (3.64)

12.05 (3.851)

2.333

.010

CR

10.40 (4.02)

8.80 (4.927)

3.654

.000

MC-CR

3.40 (3.19)

4.20 (3.436)

-2.502

.007

MCtheory-CR

4.30 (3.77)

5.14 (4.03)

-2.231

.013

MCexercise-CR

2.50 (3.65)

3.26 (3.71)

-2.158

.016

MCtheory-MCexercise

1.80 (3.79)

1.88 (3.57)

-.222

.413

Group mean for female (std dev)

Group mean for male (std dev)

t-test

One-tailed

N = 42

N = 69

N = 111

Sig.

MC

13.29 (2.80)

13.35 (2.84)

-.112

.456

MCtheory

14.01 (3.10)

13.70 (2.95)

.530

.299

MCexercise

12.20 (3.87)

12.83 (3.78)

-.835

.203

CR

11.20 (3.82)

9.81 (4.08)

1.784

.039

MC-CR

2.08 (2.75)

3.54 (3.18)

-2.455

.008

MCtheory-CR

2.81 (2.47)

3.88 (3.25)

-1.847

.034

MCexercise-CR

1.00 (4.41)

3.01 (4.07)

-2.448

.008

MCtheory-MCexercise

1.81 (4.04)

0.87 (3.47)

-1.296

.100

Dependent variables

Panel B: Data set 2 Dependent variables

24

Table 5: ANCOVA’s for Gender Dependent variable

MC-CR

MCTheory-CR

MCExercise-CR

Source

df

F

Sig

F

Sig

F

Sig

TA

1

.078

.780

.300

.584

.006

.939

Gender

1

6.319

.012

5.128

.024

4.594

.033

Total

522

Dataset 1: 1st undergraduate Note: ANCOVA on dataset 2 is not relevant, since in dataset 2 all students had 1 TA; hence no need to control for TA. Hence, in dataset 2, the ANCOVA results are equal to the t-tests, presented in Table 4, Panel B.

25

Figure 1: Summary of the Gender Effect

Dataset 1 MC 14 13

Constructed-response 13.79

12.99

Group Means

12 11 10.40 10 9

8.80

8 female male

Dataset 2 MC

Constructed-response

14 13

13.35

13.29

Group Means

12 11.20

11 10

9.81

9 8 female male

26