Students are encouraged to research the answers during ... are often selected
from a large test bank, and where the ..... in Computer Programming Classes?
The Use of Multiple Choice Tests for Formative and Summative Assessment Tim S Roberts Faculty of Informatics and Communication Central Queensland University Bundaberg, Queensland 4670, Australia
[email protected]
Abstract This paper describes the use of multiple-choice tests as an essential part of the assessment for a third-year undergraduate course in computer science. Multiplechoice tests are yesterday's news – they have been used for student assessment for many years - but their implementation as presented here differs from the norm in several important respects, in particular their use for formative (as opposed to purely summative) reasons. This paper describes the intent of the tests, their format, and the regulations concerning them, discusses the advantages and disadvantages, and concludes that there are significant gains to be made for both educators and students from their appropriate deployment. 1 Keywords: formative assessment, summative assessment, multiple-choice tests, MCQ, online assessment.
1
Introduction
All too often, learning has been driven by assessment ("Is this likely to be in the exam?" "Is this important?" "Will this topic be assessable?"), rather than the other way around. Of course, effective learning should be preeminent. Wherever possible, therefore, assessment should have some formative component. The phrase formative assessment is used throughout this paper in the sense of assessment that contributes to student learning, rather than in other senses such as those delineated by Scriven (1973). Multiple-choice tests have been used extensively in recent years for assessment purposes, most particularly in large classes where marking can consume inordinate amounts of time and resources. Multiple-choice tests have reportedly been used to good effect in courses ranging from computer programming (eg Kuechler & 1
Copyright © 2006, Australian Computer Society, Inc. This paper appeared at the Eighth Australasian Computing Education Conference (ACE2006), Hobart, Tasmania, Australia, January 2006. Conferences in Research in Practice in Information Technology, Vol. 52. Denise Tolhurst and Samuel Mann Eds. Reproduction for academic, not-for profit purposes permitted provided this text is included.
Simkin, 2003) to physical education (eg Ayers, 2001). Excellent guides to help instructors construct effective multiple-choice tests can be found in many places (eg Ballantyne & The Teaching and Learning Centre of Murdoch University, 2002; Illinois State University, 2001; Teaching Resources and Continuing Education, 2000). On most occasions, such tests have been employed for exclusively summative assessment. The advantages of multiple-choice tests to be found in the literature (eg Epstein et al, 2002; Higgins & Tatham, 2003; Kuechler & Simkin, 2003) include that they can: • test knowledge quickly within large groups • be used to provide quick feedback • be automatically scored • be analysed with regard to difficulty and discrimination, and • be stored in banks of questions and re-used as required The literature (eg Wesolowsky, 2000; Paxton, 2000; National Center for Fair & Open Testing, n.d.) also lists many disadvantages, including that multiple-choice tests can • take a lot of time to construct • test knowledge and recall only • never test literacy, or ability to analyse • never test creativity, or unique thinking, and • encourage students to take a surface approach to learning This paper demonstrates that many of these disadvantages may be easily overcome. It describes an innovative implementation of multiple-choice tests as part of the core assessment for a third-year undergraduate computer science course, though it should be noted that their application as described here would seem to be equally applicable to many other areas.
2
Description
For the last five years, online multiple-choice tests have formed an integral component of the learning and assessment within COIT13152 Operating Systems, a purely theoretical third-year course within the Bachelor of Information Technology degree (and available as an option to students in various other degree programs). Students who understand the theory can progress,
usually in the following term, to a more practical course entitled COIT13146 Systems Administration. The online system used for the tests is a course-management system written in-house, called . The implementation of the tests differs in several important respects from what might be regarded as standard. The significant differences are: 1. The multiple-choice tests serve both formative and continuous summative assessment purposes. Students are encouraged to research the answers during the period of each of the tests. Each test counts towards a final grade, thereby ensuring participation by all students. However, it is important to note that the primary purpose of the tests is formative, not summative. Students may download and print the test questions, and they may research the answers through any number of resources, including the set text, the course study guide, the course web site, and any other materials they can find in the library or online on the Web. The fact that the test is graded means that, in an effort to maximise their mark, students are likely to spend some time in this research, thus providing an environment in which learning can take place. The summative aspect enhances the formative aspect. 2. Tests are sat and submitted online. Tests are available to students from any location with an internet connection and a standard web-browser. Students log on using their student ID and pre-assigned password, and are provided with a unique quiz ID both at the beginning and end of the test that they record and use for any future enquiries. The system automatically records the quiz ID together with the time of log on and the time of submission (if one is made). The online environment ensures that students at offcampus locations are treated equally to their on-campus colleagues. The online format also enables automatic recording, marking, and analysis of results. 3. A trial practice quiz is provided. This trial quiz, in exactly the same format as the real test, is made available shortly after the start of the term, and is used to ensure all students can log on, access, and submit a typical quiz without problems. No regard is taken to correctness or incorrectness of answers, and students may take the quiz as many times as they wish. As in the real test, correct answers are not made available immediately. Instead, they are released approximately one to two weeks prior to the first real test. Thus, the purpose of the practice quiz is not primarily formative assessment, but rather, to ensure that any problems that might normally occur during a student’s first attempt at a quiz – such as a forgotten or mis-typed password – are discovered prior to the first real quiz taking place. 4. Each test consists of a fixed set of questions (the same set for every student).
That is, the questions are deliberately not selected from a large bank of questions at random, nor do different students get different questions. Exactly the opposite is the case: students are informed in advance that they if they log on to the test on more than one occasion, they will be presented with the same set of questions. Thus, there is no advantage to be gained by trying to access the test from different locations or at different times in the hope of obtaining an ‘easier’ set of questions This is very different from the norm, where questions are often selected from a large test bank, and where the primary function of the test is summative. Here, the emphasis is not on grading but on assisting learning. 5. The tests remain open for a period of 24 or 48 hours. During this time students may download or print the questions, may research the answers using any books, notes, or other resources at their disposal, and may otherwise carry on with their lives. The issue of work or personal commitments becomes, therefore, a non-issue in almost all cases. In those extremely rare cases where students can demonstrate genuine incapacity for 24 or 48 hours, proportionally extra weighting is applied to the remaining assessment items. The alternative, of allowing tests to be re-opened, is not accepted practice, since this would disadvantage other students in their ability to receive timely feedback. 6. Multiple submissions may be made. All submissions are recorded, but only the final submission for each student is counted toward the final grade. Earlier submissions are logged and scored for statistical reasons, and as a check in the event of any technical problems. The ability to make multiple submissions facilitates the process whereby students might make an initial attempt at all of the questions, realise the areas where they might have difficulties, think about the questions again, and, having reflected, submit again. 7. Correct solutions and marks are not provided until after the close of the test. That is, feedback is not instantaneous. Results are typically released one to two days after the close of the test. Following the release of results, students are free to discuss the questions and answers face-to-face or via the class email list. Since every student receives the same questions, it is clearly not feasible to provide correct answers – or even scores – immediately upon submission. This slight delay of a day or two prior to the release of results enables checking of submissions and answers, and gives time for any technical problems to be dealt with. 8. Care is taken with the composition of the questions, to ensure that the majority are of a form where the answers cannot be simply looked up in some text. Rather, selection of the correct answer requires conceptual understanding, or the application of an
and process C at time 2. A needs 5 seconds in the CPU, B 3 seconds, and C 1 second. All processes are totally CPU-bound and process-switching time is negligible, so that after 9 seconds all processes have completed. At what time does process B complete if the processscheduling algorithm is preemptive priority scheduling, each process has a higher priority than the previous one, and the quantum is 1 second? a. 3 b. 5 c. 7 d. 9
algorithm, or some other working, or a combination of these. As has been pointed out elsewhere (eg Woodford and Bancroft 2004), construction of appropriate questions, with appropriate distractors, does indeed require some skill, but such skill should be within the bounds of most professional educators. The issue of the time taken to prepare suitable questions has been found in practice to be a non-issue, since the time saved on marking and handling appeals (when compared with other forms of assessment) can be an order of magnitude greater. 9. The test remains open for a short period (perhaps 15 minutes) after the official closing time.
Q3. Suppose a particular system has 10 tape drives, and that process A currently holds 4, process B holds 3, and process C holds zero. Initially A made a claim (that is, stated its maximum need) for 7, B for 7, and C for 5. If process B now requests all of the remaining drives it needs, how many could the system allocate safely, if using the Banker's algorithm? a. 0 b. 1 c. 2 d. 3
This period of grace prevents any problems for students who misjudge the time, or whose local clock may happen to be a little slow, or who experience technical problems in the final minutes. Despite this period of grace, students are encouraged not to leave their submissions until the last moment. 10. Marks are deducted for incorrect answers. Students are penalised one-third of a mark for each incorrect answer, while an answer left blank scores zero. Students scoring less than zero in the test overall have their mark reassigned to zero. Students are encouraged to guess at answers if they can narrow down the possibilities. The rationale for this is given in detail in section 2.3. 11. Feedback is invited from students after the test. By encouraging online feedback, any queries can be answered, and difficulties resolved. If a question is judged ambiguous (or just plain wrong), appropriate steps can be taken to rectify any aberrant marks. And, most important, the questions and answers can be discussed, thus providing for further learning. 12. Questions and answers are reviewed after the test. In addition to student feedback, time is taken during the period after the test to go over any or all of the questions or topic areas that seemed to have caused the greatest difficulty. Of course, this benefits not only the students, but also the instructors, in understanding which aspects of the course may require a different, or perhaps greater, emphasis.
2.1
These three questions represent a typical sample, in the sense that all of the three questions test knowledge and understanding of the content area, while two of the three also require the application of simple algorithms. The distractors have been carefully chosen in each case so as to be at least plausible if the student has only a partial understanding of the problem2.
2.2
Instructions
The instructions to students regarding the test are made available well in advance of the test itself, so that all students have the opportunity to clarify any matter related to procedure or format that they are unsure of prior to the test itself. The two priorities when framing the instructions are simplicity and clarity. A sample set of instructions as provided to students online would include: •
the time the test is to be available,
•
that multiple submissions may be made, but in such cases, only the final submission will be counted,
•
that extensions are not possible under any circumstances,
•
the weeks during which the material that might be tested was covered,
•
the number of questions in the test,
•
that students MUST work on the test on an individual basis, and that where any
Example Questions
Here are three questions from a test used in 2004: Q1. A particular system runs almost exclusively very CPU-bound processes. At any particular moment, therefore, most of these processes are likely to be a. running b. ready c. blocked d. signalling Q2. In a certain system, process A has arrived at time 0, process B at time 1,
2
For the interested reader, the correct answers are Q1 b, Q2 b, Q3 a.
communication or collaboration is found to have occurred, all students involved are liable for a score of zero, •
the procedure to be followed in the event of technical or other problems,
•
the marking scheme to be employed, and
•
a reminder to students in different timezones to make the necessary allowances.
2.3
•
For a test which has N possible responses to each question, a negative score of 1/(N-1) should be introduced for each incorrect answer. So if each question has four possible responses, the penalty for an incorrect answer should be minus one-third of a mark (for example: for a test with 20 questions, and 4 possible responses for each question, random guessing should produce 5 correct answers, and 15 incorrect answers, for a score of 5 - 15*(1/3), which equals zero, as required).
•
Where the possible correct response can be narrowed down, it is still very much worth guessing (for example, if the answers to 10 questions are known, and the other 10 can be narrowed down to one of only two possibilities, then leaving these doubtful 10 blank will result in a score of 10/20; whereas if guesses are made, with a one in two chance of being correct in each case, one can expect a score of 10 + 5 - 5*(1/3), or 13.33/20).
Rationale
Some students question why marks are deducted for incorrect answers. A typical page describing to students the rationale for deducting marks in multiple-choice tests highlights the following points: •
For any quiz, test, or examination to be valid, it should return a score of 100% to someone who knows everything there is to know about a subject, and a score of 0% to someone who knows nothing at all.
•
Blind guesswork is likely to result in some correct answers, even for the person who knows nothing at all about the subject.
•
To correct this anomaly, negative marks need to be introduced for incorrect answers. Any multiple choice test which does not incorporate negative marks as an integral component of the marking scheme will produce invalid results.
2.4.1.1
2.4
Statistical Results
Statistics for the years 2000 through 2004, including numbers of students, minimum, maximum, and mean scores, average deviation from the mean, standard deviation, and correlation with the end-of-term examination, are given in Table 1.
Table 1: Student Statistics 2000-2004 2000
2001
2002
2003
2004
No. of tests
1
1
1
2
2
Total number of questions
30
20
20
20
20
Percentage of marks in course
20%
20%
20%
20%
20%
Number of students3
123
188
232
221
145
Maximum Score
20
20
20
20
20
Minimum Score
2
3
0
0
0
Mean
12.89
13.41
13.95
16.26
12.12
Average Deviation from Mean
2.89
2.95
3.68
2.94
3.91
Standard Deviation
3.76
3.76
4.65
3.91
4.71
+0.326
+0.439
+0.480
+0.461
+0.217
Correlation Coeffic’t with final exam4 3
This number refers to the number of students who submitted all items of assessment. The variation in the numbers of students from year to year is more attributable to factors outside of the course (such as the total number of students enrolled in the program) than to any aspect of the course itself. 4
A figure of plus one would indicate perfect correlation, zero would indicate no correlation, and a negative figure an inverse correlation.
As can be seen, the mean occurred in the range 12 to 14 out of 20, with the single exception of 20035, with marks occurring over almost the entire range available. Student results for each year are shown in Table 2.
2.4.1.2
Table 2: Student Results in the Multiple-Choice tests 2000-2004
Grade
3
Marks
2000
2001
2003
2004
(of 20)
#
%
#
%
#
%
#
%
#
%
High Distinction
>=17
23
18.7
24
12.8
84
36.2
140
63.3
30
20.7
Distinction
>=15
18
14.6
66
35.1
26
11.2
25
11.3
18
12.4
Credit
>=13
25
20.3
35
18.6
49
21.1
24
10.8
20
13.8
Pass
>=10
38
30.9
35
18.6
33
14.2
17
7.6
29
20.0
Fail