DEVELOPMENT AND EVALUATION OF COMPUTER ASSISTED EXAM – CIRCUIT THEORY EXAMPLE Jerzy Rutkowski, Katarzyna Moscinska, Lukasz Chruszczyk Institute of Electronics Faculty of Automatic Control, Electronics and Computer Science Silesian University of Technology Akademicka 16, 44-100 Gliwice Poland
[email protected],
[email protected],
[email protected] ABSTRACT In the paper the redevelopment of assessments and examinations from the traditional model into the computer-assisted multi-choice model is presented. The guidelines for computer-assisted multi-choice test development and evaluation [1] are discussed and practically verified on Circuit Theory exam. The information channel idea [2] has been applied for elaboration of the new model. Maximization of the relative mutual information is the criterion for the exam parameters tuning. The genetic algorithm method is utilized for optimization purpose. The numerous advantages of computer-assisted tests have been confirmed and significant time savings have been obtained. The benefits of the computer-assisted multichoice summative assessment have been appreciated not only by the tutors, but by the students as well.
arises: “How to assess large groups with minimum amount of resources but preserving quality ?” [4]. Some extensive research has already been done in this field and effective CAA systems have been reported [5], [3]. However, further extensive studies are necessary to make the assessment system fully automated and reliable, such that traditional assessment methods can be replaced, in fact practically eliminated. In Section 2, two different approaches to test organization are briefly presented, then, multi-choice test development is discussed in details. In Section 3, a new, information channel-based approach to test development and evaluation is presented in general. This approach has been verified on practical experiment, Circuit Theory final exam. The development of the experiment and the obtained results are discussed in Section 4.
KEY WORDS e-learning, computer assisted assessment, multi-choice test
2. Classification of tests There are many types of tests that can be used at both stages of a course, formative and summative stage. In general, they can be divided into two categories: • open-ended tests, • close-ended tests. Open-ended and close-ended tests differ in many characteristics, especially in regards to the role of a student when solving the test tasks [6]. Close-ended test limits the student to the set of possible alternatives offered. Then, the term multi-choice test is commonly used. Open-ended test has many advantages, it allows the student to express the way of reasoning and own presentation of the obtained solution. Also, it allows the examiner to check the student’s overall knowledge, creativity, not only ability to solve the typical tasks. Advantages of the open-ended tests include the possibility of discovering responses that individuals give spontaneously, and thus avoiding the bias that may result from suggesting responses [6]. However, the traditional open-ended tests have also significant disadvantages in comparison to close-ended, such as: • lower reliability, as number of the tasks/questions is limited by marking time effort practically to a few,
1. Introduction Nowadays, Information and Communication Technologies (ICT) make great strides in computer assisted education. They enable to break barriers to access, decrease costs and first of all enhance quality of learning. Interaction is the main factor determining this quality, and interaction is strongly determined by the applied assessment system. Good system should allow for practice and both formative and summative assessment. Moreover, Computer Assisted Assessment (CAA) allows to decrease costs of learning, as it enables significant time-savings, when applied to large classes. With class size as large as 100 or more, the marking of coursework, test or exam and administrative tasks related to course delivery, amounts to hundreds of hours [3]. The CAA in higher education is a process by which the student can be measured, both during the course (formative assessment) and the final exam (summative assessment), and undoubtedly, this process plays a very important part in any education system. Then, the following problem 528-005
333
•
maximum ten tasks, and consequently, they can not cover all the course topics, • subjectivity of marking. For large group of students, examiner’s person-hours, required for the test, becomes enormously large. It is suggested [3], that for class sizes in excess of 66, closeended test is more time effective. However, based on the Authors’ experience, even for the smallest class size of 25, which is acceptable by the University regulations, computer-automated multiple-choice test saves the total time spent on test preparation, marking and presentation of scores. Of course, for both methods of assessment, development of the reliable and complete task/question bank, covering all the course topics, is crucial. The bank creation is long in duration and a highly sophisticated process [7]. Computer-automated open-ended test requires extensive coding and much larger item nonresponse, what makes such assessment method impractical. It seems that computer-automated multichoice tests are much more suitable, especially for engineering courses. Numerous advantages of computer-assisted tests can be enlisted [5], [4], in random order. • •
•
• • • •
• • •
The students’ work is stored on the platform and is continuously accessible such that each individual student always can get a profile of his/her work.
The weakness of the computer-automated summative assessment (final exam) remains on the authentication of the student. In traditional exams, the requirement to produce ID is common, and nowadays, this is the only reliable way of authentication in computer-assisted exams as well. This introduces limitations to such exams, sametime and same-place (computer-room) unity is necessary. Advances in the field of biometrics offers hope that in the future authentication can be carried out remotely [5], such that the final exam can be carried out both asynchronously and remotely. It should be emphasized, that authentication regards only exams, not formative assessments. Multi-choice computer-room located exam is the subject of the Authors’ investigation. The multi-choice task consists of the task body and a few answers. Exam is generated from task database. This database normally consists of hundreds of tasks, each individual task can be multiplied by changing the data or the order of answers. The right answer increases the student’s points total, the wrong one decreases it, missing an answer does not change the total [7]. The following guidelines for multichoice test development are recommended [1].
High quality and widespread question design, comprehensive coverage of the syllabus. Formative assessment system can be easily developed. The students can use this type of assessment as a learning resource as long as they feel it appropriate, and they appreciate the opportunity to receive feedback without losing marks. Adaptive assessment can be used to match the test to the students’ ability. That way, interaction and in consequence quality of learning is improved significantly. Questions/tasks are stored in a database. Then, significant time effort of task bank development can be returned within short period of time, practically in one or two years. The database is continuously updated, modified and supplemented. Cheating is significantly reduced or practically impossible through randomization of tasks. Eventually, no two tests for students are the same. Marking is not affected by human error. Computer carries out the marking. Thus, marking is done immediately after completing the test and it is not subjected to claims (see the above advantage). Teaching quality can be better monitored, by looking at the performance of questions/tasks. Results are stored in computer memory and their statistical analysis can be done automatically. Reporting software supports both test marking and evaluation. No extra software development is required, as the assessment system can be integrated with the existing e-learning platform. No printing costs. Graphics and multimedia can be included in a test.
• • • •
• •
•
At least 20 tasks. Not more than 50 tasks in a single session, up to 150 tasks per test. i.e. up to 3 sessions. Right answer can be weighed or not. In the latter case, weight 1 is assigned to every right answer. Multiple versions of the test can be used, with the same tasks but in a different order and different order of choices. Authors’ experience is, that for the group of 30 students in one computer-room, 4 versions are enough to prevent from cheating, two different orders of tasks + two different orders of choices. Between 2 to 5 choices per task. Optimum suggested by the Authors’ is 4. Penalty points are subtracted from positive points to give the total score. Weight of penalty point W is determined by the weight of positive point of the task. A fraction of this weigh is recommended, e.g. 0.25 or 0.5. Pass/Fail threshold T has to be established. The Authors suggest setting this threshold at the neighborhood of 0.6 of the average score of the top students.
3. Information channel based measurement description As pointed out earlier, an assessment, formative test or exam can be considered as measurement of digital data. Such measurement can be described in terms of Information Theory, developed by C. E. Shannon [2] in
334
the late 40’s of the last century. Fig. 1 presents sourcechannel-receiver information system, where: X = {x1 ,..., x M }
(1)
is the digital source of information, the measured data, characterized by probability assignment PX = { p ( x1 ),..., p( x M )} ,
(1a)
Y = { y1 ,..., y N }
(2)
is the channel output, the measurement characterized by probability assignment
H (Y) H ( X) I ( X / Y)
results,
PX = { p ( x1 ),..., p( x M )} ,
(2a)
and channel itself is characterized by M ⋅ N transition probabilities that relate input and output source of information: p( y j / x i ); i = 1,..., M ; j = 1,..., N .
Figure 3 Graphical interpretation of input information: \ \ \, output information: / / /, mutual information : double crossed area.
(3)
Exemplary channels, binary channel and input/output channel, are presented in Fig. 2.
Then, for the given probabilistic model of one source and the channel, information loss H(X/Y), misinformation H(Y/X) and mutual information I(X/Y) can be defined. Their graphical interpretation is presented in Fig. 3. Mutual information between events (sources) X and Y is the information provided about the event X by the occurrence of the event Y, or vice versa. Then, it may be interpreted as the information provided about the measured data by the occurrence of measurements.
three-
H(X/Y) H(X)
H(Y) I(X/Y)
X
I ( X / Y ) = H ( X) − H ( X / Y ) = H ( Y ) − H (Y / X) (4)
Y
H(Y/X)
M
H (X) = −
∑ p( x ) log i
p( x i ) ,
2
(5a)
i =1
N
Figure 1 Information system: source-channel-receiver
H (Y) = −
j
) log 2 p( y j ) ,
(5b)
j =1
p( y1 / x1 ) x1
H ( Y / X) = −
y1
p( y 2 / x1 ) x2
• •
y2 p( y 2 / x 2 ) p( y1 / x1 )
x2
N
i =1
j =1
∑∑ p( x , y i
j
) log 2 p( y j / x i )
(6)
penalty point weight W, Pass/Fail threshold T,
(7a) (7b)
group of L students is selected and examined twice, the same day: first, using open-ended traditional and timeconsuming exam, and next, by means of computerassisted multi-choice exam (e-exam). Traditional exam is normally based on many years experience, and then, it can be assumed as the reference one. Then, • probabilistic model of the input source X is designated by scores of traditional exam, • probabilistic model of the output source Y is designated by scores of e- exam, their distribution depends on parameters setting,
y1
y2
p( y 3 / x 3 ) x3
M
In case of examination, the exact probabilistic model of the input source is not known. To roughly evaluate effectiveness of multi-choice exam and optimize values of its parameters:
p ( y1 / x 2 )
x1
∑ p( y
y3
Figure 2 Exemplary channels 335
•
The e-exam has been carried out in computer rooms, integrated with the existing information system, the Moodle LMS platform. Twenty tasks have been selected from the task bank. These tasks are short problems, typically numerical problems, stated by the text and eventually drawing: circuit diagram, time response, I − U relationship or other plot. Each task has four choices, described by number(s), formulae or a drawing. After calculating the answer, student checks one box. Problems are enumerated 1,...,20, boxes are enumerated by letters A, B, C, D, for answer recording purpose. Two problems are presented in Fig. 4.
from comparison of the two exams scores, channel transition probabilities are designated. Next, for the set parameters W and T, the mutual information (4) is calculated. To better evaluate the exams, this information is related to the input information, the maximum mutual information that can be obtained if the channel is ideal, p( y i / x i ) = 1; i = 1,..., M = N : I ( X / Y) (8) H ( X) Maximization of this relative mutual information defines the criterion for the exam parameters optimization. Genetic algorithm based optimization can be utilized, as described in the next Section. I r ( X/Y) =
It must be clearly stated, that the proposed experiment allows to compare two methods of measurement (examination), none of them is ideal. Traditional exam has been assumed as the reference one, however, e-exam can be assumed as the reference one as well. The obtained mutual information is the same, I ( X / Y) = I (Y / X) .
4. Find I − U relationship, R=100Ω; E=10V; ideal diode. I U
Before modeling the information system, marking scale has to be assumed. In Polish Universities six grade scale is used: 5 - Excellent, 4.5 - Very good, 4 - Good, 3.5 Satisfactory, 3 – Sufficient, 2 – Fail. Individual exam can be considered as a measurement, and its aim is making a diagnosis. In every diagnostic system Pass-Fail distinction is the fundamental one, classification of the Pass components is eventually the next step. Then, binary system should be considered at first: x1 and y1 encode Pass (P in short) – positive result, x 2 and y 2 encode Fail (F in short) – negative result of exam.
U 20
U
A
B
10
10
I
-0.1 -0.1
I U
C
U
D
20 0.1
10
4. Practical experiment–Circuit Theory exam The presented strategy of e-exam development and evaluation has been practically verified, Circuit Theory (CT) exam has been selected. Each year, the Authors and other four members of the CT team have to examine around 500 students, Electronics & Telecommunication students + Computer Science students, and this consumes hundreds of person-hours. The team has many years experience in traditional examining, however, taking into account time savings and teaching quality improvement, the exam redevelopment into the computer automated form seems absolutely inevitable. The experiment described in the previous Section has been conducted.
0.1
I
I
-10
12. The ideal voltage source E has been switched on the RC series circuit. What is the total energy delivered, t→∞ ? A W = E 2C / 2 B W = E 2C C W = ( E 2 / R)5RC D W =∞
The group of L = 100 students-volunteers has been selected randomly and the two exams have been carried out.
Figure 4 Two e-exam problems
The traditional exam gave the following distribution of scores: p( x1 ) = p( P) = 0.53, p( x 2 ) = p ( F ) = 0.47 . (9)
If the student is not capable to solve the problem, then he/she leaves all four boxes unchecked. The students have been informed that blind guessing may result in penalty points. Four versions of the test have been randomly 336
Entropies, calculated from equations (5), (6), and mutual information, calculated from equations (4), (8) are:
distributed to the students, with two different ordering of problems and two different ordering of choices. According to the e-exam guidelines presented in Section 2, scoring of such exam is defined by two parameters, W and T (7). It has been assumed that each right answer increases the point total (score) S by one point. Two penalty point weights are considered, W = 0.25 and W = 0.5 . Following the guidelines, it has been assumed that Pass/Fail threshold should be located in the neighborhood of 0.6 of best scores: Ta = 0.6 S top
H ( X) = 0.99 bit , H (Y) = 0.96 bit , H (Y/X) = 0.77 bit I ( X/Y) = 0.19 bit , I r ( X/Y) = 0.19 bit/bit
Next, the optimization has been repeated for threeinput/output model of the system. The Pass subgroup has been divided into two smaller subgroups: • High scores, marks 5.0, 4.5 and 4.0, • Average scores, marks 3.5 and 3.0. Same values of the optimized parameters have been obtained. For these values, the channel is characterized by transition probabilities presented in Table 3.
(10)
where S top is the average score of the top five students (5% of L).
Table 3. Three-input/output channel probabilities p ( y j / x i ) y1 = H y2 = A y3 = F
These two parameters, W and T, have been optimized, with maximization of the relative information (8) as the criterion (fitness function). Genetic Algorithm (GA) [8], run with typical genetic operation parameters, has been utilized. Chromosome consists of five bits, C = [ c1 , c 2 , c 3 , c 4 , c 5 ]
(11)
(12)
Ta − 2W
011
Ta − W
010
Ta
110
Ta + W
111
Ta + 2W
101
Ta + 3W
100
Ta + 4W
The GA has returned the following optimum values of parameters: W = 0.25, T = 7.5 . (13)
1.
Table 2. Binary channel transition probabilities p( y j / x i ) y1 = P y2 = F 32/53
21/53
x2 = F
6/47
41/47
2/35
16/35
17/35
x3 = F
1/47
5/47
41/47
The following conclusions can be drawn from this experiment.
For these values, the channel is characterized by transition probabilities presented in Table 2.
x1 = P
x2 = A
Statistical analysis of e-exam results revealed weak points in the students’ overall CT knowledge and pointed out areas of the course content that require improvement of knowledge delivery. All the tasks have been divided into three categories: Good, Average and Bad, taking into account the following criteria: easiness, percent choosing and discrimination [1]. A Good task is one that is answered well by the Pass subgroup students but not well by the Fail subgroup students. Bad tasks are those that are low in easiness and/or choices that distracted even the High subgroup students. Such tasks can be removed. Then, task weighing and scores can be re-calculated. However, such tasks are very helpful in improving quality of knowledge delivery. Four tasks of the carried out eexam have revealed to be Bad, two of them are presented in Fig. 4. Their omission did not change the scores, only moved the thresholds.
Table 1. T encoding c3 c 4 c5 T 001
4/18
I ( X/Y) = 0.27 bit , I r ( X/Y) = 0.19 bit/bit
The next three bits encode Pass/Fail threshold, as presented in Table 1. Grey code is used.
Ta − 3W
6/18
H ( X) = 1.48 bit , H (Y) = 1.29 bit , H (Y/X) = 1.02 bit
c1 c 2 = 11 =ˆ W = 0.75, c1 c 2 = 10 =ˆ W = 1.0
000
8/18
Entropies, calculated from equations (5), (6), and mutual information, calculated from equations (4), (8), are:
The first two bits encode penalty point weight: c1 c 2 = 00 =ˆ W = 0.25, c1 c 2 = 01 =ˆ W = 0.5
x1 = H
337
Multi-choice exam scores are worse than traditional exam scores. This is an expectable phenomenon, resulting, first of all, from different marking methods. In case of traditional exam, wrong answers are not penalized and classified into two categories: 1) completely wrong or no answer, 2) partially wrong. The latter ones also increase point total, by a fraction of a point, usually 0.5. In case of e-exam, there is no such distinction, wrong answer always decreases point total.
2.
3.
4.
5.
Significant number of students (21 out of 53), have passed traditional exam but failed the multi-choice one. Most of them are students located in the Average subgroup (17 out of 21). This phenomenon has been already explained in Conclusion 1. Moreover, it can be hypothesized that some students have cheated during the traditional exam. At this exam, all students are solving the same problem simultaneously, and for that reason such exam is much less protected from cheating than the multi-choice one. Four versions of the multi-choice test, distributed randomly, practically eliminated cheating. Nearly all students that failed traditional exam also failed the multi-choice one (41 out of 47). This proves effectiveness of this exam. When evaluating effectiveness of failure detection method, not passing faulty components is the most important task. Misclassification of healthy components is not as dangerous as misclassification of faulty ones. Both models of examination system, binary model and three-value model gave the same results, both values of e-exam parameters and relative information. However, three-value classification of scores gave some additional and valuable information regarding scoring of the High and Average subgroup. Only one student (out of 47) that failed the traditional exam has reached e-exam High class, only two students (out of 35) classified Average by the traditional exam have reached e-exam High class. Statistical analysis of e-exam results has brought very valuable information. This information will be utilized in the teaching process.
The students’ opinion has been also researched. The following question has been asked: “Did you find multi-choice exam more effective than the traditional one” with three possible choices: 1. YES, 2. NO, 3. DIFFICULT TO SAY. Seventy students have answered YES, twenty DIFFICULT TO SAY, and only ten NO answers have been recorded !!! This surprisingly positive students’ opinion, e-exam results were worse than the traditional exam results, additionally motivates the Team in their efforts in implementing ICT to students’ knowledge assessment.
References [1] J.Ogier, Multichoice Test Marking and Analysis, University of Canterbury – Student Survey and Testing Unit, 2005, online reference http://www.stu.canterbury.ac.nz/multichoicetests.shtml [2] C.E.Shannon, The mathematical theory of communication ( Urbana, Il, University of Illinois Press, 1949). [3] S.Hussmann, Ch.Smaill, The Use of Web-based Learning and Communication Tools in Electrical Engineering, Australian Journal of Engineering Education, online publication, 01, 2003,1-14 http://www.aaee.com.au/journal/2003/hussmann03.pdf [4] S.Starkings, How to Asses Large Groups with the Minimal Amount of Resources but Preserving Quality, Proc. International Statistical Institute - ISI 52nd session, Helsinki, Finland, 1999, http://www.stat.fi/isi99/proceedings.html [5] D.Whittington, J.Bull, M.Danson, Web-based Assessment: Two UK Initiatives, Proc. 6th Australian World Wide Web Conference, Cairns, Australia, 2000, http://ausweb.scu.edu.au/aw2k/papers/whittington/paper.h tml [6] U.Reja, K.Lozar Manfreda, V.Hlebec, V.Vehovar, Open-ended vs. Close-ended Questions in Web Questionnaires, Developments in Applied Statistics, Metodoloski zvezki, 19, Ljubljana: FDV, 2003, 159-177, http://mrvar.fdv.uni-lj.si/pub/mz/mz19/reja.pdf [7] P.Saloun, D.Salounova, L.Cudek, Electronic Distributed Testing, Proc. International Conference on Engineering Education - ICEE, Ostrava, Czech Republic, CD-ROM, 1999, http://www.ineer.org/Events/ICEE1999/Proceedings/pape rs/323/323.htm [8] M.Mitchell, An introduction to Genetic Algorithms (Complex adaptive systems) (Cambridge, MA, MIT Press, 1998).
5. Conclusion In the era of ICT revolution, redevelopment of assessments and examinations from the traditional model into the computer-assisted multi-choice model seems inevitable. A strategy for comparison of these two models, guidelines for computer-assisted multi-choice test development and evaluation have been presented and practically verified on Circuit Theory exam. All advantages of computer-assisted tests, presented in Section 2, have been confirmed. It is the Authors’ great belief, that computer-assisted multi-choice summative assessment (exam) preceded by extensive formative assessments may significantly • improve teacher-student interaction, quality of knowledge delivery, • save time costs. The first belief has to be confirmed by at least two years experience, and to do this, the Team plans to completely replace traditional model of final exam by the computerassisted multi-choice one, starting from the next academic year. Savings in time are self-evident, and they are immediate, as the extensive task bank already exists.
338