quisition of imperfect answers through Ev-MCQs in order to obtain a richer student ... From an educational viewpoint, it is important to identify the degree to which.
Imperfect answers in multiple choice questionnaires Javier Diaz1 , Maria Rifqi1 , Bernadette Bouchon-Meunier1 , Guy Denhi`ere2 , and Sandra Jhean-Larose2 1
Universit´e Pierre et Marie Curie - Paris6, CNRS UMR 7606, DAPA, LIP6, 104 Av. du Pdt. Kennedy, Paris, F-75016, France 2 ´ Equipe CHArt: Cognition Humaine et Artificielle, EPHE-CNRS, 41 rue Gay-Lussac, Paris, 75015, France
Abstract. Multiple choice questions (MCQs) are the most common and computably tractable ways of assessing the knowledge of a student, but they restrain the students to express a precise answer that doesn’t really represent what they know, leaving no room for ambiguities or doubts. We propose Ev-MCQs (Evidential MCQs), an application of belief function theory for the management of the uncertainty and imprecision of MCQ answers. Intelligent Tutoring Systems (ITS) and e-Learning applications could exploit the richness of the information gathered through the acquisition of imperfect answers through Ev-MCQs in order to obtain a richer student model, closer to the real state of the student, considering their degree of knowledge acquisition and misconception.
1
Introduction
Valid and continuous assessment is necessary for effective instruction to improve student learning [6]. To obtain a measure of the student knowledge acquisition on a particular subject (or to collect the opinion of an individual in a more general survey), Multiple Choice Questions (MCQs) are arguably the most common method applied in Computer Aided Assessment (CAA) [10], thanks to their intuitive interaction and computability. MCQs can be used as a way of evaluating the results of the learning process (grading), or as a formative assessment that identifies working areas to be treated next or to be considered as acquired. In a MCQ, the student must answer with a precise indication of the correct choices even if, as usually occurs, she3 is not entirely convinced by her own choice. It is normal for a student, when passing a MCQ test, to find herself in a situation where a question appears to be ambiguous according to the options presented. She may be able to recognize some of the options as incorrect, but not be able to establish the correctness of all of them. When in doubt, a student must make a blind choice among the answers that are not completely wrong. This type of situations cannot be treated accordingly by classical MCQs. They lack of a way for students to express ignorance, imprecision and uncertainty. By imposing precise answers classical MCQs influence student’s output. 3
We will refer to the student/learner as a woman.
II
Noise resulting from the adaptation of imprecise and uncertain answers is gathered and considered in the diagnosis of the student knowledge model; invaluable information as to the current state of knowledge of the student is then lost and replaced by a crisp rendition of the student’s response. An ITS has to help the learner achieve the course’s objectives in its quest for knowledge acquisition by first diagnosing her current state of knowledge, which will be used as a basis for proposing activities. An ITS could exploit such information to represent the state of knowledge acquisition: the concepts that have not been fully acquired by the students, the concepts that the students themselves are certain they possess and the concepts that the students have wrongfully considered as acquired (misconceptions) could be identified. Knowledge is a messy concept and there are several types of imperfections [1] that must be dealt with if we want to assess its acquisition. In this article, we present Evidential MCQs (Ev-MCQs)[3], a generalisation of MCQs through the application of belief function theory [11] that allows evaluated learners to express a more flexible answer, closer to their real state of knowledge. We validate their use in the context of knowledge evaluation on a group of students from junior school, and we measure the amount of noise we were able to prevent thanks to the expression of imperfect answers.
2
Background and previous work
Several concerns regarding the use of MCQs in assessment have been criticized. They allow guessing, adding random noise to assessment data; they rely on the recognition of the correct answers; they only treat a superficial level of knowledge; they don’t allow the assessment of partial knowledge nor the detection of dangerous misconceptions. A handful of techniques have been developed in order to correct them. Classical Test Theory, Computer Aided Testing (CAT) and Item Response Theory (IRT) are among the proposed techniques to treat the evaluation of students with different levels of knowledge [10]. The particular features of each item (question) presented to the students (discrimination power, difficulty, guessing factor) are used in order to estimate the level of the trait being evaluated. Based on previous answers the method selects the most idoneous item to be presented next, getting closer to the real level of the student. From an educational viewpoint, it is important to identify the degree to which a student is sure or confident about the correctness of her responses [7]. If the recognition of the ignorance of the correct answer is important, the detection of misinformation is essential. Some authors working in CAA [15] propose the extension of classical MCQs by allowing the student to specify a degree of global confidence on their answers (by choosing a label) or select alternative possible choices (by suggesting an alternative choice that could be the correct answer to the question). MCQs with confidence measurement [2, 5, 7, 8] allow the students to express their trust on their answers (selecting an appropriate confidence level from a list), motivated by a scoring scheme that rewards honesty and penalizes guessing. The selection of the confidence level has an underlying probabilistic in-
III
terpretation that assumes that a student would want to maximize the probability of obtaining more marks. These MCQs minimize guessing, encourage reflection and self-awareness, and improve validity and reliability of the evaluations [5]. They have been validated by comparing them to more traditional assessment types [4] and have been successfully applied in medicine [8] and computer networks [2]. Moreover, they have been found to be less biased than regular MCQs [7]. These approaches provide some flexibility, but they are still too restrictive to represent imperfect and partial knowledge. Even if some of their marking is inspired on probabilities and Shannon’s Theory of Information [5], their aim is mainly the establishment of a scoring scheme, not the formative assessment of knowledge levels and misconceptions. Furthermore, no treatment or interpretation of the imperfections of the answers is proposed. They recognize severe misconceptions (high levels of confidence on wrong answers), but a lower level of granularity in the expression of the confidence would allow the expression of doubts among the choices, the differentiation of uncertainty, indecision and ignorance, and the diagnosis of different types of misconceptions within the same answer. Even if confidence on the global answer to a question can be expressed, local imprecision and uncertainties of the particular choices remain hidden.
3
Taking the most out of a MCQ
Uncertainty and imprecision are two inherent aspects of human knowledge and they should be considered in an automated learning process. Following a formative assessment approach, they should appear in different stages of learner diagnosis. Formative MCQs must help in the acquisition of this imperfect information by allowing flexible assessment (first stage) and providing approximate reasoning inferences (second stage). In this section, we present some ways we can maximize the expressivity of the students answering a MCQ: choice indexing, choice preference quantification, general confidence assessment and distractor identification. 3.1
Choice indexing
One of the most common ways of modeling student knowledge acquisition is with an Overlay Model [16], based on the model of the domain knowledge presented as an ontology (network of related concepts). An instance of the domain ontology is considered for every student; every concept is qualified according to the learner’s knowledge level. By indexing the concepts treated on the learning resources, an ITS can infer the current state of knowledge of the students, either by implicitly analyzing the resources accessed by the students or by explicitly assessing student knowledge through an evaluation. In a MCQ, questions are indexed according to knowledge they evaluate. Since a student has to give a precise response, only one piece of evidence can be inferred from her answer. A good MCQ [10] should have good distractors, wrong
IV
answers that are not preposterous. The identification of the concepts related to the distractors allows the diagnosis of possible misconceptions.
c1 c2
c3
c4 c5
Question 1 a θ1 correct b θ2 c θ3 θ4 d
Fig. 1. Choice indexing in MCQ.
We propose to index the concepts treated on the questions at the choice level (see figure 1). This way, the selection of the correct answer of an item would give information as to the acquisition of the concepts related to that choice, and the selection of an incorrect choice would identify the concepts that could have been misunderstood. Moreover, if we allow imprecise answers that point to several choices in a single item (as proposed in the following subsections), we would get multiple pieces of information from a single MCQ. The intensity of the relationships between concepts and choices can be described by use of conditional belief functions, a generalisation of conditional probabilities in the context of belief function theory [11]. By applying the Generalized Bayesian Theorem [12] the student model can be diagnosed. 3.2
Choice preference
A precise and absolute answer would introduce noise into the assessment process: if we want to assess the acquired knowledge why force a student to a clear cut answer if her knowledge is imprecise? We propose to let her express her choice preferences by assigning weights representing her belief on the proposed choices. This type of interaction can easily be accomplished by using sliders. 3.3
General confidence
We concur with the works on confidence assessment in MCQ (presented on section 2). If the student is not sure of her answer, a good assessment tool would have to let her express this uncertainty. We want to acquire her feeling of knowing [9], of agreeing with her own answer. Alternately, we can see this as an estimation of her ignorance of the correct answer to the question at hand. The expression of general confidence can be done by explicitly letting the student provide a confidence value, or by implicitly inferring it according to the underlying uncertainty and imprecision representation theory.
V
3.4
Choice elimination
Another aspect that we consider to be very informative is the ability to eliminate one or more choices. The identification of distractors is important because, in the context of choice indexing, it ratifies the misconception of the concepts involved.
4
Belief function theory
In this section we present the basic notions of belief functions and their relationship with probabilities; we will use this measures to interpret the weight assignments given by the students to represent their choice preferences on their imperfect answers. First we define a frame of discernment Θ as the set of possible exclusive values that a variable can take. For instance we can use Θ = {θ1 , θ2 , θ3 , θ4 } to represent the set of possible choices θi of a given question. Belief function theory (also called evidence theory) [11], generalizes probabilities by abandoning the additivity constraint. A Basic Belief Assignment Θ (BBA) is a mapping m !: 2 → [0, 1] distributing a unit of belief mass to the subsets of Θ such that A∈2Θ m(A) = 1. A subset A ⊂ Θ such that m(A) > 0 is called a focal element. The value m(A) represents the fraction of belief (mass) on the elements of the subset A ⊂ Θ. Total ignorance is represented by the vacuous belief function, which assigns the belief mass unit to the frame itself. Partial ignorance is represented by m(Θ) > 0. A measure bel(A) from 2Θ to [0, 1] of a subset A ∈ Θ ! represents the total belief on the elements contained in A. We have bel(A) = B⊆A m(B), ∀A ⊆ Θ. A measure pl(A) from 2Θ to [0, 1] of a subset!A ∈ Θ represents the belief that could be transferred to A. We have pl(A) = A∩B$=∅ m(B), ∀A ⊆ Θ. Uncertainty is easily represented in belief function theory, since different amounts of beliefs can be assigned to different focal elements. Imprecision is specified by the assignment of belief masses to subsets of the frame, not necessarily singletons. Different levels of imprecision can then be expressed. The case where all focal elements are all singletons is called the Bayesian belief assignment, the particular case of probabilities. We have bel(θi ) = pl(θi ) = P (θi ). Probability only allows the representation of uncertainty. Total ignorance is represented as equiprobability and there is no way to specify partial ignorance nor imprecision.
5
Representation and interpretation of imperfect answers
In [3] we presented a way to express imperfect answers by applying belief function theory to MCQs (evidential MCQs - Ev-MCQs). Here, we consider only MCQs where only one of its choices is correct; MCQs with multiple responses can be viewed as a set of of truth or false questions (hence, single response MCQs). One unit of belief mass has to be distributed by the evaluee, so that each choice that
VI
seems plausible can be selected to a certain degree. The student is not forced to assign all of her belief mass to a single option, she can distribute it as she wants. She is neither forced ! to distribute all of her belief mass, she can leave some mass unassigned (u = 1 − θi ∈Θ m(θi )) indicating her ignorance or lack of confidence on her answer (see figure 2). In Ev-MCQs the expression of general confidence is then bound to the choice preference. The acquired freedom allows the student to point at some of the choices that otherwise she wouldn’t consider. Moreover, the student is able to identify incorrect choices (the set of Distractors) in her view (choice b in figure 2), so that the unassigned mass u will not be assigned to Θ, but to Θ \ Distractors.
m(θ1 )
m(θ1 )
u
m(θ2 ) = 1 u
m(θ3 )
m(θ3 )
m(θ4 )
u u=1 −
P
θi ∈Θ m(θi )
Fig. 2. An Ev-MCQ imperfect answer.
We can represent an answer to the question as a belief mass assignment among the elements θi ∈ Θ. The unassigned mass can be associated to Θ itself. We will only have singletons and the frame itself as possible focal elements of the BBA that represents the answer. All the BBAs expressed through Ev-MCQs are normalized. A classical MCQ would assign a weight of 1 to the selected choice θsel and 0 to the others. Ev-MCQs generalize classical MCQs, a precise and certain answer will concur with a classical MCQ answer. If we were to interpret the weights as probability masses, the imperfect answers would be expressed through probability distributions, and a tool allowing the expression of these answers would have to respect the additivity constraint. Choice elimination would be possible by fixing the value of the probability of a choice to 0 so that the masses would be distributed among the rest of the choices. Since it is not possible to express a degree of partial ignorance directly, it is necessary to express it in terms of another probability distribution over a universe having two possible states {Confident, Ignorant}. The two distributions would be independent of each other, and a combination of the two would give us a final answer that would take into account general confidence and choice preference. Nevertheless, the complexity and interactivity of a tool permitting such an answer would need to be studied carefully.
VII
6
Normalization of imperfect answers and noise prevention
The purpose of the expression of imperfect answers is to obtain answers closer to the imperfect notions the students have. We estimate that the student diagnosis process suffers from the consideration of a noise caused by the constraint of expressing precise and certain answers via classical MCQs, and that rich information as to the current state of knowledge of a student is lost in the adaptation of imperfect notions to a single choice answer. Ev-MCQs can help to avoid such noise. We define the process of normalization of an imperfect answer as the selection of a single choice as the answer to a MCQ, taking into account the BBA expressed by an imperfect answer of an Ev-MCQ, representing the answer of the same student through a classical MCQ interface. Choice preference, choice elimination and global confidence on the answer would need to be considered on the selection of the choice for the normalized answer. In Smets’ interpretation of belief function theory [13], an operation called the pignistic transformation is applied to BBAs in order to obtain a pignistic probability distribution to qualify each of the singletons on the frame of discernment: " m(A) BetP (θi ) = |A| θi ∈A, A⊆Θ
In the case of EvMCQ, we have for each choice: BetP (θi ) = m(θi ) +
m(Θ) |Θ \ Distractors|
For example, in the case of figure 2, we have Θ = {A, B, C, D}, Distractors = {B}, and m(Θ) = u = 0.3: – – – –
BetP (A) = m(A) + m(Θ)/|{A, B, D}| = 0.1 + 0.3/3 = 0.2. BetP (B) = 0. BetP (C) = m(C) + m(Θ)/|{A, B, D}| = 0.6 + 0.3/3 = 0.7. BetP (D) = m(D) + m(Θ)/|{A, B, D}| = 0 + 0.3/3 = 0.1.
We consider that each answer to a question contributes to one unit (1) of information. In the case of a precise and certain answer, that unit of information is assigned to the selected choice. In the case of an imperfect answer, the unit of information is distributed among the considered choices, and we use the BetP values of each choice as a way of measuring the information fragments. If Rnorm has only one element, we select it as the student’s normalized answer. If Rnorm has several choices, we assume that the student would select one of them randomly. We define BetPnorm as the highest value of BetP , and we call Rnorm the subset of choices having a BetP value equal to BetPnorm 4 . In the case of figure 2, BetPnorm = 0.7, and Rnorm = {C}. 4
In fact, since the imperfect answers are expressed by the students by clicking into a bar, we consider a sensitivity threshold in the comparison to BetPnorm .
VIII
We define a measure of prevented noise as the difference of information of the selected choice between the normalized answer (1) and the imperfect answer (BetPnorm ): N oise = 1 − BetPnorm (1) In our example, we have: N oise = 1 − 0.7 = 0.3.
7
Experimentation
In order to validate the use of Ev-MCQ, we conducted an experimentation in the context of knowledge evaluation of a group of 113 students from junior school. Two 7th grade (5`eme) classes and two 8th grade (4`eme) classes were confronted to a 45 minute evaluation consisting of 30 questions dealing with the biology courses. Table 1 shows the distribution of the groups and the type of evaluation assigned to them. Half of the students passed a classical MCQ evaluation (groups A, C, E, G), and half an Ev-MCQ evaluation (groups B, D, F, H). We used the BetP value of the correct choice in order to note the performance on the tests. This way, a classical MCQ answer can have only 0 and 1 as possible scores, and an Ev-MCQ answer can have any value in the interval [0,1] as its score. The evaluations are graded under 20. The average performances of each group are presented on the next to last column of table 1. The last column shows the average score of the normalized answers for Ev-MCQ evaluations. The weighted average of the scores is 16.05/20. From the results presented on table 1, we can see that the type of evaluation doesn’t have a considerable impact on the performance of the students. We can also see that difference between the score of the normalized and unnormalized answers for the Ev-MCQ evaluations is negligible. Table 1. Group distribution and performance. Grp. A B C D E F G H
Class #Students MCQ type Grade avg. /20 Norm. grade avg. /20 7th 14 Clasical MCQ 15.70 7th 15 Ev-MCQ 16.03 16.311 7th 15 Clasical MCQ 16.27 7th 15 Ev-MCQ 16.52 16.82 8th 13 Clasical MCQ 16.15 8th 14 Ev-MCQ 14.64 14.75 8th 13 Clasical MCQ 16.97 8th 14 Ev-MCQ 16.15 16.23
Table 2 describes the imperfect answers expressed by the 58 students from the 4 groups who passed the Ev-MCQ evaluation. Columns 2, 3 and 4 summarize the type of answers given by each group, presenting the percentage of absolute
IX
answers (correct or incorrect precise and certain answers), uncertain answers (having a choice clearly preferred to the others - |Rnorm | = 1), and undecided answers (having |Rnorm | > 1). Columns 5 and 6 present the mininimal and maximal prevented noise percentages per item. Columns 7, 8 and 9 present the number of students for each group and the mininimal and maximal prevented noise percentages per student. Finally, on the last column, we have the average percentage of noise prevented for each group. We can see that approximately three quarters of the answers given were absolute (the weighted average of the questions having absolute answers is 77.1%). This means that if the students of groups B, D, F and H would have had to pass the evaluation with classical MCQs instead of Ev-MCQs, they would have had to adapt their answers for one quarter of the questions, losing in the process important information as to their imperfect state of knowledge. The magnitude of the adaptations for each group is given on the last column. The weighted average of prevented noise is 6.61%. This measure gives us the amount of information lost and replaced by noise because of the inflexibility of classical MCQs. This is just an average of the performances. It includes the results of the absolute answers given by students with good performances along with the indecise and uncertain answers given by the students with relatively poor performances. The impact of imperfect answers can be seen more clearly by analyzing the particular cases of students with high amounts of noise prevented. These are the students with low feeling of knowledge whose cognitive state is very fragile and needs to be treated. We can see for example on the next to last column on table 2 the single maximum level of noise prevented from the answers of one of students from each group (17.9%, 11%, 28.7% and 13.9%). Thanks to the increased information gathered through EvMCQs different degrees of misconceptions can be recognized, so that an pedagogical application (e.g; ITS) can undertake the necessary measures to correct them. Table 2. Imperfect answers given through Ev-MCQs.
Grp. B D F H
Answers Item noise% Student noise% Noise # Students %Abs % Unc %Ind Min Max Min Max Avg. 82 4 14 0 19.1 15 0.7 17.9 6 73.6 24.6 1.8 0 26.6 15 0.3 11 5.5 75 17.6 7.4 0 22.3 14 1.3 28.7 9 77.6 17.6 4.8 0 16 14 0 13.9 6.1
On the fifth and sixth column, we can see the minimum and maximum noise prevented by a single question. The acquisition of imperfect answers allow us to recognize which items cause the more unease to students, and detect possible problems with the way questions are presented.
X
8
Conclusion
In this article, we analyse the ways in which classical MCQs can be extended in order to obtain more information from a single answer, by taking advantage of the flexibility provided by an uncertainty and imprecision management theory. Choice preference, confidence assessment and choice elimination are identified as possible extensions to classical MCQs. We presented the way Ev-MCQs, a generalization of classical MCQs applying belief function theory to the interpretation of imperfect answers expressed as weight assignments over the choices, allow for a closer representation of the real state of knowledge of a student. We validated the application of Ev-MCQs in the context of knowledge evaluation through experimentation, and we verified its main purposes: increased flexibility of expressivity and noise prevention. Results showed how Ev-MCQs allowed the acquisition of approx. 25% of imperfect answers without constraining the expression of precise and certain answers, and how Ev-MCQs prevented the replacement of an average of 6.61% of the total information with a form of noise issued from the adaptation of the imperfect answers. One of the more important aspects of the imperfect answers we presented on this article is the subject of our current research. If the expression of imperfect knowledge is very important as we have shown, the representation of imperfect answers through BBAs allow us to exploit the reasoning capabilities of belief function theory. Student modeling is possible through the use of BBAs to represent the state of knowledge of the different concepts covered on the domain model of a pedagogical application (Overlay model). The Generalized Bayesian Theorem [12] can be applied to learner diagnosis[14], taking advantage of choice indexation through conditional belief functions.
Acknowledgements The authors acknowledge the contribution provided by the dean and teachers (5`eme & 4`eme SVT courses) from the “Coll`ege Jean-Baptiste Say” school in Paris.
References 1. Bouchon-Meunier, B.: La logique floue et ses applications. Addison-Wesley (1995) 2. Davies, C.: There’s no confidence in multiple-choice testing. 6th Inter. CAA conference. Loughborough Univ., UK (2002) 119–130 3. Diaz, J., Rifqi, M., Bouchon-Meunier, J.: Evidential multiple choice questions. PING Workshop at UM’2007 (2007) 4. Farrell, G.: A comparison of an innovative web-based tool utilizing confidence measurement to the traditional multiple choice, short answer and question problem solving questions. 10th Inter. CAA conf. Loughborough Univ., UK (2006) 176–184 5. Gardner-Medwin, A.R., Gahan, M. Formative and summative confidence-based assessment. 7th Inter. CAA conf. Loughborough Univ., UK (2003) 147–155
XI 6. Gronlund, N.: Assessment of Student Achievement. Ally & Bacon (2005) 7. Hassmen, P., Hunt, D.P.: Human self-assessment in multiple-choice testing. Journal of Educational Measurement. 31(2) (1994) 149–160 8. Khan, K. S., Davies, D. A., Gupta, J. K.: Formative self-assessment using multiple true-false questions on the Internet: feedback according to confidence about correct knowledge. Medical Teacher. 23(2) (2001) 158–163 9. Koriat, A., Goldsmith, M., Schneider, W., Nakash-Dura, M.: The credibility of children’s testimony: Can children control the accuracy of their memory reports?. Journal of Experimental Child Psychology. 79(4) (2001) 405–437 10. McAlpine, M.: A summary of methods of item analysis. CAA Centre, Luton (2002) 11. Shafer, G.: A Mathematical Theory of Evidence. Princeton Univ. Press, (1976) 12. Smets, P.: Belief functions: the disjunctive rule of combination and the Generalized Bayesian Theorem. International Journal of Approximate Reasoning. 9 (1993) 1–35 13. Smets, P., Kennes, R.: The transferable belief model. Artificial Intelligence. 66 (1994) 191–234 14. Smets, P., Kennes, R.: The Application of the Transferable Belief Model to Diagnostic Problems. International Journal Intelligent Systems. 13 (1998) 127–158 15. Warburton, B., Conole, G.: Key findings from recent literature on Computer-aided Assessment. ALT-C 2003. Sheffield, UK, (2003) 16. Wenger, E.: Artificial Intelligence and Tutoring Systems. Morgan Kaufmann Publishers, Inc. (1987) 17. Zadeh, L.A.: Fuzzy Sets as a basis for a Theory of Possibility. Fuzzy Sets and Systems 1 (1978) 3–28