Student Modeling with Atomic Bayesian Networks Fang Wei and Glenn D. Blank Computer Science & Engineering, Lehigh University 19 Memorial Drive West, Bethlehem, PA 18015, USA (001-610)-758-4867, (001-610)-758-4605
[email protected],
[email protected]
Abstract. Atomic Bayesian Networks (ABNs) combine several valuable features in student models: prerequisite relationships, concept to solution step relationships, and real time responsiveness. Recent work addresses some of these features but have not combined them, which we believe is necessary in an ITS that helps students learn in a complex domain, in our case, object-oriented design. A refined representation of prerequisite relationships considers relationships between concepts as explicit knowledge units. Theorems show how to reduce the number of parameters required to a small constant, so that each ABN can guarantee a real time response. We evaluated ABN-based student models with 240 simulated students, investigating their behavior for different types of students and different slip and guess values. Holding slip and guess to equal, small values, ABNs are able to produce accurate diagnostic rates for student knowledge states.
1 Introduction An Intelligent Tutoring System (ITS) that individualizes its feedback can provide more useful help for each student [7][14][16]. Adaptive tutoring needs to model how well the student knows each concept in the problem domain (knowledge level) currently, not just the likely intention of the student’s next step in solving a problem. Modeling a student’s knowledge level will help the tutoring system provide feedback that clarifies fundamental knowledge such as concepts rather than point out errors in procedural knowledge. An ITS increases learning when it provides real-time feedback along the way where students solve a procedural problem [16]. Real time feedback helps students make progress and avoid hacking, which leads to confusion and leaving the learning environment prematurely. Student modeling with Bayesian networks can provide information about a student’s knowledge level for each granular piece of conceptual knowledge [4][7][15]. But the number of parameters and the updating time for Bayesian networks is in the order of exponential [12][13]. “Probability is not really about numbers; it is about the structure of reasoning” [12]. When researchers used probabilistic reasoning to model students’ knowledge state, fewer of them precisely investigated the relationship among causes that explain student solution steps. Table 1 compares recent work on student modeling using Bayesian networks, focusing on three features. The first two features, diagnose concepts and prerequisites, indicate whether a system diagnoses knowledge level of concepts or prerequisite M. Ikeda, K. Ashley, and T.-W. Chan (Eds.): ITS 2006, LNCS 4053, pp. 491 – 502, 2006. © Springer-Verlag Berlin Heidelberg 2006
492
F. Wei and G.D. Blank
concepts, from students’ performance. The third feature, real time, shows whether a system runs in real time in response to student work. Check marks indicate that the authors implemented a feature successfully, while a dash indicates a partial solution. Table 1. Comparisons between different systems Author
Context
Murray (1998) Vanlehn et al.(2001) Millan et al.(2002) Butz et al. (2004)
Desktop associate Solve physics problems CAT system Web-based learning programming lecture CAT system OO-design
Millan et al.(2005) Our work (2006)
Diagnose concepts
Prerequisites
Real time √ √
√ √
–
√ √
– √
√
Murray used a simplified Bayesian network algorithm to run in real time in a desktop associate system [11]. This system selects appropriate tasks for a user instead of helping the user to learn desktop knowledge. It did not diagnose any concepts in the domain, instead modeling one skill at a time, such as how to open an IE browser. Nor did it model any relationship between skills, such as prerequisites. VanLehn et al. modeled multiple rules in a solution graph in ANDES, which is a system that guides students to solve physics problems [1][15][16][17]. In ANDES, the student model predicts a student’s next solution step and diagnoses unmastered rules, instead of unknown concepts. Concepts represent knowledge at a finer granularity than rules. For example, one ANDES rule says that if the velocity of an object is constant, then its acceleration is zero, rather than representing the concepts velocity or acceleration. Furthermore, ANDES does not model prerequisite relationships between rules. The successful real-time feedback in ANDES is to diagnose any error in the current step and to guide a student to the next correct solution step. It does not tell the underlying concepts that the student needs to learn to avoid errors. Millán et al. modeled a relationship that multiple concepts cause students’ answers in a Computer Adaptive Test (CAT) system [7]. The system diagnoses students’ knowledge level from a list of problems chosen by a computer with either random or adaptive criterion. Millán et al. used 60 random problems to get the knowledge level of 14 concepts at the correctly diagnosing rate of 90%. Assuming each problem takes 5 seconds for a student, their system needs about 5 minutes to compute the students’ current knowledge state. Later, Carmona et al. modeled prerequisite relationships [4]; using 40 problems, their new system attained a correctly diagnosing rate of 91% for 14 concepts. Run-times for diagnosis and update are still quite long. Butz et al. proposed modeling prerequisite relationships using pre-knowledge from a final-exam pool [3]. Their system called BITS is a web-based system that teaches C++ programming lectures. Their student model provides learning sequences adaptive to each student through the lecture materials. The authors did not indicate the accuracy of their results. Furthermore, gathering pre-knowledge from a final exam increases the cost of knowledge acquisition, and may be idiosyncratic to the particular set of students who took the final exam.
Student Modeling with Atomic Bayesian Networks
493
As Table 1 shows, researchers are recognizing the importance of prerequisite relationships. But none provide a student model that accurately diagnoses concepts and prerequisites in real time. Our work considers both prerequisite and concept-tosolution-step relationships, and shows how to do so in real time, so that a student model can determine where students need help, as they solve a problem. In this paper we present a student model to diagnose students’ knowledge level in CIMEL-ITS, an intelligent tutoring system that helps beginners learn object oriented design in a CS1 course [2][9][10][18]. This student model provides a refined representation of prerequisite relationships, adds prerequisites to estimate the current students' knowledge level, and guarantees real-time responsiveness using an atomic Bayesian network (ABN). Evaluation results using 240 simulated students show that the ABN can diagnose each student’ knowledge level quickly and accurately. It has correctly diagnosed over 93% of 38 concepts after 38 randomly picked distinct problems in the novice object-oriented design domain. The organization of this paper is as follows: Section 2 describes the knowledge representation for our student model; Section 3 provides a formal definition of an ABN, with theorems limiting the number of parameters to a small constant, and explains the advantages of this approach; Section 4 presents evaluation results for the accuracy of student models using ABNs; and Section 5 outlines our conclusion and future work.
2 Knowledge Representation According to cognitive science theory, a sound knowledge state should show a highly connected and well-defined structure [1][5]. Students need not only knowledge of individual concepts, but also the relationships between concepts, such as similarity, difference, usage and a-part-of, to build up a sound knowledge state. Our knowledge scheme represents these relationships as explicit nodes in a network. We use a pair (ku,au) to model the causal relationship between an immediate concept or knowledge unit and an action step that a student takes to solve a problem. A ku is a knowledge unit, which means the knowledge that students need to learn. There are two kinds of ku: concept and relation between concepts. For example, relation between concepts can be Attribute_Parameter, which models understanding the difference between concepts Attribute and Parameter (a common confusion for novices). We have observed from preliminary results that students frequently struggle to understand relationships between concepts, such as the difference between Attribute and Parameter (and when to use which), or between integer and double, etc. An au is an action unit, which is a single step in a student’s solution, e.g. writing a name for an attribute. From the definition of the pair (ku,au), ku directly causes au. As shown in Figure 1, the knowledge units that students need to understand to solve the object oriented problems are modeled in a Curriculum Information Network (CIN) for the student model. All the knowledge units are connected by the prerequisite links. By convention, a prerequisite is a concept that a student needs to understand before understanding another concept. Different teachers may use different curricula which results to a different CIN. So more broadly, any concept one needs to teach before introducing a new concept is also a prerequisite. Immediate prerequisites are those concepts that strongly relate to a concept and play the most important role in
494
F. Wei and G.D. Blank
understanding the concept. The concept Class in our CIN is not the aggregation of concepts Attribute and Method. Instead it represents a category of objects. A student understands the concept of Class if he can identify correct class names. actor
int
double_int
double
int_string actor_object
actor_method numeric-datatype
object_construct or object
double_string datatype
object_class
datatype_returntype
object_method
datatype_variable
object_attribute class class_constructor class_attribute
string
class_method
variable _attribute
variable
variable_parameter
variable_returntype
method_parameter attribute attribute _constructor
method
parameter
method _constructor
returntype
method _returntype pass in only
constructor attribute_method
A B A is prerequisite of B
attribute_parameter
Fig. 1. Curriculum information network (CIN) for the student model
3 Atomic Bayesian Network An Atomic Bayesian Network (ABN) focuses on just one concept, its immediate prerequisites, and its relationship to a solution step. An ABN models both the causal association between a student’s solution and the most relevant concepts, and models prerequisite relationships among these concepts. It indicates that mastery of those concepts causes that whether the student makes the current solution step correctly or not. In other words, an ABN models two kinds of relationships: 1) the student needs to understand a concept at the center of an ABN before he can make the current solution step correctly, and 2) the student needs to understand all of the immediate prerequisites of the center concept before he is ready to understand the center concept. 3.1 Definition of an Atomic Bayesian Network (ABN) As Figure 2 shows, an ABN is a directed graph composed of one edge (ku, au) and multiple edges (immediate-prerequisite(ku), ku), in which ku and au make a pair (ku, au). Immediate-prerequisite(ku) represents the knowledge units that must be taught right before teaching ku. A noisy-and relationship is enforced among all edges (immediate-prerequisite(ku), ku). Noisy-and is an uncertain relationship which is
Student Modeling with Atomic Bayesian Networks
495
generalized from the logical AND [7][12]. Whereas logical AND requires that a student understand all immediate prerequisites of ku before he understands ku, noisyand allows for uncertainty about the knowledge of immediate prerequisites. It assumes that allowing each immediate prerequisite is independent of allowing the others. For example, the concept Numeric-datatype has two immediate prerequisites: Int and Double. Noisy-and assumes that the joint events (numeric-datatype, int) and (numeric-datatype, double) are mutually independent. Given this independence, parameters in the conditional probability table for a noisy-and relationship take the product of the conditional probability values of each parent. For more details please see the proof for equation (1). p1(ku)1
p1(ku)2 …… p1(ku)N
ku
p1: immediate-prerequisite au
Fig. 2. Atomic Bayesian Network (ABN)
All of the variables (nodes) in an ABN have binary values, true or false. They are defined as follows:
• The variable au (the leaf node) represents how a student makes a solution step in a constructive exercise, such as an OO design problem. The value of true means the student makes a correct step, while false means a wrong step. • The variable ku in the center represents if the student knows the most relevant concept for the current solution step. The value is true when the student understands the concept and false otherwise. • The variable p1(ku) (the root nodes) represents if a student knows the immediate prerequisites of the center concept. The value is true when the student understands the prerequisite and false otherwise. When a student understands ku, he might still make a wrong step because of a slip or unintentional mistake. Or, when a student does not know ku, he might guess the solution correctly. Let’s also consider the possibility of errors deriving the center concept from its prerequisites. Even if the student knows all the immediate prerequisites, a student might not understand the center concept. Or a student might guess the correct meaning of the center concept even if he doesn’t understand any immediate prerequisite. These characteristics in student learning can be applied to find out the conditional probability tables in an ABN. Four variables to determine the conditional probability tables in an ABN are formally defined as follows: • slipe is the probability a student makes a wrong step when he knows ku, [8][11][16][17] where e means an evidence: P(au=false|ku=true)=slipe or P(au=true|ku=true)=1–slipe • guesse is the probability that a student makes a correct step when he doesn’t understand ku [8][11][16][17]: P (au = true | ku = false) = guesse
496
F. Wei and G.D. Blank
• slipp is the probability a student fails to understand the center concept when he knows one immediate prerequisite, where p means a prerequisite (i.e., a slip in the causal relationship from a prerequisite to center concept). • guessp is the probability that a student understands the center concept when he doesn’t know any immediate prerequisites. A conditional probability table between the nodes of au and ku can be calculated from slipe and guesse from their above definitions. A conditional probability table between the nodes of immediate-prerequisite (ku) and ku can be calculated by the definition of slipp, guessp and noisy-and as ∏ P(ku=true| p (ku) =true ,… p (ku) =false)= ∏ (1-slip ) guess (1) 1
i
1
i ∈K
j
p j∈ K
p
where if i∈K, j∈ K, then p1(ku)i =true, p1(ku)j =false, K ∪ K is a set of all immediate prerequisites of the center concept, and K={ p1(ku)i =true| i ∈ [1, n ] }, a set of immediate prerequisites that the student knows, while k is a set of immediate prerequisites that the student does not know. Proof: From the definition of conditional probability (t means true, f means false): P(ku = t | p1(ku)i = t, … p1(ku)j = f) =
P ( ku = t , p1 ( ku ) i = t ,... p1 ( ku ) j = f ) P ( p1 ( ku ) i = t ,... p1 ( ku ) j = f )
Because the events of knowing immediate prerequisites of a concept are mutually independent, and because noisy-and assumes that joint events of knowing an immediate prerequisite and knowing the concept are also mutually independent, then the conditional probability of a knowledge unit given its prerequisites becomes: P ( ku = t , p1 ( ku ) i = t ) P ( p1 ( ku ) i = t )
∗ ...... ∗
P ( ku = t , p1 ( ku ) j = f ) P ( p1 ( ku ) j = f )
Applying the definition of conditional probability again, the conditional probability of an ABN, taking slip and guess into account, becomes: P(ku=t|p (ku) =t) ∗ … ∗ P(ku=t|p (ku) =f)=(1-slip ) ∗ … ∗ guess = ∏ (1-slip ) ∏ guess 1
i
1
j
p
p
i ∈K
p j∈ K
p
Every solution step correlates to an ABN which stores the updated value for the ABN of next solution step. If each knowledge unit has at most k immediate parents, and altogether there are n knowledge units in the domain, an ABN will: • Need O(1) running time instead of O(2n) for each solution step because it has a small bounded number of immediate parents. • Update O(1) nodes instead of O(n) nodes for each solution step. • Determine 4 parameters instead of O(nk) parameters in a noisy-and relationship. Using an ABN reduces the running time for each step from exponential for a complete Bayesian network to constant time because the ABN only considers its immediate parents, which is a small bounded number for any knowledge domain. The number of conditional parameters drops to 4—two pairs of guess and slip. The number of nodes that must be updated for each step drops to the number of parents. In
Student Modeling with Atomic Bayesian Networks
497
any domain, the number of immediate parents is much smaller than the number of total variables. Other tutoring systems using Bayesian networks have resorted to approximate algorithms in order to avoid exponential running time [7][16]. However, approximate algorithms are arbitrary with respect to how much of the network they consider. An ABN is more efficient, and it is sufficient, because the relationship between a center concept and ancestor prerequisites is tenuous at best. For example, from the CIN in Figure 1, Actor is a prerequisite of Actor_Method (a method is a service that an Actor can use), which in turn is a prerequisite of Method. From the solution step focusing on Method (give a meaningful name for the method), an ABN only updates Actor_Method, not Actor, because the prerequisite relation between Actor and Method is intuitively tenuous; it is sufficient just to update the immediate prerequisite. As we shall see, simulation results preliminarily support our claim about the sufficiency of ABNs (experiments with real students to validate ABNs’ sufficiency will be performed in our future research). The use of ABN accelerates the diagnosis in a student’s knowledge state because the ABN has a sufficiently accurate model of the student’s conceptual reasoning. Table 2 compares the student modeling approaches between considering and not considering prerequisites assuming A and B are prerequisites of C, and S is a student’s solution step. These two approaches all start from the initial prior probability values of 0.5 [7]. The table shows that when prerequisites are considered, four out of six posterior probabilities are further away from 0.5, which means our student model is less possible to end with undiagnosed states. Table 2. Comparison between different student modeling approaches. A, B, C are the related concepts to S, a solution step. A and B are prerequisites of C.
Millan et al.(2002) Our work (2006) 0.5
0.5
0.5
A
B
C
S
P(A|S=t)
P(B|S=t)
P(C|S=t)
0.56 0.83
0.60 0.83
0.65 0.75 0.5
A
P(A|S=f)
P(B|S=f)
P(C|S=f)
0.36 0.36
0.28 0.36
0.17 0.04
B
0.5
C Our work
Millán et al.
S
3.2 Theoretic Framework for ABN We propose a theory that will reduce the number of parameters that an ABN needs. It does so by discovering useful relationships between slipe and guesse and between slipp and guessp. Theorem 1 and 2 are intended for exercises not involving multiple choice questions. ABN Theorem 1: Let ku be a knowledge unit, au be ku’s evidence. If the initial prior probabilities for ku and au are P(ku)=0.5 and P(au)=0.5 when there is no information available in the domain, then P (au = true | ku = true) + P(au = true | ku = false) = 1
(2)
498
F. Wei and G.D. Blank
Both ku and au have binary value of true or false. A initial prior probability of 0.5 means that initially it is equally likely that the variables take a value of true or false [7][17]. To avoid any spurious bias to the domain, we choose the initial prior probabilities P(ku) = 0.5 and P(au) = 0.5 when initially we know nothing about ku and au, whether the student knows the knowledge and whether he can make the solution step correctly or not. After gaining more evidences in the domain, updated posterior probabilities from evidences will change but the conditional probability table of this Bayesian network will not change, i.e. equation (2) still holds. Proof: Figure 3 shows the causal relationship between ku and au. From the process of marginalization [13], the marginal probability of au is P(au = t) =
∑
P(au = t, ku)
ku
au
ku
Fig. 3. Causal relationship between ku and au
From the conditioning rule [13], P(au=t)=
∑
P(au=t|ku) ∗ P(ku)=P(au=t|ku=t) ∗ P(ku=t)+P(au=t|ku=f) ∗ P(ku=f)
ku
Q P(ku=t)=0.5, P(ku=f)=0.5, and P(au=t)=0.5 ∴ 0.5=P(au=t|ku=t) ∗ 0.5+P(au=t|ku=f) ∗ 0.5 ∴ P(au=t|ku=t)+P(au=t|ku=f) = 1 ABN Theorem 2: By definition of slipe, P(au=false|ku=true)=1-slipe, and by definition of guesse, P(au=true|ku=false)=guesse, then slipe=guesse, where 0≤slipe or guesse ≤1. Proof: Q P(au=t|ku=t)+P(au=t|ku=f)=1 ∴ 1-slipe+guesse=1 ∴ slipe= guesse ABN Theorem 3: Let ku, p1(ku), …, pk(ku) be a concept and its immediate prerequisite, …, and k levels of immediate prerequisite. There is only one immediate prerequisite at each level. If the initial prior probabilities of ku, p1(ku), …, and pk(ku) are P(ku = true) = P(p1(ku) = true) = … P(pk(ku) = true) = 0.5, when there is no information available in the domain, then P(pk(ku))=true| pk-1(ku))=true)+P(pk(ku))=true| pk-1(ku))=false)=1
(3)
Proof:
Figure 4 shows the causal relationship among ku and its ancestors. From the process of marginalization [13], the marginal probability of a particular ku is a summation of all other kus in the network, both preceding and following. For example, for node pa(ku) (k≤a≤1), P(pa(ku)=t)=
∑ P( p (ku), p k
k −1
(ku),...pa+1 (ku), pa (ku) = t , pa−1 (ku),...ku)
pk ( ku), pk −1 ( ku),... pa+1 ( ku), pa−1 ( ku),...p1 ( ku),ku
Student Modeling with Atomic Bayesian Networks
pk(ku)
pk-1(ku)
pa+1(ku)
pa (ku)
pa-1 (ku)
499
ku
Fig. 4. Relationship between ku and its ancestors
This summation shows how to compute the probability of one node from the entire network. However, producing the above joint probability is computationally hard. So we replace the joint probability above with conditional probabilities for each node, using the Bayesian network formula [13]. Then the summation is equivalent to:
∑P( p (ku))P( p
k −1
k
(ku) | pk (ku))....P( pa (ku) | pa +1 (ku))....P(ku | p1 (ku))
pk (ku), pk −1 (ku),... pa+1 (ku), pa−1 (ku),...p1 ( ku),ku
Next, we can separate the ku variables for calculation, as a set of simpler summations inside the summation. Each simpler summation covers fewer ku variables. So the expression is equivalent to:
∑{P( p
k −1
(ku) | pk (ku))....P( pa−1 (ku) | pa (ku) = t)....P( p1 (ku) | p2 (ku))
pk−1 (ku),...pa+2 (ku), pa−1 (ku),...p1 (ku)
∗
∑
P(pk-1(ku)|pk(ku))P(pk(ku)) ∗
pk (ku)
Since
∑
P(pa(ku)=t |pa+1(ku)) ∗
pa+1 ( ku)
∑ P(ku|p (ku))} 1
ku
∑ P(ku|p (ku))=1 no matter what is the value of the variable p (ku), and 1
1
ku
since the conditioning rule that represents marginalization by conditional probabilities,
∑
P(pk-1(ku)|pk(ku))P(pk(ku)) = P(pk-1(ku)).
pk (ku)
We repeat the procedure of separating ku variables to reduce the big summation for P(pa(ku)=true) to a summation of two products, of a conditional probability and a prior probability:P(pa(ku)=t|pa+1(ku)=t)*P(pa+1(ku)=t)+P(pa(ku)=t|pa+1(ku)=f)*P(pa+1(ku)=f) If the prior probability for each ku variable is 0.5, then P(pk(ku))=t|pk-1(ku))=t)+P(pk(ku))=t|pk-1(ku))=f)=1 ABN Theorem 4: By definition of slipp, P(pk(ku)) = true | pk-1(ku)) = true) = 1-slipp, and by definition of guessp, P(pk(ku)) = true | pk-1(ku)) = false) = guessp, then slipp= guessp, where 0≤slipp, guessp≤1. The proof of ABN Theorem 4 is similar to proving the Theorem 2.
4 Evaluation Experimenting with human subjects is expensive and a lot of unexpected factors can make results deviate from what is hypothesized. Large amount of data are needed to
500
F. Wei and G.D. Blank
prove that the student model really works. Using simulated students avoids problems with limited human subject resources, at low cost, when testing the student model, until it appears to be ready for testing with human subjects. Several researchers apply simulated students to evaluate their student models [4][6][7][17]. One problem with simulated students is how confident can we be that they represent real students. Millán et al. found that adding more pre-knowledge to simulated student improves the evaluation results [7]. Their work shows that analyzing real students is necessary for designing simulated students and can increase confidence in simulation results. The probability that a real student knows a concept will be very small when the student does not know any prerequisites. Following this intuitive rule from observing real students, a simulated student will not know a concept without knowing any prerequisites. We generated 240 students of 6 types for the simulation. Students in the 6 types understand 5, 10, 15, 20, 25 and 30 concepts in the CIN respectively. If a simulated student understands a concept, the knowledge level for this concept is 1, otherwise 0. KL of simulated students
A random solution step
SM of simulated students P(solution step)
Updated estimated KL SM start from 0.5
P>=0.5--correct P