assumptions and has a good reliability given sufficient data. However, because the ... sumptions on the domain model that can lead to invalid in- ferences.
Bayesian Modeling with Strong vs. Weak Assumptions in the Domain of Skills Assessment
Michel C. Desmarais, Peyman Meshkinfam, Michel Gagnon Computer Engineering ´ Ecole Polytechnique de Montr´eal Montreal, Canada H3T 2B1
Abstract Approaches such as Bayesian networks (BN) are considered highly powerful modeling and inferencing techniques because they make few assumptions and they can represent complex relationships among variables with efficiency and parsimony. They can also be learned from training data. Yet, they generally lend themselves to a variety of sound and efficient inference computations. However, in spite of these qualities, BN may not be always be the most advantageous technique in comparison to simpler techniques that make stronger assumptions. We investigate this issue in the domain of skills modeling and assessment, where BN have received a whealt attention. Vomlel’s (2004) BN model of basic arithmetic skills is compared, on the basis of predictive accuracy, to a simple Bayes posterior probability update approach under strong independence assumptions, named POKS (Desmarais, Maluf, & Liu, 1995). The results of simulation experiments show that the BN yields better accuracy for predicting concept mastery, but POKS is better at predicting question mastery. We conjecture possible explanations for these findings by analyzing the specifics of the domain model, in particular its closure under union and intersection, and discuss their implications.
1
Introduction
Bayesian modeling with joint conditional probabilities is conceptually and computationally the most straightforward mean of computing posterior probabilities. It makes few assumptions and has a good reliability given sufficient data. However, because the number of joint conditional probabilities grows exponentially with the number of variables, this approach quickly becomes impractical due to the large amount of data required to calibrate the model. Conse-
quently, reducing the number of conditional probabilities to estimate the conditional probabilities required to a useful minimum represents a fundamental goal in practice. Bayesian networks (BN) address this issue by modeling only the relevant conditional probabilities out of the full joint conditional probabilities. They can represent complex relationships among variables with efficiency and parsimony. Because they explicitly model dependencies and independencies, they make few assumptions and generally lend themselves to a variety of sound and efficient inference computations. Bayesian networks limit their assumptions to that of conditional independence of a child given its parents to all non descendents of this node (see Neapolitan, 2004, for a good introduction). Moreover, they can be learned from training data. However, in spite of these qualities, BN may not be always be the most advantageous technique in comparison to simpler techniques that make stronger assumptions. For the same reason that BN offer parsimonious models for Bayesian modeling, thereby significantly increasing their usefullness in a wider range of application contexts, so do Bayes models with stronger independence assumptions. They offer more parsimonious representations for Bayesian modeling than do BN. However, they impose further assumptions on the domain model that can lead to invalid inferences. We investigate the tradeoff between model parsimony and predictive accuracy by comparing a BN model with a simple model based on the application of Bayes rule under strong independence assumption. The domain modeled is the mastery of individual skills. Bayesian models have been used by a number of researchers in the domain of skills assessment and user modeling, such as Conati, Gertner, and VanLehn (2002), Mislevy, Almond, Yan, and Steinberg (1999), Mill´an, Trella, P´erez-de-la-Cruz, and Conejo (2000), Vomlel (2004), to name but a few of them. We use a BN developed by Vomlel (2004), composed of 20 questions items and 19 concept skills, as the comparison point for the BN approach (see figure 1). Using the same data as Vomlel, we applied the Bayesian method named
Bayesian network from Vomlel (2004). The model contains 20 question item nodes represented by leaf nodes.
Figure 1: Other nodes represent concepts or misconceptions, and hidden task nodes (labeled Tnn and with dotted line countours). See text for details.
POKS (Desmarais et al., 1995) and compare their respective performance.
2
Vomlel’s Bayesian network
Figure 1 illustrates the Bayesian network structure that we use in this study. It contains a set of concepts and question items that take binary values (mastered or non mastered). Question nodes are leaf nodes in the structure and they are labeled Xnn. Concept nodes are the non leaf nodes with oval and rectangular shapes. There are many types of concept nodes. Nodes with labels starting with “M” are in fact misconceptions whereas the oval concept nodes represent skills in the domain of fraction arithmetics. For example, the AD concept node (near the top right of figure 1’s hierarchy) represents the skill of adding two fractions with a common denominator (eg. 71 + 72 = 37 ) and concept node CD (left of node AD) is the skill of finding the common denominator (eg. 21 , 23 = 63 , 46 ). Nodes with dotted contour are hidden nodes. They are never directly assessed whereas all other nodes are. Two concepts are hidden nodes (HV1 and CP). Tasks are also hidden nodes as they allow more than one question item for a single task. For example, task T8 is linked to two question items, X9 and X9. Figure 1’s network structure is adapted from Vomlel (2004). It was initially defined through a constraint-based learning algorithm - the Hugin PC algorithm (HUGIN expert, 2002). This algorithm is based on a series of conditional independence tests. It was later inspected by domain experts for adjustments.
Vomlel (2004) tested a number of variants to figure 1’s structure. The structure reported here was one of the best performer and it is described in more detail in Vomlel (2004).
3
The POKS Bayes modeling framework
Bayesian Networks, especially those with a hierarchical structure, or that can be transformed into one, need not make strong assumptions. The inference algorithms for such structures can be exact given the BN structure’s independence assumptions, and these assumptions are themselves subject to statistical test by the learning algorithm. Taking an almost opposite stance, the POKS approach makes strong independence assumptions, namely that evidence variables are independent: P (E1 , . . . , En |H) =
n Y i
P (Ei |H)
(1)
Given the assumption of independence of equation (1), the probability update of H can be witten in following posterior odds form: O(H|E1 , E2 , . . . , En ) =
n Y
O(H|Ei )
(2)
i
where O(H|Ei ) represents the odds of H given evidence of Ei , and assumes the usual odds semantics O(H|Ei ) = P (H|Ei ) 1−P (H|Ei ) . This allows us to use Bayes’ Theorem in its
version based on odds and likelihood algebra: O(H|E) = LSEH ∗ O(H) O(H|E) = LNHE ∗ O(H)
(3) (4)
and where LS and LN are respectively the likelihood of sufficiency and the likelihood of necessity. LSHE
= P (H|E)/P (H|E)
(5)
LNHE
= P (H|E)/P (H|E)
(6)
All conditional probabilities, odds, and likelihood estimates are derived from Vomlel’s data set.
4
POKS Network induction
The evidence nodes in equation (2) are determined by the topology of the POKS network. Determining which nodes are linked together is based on the POKS network induction algorithm (Desmarais et al., 1995). This algorithm is applied to Vomlel’s data (2004), that served for the construction of figure 1’s BN. The POKS induction algorithm relies on a pairwise analysis of item to item relationships. The analysis attempts to identify the order in which we master knowledge items and it is inspired from the knowledge spaces theory of Falmagne, Koppen, Villano, Doignon, and Johannesen (1990). This theory states that skill acquisition order can be modeled by an AND/OR graph. For our purpose, we impose a stronger assumption that the skill acquisition order can be modeled by an directed acyclic graph, or DAG (see the example in figure 4 in the Discussion below). This assumption allows us to limit our network induction algorithm to pairwise analysis. We will return to this question as it has implications on the performance comparisons of the BN vs. models with strong independence assumptions approaches. The tests to establish a relation A → B consists in three conditions for which a statistical test is applied: P (B|A) ≥ pc P (A|B) ≥ pc P (B|A) 6= P (B)
(7) (8) (9)
Conditions (7) and (8) are verified by a Binomial test with parameters: pc the minimal conditional probability of equations (7) and (8), αb the alpha error tolerance level. For this study, both parameters are set at 0.5. Condition (9) is the independence test verified through a χ 2 statistic with
an alpha error αc < 0.5. The high values of alpha errors maximize the number of relations we obtain. There is no knowledge engineering effort involved in building the relations among question items in POKS. Relations are obtained with an algorithm based on statistical tests over the above three conditions. Although a network of relations is obtained with the POKS induction algorithm, we do not propagate evidence within the network from an observed node beyond its directly connected nodes1 . In other words, if we have A → B and B → C, no probability update is performed over C upon the observation of A, unless a link A → C is also derived from the data. Experimental results not reported here show that the performance is very close whether we propagate evidence using the POKS scheme in Desmarais et al. (1995), or do not propagate and solely rely on direct links for probability updates.
5
Logistic regression with the POKS model
As mentioned, the POKS network builds relations among observable question items. There is no hidden nodes in the network. However, to infer mastery of concepts from the assessment of observable question item nodes, we need to add links from question items to concepts. Logistic regression models are used to link concept nodes to question items. For each concept node in figure 1, a logistic regression model is built with the observable question items it is directly or indirectly linked to. For example, concept node CL (left side of figure 1) is linked to question items X1, X2, and X13 through a logistic regression model. The model’s parameters are estimated from Vomlel’s data again, where concept nodes mastery was independently assesed by experts (see below).
6
Methodology and data
The experiments are conducted over a dataset composed of 20 question items and 19 concept nodes. This data was graciously provided to us by Jiˇr´ı Vomlel. The 20 questions were administered to 149 high school students. Concept nodes are not directly observed, but experts analyzed each individual’s test answers to determine the mastery of each concept (except for the two hidden concepts in figure 1). This data was used by Vomlel to build figure 1’s Bayesian network model and conduct simulations to assess predictive accuracy. We use the same data for the POKS simulations. Akin to Vomlel (2004), and in order to avoid over 1 Limiting propagation to direct neighbours does not correspond to the algorithm described in Desmarais et al. (1995) where propagation is performed in accordance to the algorithm described in Neapolitan (1998). The choice not to propagate further for this study is made to use the simplest model possible.
7
Entropy reduction
For both the BN and POKS approaches, the order of questions is adpative and determined by entropy minimization.
i
If all items’ probability is close to 0 or 1, the value of HT will be small and there will be little uncertainty about the examinee’s ability score. We minimize uncertainty by choosing the item with the lowest expected value of test entropy. This value is given by: Ei (HT0 ) = P (Xi )HT0 (Xi = 1) + Q(Xi )HT0 (Xi = 0) where HT0 (Xi = 1) is the entropy after the examinee answers correctly to item i and HT0 (Xi = 0) is the entropy after a wrong answer.
8
Simulation process
Two types of performance assessment are conducted. The question predictive accuracy assessment provides an estimate of the proportion of questions that are correctly estimated as succeeded or failed. We compare the models prediction to the actual examinee’s answers as the model is given 0 to all 20 question items. Of course, such performance ends at 20 items with a 100% correct “estimate”. The concept predictive accuracy proceeds with the same process of feeding the model with observed questions, but we measure the accuracy of concept prediction. Recall that concepts are not directly observed but they were independently assessed by experts.
9
Question predictive accuracy
Figure 2 illustrates the performance of both the POKS and BN approaches for predicting the question item successes.
10
15
20
Figure 2: Question predictive accuracy.
where Q(X) = 1 − P (X). The entropy of the whole test is the sum of all individual item’s entropy: H(Xi )
5
Item
H(Xi ) = −[P (Xi )log(P (Xi )) + Q(Xi )log(Q(Xi ))]
k X
BN POKS Fixed
0
The entropy of a single item Xi is defined by the usual formula:
HT =
0.70 0.75 0.80 0.85 0.90 0.95 1.00
The number of relations obtained for POKS with alpha error αb = 0.5 and minimal conditional probabilities p c = 0.5 (see section 3) are 206 on average with Vomlel’s data of 20 items and 149 data cases.
Prediction score
calibration, model constructions and calibrations are done on the full data set, minus the data case for which a simulation is performed. For example, this implies that 149 different POKS models are built for the simulations, one for each simulation data case.
Both lines are averages of the 149 data cases. A third curve (dotted line) provides the performance of a non adaptative, fixed question sequence where all examinees get the same question sequence, regardless of their previous answers. The fixed sequence orders items based on the question items’ initial entropies: it starts with items whose average success rate is closest to 0.5 and finishes with items whose success rate is closest to 0 or 1. The simulation results show that the POKS technique is able to predict answers to questions a few percentage point more accurately than the BN. The gain is not very strong, but it is systematic and more significant after the fifth question, especially relative to the number of non observed items. In fact, the BN’s performance does not perform systematically better than a fixed question sequence.
10
Concept predictive accuracy
Figure 3 reports the performance of each technique for predicting concept mastery. Recall that mastery of concepts was assessed independently by experts. Concepts nodes in the network are not observed during the simulation, thus prediction does not reach reach 100% accuracy as it does in the question predictive simulation. For this experiment, the performance of the BN model is clearly stronger than the POKS one. The BN approach quickly reaches from 74% correct to about 90% correct in only 5 items observed, and it stabilizes at close to 92% after a couple more items. The POKS performance is about 5% weaker after the first item and this difference remains almost stable throughout the remaining 18 items observed.
2
0.70 0.75 0.80 0.85 0.90 0.95 1.00
Prediction score
(a) 12 4
+
31 5 = 8 24
(b)
(c) 21 4 > 3 18
2 5 31 + = 3 8 24
(d) BN POKS
0
5
10
15
2×
20
Item
1 =1 2
Figure 4: A simple knowledge space composed of 4 items ({a, b, c, d}) and with a partial order that constrains possible knowledge states to {∅, {d}, {d, b}, {d, c}, {d, b, c}, {d, b, c, a}}.
Figure 3: Concept predictive accuracy.
observations and partial explanations in the following paragraphs hope that they can help bringing some light.
However, the fact that POKS is weaker after all 20 items are administered indicates that the logistic regression model used does not perform as well as the BN. It suggest that the observed performance gap is mostly attributable to the logistic regression component rather than the POKS inferences.
First, we note that other experiments over two other knowledge domains (knowledge of UNIX shell commands and mastery of written French) have also shown that the POKS approach performs at least as well as standard techniques in Computer Adaptive Testing, namely Item Response Theory (IRT) (Desmarais & Pu, 2005).
An obvious followup experiment to conduct would be to verify if combining inferences from POKS and BN, instead of using the logistic regression model, would improve over the BN’s current performance at the concept level. Unfortunately the two systems are not integrated and we could not conduct this experiment in time for this publication. However, we should note that clearly, from the question accuracy results, if concepts were assessed in the classical test construction technique, by which a teacher breaks down a subject matter into a set of more specific topics and assesses the mastery of that topic by one or more test items, possibly with a weighted mean, then concept assessment accuracy would be improved by POKS.
Let us also rule out the explanation that the BN we used in this study is ill-structured and underperformant, since it did perform well at the concept prediction level, better than did POKS with logistic regression in fact. This observation also makes less likely the explanation that because POKS uses binary relations only, it needs smaller sample size than the more complex n-ary relations found in the BN. Good performance at the concept level suggest that 149 data cases is probably enough. It appears that, when it comes to concept predictive accuracy, Vomlel’s BN can use relations among concepts to effectively predict concept mastery, in spite of its relatively poorer ability predict question mastery from concept mastery.
11
Discussion
Why does a simple Bayesian posterior update scheme, that rests on strong independence assumptions and does not rely on any hidden nodes or knowledge engineered structure, perform better for question accuracy than a BN? Unfortunately, we do not have a clear answer to the above question. However, we note that such finding may not be exceptional, as other studies have also concluded that simpler approaches relying upon stronger assumptions often outperform more sophisticated approaches under certain conditions (see, for eg. Domingos & Pazzani, 1997). Which are these possible conditions here? We conjecture a few
Another explanation stems from the fact that Vomlel’s BN does not build links directly amongst question items themselves. This practice is typical of all BN used in the knowledge assessment and user modeling research litterature. It also makes good sense since question items and assessment tests have a short life span and frequent updates. The knowledge engineering effort required to build a BN among test items would prove inefficient, unless the process can be fully automated as in POKS. Nevertheless, by not directly linking questions items among themselves, it is conceivable that the predictions miss valuable information that POKS exploited. It is also possible that by relying on question items to infer concept mastery to, in turn, predict question mastery, the evidence propagated looses weights
and gathers noise. This could explain why direct links between question items themselves is more effective, in spite of the strong assumptions that used for building these links. Another, potentially more interesting hint at why the POKS did relatively well with question items lies in the structural properties of this domain. These properties are best understood by looking back at the theory of Knowledge Spaces we referred to in section 3. This theory is well known in mathemtical psychology and it states that knowledge items are mastered in a constrained order. For example, we learn to solve figure 4’s problems in an order that complies with the arrows. It follows from this structure that if one succeeds item (c), it is likely she will also succeed item (d). Conversely, if she fails item (c), she will likely fail item (a). However, item (c) does not significantly inform us about item (b). This structure defines the following possible knowledge states: {∅, {d}, {d, c}, {d, b}, {a, b, c, d}}. Other knowledge states are deemed impossible (or unlikely in a probabilistic framework). Formally, Falmagne and his colleagues argue that if the knowledge space of individual knowledge states is closed under union and intersection, then the set of all possible knowledge states can be represented by a directed acyclic graph (DAG)2 , such as the one in figure 4. This closure implies that, given a relation A → B, the absolute frequency of people who master a knowledge item A will necessarily be smaller than the frequency of B. This conclusion does not hold for the case of general BN. For example, assume figure 4’s structure is the following (a BN taken from Neapolitan, 2004): (a) smoking history (b) bronchitis (c) lung cancer (d) fatigue It is clear that smoking history (a) could be a much more frequent state than lung cancer (c) and bronchitis (b). It is also obvious that, whereas the occurrence lung cancer could decrease the probability of bronchitis by discounting that later cause as a plausible explanation for fatigue, discounting does not play a role in the case of knowledge structures (eg. observing figure 4’s (c) does not decrease the probability of (b); on the contrary, it could increase it). In short, many interactions found in general BN do not occur in knowledge structures. We conjecture that this reduction in the space of possibilities that characterizes the domain we modeled in this experiment, namely the closure under union and intersection in knowledge spaces, warrants 2
In fact, Falmagne and colleagues show that the set of all knowledge states is closed under union only, not under intersection, and that an AND/OR graph is the proper structure. For our purpose, we make the assumption/approximation that it is closed under union and intersection and that a DAG is a proper representation of the ordering.
the use of strong independent assumptions in Bayesian modeling. It allows the modeling of the domain by a pairwise analysis of variable relations, thereby reducing considerably the computational complexity and the required size of the learning data set. This last explanation is interesting because it links network structural properties (closure under union and intersection) to the level of assumption violation we can expect. However, we must emphasize that such explanation is speculative and not directly supported by empirical evidence from the current experiment. Further investigation is required to support such claim.
12
Acknowledgements
We are grateful to Jiˇr´ı Vomlel for giving us valuable feedback on an early draft of the paper and for providing the data used in this experiment. This work has been supported by the National Research Council of Canada.
13
References
Conati, C., Gertner, A., & VanLehn, K. (2002). Using bayesian networks to manage uncertainty in student modeling. User Modeling and User-Adapted Interaction, 12(4), 371–417. Desmarais, M. C., Maluf, A., & Liu, J. (1995). Userexpertise modeling with empirically derived probabilistic implication networks. User Modeling and UserAdapted Interaction, 5(3-4), 283–315. Desmarais, M. C., & Pu, X. (2005). Computer adaptive testing: Comparison of a probabilistic network approach with item response theory. Proceedings of the 10th International Conference on User Modeling (UM’2005) (p. (to appear)). Edinburg. Domingos, P., & Pazzani, M. (1997). On the optimality of the simple bayesian classifier under zero-one loss. Machine Learning, 29, 103–130. Falmagne, J.-C., Koppen, M., Villano, M., Doignon, J.-P., & Johannesen, L. (1990). Introduction to knowledge spaces: How to build test and search them. Psychological Review, 97, 201–224. HUGIN expert (2002). HUGIN explorer. ver. 6.0., computer software (Technical report). http://www.hugin.com. Mill´an, E., Trella, M., P´erez-de-la-Cruz, J.-L., & Conejo, R. (2000). Using bayesian networks in computerized adaptive tests. In M. Ortega, & J. Bravo (Eds.), Computers and education in the 21st century (pp. 217–228). Kluwer. Mislevy, R. J., Almond, R. G., Yan, D., & Steinberg, L. S. (1999). Bayes nets in educational assessment: Where
the numbers come from. In K. B. Laskey, & H. Prade (Eds.), Proceedings of the 15th Conference on Uncertainty in Artificial Intelligence (UAI-99) (pp. 437–446). S.F., Cal.: Morgan Kaufmann Publishers. Neapolitan, R. E. (1998). Probabilistic reasoning in expert systems: Theory and algorithms. John Wiley & Sons, Inc., New York. Neapolitan, R. E. (2004). Learning bayesian networks. Prentice Hall, New Jersey. Vomlel, J. (2004). Bayesian networks in educational testing. International Journal of Uncertainty, Fuzziness and Knowledge Based Systems, 12(Supplementary Issue 1), 83–100.