Bayesian Modelling of Confusability of Phoneme-Grapheme ...

Bayesian Modelling of Confusability of Phoneme-Grapheme Connections1 Mikko VILENIUS† Janne V. KUJALA‡ Ulla RICHARDSON‡ Heikki LYYTINEN‡ and Toshio OKAMOTO† † Graduate School of Information Systems, Artificial Intelligence and Knowledge Computing Laboratory, University of Electro-Communications, Chofugaoka 1-5-1, Chofu-shi, Tokyo, Japan ‡ Agora Center, University of Jyväskylä, P.O. Box 35, FI-40014 University of Jyväskylä, Finland † {mikko,okamoto}@ai.is.uec.ac.jp, ‡ [email protected], {ulla.richardson,heikki.lyytinen}@jyu.fi

Abstract Deficiencies in the ability to map letters to sounds are currently considered to be the most likely early signs of dyslexia [4]. This has motivated the use of Literate, a computer game for training this skill, in several Finnish schools and households as a tool in the early prevention of reading disability. In this paper, we present a Bayesian model that uses a student’s performance in a game like Literate to infer which phoneme-grapheme connections student currently confuses with each other. This information can be used to adapt the game to a particular student’s skills as well as to provide information about the student’s learning progress to their parents and teachers. We apply our model to empirical data collected using Literate. Based on these data, we evaluate and compare using Bayesian methodology three different submodels with different restrictions on the possible confusability relations.

In the Finnish language phoneme-grapheme connections are 100% consistent; each letter has its own phoneme and each phoneme its own letter. Literate teaches these connections through repetition. The game proceeds in the following manner: An item, a phoneme-grapheme pair, is randomly chosen from a predefined active set of items. The phoneme is auditorily presented and the student tries to select the corresponding grapheme among multiple choices including the correct one (the target) and one or more distractors chosen from the active set. The selection time is limited. Each game session consists of about 30 or more such selections (trials). The active set is varied between sessions. In the following sections, we define the general model as well as three submodels with different restrictions on the possible confusability relations. Then we compare the three submodels to each other to find the one that best fits the empirical data obtained from Literate and further evaluate that model’s goodness of fit.

2. The Model 1. Introduction We suggest a Bayesian model of the confusability of phoneme-grapheme connections. This model is motivated by the need to analyse student log files from Literate [2], an adaptive online computer game designed to support the Finnish language reading acquisition of young children. We use Bayesian methodology as it is insensitive to the adaptation logic used to choose the contents for each trial [3]. 1 This

work was supported by the European Commission’s FP6 (MCEXT-CE-2004-014203) and Tekes, the Finnish Funding Agency for Technology and Innovation. The authors would like to thank Benja Fallenstein for discussions.

In this section, we describe the basic construction of our model. We follow the usual Bayesian notation, denoting by θ the unknown we are inferring about, by p(θ) the prior distribution, and by p(y | θ) the likelihood of the data y given a value of the unknown parameter. Then, p(θ | y) is the posterior distribution of θ given the data y. These components are described below, yielding the general model. Different constraints on the unknown parameter yield three different submodels. Data We assume that the data of each game session consist of several statistically independent trials y1 , . . . , yT each of which is a triple (P, t, a) where P is the set of multiple choice options, t is the target option, and a is the actual

Seventh IEEE International Conference on Advanced Learning Technologies (ICALT 2007) 0-7695-2916-X/07 $25.00 © 2007

answer of the student. Unknown parameter The unknown parameter θ to be inferred is a binary matrix representing the confusability relation of the items, i.e., θij = 1 means that the phoneme i is confused to the letter j. Likelihood Given the data yk = (P, t, a) of trial k, with target t, answer a and P the letters in the trial (including the target and the distractors), the likelihood for the matrix θ is given by p(yk | θ) = P

θta j∈P

θtj

(1 − δ) +

1 δ, #P

where #P is the number of letters in the trial. We also add a small chance, δ = 0.05, for random errors, like selecting the wrong item, even though the student knows the correct answer. Submodels In mathematics, the following constraints are often considered for relations: 1. In a reflexive (R) relation θii = 1. 2. In a symmetric (S) relation θij = θji . 3. In a transitive (T) relation θij = θjk = 1 ⇒ θik = 1. We define three submodels by different combinations of the above constraints: TSR, SR, and R. It seems intuitive that all the relations should be reflexive in our application unless the student has learned to connect the wrong grapheme with a certain phoneme systematically. However, this seems an unlikely scenario and therefore, we assume reflexivity in all submodels. However, it is easy to imagine that the transitivity can be violated over long chains of items and there are also some conceivable ways in which the symmetry could be violated, see Fig. 1.

A

B

Figure 1. An example of how symmetry of the confusability relation can be violated. The student’s phonetic categories may not be centered on the prototypical phoneme sounds, yielding confusion of A to B but not vice versa. In the least restrictive R model, the rows of the confusion matrix are mutually independent in the uniform prior

n 1 2 3 4 5 6 7

TSR 1 2 5 15 52 203 877

SR 1 2 8 64 1 024 32 768 2 097 152

R 1 4 64 4 096 1 048 576 1 073 741 824 4 398 046 511 104

Table 1. Number of matrices nθ for each model, where n is the number of items in the active set.

distribution and because the dependence of the likelihood can be separated by the target, the posterior also has this independence and it is possible to do all computations for each target (row) independently. The posterior probability Qn for a matrix is then r=1 p(θr | y) where r indexes the rows. Prior Anticipating a wide range of different student behaviour, we decided to go with a noninformative flat prior distribution for each of our submodels, giving the same prior probability p(θ) = 1/nθ for each possible matrix, see Table 1. Inference To have the posterior distribution available all the time, all the binary matrices of the submodel in use were created and the posterior probability for each of the matrices was updated after each trial. For inference we can, for example, calculate the posterior marginal distribution for a given element θij of the unknown confusion matrix, yielding the posterior probability of those specific items being confusable by that student. Modes of the posterior distribution are also interesting. If the mode dominates the posterior distribution, it can be used as a single estimate of the confusability matrix.

3. Empirical Evaluation of the Submodels The empirical data include all sessions with 25 or more trials from 420 individual children with varying backgrounds and skill level. Altogether the game data consist of 4 946 game sessions, either with a four letter active set (963 sessions) or a seven letter active set (3 983 sessions).

3.1. Comparing the submodels: Bayes factors We compared the submodels using Bayes Factors [1]. The results can be found in Table 2. Due to the complexity of the SR model, the calculations could only be performed with sessions with four letter active set.


Bayes factor BF ≥ 10 10 > BF > 1 1 > BF > 0.1 BF ≤ 0.1 total sessions

4 item sets TSR vs. SR SR vs. R 0 488 802 212 124 182 37 81 963 963

4 and 7 item sets TSR vs. R 4042 296 293 315 4946

Table 2. Comparison of the models using Bayes factors. BF > 1 means that the first model is better than the second. BF > 10 is considered strong evidence for the first model [5].

The overall best model appears to be the TSR model, although in some isolated sessions the SR or R model was better.

3.2. Evaluating the TSR model’s fit to the data A model’s fit to data can be evaluated by first conditioning the model to the actual data and then drawing simulated data values from the posterior predictive distribution and comparing the actual data to these replicated data yrep . Formally this can be done using the so-called p-values [1]: Pr(T (yrep , θ) ≥ T (y, θ) | y),

(1)

where we use the deviance, T (y, θ) = −2ln(p(y | θ)), as the test quantity and approximate the p-value as the proporl tion of replications where T (yrep , θl ) ≥ T (y, θl ). The conditioning and replication of the answers was done for all of the sessions individually and only the answers were simulated for the original sequence of trial contents. For each session, L = 1000 replicate sessions were created with independently drawn θl for l = 1, . . . , L. Results Of all sessions, about 2% yielded a p-value of 0.01 or less and about 35% yielded a p-value of 0.99 or more. The rest of the sessions seemed to fit the data in a reasonable accuracy. In a considerable number of sessions the p-values were above 0.99. In most of these sessions the students got every trial in the session correct. As we assumed that there is a 5% lapse rate, the model is indeed incorrect as shown by the pvalue. Yet, the model works well enough for our needs: it does find the correct matrix, the one in which all letters are distinguished from each other. Another common case for the high p-value (> 0.99) was the sessions where all letters were confused with each other. In these sessions the deviance for the replicated answers was often exactly the same as for the original data, yielding a high p-value due to the definition (1) which does not

l distinguish between the cases T (yrep , θl ) > T (y, θl ) and l l l T (yrep , θ ) = T (y, θ ). With continuous parameters, the latter case almost never happens, but with discrete parameters, it should be noted that equality of the test quantities does not count as a difference to either direction. If we eliminate the above explained sessions, in which we have noticed that the model works adequately despite the high p-value, of the 1 728 sessions with a high p-value (> 0.99), we are left with 146 sessions out of total of 4 946, only 2.95%. In the 100 sessions with a low p-value we generally had sessions with a large amount of trials. It is natural that when there are a large amount of data, any deficiencies of the model start to show up. In these cases, it appears that the assumed lapse rate may have been too low.

3.3. Discussion Generally in statistics, the more detailed and less restricted model should fit the data better. However, Bayesian methods also embody the principle of Occam’s razor automatically and quantitatively [3]. The TSR model is the most restrictive and the R model the least restrictive of the three submodels. It seems that the data was generally explained well by the matrices of the TSR model and rarely explained better by the additional matrices of the other submodels. As the TSR model has the highest prior probability for these matrices, the Bayes factors supported the TSR model. For the minority of the sessions where the model did not fit the data, the reason appeared to be our assumption of a fixed lapse rate δ. One possible solution would be making δ yet another parameter in our model. However this is not a critical issue as the model appeared to still find the most descriptive matrix in most cases.

References [1] A. Gelman, J. B. Carlin, H. S. Stern, and D. B. Rubin. Bayesian Data Analysis, second Edition. Chapman & Hall/CRC, 2004. [2] H. Lyytinen, M. Ronimus, A. Alanko, M. Taanila, and A.-M. Poikkeus. Early identification and prevention of problems in reading acquisition. Nordic Psychology, in press. [3] D. J. C. MacKay. Bayesian Methods for Adaptive Models. PhD thesis, California Institute of Technology, 1992. [4] F. R. Vellutino, J. M. Fletcher, M. J. Snowling, and D. M. Scanlon. Specific reading disability (dyslexia): what have we learned in the past four decades. Journal of Child Psychology and Psychiatry, 45(1):2–40, 2004. [5] L. Wasserman. Bayesian model selection and model averaging. Journal of Mathematical Psychology, 44:92–107, 2000.