general paradigms for implementing adaptive learning systems

1 downloads 0 Views 72KB Size Report
2003, Wasson 1997, Shute & Psotka 1994). Most ITSs are quite restricted: there is a pre-defined learning path, which the student proceeds sequentially,.
GENERAL PARADIGMS FOR IMPLEMENTING ADAPTIVE LEARNING SYSTEMS Wilhelmiina Hämäläinen Department of Computer Science, University of Joensuu P.O. Box 111, FIN-80101 Joensuu, FINLAND [email protected]

ABSTRACT In this paper, we consider adaptive learning systems in the framework of context-aware computing. We introduce general paradigms for implementing real adaptivity. The paradigms consider inferring the contexts (user and use situation), selecting the action in the given context, social filtering approach when context is not known, and utilizing information about system dynamics in the form of hidden Markov models. KEYWORDS Adaptivity, modelling, intelligent tutoring systems, context-aware computing.

1. INTRODUCTION The idea of intelligent tutoring systems (ITSs) is to adapt the teaching according to individual skills, knowledge, and needs, and give personal feedback just-in-time. Classically, the ITS consists of four components: domain knowledge (learning material), student model (abstract representation of the learner), tutoring or expert module, and user interface. The “intelligence” of the system is located in student model and tutoring module. The student model stores and updates information about each learner. This information contains typically cognitive information (knowledge level, prerequisite knowledge, performance in tests, errors and misconceptions), user preferences (learning style, goals and habits), action history, and maybe some additional information about learner's attitudes, emotions and motivation. Usually the system concentrates on only cognitive diagnosis, i.e. it determines the student's knowledge level from her/his performance data. The tutoring module is responsible for selecting suitable actions like generating tests and exercises, giving hints and explanations, suggesting learning topics, searching learning material and collaborative partners. (Chou et al. 2003, Cheung, et al. 2003, Wasson 1997, Shute & Psotka 1994) Most ITSs are quite restricted: there is a pre-defined learning path, which the student proceeds sequentially, from one concept unit to another. After each unit the student is tested, and it is determined, if s/he can enter the next unit, or should stay and practise more in the current level. The system is usually implemented as a rule-based system, with pre-defined rules (Hatzilygeroudis & Prentzas 2004). The more advanced systems use fuzzy or probabilistic rules (Vos 1999, Hwang 2003) to prevent students from entering too early to the next phase. The underlying idea is that the student's knowledge is considered as a subset of expert's knowledge, and the student should work until this shortage is filled (Carr & Goldstein 1977). Thus, the adaptivity of such systems means that the students are adapted to some existing model or theory, instead of adapting the model to reality of students. new data

data

application descriptive model

predictive model

Figure 1. The iterative process of descriptive and predictive modelling.

In this paper, we introduce a contrary approach, in which the model is learnt from real users, and the learners are not bound to one path, but can freely develop. The main principle is an iterative cycle of descriptive and predictive modelling (Figure 1), which combines the classical paradigms data mining and machine learning. First, we collect data and analyze it by several descriptive models, which in turn are used for constructing a suitable predictive model. After applying a predictive model, its outcomes can be analyzed again and new descriptive and predictive models constructed. This principle can be applied in several places: learning probabilistic rules from associative rules, classes from clusters, and Markov models from episodes. To restrict the future development of ITS as little as possible, we have adapted a wide view of contextaware computing (ubiquitous computing), in which the whole context - the user herself, her actual situation and all relevant information - is used in determining the most appropriate action. In the traditional learning systems, this information is gathered directly by the application, but in addition several sensors can be used. The common personal computer can be equipped with microphone, camera, light-pen, data-glove, etc. which recognize the user and observe her interaction with computer. There are already applications which analyze voice and camera images to infer user s current task, mood or intention (Starner, et al. 1998, Toivanen, et al. 2003, Stiefelhagen, et al. 2001). Mobile devices are especially useful, because they are carried by the user. We can get user's location and near-by people by GPS (Global Positioning System). Indoors, the location can be recognized by a network of ultrasonic or radio beacons. The change of orientation (movements) can be recognized by inertial sensors (acceleration and rotation), motion sensors (change of motion) or camera. IR (infra-red) sensor reveals proximity of humans or other warmth sources. Light level, temperature, pressure and CO gas can be measured by simple sensors. In special education, we can even utilize biological sensors, which measure pulse, blood pressure and body temperature. For example, dyslexia is affected by stress and emotional factors, and often causes secondary symptoms like increase of body temperature. In the following, we will first introduce our framework of context-aware computing and the basic idea of selecting the most appropriate action. Then we will introduce three general paradigms for implementing adaptivity in practice. The first two principles describe context inference by classification, action selection, when the context is determined, and social filtering when the context is not known. The third principle describes, how information about context changes can be embedded into context inference and action selection in the form of Markov chains.

2. HIERARCHY OF CONTEXTS Contexts are usually divided into primary and secondary contexts. The primary or low-level context means the environmental characteristics, which can be gained directly from the sensors: location, time, nearby objects, network bandwidth, orientation, light level, sound, temperature, etc. They can either measure the physical parameters in the environment or logical information gathered from the host (e.g. current time, GSM cell, selected action), and the sensors are called physical or logical correspondingly (Schmidt et al. 1999). The secondary or high-level context means a more abstract context, which is derived from the primary context: the user's social situation, current activity, mental state, etc. However, this division is quite artificial. The features which are secondary contexts for some applications, can be further processed and combined to offer more high-level contexts for other applications. Thus we have adopted a hierarchical view, in which we have a continuum of contexts in different abstraction levels. In small systems, we may have no need to separate sensor processing and context inference from the actual application, but generally it is more efficient to separate these tasks. If the same contexts are used in several applications, the sensor processing and extracting low-level contexts can be managed in one place. This also reduces data overload, when preprocessing and compressing the data has been done on low level. In adaptive learning systems, the hierarchical view is especially appropriate. The special nature of educational applications occurs only in the highest levels, where we should define the high-level contexts and select the most suitable actions. Most of data is originated from logical sensors, and thus already higher-level data, which does not require preprocessing. Typically we have small but clean data sets, of discrete valued data. The physical sensors offer also valuable information, but their processing does not differ from other applications, and we can use general context servers, which preprocess the sensor data and extract the lowlevel contexts.

In educational systems it is especially important that the system is transparent - i.e. the users can see, how it works (Elithor & Banerji 1984). This concerns not only evaluators (teachers) but also students, who are grouped and classified and have right to know, how it is done. When the sensor data is abstracted, it can be processed by symbolic data mining and machine learning techniques, which are easier to design and understand.

3. THE BASIC APPROACH Our main problem is following: We are given a set of contexts C = c1,..., cn and a set of actions A = a1,..., am. The task is to determine the most appropriate action aj in the given situation described by a data vector x. In the ideal case, the situation corresponds to one of the predefined contexts ci. However, in practice the situation may contain features from several contexts and can be defined as a combination of contexts. There are basically two approaches: either we first infer the current context and then select the action in it or we infer the most appropriate action directly. We will see that the latter approaches do also infer the context, but implicitly. Let us first consider methods based on explicit context inference. The simplest approach is a combination of discriminative classification and a rule-based system. In discriminative classification only one class, the most probable context is selected according to input data. I.e. data vector x is mapped to a single value ci, i = 1,..., n. The simplest rule-based system consists of deterministic rules of form “If context is ci, then select action aj”. Default actions can be defined for the cases, when data is missing. The obvious shortage is that we lose a lot of information by such deterministic policy. The better approach is to use probabilistic classification, in which we produce the probability distribution P(C = ci|x) for all contexts ci. In the same way, we can learn probabilistic rules for selecting the most appropriate action. The probabilistic rules are of form “If context is ci, then select action aj with probability p “. These can be interpreted as conditional probabilities P(A = aj|C = ci) = p, which give the probability of an action to be most appropriate or desired in context ci. Another approach is to infer the action directly, given the data x describing the situation. This kind of approach is used in social filtering. The idea is to select an action, which has been selected in similar situations by other users. By recognizing similar situations we in fact cluster the situations and the resulting cluster can be interpreted as a high-level context.

4. INFERRING THE CONTEXT We have already observed that the context can be inferred by classification. But before we can classify the data vector, we should know the possible classes (contexts). These can be decided by the system designer, but better approach is to analyze the data and search if the situations fall into natural clusters. Thus, we recommend constructing first a descriptive model - a clustering - and then setting the classes as main clusters. Special policy is needed, if there are clear outliers which do not belong to any cluster. The next problem is to select the most appropriate classification method for context inference. The selection depends very much on application, and no general answer can be given. However, we have tried to evaluate and compare the most common classification methods (decision trees, nearest neighbour methods, naive Bayes model and multi-layer perceptrons) according to the general requirements of context-aware systems. (Table 1). The first criteria deal with efficiency of reasoning (i.e. actual classification), learning and updating the model. It should be noticed that the nearest neighbour methods do not build an explicit model, and the learning criterion is skipped. Another important observation is that there is always wrestling between efficiency and accuracy: with suitable independence assumptions and approximation techniques we can speed up the model learning, but in the cost of accuracy. Efficiency and accuracy also depend on the specific data set: some methods work very well with low-dimensional data, but are intractable or poor with highdimensional data. This means that we can suggest only some guidelines concerning the efficiency and accuracy of the given method. The other criteria evaluate the ability of learning from small training sets,

handling incomplete data (noise and missing values), mixed variables (both discrete and real-valued), and natural interpretation of the model. Table 1. Comparison of different classification methods: decision tree models, nearest neighbour approaches, naive Bayes model and multi-layer perceptron. Sign + means that the model supports the property, - that it does not.

1a)Efficient reasoning 1b)Efficient learning 1c)Efficient updating 2. Works with small training sets 3. Works with incomplete data 4. Works with mixed variables 5. Accurate classification 6. Natural interpretation

Tree models + +/-

Nearest neighbour -

Naïve Bayes + + + +

MLP + + -

-

+/-

+

+

+

+

+

+

+

+/-

+/+

+/-

Generally, the naive Bayes model performs best, but its classification accuracy suffers for strong independence assumption (all leaf variables are assumed to be conditionally independent). However, in practice the naive Bayes models have proved to work well, even if there are clear dependencies between variables (Hand, et al. 2002). The general Bayesian networks would be able to model inter-variable dependencies and obtain slightly better classification, but in the cost of efficiency. In the educational technology domain, the most important criteria are 1a, 2, 4, 5 and 6. As earlier mentioned, we typically have small but clean data sets, which consist of discrete data, either numerical or categorical. The natural interpretation is essential, because the results are often interpreted by teachers and educational scientists. The efficiency of learning the model is not so critical, because in our paradigm, it is done only once after the course (after constructing a descriptive model). In the next course, the model should be updated, but not necessarily in real time. Only the actual classification should be done efficiently so that the system can adapt to learner's current situation immediately. For example, if the system offers individual exercises for learners, it should detect when more easier or challenging tasks are desired.

5. SELECTING THE MOST APPROPRIATE ACTION Next, we will consider two approaches for selecting the desired action. When the context has been explicitly inferred we should learn only the rules for selecting the most appropriate action. These rules can be learnt from action history. The other approach is to determine the action directly, without inferring the context. The first approach is preferable in most educational applications, because the context contains also important information and selecting the best action is not so straightforward. The latter approach, known as social filtering, suits for situations, where we do not have any previous data and we can trust users ' ability to select the best actions. 1. Learning from action history can be performed in several ways. Typically the personalized systems require a learning phase, where the user has to teach the system. The user may fill an explicit query to give initial data for the system, but most users prefer to teach the system on-line, when it is actually used. That is why we will now concentrate on such continuous learning. In the beginning, the user has to do everything manually, and the system is only observing and recording varying contexts and actions selected. When the learning proceeds, the system begins to make suggestions. The learning continues, but becomes more invisible the more data system has collected. Finally the system becomes so adapted that the user has to intervene only in exceptional situations. The main problem of this approach is that it cannot adapt to new situations. Explicit teaching is worksome and continuous learning takes time. In addition, the system cannot suggest new actions, which may be more

appropriate than the previously selected ones. Especially in educational applications the user may not select herself the best actions for learning. One solution is to set default values for actions. These default values can be defined by system designers or learnt from previously collected action history. The best way is to collect first data from test use and construct a descriptive model, which reveals if there are typical selections in some situations (contexts). The simplest way is to search associative rules between high-order contexts and actions. This also reveals the frequency of contexts (the most common and the rarest contexts), in addition to most typical actions. In educational applications we may also want to favour actions which produce good learning outcomes. Thus we can select only those students' data, who have performed well compared to their starting point. Another, especially feasible solution is to combine the probabilistic rules with naive Bayes classifier into one general Bayesian network. In the construction phase, the descriptive models (clustering for finding contexts and association rules for finding context action dependencies) are still preferable. 2. Social information filtering (Shardanand & Maes 1995): offers another solution, which combines both the individual preferences and knowledge from other users' action history. Social filtering methods are nowadays very popular in recommendation systems, for example recommending suitable learning material (Chen, et al. 2005, Lee 2001, Papanikolaoum & Grigoriadou 2002). The idea is to recommend new items to the user based on other, similar users' preferences. Typically, the similar users are determined by comparing user profiles. An initial profile may be constructed according to an explicit query or it may be learnt from the user's previous selections. The user profiles can be compared in several ways, for example by computing the mean square difference or Pearson correlation of profiles. Notice that we are partially clustering the users by defining nearest neighbours. A similar method is used in Hubs and Authorities algorithm (HITS) (Kleinberg 1998) for searching most relevant Internet pages. In HITS the pages are assigned hub and authority values, according to how good pages (authorities) they refer and how good referring pages (hubs) they are referred by. The hub values are updated according to authority values, and authority values according to hub values, until the system converges and best authorities are selected. The HITS method can be easily applied to other recommendation systems. For example, Chen et al. (Chen et al. 2005) have introduced a system, which recommends the learner material according to her/his learner ability (knowledge level). The difficulty of material is initially evaluated by experts, and is updated according to student feedback (whether they understood the material or not). The student's learner ability is initialized according to course unit, and is updated according to how difficult material s/he could understand. The social filtering methods are especially useful, when we should define the most appropriate action in a new situation. The idea is to find similar contexts (similar users and/or similar situations) and select the most popular action in them. The profile consists of lower-level contexts c1,...,cn (one of them possibly indicating the user) with their assigned values. The similarity function can be simply the mean squared difference or we can give more impact on some elements, e.g. favour the user's previous contexts. In the application of the HITS algorithm we cluster the contexts only according to their actions. First we construct a graph of current context, all actions selected in it, all other contexts in which the same actions are used, and all their actions. This set may be too large, and we can prune it according to lower-level contexts. The action values are initialized by the number of contexts they have been selected, and the contexts are initialized by the number of actions used in them. The values are normalized, and updated by the normal manner, until the system converges. Finally the action or actions with highest values are selected. The system is easy to implement, it works in new situations, and quite probably it pleases the user. But in educational environments we have one special problem: what pleases the student, it not always same as the best action. For example, if a student has a tendency to laziness, the system may recommend him to sleep on lecture, like the other similar students have done. Thus we should define the goodness of an action very carefully. This requires existing data about students' actions and learning performance. For a lazy student, the best actions would be such which have activated other similar students and led to good results. Thus, we should give weight according to students' success.

6. CONTEXT DYNAMICS

Context dynamics adds a new dimension to the process. Depending on the application and individual user, the context changes can follow some patterns. For example, after a lecture the student goes for a coffee break and wants to switch the system off. After loading a new task, she wants to read the related lecture slides, before solving the task. This information of context changes can be utilized in predicting the future contexts. Hidden Markov models offer a nice solution for modelling dynamic processes in general. In the following we propose, how we could combine this context process modelling to our classification paradigm by dynamic Bayesian networks.

6.1 Hidden Markov models Hidden Markov models (HMMs) are a useful tool for modelling processes, which evolve in time. The simplest form of HMMs, 1st order HMM can be thought as a stochastic finite state machine. The process is modelled as a set of discrete states (high-level contexts) C0,...,CT and in any time t the system in one state. The states cannot be observed directly, and thus they are called hidden. We assume the Markov property that the current state Ct depends on only the previous state Ct-1, i.e. P(Ct|C0,...,Ct-1) = P(Ct|Ct-1). The probabilities P(Ct|Ct-1) are called transition probabilities, and they tell the probability to move from state Ct1 to Ct. In the kth order Markov models we generalize the Markov property and assume that the current state depends on the previous k states, i.e. P(Ct|C0,...,Ct-1) = P(Ct|Ct-k,...,Ct-1). In addition to this Markov chain we have defined a set of output variables or observations (lower level contexts) O0,...,OT, which can be observed directly. The output variables can be discrete, real-valued or combination of both. In each state Ct the system can produce an output Ot with observation probability Po(Ot|Ct). I.e. the observations depend on only the state in the given time. This kind of approach is sometimes used in context-aware applications. For example Starner et al. (Starner et al. 1998) introduce a virtual game environment, in which each room of the game correspond a state of Markov chain. The current room can be predicted from the previous room, in addition to sensor measurements. The model is especially attractive, because we can interpret it as a belief network and use Bayesian methods to reason current context or to predict future contexts. In HMM approach we are also classifying the situation implicitly. The only difference to basic classification is that now we take into account also the previous context(s), in addition to data describing the situation. In context inference, it is quite probable that only some contexts (states) have probabilistic relations. To find those relations, we can once again first construct a descriptive model from data, and search episodes - frequently occurring temporal patterns - in data (see e.g. (Toivonen 1996)). The serial episodes consist of contexts which typically succeed each other (e.g. context ci succeeds context cj with probability p). If the contexts succeed immediately, we can use 1st order HMMs, otherwise higher order HMMs are needed. The parallel episodes consist of contexts which occur closely, but the order is not fixed. In this case, the HMM should contain transitions to both directions (from ci to cj and from cj to ci), and once again higher order HMMs may be needed. If a context does not have any typical successor, all other contexts are equally probable. A0

A1

A2

C0

C1

C2

O0

O1

O2

AT

...

...

CT

OT

Figure 2: A Markov chain of contexts C0,...,CT with associated observations O0,...,OT and actions A0,...,AT.

Hidden Markov models can be generalized by allowing dependencies also between output variables (lower level contexts), e.g. Ot depends on k previous output variables, in addition to Qt. When the goal is to select the most appropriate action, it is more useful to add action variables and dependencies between them into Markov model like proposed in Figure 2. This model takes into account actions, which typically precede each other. It is especially useful, when a context is associated by a procedure of actions, instead of a single action.

Once again it is possible that the actions depend on several previous actions, and a higher order model is needed.

6.2 Dynamic Bayesian networks Dynamic Bayesian networks (DBNs) (Kjaerulff 1992) are an enlargement of Hidden Markov models. The only difference is that the observations are organized as a Bayesian network. They offer a nice solution to embed the dynamic nature of the system into classification, by combining Hidden Markov model with Bayesian classifiers. A dynamic Bayesian model consist of a Markov chain of hidden state variables (1st order HMM) and a series of naive Bayesian networks, each of them associated with one hidden state. In a dynamic naive Bayes model, both the root variable C and leaf variables X1,...,Xn of the naive Bayesian network can depend on the hidden state variable S. For simplicity, we have parameterised the variables with time t. The model structure is represented in Figure 3. In context-aware applications we can catch the unknown and immeasurable highest level contexts like user's intention, mental mood etc. by hidden variables S[t]. The root variable C[t] corresponds highest level context, which can be observed, e.g. user's current action, and the leaf variables X1[t],...,Xn[t] correspond lower-level contexts. This model is especially attractive for context-aware applications, because it separates predefined high-level contexts (C[t]) and real but unobservable contexts behind them (S[t]). S[t−1]

S[t]

C[t−1]

X1[t−1] X2[t−1]

S[t+1]

C[t]

... Xn[t−1]

X1[t] X2[t]

C[t+1]

...

Xn[t]

X1[t+1]

... Xn[t+1] X2[t+1]

Figure 3: A dynamic naive Bayesian model. Hidden state variable S, context variable C and other variables Xi are parameterised with time t. In the same way we could add any other probabilistic classification methods to Hidden Markov model. In this case, the HMM gives only the prior probabilities for contexts, and otherwise the classification is done normally. The same can be done, if we use probabilistic clustering (Celeux & Govaert) in social filtering approaches. The only difference is that now we have to learn the HMM for actions instead of contexts. HMM gives the prior probability of an action, given the previous action(s), and the probabilities are updated according to actions used in similar situations. All three paradigms can be combined, if we use classification for context inference, and social filtering for action selection, with HMM dynamics added to both of them.

7. CONCLUSIONS In this paper, we have constructed general paradigms for truly adaptive learning systems. The main task is to select the most appropriate action in the given context (user and situation). We have adopted a hierarchical view of context-aware computing to model different abstraction levels of contexts between sensors and applications. The main principle is that the high-level contexts and the rules for action selection should be learnt from real data. This is achieved by combining descriptive and predictive modelling as an iterative process. We have introduced three concrete paradigms for implementing adaptivity: 1. Paradigm: We infer the high-level context by probabilistic classification and learn probabilistic rules for selecting the best action in the given context. The naive Bayes model proved to be best candidate for

classification, but with general Bayesian networks we can combine the context inference and action selection in one model. The descriptive models are suitable for model construction: clustering for defining high-level contexts and association rules for finding dependencies between contexts and actions. 2. Paradigm: We determine the best action directly by social filtering methods. We can either define profiles of lower-level contexts, search nearest neighbours and select the most popular action among them, or use HITS algorithm and cluster contexts according to their actions. The first approach defines implicitly a high-level context, but each context should contain only one (composed) action. The HITS approach works best, if we have selected the initial set of situations by another method. Both of these approaches suit for situations, where we do not have any previous data and we can trust users' ability to select the best actions. 3. Paradigm: We capture the temporal dependencies between contexts and/or actions by hidden Markov models. In the simplest model, high-level contexts correspond to hidden states in Markov chain and output variables correspond to lower level contexts. Dynamic Bayesian networks are a more sophisticated model, which combine a naive Bayes classifier with a hidden Markov model. The hidden Markov models can be used also with other probabilistic classification or clustering methods to define the prior probabilities of contexts and/or actions. All three paradigms can be combined, if we first classify the contexts by a naive dynamic Bayesian network and then select the action by HITS algorithm, associated by a hidden Markov model of actions. Alternatively, we can use dynamic Bayesian networks for both context inference and action selection.

REFERENCES B. Carr & I. Goldstein (1977). Overlays. A theory of modelling for computer-aided instruction. AI lab meno 406, MIT, Cambridge, Massachusetts. G. Celeux & G. Govaert (1995). Gaussian parsimonious clustering models. Pattern Recognition 5(28):781-793. C.-M. Chen, et al. (2005). Personalized e-learning system using item response theory. Computers & Education (44):237255. B. Cheung, et al. (2003). SmartTutor: an intelligent tutoring system in web-based adult education. The journal of systems and software (68):11-25. C.-Y. Chou, et al. (2003). Redefining the learning companion: the past, present, and future of educational agents. Computers & Education (40):225-269. D. Hand, et al. (2002). Principles of Data Mining. MIT Press. G.-J. Hwang (2003). A conceptual map model for developing intelligent tutoring systems. Computers & Education (40):217-235. I. Hatzilygeroudis & J. Prentzas (2004). Using a hybrid rule-based approach in developing an intelligent tutoring system with knowledge acquisition and update capabilities. Expert systems with applications (26):447-492. U. Kjaerulff (1992). A computational scheme for reasoning in dynamic probabilistic networks. Proceedings of the Eighth Conference on Uncertainty in Artificial Intelligence. Morgan Kaufman. J. Kleinberg (1998). Authorative Sources in a Hyperlinked Environment. Proceedings of the Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 668-677, SanFrancisco, California. M.-G. Lee (2001). Profiling students adaption styles in web-based learning. Computers & Education (36):121 132. K. Papanikolaoum & M. Grigoriadou (2002). Towards new forms of knowledge communication: the adaptive dimensions of a web-based learning environment. Computers & Education (39):333-360. U. Shardanand & P. Maes (1995). Social information filtering: Algorithms for automating word of mouth. In Proceedings of ACM CHI 95 Conference on Human Factors in Computing Systems, vol. 1, pp. 210-217. V. Shute & J. Psotka (1994). Intelligent tutoring systems: the past, present and future, pp. 570-600. Macmillan, New York. T. Starner, et al. (1998). Visual Contextual Awareness in Wearable Computing. ISWC, pp. 50-57. R. Stiefelhagen, et al. (2001). Estimating focus of attention based on gaze and sound. Proceedings of Workshop on Perceptive User Interfaces (PUI 01). J. Toivanen, et al. (2003). Automatic recognition of emotion in spoken Finnish: preliminary results and applications. Proceedings of Prosodic Interfaces, pp. 85-89. H. Toivonen (1996). Discovery of Frequent Patterns in Large Data Collections. Ph.D. thesis, Department of Computer Science, University of Helsinki. H. Vos (1999). Contributions of minmax theory to instructional decision making in intelligent tutoring systems. Computers in human behavior (15):531-548.