Statistical models for e-learning data Silvia Figini1 and Paolo Giudici2 1
University of Pavia, 27100 Pavia, Italy
[email protected] 2 University of Pavia, 27100 Pavia, Italy
[email protected] http://www.datamininglab.it
Abstract. In the paper we propose nonparametric approaches for elearning data. In particular we want to supply a measure of the relative exercises importance, to estimate the acquired Knowledge for each student and finally to personalize the e-learning platform. The methodology employed is based on a comparison between nonparametric statistics for kernel density classification and parametric models such as generalized linear models and generalized additive models.
1
Introduction
Smoothing techniques such as density estimation and non parametric regression have become established tools in applied statistics. There is now a wide variety of texts which describe these methods and a huge literature of research papers. Recent texts include Green and Silverman (1994), Wand and Jones (1995), Fan and Gijbels (1996), Simonoff (1996), Bowman and Azzalini (1997). A broader framework for the case of regression, known as generalized additive models, is also described by Hastie and Tibshirani (1990). Modern statistical computing environments are generally geared towards vector and matrix representations of data. It is therefore a principal aim of this paper to provide simple formulation of smoothing techniques which allow efficient implementation in this type of environment. A second aim of the paper is to adress the computational issues which arise when nonparametric methods are applied to large data sets, such as those occuring in e-learning. This paper is structured as follows: in Section 2 we present methods on density estimation for exploring data and in Section 3 we focus on feature selection in this context. Section 4 shows a theoretical proposal to improve dimensionality reduction in a multivariate framework with kernel density estimation. Section 5 reports some inferential models for supervised classification and in Section 6 we show the results from the application of our proposed methodology to an e-learning data set.
2
2
Silvia Figini et al.
Density estimation for exploring data
Tipically, a kernel estimator has the form: n
1X w(y − yi ; h), fˆ(y) = n i=1
(1)
where w is itself a probability density, called in this context a kernel function, whose variance is controlled by the parameter h, called the smoothing parameter or bandwidth. The kernel method extends the estimation of a density function in more than one dimension. It is possible also to study multivariate version of kernel function: Scott (1992) describes a variety of more sophisticated techniques for constructing and displaying density estimation that can be carried out in three, four and more dimensions. In p dimension, with a kernel function defined as the product of univariate components w, and with smoothing parameters (h1 , ..., hp ), Wand and Jones (1995) derive results for more general kernel functions. An overall measure of how effective fˆ is in estimating f is the mean integrated squared error (MISE) which, in the one-dimensional case, is: Z h i2 M ISE(fˆ) = E fˆ(y) − f (y) dy (2) Z Z h n o n o i2 ˆ = E f (y) − f (y) dy + var fˆ(y) dy,
(3)
This combination of bias and variance, integrated over the sample space, has been the convenient focus of most of theoretical work carried out on these estimates. In order to construct a density estimate from the observed data it is necessary to choose a value for the smoothing parameter h. These ideas involved in crossvalidation are given a general description by Stone (1974). In the context of density estimation, Rudemo (1982) and Bowman (1984) applied these ideas to the problem of bandwidth choice, through estimation of the integrated squared error (ISE). Z n Z Z Z o2 fˆ(y) − f (y) dy = fˆ(y)2 dy − 2 f (y)fˆ(y)dy + f (y)2 dy, (4) The last term on the right hand side does not involve h. The other terms can be estimated by: n Z n 2Xˆ 1X 2 fˆ−i (y)dy − f−i (yi ), (5) n i=1 n i=1 where fˆ−i (y) denotes the estimator constructed from the data without the observation yi . Stone (1984) derived an asymptotic optimality result for bandwidths which are chosen in this cross-validatory fashion. Techniques known as biased cross-validation (Scott and Terrell 1987) and smoothed
Nonparametric approaches for e-learning data
3
cross-validation (Hall et al. 1992) also aim to minimise ISE but use different estimates of this quantity. These approaches are also strongly related to the ’plug-in’ approach. In our application we employed different methods with multivariate density estimates: Optimal smoothing, normal optimal smoothing, cross-validation and Sheather-Jones (Sheather and Jones, 1991) smoothing parameter. Jones et al. (1996) give a helpful and balanced discussion of methods of choosing the smoothing parameter in density estimation.
3
Feature selection: a bivariate nonparametric smoothing approch
Our approach to reduce the dimensionality is based on Bjerve and Doksum(1993), Doksum et al.(1994) and Jones (1996) that suggest how dependence between variables can be quantified in a local way through the definition of correlation curves and local dependence function. In order to show the similarity between two density function we use the previous methodology and we improve the results using a technique known as the smoothed booststrap, which involves simulating from fˆ rather than resampling the original data. Taylor (1989) and Scott(1992) discusses and illustrates the role of the smoothed bootstrap in constructing confidence intervals. After that it is possible to improve feature selection with a comparison between curves and surfaces. In a formal way, the hypotheses are: H0 : f (y) = g(y),
(6)
H1 : f (y) 6= g(y),
(7)
In our case we use an approach for comparing two density estimates fˆ and gˆ based on the following statistics: Z n o2 fˆ(y) − gˆ(y) dy, Dif f erence = (8) Under the null hypothesis that the two density functions f and g are identical, these two means will be identical if the same smoothing parameter h is used in the construction of each. More generally we compare more than two groups with the following statistics: p X i=1
ni
Z n o2 fˆi (y) − fˆ(y) dy,
(9)
where fˆ1 , ..., fˆp denote the density estimates for the groups, fˆ denotes the density estimate constructed from the entire set of data, ignoring the group labels, and ni denotes the sample size for group i. In order to preserve zero bias in the comparisons, a common smoothing parameter should again be employed, including in the combined estimate fˆ. We point out that in densities comparison
4
Silvia Figini et al.
n o is therefore crucial at any point y the quantity var fˆ − gˆ . The distributional properties of the test statistic are difficult to establish when the null hypotesis is of such a broad form, with no particular shape specified for the common underlying density. In the next Section we propose a multivariate extension of this approach to improve dimensionality reduction.
4
A proposal for dimensionality reduction in a multivariate nonparametric smoothing approch
Our proposal for feature selection starts with a simple consideration. The approach described in the Section 3 is based on a parwise comparison between two densities. Our aim is to extend the comparison in a larger set of variables to choose the most important variables for prediction and therefore, to improve dimensionality reduction. Given the input data D tabled as N samples and M features, X = (xi , i = 1, ..., M ), and the target classification variable c, the feature selection problem is to find from an M -dimensional observation space,RM a subspace of m features, Rm , that ”optimally” characterizes c. Given a condition defining the ”optimal characterization” an algorithm is needed to find the best subspace. The optimal characterization condition often means the minimal classification error. In an unsupervised situation where the classifiers are not specified,minimal error usually requires the maximal statistical dependency of the target class c on the data distribution in the subspace Rm (and vice versa). This scheme is Maximal Dependency . One of the most popular approaches to realize Max-Dependency is Maximal Relevance feature selection: selecting the features with the highest relevance to the target class c. Relevance is usually characterized in terms of correlation or mutual information, of which the latter is one of the widely used measures to define dependency of variables. Here, we focus on the discussion of mutual information based feature selection. Given two random variables X and Y , their mutual information is defined in terms of their probabilistic density functions p(x), p(x), and p(x, y): Z Z p(x, y) dxdy. (10) p(x, y)log I(i) = p(x)p(y) x y In Max-Relevance, the selected features xi are required, individually, to have the largest mutual information I(xi ; c) with the target class c, reflecting the largest dependency on the target class. In terms of sequential search, the m best individual features, i.e., the top m features in the descent ordering of I(xi , c), are often selected as the m features. In variable selection, it has been recognized that the combinations of individually good features do not necessarily lead to good classification performance. In other words, ”the m best features are not the best m features”, see e.g. Cover and Thomas (1991).
Nonparametric approaches for e-learning data
5
Here we present an empirical minimal-redundancy-maximalrelevance (mRMR) framework to minimize redundancy, and use a series of intuitive measures of relevance and redundancy to select promising features for both continuous and discrete data sets. First, although both Max- Relevance and Min-Redundancy have been intuitively used for feature selection, no theoretical analysis is given on why they can benefit selecting optimal features for classification. Thus, the first goal of this dissertation is to present a theoretical analysis showing that mRMR is equivalent to Max-Dependency for first-order feature selection, but is more efficient. 4.1
Relathionships of Max-Dependency, Max-Relevance and Min-redundancy
In terms of mutual information, the purpose of feature selection is to find a feature set S with m features (xi ) which jointly have the largest dependency on the target class c. This scheme, called Max-Dependency, has the following form: maxD(S, c), D = I {(xi , i = 1, ..., m); c}
(11)
Obviously, when m equals 1, the solution is the feature that maximizes I(xj ; c), (1 ≤ j ≤ M ). When m > 1 a simple incremental search scheme is to add one feature at one time: given the set with m − 1 features, Sm−1 , the m-th feature can be determined as the one that contributes to the largest increase of I(S; c), which takes the form of: Z Z p(Sm , c) (12) dSm dc I(Sm ; c) = p(Sm , c)log p(Sm )p(c) Z Z p(Sm−1 , xm , c) = p(Sm−1 , xm , c)log dSm−1 dxm dc p(Sm−1 , xm )p(c) Z Z p(x1 , ..., xm , c) dx1 ...dxm dc = p(x1 , ..., xm , c)log p(x1 , ..., xm )p(c) Despite the theoretical value of Max-Dependency, it is often hard to get an accurate estimation for the multivariate density p(x1 , ..., xm ) and p(x1 , ..., xm , c) because of two difficulties in the high-dimensional space: 1) the number of samples is often insufficient and 2) the multivariate density estimation often involves computing the inverse of the high-dimensional covariance matrix, which is usually an ill posed problem. Another drawback of Max-Dependency is the slow computational speed. These problems are most pronounced for continuous feature variables. Even for discrete (categorical) features, the practical problems in implementing Max-Dependency cannot be completely avoided. For example, suppose each feature has three categorical states and N samples. K features could have a maximum min(3k , N ) joint states. When the number of joint states increases very quickly and gets comparable to the number of samples, N, the joint probability of these features, as well as the mutual information, cannot be estimated correctly. Hence, although Max-Dependency feature selection might be useful to select a very small
6
Silvia Figini et al.
number of features when N is large, it is not appropriate for applications where the aim is to achieve high classification accuracy with a reasonably compact set of features. As Max-Dependency criterion is hard to implement, an alternative is to select features based on maximal relevance criterion (Max-Relevance). Max-Relevance is to search features which approximates D(S, c) with the mean value of all mutual information values between individual features xi and class c: maxD(S, c), D =
X 1 I(xi ; c). dim(S)
(13)
xi ∈S
It is likely that features selected according to Max- Relevance could have rich redundancy, i.e., the dependency among these features could be large. When two features highly depend on each other, the respective class-discriminative power would not change much if one of them were removed. Therefore, the following minimal redundancy (Min-Redundancy) condition can be added to select mutually exclusive features: minR(s), R =
1 dim(S)2
X
I(xi , xj ).
(14)
xi ,xj ∈S
The criterion combining the above two constraints is called ”minimal-redundancymaximal-relevance”,(mRMR) see e.g.Ding and Peng, 2003. We define the operator φ(D, R) to combine to combine D and R and consider the following simplest form to optimize D and R simultaneously: maxφ(D, R), φ = D − R.
(15)
In practice, incremental search methods can be used to find the near optimal features defined by φ(). Suppose we already have Sm−1 , the feature set with m−1 features. The task is to select the m-th feature from the set X − Sm−1 . This is done by selecting the feature that maximizes φ(). The respective incremental algorithm optimizes the following condition: X 1 (16) maxxj ∈X−Sm−1 I(xj ; c) − I(xj ; xi ) . m−1 xi ∈Sm−1
The computational complexity of this incremental search method is O(dim(S) × M ). We prove now the following results: Lemma: The combination of Max- Relevance and Min-Redundancy criteria, (the mRMR criterion), is equivalent to the Max-Dependency criterion if one feature is selected (added) at one time. Here we give an idea for the proof. We assume that Sm−1 i.e., the set of m − 1 features, has already been obtained. The task is to select the optimal m − th feature xm from set X − Sm−1 . The dependency D is represented by mutual information, i.e., D = I(Sm ; c),
Nonparametric approaches for e-learning data
7
where Sm = Sm−1 , xm can be treated as a multivariate variable. Thus, by the definition of mutual information, we have: I(Sm ; c) = H(c) + H(Sm ) − H(Sm , c) = H(c) + H(Sm−1 , xm ) − H(Sm−1 , xm , c),
(17)
where H(.) is the entropy of the respective multivariate (or univariate) variables. Now, we define the following quantity J(Sm ) = J(x1 , ..., xm ) for scalar variables x1 , ..., xm , Z Z p(x1 , x2 , ..., xm ) dx1 ...dxm . (18) J(x1 , ..., xm ) = ... p(x1 , ..., xm )log p(x1 )...p(xm ) Similarly, we define J(Sm , c) = J(x1 , ..., xm , c) as Z Z p(x1 , x2 , ..., xm , c) J(x1 , ..., xm , c) = ... p(x1 , ..., xm , c)log dx1 ...dxm dc. p(x1 )...p(xm )p(c) (19) We can easily derive from the previous equations: H(Sm−1 , xm ) = H(Sm ) =
m X
H(xi ) − J(Sm ),
(20)
i=1
and H(Sm−1 , xm , c) = H(Sm , c) = H(c) +
m X
H(xi ) − J(Sm , c).
(21)
i=1
By sobstituting them to the corresponding terms in I(Sm , c), we have: I(Sm ; c) = J(Sm , c) − J(Sm ) = J(Sm−1 , xm , c) − J(Sm−1 , xm ).
(22)
Obviously, Max-Dependency is equivalent to simultaneously maximizing the first term and minimizing the second term. We can use Jensen’s Inequality to show the second term J(Sm−1 , xm ) is lower-bounded by 0. A related and slightly simpler proof is to consider the inequality log(z) ≤ z − 1 with the equality if and only if z = 1. We see that: −J(x1 , ..., xm ) = Z Z p(x1 )...p(xm ) = ... p(x1 , ..., xm )log dx1 ...dxm p(x1 , ..., xm ) Z Z p(x1 )...p(xm ) − 1 dx1 ...dxm ≤ ... p(x1 , ..., xm ) p(x1 , ..., xm ) Z Z Z Z = ... p(x1 )...p(xm )dx1 ...dxm − ... p(x1 , ..., xm )dx1 ...dxm =1−1=0
(23)
8
Silvia Figini et al.
Qm It is easy to verify that the minimum is attained when p(x1 , ..., xm ) = i=1 p(xi ),i.e., all the variables are independent of each other. As all the m − 1 features have been selected, this pair-wise independence condition means that the mutual information between xm and any selected feature xi , i = 1, ..., m − 1 is minimized. This is the Min-Redundancy criterion. We can also derive the upper bound of the first term in J(Sm−1 , c, xm ). For simplicity, let us first show the upper bound of the general form J(y1 , ...yn ), assuming there are n variables y1 , ..., yn . This can be seen as follows: J(y1 , ..., yn ) = Z Z p(y1 ...yn ) dy1 ...dyn = ... p(y1 , ..., yn )log p(y1 ), ..., p(yn ) Z Z p(y1 |y2 ...yn )...p(yn−1 |yn p(yn ) = ... p(y1 , ..., yn )log dy1 ...dyn p(y1 )...p(yn ) =
n−1 X
H(yi ) − H(y1 |y2 , ..., yn ) − ... − H(yn−1 |yn )
n−1 X
H(yi ).
(24)
i=1
≤
i=1
This equation can be easily extended as: J(y1 , ..., yn ) ≤ min n n X X H(yi ), ..., H(yi ), i=2
i=1,i6=2
(25) n X
i=1,i6=n−1
n−1 X H(yi ) . H(yi ), i=1
It is easy to verify the maximum of J(y1 , ..., yn ) or, similarly J(Sm−1 , c, xm ), is attained when all variables are maximally dependent. When Sm−1 has been fixed, this indicates that xm and c should have the maximal dependency. This is the Max-Relevance criterion. Therefore, a combination of Max- Relevance and Min-Redundancy is equivalent to Max-Dependency for first-order selection. We have the following observations: – Minimizing J(Sm ) only is equivalent to searching mutually exclusive (independent) features. This is insufficient for selecting highly discriminative features. – Maximizing J(Sm , c) only leads to Max-Relevance. Clearly, the difference between mRMR and Max- Relevance is rooted in the different definitions of dependency (in terms of mutual information); does not consider the joint effect of featureson the target class.Onthe contrary, Max-Dependency considers the dependency between the data distribution in subspace Rm and the target class c. This difference is critical in many circumstances. 4.2
Computational Issues
We consider mutual-information-based feature selection for both discrete and continuous data. For discrete (categorical) feature variables, the integral op-
Nonparametric approaches for e-learning data
9
eration reduces to summation. In this case, computing mutual information is straightforward, because both joint and marginal probability tables can be estimated from the samples of categorical variables in the data. However, when at least one of variables x and y is continuous, their mutual information I(x, y) is hard to compute, because it is often difficult to compute the integral in the continuous space based on a limited number of samples. One solution is to incorporate data discretization as a preprocessing step. For some applications where it is unclear how to properly discretize the continuous data, an alternative solution is to use density estimation method (e.g., Parzen windows) to approximate I(x, y),as suggested by earlier work in medical image registration and feature selection. Given N samples of a variable x, the approximate density function pˆ(x) has the following form: n 1 X δ(x − x(i) , h), (26) pˆ(x) = N i=1 where δ() is the Parzen window function as explained below, x(i) is the i − th sample, and h is the window width. Parzen has proven that, with the properly chosen δ() and h, the estimation pˆ(x) can converge to the true density p(x) when N goes to infinity. Usually, δ() is chosen as the Gaussian window: T −1 n o z Σ δ(z, h) = exp − (27) / (2π)d/2 hd |Σ|1/2 , 2 2h where z = x − x(i) , d is the dimension of the sample x and Σ is the covariance of z. When d = 1, the previous equation returns the estimated marginal density; when d = 2,we can use previous equation to estimate the density of bivariate variable. For the sake of robust estimation, for d ≥ 2 , Σ is often approximated by its diagonal components. It is possible also to extend our proposal in a Bayesian framework. The idea is to employ the Bayes rule and assuming that feature variables are independent of each other given the target class. Given a sample, s = (x1 , ..., xm ) for m features, the posterior probability that s belongs to class ck is: p(ck |s) ∝
m Y
p(xi |ck ),
(28)
i=1
where p(xi |ck ) is the conditional probability table (or densities) learned from examples in the training process. The Parzen-window density-approximation can be used to estimate p(xi |ck ) for continuous features ane, therefore, the posterior class probabilities are obtained.
5
Kernel density estimation for supervised classification
Regression models play an important role in many data analyses, providing prediction and classification rules, and data analytic tools for understanding
10
Silvia Figini et al.
the importance of different inputs. Although attractively simple, the traditional linear model often fails in these situations: in real life, effects are often not linear. This section describes more automatic flexible statistical methods that may be used to identify and characterize nonlinear regression effects. These methods are called generalized additive models. In the regression setting, a generalized additive model (Hastie and Tibshirani 1990) has the form: E(Y |X1 ...Xp ) = α + f1 (X1 ) + ... + fp (Xp ),
(29)
As usual X1 , X2 , ..., Xp represent predictors and Y is the outcome; the fj ’s are unspecified smooth functions. Our approach : we fit each function using a scatterplot smoother (e.g., a cubic smoothing spline or kernel smoother), and provide an algorithm for simultaneously estimating all p functions. For two-class classification, as in our case, recall the logistic regression model for binary data. We relate the mean of the binary response µ(X) = P r(Y = 1|X) to the predictors via a linear regression model and the logit link function: log
mu(X) = α + β0 X1 + ... + βp Xp , 1 − mu(X)
(30)
The additive logistic regression model replaces each linear term by a more general functional form: mu(X) (31) = α + f1 X1 + ... + fp Xp , log 1 − mu(X) where again each fj is an unspecified smooth function. While the nonparametric form for the functions fj make the model more flexible, the additivity is retained and allows us to interpret the model in much the same way as before. The additive logistic regression model is an example of a generalized additive model. In general, the conditional mean µ(X) of a response Y is related to an additive function of the predictors via a link function g: g [µ(X)] = α + f1 X1 + ... + fp Xp ,
(32)
Examples of classical link functions are the identity link, the probit link function for modeling binomial probabilities (the probit function is the inverse Gaussian cumulative distribution function), the logit link function for log-linear or log-additive models and for Poisson count data. All three of these arise from exponential family sampling models, which in addition include the gamma and negative-binomial distributions. These families generate the well-known class of generalized linear models, which are all extended in the same way to generalized additive models. In our application the target variable is binary, then we fit an additive logistic regression model. In this model the outcome Y can be coded as 0 or 1. We wish to model P r(Y = 1|X), the probability of an event given values of the covariates X = (X1 , ..., Xp ). The generalized additive logistic model has the form: P r(Y = 1|X) ) = α + β0 X1 + ... + βp Xp , (33) log( P r(Y = 0|X) The functions f1 , ..., fp can be estimated by a backfitting algorithm within a Newton Raphson procedure.
Nonparametric approaches for e-learning data
6
11
Application
The application concerns data that come from an e-learning platform of the Univesity of Pavia. In our University there exist a learning platform where students can improve the english language. To better understand the following analysis, is necessary to describe how the site is structured. It is formed by 15 levels and every level presents 11 unities (10 lessons and final examination): Assessment, Dialogue, Glossary, Introduction, Listening 1, Listening 2, Pronunciation, Reading, Use of English, Video and Vocabulary. The course is divided in different type of exercises; some with evaluation (pronunciation, listening and assessment), the others without evaluation (grammar). The score of the first ranges between 0 and 100. The threshold to pass an exercise is 50. The descriptive analysis is mainly assembled on the data of the TRK2 table. Initially we have examined the variable Status that identifies which result has made the student in the different exercises. The acceptable values of this variable are C (completed), I (incompleted), F (Failure), P (passed). The results identify that the values of status C are relative to exercises that have been completed, but that they have not met the minimium threshold of evaluation. We have eliminated from the initial table only the anomalous observations and the values of status I. We have deleted 37203 observations from the 147432 of departure. The second step has been to examine whether there are some errors in the assignment of the values P and F: there have not been found. In a next step, by assigning to each exercise its specific level it has become possible to understand how students are distributed among the levels. We have found that only 38 students are present in all levels. At this point we have considered only the data related to the level 1 that contained 463 students. To build the dataset we have united the observation of the table TRK1 related to the same level The final table contains 376 students and 17 variables. With our application we want to supply a measure of the relative exercises importance, to estimate the acquired Knowledge for each student and finally to personalize the e-learning platform. The data are structured in 5 tables: the data related to the students that are enrolled to the course, the date and the initial and final time for every session in which a students is connected , the structure of the e-learning web site and a transactional dataset of the lessons, as well as the final examination of each level and its evaluation. For kernel density estimation the choice of a bandwidth is a compromise between smoothing enough to remove insignificant bumps and not smoothing too much to smear out real peaks. In our application we have compared three differents methods to estimate optimal smoothing parameter (Sheather-Jones, Cross-Validation and AIC based methods). In particular, Sheather and Jones (1991) in our application produce an output very closed on the data. For our application, in order to find the best smoothing parameter, is better to use and compare Cross-Validation and AIC based methods. We have supposed that it is possible to measure the relative importance for each exercise as inversely proportional to the probability to overcome the final examination without having made the exercise. It is possible to derive that
12
Silvia Figini et al.
some exercise have high impact on the final examination. 10702 10603, 10602 and 10601 are relative to Comprehension, while 10503 and 10502 are relative to pronunciation. To estimate the acquired knowledge for each student we analyse the results presented in Table 2 that display the parameter estimate for logistic regression. We Table 1. Estimation for Logit model Variable GLM Logit Intercept X10308 X10309 X10702
-2.3121 -0.0396 0.0291 0.0344
compare the results in Table 2 with a nonparametric technique based on Generalized addittive models. Table 3 show the outcome from generalized additive model. In our application we use the splines. For smoothing splines it would be possible to set up a penalized least-squares problem and minimize that, but there would be computational difficulties in choosing the smoothing parameters simultaneously. In our case an iterative approach is used with backfitting algorithm. In order to choose the best predictive model we apply ANOVA analysis (Table Table 2. Estimatation for GAM model Spline
Chi-square DF
s(X10308) s(X10309) s(X10601) s(X10602)
30.1602 7.8260 8.4466 10.3671
3 3 3 3
4). The best model under this measure evaluation is based on nonparametric approach. Also we use goodness of fit criteria based on confusion matrix. The Table 3. ANOVA for Model choice Model
Residual deviance
GLM Logit 194.24 GAM 128.20
confusion matrix (see e.g. Giudici 2003) is used as an indication of the properties
Nonparametric approaches for e-learning data
13
of a classification (discriminant) rule. It contains the number of elements that have been correctly or incorrectly classified for each class. On its main diagonal we can see the number of observations that have been correctly classified for each class while the off-diagonal elements indicate the number of observations that have been incorrectly classified. If it is (explicitly or implicitly) assumed that each incorrect classification has the same cost. Finally in Table 5 and table 6 we have the confusion matrix for the models. In Table 5 and Table 6 the diag-
Table 4. Confusion matrix for GLM models P(Y=0) P(Y=1) O(Y=0) 59 O(Y=1) 11
22 290
Table 5. Confusion matrix for GAM model P(Y=0) P(Y=1) O(Y=0) 67 O(Y=1) 6
14 285
onal represents the correct predictions and we can see that the non parametric model is the best to predict student performance. As we can see the best model is rappresented by the GAM.
7
Conclusion
In this paper we have presented a new approach to personalize e-learning platforms through the analysis of the exercises performance. Our idea, based on generalized additive models, gives for each exercise interesting measures of student performance and a probabilistic classification of exercises. Through this results we can personalize the e-learning platform near time and for each student suggest a personalized sequence of the lessons. In this paper we have also presented a theoretical methods to improve dimensionality reduction. More research is needed in this area especially concerning the computational aspects. We remark, that our proposal can be exetended to other application area such as to credit risk, churn risk and in general for risk environment. We are working in this area.
14
Silvia Figini et al.
8
Acknowledgment
This work has been supported by MIUR PRIN FUNDS ”Data Mining for ebusiness approaches”, 2004-2006. The paper, written by Silvia Figini, is the results of a close collaboration between the two authors.
References 1. Azzalini, A. and Bowman, A. W.: Applied Smoothing Techniques for Data Analysis. Oxford Statistical Science Series, Oxford (1997) 2. Fan, J. and Gijbels, I.: Local Polynomial Modelling and Ist Applications. Chapman Hall, London (1996) 3. Giudici,P.: Applied data mining. Wiley (2003) 4. Green, P.J. and Silverman, B. W.: Nonparametric Regression and Generalized Linear Models: A Roughness Penality Approach. Chapman Hall, London (1994) 5. Hastie, T.J. and Tibshirani, R. J.: Generalized Additive Models. Chapman Hall, London (1990) 6. Scott, D.W.: Multivariate Density Estimation:Theory,Practice and Visualisation. Wiley, New York (1992) 7. Simonoff, J.S.: Smoothing Methods in Statistics. Springer Verlag, New York (1996) 8. Wand, M. P. and Jones, M. C.: Kernel Smoothing. Chapman Hall, London (1995) 9. Bjerve, S. , Doksum, K.: Correlation curves measures of association as functions of covariate values. Ann. Statist., 21, 890–902 (1993) 10. Bowman, A. W.: An alternative method of cross validation for the smoothing of density estimates. Biometrika., 711, 353–360 (1984) 11. Bowman, A. W. , Foster, P. J.: Adaptive smoothing and density based tests of multivariate normality. J. Amer. Statist. Assoc., 88, 529–573 (1993) 12. Doksum, K. , Blyth, S. , Bradlow, E. , Meng, X. L. , Zhao, H. : Correlation curves as local measures of variance explained by regression. J. Amer. Statist. Assoc., 89, 571–582 (1994) 13. Jones, M. C. , Marron, J. S. , Sheather, S. J. : A brief survey of bandwidth selection for density estimation. J. Amer. Statist. Assoc., 91, 401–407 (1996) 14. Parzen, E.: On the estimation of a probability density and mode. Ann. Math. Statist., 33, 1065–1076 (1962) 15. Rosenblatt, M.: Remarks on some noparametric estimates of a density function. Ann. Meth. Statist., 27, 832–837 (1956) 16. Rudemo, M.: Empirical choice of histograms and kernel density estimators. Scand. J. Statist., 9, 65–78 (1982) 17. Scott, D.W. , Terrell, G.: Biased and unbiased cross validation in density estimation. J. Amer. Statist. Assoc., 82, 1131–1146 (1987) 18. Sheather, S.J. , Jones, M.C.: A reliable data based bandwidth selection method for kernel density estimation. J. Roy. Statist. Soc. Ser. B, 53, 683–690 (1991) 19. Stone, M. A.: Cross validatory choice and assessment of statistical predictions. J. Roy. Statist. Soc. Ser. B, 36, 111–147 (1974) 20. Taylor, C. C.: Boostrap choice of the smoothing parameter in kernel density estimation. Biometrika, 36, 111–147 (1989) 21. Whittle, P.: On the smoothing of probability density functions. J. Roy. Statist. Soc. Ser. B, 55, 549–557 (1958)