efficiency of machine learning techniques in predicting ... - CiteSeerX

1 downloads 612 Views 380KB Size Report
above attributes, the previous –post high school– education in the field of informatics and ... between students' jobs and computers were also taken into account.
RECENT ADVANCES IN MECHANICS AND RELATED FIELDS UNIVERSITY OF PATRAS 2003 in Honour of Professor Constantine L. Goudas

EFFICIENCY OF MACHINE LEARNING TECHNIQUES IN PREDICTING STUDENTS’ PERFORMANCE IN DISTANCE LEARNING SYSTEMS S. B. Kotsiantis, C. J. Pierrakeas, I. D. Zaharakis, P. E. Pintelas Educational Software Development Laboratory Department of Mathematics University of Patras Greece e-mail: e-mail: {sotos, chrpie, john, pintelas}@math.upatras.gr Keywords: supervised machine learning algorithms, prediction of student performance, distance learning. Abstract. The ability of predicting a student’s performance is very important in university-level distance learning environments. The scope of the research reported here is to investigate the efficiency of machine learning techniques in such an environment. To this end, a number of experiments have been conducted using five representative learning algorithms, which were trained using data sets provided by the “informatics” course of the Hellenic Open University. It was found that learning algorithms could enable tutors to predict student performance with satisfying accuracy long before final examination. A second scope of the study was to identify the student attributes, if any, that mostly influence the induction of the learning algorithms. It was found that there exist some obvious and some less obvious attributes that demonstrate a strong correlation with student performance. Finally, a prototype version of software support tool for tutors has been constructed implementing the Naive Bayes algorithm, which proved to be the most appropriate among the tested learning algorithms. 1 INTRODUCTION The tutors in a distance-learning course must continuously support their students regardless the distance between them. A tool, which could automatically recognize the level of the students, would enable the tutors to personalize the education in a more effective way. While the tutors would still have the essential role in monitoring and evaluating student progress, the tool could compile the data required for reasonable and efficient monitoring. This paper examines the usage of Machine Learning (ML) techniques in order to predict the students’ performance in a distance learning system. Even though, ML techniques have been successfully applied in numerous domains such as pattern recognition, image recognition, medical diagnosis, commodity trading, computer games and various control applications, to the best of our knowledge, there is no previous attempt in the presented domain [10], [15]. Thus, we use a representative algorithm for each one of the most common machine learning techniques namely Decision Trees [11], Bayesian Nets [6], Perceptron-based Learning [9], Instance-Based Learning [1] and Rule-learning [5] so as to investigate the efficiency of ML techniques in such an environment. Indeed, it is proved that learning algorithms can predict student performance with satisfying accuracy long before the final examination. In this work we also try to find the characteristics of the students that mostly influence the induction of the algorithms. This will reduce the information that is needed to be stored as well as will speed up the induction. For the purpose of our study the “informatics” course of the Hellenic Open University (HOU) provided the data set. A significant conclusion of this work was that the students’ sex, age, marital status, number of children and occupation attributes do not contribute to the accuracy of the prediction algorithms. The following section describes the data set of our study. Some elementary Machine Learning definitions and a more detailed description of the used techniques and algorithms are given in section 3. Section 4 presents the experimental results for the five compared algorithms. The attribute selection methodology used to find the attributes that most influences the induction as well as whether it improves the accuracy of the tested algorithms or not, is discussed in section 5. Finally, section 6 discusses the conclusions and some future research directions. 2 HELLENIC OPEN UNIVERSITY DISTANCE LEARNING METHODOLOGY AND DATA DESCRIPTION For the purpose of our study the “informatics” course of HOU provided the training set. A total of 354 examples (student’s records) have been collected from the module “Introduction to Informatics” (INF10) [16]. Regarding the INF10 module, during an academic year students have to hand in 4 written assignments, optionally participate in 4 face to face meetings with their tutor and sit for final examinations after a 11-month297

S. B. Kotsiantis, C. Pierrakeas, I. D. Zaharakis, P. E. Pintelas

period. A student must submit at least three of the four assignments. The total mark gathered from the handed-in written assignments should be greater than or equal to 20 for a student to qualify to sit for the final examinations of the module. In the sequel, we present in Table 1 the attributes of our data set along with the values of every attribute. The set of the attributes was divided in two groups: the “Demographic attributes” group and the “Performance attributes” group. Student’s demographic attributes Sex male, female Age ageθ

i

and into class 1 otherwise. It accepts instances one at a time and updates the weights wi as necessary. It initializes its weights wi and θ and then it accepts a new instance (x, y) applying the threshold rule to compute the predicted class y΄. If the predicted class is correct (y΄ = y), perceptron does nothing. However, if the predicted class is incorrect, perceptron updates its weights. The most common way the perceptron algorithm is used for learning from a batch of training instances is to run the algorithm repeatedly through the training set until it finds a prediction vector which is correct on all of the training set. This prediction rule is then used for predicting the labels on the test set. An excellent book about the Bayesian networks is provided by [6]. A Bayesian network is a graphical model for probabilistic relationships among a set of attributes. The Bayesian network structure S is a directed acyclic graph (DAG) and the nodes in S are in one-to-one correspondence with the attributes. The arcs represent casual influences among the variables while the lack of possible arcs in S encodes conditional independencies. Moreover, an attribute (node) is conditionally independent of its non-descendants given its parents. Using a suitable training method, one can induce the structure of the Bayesian Network from a given training set. In spite of the remarkable power of the Bayesian Networks, there is an inherent limitation. This is the computational difficulty of exploring a previously unknown network. Given a problem described by n attributes, the number of possible structure hypotheses is more than exponential in n. In the case that the structure is unknown but we can assume that the data is complete, the most common approach is to introduce a scoring function (or a score) that evaluates the “fitness” of networks with respect to the training data, and then to search for the best network (according to this score). The classifier based on this network and on the given set of attributes X1,X2, . . . , Xn, returns the label c that maximizes the posterior probability p(c | X1,X2, . . . , Xn). Instance-based learning algorithms belong in the category of lazy-learning algorithms [10], as they defer in the induction or generalization process until classification is performed. One of the most straightforward instancebased learning algorithms is the nearest neighbour algorithm [1]. K-Nearest Neighbour (kNN) is based on the principal that the examples within a data set will generally exist in close proximity with other examples that have similar properties. If the examples are tagged with a classification label, then the value of the label of an unclassified example can be determined by observing the class of its nearest neighbours. The absolute position of the examples within this space is not as significant as the relative distance between examples. This relative distance is determined using a distance metric. Ideally, the distance metric must minimize the distance between two similarly classified examples, while maximizing the distance between examples of different classes. In rule induction systems, a decision rule is defined as a sequence of Boolean clauses linked by logical AND operators that together imply membership in a particular class [5]. The general goal is to construct the smallest rule-set that is consistent with the training data. A large number of learned rules is usually a sign that the learning algorithm tries to “remember” the training set, instead of discovering the assumptions that govern it. During classification, the left hand sides of the rules are applied sequentially until one of them evaluates to true, and then the implied class label from the right hand side of the rule is offered as the class prediction. For the purpose of the present study, a representative algorithm for each described machine learning technique was selected. 3.1

Brief description of the used machine learning algorithms The most commonly used C4.5 algorithm [12] was the representative of the decision trees in our study. At each level in the partitioning process a statistical property known as information gain is used by C4.5 algorithm to determine which attribute best divides the training examples. The approach that C4.5 algorithm uses to avoid overfitting is by converting the decision tree into a set of rules (one for each path from the root node to a leaf) and then each rule is generalized by removing any of its conditions that will improve the estimated accuracy of the rule. Naive Bayes algorithm was the representative of the Bayesian networks [3]. It is a simple learning that captures the assumption that every attribute is independent from the rest of the attributes, given the state of the class attribute.

299

S. B. Kotsiantis, C. Pierrakeas, I. D. Zaharakis, P. E. Pintelas

We also used the 3-NN algorithm, with Euclidean distance as distance metric, which combines robustness to noise and less time for classification than using a larger k for kNN [14]. Attributes with missing values are given imputed values so that comparisons can be made between every pair of examples on all attributes. The RIPPER [2] algorithm was the representative of the rule-learning techniques because it is one of the most usually used methods that produce classification rules. RIPPER forms rules through a process of repeated growing and pruning. During the growing phase the rules are made more restrictive in order to fit the training data as closely as possible. During the pruning phase, the rules are made less restrictive in order to avoid overfitting, which can cause poor performance on unseen examples. The grow heuristic used in RIPPER is the information gain function. Finally, WINNOW is the representative of perceptron-based algorithms in our study [9]. It classifies a new instance x into the second-class if

∑xw i

i



i

and into the first class otherwise. It initializes its weights wi and θ to 1 and then it accepts a new instance (x, y) applying the threshold rule to compute the predicted class y’. If y΄ = 0 and y = 1, then the weights are too low; so, for each feature such that xi = 1, wi = wi · α, where α is a number greater than 1, called the promotion parameter. If y΄ = 1 and y = 0, then the weights were too high; so, for each feature xi = 1, it decreases the corresponding weight by setting wi = wi · β, where 0

Suggest Documents