doi:10.2498/iti.2012.0470
Educational Data Mining for Grouping Students in E-learning System Divna Krpan, Slavomir Stankov Faculty of Science, Teslina 12, Split, 21000 Croatia E-mail(s):
[email protected],
[email protected]
Abstract. The popularity and the use of elearning systems increased through last decades. Students produce a lot of data through their interactions with the system, which is often not exploited. Since there is a vast amount of such data, data mining techniques were used as the appropriate solution. In this paper we show practical experience with specific e-learning system and applied data mining technique for the analysis which served as a tool for grouping students with similar characteristics.
Keywords. educational data mining, learning management system, group modeling 1. Introduction The use of computers in learning and teaching advanced a lot through last decades through different e-learning systems. However, the need for improvement is always present. Students learn, explore content, and by doing so, they leave trail of “breadcrumbs” or log information. There is an important research question: what can we do with that data? These parameters could easily be recorded in e-learning system, than analyzed, and as a result, teacher could adapt tests to maximize performance [1]. Data mining (DM) techniques rise as an answer since they are used in different research areas (eg. medicine, business, market research, etc.) with vast amount of provided data. Educational data mining (EDM) is described as new and interesting application of data mining [2]. In general, EDM considers all data collected during any educational process, but we are here interested mostly in e-learning systems. As we are going to describe later, it is possible to combine information collected from e-learning system with traditional tests. Individual learning and teaching supported by e-learning systems, especially intelligent tutoring systems offers advantages of human tutoring [3] supported by tracking student progress. Students are not isolated individuals, they collaborate.
Software companies form teams because individuals are not capable of handling large projects alone. Students’ collaboration or working in groups, prepares them for real world situations. The problem of grading students’ success expands on a group. If each group member leaves some kind of trace in e-learning system, EDM might help us solve the problem [1]. We monitored students’ traces on specific learning management system: Moodle (http://moodle.org) used on our Faculty. Moodle stores detailed records of students’ actions in database and we decided to explore application of EDM in order to give the meaning to those records. In next section, we will briefly describe some of EDM techniques used in this paper. Third section provides review of literature considering researches in our area of interest: higher education, Moodle and small groups. Fourth section contains description of our research conducted on Faculty of science, University of Split (Croatia) on course Introduction to computing. The purpose of this research was to determine groups of students with similar characteristics based on test results and system logs. As the first step: understanding the domain, we described Moodle database and modules which contain data. After data collection, we performed DM analysis and interpreted the results. In fifth section, we will provide the results and discuss future directions for the research.
2. Educational data mining Data mining (DM) is used in many companies, for example: online shopping [4]. Educational data mining (EDM) is simply DM used in education. E-learning system provides data that is transformed into information used by teachers in order to improve teaching process. Students also benefit form that information, as mentioned earlier (getting appropriate content) and also by adapting their learning process. The
207
goal is to understand the learning and teaching in order to improve its quality [1]. EDM might be used for student modeling and e-learning system evaluation [5]. Teachers group or classify students, discover patterns of misconceptions, etc. In classroom, teachers observe students behavior and analyze test results. They adapt the instruction according to the provided feedback. Students are different and respond to the changes according to their characteristics, so teacher constantly adapts. Such information and feedback is often missing in e-learning systems. Since such systems store large amount of data, EDM techniques might provide meaning to the data as basis for adaptation. Data collected in e-learning systems is personal (demographic information) and academic (learning progress) [2]. EDM could be used for scheduling, prediction of students’ dropout, or course enrollment. In general, EDM process consists of three phases [5]: data collection and preparation, data analysis and interpretation of results. Such process is iterative [4]. After interpretation of the results, instruction is adapted, and process starts again. Changes might affect results in positive or negative manner. DM models of analysis may be: descriptive and predictive [6]. Descriptive models are used for description of patterns used in decision making, and predictive are used for data prediction (for example student success or grades). We selected a cluster analysis as EDM technique for our case study as a descriptive model. Cluster analysis is process of dividing data set into clusters (categories or classes). It is used when researchers expect some type of grouping or as preparation stage for some other DM technique. Variables are more similar in the same group than between groups. Clustering algorithms do not require a hypothesis to be set before the analysis [7]. There are different algorithms for cluster analysis based on distance or similarity [8]. We selected distance and K-means algorithm. Algorithm calculates distance between cases which are divided into different clusters [9]. Distance measure is simple Euclidean distance. K-means algorithm divides data set into K clusters. Each cluster contains a centroid. Centroid is a value determined as mean of all instances in the cluster. Initial set of centroids is determined heuristically. The weakness of K-
means algorithm is the selection of initial centroid set, which is often difficult with new data set. It is possible to group students and predict or analyze their response to different teaching strategies [10], and also students that misuse the system or “game with the system” [11]. Next section will provide us with closer insight through specific case studies. These case studies were selected as inspiration for our research because of similar characteristics.
3. Review of literature It is important to learn from experience of the other researchers, so we briefly introduced some of interesting case studies. We are interested in EDM techniques that researchers used in similar environments: higher education, Moodle and small groups.
3.1. Educational data mining in higher education Authors in [12] describe several case studies as the inspiration for integration of different data mining techniques. For example Gabrilson in 2003 [12] analyzed course test results in order to determine most effective factors for students’ success. Researchers used data mining prediction and discovered a relationship between different variables that influenced the final test results. They worked on changing discovered variables and next generation produced better results. Other researchers [13] used classification to determine final grade based on use of World Wide Web. It is possible to spot students with risk of failing or misuse of the system, and to improve grading of individuals or groups.
3.2. Data mining in Moodle Moodle is a popular open source LMS and also subject of different case studies. In [10] researchers applied different DM techniques with purpose of improving learning and teaching quality. Authors state it is possible to combine information gathered from the system with traditional learning and teaching test results and grades. Teachers used DM techniques to determine if some module or an activity is appropriate for students and to classify new students based on their use of the system.
208
This case study shows it is possible to combine data from different sources. We also have data collected from the system and offline (traditional pen and paper tests). The results are not comparable with our research since they had different settings (such as more students and more courses).
3.3. Small group study Case study in [14] tested possibilities and limitations of DM. Researchers point to problems of data redundancy and attribute correlation which complicate pattern discovery. The purpose of the research is to determine importance of feedback in online tests or quizzes. They conducted 8 online tests in Moodle which consisted of multiple choice questions. Tests were integrated with traditional learning and teaching. Research sample consisted of 73 students. Collected data for each student and question consisted of correctness, time spent, request for feedback, provided feedback and if students actually checked the feedback. Students had to answer a learning style questionnaire but only 90% students participated which reduced the sample. Classification algorithm (C4.5) discovered that reading of question feedback increases possibility for correct answer on next question (related to the previous by content). It also increases possibility for better final test results. Researchers emphasize that C4.5 results are not reliable on this small data set, which consequently reduces reliability for drawn conclusions. However, they concluded that a good test design is very important and recommend hierarchical clustering for questions based on correct answers and final test grade. It is easier to control and understand small data set with specific cases in the stage of EDM model design and understanding. In this case study researchers monitored one Moodle module: quiz. Our case study also consists of small sample, but we monitored more modules and offline test results. The emphasis is on learning and lesson module.
4. Case study Data mining analysis is divided into different stages. In general, there are six stages based on CRISP (Cross Industry Standard Process for Data Mining) methodology [15]. In [16] authors present four stages, but the number of stages
depends of each stage complexity. For example, vast amount of data requires complex data collection and data preparation stages [17]. Because of small data set, three staged research process was appropriate for our case study. Data was collected from the experiment conducted on Faculty of science in Split (Croatia) [18], [19]. Research sample consisted of 81 freshmen students in course Introduction to computer science. We closely monitored only the experimental group of 52 students since they used Moodle. As first step, students took a pen and paper pre-test in order to determine their initial knowledge state. In next phase, students were learning using Moodle for three weeks. Finally, they had a pen and paper post test. General student information consisted of gender, pre and post test results. Other data was collected from Moodle database. In order to extract data from Moodle database, we used different SQL queries. We were interested in time spent on course and activities, and also access frequency. Data analysis was conducted by Statistica 8.0 (http://www.statsoft.com/). It was important to understand Moodle database structure before data collection process. We will briefly describe modules used in this case study.
4.1. Moodle database Moodle is the acronym for Modular ObjectOriented Dynamic Learning Environment which implies modular design. Modular design simplifies import of new modules, content from different courses, etc. Moodle exists in different versions, the actual version is 2.X, but we used 1.9.9. Moodle contains detailed logs of user actions available through teacher or administrator user interface. Unfortunately, log information is only available as a HTML document. More detail information is available through direct access to Moodle database tables. Assignment module allows teachers to set an assignment or a task where students are required to upload their solution. Automatic grading is not available and teacher manually grades each assignment and provides feedback. Moodle will log: access time, view and upload actions. Although there are several database tables labeled as mdl_assignement.., the table mdl_assignment_submissions contains important information about user submission.
209
The Quiz module contains general quiz settings and questions. Question types are predetermined: multiple choice, true/false, short answer, embedded, numeric, matching and essay. All question types except essay, are graded automatically. Moodle logs: start and end time, order of questions, question time, and correctness. Table mdl_quiz_attempts contains user information. Moodle Lesson module contains learning content. Each lesson consists of different page types. We only used branch table which contains content and question page which contains questions. Other page types such as cluster were not used in order to simplify data collection. The overview of important lesson database tables is in Table 1. Table 1. Lesson tables Table mdl_lesson mdl_lesson_pages mdl_lesson_timer mdl_lesson_grades mdl_lesson_branch mdl_lesson_attempts
Contains Lesson settings Lesson page content Start and end lesson time Students’ grades Access information (content) Access information (questions)
We only described few modules and tables, but in general, Moodle actually consists of over 200 tables. The number differs based on the specific build or version, but also on additionally installed modules.
4.2. Data selection and preparation Our goal is to determine groups of students with similar characteristics based on online and pen and paper test results, and system logs. Moodle stores all information about modules and user activities in database. Since there are many users on the system, we filtered information for specific course and excluded information about administrators and teachers. Moodle is supported by MySQL database which is also open source. We used MySQL Workbench for database access (http://www.mysql.com/). It allows setting SQL queries, creating views etc. MySQL Workbench allows export of SQL query results into different formats such as CSV which simplified integration with MS Excel tables provided by Moodle interface and manually entered pen and paper test results. In this stage, we selected and integrated students’ data into one document.
4.3. Data transformation Data mining algorithms sometimes require specific data types, and it is necessary to transform prepared data into acceptable format. Data transformation includes: cleaning, transformation, integration and reduction. In data cleaning stage, outliers are usually discarded [20]. Outliers are cases with extremely low or high values which are usually the result of some abnormality or errors. We did not perform outlier detection since we had a small sample and it was not possible to discard those cases. Additionally, we performed manual check of data correctness in order to eliminate errors. Data formats stored in Moodle database had to be transformed, for example: date which is stored in UNIX format and as such it was not understandable to the researchers during the data correctness check. We performed an aggregation or grouped some variables. For example, we summed number of different activities on the lesson (lesson views and lesson page views), time spent on the lesson (as sum of time spent on lesson pages). Some modules were excluded from the observation since they were not used during research (forum, blog, chat, etc.). Variables also may be derived from existing variables. We subtracted pretest results from post test results in order to get students’ advancement. At this point, we had to deal with missing data cases since some students were missing one of test results. There are different methods for treating missing values [21]. Such values are sometimes substituted with means, constant values (eg. zero) or most frequent value. Substitute values could significantly influence the final results so the recommended technique is to discard such cases or if there are many cases with missing values, to discard data (variable). We already discarded information about fairly used modules. After data transformation process, 26 variables and 43 cases remained. Some software tools such as Access Watch, WebStat etc. [5] are used for Web site usage analysis. More complex analysis is performed with statistic tools as SPSS, SAS, Statistica, etc. In next section, we described application and results of selected DM algorithms.
210
4.2. Research results
5. Conclusion and discussion
It was important to determine the influence of specific variables on test results, so we performed correlation analysis. Correlation is a statistical measurement of the relationship between two variables. Possible values range from +1 to –1. A zero correlation indicates that there is no relationship between the variables. Negative correlation simply means: if one variable goes up, the other goes down. Positive correlation indicates that both variables move in the same direction [22]. There is a sample of correlation results in Table 2.
We conducted data mining analysis on students’ data collected from the Moodle database and students’ test results. The purpose of this research was to determine if DM technique will help us group students. Difference between post and pre test was used as the measure for success (to determine if students advanced or not). It was expected that students which spent more time and had more activities will perform better, but analysis showed different results. Such students had lower test scores and course scores, although most of their time was spent on lessons which contain learning content. First impression was that something could be wrong with the analysis, but we additionally checked the whole process. Since some of our students do not have a computer or internet access, their learning time was organized in the computer lab and controlled by teaching assistants. It is possible that students did not respond well on the controlled environment, and were just browsing through lesson pages. Some of students spent a lot of time using lessons at home, but still their performance was low. They probably represented outliers which could not be discarded as explained earlier. The cluster analysis showed similar results as correlation analysis. Students with lower score had more actions and were grouped in the same cluster. The number of clusters was influenced by small case number, and also a reduced data set (number of variables). Some variables were grouped, while others were completely discarded since students did not use all available activities. Unfortunately, data cleaning stage forced us to discard cases with missing data. Someone could argue why we started this research since the number of cases was known before the analysis. The answer is simple: a small data set allows easier control and understanding of specific cases in the stage of EDM model design, especially first stages of data collection and transformation. It is easier to spot specific cases and to ensure a data collection process (eg. SQL queries) is valid. Maybe the conclusions or results are not statistically significant, but the model and procedure is the important guidance for possible further research. Recommendation for further research is the development of a module for Moodle dynamic and a real-time data collection, and also a recommendation system for students. The goal is
Table 2. Sample of correlation results Variable 1 Finished lessons Assignment score Assignment score Lessons (all visited pages) Sum of lesson views Lesson time
Variable 2 Lesson score Pre test Post test Lesson score Post test Post test
Correlation 0.37 0.46 0.40 0.54 -0.39 -0.42
The number of finished lessons is in positive correlation with the lesson score. Since lessons were not graded, lesson scores were not summed with Moodle final course score which finally consisted mostly of assignment submissions score. Students with better final course score had better pre and post test results which could indicate that they were better before and after learning. Number of lessons with all visited pages (at least one visit per page) is positively correlated with lesson score. Sum of all lesson views is in negative correlation with post test results, and lesson time (time students spent learning) is also in negative correlation with post test results. After correlation analysis, we performed a cluster analysis using Statistica. The first step is determining the number of clusters. Statistica offers a v-fold cross-validation for that purpose. This procedure divides data set into small groups with similar size. Cluster analysis resulted in two clusters. Cluster 1 contains students with better pre and post test results, but Cluster 2 contains students with more activities and spent time (Table 3). Table 3. Centroids for cluster analysis Variable Pre test Post test Actions Time
Cluster 1 44,95 71,91 222,04 116,72
Cluster 2 37,06 48,27 383,20 177,53
211
to help students learn better, and prevent behavior which leads to poor results.
6. References [1] R. Llorente and M. Morant, "Data Mining in Higher Education," New Fundamental Technologies in Data Mining, pp. 201-220, 2011. [2] V. Kumar and A. Chadha, "An Empirical Study of the Applications of Data Mining Techniques in Higher Education," International Journal of Advanced Computer Science and Applications (IJACSA), vol. 2, 2011. [3] B. S. Bloom, "The 2 Sigma problem: The search for methods of group instruction as effective as one-to-one tutoring," Educational Researcher, vol. 13, pp. 4-16, 1984. [4] O. Maimon and L. Rokach, "Introduction to Knowledge Discovery and Data Mining," in Data Mining and Knowledge Discovery Handbook, O. Maimon and L. Rokach, Eds. New York: Springer, 2010, pp. 1-15. [5] C. Romero and S. Ventura, "Educational data mining: A survey from 1995 to 2005," Expert Systems with Applications, vol. 33, pp. 135-146, 2007. [6] L. Rokach, "A survey of Clustering Algorithms," in Data Mining and Knowledge Discovery Handbook, O. Maimon and L. Rokach, Eds. New York: Springer, 2010, pp. 269-298. [7] R. Baker, "Data Mining for Education," in International Encyclopedia of Education (3rd edition). vol. 7, B. McGaw, Peterson, P., Baker, E., Ed. Oxford, UK: Elsevier, 2010, pp. 112-118. [8] J. Han and M. Kamber, Data Mining – Concepts and Techniques. San Francisco, USA: Morgan Kaufmann Publishers, 2006. [9] S. Ayesha, T. Mustafa, A. R. Sattar, and M. I. Khan, "Data Mining Model for Higher Education System," European Journal of Scientific Research, vol. 43, pp. 24-29, 2010. [10] C. Romero, S. Ventura, P. G. Espejo, and C. Hervás, "Data mining algorithms to classify students," Proceedings of Educational Data Mining, pp. 20-21, 2008. [11] R. Baker, A. T. Corbett, and K. R. Koedinger, "Detecting Student Misuse of Intelligent Tutoring Systems," in Proceedings of the 7th International
Conference on Intelligent Tutoring Systems, 2004, pp. 531-540. [12] N. Delavari, P. A. Somnuk, and M. R. Beikzadeh, "Data mining application in higher learning institutions," Informatics in Education, vol. 7, pp. 31-54, 2008. [13] B. Minaei-Bidgoli, D. A. Kashy, G. Kortemeyer, and W. F. Punch, "Predicting Student Performance: An Application of Data Mining Methods With the Educational Web-Based System LON-CAPA," in 33rd ASEE/IEEE Frontiers in Education Conference, Boulder, CO, 2003. [14] M. Pechenizkiy, T. Calders, E. Vasilyeva, and P. De Bra, "Mining the student assessment data: Lessons drawn from a small scale case study," Educational Data Mining 2008, pp. 187-187, 2008. [15] P. Chapman, T. Khabaza, and C. Shearer, "CRISP-DM 1.0, Step by step data mining Guide," SPSS Inc. 2000. [16] C. Romero, S. Ventura, and E. García, "Data mining in course management systems: Moodle case study and tutorial," Computer & Education, vol. 51, pp. 368-384, 2008. [17] M. H. Falakmasir and J. Habibi, "Using Educational Data Mining Methods to Study the Impact of Virtual Classroom in ELearning," in The Third International Conference on Educational Data Mining (EDM2010), Pittsburgh, PA, USA, 2010, pp. 241-248. [18] M. Musulin, "Vrednovanje sustava e-uenja metodom eksperimenta," thesis, Faculty of Science, University of Split, 2011 (in Croatian). [19] T. iri, "Dubinska analiza podataka u sustavima e-uenja thesis, Faculty of Science, University of Split, 2011 (in Croatian). [20] J. Grzymala-Busse and W. Grzymala-Busse, "Handling Missing Attribute Values," in Data Mining and Knowledge Discovery Handbook, O. Maimon and L. Rokach, Eds. New York: Springer, 2010, pp. 33-51. [21] J. Maletic and A. Marcus, "Data Cleansing: A Prelude to Knowledge Discovery," in Data Mining and Knowledge Discovery Handbook, O. Maimon and L. Rokach, Eds. New York: Springer, 2010, pp. 19-32. [22] A. Grubiši, "Vrednovanje uinka inteligentnih sustava e-uenja," in Fakultet elektrotehnike i raunarstva Zagreb: Sveuilište u Zagrebu, 2007, p. 209.
212