EdMedia 2015 - Montreal, Quebec, Canada, June 22-24, 2015
Identifying Students with Evasion Risk Using Data Mining Márcio Aurélio dos Santos Alencar Institute of Computing, Federal University of Amazonas - UFAM Manaus, Brazil
[email protected] Eulanda Miranda dos Santos Institute of Computing, Federal University of Amazonas - UFAM Manaus, Brazil
[email protected] José Francisco de Magalhães Netto Institute of Computing, Federal University of Amazonas - UFAM Manaus, Brazil
[email protected] Abstract: The amount of educational institutions which work with distance learning courses is increasing. As a consequence, studies have shown that student dropout rates in this type of educational system have also increased. Even though Virtual Learning Environments (VLE) record all the interactions of the students throughout the course, the information provided by VLE is not enough to predict and to prevent high student dropout rates. In this context, the objective of this paper is to present a survey on data mining approaches and techniques in order to point out data mining-based solutions that can be employed to predict and to prevent high student dropout rates. Keywords: Moodle, data mining, student dropout rates.
Introduction Several educational institutions use Virtual Learning Environments (VLE) for teaching and learning. VLEs offer a range of tools whose objective is to facilitate the monitoring of students in distance courses. However, despite these various technological tools for learning, many educational institutions still face the problem of student dropout. Research by Wolff et al (2014) with students of the Open University, one of the largest distance education institutions in Europe, indicate that the main reason for student dropout is lack of tutor monitoring. On the other hand, when an intervention is performed at the right time, based on real time identification of students at risk of dropout and on the basis of appropriate decision-making, a reduction in quitting may be achieved. Besides lack of tutor monitoring, another aspects which increase the dropout rates in distance learning courses are: financial difficulties, lack of time, lack of interaction with the VLE, lack of motivation, lack of school knowledge, lack of technological knowledge, sense of isolation, health problems, excessive content in VLE and age (Mezzari et al, 2013). VLEs worldwide store a large amount of data. However, a human being is usually unable to interpret information and knowledge related to students’ participation in distance education courses recorded by VLEs. As a consequence, it is necessary to develop new tools to extract this data and to generate useful information from it. Focusing on using such stored data to investigate the factors that may contribute to student dropout, data mining techniques have been recently investigated. Data mining provides a set of techniques which can help educational system to overcome several issues such as identifying students need, personalization of training and predicting quality of student interactions by analyzing student’s trends and behaviors towards education in order to improve learning experience of students (Yadav and Pal, 2012). Precisely, Educational Data Mining (EDM) is a research domain which tries to extract and analyze information, recorded by VLEs, related to the process of teaching and learning. According to (Gamma et al, 2014), EDM may benefit students, teachers and education institutions. In such a context, this paper focuses on proposing an architecture to add an EDM module in the Moodle VLE used in School of Distance Education of the Amazon Technical Education Center (CETAM EaD). We plan to run experiments using data collected during 2010 and 2013 from the technical courses of
-773-
EdMedia 2015 - Montreal, Quebec, Canada, June 22-24, 2015 Maintenance and Support in Computer Science, Public Services and Hosting provided to 13 municipalities in the State of Amazonas, Brazil. This dataset was chosen due to the fact that after three technical courses, there was a significant number of dropouts and avoidances. This main objective of this paper is to present a survey study on EDM techniques to identify potential students from technical courses CETAM EaD at risk of truancy. Currently, this identification is carried out manually by schools administrative team and is known as a very long time consuming task. Thus, we believe that the early discovery of these students can help to reduce truancy. This paper is organized as follows. First, the most common and recently proposed EDM methods are discussed. Then, the proposed architecture is presented. Finally, conclusions and suggestions for future work are discussed.
Related Work There are several studies in the literature that seek to predict academic achievement through data mining techniques. The problem of truancy is a concern for both classroom and EAD courses. The use of EDM helps on identifying the reasons for students’ dropout. According to the census 2013 conducted by ABED (Brazilian Association for Distance Education), it was found that the average rate of students evasion in distance courses was 19.06% (CensoEaD.BR, 2015). In this section, we will discuss some recent work using data mining techniques to predict students evasion courses EAD. The research conducted by Yükseltürk et al (2014) investigated the participation of 189 Certification Program students in IT (Information Technology) in Turkey from 2007 to 2009. They investigated four machine learning/data mining techniques: k-nearest neighbors (k-NN), Decision Tree (DT), Naive Bayes (NB) and Neural Networks (NN). The inputs for these classifiers were the following nine attributes related to the students: gender, age, education level, online experience earlier, occupation, self-efficacy, availability, knowledge prior and locus of control. The authors have shown that k-NN achieved a hit rate significantly higher than the rates attained by the remaining methods: k-NN (87%), DT (79.7%), NN (76.8%) and NB (73.9%). Based on a survey conducted in the literature, Silva et al (2014) used a DT and the following attributes: proportion of displayed features, proportion of delivered or participation, average grades obtained and frequency activities. The experiment was conducted on data obtained from the participation of 88 students. The results showed a correct prediction of 77 students, i.e., the technique achieve 87% of hit rate. Castro et al (2007) applied data mining techniques to solve e-learning problems, such as: detection of irregular learning behaviors; classification based on students learning performance; clustering according to similar e-learning system usage and systems’ adaptability to students’ requirements; detection of irregular learning behaviours. In (Cortez and Silva, 2008), two core classes (Mathematics and Portuguese) of two secondary schools from the Alentejo region of Portugal were studied. The authors employed 29 predictive variables and four learning algorithms: Support Vector Machine (SVM), Random Forest (RF), DT and NN. A data set of 788 students, who appeared in 2006 examination, was used in their experiments. They reported that DT and NN attained the highest predictive accuracies, 93% and 91% respectively, in a two-class problem (pass/fail). It was also reported that both DT and NN algorithms achieved 72% as predictive accuracy on a four-class dataset. Márquez-Vera et al (2013), conducted a survey of 670 students from the Autonomous University of Zacatecas, from 2009 to 2010. They performed different experiments using 13 algorithms (Jrip, NNge, OneR, Prism, Ridor, ADTree, DT (J48), RandomTree, REPTree, SimpleCart, ICRM v1, v2 ICRM, ICRM v3). Their results indicated that algorithm ICRM (Classification Rule interpretable Mining) v3 outperformed the remaining methods. ICRM attained 98.7% of performance. The clustering technique K-means was employed in the work (France and Amaral, 2014), in order to group students according to their learning difficulties. The experiment was conducted on a dataset composed of 890 records related to activities of 33 students of Object Oriented Programming, taught in the first half of 2010, in the Bachelor's Degree in Computer Science, University of Pernambuco. In this experiment, the authors generated 6 clusters. The experiment undertaken by Witten et al (2005) compared two decision tree algorithms CART (SimpleCart) and C4.5 (J48), a Bayesian classifier (BayesNet), a logistic model (SimpleLogistic), a rule-based learner (JRip) and the Random Forest (RandomForest). The OneR classifier was also considered as a baseline and as an indicator of the predictive power of particular attributes. Kampff et al (2008) developed a system using EDM techniques to identify behaviors and
-774-
EdMedia 2015 - Montreal, Quebec, Canada, June 22-24, 2015 characteristics of students at risk of evasion on a VLE. Once students in this situation are detected, the system alerts the teacher in order to support the teacher to take pedagogical decisions focusing on motivating these students to stay on course. In addition, the system helps teachers to improve teaching techniques and to identify which students are struggling. A data collected from 161 students regarding the first three months was used in the experiment. After reviewing the decision tree, the authors noticed that the most relevant results are attained by students who performed all tasks. The research of Schneider et al (2011) describes the design and implementation of a process analysis to improve learning a course of self-study online. The learning environment was implemented within an existing Semantic MediaWiki.
Proposal Early in 2010, 1,622 students from 13 municipalities of the Amazonas State in Brazil were enrolled in technical courses in Computer Maintenance and Support, Services Public and tourism of CETAM distance education – CETAM EaD (http://ead.cetam.am.gov.br/salasp), with the support of the Network and E-Tec (Education Vocational and Technology in Distance mode) of the Ministry of Education. In 2013, the school office conducted a manual survey of school data and found that 674 students have completed the courses, representing 58.5% of students’ dropouts or avoidances. When conducting these courses, the CETAM EaD used an academic system integrated to Moodle Virtual Learning Environment (http://ead.cetam.am.gov.br/sisacad/), which among its many features, introduced the progress report of the students. This report, generated dynamically, shows the interactions of students, date made activities (forum, task, questionnaire, book, etc) and the note attained in each activity (Alencar & Netto, 2013). This information served as a support for teachers and teaching staff and were registered in the Moodle database (Figure 1).
Fig. 1 Activity report and student grades The purpose of this paper is to analyze the interactions recorded by CETAM EaD for the technical courses in period 2010-2013, using data mining techniques Education, in order to offer the functionality able to show which students have chance of truancy. A proposed architecture is shown in Figure 2. Here, the students use Moodle VLE and their interactions are recorded in a database so as the user, a person administrative or technical staff, can access the academic system and the VLE. The proposal involves adding a new routine of the academic system, which uses the VLE database to apply EDM techniques in order to generate new strategic reports, showing students who are at risk of avoidance. These new reports will help the mediators to monitor students.
-775-
EdMedia 2015 - Montreal, Quebec, Canada, June 22-24, 2015
Fig. 2 Proposed Architecture Moodle is an open-source learning course management system to help educators to create effective online learning communities. Moodle is designed to support a style of learning called social constructionist pedagogy. This style of learning believes that students learn best when they interact with the learning material, construct new material for others, and interact with other students about the material (Romero et al, 2008). In this work, we plan to use the students’ usage data of the Moodle system and to preprocess the data, i.e., the data will be transformed in format to be mined. In order to preprocess the Moodle data, we can use a database administrator tool or some specific preprocessing tool. Thus, the data mining algorithms will be applied to build and execute the model, which will discover and summarize the knowledge of interest for the user. We intend to use the tool Weka (Waikato Environment for Knowledge Analysis) to apply data mining on the moodle data. Finally, the third step involves interpretation, evaluation and deployment of the results. These results will be interpreted and used by the teacher for further actions. We intend to analyze the interactions recorded by students, tutors and teachers, as we believe that these data are of paramount importance for the prediction of students failure, as we see in research (Detoni et al, 2014), which applied EDM techniques in the Moodle environment of the Federal University of Pelotas. There is also another interesting work (Paiva, Bittencourt e Silva, 2013), in which the authors analyzed the interaction between students and a created Pedagogical Recomendation tool that can enhance the learning experience of students. To build this new functionality of the academic system, we need to know how and which attributes should be extracted. In the literature, we found several works that used different feature extraction strategies to produce results attained in the experiments. The literature review by (Romero and Ventura, 2010) highlights the importance of the data contained in the VLE logs such as: where students in and out, the most popular pages, the number of visits, participation in the discussion forums, the number of posts, the time a student is engaged stroke, etc. After extracting attributes, the preprocessing must be conducted in order to make the data set ready to be used in experiments. For data mining application, the following are examples of preprocessing strategies: data cleansing, integration, transformation, reduction and discretization. This way, we expect that the experiment with various algorithms will allow us to choose the most appropriate classifier for the application in question. The research of (Nachmias & Ram, 2009) applied in Virtual TAU Project, launched in the academic year 2000/1, implemented learning technologies in Israeli higher education. Log files of 58 Moodle course Websites offered by Tel Aviv University (TAU) in the academic year 2008 were examined. About 26,000 students enroll at Tel Aviv University in about 6,000 courses annually, from which about three-quarters have a Web-supported site; nearly all of these online-supported courses use HighLearn.
Conclusions This paper presented is a study on truancy in distance learning courses and on the main data mining techniques applied in this type of problem. We intend to use EDM techniques to analyze the data of the
-776-
EdMedia 2015 - Montreal, Quebec, Canada, June 22-24, 2015 technician course students in the Public CETAM EaD Services, using data from 2011 to 2013. The next stage of this work involves conducting experiments with the database, comprising the record of student participation in the LMS Moodle during the investigated period. We intend to compare the prediction results with the results collected manually by the secretariat during the same period. We believe that the works cited in this paper will provide the basis to know which algorithm or algorithms generate better results in applications involving data of students of distance learning courses. The objective is to apply a methodology that uses data mining to get results in order to help the pedagogical staff of CETAM EaD to identify the students with withdrawal or avoidance profile. In the future, we intend to develop a module to help teachers in educational process and with this contribution to increase the number of students who successfully accomplish their distance courses.
References Alencar, M. A. S.; Netto, J. F. M. Facilitando a Tutoria EaD Utilizando o SISACAD. In: 19º Congresso Internacional ABED de Educação a Distância, 2013, Salvador/BA. Baker, R.; Isotani, S.; Carvalho, A. (2011). Mineração de dados educacionais: Oportunidades para o brasil. Revista Brasileira de Informática na Educação, 19(02):03. Castro, F., Vellido, A., Nebot, A. and Mugica, F. (2007) ‘Applying data mining techniques to e-learning problems’, in Jain, L.C., Tedman, R. and Tedman, D. (Eds): Evolution of Teaching and Learning Paradigms in Intelligent Environment, Springer-Verlag, New York, NY, USA, pp.183–221. Cortez, P.; Silva, A. “Using Data Mining To Predict Secondary School Student Performance”, In EUROSIS, A. Brito and J. Teixeira (Eds.), 2008, pp.5-12 CensoEaD.BR (2015) disponível em: Acesso em: 15 out 2014 Detoni, D. ; Matsumara, R. A. ; Cechinel, C. . Predição de Reprovação de Alunos de Educação a Distância Utilizando Contagem de Interações. In: 25º Simpósio Brasileiro de Informática na Educação (SBIE), 2014, Dourados-MG França, R. S. ; Amaral, H. J. C. Mineração de Dados na Identificação de Grupos de Estudantes com Dificuldades de Aprendizagem no Ensino de Programação. RENOTE. Revista Novas Tecnologias na Educação, v. 11, p. 1-10, 2013. Gama, S.;Jordão, V.;Golçalves, D. EduVis: Visualizing Educational Information. in Proceedings NordiCHI 2014, the 8th Nordic Conference on Human-Computer Interaction. Helsinki, Finland. 2630 Oct 2014. ACM Press. Kampff, A. J. C.; Reategui, E. B.; Lima, J. V. de. Mineração de dados educacionais para a construção de alertas em ambientes virtuais de aprendizagem como apoio à prática docente. Novas Tecnologias na Educação. v. 6, Nº 2, Dezembro, 2008. Paiva, R.; Biteencourt, I. I.; Silva, A. P. (2013). Uma Ferramenta para Recomendação Pedagógica Baseada em Mineração de Dados Educacionais. In: Congresso Brasileiro de Informática na Educação – CBIE 2013, Campinas, SP Márquez-Vera, C.; Cano, A.; Romero, C.; Ventura, S. Predicting student failure at school using genetic programming and different data mining approaches with high dimensional and imbalanced
-777-
EdMedia 2015 - Montreal, Quebec, Canada, June 22-24, 2015 data. Appl. Intell. 38(3): 315-330 (2012) Mezzari, Adelina ; Tarouco, L.M.R. ; Ávila, B. G. ; Favero, R. V. M ; Machado, G. R. ; Bulegon, A. M. . Estratégias para detecção precoce de propensão à evasão. Revista Iberoamericana de Educación a Distancia, v. 16, p. 147, 2013. Nachmias, R., & Ram, J. (2009). Research Insights from a Decade of Campus-Wide Implementation of Web-Supported Academic Instruction at Tel Aviv University. The International Review of Research in Open and Distance Learning Romero, C., Ventura S. (2010). Educational data mining: a review of the state of art. IEEE Transactions on Systems, Man, and Cybernetics – Part C: Applications and Reviews, 40, 601–618. Romero, C.; Ventura S.; Garcia, E. Data Mining in course management system : Moodle case study and tutorial, Computer & Education, 368-384, Compedu, 2008 Schneider, D.K., Benetos, K. & Ruchat, M. (2011). MediaWikis for research, teaching and learning. In T. astiaens & M. Ebner (Eds.), Proceedings of World Conference on Educational Multimedia, Hypermedia and Telecommunications 2011 (pp. 2084-2093). Chesapeake, VA: AACE. Silva, Júlia Marques Carvalho da; Preissler Jr., S. ; Tessari, Rogério ; Andrade, Fábio Goulart . Alunos em Risco: como identificá-los por meio de um ambiente virtual de aprendizagem? In: XI Congresso Brasileiro de Ensino Superior a Distância, 2014, Florianópolis. XI Congresso Brasileiro de Ensino Superior a Distância, 2014 Wolff, Annika; Zdrahal, Zdenek; Herrmannova, Drahomira; Kuzilek, Jakub and Hlosta, Martin (2014). Developing predictive models for early detection of at-risk students on distance learning modules. In: Machine Learning and Learning Analytics Workshop at The 4th International Conference on Learning Analytics and Knowledge (LAK14), 24-28 March 2014, Indianapolis, Indiana, USA. Witten, I. H., Frank, E. Data Mining: Practical machine learning tools and techniques. San Francisco: Morgan Kaufmann, 2 ed., 2005. Yadav, S. K.; Pal, S. “Data Mining: A Prediction for Performance Improvement of Engineering Students using Classification”,World of Computer Science and Information Technology Journal (WCSIT),Vol.2, 51-56, 2012. Yükseltürk, E., Özekeş, S. & Türel, Y. K. (2014) Predicting Dropout Student: An Application of Data Mining Methods in an Online Education Program. European Journal of Open, Distance and ELearning – EURODL, 17(1), 118-133.
-778-