Evaluation of Methods and Algorithms of Educational Data Mining Oswaldo Moscoso-Zea Equinoctial Technological University, Faculty of Engineering, Quito, Ecuador
[email protected]
Mayra Vizcaino Equinoctial Technological University, Faculty of Engineering, Quito, Ecuador
[email protected]
Sergio Luján-Mora University of Alicante, Department of Software and Computing Systems, Alicante, Spain.
[email protected]
Abstract: Educational Data Mining (EDM) is an evolving discipline that allows the exploration of knowledge from academic environments by means of developing and applying data mining methods and algorithms to information stored in data repositories of higher education institutions. The application of data mining methods and algorithms allows these institutions to better understand the way the lecturers teach, the way the students learn and the activities of organizational processes to improve decision making. This paper describes DM, EDM and the existing methods and algorithms of the discipline. Furthermore, it presents experiments carried out using the classification method with different algorithms of decision trees of EDM to analyse two key performance indicators (KPI) in a private university: student dropout and graduation rate. In addition, it compares these methods and algorithms and suggests which has better precision in certain scenarios.
1. Introduction Higher Education Institutions (HEI) are generating large amounts of data from their organizational systems and applications. These data is not always processed and analyzed accordingly to produce information and knowledge. The production and dissemination of organizational knowledge is a strategic objective that supports HEI in the roadmap for planning, modernization and improvement of academic and research indicators. The setting of a technological infrastructure is necessary for a sound process of data analysis. One of the core technologies of this infrastructure is a data warehouse. A data warehouse is a data repository with a multidimensional design and used specifically for analysis (MoscosoZea, Sampedro, & Luján- Mora, 2016). The information dispersed in different operational data bases is migrated to the data warehouse by means of an extraction, transformation and loading (ETL) process. An approach that guides this knowledge creation process is called knowledge discovery in databases (KDD). KDD uses data mining (DM) as the core element for knowledge creation. DM has been applied to different industries and fields of study in the last years with promising results. Some of the fields of analysis of DM are marketing, health, finances and insurance, among others. The application of DM in educational contexts is known as educational data mining (EDM). EDM is a discipline in evolution that focuses in the design of models to improve learning experiences and organizational efficiency (Huebner, 2013). EDM uses software tools to discover trends and patterns on educational data to improve decision making in HEI. There are many institutions that are applying EDM in their educational data and are gaining insights to improve students, lecturers and staff performance. Some examples of what institutions are doing in this field of study are the following.
1
Paul Smith’s College uses learning analytics to increment the graduation rate of their students; Washington University is associated with “Persistence Plus” company to improve the evaluation of online courses by using data analytics; Open University uses its historic data to improve the retention rates of its students (Bichsel, 2012); Georgia University carried out experiments using analytic techniques to predict graduation rates and student dropout in online courses (Morris, Wu, & Finnegan, 2015). In Purdue University researchers have been using DM to determine that frequent evaluations in early stages can change the habits of students with low grades. The research team of this university has developed an academic alert system to track the performance of students (Baepler & Murdoch, 2010). The main motivation for writing this paper is to continue exploiting EDM to discover hidden knowledge in academic data. Therefore, this paper presents the different methods and algorithms of EDM that have been developed by researchers. Furthermore, it describes the experiments carried out to analyze two key performance indicators (KPI) in a private university: student dropout and graduation rate. Classification methods and algorithms of EDM are applied in the analysis. In addition, the study compares these methods and algorithms and suggest which has better precision in certain scenarios. The conclusions of this paper can support data scientists to choose the right methods and algorithms in further studies of these KPI. The research question of this work is: “Which are the best methods and algorithms of EDM that can be used to analyze student dropout and graduation rate?” The answer to this question can be fundamental for researchers in education to reduce the time of experimentation and analysis of student dropout and graduation rate. The paper hopefully can be used to have a clear vision in which methods and algorithms are more adequate to study these academic indicators. After introducing the research topic in this section, the structure of the paper is as follows: Section 2 presents the theoretical background, this section explains the definitions of DM, EDM and presents existing methods and algorithms of EDM; Section 3 describes the methodology of experimentation; Section 4 presents the results of the experiments performed to gain knowledge on student dropout and graduation rate; finally, Section 5 provides conclusions of the work.
2. Background The research objective of this work is to study existing methods and algorithms of EDM to experiment with the most appropriate for the analysis of data in educational institutions. In this section, the state of the art of the EDM topic along with the existing methods and algorithms of EDM that are needed for the design of experiments are described.
Data mining The term DM was coined in the 1990s by researchers in the information systems and databases field. Academics in the literature also use terms as "Data Archeology", "Data Collection", "Knowledge Extraction" and "Data Analysis" when referring to DM. DM is a fundamental part in the process of KDD. DM is an approach that uses different information technologies (IT) systems and tools to analyze and extract knowledge from information contained in data repositories of organizations. The most commonly used framework for understanding the life cycle of a DM project is the Cross Industry Standard Process for Data Mining (CRISP-DM) (Chapman et al., 2000). The main phases as shown in Figure 1 are: 1. Understanding the business, 2. Understanding data, 3. Preparing data, 4. Creating models, 5. Evaluation and 6. Deployment.
2
Figure 1: Phases of CRISP-DM
Educational Data Mining EDM uses methods, tools and algorithms of DM to investigate data from students, teachers and administrative staff, collaboration between students, administrative data and demographic data of HEI. Hence EDM is an evolving discipline, there is not a widely accepted definition. However, a definition that fits the objectives of our research is provided by the International Society of Educational Data Mining: "EDM is an emerging discipline, concerned with developing methods for exploring the unique and increasingly large-scale data that come from educational settings, and using those methods to better understand students, and the settings which they learn in” (International Society Educational Data Mining, 2015). The EDM process is shown in Figure 2. This figure depicts in the first step the preprocessing of raw data obtained from an educational environment. These raw data is then modified (a new data set is created) and used with EDM methods or algorithms. As a following step a model is defined and the experiments are carried out. The results of experimentation allows the evaluation and refinement of the process with the results of the analysis (Ventura & Romero, 2013). EDM explores the organizational context, this is one of the reasons that it is key in the study and improvement of academic indicators as desertion rate, graduation rate, restructuration processes and organizational management (Bienkowski, Feng, & Means, 2012). Hypothesis formulation Educational Environment
Testing Raw Data
Preprocessing
Modified data
Data mining
Models/ patterns
Interpretation/ evaluation
Refinement
Figure 2: Knowledge Discovery with EDM
Methods and Algorithms of EDM The field of EDM integrates methods, algorithms and techniques which can be used to perform different experiments and to design models. The output of the model implementation allows
3
researchers to predict or obtain patterns from educational data. This paper presents and classifies the algorithms and tools used by researchers in this scientific field. A description of DM methods and a classification according to the type of method is presented in the following list (Siemens & Baker, 2012):
Prediction Methods: The goal of prediction is to design a model that allows inferring in some aspects of the data based on the combinations of other features of the data, for example, information of student dropouts can be collected and with the analysis of this information, predictions can be made to take corrective and preventive actions targeted to new students. There are three types of methods in this group: classification, regression and latent knowledge estimation. Structural Discovery Algorithms: It is a question of finding a structure for the data without a prior idea of what is to be found. The researcher main task is to identify the natural structure of the data. Within this classification are: clustering, factor analysis, social network analysis and discovery of domain structures. Relationship Mining: The goal is to discover relationships between certain variables in a data set. Within this category we can mention: association, correlation, sequential pattern mining and causal mining of data. Model Discovery: The results of the mining analysis are used for further analysis. Normally a model is obtained through prediction methods.
Another study from (Kumar & Dr.Vijayalakshmi.M.N, 2011) presents EDM methods and the key areas of application as shown in Table 1. Table 1: Educational Data Mining Algorithms and Applications Method
Applications in EDM -Detecting student behaviors.
Prediction
-Developing domain models -Predicting and understanding educational outcomes
student
-Discovery of new student behavior patterns Clustering
Relationship Mining
Discovery with models
-Investigating similarities between schools
and
differences
-Discovery of curricular associations in courses and sequences of courses -Discovery of relationships between student behaviors, and student characteristics or contextual variables -Analysis of research question across a wide variety of contexts
-Human identification of patterns in student Distillation of Data for Human Judgment learning, behavior, or collaboration -Labeling data for use in later development of prediction models In the international conference on DM of 2006 sponsored by the Institute of Electrical and Electronics Engineers (IEEE) a ranking was presented on the eight best DM algorithms of the field. These algorithms are:
4
Decision Trees: organize data forming branches of influences for decision making. The tree trunk represents the initial decision. This decision starts with a yes and no question for example if the student will graduate or not. The next divergent branches are graduation and no graduation and each further election should have their own divergent branches that conduct to an end point.
K-Means Algorithm: is based in group analysis. Divides data collected in clusters separated by common characteristics. Apriori Algorithm: This algorithm normally controls transactional data. For example the algorithm could predict which products might be bought together in a store. EM Algorithm: define parameters by analysing data and predicts the possibility of a future output or a random event within the parameters of data. For example: EM algorithm could intent to predict the time of the next eruption of a volcano analysing the data of previous eruptions. Page rank Algorithm: is a base algorithm for search engines. Ranks and estimates the relevance of a piece of data within a big set of data. An example is the ranking of a web site within a big set of all the web sites in the internet. Adaboost Algorithm: anticipates the behavior using observed data in order to be sensitive to statistical extremes. Nearest neighbor algorithm: This algorithm recognizes patterns in the location of data and associates the data with a bigger identifier. Naive Bayes Algorithm: predicts the output of an identity based in the data of known observations. For example if the height of a person is 1,90 meters and the size of the shoes is 14, the algorithm could predict with a determined probability that the person is a man.
3. Methodology In this study different experiments of DM are realized. The methodologies used in this research are: KDD for the knowledge discovery process and CRISP-DM for the data mining process.
Knowledge Discovery in Databases One of the methodologies used in this research is KDD. KDD is an approach to discover useful knowledge from a group of data. This process is shown in Figure 3. It is composed of the following phases 1. Data selection: selection of source data from operational data bases and the migration to a target data repository in this case a data warehouse. 2. Data preprocessing: the data cleansing and preprocessing of data by deciding strategies to put the data in the right format, removing duplicates and handling missing fields. 3. Data transformation: creating data sets with the needed variables for reducing complexity of analysis. 4. Data mining: application of methods and algorithms to the data set in order to predict trends and discover patterns in data. 5. Interpretation and evaluation: understanding of results and the creation of explicit knowledge by means of visualization of data in reports and dashboards.
5
Transformation
Data Mining
Interpretation/ Evaluation
Preprocessing Selection
knowledge
Patterns Data
Preprocessed Data
Transformed Data
Target Date
Figure 3: Knowledge Discovery in Data Bases (KDD)
CRISP- DM The implementation of experiments with EDM in this study was performed following the CRISP-DM phases. These phases are: 1. Understanding the business: in this step the analyst should understand the vision and goals of the business and how the DM project will benefit the organization. In this phase is important to have clearly identified the requirements of the project. 2. Understanding data: it is important to identify the data tables and fields that are going to be subject of analysis. 3. Preparing data: requires to migrate the required data to a data set that will be used in the analysis. A data cleansing process must be performed during this transformation. 4. Creating models: the model that will be used for the analysis must be planned and designed. 5. Evaluation: this is the experimentation phase that requires the selection of algorithms and tools. The output of this phase is the knowledge created with the existing model. 6. Deployment: in this phase the presentation of the results is done. If the results are not relevant to the requirements a new model should be planned
4. Approach and Findings This section shows the development of the experiments using the KDD methodology. It analyzes the academic information of HEI. The primary analysis is realized to discover and predict trends in two indicators: student dropout and graduation rate. The data analyzed was limited to records of computer science students. The period of the DM analysis was from 2002 until the year 2015. There are two groups for data analysis: The students who entered the university from the first semester and students who enter the university validating subjects. The data source contains information collected when entering the institution from the academic system of the university (personal and educational data) and data that is collected during the duration of the studies. Data selection and preprocessing are performed using different criteria for the representation and application of classification algorithms such as decision trees, bayesian networks and decision rules. Two influential classes are defined for the analysis: graduation and desertion of students. These indicators of students are evaluated according to data of enrollment, for example: if the student does not enroll for more than two consecutive semesters it is considered in desertion as indicated in the regulation of the boards of academic evaluation, accreditation and quality assurance of higher education in the country of the university. This evaluation board is a public entity that carries out continuous evaluation and accreditation processes. The detailed process of this research is shown in the following KDD steps.
6
Data Selection The data used for this analysis was previously collected in a data warehouse of the HEI. Data mining could be done directly from this data repository, however, the process of structuring the views for analysis of the data was increasingly complex. The decision to make the process efficiently was to create an external table to the DW with all the fields necessary for the analysis of student dropout and graduation rate. In this step the data from the DW from the year 2002 to the year 2015 was analyzed. The focus group for the analysis was the students of the computer science school. The total number of students in this group was 441 students. From these students, 330 entered the university from 1st semester, and the remaining 111 students enter into the school validating subjects.
Data Preprocessing After having the information of the focus group of analysis, the data was cleaned, verifying formats and checking that information is correct. This process was performed using IT tools for data migration.
Data transformation In order to create the dataset an ETL process was performed. Data was extracted from the data warehouse. Different dimensions and fact tables were the source of these data. The transformation process was a less complicated activity due to the fact that a transformation process was previously performed for the creation of the DW. However, different SQL operations (aggregation and normalization) were executed in the information in order to structure the data set fields with yes, no or 1, 0 in order to facilitate further analysis. Once the data was transformed it was loaded into the new created table (data set). The data set was the main input of DM in the different analysis tools used in this investigation (See Figure 4).
Figure 4: Dataset for analysis
Data mining The DM process starts with the creation of a .CSV file from the resulting data set. This file is the input for the three tools used for analysis: WEKA, Orange 3 and Rapid Miner. After performing a feasibility analysis realized to these tools WEKA was selected as the most adequate for this experiment. As explained previously, in the IEEE conference of 2006 was stated that one of the best methods for EDM is classification (decision trees, naïve bayes, meta classifiers), therefore this
7
method is implemented in this analysis. Many algorithms are tested using supervised and nonsupervised filters applied to the data set. WEKA works with different classifiers that can be chosen in the tool. In this study the following classifiers were used: Naïve Bayes, Stacking, One R, J48, and Random Tree. The precision of the best classifiers for each algorithm for retention rate is shown in Table 2. Table 2: Comparison of different algorithms within retention rate dataset
ALGORITHM Classifier Test Mode
BadlyWell ranked Kappa (%)Absolute ranked instances(%) Index Error instances
NAIVE BAYES
With 8 Percentage 71.33%. data set division attributes
28.66%
0.28
0.31%
STACKING
With 8 Percentage 72% data set division attributes
28%
0
0.40%
ONE R
With all data set Crossattributes validation
28%
0.02
0.28%
Percentage 99.09% division
0.90%
0.97
0.01%
Crossvalidation
26%
0.33
0.27%
With J48
RANDOM TREE
72%
8
data set attributes With all data set attributes
74%
In the different experiments done J48 (99.09% of precision) was the algorithm that better performed for retention rate. In the case of graduation rate the best algorithm was random tree (100% of precision). In all the cases the comparison was made with the percent of correct and incorrect classification. Although the random tree algorithm gives the more precise classification for graduation rate, it is difficult to visualize correctly the decision tree. Therefore, we recommend the use of J48 algorithm decision tree. One view of this analysis is shown in Figure 4. This figure shows a decision tree with a classification of students from the computer science school. In the first branch of analysis is shown whether the students started the career in first level or they started validating courses.
8
Figure 4: Branch of J48 analysis for desertion rate Other important results of this paper are the discoveries of knowledge from the experiments with the EDM algorithms. Some of the most important findings in this HEI are: Graduation: is higher for students born and living in the same city of the higher institution, graduation rate is higher for students that have entered the school validating courses than those who started from the beginning. Student dropout: students that study high school in public institutions and those which are married have higher risk to abandon their studies. Students that enter with a grant and lose it have higher risk to drop out.
CONCLUSIONS This paper presents a view of EDM and hopefully it can be a good source for researchers wishing to experiment with DM in the field of education in areas such as evaluation, enrollment, planning, student welfare, marketing, etc. EDM is projected as an essential discipline for university management that gives visibility to managers to improve decision making. By using DM tools as WEKA an evaluation of EDM methods and algorithms was performed. This analysis shows that decision trees algorithms J48 and random trees have better precision in classifying students in the analysis of student dropout and graduation rate. Some potential limitations of the work are that EDM is being used in higher institutions to improve the teaching and learning process and to improve academic indicators. The knowledge created with these experiments allows the institution to improve the tutoring system and to identify students in risk of drop out in early stages to implement corrective actions. It also gives information on the groups that are more likely to graduate and the groups that are not. With the output of the experiments new strategies can be implemented to improve academic indicators.
References Baepler, P., & Murdoch, C. J. (2010). Academic Analytics and Data Mining in Higher Education. International Journal for the Scholarship of Teaching and Learning, 4(2), 1–9. Bichsel, J. (2012). Analytics in Higher Education Benefits, Barriers, Progress and Recommendations. Retrieved May 3, 2016, from https://net.educause.edu/ir/library/pdf/ERS1207/ers1207.pdf Bienkowski, M., Feng, M., & Means, B. (2012). Enhancing teaching and learning through educational data mining and learning analytics: An issue brief. Department of Education’s (ED) Office of Educational Technology, 1–57. Chapman, P., Clinton, J., Kerber, R., Khabaza, T., Reinartz, T., Shearer, C., & Rudiger, W. (2000). Crisp-Dm 1.0. CRISP-DM Consortium, 76.
9
Huebner, R. A. (2013). A Survey of Educational Data-Mining Research. Research in Higher Education Journal, 1–13. International Society Educational Data Mining. (2015). Educational Data Mining. Retrieved June 29, 2015, from http://www.educationaldatamining.org/ Kumar, S. A., & Vijayalakshmi.M.N. (2011). A Novel Approach in Data Mining Techniques for Educational Data. 3rd International Conference on Machine Learning and Computing (ICMLC 2011) A, (Icmlc), 152–154. Morris, L., Wu, S.-S., & Finnegan, C. (2015). Predicting retention in online general education courses. American Journal of Distance Education, 19(1), 23–36. Moscoso-Zea, O., Sampedro, A., & Luján-Mora, S. (2016). Datawarehouse design for Educational Datamining. In Information Technology Based Higher Education and Training ITHET, 1–6. Istanbul - Turkey. Siemens, G., & Baker, R. S. J. D. (2012). Learning analytics and educational data mining. Proceedings of the 2nd International Conference on Learning Analytics and Knowledge - LAK ’12, 252. Ventura, S., & Romero, C. (2013). Data mining in education. WIREs Data Mining Knowledge Discovery, (August), 12–27.
10