Fast Boost Decision Tree Algorithm: A novel classifier ...

6 downloads 218 Views 704KB Size Report
2Associate Professor, PG and Research Dept. of Computer Science, ... decision tree by employing top-down, line search to test every attributes in the data ... response received from the Student of private colleges of Tamilnadu state of India.
ISSN:0254-0223

Vol. 31 (n. 11, 2016)

Fast Boost Decision Tree Algorithm: A novel classifier for the assessment of student performance in Educational data C.Anuradha1, T.Velmurugan2 1

Research Scholar, Bharathiar University, Coimbatore, India. Associate Professor, PG and Research Dept. of Computer Science, D.G.Vaishnav College, Chennai-106, India. 1 [email protected]; [email protected]

2

Abstract: Educational data mining (EDM) is an emerging field for high quality research that mines large data sets in order to answer educational research questions that shed light on the learning process. EDM concerned with developing methods for exploring unique and increasingly large scale data that come from educational settings and using those methods to better understand students and the settings which they learn in. The primary objective of this research work is to compare the performance of various classification techniques in the prediction of students’ performance in the End semester examination. A new system namely Fast Boost Decision Tree algorithm is proposed in this research work to improve the accuracy of existing classification algorithms for greater speedups in the prediction of students based on their results. The proposed system is used to classify the data and construct the decision tree by employing top-down, line search to test every attributes in the data set used for classification. The dataset for the study were collected from the student’s performance report from various private colleges in Tamil Nadu state, India. The collected data set has been preprocessed and verified to enhance the performance in terms of Accuracy. The effectiveness of various Classification algorithms was compared with the proposed system and the results are discussed.

Keywords: Educational data mining, Student Performance, Classification Algorithms, Fast Boost Decision Tree Algorithm.

1. INTRODUCTION Data mining techniques have become de-facto standard in majority of the research studies for analyzing the large volume of dataset, to extract useful information and knowledge to support the problems related to decision-making. It has been proved in various studies and by the previous study by the authors that data mining techniques find widespread applications in the educational decision making process for improving the performance of students in higher educational institutions. The new branch of educational data mining is rapidly developing as a key technique in the analysis of data generated in the educational setting. Classification techniques assumes significant importance in the machine learning tasks and are mostly employed in the prediction related problems. In machine learning problems, feature selection techniques are used to reduce the attributes of the class variables by removing the redundant and irrelevant features from the dataset. The most useful data mining techniques in e-learning is classification. It is a predictive data mining technique, makes prediction about data values using known results found from different data. Classification techniques of data mining are used to predict the student performance. Decision tree algorithm has been successfully used in capturing knowledge and presents a powerful method of inferring classification models from a set of labeled examples [1]. Recently, there has been increasing interest in combining classifiers concept that is proposed for the performance improvement of individual classifiers. The integration algorithms approach is used to making decision more accurate, reliable and precise. One of a mechanism that is used to build an ensemble of classifiers is using different learning methods [2]. An ensemble of classifiers is a set of classifiers whose own decisions are combined to improve the performance of the overall system. The most popular techniques for constructing ensembles are Bagging and Boosting. These two techniques manipulate the training examples to generate multiple classifiers [3] [4].

139

ISSN:0254-0223

Vol. 31 (n. 11, 2016)

To analyze the students’ performance, the subject marks scored and academic performance in the subjects are extracted. To filter the weak students from the strong students the classification methods are applied. This method is helpful for the instructors to concentrate along the issue and make changes in the course of study. More attention can be applied to the lowest performing students by take on this method. The problem with this research is to find a way to improve the performance of Arts and Science college students in the academic educational process. This work discusses about the classification techniques such as C4.5, IBK, J48, KStar, Multi Layer Perception, Naive Bayes, NNge, PART, PRISM, Random Forest, Simple Cart, Simple Logistics, SMO and ZeroR. The performance of the classification methods are compared to obtain better accuracy results. The main goal of this research is to develop a decision making based on using data mining that evaluates the performance of Arts and Science college students. The main objective of this paper is using data mining to predict the performance of students in the courses accurately. We developed a novel Students’ Performance Prediction System that can be able to predict the students’ performance based on their academic result with high accuracy. The Primary objectives of this research work are as follows.  Find ways to collect data about Arts and Science college students for predict performance.  Define a process to integrate and prepare the collected data.  Test the viability of our view of the proposed approach with Arts and Science college students and here we use different areas for college students in different disciplines.  Use data mining to predict Arts and Science college student’s performance.  Use a novel approach to asses' Arts and Science college student’s performance In this paper Fast Boost Decision Tree (FBDT) algorithms is proposed to improve the accuracy of student performance than the traditional Classification Algorithms. To collect student data, a well-structured Questionnaire was developed. The dataset for the study were collected from the student’s performance report of a private college in Tamil Nadu state of India. Totally 20224 response received from the Student of private colleges of Tamilnadu state of India. The classification technique splits the datasets in order to sort out the variation in performance. The identified results yield the interesting pattern. Knowledge is acquired by predicting the interesting patterns. The paper structured as follows. In section 2, the most cited literature survey papers in Educational data mining are explained. Section 3 consists of materials and methods of the domain of study will be studied along with our proposed approach description. The experimental evaluation and comparative analysis are given in Section 4. Conclusion for our proposed work is given in Section 5. Finally, vital references are mentioned. 2. RELATED WORK A research work carried out by HythemHashim et al. discussed about Data mining methodologies to study student’s academic performance using the C4.5 Algorithm. The objective of this paper is to build a classification model that can be used to improve the student’s academic records in Faculty of Mathematical Science and Statistics. This model has been done using C4.5 for predicting student performance in many different settings. [5].A review had done by Ashish et al. in Clustering Algorithms Applied in Educational Data mining. This research paper answers a question that how can a higher educational institution harness the power of massive amount of student data

140

ISSN:0254-0223

Vol. 31 (n. 11, 2016)

for its strategic use. This difficult task has been achieved successfully by using various data mining approaches like clustering, classification, prediction algorithm [6]. Another research done by Anjana and Jeena discussed about Predicting College Students Dropout using EDM Techniques. Here WEKA tool has been used to evaluate the attributes. Various classification techniques like induction rules and decision tree have been applied to data and results of each of these approaches have been compared [7].A paper Titled “Performance Analysis and Prediction in Educational Data Mining: A Research Travelogue” by Pooja et al. has been done towards the usage of data mining techniques in the field of education. This paper presents a comprehensive survey towards educational data mining and its scope in future [8]. Another work had done by Arora et al. describes the process of finding the set of weak students based on graduation and post-graduation marks [9].Recent paper in Elsevier titled “Educational Data Mining: A Survey and a Data Mining-Based Analysis of Recent Works” carried out by Pena-Ayala and Alejandro presented the survey of published papers from 2010-2013 and divided Educational Data Mining approaches in kinds of educational systems, disciplines, tasks, methods, and algorithms. Author identified that each Educational Data Mining approaches can be organized according to six functionalities student modeling, student behavior modeling, assessment, student performance modeling, student support and feedback versus curriculum-domain knowledgesequencing, mostly focusing on academic performance [10]. Komal and Supriya [11] have conducted a Survey on Mining Educational Data to Forecast Failure of Engineering Students. This paper provides a Review of the available literature on Educational Data mining, Classification method and different feature selection techniques that author should apply on Student dataset. The research paper titled Improvement on Classification Models of Multiple Classes through Effectual Processes by Tarik [12]. This paper work focuses on improving the results of classification models of multiple classes via some effective techniques. The collected data are pre-processed, cleaned, filtered, normalized, the final data was balanced and randomized, then a combining technique of Naïve Base Classifier and Best First Search algorithms are used to ultimately reduce the number of features in data sets. Finally, a multi-classification task is conducted through some effective classifiers such as K-Nearest Neighbor, Radial Basis Function, and Artificial Neural Network to forecast the students’ performance. Another work carried out by Sadaf and Kulkarni discussed about Precognition of Students Academic Failure Using Data Mining Techniques. This research paper proposes to pre-recognize Student’s academic failure using various Data mining techniques especially induction rules, decision trees and naïve Bayes are applied [13]. Al-Radaideh et al. [14], have applied a classification data mining techniques to improve the quality of the higher education by evaluating the main attributes of students that affect the their performance. This study was used to predict the student’s final grade in a course. Another work carried out by Ayesha et.al. [15] had performed a study on student learning behavior. For this factors like class quizzes mid and final exam assignment are studied. This study will help the tutors to reduce the ratio of drop out and improve the performance level of students. A research done by Bharadwaj and Pal [16] used the decision tree method for classification to evaluate performance of student’s. The objective of their study is to discover knowledge that describes students’ performance in end semester examination. This study was quite useful for identifying the dropout’s student in earlier stage and students who need special attention and allow the teacher to provide appropriate advising. Bharadwaj and Pal [17], conducted study on the student performance based by selecting 300 students from 5 different degree college conducting BCA (Bachelor of Computer Application) course of Dr. R. M. L. Awadh University, Faizabad, India. By means of Bayesian classification method on 17 attributes, it was found that the factors like students grade in senior secondary exam,

141

ISSN:0254-0223

Vol. 31 (n. 11, 2016)

living location, medium of teaching, mother’s qualification, students other habit, family annual income and student’s family status were highly correlated with the student academic performance. A work carried out by Bray [18], in his study on private tutoring and its implications, observed that the percentage of students receiving private tutoring in India was relatively higher than in Malaysia, Singapore, Japan, China and Sri Lanka. It was also observed that there was an enhancement of academic performance with the intensity of private tutoring and this variation of intensity of private tutoring depends on the collective factor namely socio-economic conditions. Another work by Chandra and Nandhini [19] had used the association rule mining analysis to identify students’ failure patterns. The main objective of their study is to identify hidden relationship between the failed courses and suggests relevant causes of the failure to improve the low capacity students’ performances. Another work by Fadzilah and Abdullah [20], were applied data mining techniques to enrollment data and Descriptive and predictive approaches were used. Cluster analysis was used to group the data into clusters based on their similarities. For predictive analysis, Neural Network, Logistic regression, and the Decision Tree have been used. After evaluating these techniques, Neural Networks classifier was found to give the highest results in terms of classification accuracy. Researchers Hijazi and Naqvi [21] conducted a study on the student performance by selecting a sample of 300 students (225 males, 75 females) from a group of colleges affiliated to Punjab university of Pakistan. The hypothesis that was stated as "Student's attitude towards attendance in class, hours spent in study on daily basis after college, students family income, students mother's age and mother's education are significantly related with student performance” was framed. By means of simple linear regression analysis, it was found that the factors like mother’s education and student’s family. A researchers Pandey and Pal [22], conducted study on the student performance based by selecting60 students from a degree college of Dr. R. M. L. Awadh University, Faizabad, India. By means of association rule they find the interestingness of student in opting class teaching language. A work carried out by Pathom et al. [23], had proposed a classifier algorithm for building Course Registration Planning Model (CRPM) from historical dataset. The algorithm is selected by comparing the performance of four classifiers include Bayesian Network, C4.5, Decision Forest, and NBTree. The dataset were obtained from student enrollments including grade point average (GPA) and grades of undergraduate students. As a result, the NBTree was the best of the four classifiers. NBTree was used to generate the CRPM, which can be used to predict student class of GPA and consider student course sequences for registration planning. Ramasubramanian et al. [24], has predicted aspects of higher education students. In this paper they analyze that one of the biggest challenges that higher education faces today. Institutions would like to know, something about the performances of the students group wise. Authors proposed a problem to investigate the performances of the students when the large data base of Students information system (SIS) is given. Generally students’ problems will be classified into different patterns based on the level of students like normal, average and below average. In this paper we attempt to analyze SIS database using rough set theory to predict the future of students. Another work done by Shaeela Ayesha et al. [25], build a data mining model for higher education system. The authors applied K-mean clustering to analyze learning behavior of students which will help the tutor to improve the performance of students and reduce the dropout ratio to a significant level. A research carried out by Shannaq et al. [26] had applied the classification as data mining technique to predict the numbers of enrolled students by evaluating academic data from enrolled students to study the main attributes that may affect the students’ loyalty (number of enrolled

142

ISSN:0254-0223

Vol. 31 (n. 11, 2016)

students). The extracted classification rules are based on the decision tree as a classification method, the extracted classification rules are studied and evaluated using different evaluation methods. It allows the University management to prepare necessary resources for the new enrolled students and indicates at an early stage which type of students will potentially be enrolled and what areas to concentrate upon in higher education systems for support. A researcher by Waraporn [27], who presented the use of data mining techniques, particularly classification, to supports high school students in selecting undergraduate programs. Warapon proposed a classification model to give guidelines to students, especially, for the undergraduate programs for making possible better academic plans. The decision tree technique was applied to determine which major is best suitable for students. Another work done by Komal and Supriya [28] has conducted a Survey on Mining Educational Data to Forecast Failure of Engineering Students. This paper provides a Review of the available literature on Educational Data mining, Classification method and different feature selection techniques that author should apply on Student dataset. A work carried out by Anuradha and Velmurugan [29] had applied classification techniques like decision tree algorithm C4.5 (J48), Bayesian classifiers, k Nearest Neighbor algorithm and two rule learner’s algorithms namely OneR and JRip are used for classifying the performance of students as well as to develop a model of student performance predictors. The accuracy of the classifiers shows 60%. The JRip produces highest classification accuracy for the Distinction. Classification of the students based on the attributes reveals that prediction results are not uniform among the classification algorithms. 3. MATERIALS AND METHODS In this research work, a new method called Fast Boost Decision tree algorithm is proposed to evaluate the students performance. The methodology starts from the data collection from questionnaire, then Data preprocessing, statement of the problem and gives MATLAB implementation. Comparative analysis of various efficient classification algorithms is done with our proposed algorithm in order to find its efficiency by accuracy. 3.1 Description of Data set The dataset used for this study for academic performance analysis was taken from Arts and Science college student of course offered by some of the Arts and Science Colleges during the period between 2015 and 2016. Totally 20224 students data were collected through Questionnaire and formed in Ms-Excel software. Student personal and academic details along with their performance were collected from the student information system. The target variable was “Student End Semester Marks” (ESM) which was usually in numeric form in terms of percentage. It was discretized using pre-processing filters into 4 categories. The categories of target variable included First Class (Score > 60%), Second Class (45 - 60%), Third Class (36 - 45%), Fail (< 36%).The collected information was integrated into a distinct table in Ms-Excel. Student dataset contains various attributes like Students Sex, Branch of study, Students category, grade in High School, grade in Senior Secondary, Medium of instruction, Living Location of Student, accommodation facility in hostel or day scholar, student’s family size, family type, Family annual income, Qualification of parents, Previous Semester Mark, Class Test Grade, Seminar Performance, Assignment, General Proficiency, Attendance, Lab Work, End Semester Marks etc. The influencing attributes are selected and are used to classify and predict the student performance using MATLAB. The detailed description of the dataset is provided in Table 1.

143

ISSN:0254-0223

Vol. 31 (n. 11, 2016)

Table 1: Description of the attributes used for Classification Variables Gender Branch Cat HSG

SSG

Medium LLoc HOS FSize FType FINC FQual MQual PSM CTG SEM_P ASS GP ATT LW ESM

Description Students Sex Students Branch Students category Students grade in High School

Possible Values {Male, Female} {BCA, B.SC, B.COM, B.A} {BC, MBC, MSC, OC, SBC, SC} {O – 90% -100%, A – 80% - 89%, B – 70% - 79%, C – 60% - 69%, D – 50% - 59%, E – 35% - 49%, FAIL - 60%, Second >45 &36 & 60% , Second >45 &36 &

Suggest Documents