2014 Fourth International Conference on Advanced Computing & Communication Technologies
Mining Students’ Data for Performance Prediction Tripti Mishra Research Scholar Mewar University
Dr. Dharminder Kumar Professor &Chairman, CSE Department G.J. University, Hisar
[email protected]
[email protected]
Abstract-A country’s growth is strongly measured by quality of its education system. Education sector, across the globe has witnessed sea change in its functioning. Today it is recognized as an industry and like any other industry it is facing challenges, the major challenges of higher education being decrease in students’ success rate and their leaving a course without completion. An early prediction of students’ failure may help the management provide timely counseling as well coaching to increase success rate and student retention. We use different classification techniques to build performance prediction model based on students’ social integration, academic integration, and various emotional skills which have not been considered so far. Two algorithms J48 (Implementation of C4.5) and Random Tree have been applied to the records of MCA students of colleges affiliated to Guru Gobind Singh Indraprastha University to predict third semester performance. Random Tree is found to be more accurate in predicting performance than J48 algorithm.
socio economic and previous academic performance parameters to predict academic performance using data mining techniques. The emotional skills like assertion, leadership, stress management etc are obtained, using standard Emotional Skill assessment process ESAP.
Data mining tasks can be either descriptive or predictive. Descriptive data mining uses techniques of association rule mining, clustering etc. to find patterns hidden in large data set and help in intelligent decision making. Predictive data mining constructs models using rule set, decision tree, neural nets, and support vectors etc. to predict the class of a new data set. The objective of this paper is to predict the third semester performance of MCA students. The rationale behind considering third semester for prediction is the observation that most of the students drop out of the course after first year and also students normally take a year to get integrated in an institute academic environment. Two decision tree algorithms, J48 and Random Tree, have been used to build the model and the main contribution of this paper is the model comparison along with finding the impact of various attributes on students’ performance.
Keywords- classification, data mining, prediction,
I.
INTRODUCTION
Preliminary education adds to a nation’s literacy rate but higher education has a direct impact on the work force being provided to the industry and hence directly affects the economy. Lots of Institutions of higher learning have been set up across India. However the quality of education is judged by the success rate of students and to what extent an institute is capable of retaining its students. Predicting students’ performance can help identify the students who are at risk of failure and thus management can provide timely help and take essential steps to coach the students to improve his performance.
The remainder of this paper is organized as follows. Section 2 discusses previous work followed by experimental settings in section 3.Section 4 presents the result and conclusions are discussed in section 5. II.
LITERATURE SURVEY
Most cited literature survey papers in Educational data Ming have been by Romero and Ventura [4], Ryan Baker [16], and Romero and Ventura [5] which indicate performance prediction as one of the emerging field of educational data mining. Paris,
Data mining techniques have been applied to predict the academic performance of the students based on their socio economic condition and previous academic performances. This paper explores the link between emotional skills of the students along with
978-1-4799-4910-6/14 $31.00 © 2014 IEEE DOI 10.1109/ACCT.2014.105
Dr. Sangeeta Gupta Professor, Management and IT Bhagwan Parushram Institute of Technology, Delhi
[email protected]
255
Affendy and Musthphain [9] have used various subject performance attributes to predict final CGPA of bachelor of computer science students of a Malaysian University. Various Bayesians Classification techniques have been used and a comparative study suggests that Ensemble method gives best overall accuracy.
Osmanbegovic and Suljic in [7] have considered students’ attitude towards studying apart from demographic variables and score earned at high school to predict the grade in Business Information for first year students. Score of entrance exam, study material and average weekly hours devoted to studying have been found to have maximum impact while number of household member distance of residence and gender have been found to have least impact. Naïve Bayes is found to be better classifier than J48.
Nghe, Janice and Haddawy [17] have considered two diverse populations, international, large Can Thao University of Vietnam and Asian Institute of Technology a small international postgraduate institute and achieved similar levels of accuracy of prediction performance for both the population.
Chi-squared Automatic Interaction Detector (CHAID) has been used by Ramaswami and Bhaskaran in [10] to classify 12th class students of select Tamilnadu schools. Apart from demographic details, students’ health, tuition, care of study at home etc. have been studied. Prediction Accuracy obtained was 44.69, and potential influences variables were found to be Xth grade marks, location of school, private tuition etc.
Cheewaprakobkit [14] considered 1600 students records bet 2001 and 2011 in Thailand University and applies decision tree and neural network to most important factors affecting students’ academic achievement. Decision tree proves to be a better classifier than the neural network with 1.31% more accuracy. Number of hours worked per semester, additional English course, no of credits enrolled per semester and marital status of the students are major factors affecting the performance.
Bharadwaj and Pal [2] base their experiment only on Previous Semester marks, class test grade, seminar performance, Assignment, attendance, Lab work to predict end semester marks. Records of 50 students of Session 2007 to 2010 MCA of Purvanchal University were considered. The paper calculates Split info, gain ratio of each predictor and products prediction rules.
Importance of 24 predictor variables including demography, scores in maths , Turkish, religion and ethics, science and technology and level determination exams etc have been ranked by Sen, Uçar, and Delen [1] for predicting Turkish secondary education placement result. Application of Artificial Neural Network, Support Vector Machine, Multiple Regression and Decision indicated that most important predictor variables are determination exam, scholarship, number of siblings, success level in Turkish Language etc.
N. S. Shah [13] has applied various algorithms of decisions tree (C45 Random Forest, BF Tree, Rep Tree),Functions ( logistic RBF Network) ,Rule (3 Rip) and Bayes Net, Naive Bayes to categorize students of BBA program of University of Karanchi. Out of 42 independent variable 5 best variables having highest effect in determining performance is considered. Random Forest proven to most accurate classifier J48 decision tree, BF Tree ,Rep Tree and J Rip rule .
Wook et al.[11] have considered few personality traits like motivation of study, interests, learning environment , along with demographic details and previous academic performance to predict CGPA of computer science graduate and finally find out students who at risk of failing. Bidgoli, Koshy, Kortemeyer and Punch [3 ] applied tree classifier as well as non tree classifiers to predict the grades of students enrolled with online education Latest Learning Online Network with Computer Assisted Personalized Approach (LON – CA PA) developed at Michigan State University and found that combination of multiple classifier enhanced the prediction accuracy.
Kabakchieva in [6] conducts data mining classification Techniques on 10330 students, with 14 attributes including personal profile, secondary educational score, entrance exam score, admission year etc. The students are classified into five categories excellent, very good, good, average and bad. 10 fold cross validation and percentage split is used for all the classifier J48, Bayesian, K-nearest
256
neighbor one R and J Rip.J48 has been found to be most suitable of all classifiers.
location and other attributes like gender, medium at secondary level are found to be less relevant
Kabra and Bichkar [15] experimented with 346 first year students of an engineering college collecting their demographic data (category, gender etc), past performance data (SSC or 10th marks, HSC or 10 + 2 exam marks etc.), address and contact number to predict whether a student will PASS/FAIL or get promoted(When he fails in 3 theory and 2 practical subjects). J48 algorithm in WEKA produces a prediction model with accuracy 60.46 %. The most important attribute in predicting student’s performance is found to be HSCCET. The social attributes like category, parents’ occupation, living
The students dropping out of an open polytechnic of New Zealand due to failure has been explored by Kovaic.Z [18].Enrollment data consisting of sociodemographic variables (age, gender, ethnicity, education, work status, and disability) and study environment (course programme and course block), of 435 students of polytechnic students of Information system course were collected. The final label consisting of two categories PASS (those who completed the course) and FAIL (Those who did not complete) were considered. Feature selection indicated that most important attributes for prediction are ethnicity, course programme and course block.
III.
data. Data source from the total of 250 instances in the raw data, the data cleaning process ended up in 215 instances.
EXPERIMENTAL SETTING
The major objective of the proposed methodology is to build the classification model that classifies a students’ third semester performance as BAVG (=80%) . The classifiers, has been built by combining the Standard Process for Data Mining that includes business understanding, data understanding, data preparation, modeling and finally application of data mining techniques which is classification in present study.
C. Modeling The open source data mining tool Waikato Environment for Knowledge Analysis (WEKA), has been used for classification. WEKA provides inbuilt algorithms that can be applied to any data set. D. Classification
A. Data Understanding Tree-based methods classify instances by sorting the instances down the tree from the root to some leaf node, which provides the classification of a particular instance. Each node in the tree specifies a test of some attribute of the instance and each branch descending from that node corresponds to one of the possible values for this attribute [16]. J48 is a class for generating a pruned or unpruned C4.5 decision tree while Random Tree constructs a tree that considers K randomly chosen attributes at each node without pruning. We have used Cross-validation for testing as it has been proved to be more suitable for limited dataset and gives best estimate of error [10].
The data of MCA students from various Institutions affiliated to GGSIP University was collected through a structured questionnaire. A sample of 250 students was collected having 25 attributes which included academic integration, social integration and emotional skills as shown in Table I. B. Data-Preprocessing The data collected was saved as Excel spread sheets. The cleaning process required data eliminating data with missing values, correcting inconsistent data, identifying outliers, as well as removing duplicate
257
TABLE I. Attributes Description
Attribute Name
Values
Description
Male, Female
Gender
FE
Midschool, Inter, Grad, Postgrad
Father’s Education
ME
Midschool, Inter, Grad, Postgrad
Mother’s Education
GENDER
FO
Govtjob, Pvtjob, Business
Father’s Occupation
MO
Govtjob, Pvtjob, Business ,Housewife
Mothers Occupation
FI
MIG,HIG,LIG,VHIG
Annual Family Income
LOAN
Yes, No
Educational loan at any level of education
EARLYLIFE
Metro, City, Village
15 years of life spent
MEDIUM
English, Other
Medium of instruction at school level.
TENTH
BLAVG, AVG, ABAVG, EXCL
% marks in 10th
TWELVTH
BLAVG, AVG, ABAVG, EXCL
% marks in 12th
GRAD
BLAVG, AVG, ABAVG, EXCL
% marks in Graduation
FIRST_SEM
BLAVG, AVG, ABAVG, EXCL
% marks in 1st Semester of MCA
SECSEM
BLAVG, AVG, ABAVG, EXCL
% marks in 2nd Semester of MCA
THIRDSEM
BLAVG, AVG, ABAVG, EXCL
% marks in 3rd Semester of MCA
GRADDEGTYPE
Regular, Distance
Type of Graduation Degree
GRADDEGSTREAM
CS, NCS
Graduation Degree Stream
GAPYEAR
Yes, No
Gap year in education
ACADEMICHRS
INSUF, SUF, OPTIMAL
Hours spent on academic activities
ASSERTION
D, S,E*
Assertiveness of the student
EMPATHY
D, S,E
Empathy of the student
DECISIONMAKING
D, S,E
Decision making ability of the student
LEADERSHIP
D, S,E
Leadership ability of the student
DRIVE
D, S,E
Drive of the student
STRESSMGMT
D, S,E
Stress management skill of the student
*D, S , E represent need to develop, need to strengthen ,need to enhance respectively
D. Classification Tree-based methods classify instances by sorting the instances down the tree from the root to some leaf node, which provides the classification of a particular instance. Each node in the tree specifies a test of some attribute of the instance and each branch descending from that node corresponds to one of the possible values for this attribute [12]. J48 is a class
for generating a pruned or unpruned C4.5 decision tree while Random Tree constructs a tree that considers K randomly chosen attributes at each node without pruning. We have used Cross-validation for testing as it has been proved to be more suitable for limited dataset and gives best estimate of error [8]. IV.
J48 and Random tree were applied on the data set using 10 fold cross validation. The summary and the rules obtained by J48 are listed in Fig. 1 and Table II,
258
RESULT AND DISCUSSION while summary of random tree and rules are shown in Fig .2 and Table III. The performance of algorithms is evaluated on the basis of recall and precision and true positive(TP) rate. Precision is defined as number
of correct positive prediction over total number of positive prediction and recall is defined as number of correct positive prediction over total number of positive cases. A high precision indicates that algorithm returns more relevant results than
irrelevant and high recall means that most of the results retuned by the algorithm are relevant. The performance comparison of J48 and Random Tree is shown in Table IV.
Figure 1. J48 Result Summary
TABLE II. Rules Derived from J48 1.
If(SECSEM = AVG)and (TWELFTH = AVG)and(MEDIUM = English)and(EARLYLIFE = Metro)and(ME = Grad): ABAVG
2.
If(SECSEM = AVG)and (TWELFTH = AVG)and(MEDIUM = English)and (EARLYLIFE = City)and (GRAD = AVG): AVG)
3.
If(SECSEM = AVG)and( TWELFTH = EXCL): AVG
4.
If(SECSEM = AVG)and (TWELFTH = ABAVG): AVG
5.
If(SECSEM = AVG)and (TWELFTH = BAVG)and (LEADERSHIP = S)and (GRAD = BAVG): BAVG
6.
If(SECSEM = AVG)and (TWELFTH = BAVG)and (LEADERSHIP =E): AVG
7.
If(SECSEM = EXCL)and (GRADDEGTYPE = Regular): EXCL
8.
If(SECSEM = ABAVG)and (FIRSTSEM = ABAVG)and(TENTH = ABVG): ABAVG
9.
If(SECSEM = BAVG)and (ME = Inter): BAVG
259
Figure 2. Random Tree Result Summary
TABLE III. Rules Derived from J48
1.
If(SECSEM = BAVG)and(ASSERTION = D)and(FI = MIG) : BAVG
2.
If(SECSEM = ABAVG) and (LEADERSHIP = E ): ABAVG
3.
If(SECSEM = EXCL)and (FE = Grad) : EXCL /0)
4.
If(SECSEM = AVG)and(FO = Govtjob)and(ACADEMICHRS = SUF)and(DRIVE = S) : AVG
TABLE IV. Performance Comparison of J48 and Random decision Tree
ABVG EXCL AVG BAVG Weighted Average Correctly Classified Instances Incorrectly Classified Instances
TP Rate 0.792 1.000 0.900 0.923 0.884
Precision 0.924 0.857 0.851 0.923 0.887
J48 Recall
0.792 1.000 0.900 0.923 0.884 88.3721% 11.627%
TP Rate 0.948 0.952 0.914 1.000 0.944
Random Tree Precision Recall 0.924 0.952 0.970 0.929 0.945 94.4186% 5.5814%
260
0.948 0.952 0.914 1.000 0.944
x
It is evident from the rules derived from the J48 and Random tree that x Result of second semester is key influencer of third semester result. It is expected also, as the programming subjects of second semester forms the foundation of programming subjects of third semester. x Consistently good academic performance is clearly a good indication of good performance in third semester too. x Out of all emotional attributes leadership and drive of the students have been found to affect the performance. V. CONCLUSIONS DIRECTIONS
AND
Socio economic conditions are having only marginal effect on performance.
The performance of both the algorithm is satisfactory; however, higher overall accuracy (94.418%) was attained by Random Tree implementation as compared to J48 with 88.372% accuracy. Also the True Positive Rate, Precision and Recall measures of Random tree are higher than J48 and in line with the corresponding accuracy.
FUTURE
Today academic success of students of any professional Institution has become the major issue for the management. An early prediction of students at risk of poor performance helps the management take timely action to improve their performance through extra coaching and counseling. This paper focused on identifying attributes that influenced students ‘third semester performance. Effect of emotional quotient parameters on placement has been established. Random tree gave higher accuracy of prediction than J 48.The future research direction will include professional courses of B.Tech as well as the development of a decision support system to help authorities identify the weak students and take timely measures.
261
[10] M. Ramaswami, and R. Bhaskaran, “A CHAID Based Performance Prediction Model in Educational Data Mining”, International Journal of Computer Science, Vol. 7, Issue 1, No. 1.of 2010.
REFERENCES [1]
B. Sen, E. Uçar and D. Delen, “Predicting and Analyzing Secondary Education Placement-Test Scores: A Data Mining Approach”, Expert System with Application, Volume 39, Issue 10, 2012.
[2]
B.K.Bhardwaj and S.Paul , “Mining Educational Data to Analyze Students Performance”, International Journal Advanced Computer Science and application Vol. 2 No. 6 , 2011 .
[3]
B. M. Bidgoli, D.Koshy, G.Kortemeyer, W.F.Punch, “Predicting Student Performance: An Applicant of Data Mining methods with an educational web based system” , 33rd ASEE/ IEEE .frontiers in Education Conference 20004.
[4]
C. Romero and S. Ventura, “Educational data mining: a survey from 1995 to 2005,” Expert Systems with Applications, no. 33, pp. 135–146, 2007.
[5]
C.Romero ans S, Ventura,” Educational Data Mining: A Review of the State of the Art”,IEEE Transaction on Systems, Man, and Cybernatics,Vol.40,No.6,2010.
[6]
[11] M. Wook, Y.H.Yahaya, N. Wahab, M. R.M. Isa, N. F. Awang a International Conference nd H.Y. Seong, “Predicting NDUM Student's Academic Performance Using Data Mining Techniques, Paper presented at International Conference of Computer and Electrical Engineering, ICCEE. December 28-30. 2009. [12] Mitchell, T.: Machine Learning. McGraw Hill, New York (1997). [13] N. S. Shah, “Predicting Factors that Affect Students ’ Academic Performance By Using Data Mining,” Pakistan Business Review, January 2012. [14] P.Cheewaprakobkit, “Study of Factor Analysis Affecting Achievements of Undergraduate”, Paper presented at International Multi Conference of Engineers and Computer Scientists, IMECS , Hong Kong, HK, March 13 - 15, 2013. [15] R. R. Kabra, R.R, Bichkar ,” Performance Prediction of Engineering Students using Decision Trees”, International Journal of Computer Applications, Volume 36, No.11, 2011.
D.Kabakchieva, “Predicting Student Performance by using Data Mining methods for classification.” , Cybernetics and Information Technologies, Volume 13, 2013.
[7]
E. Osmanbegovic and M. Suljic, “ Data mining Approach for Prediction of Student Performance” Economic Review - Journal of Economics & Business Vol. 10, issue 1, 2012.
[8]
IH. Witten and E. Frank, Data Mining: Practical machine learning tools and techniques. San Francisco: Morgan Kaufmann, 2 ed., 2005.
[9]
I.H. M. Paris, L.S. Affecndy and N.Musthafa, “Improving Performance Prediction using Voting technique in data Mining”, World Academy of Science, Engineering and Technology World Academy of Science, Engineering and Technology, Vol 38,2010.
[16] R.S.J.D Baker and K.Yacef, “The State of Educational Data Mining in 2009: A Review and Future Visions” , Journal of Educational Data Mining, 1, Vol 1, No 1, 2009. [17] T.Nghe, J.Paul , Aneek and Peter Heddawy, “A Comparitive Analysis of Techniques for Predicting Academic Performance”, Paper presented at 37th ASEE/IEEE Conference, Frontiers in Education Conference - Global Engineering: Knowledge Without Borders, Opportunities Without Passports, Milwaukee,WI,October 10-13,2007.
[18] =-.RYDþLü ³(DUO\3UHGLFWLRQRI6WXGHQW6XFFHVV Mining Students Enrolment Data”, Paper presented at Proceedings of Informing Science & IT Education Conference (InSITE) ,Casinio Italia, June, 19-24,2010.
262