Mining Students' Data for Prediction Performance - IEEE Computer

4 downloads 0 Views 498KB Size Report
performance. Data mining techniques have been applied to predict the academic performance of the students based on their socio economic condition and ...
2014 Fourth International Conference on Advanced Computing & Communication Technologies

Mining Students’ Data for Performance Prediction Tripti Mishra Research Scholar Mewar University

Dr. Dharminder Kumar Professor &Chairman, CSE Department G.J. University, Hisar

[email protected]

[email protected]

Abstract-A country’s growth is strongly measured by quality of its education system. Education sector, across the globe has witnessed sea change in its functioning. Today it is recognized as an industry and like any other industry it is facing challenges, the major challenges of higher education being decrease in students’ success rate and their leaving a course without completion. An early prediction of students’ failure may help the management provide timely counseling as well coaching to increase success rate and student retention. We use different classification techniques to build performance prediction model based on students’ social integration, academic integration, and various emotional skills which have not been considered so far. Two algorithms J48 (Implementation of C4.5) and Random Tree have been applied to the records of MCA students of colleges affiliated to Guru Gobind Singh Indraprastha University to predict third semester performance. Random Tree is found to be more accurate in predicting performance than J48 algorithm.

socio economic and previous academic performance parameters to predict academic performance using data mining techniques. The emotional skills like assertion, leadership, stress management etc are obtained, using standard Emotional Skill assessment process ESAP.

Data mining tasks can be either descriptive or predictive. Descriptive data mining uses techniques of association rule mining, clustering etc. to find patterns hidden in large data set and help in intelligent decision making. Predictive data mining constructs models using rule set, decision tree, neural nets, and support vectors etc. to predict the class of a new data set. The objective of this paper is to predict the third semester performance of MCA students. The rationale behind considering third semester for prediction is the observation that most of the students drop out of the course after first year and also students normally take a year to get integrated in an institute academic environment. Two decision tree algorithms, J48 and Random Tree, have been used to build the model and the main contribution of this paper is the model comparison along with finding the impact of various attributes on students’ performance.

Keywords- classification, data mining, prediction,

I.

INTRODUCTION

Preliminary education adds to a nation’s literacy rate but higher education has a direct impact on the work force being provided to the industry and hence directly affects the economy. Lots of Institutions of higher learning have been set up across India. However the quality of education is judged by the success rate of students and to what extent an institute is capable of retaining its students. Predicting students’ performance can help identify the students who are at risk of failure and thus management can provide timely help and take essential steps to coach the students to improve his performance.

The remainder of this paper is organized as follows. Section 2 discusses previous work followed by experimental settings in section 3.Section 4 presents the result and conclusions are discussed in section 5. II.

LITERATURE SURVEY

Most cited literature survey papers in Educational data Ming have been by Romero and Ventura [4], Ryan Baker [16], and Romero and Ventura [5] which indicate performance prediction as one of the emerging field of educational data mining. Paris,

Data mining techniques have been applied to predict the academic performance of the students based on their socio economic condition and previous academic performances. This paper explores the link between emotional skills of the students along with

978-1-4799-4910-6/14 $31.00 © 2014 IEEE DOI 10.1109/ACCT.2014.105

Dr. Sangeeta Gupta Professor, Management and IT Bhagwan Parushram Institute of Technology, Delhi [email protected]

255

Affendy and Musthphain [9] have used various subject performance attributes to predict final CGPA of bachelor of computer science students of a Malaysian University. Various Bayesians Classification techniques have been used and a comparative study suggests that Ensemble method gives best overall accuracy.

Osmanbegovic and Suljic in [7] have considered students’ attitude towards studying apart from demographic variables and score earned at high school to predict the grade in Business Information for first year students. Score of entrance exam, study material and average weekly hours devoted to studying have been found to have maximum impact while number of household member distance of residence and gender have been found to have least impact. Naïve Bayes is found to be better classifier than J48.

Nghe, Janice and Haddawy [17] have considered two diverse populations, international, large Can Thao University of Vietnam and Asian Institute of Technology a small international postgraduate institute and achieved similar levels of accuracy of prediction performance for both the population.

Chi-squared Automatic Interaction Detector (CHAID) has been used by Ramaswami and Bhaskaran in [10] to classify 12th class students of select Tamilnadu schools. Apart from demographic details, students’ health, tuition, care of study at home etc. have been studied. Prediction Accuracy obtained was 44.69, and potential influences variables were found to be Xth grade marks, location of school, private tuition etc.

Cheewaprakobkit [14] considered 1600 students records bet 2001 and 2011 in Thailand University and applies decision tree and neural network to most important factors affecting students’ academic achievement. Decision tree proves to be a better classifier than the neural network with 1.31% more accuracy. Number of hours worked per semester, additional English course, no of credits enrolled per semester and marital status of the students are major factors affecting the performance.

Bharadwaj and Pal [2] base their experiment only on Previous Semester marks, class test grade, seminar performance, Assignment, attendance, Lab work to predict end semester marks. Records of 50 students of Session 2007 to 2010 MCA of Purvanchal University were considered. The paper calculates Split info, gain ratio of each predictor and products prediction rules.

Importance of 24 predictor variables including demography, scores in maths , Turkish, religion and ethics, science and technology and level determination exams etc have been ranked by Sen, Uçar, and Delen [1] for predicting Turkish secondary education placement result. Application of Artificial Neural Network, Support Vector Machine, Multiple Regression and Decision indicated that most important predictor variables are determination exam, scholarship, number of siblings, success level in Turkish Language etc.

N. S. Shah [13] has applied various algorithms of decisions tree (C45 Random Forest, BF Tree, Rep Tree),Functions ( logistic RBF Network) ,Rule (3 Rip) and Bayes Net, Naive Bayes to categorize students of BBA program of University of Karanchi. Out of 42 independent variable 5 best variables having highest effect in determining performance is considered. Random Forest proven to most accurate classifier J48 decision tree, BF Tree ,Rep Tree and J Rip rule .

Wook et al.[11] have considered few personality traits like motivation of study, interests, learning environment , along with demographic details and previous academic performance to predict CGPA of computer science graduate and finally find out students who at risk of failing. Bidgoli, Koshy, Kortemeyer and Punch [3 ] applied tree classifier as well as non tree classifiers to predict the grades of students enrolled with online education Latest Learning Online Network with Computer Assisted Personalized Approach (LON – CA PA) developed at Michigan State University and found that combination of multiple classifier enhanced the prediction accuracy.

Kabakchieva in [6] conducts data mining classification Techniques on 10330 students, with 14 attributes including personal profile, secondary educational score, entrance exam score, admission year etc. The students are classified into five categories excellent, very good, good, average and bad. 10 fold cross validation and percentage split is used for all the classifier J48, Bayesian, K-nearest

256

neighbor one R and J Rip.J48 has been found to be most suitable of all classifiers.

location and other attributes like gender, medium at secondary level are found to be less relevant

Kabra and Bichkar [15] experimented with 346 first year students of an engineering college collecting their demographic data (category, gender etc), past performance data (SSC or 10th marks, HSC or 10 + 2 exam marks etc.), address and contact number to predict whether a student will PASS/FAIL or get promoted(When he fails in 3 theory and 2 practical subjects). J48 algorithm in WEKA produces a prediction model with accuracy 60.46 %. The most important attribute in predicting student’s performance is found to be HSCCET. The social attributes like category, parents’ occupation, living

The students dropping out of an open polytechnic of New Zealand due to failure has been explored by Kovaic.Z [18].Enrollment data consisting of sociodemographic variables (age, gender, ethnicity, education, work status, and disability) and study environment (course programme and course block), of 435 students of polytechnic students of Information system course were collected. The final label consisting of two categories PASS (those who completed the course) and FAIL (Those who did not complete) were considered. Feature selection indicated that most important attributes for prediction are ethnicity, course programme and course block.

III.

data. Data source from the total of 250 instances in the raw data, the data cleaning process ended up in 215 instances.

EXPERIMENTAL SETTING

The major objective of the proposed methodology is to build the classification model that classifies a students’ third semester performance as BAVG (=80%) . The classifiers, has been built by combining the Standard Process for Data Mining that includes business understanding, data understanding, data preparation, modeling and finally application of data mining techniques which is classification in present study.

C. Modeling The open source data mining tool Waikato Environment for Knowledge Analysis (WEKA), has been used for classification. WEKA provides inbuilt algorithms that can be applied to any data set. D. Classification

A. Data Understanding Tree-based methods classify instances by sorting the instances down the tree from the root to some leaf node, which provides the classification of a particular instance. Each node in the tree specifies a test of some attribute of the instance and each branch descending from that node corresponds to one of the possible values for this attribute [16]. J48 is a class for generating a pruned or unpruned C4.5 decision tree while Random Tree constructs a tree that considers K randomly chosen attributes at each node without pruning. We have used Cross-validation for testing as it has been proved to be more suitable for limited dataset and gives best estimate of error [10].

The data of MCA students from various Institutions affiliated to GGSIP University was collected through a structured questionnaire. A sample of 250 students was collected having 25 attributes which included academic integration, social integration and emotional skills as shown in Table I. B. Data-Preprocessing The data collected was saved as Excel spread sheets. The cleaning process required data eliminating data with missing values, correcting inconsistent data, identifying outliers, as well as removing duplicate

257

TABLE I. Attributes Description

Attribute Name

Values

Description

Male, Female

Gender

FE

Midschool, Inter, Grad, Postgrad

Father’s Education

ME

Midschool, Inter, Grad, Postgrad

Mother’s Education

GENDER

FO

Govtjob, Pvtjob, Business

Father’s Occupation

MO

Govtjob, Pvtjob, Business ,Housewife

Mothers Occupation

FI

MIG,HIG,LIG,VHIG

Annual Family Income

LOAN

Yes, No

Educational loan at any level of education

EARLYLIFE

Metro, City, Village

15 years of life spent

MEDIUM

English, Other

Medium of instruction at school level.

TENTH

BLAVG, AVG, ABAVG, EXCL

% marks in 10th

TWELVTH

BLAVG, AVG, ABAVG, EXCL

% marks in 12th

GRAD

BLAVG, AVG, ABAVG, EXCL

% marks in Graduation

FIRST_SEM

BLAVG, AVG, ABAVG, EXCL

% marks in 1st Semester of MCA

SECSEM

BLAVG, AVG, ABAVG, EXCL

% marks in 2nd Semester of MCA

THIRDSEM

BLAVG, AVG, ABAVG, EXCL

% marks in 3rd Semester of MCA

GRADDEGTYPE

Regular, Distance

Type of Graduation Degree

GRADDEGSTREAM

CS, NCS

Graduation Degree Stream

GAPYEAR

Yes, No

Gap year in education

ACADEMICHRS

INSUF, SUF, OPTIMAL

Hours spent on academic activities

ASSERTION

D, S,E*

Assertiveness of the student

EMPATHY

D, S,E

Empathy of the student

DECISIONMAKING

D, S,E

Decision making ability of the student

LEADERSHIP

D, S,E

Leadership ability of the student

DRIVE

D, S,E

Drive of the student

STRESSMGMT

D, S,E

Stress management skill of the student

*D, S , E represent need to develop, need to strengthen ,need to enhance respectively

D. Classification Tree-based methods classify instances by sorting the instances down the tree from the root to some leaf node, which provides the classification of a particular instance. Each node in the tree specifies a test of some attribute of the instance and each branch descending from that node corresponds to one of the possible values for this attribute [12]. J48 is a class

for generating a pruned or unpruned C4.5 decision tree while Random Tree constructs a tree that considers K randomly chosen attributes at each node without pruning. We have used Cross-validation for testing as it has been proved to be more suitable for limited dataset and gives best estimate of error [8]. IV.

J48 and Random tree were applied on the data set using 10 fold cross validation. The summary and the rules obtained by J48 are listed in Fig. 1 and Table II,

258

RESULT AND DISCUSSION while summary of random tree and rules are shown in Fig .2 and Table III. The performance of algorithms is evaluated on the basis of recall and precision and true positive(TP) rate. Precision is defined as number

of correct positive prediction over total number of positive prediction and recall is defined as number of correct positive prediction over total number of positive cases. A high precision indicates that algorithm returns more relevant results than

irrelevant and high recall means that most of the results retuned by the algorithm are relevant. The performance comparison of J48 and Random Tree is shown in Table IV.

Figure 1. J48 Result Summary

TABLE II. Rules Derived from J48 1.

If(SECSEM = AVG)and (TWELFTH = AVG)and(MEDIUM = English)and(EARLYLIFE = Metro)and(ME = Grad): ABAVG

2.

If(SECSEM = AVG)and (TWELFTH = AVG)and(MEDIUM = English)and (EARLYLIFE = City)and (GRAD = AVG): AVG)

3.

If(SECSEM = AVG)and( TWELFTH = EXCL): AVG

4.

If(SECSEM = AVG)and (TWELFTH = ABAVG): AVG

5.

If(SECSEM = AVG)and (TWELFTH = BAVG)and (LEADERSHIP = S)and (GRAD = BAVG): BAVG

6.

If(SECSEM = AVG)and (TWELFTH = BAVG)and (LEADERSHIP =E): AVG

7.

If(SECSEM = EXCL)and (GRADDEGTYPE = Regular): EXCL

8.

If(SECSEM = ABAVG)and (FIRSTSEM = ABAVG)and(TENTH = ABVG): ABAVG

9.

If(SECSEM = BAVG)and (ME = Inter): BAVG

259

Figure 2. Random Tree Result Summary

TABLE III. Rules Derived from J48

1.

If(SECSEM = BAVG)and(ASSERTION = D)and(FI = MIG) : BAVG

2.

If(SECSEM = ABAVG) and (LEADERSHIP = E ): ABAVG

3.

If(SECSEM = EXCL)and (FE = Grad) : EXCL /0)

4.

If(SECSEM = AVG)and(FO = Govtjob)and(ACADEMICHRS = SUF)and(DRIVE = S) : AVG

TABLE IV. Performance Comparison of J48 and Random decision Tree

ABVG EXCL AVG BAVG Weighted Average Correctly Classified Instances Incorrectly Classified Instances

TP Rate 0.792 1.000 0.900 0.923 0.884

Precision 0.924 0.857 0.851 0.923 0.887

J48 Recall

0.792 1.000 0.900 0.923 0.884 88.3721% 11.627%

TP Rate 0.948 0.952 0.914 1.000 0.944

Random Tree Precision Recall 0.924 0.952 0.970 0.929 0.945 94.4186% 5.5814%

260

0.948 0.952 0.914 1.000 0.944

x

It is evident from the rules derived from the J48 and Random tree that x Result of second semester is key influencer of third semester result. It is expected also, as the programming subjects of second semester forms the foundation of programming subjects of third semester. x Consistently good academic performance is clearly a good indication of good performance in third semester too. x Out of all emotional attributes leadership and drive of the students have been found to affect the performance. V. CONCLUSIONS DIRECTIONS

AND

Socio economic conditions are having only marginal effect on performance.

The performance of both the algorithm is satisfactory; however, higher overall accuracy (94.418%) was attained by Random Tree implementation as compared to J48 with 88.372% accuracy. Also the True Positive Rate, Precision and Recall measures of Random tree are higher than J48 and in line with the corresponding accuracy.

FUTURE

Today academic success of students of any professional Institution has become the major issue for the management. An early prediction of students at risk of poor performance helps the management take timely action to improve their performance through extra coaching and counseling. This paper focused on identifying attributes that influenced students ‘third semester performance. Effect of emotional quotient parameters on placement has been established. Random tree gave higher accuracy of prediction than J 48.The future research direction will include professional courses of B.Tech as well as the development of a decision support system to help authorities identify the weak students and take timely measures.

261

[10] M. Ramaswami, and R. Bhaskaran, “A CHAID Based Performance Prediction Model in Educational Data Mining”, International Journal of Computer Science, Vol. 7, Issue 1, No. 1.of 2010.

REFERENCES [1]

B. Sen, E. Uçar and D. Delen, “Predicting and Analyzing Secondary Education Placement-Test Scores: A Data Mining Approach”, Expert System with Application, Volume 39, Issue 10, 2012.

[2]

B.K.Bhardwaj and S.Paul , “Mining Educational Data to Analyze Students Performance”, International Journal Advanced Computer Science and application Vol. 2 No. 6 , 2011 .

[3]

B. M. Bidgoli, D.Koshy, G.Kortemeyer, W.F.Punch, “Predicting Student Performance: An Applicant of Data Mining methods with an educational web based system” , 33rd ASEE/ IEEE .frontiers in Education Conference 20004.

[4]

C. Romero and S. Ventura, “Educational data mining: a survey from 1995 to 2005,” Expert Systems with Applications, no. 33, pp. 135–146, 2007.

[5]

C.Romero ans S, Ventura,” Educational Data Mining: A Review of the State of the Art”,IEEE Transaction on Systems, Man, and Cybernatics,Vol.40,No.6,2010.

[6]

[11] M. Wook, Y.H.Yahaya, N. Wahab, M. R.M. Isa, N. F. Awang a International Conference nd H.Y. Seong, “Predicting NDUM Student's Academic Performance Using Data Mining Techniques, Paper presented at International Conference of Computer and Electrical Engineering, ICCEE. December 28-30. 2009. [12] Mitchell, T.: Machine Learning. McGraw Hill, New York (1997). [13] N. S. Shah, “Predicting Factors that Affect Students ’ Academic Performance By Using Data Mining,” Pakistan Business Review, January 2012. [14] P.Cheewaprakobkit, “Study of Factor Analysis Affecting Achievements of Undergraduate”, Paper presented at International Multi Conference of Engineers and Computer Scientists, IMECS , Hong Kong, HK, March 13 - 15, 2013. [15] R. R. Kabra, R.R, Bichkar ,” Performance Prediction of Engineering Students using Decision Trees”, International Journal of Computer Applications, Volume 36, No.11, 2011.

D.Kabakchieva, “Predicting Student Performance by using Data Mining methods for classification.” , Cybernetics and Information Technologies, Volume 13, 2013.

[7]

E. Osmanbegovic and M. Suljic, “ Data mining Approach for Prediction of Student Performance” Economic Review - Journal of Economics & Business Vol. 10, issue 1, 2012.

[8]

IH. Witten and E. Frank, Data Mining: Practical machine learning tools and techniques. San Francisco: Morgan Kaufmann, 2 ed., 2005.

[9]

I.H. M. Paris, L.S. Affecndy and N.Musthafa, “Improving Performance Prediction using Voting technique in data Mining”, World Academy of Science, Engineering and Technology World Academy of Science, Engineering and Technology, Vol 38,2010.

[16] R.S.J.D Baker and K.Yacef, “The State of Educational Data Mining in 2009: A Review and Future Visions” , Journal of Educational Data Mining, 1, Vol 1, No 1, 2009. [17] T.Nghe, J.Paul , Aneek and Peter Heddawy, “A Comparitive Analysis of Techniques for Predicting Academic Performance”, Paper presented at 37th ASEE/IEEE Conference, Frontiers in Education Conference - Global Engineering: Knowledge Without Borders, Opportunities Without Passports, Milwaukee,WI,October 10-13,2007.

[18] =-.RYDþLü  ³(DUO\3UHGLFWLRQRI6WXGHQW6XFFHVV Mining Students Enrolment Data”, Paper presented at Proceedings of Informing Science & IT Education Conference (InSITE) ,Casinio Italia, June, 19-24,2010.

262

Suggest Documents