Finding Peculiar Students from Student Database using Outlier Analysis: Data Mining Approach Lakshmi Sreenivasa Reddy.D1
Dr.B.Raveendrababu2
Department of Computer Science Engineering RISE Krishna Sai Gandhi Group of Institutions Ongole, India
[email protected]
Department of Computer Science Engineering VNR VJIT Hyderabad, India
[email protected]
Vijaya Bhaskar Velpula2 Department of Computer Science Engineering Guntur Engineering College Guntur, India
[email protected]
S.Sailaja2 Department of Computer Science Engineering RISE Krishna Sai Gandhi Group of Institutions Ongole, India
[email protected]
K. Bindhu Madhavi3 Department of Computer Science & Engineering MRR Group of Institutions Hyderabad, India
[email protected]
Abstract—Students with different behaviors joined in the educational institutions create different problems in class. To bring them in right path, mentors should be able to find such candidates in the class. Since these students are different in behavior, the teaching faculty should not teach the common approach of teaching for all students. These people would have abnormal behavior when compared with other students. These students are treated as peculiar students. The Students data is almost mixed type of data. In this paper how these peculiar students are found using data mining techniques is presented. In this paper the techniques related to categorical attribute data are used. The data is collected from B.Tech students from different colleges for experiments using ILS questionnaire [1]. We have also investigated the relationship of peculiarity with learning styles. Index Terms — ILS, Felder and Silverman, NAVF, NBAD. AVF, BAD, categorical.
I. INTRODUCTION Many systems are developed in education field to improve the style of learning. Almost all methods are developed based on online courses and online education using web mining techniques. The track of the peculiar students is like ups and downs. Finding peculiar students is another important corner in education. Finding c 978-1-4799-6876-3/14/$31.00 2014 IEEE
peculiar students in education institutions would help the mentors to organize the students in more effective manner. This paper presents how the peculiar students can be found from the student’s database using data mining techniques. Since the data collected about the students is almost categorical, this paper has used the categorical techniques. Data mining can also be used in order to learn the problems in the education system, for example, giving incorrect feedback statements [2], to adapt the level of the progress of the learner [3], to suggest personalized learning experiences and activities for the students [4].There are so many methods available for categorical data. But some of the latest methods are presented here to find peculiar students from student data base. Index of learning styles questionnaire [1] has been used to collect data from B.Tech students from different branches from different colleges. This questionnaire consists of 44 questions based on different learning styles with a class label attribute, whether a student succeeded in B.Tech or not. These questions are treated as attributes in this student data. Each question has two options. Active /reflective learning style contains 11 questions. Similarly Global / Sequential, Sensing / Intuitive, Visual / Auditor learning styles can 160
also have 11 questions each. From these 44 questions, the strengths of 8 learning styles are identified. In [11], numerical based models, like k-NN, density based models, distance based models, and cluster related methods are used to find learners’ learning styles. II. PECULIAR STUDENTS RECORDS ARE OUTLIERS IN EDUCATION DATA Peculiar students’ records do not comply with other students records. These records have an extreme behavior when compared with other data objects. These records would differ significantly with other students’ records in the data base. So these students’ records are called outliers. These outliers lead us to take wrong decision making. Peculiarity may be developed based on different factors. These can be found easily by applying existing methods such as distance based, density based, statistical based etc. These students’ records are occurred based on different psychological factors of students. III. TYPES OF LEARNING STYLES A. Active/Reflective learning Active learners are like “The learners who are active in processing the information first” they like to discuss examples, explanation and testing the information with others. But the reflective learners do not discuss with others. They like to think alone. They prefer the material before going to process. Reflective learners collect more information about forums regarding discussion, doubts and explaining something more beyond expected by active learners. Reflective learners like frequent reading. Reflective learners are very careful and preferred to participate passively. Since the active learners like to do something themselves, they have to spend more time on examples. They do not like others’ solutions. Reflective learners like to spend more time on reading materials. B. Sensing/Intuitive learning The sensing learners like concrete material, analyzing the performance on questions about facts and theories. But the intuitive learners like abstract of the original theories. Sensing learners prefer examples to learning concrete material. The intuitive learners use examples as supplementary material. Therefore the time spent on content object to be high by sensing and time spent on examples tends to be low by intuitive.. Sensing learners solve problems by examples; whereas intuitive learners prefer more challenges. We can expect the understanding of theories and concepts by intuitive learners. The work is done carefully and slowly with more details by sensing learners. C. Visual/Verbal learning Visual learners like to learn anything clearly by images, graphics and flow charts. But the verbal learners
are preferred to learn from words. Furthermore verbal learners like to communicate and discuss with others. The verbal learners prefer more number of visits and postings as well as more time spent in a discussion forum. D. Sequential /Global learning Sequential learners look with details but global learners feel in seeing the” Big picture” connected to other fields. For Global learners, outline of the course and the chapters are especially important. The global learners like to interpret predefined solutions and they like to find new solutions by connecting the different fields. The searching of a course is acted as a pattern denoting global learner’s style. The course is gone through step by step by the sequential learners in a linear way. Global learner skips leaning objects and jumping to more complex material. IV. EXISTING METHODS TO FIND OUTLIERS IN CATEGORICAL DATA
A. Attribute Value Frequency (AVF) AVF is one of the methods to find outliers in categorical data. This method [5] uses the frequency of attribute values included in each record. This method gives better accuracy and less time complexity. It is a simple and faster approach for detecting outliers. It needs only one scan over the data and does not need more space and more search for combinations of each attribute values. An outlier point Xi is defined based on the AVF Score given below:
1 m AVFscore( Xi ) = ¦ f ( xij ) m j =1
(1)
Here Xi is an object of the collected data. Xij is a attribute value corresponding to the jth attribute in ith object. f(Xij) is the frequency of the attribute value Xij, m is the number of attributes. It is useful to find outliers in Categorical data. Since the collected data is categorical, this method has been used here. But this method needs an input value to find the number of outliers. This method cannot find reliable number of outliers. B. BAD score Algorithm The BAD score algorithm [6] is another method to find outliers in categorical data. This algorithm also needs one scan of dataset for each data object, but this algorithm also needs only one scan of the entire dataset for all records to find frequency of each value in dataset. It finds the outliers based on the entropy of the record. This algorithm finds the entropy of each object in the data set and finds k-objects as those k-highest BAD
2014 IEEE International Conference on MOOC, Innovation and Technology in Education (MITE)
161
score objects. The BAD score algorithm is defined as below. Let ‘D’ be the dataset is defined below. Where D= {A1, A2-------- Am}, D (Aj) = Domain of all distinct values in attribute ‘j’, V=Set of all distinct values in database, Where
b = mean ( fi )
D(A)= j D (A1) ∪ D (A2) ∪ ....D (Am)={V1j, V2j ,V4j ...Vkj}
Where 1 ≤ k ≤ n and 1 ≤ j ≤ m for each record, BAD score for each record is shown below
BADscore( Xi )
1 Score1( Xi) + score2( Xi )
(2)
Here m ª ( f (Vkj ) − 1) § ·º Score1( Xi) − ¦« ¦ log10 ¨¨© ( f ((Vkjn−1))−1) ¸¸¹» (3) j =1 ¬∀Vkj∈D( Aj ) Xij =Vkj (n − 1) ¼ m ª ( f (Vkj ) § ·º score2 = −¦ « log 10 ¨¨© ( (f n(−Vkj1) ) ¸¸¹ » ¦ j =1 ¬∀Vkj∈D ( Aj ) Xij ≠Vkj ( n − 1) ¼
D. Fuzzy based BAD Score(FBAD) This method calculates fuzzy seeds based on frequency of attribute values. Based on these fuzzy seeds it calculates fuzzy threshold value. If the BAD score of a record is more than the fuzzy score it is treated as regular record otherwise it is treated as outlier (peculiar). The fuzzy seeds are calculated as given below.
b − 3* STD( fi) ° a = °®b − 2* STD( fi) ° ° b − STD( fi) ¯
b + 3* STD( fi) ° c = °®b + 2* STD( fi) ° ° b ¯
(6)
if
max( fi) > 3* STD( fi)
if
max( fi) > 2* STD( fi)
if
otherwise
if
max( fi) > 3* STD( fi)
if
max( fi) > 2* STD( fi) otherwise
(4)
This method can give better results than AVF and Greedy in some datasets. But the time complexity is very high when compared with above two methods. The drawback for this too is giving input. It cannot find number of outliers automatically. The remedy for these is the optimization of AVF and BAD algorithms
0 ° ° ° ½2 ° 2 °°® fi−a °°¾ ° ° c−a ° ¯° ¿° Fuzzyscore( xi) = °® ° ½2 ° ° fi −a °° °1− 2 °®° c−a ¾°¿° ° ¯° ° ° 1 ¯
if
fi < a
if
a ≤ fi ≤ b
if
b ≤ fi ≤ c
if
fi > c
(7)
(8)
(9)
C. Normally Distributed BAD Score (NBAD) algorithm Take the data set ‘D’ with ‘m’ attributes A1, A2----Am and d (Ai) is the domain of distinct values of the attribute Ai. Let Kn is the number of outliers which are normally distributed. To find ‘Kn’ AVF used the Gaussian distribution. If the AVF frequency of any object is less than “Ĭ” then this model treats those records as outliers.
Φ = μ BAD ( Xi ) − σ BAD ( Xi ) Where i=n
μ ( X ) = Mean of all BAD scores for i=1 to BAD
σ (X ) = BAD
(5)
i
i
Standard Deviation of all BAD scores for
i=1 to i=n. This method also finds optimal number of outliers automatically. All these methods can be applied to find peculiar students from student database.
162
V. EXPERIMENTAL RESULTS AND DISCUSSION The data is collected from B.Tech Students using ILS questionnaire of 44 learning style related questions and percentages up to B.Tech. 493 records have been collected and there are forty nine attributes with all categorical values. There is one class label attribute “B.Tech Result” with two values “Success” and “Not success”. When a student gets more than 65% marks up to B.Tech the student result is treated as “Success”. Otherwise the result is treated as “Not success”. The algorithms have been used to find peculiar students from 493 records. After the peculiar students have found, investigation has also been done for the correlation between learning styles and peculiarity. After applying both methods on different branches of B.Tech, the number of peculiar students found by NBAD and FBAD are given in the below Table1.
2014 IEEE International Conference on MOOC, Innovation and Technology in Education (MITE)
By observing the peculiar students found by NBAD and FBAD, FBAD has found more number of peculiar students than NBAD in each branch. In Civil branch (82 records) NBAD found 12.1% of the students are peculiar. But in these peculiar students 2.4% of the whole students are succeeded in their course. That means they got more than 65% in B.Tech even though they did not succeed in lower courses. From FBAD method 14.4% students found as peculiar.
TABLE1. COMPARISON OF PECULIAR STUDENTS FOUND BY NBAD, FBAD.
Branch
Civil (82) CSE (110) ECE (215) EEE (87)
Peculiar students found by NBAD
Peculiar students found by FBAD
Not Success
Success
Not Success
9.7%
2.4%
12%
2.4%
5.45%
4.54%
7.27%
3.63%
4.18%
0%
4.65%
0.93%
8.08%
4.59%
10.344%
3.44%
Fig2. Distribution of learning styles
Success
In these peculiar students also 2.4% of them succeeded in their courses. In CSE (110 records) the change in success rate is more even in peculiar students. Out of 9.99% peculiar students 4.54% students succeeded in B.Tech CSE as per the NBAD method. FBAD method found 10.3% of the entire students as peculiar students and from these 3.63% succeeded in their B.Tech course. According to NBAD for ECE branch (215 records) there are 4.18% peculiars. From these no one succeeded in their B.Tech course. But FBAD has found 5.58 % students as peculiars with 0.93% success rate in that peculiarity. In EEE(87 records), NBAD and FBAD found 12.17% and 13.78% respectively as peculiar students. Out of these 4.59% and 3.44% of peculiars succeeded in their B.Tech courses.
Fig3 Peculiar students records under yellow line
VI. CONCLUSION & FUTURE WORK Applying Data mining techniques like AVF, BAD, NBAD, FBAD for education data, peculiar students have been found in each branch. This information is very helpful for the mentors to organize the pupils in effective manner. In future we will design a decision making system to select different courses based on student learning styles. REFERENCES [1] Index of Learning Styles Designed by Felder & Silverman NCSU, USA. [2] Nilakant, K., & Mitrovic, A. (2005). Application of data mining in constraint-based intelligent tutoring systems. In Proceedings of the artificial intelligence in education, AIED (pp. 896–898). [3] Romero, C., Ventura, S., & Bra, P. D. (2004). Knowledge discovery with genetic programming for providing feedback to courseware author. User Modeling and User-Adapted Interaction: The Journal of Personalization Research, 14(5), 425–464.
Fig1. Comparison of Peculiar students found by NBAD
2014 IEEE International Conference on MOOC, Innovation and Technology in Education (MITE)
163
[4] Tang, T., & McCalla, G. (2005). Smart recommendation for an evolving e-learning system. International Journal on E-Learning, 4(1), 105–129. [5] M. E. Otey, A. Ghoting, and and A. Parthasarathy, "Fast Distributed Outlier Detection in Mixed-Attribute Data Sets," Data Mining and Knowledge Discovery [6] LakshmiSreenivasaReddy.D, B.RaveendraBabu and A.Govardhan, “A Novel Approach to Find Outliers in Categorical Dataset” Elsevier AEMDS-2013, pp. 905912 [7] LakshmiSreenivasaReddy.D, .B.RaveendraBabu and A.Govardhan, “Outlier Analysis of Categorical Data using NAVF”’, Informatica Economica vol 17, Cloud computing issue 1, 2013. [8] Lakshmi SreenivasaReddy.D, B.Raveendrababu, “Outlier Analysis of Categorical Data using FuzzyAVF, IEEE ICCPCT-2013, pp 1259-1263. [9] LakshmiSreenivasaReddy.D,B.Raveendrababu “Learning Styles Vs Suitable Courses” IEEE-MITE2013 pp 52-57 [10] LakshmiSreenivasaReddy.D,B.Raveendrababu ““Improving Classifier Accuracy through Outlier Analysis for Categorical Data Using Outlier Factor by Infrequency” IEEE ICCPCT-2013, 1324-1328 [11] LakshmiSreenivasaReddy.D,B.RaveendraBabu and A.Govardhan, “Outlier Analysis of Categorical Data using NAVF”, Informatica Economica vol 17, issue 1, 2013, pp 5-13. [12] LakshmiSreenivasaReddy.D, B.RaveendraBabu and A.Govardhan, “A Novel Approach to Find Outliers in Categorical Dataset” Elsevier AEMDS-2013, pp 905-912 [13] LakshmiSreenivasaReddy.D, B.RaveendraBabu and A.Govardhan, “A Model for Improving Classifier Accuracy for Categorical Data Using Outlier Analysis”, International Journal of Computers and Technology (IJCT), Volume 7, Issue1, 2013, pp.500-509 [14] Yongsekim “IEEE conference on outlier analysis of learners data based on user interface behaviors”,2007 [15] M. Krishna Murthy, A. Govardhan, Lakshmi SreenivasaReddy D,”Outlier Analysis of Categorical Data Using Infrequency”,Internationaljournal of Computers and Technology (IJCT) Volume 8, No 3, 2013. [16] M. Krishna Murthy, A. Govardhan, Lakshmi SreenivasaReddy D,” Amodel to find outliers in mixedattribute datasets using mixed attributeoutlier factor.”,International journal of Computers Science Issues(IJCSI) Volume 10, Issue 5, No 2, 2013. [17] Urbancic, T., Skrjanc, M., & Flach, P. (2002). Webbased analysis of data mining and decision support education. AI Communications, 15, 199̄204 Wang, F. (2002). On using data-mining technology for browsing log file analysis in asynchronous learning environment. In Conference on educational multimedia, hypermedia and telecommunications (pp. 2005̄2006). [18] Sheard, J., Ceddia, J., Hurst, J., & Tuovinen, J. (2003). Inferring student learning behaviour from website
164
[19]
[20]
[21]
[22]
[23]
[24]
interactions: A usage analysis. Journal of Education and Information Technologies, 8(3), 245̄266. Shen, R., Han, P., Yang, F., Yang, Q., & Huang, J. (2003). Data mining and case-based reasoning for distance learning. Journal of Distance Education Technologies, 1(3), 46̄58. Shen, R., Yang, F., & Han, P. (2002). Data analysis center based on elearning platform. In Proceedings of the 5th international workshop on the internet challenge: technology and applications (pp. 19–28). Silva, D., & Vieira, M. (2002). Using data warehouse and data miningresources for ongoing assessment in distance learning. In IEEE international conference on advanced learning technologies, Kazan, Russia (pp. 40– 45). Srivastava, J., Cooley, R., Deshpande, M., & Tan, P. (2000). Web usage mining: Discovery and applications of usage patterns from web data. SIGKDD Explorations, 1(2), 12̄23. Wang, F. (2002). On using data-mining technology for browsing log file analysis in asynchronous learning environment. In Conference on educational multimedia, hypermedia and telecommunications (pp. 2005̄2006). Wang, W., Weng, J., Su, J., & Tseng, S. (2004). Learning portfolio analysis and mining in SCORM compliant environment. In ASEE/ IEEE frontiers in education conference (pp. 17̄24).
2014 IEEE International Conference on MOOC, Innovation and Technology in Education (MITE)