Benchmarking Data Mining Methods in CAT Ibrahim Furkan Ince1, Adem Karahoca2, and Dilek Karahoca2 1
2
School of Engineering, The University of Tokyo, Japan Department of Software Engineering, Bahcesehir University, Istanbul, Turkey
[email protected], {akarahoca,dilek.karahoca}@bahcesehir.edu.tr
Abstract. In this study, a ranking test problem of Computer Adaptive Testing (CAT) is benchmarked by employing three popular classifiers: Artificial Neural Network (ANN), Support Vector Machines (SVMs), and Adaptive Network Based Fuzzy Inference System (ANFIS) in terms of ordinal classification performances. As the pilot test, “History of Civilization” class which offered in Bahcesehir University is selected. Item Response Theory (IRT) is focused for the determination of system inputs which are item responses of students, item difficulties of questions, and question levels. Item difficulties of questions are Gaussian normalized to make ordinal decisions. The distance between predicted and expected class values is employed for accuracy estimation. Comparison study is conducted to the ordinal class prediction correctness and performance analysis which is observed by Receiver Operating Characteristic (ROC) graphs. The results show that ANFIS has better performance and higher accuracy than ANN and SVMs in terms of ordinal question classification when the ordinal decisions are practically made by Gaussian Normal Distribution and ROC graphs are focused to observe any significant difference among the performances of classifiers. Keywords: Artificial Intelligence (AI), Computer Adaptive Testing (CAT), Intelligent Question Classification, Artificial Neural Network (ANN), Adaptive-Network-Based Fuzzy Inference System (ANFIS), Support Vector Machines (SVMs).
1
Introduction
Distance education, and e-learning have become very important trend in computer assisted teaching with internet technologies. Instructors try to discover new strategies and learning methods to enhance learning in learner’s side, since education consists of instruction and learning. The education should be learner centered and “learning” occurs in a cognitive manner in learner’s mind. It should be included that, individuals have own learning style, and the learning performance of each person cannot be evaluated in a simple way such as measuring the test results depending on the quantity of right and wrong answers. Also, just implementing the best Web-based system with the newest technologies is not enough itself to improve the student’s performance. The testing and development must be intelligent. For instance, if instructor asks a question that the student could not answer, another and easier D.-S. Huang et al. (Eds.): ICIC 2011, LNAI 6839, pp. 716–726, 2012. © Springer-Verlag Berlin Heidelberg 2012
Benchmarking Data Mining Methods in CAT
717
question should be asked secondly. If the student answers this time, a harder question should be introduced. The advantages of computer assisted assessment (CAA) are pointed by [1]. He points out that CAA is the way to evaluate the students on the Web with intelligent interfaces. This is the behavior of a real teacher in class, and system must be developed by stimulating this real environment on the Web. The system must generate intelligent questions based on the responses and performance of the students during the evaluation sessions. This kind of testing is called as Computer Adaptive Testing (CAT) which is a methodology selecting question in order to maximize the performance of examinee by observing the previous performance throughout the test [2]. It behaves as if a teacher asks questions to a student in a real class environment. The difficulty of test depends on the performance and ability of each student. Many researchers recently studies on personalized learning mechanism to support online learning activities, and help learners learn more effectively [3, 4, and 5]. New personalized systems include preferences, interests, and behaviors of users in their development process [6, 7]. Fei et.al studies on an e-learning project to classify questions into classes by the text of questions selected from a pool [8]. The system achieves 78% success in question classification. Hermjakob describes a machine learning based parsing and question classification for answers [9]. He uses parse trees and claims that they must be more oriented towards semantics. They achieve the matching questions with appropriate answers by machine learning algorithm. Hacioglu and Ward also studies on a machine learning technique for question classification by using SVMs. Newly [10], Nguyen and Shimazu introduce an application of using sub-tree mining for question classification problem [11]. The results show that their method reaches a comparable or even better performance than kernel methods and improved testing efficiency. Also, an integrated Genetic Algorithm (GA) and Machine Learning (ML) approach is studied for question classification in English – Chinese cross-language question answering [12]. Zhang and Lee presents a research work on automatic question classification through machine learning approaches by experimenting with five different algorithms; Nearest Neighbors (NN), Naive Bayes (NB), Decision Tree (DT), Sparse Network of Winnows (SNoW), and Support Vector Machines (SVMs) [13]. They describe how the tree kernel can be computed efficiently by dynamic programming. Zaanen, Pizzato and Mollá introduce a new approach to the task of question classification which is extracted structural information using machine learning techniques and the patterns found are used to classify the questions [14]. The major problem of these systems is generally faced in the derived ordinal decisions which may be inconsistent and the combination of multiple inputs of data is a challenge for decision analysis [30]. In other words, for intelligent question classification, each classifier needs a decision maker which makes multiple input combinations of data relational and accordingly produce desired output data type and format. In this post-paper [27, 28], item difficulties of questions are estimated using item responses, and then Gaussian normalized in order to make the most accurate ordinal decisions. Gaussian normalized item difficulties are used to produce the third input of the classifier model which is question levels. Initial difficulty levels and item difficulties are identified in order to initialize the training process for the first time, and then software is enabled to run classification automatically when the learning ends. Practically, question levels and item difficulties are used as a decision maker
718
I.F. Ince, A. Karahoca, and D. Karahoca
which makes an ordinal classification with respect to the third input of the model which is item responses. In other words; without a decision maker input model, none of the classifiers can classify the questions with respect to item responses of questions only. Hence, our system employs a practical method for creating appropriate decision maker through ordinal classification. In the system design, item difficulty of questions are Gaussian normalized for the very first time only once, and then in each classification process, initial class values and initial item difficulties are used. To update the question levels and initial item difficulty values, system administrator is tasked for update operation which is aimed at the end of each test session. Finally; ANN, SVMs, and ANFIS are employed in order to test the performance of system. According to the ROC graphs analysis by using the term: Area under Curve (AUC), ANFIS significantly leads in terms of ordinal decision performance though the insignificant relation appears with numerical accuracy rates which are so close to each other: ANFIS with 99% correctness while ANN is at 98%, and SVMs is at 97%, respectively.
2
Methods
Internet users are facing with difficulties to find the specific information that they are looking for with the existence of several Web applications in recent years [15, 16]. There are many huge information resources on the web. Similarly, there are many multiple choice tests available for the learners on various e-learning systems. These elearning systems have millions of questions from different topics in a question pool but the difficulty levels are not generally determined. The aim in here is categorizing the questions intelligently into five groups such as; 1) Very Easy; 2) Easy; 3) Average; 4) Hard; and 5) Very Hard. The methods should be operated in Run-Time of the educational Web based testing application. However, automated classification performance in terms of accuracy and performance is the most significant parameter for entire system evaluation. In this regard, three popular classifiers: Artificial Neural Network (ANN), Support Vector Machines (SVMs), and Adaptive-Network-Based Fuzzy Inference System (ANFIS) are employed and benchmarked. Methods and classifiers are explained and regarding performance evaluation is performed. 2.1
Data Gathering and Preprocessing
Electronic quiz system is used to perform the CAT which is a web application that runs on Intranet backbone. All students are assigned to computers and operated independently. The questions and the possible answers are given to students randomly. Finally, 5018 questions and answers are taken into consideration for classification purpose. Questions are categorized into five different groups such as very easy (-1,0), easy (-0,5), average (0,0), hard (0,5) and very hard (1,0). As a procedure, if item response is correct, then; it is scored as one. Otherwise, such as wrong response or no response, it is graded by zero. In the testing phase, some students take the test more than once due to some technical problems. Therefore, their best scores are included to the evaluation process.
Benchmarking Data Mining Methods in CAT
2.2
719
Item Response Theory - Item Difficulty
Item Response Theory (IRT) plays an important role in educational measurements [4, 17]. This theory is a statistical framework in which examinees are described by a set of ability scores in certain fields [18]. It is mostly applied in computerized adaptive testing to select the most suitable items for users based on their ability [19, 29]. Traditionally, item difficulty is considered as an important ability score in which the items are scaled in a range from 0.00 to 1.00. It is inversely proportional with the number of correct answers for each question. So, any question which has the least amount of correct answers is considered as the hardest question in a test. From this approach, item difficulty of each question is estimated in order to classify the questions with respect to their difficulty levels. This is necessary for initial normalized classification which is used as a reference value for performance evaluation. Item difficulty is defined by Equation 1 where ID, MSCA and SCAE refer to item difficulty, minimum sum of correct answers and sum of correct answers of each question, respectively. ID =
MSCA SCAE
(1)
Table 1. Item Difficulty versus Question Levels
Q
1
st
2
nd
3
rd
4
th
5
th
6
th
7
th
8
th
9
th
10
th
11
th
12
th
13
th
ID
0,69 0,71 0,57 0,73 0,65 0,62 0,81 0,91 0,65 0,76 0,68 0,86
1,0
QL
0,0
1,0
0,0
-0,5
0,0
-0,5
-0,5
0,5
0,5
-0,5
0,0
-0,5
0,5
By using the Equation 1, item difficulty of each question is estimated. Table 1 shows the corresponding item difficulty of each question where ID represents the item difficulties of questions and QL refers to question levels which are initially Gaussian normalized. 2.3
Normal Distribution and Selection of Activation Function for Classifiers
In probability theory and statistics, the normal distribution is the most widely used probabilistic continuous distribution in which the area of underlying bell curve integrates to 1 and information entropy is maximized in terms of sample means “μ” and variance “σ2”. In this study, classification performance is evaluated with respect to Gaussian-normalized initial class values which perform an equal deviation from the mean of class values and finally provides less noisy error rates in terms of classification accuracy. From this approach, the question levels are defined with ordinal values to be classified into five categories from easiest to hardest. The sigmoid function (tan h) is chosen for being the activation function which easily can fit to the ordinal interval of question levels between [-1, 1]. Equation 2 defines the sigmoid function as follows:
720
I.F. Ince, A. Karahoca, and D. Karahoca
F (n) =
1 1 + e −n
(2)
The domain of the sigmoid function (tan h) is between -1 and 1 and Table 1 shows the initial item classes (Question levels) after Gaussian normalization. 2.4
ANN Classification
Firstly, the classification is performed by ANN which uses such a neuron model that is the most basic component of a neural network [20]. The neuron requires a set of input values, xi which are multiplied by a weighting coefficient wi and integral of weighted signals, n is considered as the net input of the neuron [20]. An activation function, F converts the net input of the neuron into an output signal which is transmitted to other neurons [21]. In the ANN classification, 50% of data is arranged for training purpose, 25% of data for validation purpose, and 25% for testing purpose. Number of inputs is three and thirteen patterns are considered in four hidden layers with five-hundred epochs. 2.5
SVMs Classification
Different than ANN, SVMs offer a statistical learning theory that minimizes the structural risk rather than training error which finally expresses an upper bound for the general minimization error [22]. The only parameter needs to be set is a penalty term for misclassification which determines a tradeoff between resolution and generalization performance [22, 23]. The algorithm searches for a unique separating decision surface which supports the cluster vectors statistically and defines the classification output with respect to fractional data points linearly combined in support vectors in which the inputs are obtained from the samples extracted from training data. Basically, SVMs classify a pattern vector X into class y Є {-1, 1} based on the support vectors X m and corresponding classes Ym as: M
y = sign (∑α m y K ( X m , X ) − b) m =1
m
(3)
where K(.,.) is a symmetric absolute positive kernel function which is freely chosen within mild constraints. The parameters a m and b are determined by a linearly constrained quadratic programming (QP) problem which is implemented with a sequence of smaller scale, sub-problem optimizations, or an incremental scheme that adjusts the solution one training point at a time [22]. Conceptually, most of the training data X m have zero coefficients a m while the non-zero coefficients are returned by the constrained QP optimization construct the support vectors set [22, 24]. Our system assumes that the set of support vectors and coefficients a m are Gaussian normalized which provides an ease for getting the actual experimental performance in which the efficient run-time implementation of the classifier is aimed to be attained. In this purpose, radial basis function is employed and same inputs of ANN and same percentages of data are used for training, validation and testing.
Benchmarking Data Mining Methods in CAT
2.6
721
ANFIS Classification
A Fuzzy Logic System (FLS) is considered as a non-linear mapping from the input space to the output space by employing Gaussian fuzzy sets and linear rule outputs in which the parameters of the network are the mean and standard deviation of the membership functions and the coefficients of the output linear functions [25]. Hence, the number of membership functions, the number of fuzzy rules, and the number of training epochs are important factors to be considered in terms of classifier model design. Figure 1 shows the architectural model design of ANFIS as follows: x
A1 x
w1
w1
∏
N
y
w1f 1
A2
∑
f
B1 y
∏
N
w2f 2
B2 w2
w2
x
y
Fig. 1. ANFIS Architecture [25]
The aim of the training process is to minimize the training error between the ANFIS output and the actual class which finally allows a fuzzy system to train its features from the input data and employs these features with respect to the model rules [25]. In this study, the model is designed with respect to the questions of the testing system including three features in which item difficulties are Gaussian normalized for ordinal decision rules. According to the feature extraction process, 3 inputs are fed into ANFIS model and one variable output is obtained at the end. The last node calculates the summation of all outputs [26]. The same inputs of ANN and SVMs classification are used and same classes of questions are obtained.
3
Experimental Results
Three artificial intelligence methods: ANN, SVMs, and ANFIS are evaluated with respect to the distance between actual and predicted class values. Results show that ANFIS predicts the nearest values with respect to Gaussian normalized actual class values (QL) which are listed in Table 2 with corresponding predicted classes as follows: In addition to numerical correctness evaluation, ROC graphs with the false-positive rate on the X axis and the true-positive rate on the Y axis are also employed as a tool for performance analysis. Area under ROC graph (AUC) is used as a useful measure of
722
I.F. Ince, A. Karahoca, and D. Karahoca Table 2. Outputs of ANN, SVMs and ANFIS 6
th
7
th
8
th
9
th
th
10
th
th
11
12
th
13
1
0.5
0,7477
0,4950
0,7926
0,5010
-0,4905
0,9923
0,4996
-0,4978
-0,0001
-0,5120
0,0105
-0.5
-0,0091
0,5020
0
0,5196
0,4999
-0,4985
0,4890
0,5023
-0,4994
-0,4959
0.5
0,4998
-0,5004
-0,5087
0.5
-0,5059
-0,4985
-0.5
-0.5
-0,4954
-0,5001
th
-0,5090
0,0000
5
-0.5
-0,5100
0,0002
-0,0002
-0,4999
0,0029
ANFIS
0,0058
-0.5
0,0003
SVM
th
-0,0033
0
ANN
4
0
QL
0,0006
3
rd
0,0031
2
nd
1
0
st
Q
similarity between two classes. Performances of the classifiers are considered inversely proportional with the AUC which always results in the values between 0, 5 and 1, 0. Further, standard error estimate under the area of ROC Curve SE (w) is used to observe the average deviation from the findings of ROC resulting data. As the criterion to measure the ranking quality of classifiers, a certain confidence interval value is determined. Table 3 shows the ROC graph analysis of employed classifiers as follows: Table 3. ROC Graph Analysis of Classifiers Performance of ANN Classification Class
TPR
FPR
Area
-1
1,0000
1,0000 0,0000
-0,5
1,0000
1,0000 0,3530
0
0,8089
0,60•7 0,2485
0,5
0,7664
0,2942 0,1698
1
0,6791
0,0592 0,0201
1,0
ROC Curve
AUC
0,7914
SE(w)
0,0227
Confidence Interval
0,747
0,836
True Positive Rate
,8
,6
,4
,2
,0 ,0
,2
,4
,6
False Positive Rate
,8
1,0
Benchmarking Data Mining Methods in CAT
723
Table 3. (Continued) Performance of SVMs Classification Class
TPR
FPR
Area
-1
1,0000
1,0000 1,0000
-0,5
1,0000
1,0000 0,3815
0
0,9459
0,6079 0,2855
0,5
0,8773
0,2948 0,1855
1
0,7186
0,0624 0,0224
1,0
ROC Curve
True Positive Rate
,8
,6
,4
AUC
0,8748
,2
SE(w)
0,0215
,0 ,0
Confidence Interval
0,833
,2
,4
,6
,8
1,0
,8
1,0
False Positive Rate
0,917
Performance of ANFIS Classification Class
TPR
FPR
Area
-1
1,0000
1,0000 1,0000
-0,5
1,0000
1,0000 0,2875
0
0,4956
0,6156 0,1493
0,5
0,4735
0,3074 0,0940
1
0,3407
0,0765 0,0130
1,0
ROC Curve
True Positive Rate
,8
,6
,4
AUC
0,5438
,2
SE(w)
0,1002
,0 ,0
Confidence Interval
4
0,347
0,740
,2
,4
,6
False Positive Rate
Discussion
Gaussian Normal Distribution is employed to determine the ordinal decision rules. Different from the insignificant classification correctness values which are 0.99 for ANFIS, 0.98 for ANN, and 0.97 for SVMs, ROC graphs show a clear performance
724
I.F. Ince, A. Karahoca, and D. Karahoca
analysis in which ANFIS has the most different behavior in terms of integral distance from the normal line which can also be expressed as the area between the ROC curve and the normal line. ROC graphs show that the least error rates belong to ANFIS, and the most error rates belong to SVMs. ANFIS approaches to normal line with an angle of 45 degrees faster than the others. ROC Curves also shows the rates for each method how approaches to correct classification. If a path by ROC Curve comes closer to normal line and goes to the end point of normal line, then; it means that the method has better performance, better rate of finding true classification, and less amount of error of correctness in classification. ANFIS has the closest path to normal at overall and approaches to normal more quickly than the others. ANN has a distance to normal line less than the SVMs. According to the experimental results, ANFIS is significantly the best method among two others to classify the question levels.
5
Conclusion
In this study, an ordinal classification model by three strong methods (ANN, SVMs and ANFIS) is benchmarked for the multiple choice question classification. Experiments are conducted in order to classify questions into five strength levels. Practically, decision rules are made by Gaussian Normal Distribution for ordinal classification problem. The effectiveness’s of ANN, SVMs and ANFIS methods are evaluated by comparing the performances and class correct nesses of the sample questions (n=13) using the same three inputs: item responses, item difficulties and question levels (5018 rows of data that are the item responses of students in a test composed of 13 questions). The comparative test performance analysis is conducted using the classification correctness and ROC analysis reveals that the AdaptiveNetwork-Based Fuzzy Inference System (ANFIS) yields better performances than the Artificial Neural Network (ANN) and Support Vector Machines (SVMs) under the particular conditions of the experiment. This study provides an impetus for research on machine learning and artificial intelligence of question classification in computer adaptive testing (CAT) applications.
References 1. Dalziel, J.: Integrating Computer Assisted Assessment with Textbooks And Question Banks: Options for Enhancing Learning. In: Fourth Annual Computer Assisted Assessment Conference, Loughborough, UK (2000) 2. Brulovsky, P., Peylo, C.: Adaptive And Intelligent Web-based Education Systems. International Journal of Artificial Intelligence in Education 13, 156–169 (2003) 3. Brulovsky, P.: Adaptive Educational Systems on The World-Wide-Web: A Review of Available Technologies. In: Proceedings of Fourth International Conference on Intelligent Tutoring Systems, Workshop on WWW-based Tutoring, San Antonio, TX (1998) 4. Chen, C., Duh, L.: Personalized Web-based Tutoring System Based on Fuzzy Item Response Theory. Expert Systems with Applications 34, 2298–2315 (2008) 5. Chen, C., Lee, H., Chen, Y.: Personalized E-learning System Using Item Response Theory. Computer And Education 44(3), 237–255 (2005)
Benchmarking Data Mining Methods in CAT
725
6. Ioannis, H., Jim, P.: Using A Hybrid Rule-based Approach in Development An Intelligent Tutoring System with Knowledge Acquisition And Update Capabilities. Expert Systems with Applications 26, 477–492 (2004) 7. Lee, M.: Profiling Student’s Adaptation Styles in Web-based Learning. Computers And Education 36, 121–132 (2001) 8. Fei, T., Heng, W.J., Toh, K.C., Qi, T.: Question Classification for E-learning by Artificial Neural Network, Information, Communications And Signal Processing. In: 2003 And the Fourth Pacific Rim Conference on Multimedia. Proceedings of The 2003 Joint Conference of The Fourth International Conference, vol. 3, pp. 1757–1761 (2003) 9. Hermjakob, U.: Parsing And question Classification for Question Answering. In: Proceedings of the Workshop on Open-domain Question Answering, Toulouse, France. Annual Meeting of the ACL, vol. 12, pp. 1–6 (2001) 10. Hacioglu, K., Ward, W.: Question Classification with Support Vector Machines And Error Correcting Codes. In: The Proceedings of HLT-NACCL 2003, Edmonton, Alberta, Canada, pp. 28–30 (2003) 11. Nguyen, M.L., Shimazu, A., Nguyen, T.T.: Sub-tree Mining for Question Classification Problem. In: Twentieth International Joint Conference on Artificial Intelligence (IJCAI 2007), Hyderabad, India, pp. 1695–1700 (2007) 12. Day, M., Ong, C., Hsu, W.: Question Classification in English-Chinese Cross-Language Question Answering: An Integrated Genetic Algorithm And Machine Learning Approach. In: Proceedings of The IEEE International Conference on Information Reuse And Integration (IEEE IRI 2007), Las Vegas, Nevada, USA, pp. 203–208 (2007) 13. Zhang, D., Lee, W.S.: Question Classification Using Support Vector Machines. In: Proceedings of The 26th ACM SIGIR Conference on Research And Development in Information Retrieval (SIGIR), Toronto, Canada (2003) 14. Zaanen, M., Pizzato, L.A., Mollá, D.: Question Classification by Structure Induction. In: Proceedings of The International Joint Conferences on Artificial Intelligence, Edinburgh, U.K., pp. 1638–1639 (2005) 15. Berghel, H.: Cyberspace 2000: Dealing with Information Overload. Communications of the ACM 40(2), 14–24 (1997) 16. Kobayshi, M., Takeda, K.: Information Retrieval on the Web. ACM Computing Surveys 32(2), 144–173 (2000) 17. Baker, H., Frank, B.: Item Response Theory: Parameter Estimation Techniques. Marcel Dekker, New York (1992) 18. Rudler, L.M.: An Online Interactive Computer Adaptive Testing Tutorial (2006), http://edres.org/scripts/cat/catdemo.htm 19. Howard, W.: Computerized Adaptive Testing: A Primer. Lawrence Erlbaum Associates, Hillsdale (1990) 20. Haykin, S.: Neural Networks: A Comprehensive Foundation, 2nd edn. Prentice Hall International Inc., Englewood Cliffs (1999) 21. Dalton, J., Deshmare, A.: An Approach to Increasing Machine Intelligence. IEEE Potentials, Artificial Networks, 31–33 (1991) 22. Genov, R., Gert, C.: Kerneltron: Support Vector “Machine” in Silicon. IEEE Transactions on Neural Networks 14(5) (2003) 23. Girosi, F., Jones, M., Poggio, T.: Regularization Theory And Neural Networks Architectures. Neural Comput. 7, 219–269 (1995) 24. Vapnik, V.: The Nature of Statistical Learning Theory. Springer, New York (1995)
726
I.F. Ince, A. Karahoca, and D. Karahoca
25. Alturki, F.A., Abdennour, A.B.: Neuro-fuzzy Control of A Steam Boiler-turbine Unit. In: Proceedings of The 1999 IEEE International Conference on Control Applications, vol. 2, pp. 1050–1055 (1999) 26. Jang, J., Jang, S.R.: ANFIS: Adaptive-Network-based Fuzzy Inference System. IEEE Trans. Syst. Man Cybernet 23(3), 665–685 (1993) 27. Karahoca, A., Karahoca, D., Ince, F.: ANFIS Supported Question Classification in Computer Adaptive Testing (CAT). In: ICSCCW 2009, 5th International Conference on Soft Computing, Computing with Words And Perceptions in System Analysis, Decision And Control, pp. 1–4. IEEE, Famagusta (2009) 28. Ince, I.F.: Intelligent Question Classification for E-Learning Environments by Data Mining Techniques, Master Thesis, Institute of Science, Computer Engineering, Bahcesehir University, Istanbul, Turkey (2008) 29. Brulovsky, P.: Adaptive And Intelligent Technologies for Web-based Education. In: Rollinger, C., Peylo, C. (eds.) Special Issue on Intelligent Systems and Teleteaching, pp. 19–25. Künstliche Intelligenz (1999) 30. Qinghua, H., Daren, Y., Maozu, G.: Fuzzy preference based rough sets. Information Sciences, Special Issue on Intelligent Distributed Information Systems 180(10), 2003–2022 (2010)