the relation between the pairs of question and answer (Q&A) and the text. From this point of view, we can use essentially the same system architecture for both ...
3C1.3
ICICS-PCM 2003 15-18 December 2003 Singapore
Question Classification for E-learning by Artificial Neural Network Ting Fei**; Wei Jyh Heng*; Kim Chuan Toh**; Tian Qi* * Institute for Infocomm Research ** National University of Singapore being selected from or compared to a limited set of predefined unequivocally correct responses to a question. The electronic marking of the responses is completely nonsubjective because no judgment has to be made on the correctness or otherwise of an answer at the time of marking. Objective tests are particularly appropriate for informal self-assessment to give students ongoing feedback about their progress.
Abstract Text categorization is the classification of unstructured text documents with respect to a set of one or more predefined categories. This paper describes our work in exploring automatic question classification tests which can be used in E-learning system. Such tests can take the form of multiplechoice tests, as well as fill-in-the-blank and short-answer tests. We acquired 20 texts used for high school students and each text is followed by several multiple choice questions from e-learning webpage. We propose a text categorization model using an artificial neural network trained by the backpropagation learning algorithm as the text classifier. Our test results show that the system achieved the performance in terms of F1 value of nearly 78%.
Questions can also be grouped into banks according to difficulty, area of the curriculum or type of skill being tested, such as vocabulary, comprehension, analysis, and application. Thus, once a question bank is build up, assessments can then be designed by drawing a certain number of questions from each bank, thereby allowing a unique subset of questions to be chosen for each assessment or student where specific or personalized skills and levels of competence will be examined.
Keywords: E-learning, Text Categorization, Artificial Neural Network, Backpropagation, Question Classification, Feature Selection
The combination of question banks and Internet CAA systems can provide a standard format test. It is more convenient for dissemination of up-to-date material. However, CAA systems have limitations. Construction of good objective questions requires skill and practice and so is initially time consuming. Assessors and invigilators need training in assessment design, and examinations management.
1. Introduction Facilitated by technology, e-learning could be an important industry of the 21st Century. E-learning is the convergence of the web and learning on all levels, whether it is elementary school, college, or business.
It is necessary that the difficulty level of questions is predefined in the banks. In practice, the decision of which difficulty level a question belongs to is not an easy task. It is subject to the person’s understanding and knowledge level. Given a text and corresponding questions, different people may have different opinions on whether the question is difficult, medium or easy. The most common method is by test to determine the statistical characteristics of each question.
In the last decade e-learning undergoes tremendous developments both in technological and scientific terms. As a booming technique, the dynamic nature of learning procedure, the diverse learner preferences, the customized learning content and the automatic construction of learning scenarios seem to be crucial obstacles for most of elearning platforms. Among numerous components of e-learning, assessment is an important part. Internet computer assisted assessment (CAA) can play both formative and summative roles in elearning (e.g., practice questions and exams respectively). Dalziel [1] pointed out that one of the essential advantages of Internet CAA is “the networked nature of the approach, which provides for distribution of formative or summative assessment directly to multiple client computers with little or no additional hardware or software installation needed apart from standard access to the Web”.
The objective of this paper is to investigate the effectiveness of neural network learning techniques as applied to automatic question classification. We use computer to automatically classify a number of questions to several difficulty levels, given a certain text and the answer. This project is restricted to research on the materials with plain text format. That is, we do not consider the texts or questions which contain graphs, pictures, or any formulas. The reason is that this kind of question is more complicated and need other analytical methods like graph theory etc.
Most CAA tests use objective test questions, with answers
0-7803-8185-8/03/$17.00 © 2003 IEEE
1
usually be given an explicit list of suffixes, and, with each suffix, the criterion under which it may be removed from a word to leave a valid stem.
2. Question Classification Our problem is to classify objective questions to three difficulty levels: hard, medium, and easy. And these questions are raised from a given text. We aim to automatically classify multiple-choice questions since multiple-choice question is the most common type of objective question in exercises or tests. Such tests can also take the form of fill-in-the-blank and short-answer tests. In our experiment, we ignore the wrong choices and explore the relation between the pairs of question and answer (Q&A) and the text. From this point of view, we can use essentially the same system architecture for both multiple-choice questions and short-answer questions.
2.2 Feature Selection After the data preprocessing, the key words and their corresponding term frequency of both questions and text are obtained. As to the feature selection, for each question, it corresponds to a feature vector. How the difficulty level of a question is assigned from its feature vector is somewhat subjective. But some rules have been suggested: Raphael [2] offers a powerful guidance for helping students analyze and understand questions. Her Question Answer Relationships (QARs) break questions into two categories: those which have answers supplied by an author ("in the book" QARs) and those which have answers that need to be developed based on the reader's ideas and experiences ("in my head" QARs). Question Answer Relationships help students recognize the kind of thinking they need to be engaged in when responding to questions. Moreover, QARs help us to derive the feature vector of the questions that determine the difficulty level.
The classification involves data preprocessing, feature selection, classification, and evaluation.
2.1 Data Preprocessing 2.1.1 Stop Words Refinement We use a general stoplist as our reference to remove the conjunctions, prepositions, articles, and common words high frequently used in articles. For our text-based question classification problem, there are some words frequently occurring in the questions. They may not be common words in other situations.
Here we chose five components to characterize each question, and thus our feature vector is five-dimensional. The meaning of the five components is discussed below: a) Query-Text Relevance Two vectors are constructed to compute the relevance between the Q&A and the corresponding text. This idea comes from Vector Space Model (VSM) [3]. One is query vector, which is composed of the term frequency of the Q&A. The term frequencies of the key words of Q&A in text make up a document vector. That is, we use the key words of Q&A as index to form these two vectors. If certain key word does not occur in the text, the relevant term frequency of document vector is set to zero. Furthermore, the duplicate key words in Q&A should be removed. In most cases, the query vectors compose of ones. We use the dot product equation to represent the relevance between question and answer:
Question A primary purpose for building the Suez Canal was to Which is an accurate statement about the partitioning of Africa by European imperialist nations during the 1800s? The 19th century term white Mans Burden reflects the idea that The Sepoy Mutiny in India, the Boxer Rebellion in China, and the Islamic Revolution in Iran were similar in that they In Europe, a major characteristic of humanism was Table 1 Typical Questions Table 1 gives some typical questions in our samples. The words “primary”, “similar”, “accurate”, “reflects”, “characteristic” and so on frequently occur in the questions except the other common words. After comparing with the original stoplist, we add these words and update our stoplist.
cos θ =
where d is a document vector, q is a query vector, and θ is the angle between them. If d and q are normalized so that their magnitudes are one, the preceding equation then reduces to cos θ = d • q . So the similarity score is a measure of cosine of the angle between the vectors. The highest scoring document has the smallest angle between itself and the query.
2.1.2 The Porter Stemming Algorithm We use Porter’s Algorithm1 to do the stemming process. Assuming that one is not making use of a stem dictionary, and that the purpose of the task is to improve information retrieval performance, the suffix stripping program will 1
d •q d * q
b) Mean Term Frequency Although cos θ measures the relevance between Q&A and
http://www.tartarus.org/~martin/PorterStemmer/
2
the text, but it does not directly reflects the term frequency. If the elements of the document vector are multiplied by a constant, the computed relevance will not change. In order to discriminate these Q&A, we introduce the factor of mean term frequency. Just as its name implies, mean term frequency equals to the mean value of document vector.
3. Performance Evaluation and Experimental Results
c) Length of Q&A The length of the question is also a factor that may influence the difficulty of a question. There is a kind of multiple-choice question which gives a short paragraph describing a certain phenomena or historical event, followed by a question according the description. This kind of question usually has a longer question stem, and requires more analytical ability to answer the question. From this point of view, this kind of question may be more difficult than other shorter questions.
Our data set consists of 233 questions from 20 texts2. The content is chosen randomly, and the questions have different difficulty levels. 170 questions (73%) are used for training data and the remaining 63 questions (27%) are used as test data. After removing words on a stop list and stemming, the length of the text is in the range from 82 to 1515.
3.1 Experimental Settings
Number of Texts 20 Number of Total Questions 233 Number of Easy Questions 73 Number of Medium Questions 54 Number of Hard Questions 106 Table 2 The set of categories used in experiment
d) Term Frequency Distribution (Variance) As we mentioned above, term frequency is an essential aspect for the question difficulty level. However, mean term frequency and query-text relevance are not enough to represent the characteristics of term frequency distribution. Therefore, we chose variance to feature the term frequency distribution.
Topic of Text 1 2
# of Question 5 3
Text Length 170 82
African Trading Kingdoms Byzantine Chinese Communist 3 15 440 Revolution 4 The Enlightenment 7 302 5 Exploration 28 828 6 The French Revolution 6 229 7 Imperialism 27 1129 8 Industrial Revolution 3 390 9 Golden Age of Islam 2 388 10 Japan 6 247 Independence Movements in 11 8 102 Latin America 12 The Meiji Restoration 6 120 13 Middle East 11 508 14 The Cold War 22 1192 15 Natural Selection 12 442 16 Reformation 14 349 17 Religions & Philosophies 20 1515 18 The Renaissance 11 374 19 Russian Revolution 22 434 20 Scientific Revolution 4 91 Table 3 Collections for Question Classification System
e) Distribution of Q&A in Text For an easy question, the answer generally can be directly found in the text. Often we wish to find such sentences in the text that contain the answer to a certain question. This will make things easier. We define the shortest distance of Q&A to be the length of a sequence of words that contains most key words of Q&A. If 80% information of Q&A can not be found in a range of 50 sequential key words, the shortest distance will be set to a large number. This means that the question is hard to answer due to limited information from the text.
2.3 Classification The power of artificial neural network (ANN) in text categorization has been demonstrated. We propose to use a two layer feed-forward neural network trained by backpropagation as the text classifier. The output vector is a binary vector of dimension 3, which means the difficulty degree is classified into 3 classes (easy, medium, hard).
Training Test Total Data Data Easy Questions 52 21 73 Medium Questions 41 13 54 Hard Questions 77 29 106 Total Questions 170 63 233 Table 4 Data Distribution in the Training and Test Sets Category
2
3
http://regentsprep.org/regents.cfm
3.2 Evaluation
3.3 Experimental Results
As performance measure, we use recall and precision measures. These two parameters are defined as follows.
Since the number of neurons in the hidden-layer may cause the under-fitting or over-fitting of the test results, we conduct experiments to change the number of hidden layer neurons from 3 to 10.
DEFINITION 1: Precision measures the portion of the assigned categories that were correct. It falls in the range from 0 to 1, with 1 being the best score. Pr ecision =
The results were evaluated by precision, recall, and F1 value, all micro-averaged across different categories. We chose micro-averaging over macro-averaging for the reason that the number of documents belonging to each category varies considerably among different categories, so that the micro-averaging, a per-document averaging, makes more sense.
Number of correctly assigned questions Number of questions assigned
DEFINITION 2: Recall measures the portion of the correct categories that were assigned. It falls in the range from 0 to 1, with 1 being the best score. Recall =
Figure 1 shows the experiment results. (a) displays the result of ANN performance in terms of least mean square (LMS) error. (b) shows the test value of F1 value versus the number of hidden neurons.
Number of correctly assigned questions Number of questions that should be assigned
The category assignment of a binary classifier can be evaluated using a two-way contingency table: YES is correct
NO is correct
Assigned YES a Assigned NO c Table 5 Contingency Table
b d
Precision is defined as a/(a + b), and recall is defined as a/(a + c). The values may be averaged or summed in a manner to determine efficiency for the set of categories as a whole. For evaluation performance average across categories, we use the micro-averaging method and macroaveraging method. Macro-averaging consists of computing recall and precision for every item category, and averaging after it. Microaveraging is adding up all numbers of correctly assigned items, items assigned, and items to be assigned, and calculate only one value of recall and precision. Microaveraging performance scores give equal weight to every document, and are therefore considered a per-document average. Likewise, macro-average performance scores give equal weight to every category, regardless of its frequency, and is therefore a per-category average. We can combine the precision and recall into one measure named F1, which is defined as:
F1 =
2rp r+ p
Figure 1 Experiment Results
where r and p are recall and precision respectively.
3.4 Discussions The objective values (LMS error) of ANN algorithm will decrease when the number of neurons is increased. When N
4
b) To further test the effectiveness of the proposed model and to increase the generality of the empirical study, more extensive experiments should be conducted by using larger training and test sets.
is set to 10, the objective value is about 0.012, which means the trained ANN has little bias with the target. However, for our problem with 170 training data, ANN has the best result when the number of hidden neurons is set to N = 3 (≈78%), followed by N = 4 and N = 5 (≈76%). When N is larger, the test results get worse because of over-fitting. And if N is taken too small, under-fitting results will be obtained. This verifies that the number of neurons in hidden layer must be far less than the number of the training samples to avoid overfitting.
c)
d) For multiple-choice question, to find out the relation between the four choices, the research on the relevance or similarity among the choices could be done. The higher the relevance is, the harder the question is. Thus our feature vector will have one additional element.
Overall, the performance of the ANN classifier in text categorization is good. Our method on question classification, especially the method on choosing feature vector, is demonstrated to be applicable. Backpropagation ANN also shows promising application for automatic question classification problem.
6. Conclusion In this paper, a classification model based on the popular backpropagation neural network was proposed for the textbased question classification task. Experiments were conducted using the proposed model to automatically classify text-based questions into three difficulty levels. We present our method of feature selection. It provides an impetus for research on machine learning of question classification. Also, categorization effectiveness and feasibility of the proposed model was tested empirically.
4. Application The automatic question classification system can be used widely in education to automate time consuming tasks and reduce subjective error. Generally, the classification system has the following applications: a)
According to the difficulty level, teacher can assign different questions to different students. This kind of targeted teaching can improve education quality and efficiency.
References [1] Dalziel, J. “Integrating Computer Assisted Assessment with Textbooks and Question Banks: Options for Enhancing Learning”, Fourth Annual Computer Assisted Assessment Conference, Loughborough, UK, June 2000.
b) As mentioned before, questions are usually grouped into question banks according to difficulty levels. Thus, automatic question classification can be used to do this job during the construction or update of question banks. c)
[2] Raphael, T. “Teaching Question Answer Relationships, Revisited”. The Reading Teacher. Vol. 39, pp. 516-522, June 1986.
It helps to standardize test set. Current technique has to rely on large quantities of tests to give reasonable classification. Using automatic question classification, test set can be in the form of standard test without postclassification.
[3] M. Lennon, D. Pierce, B. Tarry, and P. Willett. “An evaluation of some conflation algorithms for information retrieval”. Journal of Information Science, Vol. 3, 1981, pp. 177-183.
5. Future Work
[4] Taro Watanabe, Mitsuo Shimohata and Eiichiro Sumita. “Statistical Machine Translation on Paraphrased Corpora”. Third International Conference on Language Resources and Evaluation (LREC 2002), pp. 1954-1957 Las Palmas, Canary Islands, Spain, May 2002.
We have some future research directions: a)
To find the most efficient classifier, the comparison of ANN with SVM, kNN etc, can be conducted to investigate the performance of each method.
One of the shortcomings of the statistical method used in our model is that it lacks semantic analysis. To strengthen the “semantic understanding” ability, the supplemental method, natural language processing, such as expression normalization, and sentence understanding should be used during the text preprocessing. Expression normalization [4, 5] is proposed that synonymous expressions are replaced with one standard expression.
[5] Mitsuo Shimohata and Eiichiro Sumita. “Automatic paraphrasing based on parallel corpus for normalization”. Proceedings of the Third International Conference on Language Resources and Evaluation, Las Palmas, Canary Islands, Spain, May 2002.
5