2013 International Conference on Computer, Control, Informatics and Its Applications
Text Message Categorization of Collaborative Learning Skills in Online Discussion Using Support Vector Machine Erlin
Unang Rio
Rahmiati
Department of Information Technology STMIK-AMIK Riau Pekanbaru, Indonesia
[email protected]
Department of Information Technology STMIK-AMIK Riau Pekanbaru, Indonesia
[email protected]
Department of Informatics Management STMIK-AMIK Riau Pekanbaru, Indonesia
[email protected]
Abstract—This paper presents a method for text messages categorization of collaborative learning skills in online discussion. Despite of support vector machine is a popular method for text categorization in the research area of machine learning, unfortunately, the use of support vector machine in an educational setting is rare. Usually, text categorization by support vector machine is employed to categorize news article, emails, product reviews and web pages. In an online discussion, text categorization that used to classify the message send by student into a certain category is often manual, requiring human skilled specialists. However, human categorization is not an effective way for number of reasons; time-consuming, labor intensive, lack of consistency in category and costly. Therefore, this paper proposes support vector machine approach to automatic code the message. Results show that support vector machine achieving good classification on eight categories of collaborative learning skill in online discussion as measured based on the error rate, precision, recall and balanced Fmeasure. Keywords—text categorization, online discussion, support vector machine
I.
INTRODUCTION
Text categorization is a prominent method in the research into collaborative learning and active research area in information retrieval and machine learning. Text categorization refers to solving the problem to classify documents into a certain number of pre-defined categories based on their content. As the volume of message in online discussion increases and categorize the message into some classes is needed, thus, categorizing and coding of text message by human are time-consuming, labor intensive and costly. Dumais [1] stated that the weaknesses of human systems are inconsistency in assigning category and needs to adapt to changing category structures. Online discussion presents major challenges to the existing text categorization technique. Online discussion messages are usually incomplete, error-prone, and poorly structured [2]. In an online discussion, text categorization that used to classify the message send by student into a certain category is often
c 978-1-4799-1078-6/13/$31.00 2013 IEEE
manual, requiring skilled specialists. Therefore, how to use a various computer technologies to auto-coding the message is subject of great research value. In recent years, research has shown that there has been extensive study and actively explored various machine learning for text document categorization and classification. Among these are Bayesian network classifier [3], k-nearest neighbor classifier [4] and decision tree [5]. These methods are conventional learning methods compared to those new approaches, however, it has simple algorithms and relatively high efficiency. Recently, a number of researchers have been proposed some new approaches and models. For examples, there are neural network [6], support vector machines [7], fuzzy kmeans [8] and maximum entropy models [9] also have good results. Gabrilovich & Markovitch [10] have been proved that support vector machines (SVM) is one of the best algorithms for text categorization. Meanwhile, Yu, ben Xu & hua Li [11] have argued that neural network also a popular categorization method that can handle linear and non linier problems for text categorization, and both of linear [12] and non linear [13] support vector machine classifier can achieve good result. Unfortunately, the use of support vector machine in an educational setting is rare. Most of support vector machine is used to categorize news article, emails, product reviews and web pages. Hence, this research is focused on automated text message categorization in online discussion using support vector machine that can be trained on a large corpus of messages and associates a high score to pairs of messagecategory that appear in the corpus data. The rest of the paper is organized as follows: In the next section, we describe the collaborative learning skill in online discussion. Section 3, explains the method of this research. The experiment is described in section 4. The experimental results are given in section 5. Finally, the conclusions are given in section 6.
295
II.
collaborative learning skill category based on Soller’s model which is modified version of McManus and Aiken’s Collaborative Skills Network [16] is adopted for this research. In Soller’s model each conversation act is assigned a sentence opener indicating the act’s intention. Students communicated through a sentence opener interface by initiating each contribution with one of the key phrases, which conveys the appropriate dialogue intention.
COLLABORATIVE LEARNING SKILL
Many educators are really aware of the use of collaborative learning and information and communication technology (ICT) to facilitate the learning process for the benefit of their students. Gokhale [14] have argued that collaborative learning is about groups of students, and groups of students and teachers, constructing knowledge together. In collaborative learning that student has various performance levels work together in small groups and the students are responsible for one another's learning as well as their own. Hence, the success of one student helps other students to be successful.
The weakness of the sentence opener approach is limitations in using of ideas or thinking that will be delivered in a discussion. Each student communicated by sentence opener first before posting messages on the online discussion. Therefore, this research is designed to automate text categorization, hence student feel free to deliver their idea without being limited by the sentence opener that has been set previously by the system.
Skill in learning collaboratively means knowing when and how to question, inform, and motivate one’s teammates, knowing how to mediate and facilitate conversation, and knowing how to deal with conflicting opinions [15]. The TABLE I.
No
DEFINITIONS AND EXAMPLE OF COLLABORATIVE LEARNING CONVERSATION SKILLS
Category
Description
1
Acknowledge
:
Inform peers that you read and/or appreciate their comments. Answer yes/no questions.
2
Discuss
:
Reason (positively or negatively) about comments or suggestions made by team members.
3
Inform
:
Direct or advance the conversation by providing information or advice.
4
Maintenance
:
5
Mediate
:
6
Motivate
:
Providing positive feedback and reinforcement.
7
Request
:
Ask for help/advice in solving the problem, or in understanding a team-mates comment.
8
Task
:
Shift the current focus of the group to a new subtask or tool.
Support group cohesion and peer involvement. Recommended an instructor intervene to answer a question.
Table 1 offers brief descriptions of the sub-skill categories of Soller’s model. We take eight sub-skill categories; request, inform, motivate, task, maintenance, acknowledge, discuss and mediate; to be implemented and tested using support vector machine. III.
Examples - Thanks friends for the response that you all give for this question 1. - Ok, I can accept this answer and thanks for your cooperation. - I agree because the example from our lecturer is very clear. - However, I think we can more focusing on the content. - According to my answer before, I have give the explanation about the errors or mistakes that have been occur based on Minicase 2. - At last I know what functional information system is and also provide examples. - Excuse me, but I don't think that's what you meant at all. - Sorry if I was a bit technical answer. - We should ask our lectures when we must submit our assignment. - Very good, your idea is brilliant. It is help us to solve our problem. - Do not give up, you have time to do your assignment until next week. - Could you help me to find an application for doing this task? - Please answer the question below and elaborate your answer based on Chapter 7. - Hi, let's start to discuss about work division. - Are you ready my friend? We will start our discussion on sorting data method.
Information Systems held on Moodle as learning management systems (LMS) in e-learning is examined for one discussion topic. This topic is chosen because it has highest replies than other topics. There are 29 students completed the topic of discussion. The total numbers of messages in the online discussion (that are replies to somebody’s message) are 394 messages.
METHOD
In text categorization method, documents are to be classified into a certain number of categories. In this research, each document represents a message in online discussion. Message on the online discussion forum from one subject SCK3433-02 2008/2009: Management of Organization
A document from online discussion is composed of discussion subject, user name, user ID, date, time and contents of the message. In our approach, we use only the contents of the message. The content of the message is important to analyze the dynamics of the forum and what kind
296
Where d is a set of document, c is set of categories, for any pair (d, c) of document and category [17].
of feedback from one another. It also used to gather information about the quality of the online discussion. Moreover, a text categorization approach codes the content of messages according to the message type. A central idea in text categorization that the many words of the documents are classified into certain message categories.
Auto-coding based on the support vector machine model is shown in fig. 1. First of all, students’ interaction through online discussion as a means of sharing knowledge and solving a problem by posting their idea or solution in the text form. All text on this forum as known as corpus data will be categorized into 8 categories using support vector machine approach. As depicted in the fig. 1, there are three stages in text categorization are applied: pre-processing, dimensionality reduction and classification.
The problem of text categorization may be formalized as the task to approximate an unknown classification function Φ : d × c → Boolean defined as:
⎧true, ifd ∈ c ⎩ false, otherwise
φ (d , c) = ⎨
1)
Fig. 1. Framework for text categorization of online discussion
lower case characters. Next, stop word removal to remove common words that are usually not useful for text categorization and ignored for later processing such as “are”, “is”, “the”, “a”, etc., based on a stop list for general English text. The remaining words then stemmed using the Porter’s algorithm [18] to normalize words derived from the same root such as “computer”, “computation”, “computing” would end up into the common form “comput” after the applying the stemming process. After stemming, we merged the sets of word’s stems from each of the 275 training documents and removed the duplicates. As a result are 1137 terms in the vocabulary is potential as feature set.
A. Pre-Processing Classifier or algorithm cannot be directly interpreted the text. Documents should first be transformed into a representation suitable for the classification algorithms to be applied. In order to transform a text into a feature vector, preprocessing is needed. This stage consists of identifying feature by feature extraction and feature weighting. The main goal of feature extraction is to transform a message from text format into a list of words as feature set, easy to be processed by support vector machine algorithms. This includes tokenization, stop word removal and stemming. Tokenization is used to separate text into individual words. All upper case characters in the words are converted to
297
SVMmulticlass proposed by Joachim and available at http://svmlight.joachims.org/svm_multiclass.html. In order to train the support vector machine, a set of training documents and a specification of the pre-defined categories the documents belong to are required.
B. Dimensionality Reduction The main difficulty in the application of a support vector machine to text categorization is the high dimensionality of the input feature space which is typical for textual data. This is because each term in the feature set represents one dimension in the feature space, as a consequence, the size of the input of the support vector machine depends upon the number of stemmed-words after removing the duplicates. In order to reduce the dimension of the input, feature selection also called term space reduction (TSR) is needed to select, from the original set of term, that when used for document indexing, yield the highest effectiveness [19].
IV.
The messages are firs manually coded by human coder. The total number of messages that coded by human is 394. The list messages of online discussion categorized by two of human coders and calculated the reliability among them. The reliability of human coder is needed to determine how well the human coded the list of messages based on coder training. The reliability test is conducted using multiple reliability coefficients such as percent agreement, Scott’s Pi ( π ), Cohen’s Kappa (k) and Krippendorff’s alpha ( α ). De Wever [21] has argued that reporting multiple reliability indices is of importance considering the fact that no unambiguous standards are available to judge reliability values. The coders do a sample exercise on other messages to familiarize themselves with the model. Two coders should do the analysis independently and have the results cross examined by one another. The reliability value of human coders to categorize the messages as follows: Percent Agreement=77.66%, π =73.49%; k=73.56% and α =73.52%. Krippendorff [22] added that variable with Alpha as low as .667 could be acceptable for drawing tentative conclusions. The values of .667 also appropriate for Scott Pi and Cohen Kappa. The reliability values of categorization messages by human coders as shown in table 2.
There are many kinds of document reduction techniques. Savio and Lee [13] introduce four kinds of methods designed to the dimensional of the feature space; document frequency (DF) method, Category Frequency-Document Frequency (CF-DF) method, product of term occurrence frequency (TF) and the inverse document frequency (IDF) as known as TF.IDF method and Principal Component Analysis (PCA). This research reduced the size of dimension by computing the document frequency (DF). Yang and Pedersen [20] have shown that with DF it is possible to reduce the dimensionality by a factor of 10 with no loss in effectiveness. This seems to indicate that the terms occurring most frequently in the corpus are the most valuable for text categorization. C. Text Classifier Support vector machine as text classifier must be trained before it can be used for text categorization. Training of the support vector machine classifier is done by using TABLE II.
EXPERIMENT
THE RELIABILITY OF MESSAGE CATEGORIES BY HUMAN CODER
N Agreements
N Disagreements
Percent Agreement
Scott Pi
Cohen’s Kappa
Krippendorff Alpha
306
88
77.66%
73.49%
73.56%
73.52%
Armed with 394 human-code message, support vector machine approach is ready to be trained. The total number of messages is split into two parts. First part for training data and the second part for testing data our text categorization model. We used a set of 275 messages as a training set for training the text classifier. For testing, a test set of 119 messages is used. TABLE III.
In this research, we used 8 categories based on Soller’s model. Each message of text is classified in the same category are presumed to have a similar meaning. There are a total of 394 messages belonging to one of the 8 chosen categories. The distribution of messages among the 8 categories in the training set and the testing set is shown in table 3. In order to use a support vector machine learning approach, we first need to transform a text into a feature vector representation. Hence the feature extraction is needed. We created a program to combine the three phases of feature extraction in C++ language based on Porter’s algorithm.
CLUSTERING DATA SET
Name of Category
No. of Training
No. of Testing
Total
Acknowledge Discuss Inform Maintenance Mediate Motivate Request Task
40 50 60 29 2 20 44 30
15 25 25 12 1 8 20 13
55 75 85 41 3 28 64 43
Total
275
119
394
After stemming, we merged the sets of word’s stems from each of the 275 training documents and removed the duplicates. As a result are 1137 terms in the vocabulary is potential as feature set. After feature extraction phase that select the important terms, we should do term weighting of each word. There are various term weighting approaches studied in the literature. Boolean weighting is one of the most commonly used. In Boolean weighting, the weight of a term is considered to be 1 if the term appears in the document and it
298
is considered to be 0 if the term does not appear in the document.
classification. Several measures in the information retrieval and machine learning have been defined based on this contingency table. The most common performance in machine learning that have been widely applied for evaluating the performance of text classifiers are error rate, precision, recall and F-Measure.
The size of dimension is reduced by document frequency (DF) method. All potential features are ranked in each category based on the term occurs in the message. The top features for each category are chosen as its feature set. This experiment selects 370 terms as text classifier’s input after reduced by document frequency method.
The experimental result of SVM is illustrated in table 4 in 8 by 8 confusion matrix that displays the performance of eight categories of text of collaborative learning skill in online discussions. In this research, confusion matrix C, which is n x n matrix for N-class classifier is used to compute error rate, recall, precision and F-measure.
Next step, text representation to transform the text into a vector suitable for text classifiers or categorization algorithms to be applied. Text representation is a vector of terms weights. Each term represents specific information about the original text. The terms are sometimes referred to as features. Each term usually has an associated weight which represents its contribution to the text. For each training and testing text, we created the space vectors corresponding to the 275 training sentences, where each space vector had a dimensionality of 370.
The confusion matrix gives the number of instances where each digit is correctly classified and also the instances where they are misclassified on using optimization dataset. The row indexes of a confusion matrix correspond to actual values observed and used for model testing; the column indexes correspond to estimated values produced by applying the model to the test data. For any pair of actual/predicted indexes, the value indicates the number of records classified in that pairing. In an ideal classifier all entries will be zero except diagonal. The sum of the values in the matrix is equal to the number of scoring records in the input data table.
Training of a text classifier based on supervised learning is done by using SVMmulticlass available at http://svmlight.joachim.org/svm_multiclass.html. SVMmulticlass consists of a learning module (svm_multiclass_learn) and a classification module (svm_multiclass_classify). The classification module can be used to apply the learned model to new examples. In the experiment using SVM, the input file “train.txt” contains the training sentences. This file is trained using svm_multiclass_learn and the result in the model which is learned from the training data in train file. The model is written to “model.txt”. To make predictions on test sentences, svm_multiclass_classify reads this file. For all test sentences in “test.txt” the predicted classes are written to “output.txt”. There is one line per test in the output file in the same order as in the test file. The first value in each line is the predicted category, and each of the following numbers is the discriminated values for each of the n categorization. The performance of SVM is largely dependent on the selection of the kernel parameters. The string c is a learning option to determine the trade-off training error and margin, and string t is a kernel option to store with the vector. In SVMmulticlass, the value of t range from 0 to 4. This experiment uses five values of t (0, 1, 2, 3, 4) that is combined with the value of c range from 0.10 – 1.0. Furthermore, we conducted one more experiment by taking the best value of c which was 5 below of 0.8 and 5 above of 0.8. Hence, these experiments have been completed with all possible value of t paired with all possible value of c. Total six experiments was conducted and the best performance when the value of t is 1 and c is 0.80 where the accuracy reaches 79.8 with time processing of 1.06 seconds. Hence, the optimal parameter setting used 0.8 as a value parameter to c and 1 as a value parameter to t. V.
There are two values (actual and predicted) for each of category. For instance, the first cell (1, 1) is the number of actual first category that was predicted to be first category; the value is 15. The second cell (2, 2) is the number of actual second category that was predicted to be second category; the value is 21. There are 4 misclassified for second category; 2 of actual second category that were predicted to be third category, 1 of actual second category that were predicted to be sixth category and 1 of actual second category that were predicted to be seventh category. TABLE IV.
CONFUSION MATRIX OF SVM
Actual Category
Estimate Category
EXPERIMENTAL RESULT
1
2
3
4
5
6
7
8
1
15
0
0
1
0
0
0
0
93% 17%
2
0
21
4
0
0
3
0
0
77.8% 22.2%
3
0
2
19
0
0
0
0
5
73.1% 26.9%
4
0
0
0
9
0
1
0
0
90% 10%
5
0
0
0
0
1
0
0
0
100% 0%
6
0
1
0
0
0
3
0
1
60% 40%
7
0
1
2
2
0
1
20
0
76.9% 23.1%
0
0
0
0
0
0
0
7
100% 0%
100% 0%
84% 16%
76% 24%
75% 25%
100% 0%
37.5% 62.5%
100% 0%
53.9% 46.1%
79.8% 20.2%
8
In order to evaluate a support vector machine task of collaborative learning skill, this research defines a contingency matrix representing the possible outcomes of the
299
Different category numbers of eight categories may cause diverse performance measures as shown in table 5. SVM achieved better results in smaller category classes, e.g. Acknowledge, Mediate and Request than larger ones, e.g. Motivate, Task and Inform. The lowest categorization with SVM is motivate category of 37.50%, while in BPNN is Inform of 56%. The performance of most categories is satisfactory. The micro average value of the overall error rate, recall, precision and F-Measure rate varied between 78.29% and 83.59%. TABLE V.
[4]
[5]
[6]
[7]
PERFORMANCE MEASURE OF SVM
Classifier
[8]
SVM
Categories
Error Rate (%)
Recall (%)
Precision (%)
F-Measure (%)
Acknowledge
0.00
100.00
93.75
96.77
Discuss
16.00
84.00
75.00
79.25
Inform
24.00
76.00
73.07
74.51
Maintenance
25.00
75.00
90.00
81.82
Mediate
0.00
100.00
100.00
100.00
Motivate
62.50
37.50
60.00
46.15
Request
0.00
100.00
76.92
86.95
Task
46.15
53.85
100.00
70.00
Microavg
20.17
78.29
83.59
79.43
[9]
[10]
[11]
[12]
[13]
VI. CONCLUSION This paper shows, through experimental results, that support vector machines is good classifiers for the text categorization. The experiment is conducted using the proposed framework to categorize the text message of online discussion into eight categories of collaborative learning skill. The average reliability values of categorization messages by human coder is 74.55% while the effectiveness of support vector machine that are measured by error rate, recall, precision and F-measure is also 73.70%. It shows that the accuracy of support vector machine approach in coded the messages is near with the human accuracy. The results also demonstrate that support vector machine is able to give a good categorization performance.
[14] [15]
[16]
[17]
[18]
ACKNOWLEDGMENT
[19]
This work are supported by The Ministry of National Education of the Republic of Indonesia and STMIK-AMIK Riau.
[20]
[21]
REFERENCES [1]
[2]
[3]
S. Dumais, “Using SVMs for text categorization,” in IEEE Intelligent Systems, M. A. Hearst, B. Schölkopf, S. Dumais, E. Osuna, and J. Platt, Eds: Trends and Controversies—Support Vector Machines, vol. 13, 1998. A.K.F. Liu, S.C. Li, and S.O. Choy, “An evaluation of automatic text categorization in online discussion analysis,” Seventh IEEE International Conference on Advanced Learning Technologies, Niigata, Japan, pp. 205-209, 2007. W. Lam and K.F. Low, “Automatic document classification based on probabilistic reasoning: Model and performance analysis,” Proceedings
[22]
300
of the IEEE International Conference on Systems, Man and Cybernetics, Vol.3, pp. 2719-2723, 1997. P. Soucy and G.W. Mineau, “A simple k-NN program for text categorization,” The first IEEE International Conference on Data Mining (ICDM_01), Vol. 28, pp. 647-648, 2001. L. Ma, J. Shepherd and Y. Zhang, “ Enhancing text clasification using synopses extraction,” The fourth International Conference on Web Information System Engineering, pp. 115-124, 2003. J. Farkas, “Generating document clusters using thesauri and support vector machines,” Canadian conference on Electrical and ComputerEngineering, Vol.2, pp. 710-713, 1994. T. Joachims, “Text categorization with support vector machines: Learning with many relevant features,” 10th European Conference on Machine Learning, pp. 137-142, 1998. M. Benkhalifa, A. Bensaid and A. Mouradi, A, “Text categorization using the semi-supervised fuzzy c-means algorithm,” 18th International conference of the north american fuzzy information processing society – NAFIPS, pp. 561-568, 1999. J. Kazama and J. Tsujii, “Maximum entropy models with inequality constrains: A case study on text categorization,” Machine Learning. 60(1-3), pp. 159-194, 2005. E. Gabrilovich and S. Markovitch, “Text categorization with many redundant features: Using aggressive feature selection to make SVMs competitive with C4.5,” Proceedings of the 21st International conference on machine learning, pp. 321-328, 2004. B. Yu, Z. Ben Xu, C. Hua Li, “Latent semantic analysis for text categorization using support vector machine;” Journal of KnowledgeBased Systems, 21, pp. 900-904, 2008. H.T. Ng, W.B. Goh, and K.L. Low, ”Feature selection, perception learning, and usability case study for text categorization,” 20th Annual international ACM-SIGIR Conference on research and development in information retrieval, pp. 67-73, 1997. L.Savio, Y. Lam, and D. K. Lee, “Feature reduction for support vector machine based text categorization,” Sixth International Conference on Database Systems for Advance Applications(DASFAA), pp. 195-203, 1999. A.A. Gokhale, “Collaborative learning enhance critical thinking,” Journal of Technology Education, 7(1), 1995. A. Soller, “Supporting social interaction in an intelligent collaborative learning system,” Journal of Artificial Intelligence in Education, 12, 2001. M. McManus and R. Aiken, R, “Monitoring computer-based problem solving,” Journal of Artificial Intelligence in Education, 6(4), pp. 307336, 1995. J.M.G. Hidalgo, J.M.G, “Text representation for automatic text categorization,” Eleventh conference of the european chapter of the association for computational linguisitics EACL, 2003. . M.F. Porter, M.F, “An algorithm for suffix stripping,” In Morgan Kaufmann, Readings in information retrieval, Morgan Kaufmann Publishers Inc, pp. 313-316, 1997. F. Sebastiani, “Machine learning in automated text categorization,” ACM Computing Survey, 34(1), pp. 1-47, 2002. Y. Yang, and J.O. Pedersen, “A comparative study on feature selection in text categorization,” Proceedings of ICML-97, 14th International conference on machine learning, pp. 412-420, 1997. B. De Wever, T. Schellens, M. Valcke, M, and H. Van Keer, “Content analysis schemes to analyze transcripts of online asynchronous discussion groups: A review,” Computers & Education 46(1), pp. 6–28, 2005. Krippendorff, “Quantitative content analysis: An introduction to its method,” Beverly Hills, Sage Publications, 2004.