Jose Rizal University,. 80 Shaw Boulevard, Mandaluyong City, Philippines. {jundy.raga, jennie.raga}@gmail.com. Arlyn D. Orenseb. Maria Rosario D. Rodaviab.
Testing the Properties of a Proposed Random Indexingbased Discourse Transcript Analysis Approach Rodolfo C. Raga, Jr.a
Jennifer D. Ragaa
Arlyn D. Orenseb
a
Maria Rosario D. Rodaviab
b
Computer Science Department, Jose Rizal University, 80 Shaw Boulevard, Mandaluyong City, Philippines {jundy.raga, jennie.raga}@gmail.com
School of Computer Science Arellano University 2600 Legarda, Manila, Philippines {lynorense, rose.rodavia}@yahoo.com
ABSTRACT
several domains are often intertwined to build a single curriculum.
Random Indexing (RI) is an efficient, scalable, and incremental alternative to standard word space methods which require huge training data to implement. According to the literature, the incremental nature of RI enables it to be used for similarity computations even after just a few examples have been encountered. We deem that this property is also what makes RI robust for small data sets. In this article, we present our efforts in further testing the utility RI. Built upon our previous research in analyzing forum transcript, we conducted experiments that aim to further investigate whether RI can be used to perform topic relevance rating of small fragments of texts as well as Tagalog-based discourse contributions. We measure our results in terms of agreement with hand-coded corpora with the help of the Kappa statistic. Although initial and not conclusive, we report promising results for the intuitively appealing utility of RI most especially, in addressing the problem of monitoring and assessing contributions in educational online discussion forums.
Recently however, the potential of other word space modeling techniques such as Random Indexing (RI) has started to get explored. RI significantly differs from other word space methods in various respects, and this constitutes its potential to address the problems that plague the standard approaches. According to Chatterjee & Mohan (2008), for example, the incremental nature of RI enables it to be used for similarity computations even after just a few examples have been encountered. Gorman & Curran (2006) on the other hand, experimented with RI and found that it is more robust for processing small data sets. Furthermore, the literature suggests that RI, like most Word Space Modelling approach, supports language independent processing of data (Jönsson et al, 2008). These studies prompted us to investigate whether these properties of RI can leverage the task of analyzing forum message contributions. From an educational standpoint, verifying the utility of these properties is interesting because they can have significant implications in the design of practical tools that can help overcome the information overload that teachers endure in maintaining educational forums.
Keywords Online Forum Discussion, Message Analysis, Random Indexing
1. INTRODUCTION
In our previous work (Raga et al, 2010), we already described our proposed approach which adapts the functionalities of Random Indexing for the task of monitoring the topic relevance of forum message contributions and implemented this approach in a prototype system which we call LexNet. In the current work, we simply extend our experiments on LexNet to further explore the utility of our proposed approach. More particularly, we aim to determine how the system will perform on: (1) a dataset consisting of short but topically flat fragments of text, and (2) a dataset of Tagalog-based online forum discourse contributions. By processing the former we aim to test the ability of the proposed approach to process non-discourse type of data whereas the latter will enable us to test its language independent capabilities. We also aim to compare the performance of LexNet to different groups of human annotators to investigate how the strategies adopted by the human annotators compare to those of the prototype system.
With the recently increasing interest in the field of education to adopt asynchronous online discussion as a tool in promoting interaction and learning among students, the need to develop tools that can provide assistance to teachers in monitoring and assessing student’s contributions in these forums has become more pressing than ever. In line with this problem, researchers have started to seek for methods and techniques that can be used to address this need. One method which has strongly been suggested by previous research is the Word Space Modeling approach. Word Space Modelling (WSM) according to Sahlgren (2006) is : “a computational model of word meaning that utilizes the distributional patterns of words collected over large text data to represent semantic similarity between words in terms of spatial proximity”
The plan of the rest of this article is as follows. In the next section, we provide some background on RI and briefly revisit the strategy we used to adapt it to the task of topic relevance rating. Section 3 describes the experiments we conducted including the data we used, experiment procedures, and evaluation approach. Section 4 describes the results we generated
Several word space models have already proven their mettle in numerous applications. Foremost among these is the Latent Semantic Analysis (LSA) approach which has become almost synonymous with information retrieval research. However, most of these models, including LSA, require huge amount of training data to train and implement on a particular domain. This requirement, we believe, minimizes its utility in education where
52
Proceedings of the 8th National Natural Language Processing Research Symposium, pages 52-57 De La Salle University, Manila, 24-25 November 2011
discourse topic structure of a text and thus cannot be directly used to measure topic relevance. To address this issue, in our proposed approach, we supplemented the RI generated cosine scores with the combination of several surface level text statistics.
and presents our analysis. Finally, in Section 5 we present a summary and describe possible future work.
2. RI AND SEMANTIC KNOWLEDGE Semantic knowledge can be thought of as knowledge about relations among several types of elements, including words, concepts, and percepts. In this regard, the semantic representation that can be generated using RI has to do with knowledge of word to word relations.
Table 1: Cosine Value Interpretation
The procedures for using Random Indexing (RI) can be summarized into two steps as follows:
2. The next step involves the generation of the context vector that will represent each word in the vocabulary. These context vectors have the same dimensionality as the index vectors. They are automatically generated by scanning through the text, and each time a word occurs in the surrounding context of a focus word, usually in an n x n window, that word's index vector is added to the context vector for the focus word.
Identical Vectors
The terms are distributionally similar and assumed to be related
0
Orthogonal Vectors
The terms are distributionally independent, relatedness cannot be established
-1
Opposite Vectors
The terms represented are not related
>-1 &