Apr 26, 2016 - Massive Open Online Courses (MOOCs). The proposed ... Bayes, Learning Analytics ... Topic analysis of MOOC discussion content using.
A Framework for Topic Generation and Labeling from MOOC Discussions Thushari Atapattu, Katrina Falkner School of Computer Science University of Adelaide Adelaide, Australia {firstname.lastname}@adelaide.edu.au
Abstract This study proposes a standardised open framework to automatically generate and label discussion topics from Massive Open Online Courses (MOOCs). The proposed framework expects to overcome the issues experienced by MOOC participants and teaching staff in locating and navigating their information needs effectively. We analysed two MOOCs – Machine Learning and Statistics: Making Sense of Data offered during 2013 and obtained statistically significant results for automated topic labeling. However, more experiments with additional MOOCs from different MOOC platforms are necessary to Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author. Copyright is held by the owner/author(s). L@S 2016, April 25-26, 2016, Edinburgh, Scotland Uk ACM 978-1-4503-3726-7/16/04. http://dx.doi.org/10.1145/2876034.2893414
generalise our findings.
Author Keywords MOOC, discussion forums, topic modeling, LDA, Naïve Bayes, Learning Analytics
Introduction Discussion forums or communities form a significant component of many MOOCs, particularly within the connectivist MOOC model (cMOOCs) [1]. Due to the massive amount of discussion data generated within MOOCs, it is challenging for participants to effectively locate and navigate their information need unassisted. Similarly, the workload required to manage the magnitude of existing discussions, along with the fact that discussions may develop around any course topic at any point in time, makes it difficult for teaching staff to effectively focus their attention and support learners. In order to address this problem at large scale, we propose a framework to identify key discussion topics, supporting a topic-wise organisation of discussion contents.
Background Topic analysis of MOOC discussion content using existing topic models (e.g. LDA, Structural topic model) has been explored in some recent works [2, 3]. However, these existing works are limited in their usage to end-user applications (e.g. visualisation, recommender systems) due to their inability to provide topic labels for generated topics without human intervention [2-4]. Topic analysis from other text sources such as document collections rely on external resources for topic labeling. Ramage et al. [5] propose a labeled LDA using supervised learning techniques trained with a corpus of tagged web pages. Lau et al. [6] introduce an automated topic labeling approach by querying the web to obtain candidate labels from titles of Wikipedia articles. Magatti et al. [7] utilise Google directory hierarchy to obtain candidate labels. This approach is only applicable to a hierarchical topic model and relies on a pre-existing ontology. While initial works in automated topic labeling have promise, a new mechanism is necessary in order to cope with the scale present in a MOOC and the dynamic and diverse nature of discussion contents.
Topic Analysis & Labeling The framework proposed in this paper builds upon existing topic models, introducing a novel solution for automated topic labeling at scale (Figure 1). Topic modeling is an unsupervised approach to discover hidden thematic structures in large text corpora [8]. This work utilise Latent Dirichlet Allocation [8], the state of the art in topic modeling. The aim of topic labeling is to generate a set of candidate labels and develop a ranking mechanism to determine the most relevant and meaningful label for
unlabeled topics extracted from a text corpora (i.e. discussion forums in this work). This work incorporates titles of lectures introduced by the lecturers within their lecture materials (e.g. video transcripts and lecture slides) to obtain candidate topic labels for discussion topics.
Figure 1. Overview of topic labeling process
Given the dynamic nature and scale present in MOOC discussions, it is likely that a large number of topics will be generated, proportional to the number of discussion threads. Due to this scale, direct comparison of each topic across the corpora of lecture materials to obtain candidate labels is a highly inefficient approach. Further, due to the self-paced nature of most MOOCs, discussion can be generated at any point in time related to any course content. Therefore, it is not practical to rely on candidate labels generated at the commencement of the course from early discussions, as these are unlikely to be suitable as the course progresses.
Course
Users*
Threads
Lecturerelated threads ML 6368 5449 972 STAT 2313 1145 392 *Anonymous users are counted as 1 unit Table 1. Statistics of selected courses Course Thread* Post Comment ML 972 3670 1778 STAT 392 1648 882 *Thread is a combination of thread title, corresponding posts and comments Table 2. Statistics of lecture-related discussions
Topic 1
Topic terms Function cost gradient descent minimum theta square point error algorithm 2 Class boundary point decision positive classification probability svm example line 3 Layer network neural output input hidden node unit weight image 4 Learning algorithm machine learn problem datum example system unsupervised result 5 Matrix pca inverse vector pinv dimension octave loop svd datum Table 3. Sample topics (unlabeled) from ML course
In order to address this problem, we introduce text classifiers (i.e. Naïve Bayes) to narrow the focus of topic labeling to a subset of the course contents organised by weeks. We build text classifiers for each week, using the contents from course materials (i.e. top 50 key terms) of a given week in the ‘positive’ class and the key terms from course materials of remaining weeks in the ‘negative’ class (Figure 1(1)). Then, candidate topic labels are generated by measuring the similarity between topic terms of discussions and key terms extracted from course materials of the corresponding week/s. In order to summarise course materials as a set of key terms, we use state the art term extraction algorithms including basic term frequency and Chi-squared algorithm [9]. We also reused authors’ previous work in extracting key terms from lecture slides [10]. The generated set of candidate labels are ranked based on the similarity scores between discussion content and course content.
Method The topic analysis approach involves pre-processing of discussion contents including removal of non-textual contents, stop words, language detection and translation to English, and lemmatisation. We utilise existing JavaScript implementation of LDA (jsLDA1) and obtains 40 and 25 topics from ML and STAT courses respectively (Table 3). To overcome the brevity of discussion posts when applying common topic models, we merge all posts and comments belonging to a single thread. The topics generated from topic models are used as testing data for text classifiers to classify the discussion topics into weeks while contents of course materials (weekly-basis) are used as training data (Figure 1). Our manual topic labeling task involves 3 volunteer domain experts for each course. These experts are senior academic staff or post-doctoral researchers from the Computer Science school of University of Adelaide. The task of each expert is to examine the given topics and rank the suggested candidate labels. The ranks are defined as most related label to the given topic (rank-1), reasonably related label to the given topic (rank-2), and least related to the given topic (rank-3). Human annotators also had the option to mention when the generated labels are not related or more than one label are equally likely ranked.
Methodology Data This work utilises Coursera discussion forum data from the MOOCs offered during 2013 [11]. For this study, we selected two MOOCs offered in English and none selfpaced: Machine Learning (ML) course (Stanford University – 10 weeks) and Statistics – Making sense of data (STAT) course (University of Toronto – 8 weeks) (Table 1). This work primarily focuses on lecturerelated discussions for the analysis (Table 2).
Results and Discussion We have calculated the Spearman’s ranking correlation coefficient (rs) between the ranks generated by machine and human (Table 4). Assuming that the rankings are performed independently across the different participants since each individual obtained
Research question: How can we leverage the learnergenerated discourse from discussion forum contents to better identify key topics of interest? 1
http://mimno.infosci.cornell.edu/jsLDA/index.html
Course
H1
H2
H3
ML
0.575
0.612
0.637
STAT
0.705
0.669
0.525
Mean (SD) 0.608 (0.03) 0.633 (0.09)
*H1, H2, H3 denotes the human expert involved in each course Table 4. Spearman ranking correlation between machine and human topic labeling
Course ML
H1H2 0.702
H1H3 0.525
H2H3 0.675
STAT
0.506
0.573
0.653
*Human annotator pairs denote different individuals in each course Table 5. Inter-rater agreement (Kappa)
Mean (SD) 0.634 (0.09) 0.577 (0.07)
randomly arranged list of candidate labels, the results show a positive ranking correlation (p