ring categories from multilingual survey data using survey questions' word statistics ... its assumptions. The analysis of topic models is dependent upon exploring.
2012 11th International Conference on Machine Learning and Applications
A Machine Learning based Topic Exploration and Categorization on Surveys Liana M. Epstein, Philip Garland, and Annabell Suh Dept. of Methodology SurveyMonkey Palo Alto, USA {liana, philg, annabell}@surveymonkey.com
Clint P. George, Daisy Zhe Wang, and Joseph N. Wilson Dept. of Computer & Information Science & Engg. University of Florida Gainesville, USA {cgeorge, daisyw, jnw}@cise.ufl.edu
Abstract—This paper describes an automatic topic extraction, categorization, and relevance ranking model for multilingual surveys and questions that exploits machine learning algorithms such as topic modeling and fuzzy clustering. Automatically generated question and survey categories are used to build question banks and category-specific survey templates. First, we describe different pre-processing steps we considered for removing noise in the multilingual survey text. Second, we explain our strategy to automatically extract survey categories from surveys based on topic models. Third, we describe different methods to cluster questions under survey categories and group them based on relevance. Last, we describe our experimental results on a large group of unique, real-world survey datasets from the German, Spanish, French, and Portuguese languages and our refining methods to determine meaningful and sensible categories for building question banks. We conclude this document with possible enhancements to the current system and impacts in the business domain.
Figure 1.
Keywords-topic modeling; survey clustering; fuzzy clustering; categorization;
category template building process does not consider any survey question usage statistics (from the existing surveys in the system) and is a language-specific task. To address this, we are building tools that help the category-specific template building process use much less manual effort and employ language-independent system design. Moreover, our proposed system can automatically find commonly occurring categories from multilingual survey data using survey questions’ word statistics. We focus on the task of automatically clustering surveys and questions and ranking them on relevancy to a specific topic or survey. This includes challenges such as (a) representing user surveys and questions in a machine readable form by removing noise terms and stop-words, (b) employing machine learning models that can learn topics from surveys and categorize them with minimal manual intervention, (c) post-processing strategies on the learned model for survey- and question-clustering, and (d) experiments on our unique set of multilingual (Spanish, German, French, and Portuguese) survey datasets from SurveyMonkey. Fig. 1 and Fig. 2 show the visualization of survey-clustering and question-clustering. The different colors represent different topic content in the survey text. The potential impact of this project in the business domain is unparalleled. There is not, to our knowledge, any other
I. I NTRODUCTION As the amount of text data available keeps rising, it becomes challenging for people to locate and track the relevant information they require. We are particularly interested, within the domain of multilingual survey texts, to build language independent tools for topic discovery, clustering, and ranking of surveys and their questions. Effectively addressing the potentially huge amount of information contained in a large collection of surveys leads us to use tools for automatic text summarization and topic extraction. Topic modeling methods such as Latent Semantic Indexing (LSI) and Latent Dirichlet Allocation (LDA) are designed to assist with these types of problems. Conventional survey designer systems such as SurveyMonkey provide manually designed, category-specific1 survey templates (e.g., Education template, Customer Feedback template, etc.) to ease the survey building process [1]. During survey building, template-questions can be customized or new questions can be added based on user needs. One of the disadvantages of this type of system is that the manual labor required to build such templates is high. Similarly, the 1 We
use category and topic inter-changeably throughout in this paper.
978-0-7695-4913-2/12 $26.00 © 2012 IEEE DOI 10.1109/ICMLA.2012.132
Survey clustering
7
Topic models are well suited to a language independent approach to clustering and ranking for surveys because the bag-of-words document model, upon which they are based, is largely independent of semantic structures. The inference is based only upon the word co-occurrence frequencies in each document in a given corpus. We use a topic modeling algorithm based on HDP [3] to discover topics from surveys. The estimated topics are further used to rank relevant surveys in the corpus and group them (survey clustering). Topic models also provide relevant words and their probabilities for a given topic. Domain experts can use these words to name the learned categories or topics with minimal manual effort. We also considered the problem of grouping similar questions together (question clustering) to assist survey designers. We used LSI to represent questions due to its computational efficiency compared to the more complex models such as LDA and HDP. We implemented our question clustering system based on fuzzy clustering [4] of the questions represented in LSI space. Our results show that our method can automatically find many manually defined survey-categories, and group topically similar questions as well as surveys with questions in a language different from the survey group’s language. One of the challenges we faced in designing the multilingual survey categorization system was the demographic and cultural variations in the language usage by people from different countries. The variation in question structure was quite visible even with the formal environment imposed by the survey format. For example, in the case of Spanish surveys, many specific words and phrases were used to ask questions politely. During topic modeling inference, these caused trouble in forming relevant topics from the survey text. Similarly, for German, many surveys include a large set of common words from colloquial phrases. This caused the topic-modeling-based ranking and clustering algorithms to form overlapping, distinguishable question and survey groups. We try to tackle some of these problems by using language specific lemmatizers and stop-word list (section III). This paper is organized as follows. Section II describes the state-of-the-art models in the area of document topic modeling, language-independent text mining, and survey clustering. Section III describes our overall system architecture and algorithms. Section IV describes details about our unique multilingual datasets, evaluation metrics, results, and analysis. Section V concludes this paper.
Figure 2. Question clustering: q(s1) represents a question from survey 1.
”question bank” in any of the languages such as Spanish, German, French and any other automatic survey and question categorization and ranking system in existence. The categories that emerged from our system were qualitatively different due to cultural differences in both the way that questions are asked in different languages and the information that real people who live in different countries want to find out the most. Thus, the process we developed supports automating the construction of culturally-relevant question banks from existing survey corpora. Our approach Our proposed system uses topic models to model the corpus (document collection) of surveys. Topic models represent documents as bags of words without considering word order as being of any importance. These models have the ability to represent large document collections with lower dimensional topics, which represent clusters of similarly behaving words. In addition, the document words are assumed to be generated from topic-specific multinomials and the topic for a particular word is chosen from that document’s topic mixture. These topics are assumed to be generated over the corpus vocabulary from a Dirichlet distribution. Blei et al. [2] give a detailed description of this language model and its assumptions. The analysis of topic models is dependent upon exploring the posterior distribution of model parameters and hidden variables conditioned on observed words. The model parameters are corpus-level topics or concepts—sets of words with corresponding probabilities—and document-level topic mixtures. The original topic model assumes that one should know the number of topics in the corpus beforehand. However, Teh et al. [3] solve this issue with a new framework called the Hierarchical Dirichlet Process (HDP), which can learn a variable number of topics automatically from the data.
II. R ELATED
WORK
Topic models are often used to characterize plain text documents and to extract topical content from them. One such model, LSI, can group together words and phrases that exhibit synonymy (or similar meanings), e.g., car and automobile. The LSI method typically performs matrix factorization over a term-document matrix (TF-IDF matrix),
8
the system components. First, we tokenize the raw surveyquestions with a tool that is dependent on the survey’s source language. For Latin-character based languages such as Spanish, German, and French, we build the tokenizers using the python Natural Language Processing Toolkit (NLTK) [8] toolkit and predefined regular expressions. For Asian languages such as Japanese, we use morphology-based segmenters (e.g., MeCab and TinySegmenter for Japanese text) to tokenize the survey text2 . Second, we standardize tokens by removing noise terms and stop-words. We used language-dependent stop-word lists for this purpose. Third, we represent each survey or question as a document in a sparse bag-of-words format, after building a vocabulary of corpus-words (separately for each language we used). Finally, we use documents as input to the topic learning model which, in turn, learns clusters from the term cooccurrence frequencies of the corresponding documents. See Fig. 3 for more details.
which represents the occurrence of words in documents using the concepts of eigenvalue decomposition and identifies patterns in the relationships between the document terms and concepts or topics. However, we used LSI to cluster questions under a given survey topic and to build topical question banks because probabilistic topic models such as LDA and HDP are less effective in modeling small documents [5]. In the probabilistic topic modeling setting (e.g., LDA and HDP) [2], [3], a topic is represented by a multinomial distribution of words in a vocabulary. Topic modeling allows us to represent the properties of a large collection of documents containing numerous words with a small collection of topics. Each document is described by a mixture of topics, and words are chosen from the multinomial that results from the mixture of that document’s topic multinomials. Topic models are designed to handle both polysemy (single words with multiple meanings such as model and chip) and synonymy. We use topic modeling algorithms such as HDP [3] and LDA [2] to discover topics from surveys. Survey questions are usually short, which differs substantially from conventional document information retrieval and mining problems. Grant et al. [5] tested the applicability of topic-modeling-based approaches to a Twitter dataset, and found that the restricted lengths of tweets prevents them from exploiting their full potential. Aggregating tweets to train the topic model can yield an improved set of topics. The research work of Hong et al. [6] discusses similar observations on a different Twitter dataset. In this paper, we used a similar strategy to model surveys. We aggregate the questions of each survey and consider that to be a single document for topic modeling. Francis et al. [7] described several methods to perform text-mining including surveys (on the 2008 CAS Quinquennial Membership Survey). They explained methods such as TF-IDF, k-means, and hierarchical clustering on the survey question and answer words which are based on the R package tm. Here, however, we have a comprehensive set of experiments on multilingual survey datasets to which we apply advanced statistical models such as LDA, HDP, and LSI.
B. Topic learning As discussed earlier, topic models have the ability to learn semantic relationships of words from an observed text collection. In this system, topic modeling is used for three main purposes i) categorizing and ranking surveys, ii) survey sub-categorization and ranking, and iii) clustering of survey questions under an identified survey sub-cluster. Survey ranking is performed to identify relevant surveys that belong to general (top-level) topics such as market research, education, and sports. To perform ranking, we first compute the topic mixtures of the survey documents, which are formed by combining survey questions. To estimate the topical structure from the survey documents, we use HDP [3], which can learn the number topics automatically (this is one of our primary goals) along with the topic model from large document collections. A detailed theoretical review of HDP and its inference methods is presented by Teh et al [3]. We use a modified version of the HDP implementation by Wang and Blei [9] in our experiments. The major components of a learned HDP model are the corpus-level topic word association counts and document-level topic mixtures. Each topic in the estimated model is represented by its topic-word-probabilities. These words are used by language experts to name survey categories. The document level topic mixtures give an idea of the topicality of a particular survey to a given topic. This is also quite useful in finding similar surveys and grouping them together. From the observations of the top-level survey categorization explained above, we found that some of the topics found by the HDP estimation process can be further divided into subtopics and the corresponding surveys can be ranked by subtopic relevance. For modeling survey subtopics, we use
III. S YSTEM DESIGN This section explains our methodology and the system architecture. Fig. 3 gives a graphical representation of our prototype system. It consists of two main modules – one that is language dependent and another that is language independent. The following sub sections explain individual system components in detail. A. Data pre-processing This component is part of the language dependent system module. We designed the preprocessor in such a way that a change in the input language does not affect the rest of
2 We ignored Asian languages from our analysis because the datasets were too small to capitalize the results.
9
Figure 3.
The system design
Table I R ESEARCH DATASETS
the original LDA model [2] because it is more accurate and less computationally expensive than HDP. We use the Gensim package’s [10] online variational inference implementation for the model estimation process. Conventional topic modeling algorithms are designed to work on larger documents compared to survey-questions (section II). The chance of a term re-occurrence in the same question is quite low compared to typical documents used in the topic modeling literature. So, to cluster questions to build question banks, we represent questions in a much simpler format such as TF-IDF and perform LSI, which helps to represent the questions in the smaller LSI space rather than the vocabulary space.
Language Spanish German French Portuguese
vocabulary size 6.2K 2.4K 5.3K 1.5K
# of stop-words 350 400 160 240
first apply fuzzy C-means (FCM) clustering [4], [11] to the set of survey questions represented in LSI space (section III-B). Second, we rank the questions that belong to a given cluster based on measures such as string matching, fuzzy set matching [12], and distance from the cluster centroid. Finally, we remove duplicate questions and present the ranked questions to survey designers (Fig. 2).
C. Survey relevance ranking We use survey relevance ranking to group together surveys belonging to an estimated topic (Fig. 1). We use individual surveys’ estimated document topic mixtures, θˆd , to rank them on relevance given a topic or set of topics. For a given topic set T ⊂ K, we calculate m(d) = ln θˆd,k + ln(1 − θˆd,j ) (1) k∈T
# of surveys 7.3K 3K 9.4K 2K
IV. E XPERIMENTAL RESULTS AND ANALYSIS This section describes our datasets, experiments, and observations. A. Dataset description and experimental setup We conducted our experiments on both research and realword datasets from SurveyMonkey, which are in a variety of languages including English, Spanish, German, French, Portuguese, and Japanese. In this paper, we only describe results from the research datasets of the Spanish, German, French, and Portuguese languages. We only consider surveys having at least five questions for topic modeling. Similarly, we only consider words that have a minimum overall corpus frequency of 10, for vocabulary construction. The detailed description of specific datasets is given in Table I. The reported number (#) of surveys and vocabulary size are approximate counts and computed after removing noise (stop-words, duplicate surveys, etc.) in the survey text. We also removed English surveys and words from the foreign language surveys using an English dictionary. We also observed that, when we perform topic modeling on language specific datasets, most foreign language words
j ∈T /
for all surveys d = 1, 2, ..., D in the corpus and sort them to rank their relevance. Here, we assume that the document topic mixtures θˆd satisfy the multinomial property K ˆ j=1 θj = 1. Intuitively, we can see that this equation maximizes the score of a topic set T ⊂ K given a document. A document with a high value of this score is a highly relevant document for that topic set. D. Question clustering and ranking One of the goals of this project is to design a system that can recommend useful, relevant survey questions, given a selected survey topic (e.g., education) for building question banks. Once we have the surveys that belong to a given topic, we group similar survey questions into question groups and rank them within group based on several ranking scores. We
10
were grouped together into foreign language topics. For example, surveys in the German collection having French language questions had a very high probability for one particular topic that they all shared. We ignored those foreign language topics and surveys in our analysis.
Table II S UBSET OF CATEGORIES FOUND FROM A S PANISH REPRESENTATIVE DATASET
Categories Business Parent satisfaction surveys Customer feedback Education
B. Evaluation metrics For the English surveys, we had a set of manually identified category and their associated manually designed survey questions (category templates) [1]. We used them for evaluating the automatically generated survey categories and their relevance ranked surveys. This manual evaluation process was performed by the domain experts in survey design. One of the key aims of this project is to automatically identify topics from the multilingual survey datasets, and compare them with the manually identified English survey categories. The ranking score, (1), can also be used to evaluate cohesiveness of the identified sets of categorical surveys. High scores of surveys represent high relevancy to the given category. So, for each category k, in a dataset, we compute the mean of the relevancy scores of all surveys using: μk = meand∈Dk (exp(m(d)))
Sub-categories Partnerships Web security Football, grade levels, alumni Product evaluation Alumni, campus selection Student satisfaction survey Course rvaluation
Table III S UBSET OF CATEGORIES FOUND FROM A F RENCH REPRESENTATIVE DATASET
Categories Market research Human resources
Just for fun
(2)
Sub-categories Consumer preferences Product feedback Manager evaluation Work place evaluation Employee evaluation Facilities and services Transportation Vacation travel Media usage
Our experiments on the Spanish, German, French, and Portuguese research datasets identified several existing topics and new sub-topics. Tables II and III show subsets of topics and sub-topics found from Spanish and French.
where m(d) is from (1) and Dk is the set of grouped surveys for topic k. High values of μk indicate that the ranked surveys in Dk are highly cohesive and relevant to topic k. The manual evaluation of survey categories by our domain experts from different languages, supports this fact (See section IV-D). Similarly, to evaluate the cohesiveness of question clusters, for each question, we compute the distance from the cluster centroids (generated by FCM). Then, we compute distance means for each question cluster. A small mean value is a good indication of a cohesive cluster. We also use questions’ overall appearance frequencies in their associated surveys to compute questions’ importance.
D. Discussion We observed that Spanish surveys behave in a manner similar to the English surveys, i.e., the system found similar, meaningful topics. However, on the German and French datasets, our bag-of-words based topic modeling did not perform as well as expected. The majority of the topics were polluted with noise, and it was hard to find meaningful topics. We noticed that their survey and question structures were considerably different from those in English and Spanish. For example, the German surveys contain many questions with diverse topical ranges. We believe the differences in the style of question and survey formation affect the performance of the topic-modeling-based survey categorization algorithm. Our observations are supported quantitatively by the means of categorical survey ranking scores (2), μk , which represent the quality of the corresponding categories (Fig. 4 and Fig. 5). For the French, Portuguese, and German datasets, we can see that there is a sudden change in μk after the few topic sets (Dk ). We observed that after the first few topic survey sets, the topic mixtures of the surveys that belong to the rest of the sets, become almost uniform. This adversely affects the ranking scores m(d) and their means μk . This indicates that the first few topic sets are good for question bank building. We believe that we can further improve the results by using a better lemmatizer for
C. Results We conducted our initial experiments on a toy dataset, using the HDP algorithm [3], which show that it can learn a considerable number of topics (e.g., ∼ 100 topics from 8K surveys). Based on the estimated topic-words (e.g., student, teacher), we give an appropriate name (e.g., education) to each of the topics found by HDP. Most of the manually identified categories [1] are automatically discovered by HDP. Moreover, we notice that HDP can find additional meaningful topics. In addition, we ranked relevant surveys for a given topic based on (1). To determine subtopics under an identified topic (e.g., education), we collected all the relevant surveys belong to that topic (based on (1) and a predefined threshold) and performed another topic modeling estimation activity on them.
11
V. C ONCLUSION In this paper, we described the problem of automatic topic discovery and categorization of multilingual surveys and questions. We suggested a system to tackle this problem that is based on the well known topic modeling frameworks such as Latent Semantic Analysis, Latent Dirichlet Allocation, and Hierarchical Dirichlet Process, and fuzzy clustering methods. We also discussed our experimental results and refining methods to improve the question clusters and survey clusters, so that they can be used for commercial surveyquestion bank generation. ACKNOWLEDGMENT The authors would like to acknowledge the support for this project from SurveyMonkey. We would like to thank Carlos Ibarra and Eric Esteban for helping us to evaluate the results.
Figure 4. The means (μk s) of survey ranking scores of topics (sorted in the descending order of μk , (2)) from the Portuguese, Spanish, German, and French research datasets.
R EFERENCES [1] (2012) Surveymonkey. [Online]. Available: http://www. surveymonkey.com [2] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent Dirichlet allocation,” JMLR, vol. 3, pp. 993–1022, March 2003. [3] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei, “Hierarchical Dirichlet processes,” Journal of the American Statistical Association, vol. 101, no. 476, pp. 1566–1581, 2006. [4] J. C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms. Norwell, MA, USA: Kluwer Academic Publishers, 1981. [5] C. Grant, C. P. George, C. Jenneisch, and J. N. Wilson, “Online topic modeling for real-time twitter search,” 2011, TREC 2011 Notebook. [6] L. Hong and B. D. Davison, “Empirical study of topic modeling in twitter,” in Proc. of SOMA, ser. SOMA ’10. NY, USA: ACM, 2010, pp. 80–88.
Figure 5. The survey counts of the top ranked topics using μk from the Portuguese, Spanish, German, and French research datasets. We can see that the μk are not depended on the cardinalities.
[7] L. Francis and M. Flynn, Text Mining Handbook, 2010, casualty Actuarial Society E-Forum, Spring 2010.
German and French survey text.
[8] S. Bird, E. Klein, and E. Loper, Natural Language Processing with Python. O’Reilly Media, 2009.
E. Future work
[9] C. Wang and D. M. Blei, “A split-merge MCMC algorithm for the hierarchical Dirichlet process,” CoRR, 2012.
We have noticed that the majority of survey questions follow the question-types (structures) of Yes/No (answers to those questions are “Yes” or “No”) and question-word (e.g., What, When, How, etc.) questions. Topic models may not be able to distinguish these question-types, since they produce a global view of the corpus wide topics. They usually cluster these common words into a single cluster, so it may be a good idea to group these questions into questiontype classes beforehand. This may help us forming another sublayer of topics based on question-types. We also plan to learn language specific model hyper-parameters [13] as an alternative to removing language specific stop words.
ˇ uˇrek and P. Sojka, “Software Framework for Topic [10] R. Reh˚ Modelling with Large Corpora,” in Proc. of the LREC 2010 Workshop. ELRA, 2010, pp. 45–50. [11] E. Dimitriadou, K. Hornik, F. Leisch, D. Meyer, and A. Weingessel, Misc Functions of the Dept. of Statistics, TU Wien (R pkg. e1071), 2011. [12] A. Cohen. (2011) Fuzzy string matching in python. [Online]. Available: https://github.com/seatgeek/fuzzywuzzy [13] H. Wallach, D. Mimno, and A. McCallum, “Rethinking lda: Why priors matter,” NIPS, vol. 22, pp. 1973–1981, 2009.
12