over 50,000 Twitter followers of Barack Obama, considering more than the 700,000 messages posted during the American. Elections in 2012. The results ...
Generating Cohesive Semantic Topics from Latent Factors Paulo Bicalho, Tiago Cunha, Fernando Mour˜ao, Gisele L. Pappa and Wagner Meira Jr. Universidade Federal de Minas Gerais, Computer Science Belo Horizonte, MG, Brazil Email: {p.bicalho, tocunha, fhmourao, glpappa, meira}@dcc.ufmg.br Abstract—Extracting topics from posts in social networks is a challenging and relevant computational task. Traditionally, topics are extracted by analyzing syntactic properties in the messages, assuming a high correlation between syntax and semantics. This work proposes SToC, a new method for generating more cohesive and meaningful semantic topics within a context. SToC postprocesses the output of a Non-Negative Matrix Factorization (NMF) method in order to determine which latent factors should be further merged to improve cohesion. Based on NMF’s output, SToC defines a topics transition graph and uses Markovian theory to merge pairs of topics mutually reachable in this graph. Experiments on two real data sample from Twitter demonstrate that SToC is statistically better than fair baselines in supervised scenarios and able to determine cohesive and semantically valid topics in unsupervised scenarios.
I. I NTRODUCTION Social networks are an excellent source of data for understanding the interests of large groups of people on a variety of topics, including elections, product popularity, government policies, among others. Indeed, a recent study showed that close to one fifth of US Internet users have posted online or used a social networking site for civic or political engagement [1]. The popularity of social networks together with the diversity of user profiles guarantees a good enough sample to understand matters such as those related above. Topic discovering methods are among the most explored to extract information from large amounts of data. They were conceived to find semantically meaningful topics from a document corpus and are mostly based on matrix decomposition techniques [2]. However, traditional methods for semantic topic discovery usually face a challenge: the semantic of a topic is usually extracted by analyzing syntactic properties of the data, i.e., they assume a high correlation between the syntax and semantics of the text. Hence, these methods usually output clusters that individually encompasses more than a single semantic topic (low cohesive topics) or a number of different clusters referring to the same topic [3]. Furthermore, in the context of social data, posts are frequently short and no information about context is provided, making the task of topic discovery more challenging. In order to overcome the aforementioned problems, this paper proposes a new method that merges topics generated by traditional topic discovery approaches and make them more cohesive and meaningful within the context being explored. This is done by analyzing the mutual reachability between topics when they are transformed into a transitional graph of topics. Cohesion is defined as the semantic relationship between a pair of topics and is measured by a modified version
of the Bhattacharyya distance [4]. SToC (Semantic TOpic Combination) focuses on social data, although it can be also applied to other types of documents. It works in two phases: first, it uses a traditional matrix factorization method to extract topics from data. In the second step, a new method is used to merge these topics in order to increase topics cohesion. The first phase uses a Non-negative Matrix Factorization (NMF) [5] algorithm, which is capable of generating goodquality topics despite vocabulary overlaps. The method takes as input a term-document matrix representation of documents using a bag-of-words model and generates two types of outputs: (i) a representation of each topic in terms of a weighted combination of terms and (ii) a representation of each document in terms of a weighted combination of topics. In both representations, the weight indicates how closely a particular keyword is related to the corresponding topic and how closely a particular topic is related to the corresponding document. However, these semantic topics may still lack cohesion [3]. In the second phase we use these weights to build a tripartite document-topic-term graph, which is then transformed into a transitional graph of topics (i.e., latent factors), and Markovian theory is applied to calculate the reachability between pairs of topics in the graph. Topics mutually reachable are merged into a single new semantic topic. Note that by merging mutually reachable topics we indirectly account for context. For example, it makes sense to merge topics related to abortion and economic crisis in a political context, as it may be classified as a semantic topic that includes issues the government is being criticized for. On the other hand, these two topics should not be placed together in cases where we simply want to know what people think about abortion. These issues will be handled by the mutual reachability, which will consider terms and documents indirectly shared by different topics with high probabilities as more semantically related. In summary, the main contribution of this paper is a new context-sensitive semantic topic identification method that can be used in any topic detection task and optimizes a desirable property in topic identification: semantic cohesion. SToC is also scalable, efficient and handles the lack of context usually inherent to microblogs and other social media platforms. SToC was evaluated in two phases. First, we tested the topic identification method in a Twitter posts collection where the main topics were known. Second, we performed a case study with over 50,000 Twitter followers of Barack Obama, considering more than the 700,000 messages posted during the American Elections in 2012. The results showed that the method is able to generate more concise and semantic cohesive topics when compared to three baselines.
II. R ELATED W ORK Topic identification methods are usually based on one of the following approaches: (i) clustering, which includes traditional data mining algorithms applied to textual data [6]; (ii) probabilistic, such as Latent Dirichlet Allocation (LDA), where a generative model allows explaining sets of observations by the similarity inherent to some parts of the data [7] and (iii) non-probabilistic, which generate good-quality topics, regardless of vocabularies overlap. Non-probabilistic methods, such as matrix factorization and sparse coding [8, 3], are the focus of this paper, and they assume that there are few latent factors not directly observable from data that represent most of the original data. In topics identification, each latent factor represents a semantic topic. As one may refer to the same semantic topic using different vocabularies, we say this type of technique actually generates fragmented or redundant representations of semantic topics, known as semantic sub-topics. In order to deal with this problem of redundancy, Kuhn et al. [9] use a Latent Semantic Indexing (LSI) method followed by a clustering process to identify semantic topics from the source codes of a system. Their method identifies N latent factors in the raw data and represents the original files using the new N-dimensional space. Next, it clusters the files using a co-variance matrix represented on the top of this new space. Wang et al. [10] propose a general topic modeling method, Group Matrix Factorization (GMF), to enhance the scalability and efficiency of non-probabilistic approaches. GMF assumes that the documents have already been categorized into multiple semantic classes, and there exist class-specific topics for each of the classes as well as shared topics across all classes. Topic modeling is then formalized as a problem of minimizing a general objective function with regularizations and/or constraints on the class-specific topics and shared topics. Recognizing the difficult of generating cohesive topics, Choo et al. [11] propose a visual analytic system for topic modeling called UTOPIAN (User-driven Topic modeling based on Interactive Nonnegative Matrix Factorization). UTOPIAN is a semi-supervised system that enables users to interact with the topic modeling method and steer the result in a user-driven manner. The authors propose a variety of user interactions, which are mostly based on individual topics or documents, using a supervised version of NMF (SS-NMF). Tanget al. [12], in turn, propose a general solution that exploits multiple types of contexts without arbitrary manipulation of the structure of classical topic model. They formulate different types of context as multiple views of the document corpus. A co-regularization framework is proposed to let theses views collaborate with each other, vote for the consensus topics, and distinguish them from view-specific topics. From the aforementioned methods, the work of [9] is the most similar to ours, but it uses a clustering method to merge topics. SToC represents a more scalable and robust strategy for modeling semantic topics. III.
ST O C: A F RAMEWORK FOR I DENTIFYING C OHESIVE S EMANTIC T OPICS This section describes SToC, the framework we propose to identify cohesive semantic topics, illustrated in Figure 1. Steps one to three, i.e., text preprocessing, data modeling and extraction of latent factors, do not differ from what has been
Fig. 1.
SToC: a framework for identifying semantic topics.
proposed in the literature so far. The main contribution of the framework comprises steps four to six, where semantic subtopics are merged into cohesive topics. The text preprocessing phase follows the traditional steps of text preprocessing in Information Retrieval, which includes the removal of special characters, stop-words, plural and genre markers, and the conversion of verbs to infinitive. After preprocessing, the data modeling phase creates a matrix of posts × terms using the term frequency (TF), which empirically showed better results than a TF X IDF representation. We then use the resulting matrix as input to a Non-negative Matrix Factorization (NMF) method. Finally, we merge these sub-topics by defining a transition graph among distinct latent factors and Markovian theory to calculate the reachability among factors in this graph. Factors mutually reachable in the graph are merged into a single factor that represents a semantic topic, as detailed below. A. Identifying latent factors Many works in the literature have dealt with the problem of identifying latent factors. Here we work with NMF because it does not assume that latent factors compose a space of independent variables. Further, NMF provides an intuitive modeling of documents through these factors, defining each document as a sum of positive components. Specifically, we adopted the traditional NMF implementation based on multiplicative update algorithms [13]. MNF can be briefly described as follows. Given an input matrix A ∈ Rm×n , where each line represents a post and each column represents a term, and an integer k < min{m, n}, representing the number of desired latent factors, NMF finds two non-negative matrices W ∈ Rm×k and H ∈ Rn×k , where A ≈ W H T . Defining k for NMF is not simple, since it is widely believed that NMF is a non-convex problem with a unique solution and there is no guarantee of finding the global minimum [14]. Here we evaluate empirically the impact of distinct values of k on our goal of identifying cohesive topics. B. Finding Semantic Topics We have already discussed that matrix factorization methods produce semantic sub-topics, but are not able to guarantee that the latent factors represent distinct and cohesive semantic topics. According to these properties, we can redefine the semantic topic identification problem as: identify the minimum number of semantic topics, with high intern cohesion, in a set of posts D. We define cohesion as the property of a pair of
topics being semantically related. For instance, posts about football and tennis are topics semantically close to each other since both talk about sports. However, measuring cohesion is a challenge since a given subject may be discussed through distinct vocabularies. Here we ensure the consolidation of cohesive semantic topics by extrapolating simple syntactic analyses. In order to merge semantic sub-topics, we first assume latent factors are modeled by a stochastic process, and that there is a probability of reaching a factor f ′ when leaving a factor f . Hence, we first represent the posts, terms and topics (latent factors) as a tripartite graph, and transform it into a transition graph among different latent factors. Then, we perform a random walk between pairs of latent factors in this graph, and merge factors with high probabilities of mutually reaching each other. The premise is that reachable topics talk about similar matter, since they share sub-vocabularies and documents. As already discussed, the way topics combination is performed is context dependent. Hence, in different scenarios the same sub-topics may or may not be merged according to context. 1) Creating a tripartite graph: NMF outputs two matrices. The m × k matrix W represents the m input posts through k latent factors. Each value Wij greater than zero defines both a directed edge from post pi to latent factor fj and an edge in the opposite direction from fj to pi . Similarly, matrix H of n × k dimensions describes each term through k factors, and each cell Hij greater than zero defines a relation between a term and a latent factor. Based on these relations, we build a weighted directed tripartite graph Gt with three types of nodes, T , F and D, representing the terms, latent factors and posts, respectively. The intuition behind this graph is that a term can be associated with more than one latent factor (e.g., the term depression might be present in the topic about health and politics with different meanings) as a post may talk about one or more latent factors (e.g., the same post can talk about sports and food “Watching the football at Albanos while eating the best pesto spaghetti ever #goPSG”). Each edge weight represents the intensity of the relationship between two nodes and is derived from W and H. In order to define transition probabilities among nodes in Gt we used the normalized values of Wij and Hij as edge weights. Edges leaving a term or post towards a latent factor, Wij or Hij , are normalized by the sum of the values of the ith line of W or H, making the sum of all leaving probabilities equals to 1. Analogously, for edges leaving from latent factors to terms/posts, we normalize their values by the sum of the values of the j th column. 2) Building the Topic Transition Graph: We then convert the tripartite graph Gt into a new graph G describing only relationships between latent factors in F . As the edge weights in Gt are normalized, each weight can be interpreted as a probability of leaving one node and reaching directly a different one. Following this rationale, each latent factor fi ∈ F has also an indirect probability of reaching another factor fj ∈ F through terms and posts. We want to transform these indirect links into direct relations between factors. Having a graph that represents only latent factors, we can then calculate the probabilities of a random walker leaving a latent factor fi and reaching a latent factor fj . In order to do that, we transform this two-step probabilistic path into a single edge by using Equation 1. Note that we join these probabilities by summing up their results, but other kinds of combinations could be evaluated.
1: function JOIN T OPICS(k, M ) 2: while k > 1 & lastK 6= k do 3: lastK = k 4: Pmin = getMinTransitionProbability(M) 5: M ′ = modifyTransitionMatrix(M) 6: candidateList = getImpact(M ′ , k, Pmin ) 7: candidate ← getBestCandidate(candidateList) 8: while candidate.impact > 0 do 9: updateTransitionMatrix(M , candidate) 10: candidate ← getBestCandidate(candidateList) 11: k−− 12: returnM 13: function GET I MPACT(M , k, Pmin ) 14: for i = 1 → k do 15: for j = i + 1 → k do 16: cohesion = sqrt(M [i, j] × M [j, i]).(1 + |M erge(i, j)|−1 ) 17: impact(i,j) = (cohesion − Pmin )/Pmin 18: return impact Fig. 2.
Algorthm for merging latent factors
P (fi → fj ) =
X
P (tk |fi ) × P (fj |tk ) +
X
P (dk |fi ) × P (fj |dk )
k∈D fi
k∈T fi
(1) fi
fi
where T and D are the sets of terms and posts with an input edge from fi . In Equation 1, the first sum represents all indirect paths from fi to fj passing through the terms, while the second comprises the indirect paths through documents. The resulting graph G is represented by a stochastic transition matrix M , where the sum of each row in M is equal to 1. As we are concerned with semantic relations among latent factors, and that not semantically correlated topics may share a subset of terms, it is important to distinguish effective relations from noisy ones. A noisy relation is defined as an edge in G with transition probability smaller than the random transition probability in a graph of the same size. Noisy relations are identified and removed from G by setting to zero their corresponding cells in matrix M . In order to ensure that M remains stochastic, we re-normalize each of its rows. Further, in order to ensure that the random walking process conducted in the next step converges to a unique solution, we make G irreducible (all of its nodes are mutually reachable) and aperiodic (there is no integer S > 1 that divides the length of every cycle of the graph) by adding a random transition probability between each pair of nodes of G. 3) Merging Topics: Having the topic transition graph, the next step is to merge semantic sub-topics. The idea is to merge topic pairs with a high mutual probability of reaching each other while this mutual probability is higher than the random transition probability, as described by the algorithm presented in Figure 2. The algorithm receives as inputs the number k of latent factors identified by NMF and the transition matrix M generated in the previous step. The algorithm is iterative, and at each iteration a minimum transition probability Pmin is defined among all pairs of topics i, j in M as the mean probability of a random walker leaves i (i.e., one minus the self-loop probability of i) times the probability of the random walker goes from i to any topic chosen at random (line 4). Next, as the graph G represented by M is not usually a clique, line 5 calculates the probabilities of nodes not directly connected reach each other, generating M ′ . Since M is irreducible, we can obtain this information by exponentiating M d times (M d ), where d is the diameter of G.
IV. E XPERIMENTAL R ESULTS This section reports evaluations of SToC on two datasets derived from Twitter. The first, named Observatory, is composed by a set of 5,431 tweets, collected using keywords defined by specialists, manually classified as belonging to one out of six topics: religion, dengue fever, soccer, election, traffic, or cars. The second, called Elections, comprises 708,121 tweets posted by Obama’s followers from July to September 2012. For this second dataset, we do not known a priori the topics being discussed, as in most real-world applications. As SToC works with matrices of posts per terms, after the text preprocessing phase, the dataset Observatory was described by 1,956 terms (matrix of 5,431 per 1,965), while Elections was described by 61,526 (matrix of 708,121 x 61,526). A. Quantitative Evaluation Evaluating the topics identified by SToC can be as difficult as obtaining them. For this reason, this section defines two disjoint sets of metrics that evaluate the performance of the semantic topic identification phase, assuming a supervised and an unsupervised scenario. In both analyses, we compare SToC with three other semantic topic identification techniques: (i) NMF with its input parameter k equals to the final number of topics found by SToC, (ii) a Matlab’s hierarchical average-linkage clustering algorithm, with cosine similarity, applied to the topics extracted from NMF (NMF+HC), and (iii) Weka’s [6] version of the expectation-maximization algorithm (EM) applied to the topics created by NMF (NMF+EM). NMF runs for 100 iterations in all cases. Aiming to present fine-grain evaluations on the tradeoff between effectiveness and coverage for Observatory, we calculate the precision and recall of the discovered topics. Both metrics depend on how documents are mapped to topics found and how topics are mapped to known classes. We assume that a post is associated with one topic, although many topics per document mappings are possible (see Figure 3), considering
0.6
Election Observatory
0.5 0.4 0.3 0.2 0.1 0
16
Number of topics
Percentage of documents
Next, in line 6, the function getImpact calculates, for each pair of topics, the impact that merging them will produce in M ′ w.r.t. cohesion. We quantify cohesion as the mutual transition probability between two topics i, j times a weight ((1 + |M erge(i, j)|−1 ), which represents the inverse of the number of distinct latent factors that compose the topic created when merging i and j. This weight tries to avoid two big topics to be merged, since big topics usually have high input transition probabilities. The mutual transition is measured, in turn, through the Bhattacharyya distance (sqrt(M [i, j] × M [j, i])), which penalizes pairs of topics with unbalanced or low probability of reaching each other. Therefore, we define the impact of a merge by analyzing how bigger or smaller a weighted mutual probability of two topics is compared to the random transition, normalized by the latter (line 17). In line 7, the pair of topics with highest positive impact is selected, the corresponding topics are merged and M is updated by replacing the two merged topics with a single new one. The transition probabilities of this new topic is also updated (line 9). Next, a new topic with the highest impact in M is selected, and this process goes on until there are no remaining pairs with positive impact or the number of resulting topics reaches one.
Observatory Election
14 12 10 8 6 4 2 0
1
3
5
7
9 11 13 15 17 19 21
1
Rank of Topics
2
3
4
5
6
8
12 22
Number of latent factors
(a)
(b)
Fig. 3. Final topics found by SToC ordered by the frequency of documents assigned to them in Observatory and Election (a), and an histogram of the number of latent factors assigned to each semantic topic after running SToC(b) TABLE I.
R ESULTS OF A SUPERVISED ANALYSIS FOR O BSERVATORY
Method
SToC
NMF
NMF+HC NMF + EM
K 50 100 200 300 400 500 16 26 43 55 65 69 6 50 50
Topics 16.2 26.5 43.3 55.5 64.7 69.1 16.2 5.3
Precision 0.830 •0.837 •0.811 H0.764 H0.730 H0.651 N0.888 N0.902 N0.944 N0.963 N0.967 N0.974 H0.506 H0.805 H0.447
Recall 0.638 H0.567 H0.479 H0.433 H0.389 H0.306 H0.479 H0.362 H0.276 H0.235 H0.211 H0.208 H0.543 H0.535 H0.267
Accuracy 0.649 H0.575 H0.488 H0.440 H0.399 H0.318 H0.469 H0.352 H0.268 H0.228 H0.206 H0.201 H0.543 H0.534 H0.284
to the probabilities found by NMF. This is the most restrictive way to map posts, which associates the post with the most likely topic and the topic with the most frequent class among the posts in that topic. During this mapping, 2.6% of the posts from Observatory are ignored, since they present zero probability for all latent factors (for Election, this number is 1%). Similarly, a topic can also be mapped to zero or one known classes. In cases where the number of topics found is greater than the number of known classes, only the topics with the highest number of documents are mapped to the real classes, and this is reflected in the recall of the methods. Table I shows the average results obtained over 10 runs of the methods for Observatory. The first column reports the values tested for the parameter k, which is data dependent. The column topics represents the number of topics obtained after the merging process, followed by values of precision, recall and accuracy. The results obtained by SToC are followed by the results of NMF without any topic merging process. Note that the values of k here were set to be equal to the values of the final number of topics obtained by SToC. We also report the values with k = 6, which is the known number of topics. For NMF+HC, we set as a stopping criteria to the algorithm to reach the same number of topics SToC found. For EM, we set the number of interactions to 100. Results are compared using a t-test with confidence level 0.01, and the best value of k for SToC is used as reference (first line of the table in gray). The symbol N (H) in a cell indicates its result is better(worse) than the highlighted, and • indicates no statistical evidence to state any difference. Analyzing the different values of k tested and the final number of topics obtained during the merging process, we notice that the bigger k is the greater the number of final topics generated. Note that the best result obtained generated on average 16 topics. We knew the real dataset had 6 topics about different subjects, but we know other topics not explicitly identified could appear together with the selected ones. This is one of the reasons
TABLE II. Method
SToC
NMF
NMF+HC NMF+EM
K 50 100 200 300 400 500 16 26 43 55 65 69 50 50
TABLE III.
U NSUPERVISED ANALYSIS FOR O BSERVATORY. Topics 16.2 26.5 43.3 55.5 64.7 69.1 16.2 5.3
Purity 0.523 H0.481 H0.436 H0.414 H0.408 H0.461 H0.440 H0.406 H0.370 H0.354 H0.343 H0.341 H0.486 N0.838
Purity R 0.394 0.360 0.330 0.321 0.323 0.412 0.325 0.298 0.269 0.258 0.250 0.248 0.367 0.830
Entropy 6.996 N6.819 N6.674 N6.644 N6.659 N6.851 N6.886 N6.502 N6.127 N5.932 N5.810 N5.742 •6.981 H8.293
Entropy R 7.567 7.378 7.196 7.127 7.103 7.190 7.488 7.119 6.755 6.564 6.442 6.379 7.537 8.408
evaluating the method is so difficult. In terms of precision and recall, we obtained values of 0.83 and 0.638, respectively. Observe that as the number of topics increases, the precision drops but the recall reaches half of its original value, which is expected, given the mapping strategy defined above. Comparing the results with NMF alone, at first it seems the values of precision are really high and much better than those obtained by SToC. However, looking at the results of recall, we realize that the precision is high at the cost of a very low recall. For example, with 16 topics, it decreases from 0.638 for SToC to 0.479 for NMF. When NMF is used followed by a hierarchical clustering algorithm, again the values of precision are statistically the same as those obtained by SToC, but the maximum value of recall found is 0.535, against 0.638 obtained by SToC. The results obtained by EM are always worse than those obtained by SToC. As for most datasets the topics are unknown a priori, here we define cluster-based metrics for an unsupervised evaluation. However, note that these metrics are not ideal, as they do not take context into account. Cohesion, or the intraclustering similarity, is evaluated considering two traditional metrics: the purity and entropy of the semantic topics [6]. In order to calculate both metrics, shown in Eq.2 and Eq. 3, for datasets with known classes, each semantic topic ti is paired with the most frequent actual class of the documents assigned to ti . In these equations, T represents the sets of semantic topics (clusters) found, C is the set known classes, D the total c number of tweets, nti is the size of topic ti and ntij is the number of posts belonging to class cj clustered in topic ti . purity(T, C) =
entropy(T, C) =
|T | X nti i=1
n
k 1 X maxj |ti ∩ cj |. D i=0
× (−
1 log|C|
c |C| X nt j i j=1
nti
(2)
log
c nt j i
nti
)
(3)
Table II presents the results of purity and entropy for Observatory. We also reported the value of each metric when considering clusters generated at random from the sample (R) to work as reference for the actual metric. Looking at purity, again the configuration with k equals 50 remains the best. For entropy, on the other hand, its value decreases as the number of topics increase. According to this metric, the best values are those obtained with k equals 300. We can interpret these results as: the maximum frequency of a term, which is responsible for associating it with a single topic, reduces as we increase the number of topics (and is reflected in the values of purity). However, this same term spreads over a few distinct topics (and this is reflected in the entropy) while we use k smaller or equals 300. Hence,
Method
SToC
NMF
NMF+HC NMF+EM
K 50 100 200 300 400 500 22 35 53 69 85 98 50 50
U NSUPERVISED ANALYSIS FOR E LECTION . Topics 22 34.9 53.3 69.2 85.3 98.2 22 4.6
Purity 0.444 H0.303 H0.367 •0.409 •0.403 •0.405 H0.246 H0.187 H0.171 H0.168 H0.165 H0.159 •0.365 N0.682
Purity R 0.419 0.251 0.340 0.394 0.389 0.392 0.207 0.118 0.095 0.089 0.086 0.077 0.327 0.672
Entropy 8.062 N7.930 N7.866 N7.845 N7.813 N7.816 H8.164 N7.898 N7.687 N7.549 N7.447 N7.387 •8.042 H8.263
Entropy R 8.309 8.278 8.220 8.204 8.192 8.200 8.541 8.355 8.207 8.111 8.031 7.994 8.333 8.367
according to purity, SToC obtains results better than those achieved by the baselines, except EM. Looking at entropy, the opposite is concluded. A similar analysis was performed using the dataset election, where we do not have a ground truth as reference. For datasets with unknown classes, instead of considering the purity and entropy of the clusters w.r.t. the real class, we evaluate them considering the terms observed across different groups, as in [15]. For each term we calculate its purity and entropy w.r.t. the frequency it appears in different groups. The results are reported in Table III. For this dataset, the results of both purity and entropy are much closer to the results obtained from the random clusters. An analysis following the results of purity shows it first drops and then increases again as the number of topics grows. Again, the results of entropy contradict those of purity, with purity putting EM as the better method and entropy pointing out NMF, where the method with 98 topics has an entropy of 7.387. The explanation is the same as before: entropy captures terms spread along different topics, while purity focuses on high frequency terms associated with topics. Figure 3(b) shows, for k equals 50, how many latent factors composed each semantic topic in the end of SToC execution. The number of topics found by SToC was 16 for Observatory and 21 for Election. For election, there is a large topic containing 22 latent factors (44% of all factors found by NMF) and 16 topics formed by only one latent factor. For observatory, latent factors were better distributed among the topics. This can be explained by the way datasets were collected. We know that, for Observatory, tweets do belong to different subjects. But for election, which refers to a more concise topic, SToC might not have been as so good at identifying subtopics within election. For both datasets EM presents better values of purity. This may be explained by looking at the topics generated by EM and SToC. EM always finds one giant topic and others very small (with high purity), while in SToC the size of the topics is better distributed. For instance, EM found one big topic that contains 85% of latent factors on Election, while the biggest topic found by SToC corresponds to 35% of latent factors (see Figure 3(b)). Since the values reported correspond to the average purity of each topic and very small topics generally have high purity, these values are really high. B. Qualitative Evaluation As previously discussed, as the method finds semantic topics within a context, the traditional measures to evaluate the quality of topics might not be the most appropriate. This section shows a sample of the topics obtained by SToC in
the dataset election, and discusses the quality of the results. In the dendogram in Figure 4, each leaf node represents a latent topic. It is important to emphasize that this dendogram is not the one with highest purity. It was selected among the 10 clusters generated by different NMF runs. We selected the one that visually presented the best results.
! " #
# $ % !
#
# # &
# # & # ## # &
Fig. 4. Sample of semantic topics found by SToC in the dataset Election.
This choice was based on an analyses of the the composition of each final topic and how it related with the events that occurred on the date that the tweets were posted. The purity of the topics showed is 0.37, while the highest obtained was 0.52 and average of 0.44, as reported in Table III. Again, we believe that a high value of purity is easy to achieve when the number of small topics is large. In particular, purity is 1 if each latent factor gets its own cluster. Thus, purity does not trade off the quality of the clustering of topics against the number of final topics, and better metrics to evaluate the semantic topics should be investigated in the future. Figure 4 shows two topics found by SToC (ids 79 and 80). Looking at the terms describing these topics, we observe that topic 79 contains words related to the candidates in the U.S presidential election, including obama, romney, ryan, paul, etc. Topic 80 contains terms more related to events that happened during the political campaign, like the controversial Republican convention (08/21/2012) where a strict anti-abortion plan was approved. The foregoing discussion demonstrates that, in supervised scenarios, SToC presents solutions statistically better than the baselines. In turn, it is difficult to evaluate the quality of results in unsupervised scenarios because there is no appropriate metric to evaluate semantic within a context. In such cases, we analyze how the latent factors are merged and manually evaluated the content of posts assigned to each topic. V. C ONCLUSIONS AND F UTURE W ORK This paper proposed SToC, a new method for determining cohesive semantic topics discussed in a collection of posts. SToC assumes that distinct latent factors defined by an NMF method can be further merged to improve topic cohesion. Empirical evaluations on two real post collections derived from Twitter demonstrate this hypothesis. Analyses on supervised scenarios showed that SToC is statistically better than fair baselines, while manual evaluations demonstrate that SToC produce cohesive, context sensitive and semantically valid topics in unsupervised scenarios.
As future works, we intend to perform further evaluations on SToC’s stopping criteria. We also aim to enhance SToC using the approach proposed in [3], in order to deal better with data sparsity. Additionally, qualitative evaluations should be conducted, such as assessments on the interpretability of the discovered semantic topics. Finally, we intend to contrast SToC’s computational performance against the selected baselines. R EFERENCES [1] L. Rainie, A. Smith, K. L. Schlozman, H. E. Brady, and S. Verba, “Social media and political engagement,” Pew Research Center’s Internet & American Life Project., 2012. [Online]. Available: http://pewinternet.org/Reports/ 2012/Political-engagement.aspx [2] A. Pons-Porrata, R. Berlanga-Llavori, and J. RuizShulcloper, “Topic discovery based on text mining techniques,” Information processing & management, vol. 43, no. 3, pp. 752–768, 2007. [3] X. Cheng, J. Guo, S. Liu, Y. Wang, and X. Yan, “Learning topics in short texts by non-negative matrix factorization on term correlation matrix,” in Proc. of SDM, 2013. [4] A. Bhattacharyya, “On a measure of divergence between two statistical populations defined by their probability distributions,” Bulletin of the Calcutta Mathematical Society, vol. 35, pp. 99–109, 1943. [5] M. Berry, M. Browne, A. Langville, V. Pauca, and R. Plemmons, “Algorithms and applications for approximate nonnegative matrix factorization,” Computational Statistics & Data Analysis, vol. 52, no. 1, 2007. [6] I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, 2nd ed. Morgan Kaufmann, 2005. [7] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,” J. Mach. Learn. Res., vol. 3, Mar. 2003. [8] L. Bai, J. Guo, Y. Lan, and X. Cheng, “Group sparse topical coding: from code to topic,” in Proc. of ACM WSDM, 2013, pp. 315–324. [9] A. Kuhn, S. Ducasse, and T. G´ırba, “Semantic clustering: Identifying topics in source code,” Information and Software Technology, vol. 49, no. 3, pp. 230–243, 2007. [10] Q. Wang, Z. Cao, J. Xu, and H. Li, “Group matrix factorization for scalable topic modeling,” in Proc. of the 35th ACM SIGIR, 2012, pp. 375–384. [11] J. Choo, C. Lee, C. K. Reddy, and H. Park, “Utopian: User-driven topic modeling based on interactive nonnegative matrix factorization,” IEEE Trans. Vis. Comput. Graph., vol. 19, no. 12, pp. 1992–2001, 2013. [12] J. Tang, M. Zhang, and Q. Mei, “One theme in all views: modeling consensus topics in multiple contexts.” in Proc. of KDD. ACM, 2013, pp. 5–13. [13] D. Seung and L. Lee, “Algorithms for non-negative matrix factorization,” Advances in neural information processing systems, vol. 13, pp. 556–562, 2001. [14] C. Lin, “On the convergence of multiplicative update algorithms for nonnegative matrix factorization,” Neural Networks, IEEE Transactions on, vol. 18, no. 6, pp. 1589–1596, 2007. [15] H.-Y. Kao, S.-H. Lin, J.-M. Ho, and M.-S. Chen, “Mining web informative structures and contents based on entropy analysis,” Knowledge and Data Engineering, IEEE Transactions on, vol. 16, no. 1, pp. 41–55, 2004.