Understanding Sparse Topical Structure of Short Text ...

Understanding Sparse Topical Structure of Short Text via Stochastic Variational-Gibbs Inference Tianyi Lin

Siyuan Zhang

Hong Cheng

The Chinese University of Hong Kong



[email protected]

[email protected] [email protected]

ABSTRACT With the soaring popularity of online social media like Twitter, analyzing short text has emerged as an increasingly important task which is challenging to classical topic models, as topic sparsity exists in short text. Topic sparsity refers to the observation that individual document usually concentrates on several salient topics, which may be rare in entire corpus. Understanding this sparse topical structure of short text has been recognized as the key ingredient for mining user-generated Web content and social medium, which are featured in the form of extremely short posts and discussions. However, the existing sparsity-enhanced topic models all assume over-complicated generative process, which severely limits their scalability and makes them unable to automatically infer the number of topics from data. In this paper, we propose a probabilistic Bayesian topic model, namely Sparse Dirichlet mixture Topic Model (SparseDTM), based on Indian Buffet Process (IBP) prior, and infer our model on the large text corpora through a novel inference procedure called stochastic variational-Gibbs inference. Unlike prior work, the proposed approach is able to achieve exact sparse topical structure of large short text collections, and automatically identify the number of topics with a good balance between completeness and homogeneity of topic coherence. Experiments on different genres of large text corpora demonstrate that our approach outperforms various existing sparse topic models. The improvement is significant on large-scale collections of short text.

Keywords Topic modeling; short text; sparse topical structure; Indian Buffet Process; stochastic variational-Gibbs inference

1. INTRODUCTION There has been an explosion in the amount of user-generated Web content available in recent years. Google News launches its project to bring in user-generated news, and becomes a Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

CIKM’16 , October 24-28, 2016, Indianapolis, IN, USA c 2016 ACM. ISBN 978-1-4503-4073-1/16/10. . . $15.00

DOI: http://dx.doi.org/10.1145/2983323.2983765

significant part of broadcast news1 . Twitter allows its users to share their life with short posts, and creates more than 500 million tweets on a daily basis2 . This huge amount of user-generated content, normally in the form of very short text, contains rich information that can be scarcely found in traditional text sources. However, analyzing short text has several challenges like topic sparsity, which have not appeared in analyzing normal text. Topic sparsity refers to the observation that the content of most short text only concentrates on a narrow range of topics, rather than covering a wide variety of topics spanning in the whole corpus. Such sparse topical structure of short text reflects social trend and user interests, which benefits several important tasks in data mining, such as social media event detection and personal attribute prediction. Consequently, it has been recognized as an important problem to identify the meaningful sparse topical structure of “web scale” collection of short text. Topic models, such as latent Dirichlet allocation (LDA) [3] and hierarchical Dirichlet process (HDP) [24], have been proven effective to discover latent topical structure of unstructured text collection. However, the experience of classical topic models on short text is mixed since they suffer from insufficient word co-occurrence information in each short text [11, 34]. This inspires a line of sparsity-enhanced topic models [21, 26, 28, 7, 35, 4, 33, 15, 2, 31] which aim at mining sparsity in the topical structure of text. However, none of these models can scale to large text corpora due to the limitation of their inference procedure. As an alternative to traditional inference method [1, 25], stochastic variational inference [10] is proposed to infer topic models on large text corpora. However, this method is not applicable for inferring sparsity-enhanced topic models effectively since the variational distribution inferred is only a dense approximation of true sparse posterior distribution. Another reason for the inefficiency of stochastic variational inference is that the over-complicated generative process of sparsity-enhanced topic models do not admit the closedform solution to the update of variational parameters. In this paper, we propose a probabilistic Bayesian topic model, namely Sparse Dirichlet mixture Topic Model (SparseDTM), based on Indian Buffet Process (IBP) prior [9]. We infer SparseDTM on the large text corpus through a novel inference procedure called stochastic variational-Gibbs inference. The proposed approach outperforms the existing ones by addressing sparsity in topical structure of short text corpora. Specifically, the previous approaches based on s1 2

http://en.wikipedia.org/wiki/user-generated content http://blog.twitter.com/2014/the-2014-yearontwitter

tochastic variational inference only achieve dense variational distribution of document-level variables when applied to infer sparsity-enhanced topic models. In contrast, our approach infer the true posterior distribution of document-level latent variables by collapsed Gibbs sampling. Moreover, the proposed approach scales to the large text corpora by inferring corpus-level variables through stochastic variational inference. Experiments on different genres of large text corpora demonstrate that our approach achieves very high time and iteration efficiency, while not sacrificing predictive performance and topic coherence. Our approach addresses sparsity in the topical structure of text corpora and infers the number of topics automatically. Finally, our approach is shown to perform significantly better on large-scale collections of short text. We summarize the contributions of this paper as follows. • We propose a probabilistic Bayesian topic model, SparseDTM, for analyzing large-scale short text corpora, and a novel inference procedure to infer SparseDTM on large-scale short text collections. • The simple but effective generative process of SparseDTM allows for the feasibility of combining local Gibbs sampler and stochastic variational inference, making it possible to infer sparse topical structure of short text. • The proposed approach in this paper outperforms existing approaches since it recovers the true posterior distribution of document-level variables, i.e., sparse topical structure of text corpora, and infers the number of topics automatically, as confirmed by the high predictive performance and topic coherence presented in Section 5. The rest of the paper is organized as follows. In Section 2, we discuss the related work. In Section 3, we formally define the problem of modeling sparse structure of text. In Section 4, we introduce the probabilistic Bayesian topic model, i.e., SparseDTM, and discuss its probabilistic interpretation. In Section 5, we describe the experiments on different genres of large-scale short text corpora. We conclude this work in Section 6.

that sparse topical coding significantly outperforms other competing methods in this category. However, this method lacks the ability to address sparsity in topical structure of user-generated content like tweets [15]. Another drawback of sparse topical coding and its online variant is their inability to infer the number of topics automatically from data. Probabilistic sparsity-enhanced topic models improve classical topic models by adopting specific prior to decouple across-data prevalence and within-data proportion in modeling mixed membership data, such as an entropic prior [21], a spike and slab prior [26, 15], an IBP prior [28, 2], a zeromean Laplace prior [7, 31] and a hierarchical Beta process prior [4]. Gibbs sampling is a standard technique for inferring sparsity-enhanced topic models since it identifies the topic sparsity. However, this approach suffers from very poor scalability. The most related approach to ours is focused topic model and its variants [28, 2] which model topic sparsity via IBP prior. However, the generative process of such models is so complicated that stochastic variational inference can not be applicable. In contrast, the proposed SparseDTM has the simple but effective generative process, which can be inferred by stochastic variational-Gibbs inference on large-scale data corpora.

2.2

2.3 2. RELATED WORK To the best of our knowledge, our approach is the first to analyze large-scale short text corpora via sparsity-enhanced topic model and stochastic variational-Gibbs inference. The related lines of literature are as follows.

2.1 Sparsity-enhanced Topic Models Various works have been developed to model latent sparse topical structure of data [12, 21, 26, 28, 7, 35, 4, 33, 15, 2, 31]. These probabilistic models can be categorized into two camps: non-probabilistic coding or matrix factorization with sparsity regularization, and probabilistic models with specific prior. In the first category, coding is used to represent the coefficients of topical basis in the topic space to model the generative process of words. For example, the non-negative matrix factorization [12] and non-probabilistic sparse topical coding [35, 33] provide a feasible framework to impose ℓ1 regularization to control the sparsity in the posterior distribution of topical structure of data. [35] further demonstrates

Stochastic Variational Inference

Various works have made efforts to infer topic models on large data corpora [27, 16, 8, 10, 33]. The idea behind is to optimize variational lower bound via stochastic approximation [20], allowing optimization to be proceeded over a subset of data. Some famous inference procedures include stochastic variational inference [10], hybrid stochastic variational inference with sampling [16, 27], and stochastic collapsed variational inference [8]. Stochastic variation inference suffers from inferring dense variational distribution instead of true sparse posterior distribution as discussed before. Recently, an online inference procedure has been developed in [33] for sparse topical coding; however, this approach can neither address the sparsity in the topical structure of short text corpora, nor infer the number of topics automatically.

Short Text Mining

Another line of related work is to design different strategies and models for analyzing short text, where the classical topic models suffer from insufficient word co-occurrence information [11, 34]. Jin et al. [13] cluster short text by transferring topical knowledge from another collection of long text. Shou et al. [22] develop an incremental framework to cluster online streaming tweets. Xu et al. [29] and Kenter and de Rijke [14] analyze the semantic similarity of short text by introducing a semantically similar hashing approach and a saliency-weighted semantic network respectively. However, these efficient methods are restrictive since they can not discover the latent structure of short text. To alleviate the issue of sparse word occurrence patterns, several topic models are proposed for mining the specific form of text corpora. Specifically, the Biterm Topic Model [5, 30] is proposed to analyze a set of unordered word-pair co-occurrence (biterms) in text corpora instead of individual document. Sridhar [23] induces a distributed representation of words instead of bag of words, and provides a Gaussian mixture model to capture latent topic. Quan et al. [19] in-

tegrate topic modeling with short text aggregation during topic inference. However, this type of approach can not address sparsity in the topical structure of short text corpora. Another direction is to model the sparsity in the topical structure of short text corpora. As the extreme case, Yin and Wang [32] assume that one short text is assigned with only one topic, and introduce dirichlet multinomial mixture model, a Bayesian variant of mixture of unigram [3], for short text clustering. However, most short text has more than one topic, as confirmed by our experiments in Section 5. Recently, Lin et al. have proposed a dual-sparse topic model (DsparseTM) in [15] by incorporating sparsity in the latent Dirichlet allocation model. However, DsparseTM can neither scale to large corpora, nor infer the number of topics automatically from text.

3. PROBLEM FORMULATION In this section, we formally define the problem of modeling sparse topical structure of text. |D| ~j = Let D = {w ~ j }j=1 be a collection of document, where w (wj1 , wj2 , · · · , wjnj ) is a vector of terms representing the textual content of document j. wji denotes the frequency of the i-th term in document j. V is denoted as the vocabulary of all distinct words in D. Definition 1. (Topic, Topical Structure, Topic Mod~ in a document collection D is defined eling) A topic φ as a multinomial distribution over the vocabulary V , i.e., ~ v∈V . It is a common to assume that there are K {p(v|φ)} topics in D, while K is unknown. The topical structure of a document j, i.e., θj , is defined as a multinomial distribution ~ k |θ~j )}k=1,··· ,K . The task of topic over K topics, i.e., {p(φ modeling aims to infer the number of topics K, K salient ~ k }k=1,··· ,K , and the topical structure of each doctopics {φ ument, {θ~j }j=1,··· ,|D| , from D. Definition 2. (Sparse Topical Structure) A document j in a document collection D has the sparse topical struc~k } ture if it holds true that θjk = 0 for most of topics {φ 3 where 1 ≤ k ≤ K . A document collection D has the sparse topical structure if each document in D has the sparse topical structure. Most classical probabilistic topic models, such as LDA, adopt the Dirichlet prior for the topic structure of documents, which alleviates the overfitting problem if the number of topics K is very large. However, the Dirichlet prior itself is limited to control posterior sparsity in inferred topical structure, and hence can not model sparse topical structure described in Definition 2, which commonly exists in short text as discussed before. Therefore, we need to introduce some auxiliary variables to indicate a set of representative topics of each document. Definition 3. (Topic selector) For j ∈ {1, 2, . . . , |D|}, ~j , is defined as a Ka topic selector of document j, i.e., β dimensional vector whose components are all binary. Each entry βjk determines if document j is generated from topic ~ k . Indeed, document j is generated from φ ~ k where βjk = 1. φ 3 The percentage of topics in one document is less than 10% as shown in [5, 11].

Table 1: Variables and Notations Notation Meaning T truncation K number of topics V vocabulary D a collection of documents nj number of words in document j njk frequency of topic k in document j nkv frequency of word v in topic k nk frequency of words in topic k wji ith word in document j |D| w ~ set of all words, i.e., {w ~ j }j=1 zji assigned topic at ith word in document j |D| ~z set of all topic assignments, i.e., {~zj }j=1 θ~j topical structure of document j ~ φk word usage of topic k ~j β topic indicator of document j γ concentration parameter of IBP a concentration parameter of DP η topic hyperparamter ρt learning rate for online update Γ(·) Gamma function B(·, ·) Beta function Dirichlet(·) Dirichlet distribution Beta(·) Beta distribution Bernoulli(·) Bernoulli distribution Multinomial(·) Multinomial distribution I(·) Indicator function ~j determines the Given a document j, its topic selector β set of active topics, and hence models its sparse topical structure θ~j directly. The following proposition clarifies the con~ nection between θ~ and β. Proposition 3.1. A document j in a document collection D has sparse topical structure if it holds true that K X

βjk < K,

k=1

where K is inferred from D. Furthermore, θjk = 0 holds if βjk = 0. However, this relationship does not hold conversely. Even βjk = 1, it is possible that θjk = 0 due to data sparsity. Now we are ready to give a formal description of our task. Given a collection of documents D, the vocabulary V , our task is to: 1. infer the set of topics appearing in each document j ~j ; by inferring β 2. learn the number of topics K in the collection D; ~ 3. infer the sparse topical structure of documents θ; ~ 4. learn the word usage of topics φ. All the notations used in this paper are listed in Table 1.

4.

APPROACH

The key difficulty of inferring the sparsity in the topical structure of “web scale” text corpora is to balance the effectiveness of the model design and the feasibility of stochastic inference procedure. The sparsity-enhanced topic models [26, 28, 15] are all theoretically sound; however, their over complicated generative process prevent feasibility of stochastic variational

inference. One exception is a simple sparsity-enhanced topic model, namely LIDA [2], which can scale to large text corpora via stochastic variational inference. We propose a Bayesian probabilistic model based on Indian Buffet process prior, namely SparseDTM, to model sparsity in the topical structure of large text corpora. Specifically, we develop a unified framework to incorporate the finite IBP prior [9], and heuristic shrinkage and rearrange operations into the generative process. Different from the existing sparsity-enhanced topic models, our model has a simple generative process, which adopts the finite IBP prior to select a set of topics in individual document while inferring the number of topics automatically by a heuristic way of shrinking the topical representation of individual document during topic inference.

4.1 Model In this section, we describe the generative process of the Sparse Dirichlet Topic Model (SparseDTM). SparseDTM is a probabilistic Bayesian generative model developed for analyzing a collection of documents. The idea behind is to combine the finite IBP prior with a Dirichlet mixture model with symmetric corpus-level topic prior. It is worthy mentioning that SparseDTM defines a series of topic selectors and a heuristic shrinkage operation, allowing for inferring the sparsity in topical structure of text corpora and the number of topics. Its generative process is feasible for the procedure of stochastic variational-Gibbs inference. SparseDTM is depicted in Figure 1 and its generative process is presented as follows: For each topic k = 1, 2, . . . , T : 1. πk ∼ Beta( Tγ , 1), ~ k ∼ Dirichlet(η~1), 2. φ For each document j = 1, 2, . . . , |D|: 1. βjk ∼ Bernoulli(πk ), Shrinkage Operation: ~ k where {k > K : βjk = 0, βjK 6= 0} for all 1. Remove φ document j, For each document j = 1, 2, . . . , |D|: 1. Rearrange Operation: (a) Set θjk = 0 if βjk = 0, ~j′ = (βjk , · · · , βjk ) where βjk is non-zero (b) Set β 1 j j component in {βjk : 1 ≤ k ≤ K}. ~j′ ), (c) θ~j′ ∼ Dirichlet(aβ (d) Set θ~j according to θ~j′ , 2. For each word i = 1, 2, . . . , nj : (a) sample zji from Multinomial(θ~j ), ~ z ). (b) sample wji from Multinomial(φ ji We make the following remarks: Finite IBP prior: It is necessary to understand the reason why we apply finite IBP prior to topic selector, rather

γ

π

φ

β

θ

η

z T

w nj |D|

a

Figure 1: The graphical model of the SparseDTM than using infinite IBP prior as the focused topic model [28]. We observe that the utilization of finite IBP prior leads to much simpler generatice process than that of focused topic model, which can be inferred by using stochastic variationalGibbs inference. On the other hand, it has been proven in [6] that the finite IBP prior behaves as well as the infinite IBP prior when the truncation level is sufficiently large, which supports our SparseDTM theoretically. Shrinkage operation: Shrinkage operation provides a heuristic way to infer the number of topics. When applying ~ it has been shown the finite IBP prior to the topic selector β, in [9] that there exists 0 < K ≤ T such that βj,K 0 Discussion: We discuss the possible reasons why it is better to apply stochastic variational-Gibbs inference to infer SparseDTM. Firstly, the proposed algorithm can extract sparse posterior distribution of topical representation of individual document. Furthermore, we can understand our algorithm theoretically as an approximated Expectation Propagation [17], which has been proven in [27] more effective than existing stochastic variational inference.

5.

EXPERIMENT

In this section, we investigate the performance of the proposed algorithm on large-scale collections of short text. The objectives include: (1) a quantitative evaluation of the predictive performance and topic coherence; (2) a quantitative measurement of sparsity in the topical structure; (3) an interpretation of inferred topics; (4) a quantitative evaluation of influence of truncation, the size of mini-batch, and learning rate.

5.1

Data Sets

We adopt three different genres of large-scale real-world data sets in our experiment. Stop words are removed from each data set according to a standard list of stop words4 . • DBLP. Titles of scientific papers are good examples of short documents. We collect titles of all conference papers from the DBLP database5 . This data set contains 1,017,771 short documents and 28,014 unique words. The average length of each document is 6.6. • NYT. This data set6 , denoted as NYT, is a good representative of collections of user-generated content. It contains 299,752 news articles published on New York Times between 1987 and 2007. The vocabulary size is 102,660, and the average length of each document is 166.1. To further investigate the behavior of our algorithm on short content, we vary the document length by randomly sampling words from the original document. As a result, we obtain four short text corpus denoted as NYT-1, NYT-2, NYT-3 and NYT-4.

λkv ← λkv + ρt (−λkv + η + |D|njkv ) , where njkv is the number that the v-th word assigned to topic k in document j. Furthermore, we can extend the above equation to the case of mini-batch. Specifically, given a sample of documents M , we obtain |D| X njkv ). (2) λkv ← λkv + ρt (−λkv + η + |M | j∈M ~ and ~z: We further exCollapsed Gibbs Sampling for β plain how to apply collapsed Gibbs sampling to infer document~ and ~z. The idea behind is to use the varialevel variables β ~ ~ tional distribution q(φ|λ) as “priors” instead of p(φ|η), and

~ a) p(zji = k, ~z¬ji , ~λ, β, ~ a) p(~z¬ji , ~λ, β,

4

http://www.ml-thu.net/∼jun/stc.shtml https://www.informatik.uni-trier.de/∼ley/db/ 6 https://ldc.upenn.edu

5

• Twitter. We sample one collection of 2,792,323 tweets posted in June 2009, and another collection of 2,818,706 tweets posted in July 2009 from the Twitter data set released by the Stanford Network Analysis Project7 , denoted as Twitter-1 and Twitter-2. We obtain the vocabularies of 88,080 and 89,746 unique words after removing hashtags and words appearing less than 10 times. The statistics of the data sets are summarized in Table 2. Table 2: Statistics of the data sets Vocabulary Avg doc len Data set # Documents size by words DBLP 1,017,771 28,014 7 NYT 299,752 102,660 166 Twitter-1 2,792,323 88,080 6 Twitter-2 2,818,706 89,746 6 NYT-1 298,978 24,845 6 NYT-2 298,793 32,631 8 NYT-3 298,362 38,783 11 NYT-4 297,997 43,974 14

Sparsity Ratio. Sparsity ratio [15, 26] measures the sparsity in the topical structure of data quantitatively. For each document j, its sparsity ratio is defined as Sparsity-ratio(j) = 1 −

We compare our method with other methods by using perplexity and point-wise mutual information (PMI), which can describe the usefulness and significance of the topics mined [3, 18]. Predictive Performance. We use the perplexity [3, 24] to evaluate the predictive performance of all algorithms. Given Dtrain and Dtest , we split each document w ~ j in Dtest into two parts, w ~ j = (w ~ j1 , w ~ j2 ), and calculate perplexity of w ~ j2 conditioned on both w ~ j1 and Dtrain as ) ( P ~ j2 |w ~ j1 , Dtrain ) j∈Dtest log p(w P , (5) Perplexity = exp − ~ j2 | j∈Dtest |w where |w ~ j2 | is the number of tokens in w ~ j2 . In this paper, the predictive distribution can be approximated by p(w∗ |w ~ j1 , Dtrain ) ≈

K X

k=1

(λkw∗ + nkw∗ )(aθ¯jk + njk ) , PV P ¯ ( v=1 λkv + nk )(a K k=1 θjk + nj )

2 N (N − 1)

X

1≤i 0,

k=1

5.3 5.2 Metrics

K X

1 In this paper, we set the learning rate ρt = (1+t) κ consistently. In the following section, we firstly evaluate the 8 9

https://www.cs.princeton.edu/∼chongw/resource.html http://github.com/metalgeekcz/onlineHDP

SparseDTM LIDA TFOHDP OHDP HDP SparseDTM LIDA TFOHDP OHDP HDP

Table 3: Performance of all algorithms on eight data sets Perplexity PMI Perplexity PMI Perplexity PMI Perplexity PMI DBLP NYT Twitter-1 Twitter-2 4,957 0.842 11,267 1.154 15,983 0.802 13,257 0.876 6,054 0.445 16,950 0.917 32,208 0.435 26,050 0.475 6,051 0.628 14,031 0.977 26,998 0.486 20,282 0.595 5,597 0.738 15,185 1.072 24,603 0.610 20,743 0.590 8,752 0.678 17,854 1.076 17,384 0.643 24,030 0.591 NYT-1 NYT-2 NYT-3 NYT-4 6,589 0.522 6,160 0.588 9,192 0.615 8,603 0.686 8,874 0.519 11,743 0.503 11,714 0.468 14,465 0.446 9,710 0.365 11,618 0.439 12,000 0.444 12,695 0.551 9,164 0.365 9,967 0.456 11,485 0.477 13,740 0.498 11,861 0.403 14,415 0.459 12,394 0.469 13,822 0.528

performance and efficiency of the proposed algorithm, and present sparsity in the topical structure of held-out documents. Furthermore, we tune κ, the size of min-batch and the truncation level to explore their influence, especially that of truncation level T on inferred number of topics K and sparsity ratio of held-out documents. Each algorithm is trained for 30 hours once, and the result presented is the average value computed over 4 independent runs. The prior on the topic-simplex is η = 0.01.

5.4 Experimental Results 5.4.1 Comparative performance of all algorithms The PMI score and Perplexity of all algorithms are presented in Table 3. The PMI score as the function of time and iteration number are presented in Figure 2 and Figure 3 respectively. We set the truncation level T = 300 for LIDA, OHDP and SparseDTM, and the model parameters γ = 5 and a = 0.1 for SparseDTM. The size of mini-batch is |M | = 100 and the learning rate is κ = 0.8. We make the following remarks: DBLP. The proposed algorithm yields the lowest perplexity and highest PMI score, followed by the LIDA topic model and three HDP topic models. DBLP is an interesting data set where each document contains relative multiple topics even its length is short. It is reasonable that SparseDTM performs better than three HDP topic models which can not address sparsity in the topical structure of short text corpora. The poor performance of LIDA supports that the Gibbs sampling is an inappropriate inference method for analyzing large text corpora. To further investigate efficiency of SparseDTM, we present the PMI score as a function of time and iteration numbers. We observe that SparseDTM outperforms other methods consistently after 3 hours or 5000 iterations. Furthermore, the reason why TFOHDP behaves worse than HDP and OHDP is that a mini-batch of short text fail to provide sufficient statistics of word co-occurrence for adapting model complexity on the fly. This observation supports the use of finite IBP prior, shrinkage and rearranging operations to address the sparsity in the topical structure of short text corpora. NYT. SparseDTM achieves the lowest perplexity and highest PMI score, outperforming other methods consistently. The perplexity decreases at least 20% while the PMI score increases at least 10% for SparseDTM than other methods. The performance of all candidate methods become better

on NYT than DBLP since NYT is a normal text collection where each document provides sufficient statistics of word co-occurrence. The best performance of SparseDTM provides a strong evidence that it can work well with usergenerated contents by accurately addressing the sparsity in the topical structure. In order to strengthen our argument, we validate all methods on four shorten NYT data sets and find that SparseDTM is still the best. It is worth noting that the LIDA topic model behaves very well on these small data sets, which shows that the predictive performance can be improved by addressing sparsity in the topical structure of short text. On the other hand, Figure 2 and Figure 3 provide the evidence that SparseDTM achieves fast convergence and the best performance consistently. Twitter. The improvement of SparseDTM over other algorithms is significant on Twitter-1 and Twitter-2. Figure 2 and Figure 3 show that the efficiency of SparseDTM in terms of time and iteration numbers are remarkably high on Twitter-1 and Twitter-2. We remark that the LIDA topic model suffers from the limitation of its inference procedure on large-scale text corpora, and hence fail to identify sparsity in the topical structure of text corpora. This strongly supports the fact that there are a lot of tweets covering more than one topic, and addressing sparsity in the topical structure is helpful for analyzing online streaming tweets. Discussion. We remark the ability of SparseDTM in harvesting coherent topics from short text regardless of its scalability and sparse topical structure. Experimental result shows that SparseDTM significantly outperforms other competing algorithms, which can be explained that three HDP topic models suffer from sparse topical structure and word co-occurrence of mini-batch of short text, and the LIDA topic model suffers from the limitation of its inference procedure.

5.4.2

Character of sparse topical structure of text

We use sparsity ratio to measure sparsity in the topical structure of held-out short text, and report the final results after 30 hours in Table 4, and specify the number of topics in individual document in Figure 4. DBLP. Table 4 indicates that each short text in DBLP has sparse topical structure, which can be identified by SparseDTM effectively. Furthermore, SparseDTM behaves consistently as M varies. Figure 4 shows that the distribution of number of documents according to the number of topics per document is also stable as the size of mini-batch M varies. We observe that the sparsity ratio makes sense since the average number of topics in each short text is less

DBLP

NYT

Twitter1

Twitter2

1.5 SparseDTM TFOHDP OHDP

LIDA HDP

1

PMI

0.6

0.4

SparseDTM TFOHDP OHDP

LIDA HDP 1

0.8

1

PMI

PMI

0.8


LIDA HDP

LIDA HDP

0.8

PMI


1

0.6

0.6

0.4

0.4

0.2

0.2

0.2

0

0

0.5

0

2

4

6

8

10

0

12

Seconds elapsed

0

2

4

6

8

10

Seconds elapsed

4

x 10

12

0 0

2

4

6

8

10

12

Seconds elapsed

4

x 10

0

2

4

6

8

10

12

Seconds elapsed

4

x 10

4

x 10

Figure 2: PMI vs Time on DBLP, NYT, Twitter-1 and Twitter-2 NYT

DBLP

Twitter1

Twitter2

1.5 SparseDTM TFOHDP OHDP


PMI

0.6

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0.2

0

0

0.4


1

0.8

1

PMI

PMI

0.8

1

PMI

1


0.5

0

0.5

1

Iteration

1.5

0

2

0

200

400

600

800

1000

1200

Iteration

4

x 10

0 0

0.5

1

1.5

Iteration

2

0

0.5

1

1.5

Iteration

4

x 10

2 4

x 10

Figure 3: PMI vs Iteration on DBLP, NYT, Twitter-1 and Twitter-2 than 3. A majority of short text in DBLP contains 2 topics. This is reasonable since each document in DBLP is a title of scientific articles commonly related to more than one research topics. NYT. Table 4 shows that the topical structure of documents in NYT is denser than DBLP. This makes sense since NYT is a normal text collection collected from social medium. Each document has long length, and covers multiple topics. For instance, a political report may contain some analysis about religion or culture, or a sport news may talk about fashion and entertainment. Therefore, it is reasonable to see that each document contains more or less 6 topics. In Figure 4, we find that the distribution of number of documents according to the number of topics per document seems inconsistent as |M | varies. As |M | increases, more documents are assigned with less topics, leading to sparser topical structure of held-out document collection. It is also interesting to see that this difference does not matter the final performance. The possible explanation is the diversity of user-generated news in NYT that the topics tend to be specific as |M | increases. Twitter. Table 4 shows that the sparsity in the topical structure of text corpora on Twitter is similar with that on DBLP. Twitter is a new social platform, allowing their users to share information. Each tweet posted focuses on its author’s interests instead of covering multiple topics spanning in Twitter. Therefore, the sparsity ratio of text corpora collected from Twitter is reasonably sparser than text corpora collected from NYT. Figure 4 shows that the distribution of number of documents according to the number of topics per document remains stable as |M | varies. It provides an explanation to the difference between Twitter and DBLP. Specifically, a tweet focuses on its user’s interest while a title of scientific article covers its related research topics. So there exists more documents in Twitter that contains single topic.

5.4.3 Parameter sensitivity Influence of T. We investigate the influence of T to the number of topics K, sparsity ratio and the performance of SparseDTM. We set |M | = 100 and κ = 0.8, and vary T in {5, 10, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500}.

Table 4: Sparsity Ratio of Held-out documents inferred by SparseDTM on DBLP, NYT, Twitter-1 and Twitter-2 |M | 100 500 1000

DBLP 0.9942 0.9928 0.9931

NYT 0.9825 0.9848 0.9883

Twitter-1 0.9944 0.9948 0.9948

Twitter-2 0.9936 0.9941 0.9943

Figure 5a shows that the number of topics found by SparseDTM increases as we enlarge T on four data sets, and NYT contains most topics followed by Twitter and DBLP. It makes sense since NYT contains a few of normal usergenerated news where each document offers more information than short tweets or titles of scientific article. Figure 5b shows that the sparsity ratio of held-out documents increases, and quickly becomes stable as T increases. This result provides the evidence that the sparsity ratio inferred by SparseDTM will not reduce as we enlarge T , and thus can address the sparsity in the topical structure of different corpus effectively. Finally, Figure 5c shows that the PMI score of topics found by SparseDTM varies slightly when T is neither too large nor too small, indicating that the performance of SparseDTM is robust to T . Influence of |M|. We investigate the parameter sensitivity of mini-batch size |M |. We fix T = 300 and κ = 0.8, and vary |M | in {10, 50, 100, 300, 500, 1000}. Figure 6 shows the PMI scores obtained by SparseDTM on all data sets. In general, SparseDTM is robust to |M |, indicating that SparseDTM works well consistently on large-scale short text collections regardless of the size of mini-batches. Influence of kappa. We turn to the parameter sensitivity of learning parameter κ. We fix T = 300 and |M | = 500, and vary κ in {0.5, 0.6, 0.7, 0.8, 0.9, 1.0}. In Figure 7, we plot the PMI scores obtained by SparseDTM on all data sets. We observe that the performance are nearly the same on different data sets. It is reasonable that SparseDTM suffers as κ is too large [10]. In fact, tuning the parameter κ is very important for SparseDTM. From Figure 7, we observe that κ = 0.8 and κ = 0.9 are appropriate for short text, e.g., tweet, while κ = 0.6 and κ = 0.7 fit collections of normal user-generated news, e.g., NYT, very well.

x 10

NYT

1.4 1.2 1 0.8 0.6 0.4

4000 3500 3000 2500 2000 1500 1000

1.2 1 0.8 0.6 0.4

500

0.2

0

0

2

3

4

5

6

0

Number of Topics per Document

2

4

6

8

10

12

14

x 10

|M|=100 |M|=500 |M|=1000

1.8

1.4

0

Twitter−2

4

2 |M|=100 |M|=500 |M|=1000

1.6

0.2 1

x 10

1.8

Number of Document

Number of Document

Number of Document

2 |M|=100 |M|=500 |M|=1000

4500

1.6

Twitter−1

4

5000 |M|=100 |M|=500 |M|=1000

Number of Document

DBLP

4

2 1.8

1.6 1.4 1.2 1 0.8 0.6 0.4 0.2

1

1.5


2

2.5

3

3.5

4

4.5

0

5

1

2


3

4

5

6


Figure 4: Sparsity Topical Sturcture of Held-out documents on DBLP, NYT, Twitter-1 and Twitter-2 1.05 DBLP NYT

Number of topics

400 350 300 250 200 150 100 50 0

5

10

50

100

150

200

250

300

350

400

450

DBLP NYT

Twitter−1 Twitter−2

1.1

DBLP NYT

1 0.95

0.9 0.8

0.9

0.85

0.7 0.6 0.5

0.8

0.4 0.75 0.3 0.7

500

1.2 Twitter−1 Twitter−2

1

PMI

Twitter−1 Twitter−2

450

Sparsity ratio of held−out documents

500

5

T

10

50

100

150

200

250

300

350

400

450

500

0.2

5

10

50

100

T

(a) Number of Topics vs T

150

200

250

300

350

400

450

500

T

(b) Sparsity Ratio vs T

(c) PMI vs T

Figure 5: Parameter sensitiveity w.r.t T on DBLP, NYT, Twitter-1 and Twitter-2 1.2 Twitter−1 Twitter−2

DBLP NYT

1.1

very useful in analyzing massive and streaming short text, which has been recognized a crucial task in data mining.

PMI

1

6.

0.9

0.8

0.7

0.6

0.5 10

50

100

300

500

1000

Size of Minibatch

Figure 6: Parameter sensitiveity w.r.t |M | on DBLP, NYT, Twitter-1 and Twitter-2

5.4.4 Interpretation of topics We present some topics learned by SparseDTM and the LIDA topic model on NYT. Table 5 shows the selected topics. We observe that SparseDTM harvests more coherent topics than the LIDA topic model. The topics learned by SparseDTM contains specific meaningful terms. For example, the second topic of SparseDTM (|M | = 500) is about playoff of Lakers, including dominant player’s name, “shaquille o neal”, coach’s name, “phil jackson”, and Laker’s opponents’ name “davis” and “spurs”. The third topic talks about religion by including some meaningful terms like “nicholas” and “denomination”. However, the topics inferred by the LIDA topic model contains too many general terms. For example, the second topic can be about either living conditions or working condition, or even something about location of department. The third topic can be categorized into one of two different area, energy and environmental protection. The terms like “natural”, “forest”, “trees” might be so general that they can not offer accurate information. Summary. In this section, we conduct several comprehensive experiments on three genres of large text corpora and multiple tasks to verify SparseDTM. Experimental result demonstrates that SparseDTM outperforms other algorithms and behaves very well on Twitter. In addition, SparseDTM infers the number of topics and achieves stable sparsity ratio of text regardless of T , and fast convergence and stable performance regardless of |M | and κ. Furthermore, SparseDTM addresses sparsity in the topical structure and learns some useful topics. In conclusion, SparseDTM is

CONCLUSIONS

In this paper, we propose a probabilistic Bayesian topic model, namely Sparse Dirichlet mixture Topic Model (SparseDTM), and infer it on large text corpus through a novel inference procedure called stochastic variational-Gibbs inference. The proposed approach outperforms the existing approaches since it can discover the posterior distribution of sparse topical structure of large short text corpus. The simple but effective generative process of SparseDTM is feasible for stochastic variational-Gibbs inference on large text corpus. Experimental results on different genres of large text corpus demonstrate that the proposed method significantly outperforms other competing methods on largescale collection of short text, and extracts useful topics from about three million tweets within 30 hours while analyzing sparsity in the topical structure of each tweet. The effectiveness is verified by higher PMI score and lower perplexity compared to existing methods. The sparse topical structure will be helpful for analyzing huge volume of short text which becomes prevalence in the era of social media.

7.

ACKNOWLEDGMENTS

This work is supported by The Chinese University of Hong Kong Direct Grant No. 4055048.

8.

REFERENCES

[1] C. Andrieu, N. De Freitas, A. Doucet, and M. I. Jordan. An introduction to mcmc for machine learning. Machine learning, 50(1-2):5–43, 2003. [2] C. Archambeau, B. Lakshminarayanan, and G. Bouchard. Latent ibp compound dirichlet allocation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(2):321–333, 2015. [3] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. the Journal of machine Learning research, 3:993–1022, 2003. [4] X. Chen, M. Zhou, and L. Carin. The contextual focused topic model. In KDD, pages 96–104, 2012.

DBLP

NYT

Twitter−1

kappa = 0.5 kappa = 0.6 kappa = 0.7



1.2

0.6

0.8

1.1

0.9 0.8


0.8

0.6

0.5 0.4

0.4

0.3

0.2

0.2

0.7

0.1


1

0.6

PMI

PMI

0.3


0.7

1 0.4


0.9

0.5

PMI

Twitter−2

1.3 kappa = 0.8 kappa = 0.9 kappa = 1.0

0.7

PMI

0.8

0.2

0.1 0.6

0 −0.1

0

2

4

6

8

10

Seconds elapsed

12 4

x 10

0.5

0 0

2

4

6

8

Seconds elapsed

10

12

−0.1

0 0

2

4

6

8

Seconds elapsed

4

x 10

10

12

0

2

4

x 10

4

6

8

Seconds elapsed

10

12 4

x 10

Figure 7: Parameter sensitiveity w.r.t κ on DBLP, NYT, Twitter-1 and Twitter-2

florida ballot election votes vote voter country count

LIDA room official area night local bed window pool

gas oil tree natural forest trees pipeline green

Table 5: Selected topics on NYT data set SparseDTM (|M | = 500) SparseDTM (|M | = 1000) brain nba religious power religious bush genetic o neal churches oil jewish white house dna phil jackson nicholas energy god administration genes davis god prices jew presidential gene spur bible plant muslim congress protein michael jordan disabled gas islamic plan biotech portland pastor california islam official chemical shaquille o neal denomination electricity christian nation

[5] X. Cheng, X. Yan, Y. Lan, and J. Guo. Btm: Topic modeling over short texts. TKDE, 26(12):2928–2941, 2014. [6] F. Doshi, K. Miller, J. V. Gael, and Y. W. Teh. Variational inference for the indian buffet process. In AISTATS, pages 137–144, 2009. [7] J. Eisenstein, A. Ahmed, and E. P. Xing. Sparse additive generative models of text. In ICML, pages 1041–1048, 2011. [8] J. Foulds, L. Boyles, C. Dubois, P. Smyth, and M. Welling. Stochastic collapsed variational bayesian inference for latent dirichlet allocation. In KDD, pages 446–454, 2013. [9] T. L. Griffiths and Z. Ghahramani. The indian buffet process: An introduction and review. Journal of Machine Learning Reseach, 12:1185–1224, 2011. [10] M. D. Hoffman, D. M. Blei, C. Wang, and J. Paisley. Stochastic variational inference. The Journal of Machine Learning Research, 14(1):1303–1347, 2013. [11] L. Hong and B. D. Davison. Empirical study of topic modeling in twitter. In Proceedings of the first workshop on social media analytics, pages 80–88. ACM, 2010. [12] P. O. Hoyer. Non-negative matrix factorization with sparseness constraints. The Journal of Machine Learning Research, 5:1457–1469, 2004. [13] O. Jin, N. N. Liu, K. Zhao, Y. Yu, and Q. Yang. Transferring topical knowledge from auxiliary long texts for short text clustering. In CIKM, pages 775–784. ACM, 2011. [14] T. Kenter and M. de Rijke. Short text similarity with word embeddings. In CIKM, pages 1411–1420. ACM, 2015. [15] T. Lin, W. Tian, Q. Mei, and H. Cheng. The dual-sparse topic model: mining focused topics and focused terms in short text. In WWW, pages 539–550, 2014. [16] D. Mimno, M. D. Hoffman, and D. M. Blei. Sparse stochastic inference for latent dirichlet allocation. In ICML, 2012. [17] T. P. Minka. Divergence measures and message passing. Technical report, Microsoft Research, 2005. [18] D. Newman, J. H. Lau, K. Grieser, and T. Baldwin. Automatic evaluation of topic coherence. In NAACL, pages 100–108, 2010. [19] X. Quan, C. Kit, Y. Ge, and S. J. Pan. Short and sparse text topic modeling via self-aggregation. In IJCAI, pages 2270–2276. AAAI Press, 2015. [20] H. Robbins and S. Monro. A stochastic approximation method. Annals of Mathematical Statistics, pages 400–407, 1951. [21] M. Shashanka, B. Raj, and P. Smaragdis. Sparse

[22] [23]

[24] [25]

[26] [27]

[28] [29]

[30] [31]

[32] [33] [34]

[35]

overcomplete latent variable decomposition of counts data. In NIPS, pages 1313–1320, 2008. L. Shou, Z. Wang, K. Chen, and G. Chen. Sumblr: continuous summarization of evolving tweet streams. In SIGIR, pages 533–542, 2013. V. K. R. Sridhar. Unsupervised topic modeling for short texts using distributed representations of words. In NAACL-HLT, pages 192–200, 2015. Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. Hierarchical dirichlet processes. Journal of the American Statistical Association, 101(476), 2006. M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and variational inference. Foundations R in Machine Learning, 1(1-2):1–305, 2008. and Trends C. Wang and D. M. Blei. Decoupling sparsity and smoothness in the discrete hierarchical dirichlet process. In NIPS, pages 1982–1989, 2009. C. Wang and D. M. Blei. Truncation-free online variational inference for bayesian nonparametric models. In NIPS, pages 413–421, 2012. S. Williamson, C. Wang, K. Heller, and D. M. Blei. The ibp compound dirichlet process and its application to focused topic modeling. In ICML, pages 1151–1158, 2010. J. Xu, P. Liu, G. Wu, Z. Sun, B. Xu, and H. Hao. A fast matching method based on semantic similarity for short texts. In Natural Language Processing and Chinese Computing, pages 299–309. Springer, 2013. X. Yan, J. Guo, Y. Lan, J. Xu, and X. Cheng. A probabilistic model for bursty topic discovery in microblogs. In AAAI, pages 353–359, 2015. L. Yang, L. Jing, M. K. Ng, and J. Yu. A discriminative and sparse topic model for image classification and annotation. Image and Vision Computing, 2016. J. Yin and J. Wang. A dirichlet multinomial mixture model-based approach for short text clustering. In KDD, pages 233–242, 2014. A. Zhang, J. Zhu, and B. Zhang. Sparse online topic models. In WWW, pages 1489–1500, 2013. W. X. Zhao, J. Jiang, J. Weng, J. He, E-P. Lim, H. Yan, and X. Li. Comparing twitter and traditional media using topic models. In Advances in Information Retrieval, pages 338–349. Springer, 2011. J. Zhu and E. P. Xing. Sparse topical coding. In UAI, pages 831–838, 2011.

Understanding Sparse Topical Structure of Short Text ...

Understanding Sparse Topical Structure of Short Text ...

Suggest Documents

Sparse Bayesian structure learning with

Sparse Additive Generative Models of Text - CiteSeerX

The structure of sparse resultant matrices - CiteSeerX

Group sparse topical coding: from code to topic

Sparse short-distance connections enhance calcium wave ...

Scalable Topical Phrase Mining from Text Corpora

Understanding Neural Sparse Coding with Matrix Factorization

Topical structure analysis.pdf - Francisco Perlas Dumanig

Sparse Bayesian Classifiers for Text Categorization

Sparse Representations for Text Categorization - Google Sites

Local structure preserving sparse coding for

Learning to Classify Short and Sparse Text & Web with Hidden Topics

Learning to Classify Short and Sparse Text & Web with ... - Google Sites

text structure

Topical Summarization of Web Videos by Visual-Text Time ... - VIREO

Extended Topical Classification of Hadith Arabic Text

A Topical Classification of Quranic Arabic Text

Automatic Coding of Short Text Responses via

ONTOLOGY ENGINEERING VIA TEXT UNDERSTANDING

Natural Scene Text Understanding - CiteSeerX

Understanding Finale's Text Tool - wardbaxter.com

short communications Structure of human muscle

Sparse Regression Based Structure Learning of ... - PLOSwww.researchgate.net › publication › fulltext › Sparse-Re

Understanding population structure and historical demography of ...