Entropy Regularization for Topic Modelling Nagesh Bhattu Sristy
D. V. L. N. Somayajulu
Department of CSE Natioanal Institute of Technology Warangal, India
Department of CSE National Institute of Technology Warangal, India
[email protected]
[email protected]
ABSTRACT Supervised Latent Dirichlet based Topic Models are variants of Latent Dirichlet Topic Models with the additional capability to discriminate the samples. Light Supervision strategies are recently adopted to express the rich domain knowledge in the form of constraints. Posterior Regularization framework is developed for learning models from this weaker form of supervision expressing the set of constraints over the family of posteriors. Modelling arbitrary problem specific dependencies is a non-trivial task, increasing the complexity of already harder inference problem in the context of latent dirichlet based topic models. In the current work we propose posterior regularization method for topic models to capture wide variety of auxiliary supervision. This approach simplifies the computational challenges posed by additional compound terms. We have demonstrated the use of this framework in improving the utility of topic models in the presence of entropy constraints. We have experimented with real word datasets to test the above mentioned techniques.
Categories and Subject Descriptors H.4 [Information Systems Applications]: Miscellaneous
General Terms Posterior Regularization
Keywords Posterior Regularization, Topic Modelling, Entropy Regularization
1.
INTRODUCTION
Probabilistic topic models have received much attention in the recent past due to their ability to model arbitrary domain specific dependencies. Latent Dirichlet Allocation(LDA) is one such method for topic modelling. Though it was primarily developed in the context of document modelling, it Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. IBM I-CARE’14, October 9-11, 2014 Banglore, India. Copyright 2014 ACM 978-1-4503-3037-4 ...$15.00. http://dx.doi.org/10.1145/2662117.2662130.
has been adapted into wide range of applications including recommendation systems , Information Retrieval, Sentiment Analysis [10]. Topic Models inferred in an unsupervised manner as given in Blei et al. [3], are of limited use for typical discriminating target tasks. Supervised Topic Models are designed to take labelled data and learn topics which have discriminating ability in addition to describing the data(likelihood). Earlier efforts on supervised topic models by Blei and McAuliffe [2] focused on including additional nodes in the generative process where the latent topics explain the individual words and collectively they explain the label associated with the document. These are called down-stream models. Mimno and McCallum [11] and Lacoste-Julien et al. [9] have designed topic models, which consist of labels associated with the documents effect the topics inferred. These are called upstream models. These are akin to generative - discriminative models where the latter models avoid unnecessary independence assumptions. MEDLDA(Maximum Entropy Discrimination LDA) is another approach for supervised topic modelling authored by Zhu et al. [12] where margin based constraints based on supervision and likelihood based topic modelling are tied together using Maximum Entropy Discrimination principle (MED). In addition to modelling margin based constraints, MED principle allows regularization which helps in inducing sparsity and model averaging. They assumed additional mean field assumptions to derive the distribution on parameters. Though MEDLDA is down stream topic model, the authors Zhu et al. [12] show how MEDLDA infers better topic models compared to DiscLDA [9]. MEDLDA achieves it’s objective by suitable constraints on the posterior derived by LDA. The current work follows similar approach. The success of supervised learning methods depends on the availability of labelled examples. As machine learning algorithms are applied in wide-spread domains, the availability of labelled data has become a challenge and a set of semi-supervised algorithms are proposed which make use of limited amount of labelled data and (large amount of) unlabelled data to derive machine learning models. Recent researches furthered this approach to learn models from alternate forms of supervision known as weak supervision. In weak supervision, a domain expert expresses his ideas about the domain in the form of corpus level or instance specific constraints. The examples of such supervision include feature expectation constraints of Druck et al. [5] for classification, number of edges to be preserved in projecting a dependency parser from one language to another Ganchev
et al. [7]. The general principle is to penalize models for violating constraints and the algorithm lends itself into Expectation Maximization steps where maximizing the likelihood and penalization for violating constraints are repeated alternatively. Unlabelled data is used for constraint based model. In effect the constraints are applied over the family of posteriors considered and called Posterior Regularization (PR) Ganchev et al. [7] approach. In the context of Topic Modelling, there are several efforts done to induce this weaker form of supervision. Balasubramanyan and Cohen [1] also study incorporating entropy constraints into the topic model. As observed by Ganchev et al. [7], the PR framework is an alternative approach to achieve similar effect allowing richer set of constraints to be imposed on the posterior directly. Balasubramanyan and Cohen [1] use additional terms into the posterior. They fix some of these terms to be constants to drive the maximization or minimization objective. They model the deviation from these terms as gaussian. Our approach does not restrict the family of the above deviation, but derives the posterior in a maximum entropy sense. Posterior Regularization is also studied in a similar context by Zhu et al. [13] referred as RegBayes. This work uses similar regularization framework for imposing margin based constraints on infinite latent SVM(support vector machine). We have drawn inspiration from Zhu et al. [13]. The contributions of this work are • Expressing a family of regularization constraints in the context of Latent Dirichlet based topic models • Entropy regularization for controlling the number of topics assigned for individual words
2.
POSTERIOR REGULARIZATION OF LATENT VARIABLE MODELS
We review the necessary optimization framework used in this work, though very superficially in this section. The generalized maximum entropy [6] principle can be extended to introduce constraints over posteriors observed in topic modelling. The posteriors used in topic modelling have latent variables making the inference procedure for the posterior intractable. Approximate inference methods are applied to solve this complicated inference problem. Gibbs Sampling [8] and variational inference [3] are the methods used for inference in topic modelling. Posterior Regularization in LDA based topic models as given in Balasubramanyan and Cohen [1] use additional assumptions to induce the regularization effect. The current work deals with data-dependent constraints. Similar to work in Zhu et al. [13] we use regularization of posterior with latent variables. So the generalized problem considered in this work is given in the lemma 1. Lemma 1. RegBayes Optimization Problem If r(M ) is a probability distribution in the space of posterior distributions 4 over the model represented by M; φ(M, D) is the feature function; g is a lower semi-continuous function of expectation of this feature with respect to r(M ) and g ∗ being it’s conjugate inf KL(r(M )||q(M, D)) + g(Er(M ) (φ(M, D)) (1) r(M )∈4 Z t = sup − ln q(M, D) exp λx φ(x)dM − g ∗ (−λ) (2) λ∈B∗
M ∈M
Figure 1: Graphical Model of Entropy Constrained LDA The optimal posterior belonging to the above family in the primal problem subjected to the constraints g(Er(M ) ) is rˆ(M ) = q(M,D)exp(λt .˙φ(M,D) . Z(λ,q))
The parameters λ are obtained by solving the dual problem in Equation 2. Lemma 1 defines regularization for arbitrary posteriors q(M, D). M can include the topic assignments to individual words, topic distributions for documents, and the discriminative parameters for supervised LDA. To represent this general case summation is replaced by integration.
3.
ENTROPY CONSTRAINTS
The topic models inferred by LDA do not have any control over the number of topics seen in a document or number of topics assigned to the words of the vocabulary. Capturing such additional constraints on the model can be achieved though multiple methods. In the current section we show a method of controlling the entropy of distribution of words of the vocabulary over topics. We first review the inference algorithms of LDA. Later we design our inference algorithms using these approaches.
3.1
Latent Dirichlet Allocation
LDA is a hierarchical bayesian model developed in the context of modelling a special kind of document clustering called topic modelling. LDA as given in Blei et al. [3], models each document to have dirichlet based topic distribution and words are drawn from fixed multinomial over vocabulary one for each topic. The topic model as given in Blei et al. [3] takes the per topic multinomial distributions over vocabulary β as fixed parameters and we follow similar approach. In this model θd is the document specific topic distribution drawn from dirichlet Dir(α). Inner plate represents the topic assignments zn to individual words wn in the document. The joint distribution is given in (3). ! |D| Ni Y Y p(θ, Z, D|α, β) = p(θi |α) p(zin |θi )p(win |zin ) (3) i=1
n=1
Two techniques are proposed for approximating the complex posterior namely variational inference (Blei et al. [3]) and collapsed gibbs sampling (Griffiths and Steyvers [8]). We will follow variational inference in this paper. To derive an approximation to the posterior of (3), a mean field variational distribution q(θ, Z|γ, Φ) is assumed removing the complex coupling in LDA. The expectation of posterior with respect to this auxiliary tractable exponential family distribution can be obtained which acts as an upper bound for
Table 1: Datasets
the likelihood of marginal. This marginal is required as a normalizer in the inference problem of LDA. L(q; α, β)
= Eq [log p(θ, Z, D|α, β)] − Eq [log q(θ, Z|γ, Φ)] ≥ − log(p(D|α, β))
NameOfDataset Books DVD Electronics Kitchen
(4)
The likelihood maximization objective in (4) is equivalent to minimizing KL(q(θ, Z|γ, Φ)||p(θ, Z|D, α, β)).
3.2
Description 2000 instances 2000 instances 2000 instances 2000 instances
No of Classes 2 classes 2 classes 2 classes 2 classes
No of Features 473,856
Entropy Constraints
Controlling the number of times an observation is given a discrete label, is commonly studied as sparsity inducing regularization. There are various methods for achieving this effect. Ganchev et al. [7] shows an example of L1/∞ regularization for the same objective, where the dual results in L∞/1 norm which is more efficient to solve than the primal problem. In the current work we use an alternative approach by enforcing constraints on the entropy of corresponding random variable. In the subsequent portion of this section we show a method of controlling the sparsity of number of topic assignments to individual words by using an entropy constraint. Entropy should be minimum if we want lesser number of topics being assigned to individual words. For each word we observe the distribution over topics given in Equation (5). pwk =
Nwk Nw
(5)
In (5), Nwk is the number of times a word w is assigned topic k. The denominator is the number of tokens of type w in the document collection. The entropy of this distribution is given by Equation (6). If we want the number of topic assignments to a particular word to be minimum, we can achieve it by constraining the entropy to be minimum. X Cw = −pwk log(pwk ) (6) k
The constraints are of the form (7). a is a constant. Equation (8) is the dual of this constraint. 1 ||C − 0||22 (7) 2a a ||λ||22 (8) 2 where C represents a vector of all entropies (one for each word). As we don’t want hard constraints, we use constraint on the square of L2 norm which is equivalent to L2 regularization of dual parameter in the dual space. The resultant primal dual of the objective with these constraints (one for each word) is given in Equation (10) 1 inf KL(r(Z)||q(Z, D)) + ||C − 0||22 (9) r(Z)∈4 2a = sup − log (Eq (exp(λ.C))) − λ∈B∗
||λ||22 2
(10)
The solution for Equation (10) takes the form rˆ(Z) = As we are using CVB0 algorithm for inference of LDA model, the collapsed conditional distribution takes the form in Equation (11). −dn −dn α + Er [Ndk ] β + Er [Nwk ] r(zdn = k) ∝ ∗ |V |β + Er [Nk−dn ] q(Z,D)exp(λ.C) . Eq (exp(λ.C))
exp(λwdn .Er (Cwdn ))
(11)
The complete generative process symbolically represented in Figure 1. In this figure the nodes Cw represent aggregate properties used for modelling additional dependencies. Our algorithm for topic model inference alternatively optimizes for the posterior in Equation (11) and Equation (10) for estimating the parameters λ. This way we are able to segregate the core LDA (CVB0) and additional posterior regularization terms in a maximum entropy sense. We could use L1/∞ regularization for achieving the same effect, which results in further efficient dual formulation. We can also achieve this by using laplacian prior and minimization of KL Divergence of distribution in (5) from this prior as constraint. We have not evaluated the efficacy of these approaches and leave them for future work.
3.3
Experiments
We have taken the datasets used by Blitzer [4]. This dataset consists of reviews about 4 types of data each of which has 2000 reviews. The characteristics of these datasets are summarized in table 1. We have taken the implementation of collapsed variational bayes available in 1 and made suitable modifications to implement the optimization objective of the current work. The objective we have taken in the previous section maximizes the likelihood (which has the effect of finding out the latent variable topic distributions) on one hand and finds out the parameters of maximum-entropy model which maximize the entropy of these word-topic distributions as another step. These two objectives need not be converging to a single point solution. We evaluate the resultant topic model from these two perspectives. Firstly based on the Sum/Avg Entropy of the word-topic distribution, minimization of which is the main objective of this work. Secondly we see how the likelihood gets effected. The cumulative entropy is sum of entropies of probability distributions given in Equation (6) over all words in the vocabulary(or only words for which we want to employ this method). We have plotted the cumulative entropy for various configurations. The number of topics is varied between 10,20,30,40 topics. We have also changed the parameter a in Equation (8), between 0.1, 0.5, 1, 10. In figure 2a, the first column of points is the setting corresponding to no regularization. Column 2,3,4,5 correspond to the settings of parameter a being 0.1, 0.5, 1, 10 respectively. Each line in these figures corresponds to different number of topics shown in title of the line. As can be seen in the figures through-out the results, the setting of parameter a being 0.1 has been inferior compared to plain- lda. In all other settings the additional regularization has resulted in decrease of entropy as required. We have also reported the likelihood of the corresponding model over all these configurations in the figure given on the right side. Figure 2b shows how the likelihood varied for different 1
http://www.ics.uci.edu/ asuncion/software/fast.htm
(a) Books Entropy
(b) Books Likelihood
(c) DVD Entropy
(d) DVD Likelihood
Figure 2: Comparison Plot of Cumulative Entropy/Likelihood vs parameter ’a’ value for different number of topics
combinations. While calculating the likelihood we did not consider the additional entropy constraints as the objective is not towards any of these. We have also tested the effectiveness of the entropy regularization on the classification accuracy by using topical representation of the documents as input to an external SVM algorithm. We observed that for all the datasets, accuracies are improved by 2-4% because of this entropy regularization.
4.
CONCLUSION
We have demonstrated a new method of achieving entropy regularization for topic modelling using posterior regularization. This framework allows the additional domain specific modelling to be independent of the basic topic modelling with out changing the basic characteristics of topic modelling such as conjugate (dirichlet- multinomial) assumption. This approach can further be applied in various other contexts where the desired effect can be specified in the form of constraints over posterior.
References [1] Ramnath Balasubramanyan and William W Cohen. Regularization of latent variable models to obtain sparsity. SIAM Data Mining Conference, 2013. [2] David M Blei and Jon D McAuliffe. Supervised topic models. arXiv preprint arXiv:1003.0783, 2010. [3] David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation. the Journal of machine Learning research, 3:993–1022, 2003. [4] John Blitzer. Domain Adaptation of Natural Language Processing Systems. PhD thesis, University of Pennsylvania, 2008. [5] Gregory Druck, Gideon Mann, and Andrew McCallum. Learning from labeled features using generalized expectation criteria. pages 595–602, New York, NY, USA, 2008. ACM. ISBN 978-1-60558-164-4. [6] Miroslav Dudik. Maximum Entropy Density Estimation and Modeling Geographic Distributions of Species. PhD thesis, Princeton, NJ, USA, 2007. AAI3281302. [7] Kuzman Ganchev, Jo˜ ao Gra¸ca, Jennifer Gillenwater, and Ben Taskar. Posterior regularization for structured latent variable models. J. Mach. Learn. Res., 11:2001– 2049, August 2010. ISSN 1532-4435.
[8] Thomas L Griffiths and Mark Steyvers. Finding scientific topics. Proceedings of the National academy of Sciences of the United States of America, 101(Suppl 1): 5228–5235, 2004. [9] Simon Lacoste-Julien, Fei Sha, and Michael I Jordan. Disclda: Discriminative learning for dimensionality reduction and classification. In Advances in neural information processing systems, pages 897–904, 2008. [10] Himabindu Lakkaraju, Chiranjib Bhattacharyya, Indrajit Bhattacharya, and Srujana Merugu. Exploiting coherence for the simultaneous discovery of latent facets and associated sentiments. SIAM Data Mining Conference, 2011. [11] David M. Mimno and Andrew McCallum. Topic models conditioned on arbitrary features with dirichletmultinomial regression. In UAI, pages 411–418, 2008. [12] Jun Zhu, Amr Ahmed, and Eric P Xing. Medlda: maximum margin supervised topic models. Journal of Machine Learning Research, 13:2237–2278, 2012. [13] Jun Zhu, Ning Chen, and Eric P. Xing. Bayesian inference with posterior regularization and infinite latent support vector machines. CoRR, abs/1210.1766, 2012.