Knowledge and Information Systems

3 downloads 0 Views 1MB Size Report
words. For the article on Carol Burnett, we expect the most important topics to be. 'Music, TV, and film.' As a result, we see that the top 10 nearest neighbors are.
Knowledge and Information Systems Inside Latent Dirichlet Allocation: An Empirical Exploration --Manuscript Draft-Manuscript Number:

KAIS-D-16-00617

Full Title:

Inside Latent Dirichlet Allocation: An Empirical Exploration

Article Type:

Regular Paper

Keywords:

Latent Dirichlet Allocation; topic modeling; Gibbs Sampling

Corresponding Author:

James Hansen, PhD Brigham Young University Provo, UT UNITED STATES

Corresponding Author Secondary Information: Corresponding Author's Institution:

Brigham Young University

Corresponding Author's Secondary Institution: First Author:

James Hansen, PhD

First Author Secondary Information: Order of Authors:

James Hansen, PhD

Order of Authors Secondary Information: Funding Information: Abstract:

¬¬¬¬¬Inside Latent Dirichlet Allocation: An Empirical Exploration

Abstract Unlike mixed models that have been developed for analyzing text, topic modeling recognizes that documents may be about more than one topic. Latent Dirichlet Allocation is an algorithm that assigns probability to words that tend to occur together, thus possibly identifying topics. Most research concerning LDA has focused on theory or algorithmic refinement. This paper focuses on exploring the internal functionality of LDA toward better understanding of how it can be used and interpreted.

Powered by Editorial Manager® and ProduXion Manager® from Aries Systems Corporation

Manuscript Click here to view linked References 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

Click here to download Manuscript LDA.pdf



Inside Latent Dirichlet Allocation: An Empirical Exploration



James V. Hansen Marriott School Brigham Young University Provo, Utah 84602

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

Inside Latent Dirichlet Allocation: An Empirical Exploration I. INTRODUCTION

In almost every field of endeavor there has been explosive growth in the amount of information generated. Wikipedia articles, blogs, Flickr images, astronomical survey data, social network analytics, and many other areas yield enormous volumes of documents and content. Organizing and analyzing this data in effective time is well beyond human capabilities. Algorithmic tools are essential to organizing, searching, and understanding this information. [6] Consider, for example, a collection of documents for which we seek to identify underlying “topics” that define the collection. Suppose that each document contains a mixture of different topics. Here a “topic” is a collection of words that have different probabilities of appearance in passages discussing the topic. One topic might contain many occurrences of the words “smooth,” “wrinkle,” and “iron.” Another might contain many occurrences of “copper” and “quartz,” with a few occurrences of “iron.” Most of the occurrences of “iron” in this second topic are nouns instead of verbs. Algorithmically it is desirable to distinguish the different context of such symbols. Topic modeling offers an approach to addressing these issues. In particular, topic modeling provides a way to infer the latent structure behind a collection of documents. In this context, a topic is a probability distribution over a collection of words and a topic model is a formal statistical relationship between a group of observed, and latent (unknown) random variables that specifies a probabilistic procedure to generate the topics. The central goal of a topic is to provide a “thematic summary” of a collection of documents. In other words, it answers this important question: What themes are these documents discussing? For instance, a collection of news articles might discuss e.g. politics, sports, and business-related themes. [1] A robust method of topic modeling is Latent Dirichlet Allocation (LDA). LDA is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar [1]. For example, if observations are words collected into documents, LDA posits that each document is a mixture of a small number of topics and that each word's creation is attributable to one of the document's topics. LDA has emerged from the study of mixed membership modeling algorithms. It differs from standard clustering models that seek to group related articles into disjoint sets, or clusters, which capture the topics prevalent in a corpus. With these models every document is assigned to a single topic. Even expectation maximization models that use probabilistic means to capture uncertainty in cluster assignments, still assume that each document is assigned to a single topic.

2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

Yet is an article really about just one topic, such as politics or sports? Concretely, consider an article containing words like audit, fraud, bonds, economy and profits. A standard mixed-membership clustering model making probabilistic assignments might group this article with other articles related to the topic audit. However, this article also contains words like bonds, economy, and profits, which might mean that is really an article about finance.. Indeed, a probabilistic model might assign it to an audit cluster or a finance cluster. But realistically it may be more important to capture is the fact that the article is about both audit and finance. This perspective is contextually more expressive, particularly when we introduce latent variables, which aim to capture abstract notions such as topics. Importantly, the LDA model is not limited to analysis of text, but extends to other problems involving collections of data, domains, such as collaborative filtering, content-based image retrieval and bioinformatics. II. PRIOR WORK Most LDA research has focused on theory or algorithmic enhancement. A landmark preceding LDA was the recognition [4] that, given a generative model of text, one can fit the model to data using maximum likelihood or Bayesian methods. While the probabilistic latent semantic indexing (pLSI) of [4] was an important development in probabilistic modeling of text, it was subsequently found to be incomplete in [1]. In particular, it was shown that pLSI provides no probabilistic model at the level of documents. That is to say, each document is represented as a list of numbers, and there is no generative probabilistic model for these numbers. This leads to important problems: (1) The number of parameters in the model grows linearly with the size of the corpus, which leads to serious problems with over fitting, (2) It is unclear clear how to assign probability to a document outside of the training set. To address these limitations, Blie, et al. [1] proceeded to develop the basic LDA algorithm. Subsequently, [7] introduced a special purpose variation called Spatial Latent Dirichlet Allocation (SLDA). SLDA is designed to discover objects from a collection of images. Later, a more general refinement was developed in [5]. This model is an online variational Bayes (VB) algorithm for Latent Dirichlet Allocation (LDA). This version is based on online stochastic optimization with a natural gradient step, which can be shown to converge to a local optimum of the VB objective function. It is claimed able to process massive document collections, including those arriving in a stream. Frigyik, et al. [3] describe and contrast three methods of generating samples for LDA: stick-breaking, the P´olya urn, and drawing gamma random variables. Blie [2] surveyed probabilistic topic models, including LDA. He argues that topic models promise to be an important component for summarizing and understanding our growing digitized archive of information. This article also contains a comprehensive bibliography of LDA-related research.

3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

This sampling underscores the theoretical and algorithmic focus of LDA research. Our contribution takes a different tack by providing an empirical examination of LDA. In particular, we explore the following areas of performance: 1. Finding top words in each latent topic and using these to identify topic themes. This is important to many applications: For example, seeking medical information on particular topics, or searching for developments in areas of scientific interest, or finding topics for legal precedent. 2. Predicting topic distributions for example documents. This is fundamental to assessing the topical structure of documents of interest. This can be useful for classification, indexing, and assessing trends in document composition. 3. Comparing the quality of LDA to conventional information retrieval methods. This is useful in determining the robustness and comparative effectiveness of various models. 4. Investigating the role of model hyper-parameters alpha and gamma. While these parameters are heuristic in nature, it is of interest to determine the effects of high and low settings on the quality of LDA performance. Details on each of these will be explained as they are addressed. First we outline at a high level the algorithmic features of LDA III. LATENT DIRICHLET ALLOCATION LDA provides a generative model that seeks to discover how the documents in a dataset were created. That is to say, if we define a dataset as a collection of N documents, with a document defined as a collection of words, a generative model can be used to describe how each document obtained its words. Following [1] and [6], assume K topic distributions are known for a dataset. Then we have K multinomials containing V elements each, where V is the number of terms in the corpus. Let 𝜂! denote the multinomial for the ith topic, where the size of 𝜂! is 𝑉: |𝜂! | = 𝑉. Given these distributions, the LDA generative process is as follows: For each document: Randomly choose a distribution over topics: a multinomial of length K (1) For each word in the document: Probabilistically draw one of the K topics from the distribution over topics obtained in (1), say topic 𝜂! (i) Probabilistically draw one of the V words from 𝜂! . (ii)

4

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

This generative model is able to extract multiple topics from documents. Consider our example of a document containing words pertaining to both audit and finance. Step (1) denotes that each document contains topics in different proportion; e.g., one document may contain a lot of words drawn from the topic on audit and no words drawn from the topic about finance, while a different document may have an equal number of words drawn from both topics. Step (ii) recognizes that each individual word in the document is drawn from one of the K topics in proportion to the document’s distribution over topics as determined in Step (i). The selection of each word depends on the distribution over the V words in a vocabulary, as determined by the selected topic, 𝜂! . However, since the central goal of topic modeling is to automatically discover the topics from a collection of documents; the assumption that we know the K topic distributions is not very helpful. In consequence, we must algorithmically learn the topic distributions. This can be accomplished through Collapsed Gibbs Sampling. IV. COLLAPSED GIBBS SAMPLING A collapsed Gibbs sampler is a Markov Chain Monte Carlo (MCMC) algorithm, which generates a sequence of topic assignments for each token that in the limit converge to a sequence of samples drawn from the posterior distribution. In practice the algorithm is run for a sufficiently long time to allow the topics to “converge” (sometimes referred to as burn-in) and then the last few samples are used to estimate the posterior distribution over topics for each document and the posterior distribution over words for each topic. Suppose we have a collection of documents, and we are focusing our analysis to the use of the 10 words shown in Table 1. Further, assume we have run several iterations of collapsed Gibbs sampling for an LDA model with K=2 topics and alpha=10.0 and gamma=0.1 The corpus-wide assignments at our most recent collapsed Gibbs iteration are summarized in the counts shown in Table 1.



Word

Count in Topic 1

Count in Topic 2

tennis

52

0

match

15

0

admission

17

2

price

9

25

official

20

37

owner

9

32

stadium

0

75

5

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

earnings

1

23

bankrupt

0

19

taxes

0

29

Table 1 Word Counts Let’s look at the first step of the algorithm, which would be to select a document and here we have a very simple 5-word document. And the first step of the algorithm is to assign every word in this document to a topic. So, here is a set of randomly assigned topics. Topic

1

2

1

2

1

Word

tennis

official

admission

price

owner

We seek to re-compute the topic assignment for, say, the word “official”.. We illustrate how to compute the probabilities of assigning “official” to Topic 1 and to Topic 2. The Collapsed Gibbs Sampling is expressed as 𝑚!""#$#%&,! + 𝛾 𝑛!" + ∝ 𝑃𝑟𝑜𝑏(𝑇𝑜𝑝𝑖𝑐 𝑘) = 𝑁! − 1 + 𝐾 ∝ !"# 𝑚!,! + 𝑉𝛾 In words, the left-hand side of the expression represents “how much a given document represents Topic 𝑘,” where 𝑛!" is the number of current assignments to topic 𝑘 in document 𝑖 ∝ is a smoothing constant, a Bayesian prior 𝑁! is the number of words in document 𝑖 𝐾 is the number of topics The right-hand side of the expression denotes “How much each topic represents the word ‘official’ based on assignments in other docs in corpus,” where 𝑚!""#$#%&,! is the number of assignments corpus-wide of the word “official“ to topic 𝑘 𝛾 is a smoothing parameter, a Bayes prior !"# 𝑚!,! is the sum of all Topic 𝑘 word counts 𝑉 is the size of the vocabulary For our example, we have Prob(Topic 1) = (13/24) * (20.1/124) = 0.0878

6

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

Prob(Topic2) = (11/24) * (36.1/242) = 0.0681 Finally, we compute the relative probabilities of Topic 1 and Topic 2. Relative-Prob(Topic 1) = 0.0878/(0.0878 + 0.0681) = 0.5632 Relative-Prob(Topic 2) = 0.0681/(0.0878 + 0.0681) = 0.4368 Given these results, we change the topic designation for ‘official’ in the given document to Topic 1., i.e., Topic

1

1

1

2

1

Word

tennis

official

admission

price

owner

This algorithm is computed for every word in the document; and he entire process is iterated until convergence or until some stopping criteria is met. By way of comparison, the basic methodology proposed by information retrieval researchers for text corpora—a methodology successfully deployed in modern Internet search engines—reduces each document in a corpus to a vector of real numbers, each of which represents ratios of counts. In the popular tf-idf scheme a basic vocabulary of “words” or “terms” is chosen, and, for each document in the corpus, a count is formed of the number of occurrences of each word. After normalization, this term frequency count is compared to an inverse document frequency count, which measures the number of occurrences of a word in the entire corpus (generally on a log scale, and again suitably normalized). The end product is a term-by-document matrix, whose columns contain the tf-idf values for each of the documents in the corpus. Thus the tf-idf scheme reduces documents of arbitrary length to fixed-length lists of numbers. [Blei, et al., 2003] While the tf-idf reduction has some appealing features—notably in its basic identification of sets of words that are discriminative for documents in the collection—the approach also provides a relatively small amount of reduction in description length and reveals little in the way of inter- or intra-document statistical structure. V. EXPERIMENTAL EXPLORATION We have noted that a major feature distinguishing the LDA model is the notion of mixed membership. Established models have assumed that each data point belongs to a single cluster. k-means determines membership simply by shortest distance to the cluster center, and Gaussian mixture models suppose that each data point is drawn from one of their component mixture distributions. In many instances, however, it is more realistic to think of data as genuinely belonging to more than one cluster or category. For example, if we have a model for text data



7

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

that includes both "audit" and "fraud" categories, then an article about Financial Prosecutions should have membership in both categories rather than being forced into just one. We have seen that collapsed Gibbs sampling can be used to perform inference in the LDA model. In this experiment we use GraphLab to explore an LDA to learn the topic model for Wikipedia text data on persons. In this experiment we 1. Apply standard preprocessing techniques on Wikipedia text data 2. Use GraphLab to fit a Latent Dirichlet allocation (LDA) model 3. Explore and interpret the results, including topic keywords and topic assignments for documents An example fragment of a Wikipedia text file on people that we use is shown in Table 2. In the original data, each Wikipedia article is represented by a URI, a name,, and a string containing the entire text of the article. URI name text ...

Digby Morrell

digby morrell born 10 october 1979 is a former ...

...

Alfred J. Lewy

alfred j lewy aka sandy lewy graduated from ...

...

Harpdog Brown

harpdog brown is a singer and harmonic a player who ..

Table 2 Wikipedia People Fragment

8

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

Recalling our objectives from Section II, we proceed in the following way:



1. Find top words in each latent topic and using these to identify topic themes 2. Predict topic distributions for example documents 3. Compare the quality of LDA to conventional information retrieval methods 4. Investigate the role of model hyper parameters alpha and gamma

The method used to fit the LDA model is a randomized algorithm based on Collapsed Gibbs Sampling. We commence by identifying the topics learned by our model— then identifying major themes associated with these topics Earlier we described a topic as a probability distribution over words in the vocabulary; that is, each topic assigns a particular probability to every one of the unique words that appears in our data. Different topics will assign different probabilities to the same word. For instance, a topic that ends up describing science and technology articles might place more probability on the word 'university' than a topic that describes sports or politics. Looking at the highest probability words in each topic will give a sense of major themes. Ideally we would find that each topic is identifiable with some clear theme and that all the topics are relatively distinct. Here are the words representing the topics that were found along with their respective probabilities: The scale of each weight indicates the degree to which the document embodies a given topic. ['university', 'research', 'professor', 'international', 'institute', 'science', 'society', 'studies', 'director', 'national'] (0.03377237807725826) ['played', 'season', 'league', 'team', 'career', 'football', 'games', 'player', 'coach', 'game'] (0.012033499250205755) ['film', 'music', 'album', 'released', 'band', 'television', 'series', 'show', 'award', 'appeared'] (0.011801143226764909) ['university', 'school', 'served', 'college', 'state', 'american', 'states', 'united', 'born', 'law'] (0.008813833898980135) ['member', 'party', 'election', 'minister', 'government', 'elected', 'served', 'president', 'general', 'committee'] (0.008510455845732842) ['work', 'art', 'book', 'published', 'york', 'magazine', 'radio', 'books', 'award', 'arts'] (0.00837981038349629) ['company', 'business', 'years', 'group', 'time', 'family', 'people', 'india', 'million', 'indian'] (0.00654814346663618)

9

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

['world', 'won', 'born', 'time', 'year', 'team', 'championship', 'tour', 'championships', 'title'] (0.006458708317991025) ['born', 'british', 'london', 'australian', 'south', 'joined', 'years', 'made', 'england', 'australia'] (0.006314033812829743) ['music', 'de', 'born', 'international', 'la', 'orchestra', 'opera', 'studied', 'french', 'festival'] (0.005795836039797515) An examination of the respective word sets, suggest the following themes to identify each topic. Topic 1: University research Topic 2: Competitive sports Topic 3: Music and entertainment Topic 4: American colleges and law Topic 5: Government and politics Topic 6: Art and media Topic 7: Business and people Topic 8: International sports Topic 9: UK and Australia Topic 10: International music VI. MEASURING IMPORTANCE OF TOP WORDS We can learn more about topics by exploring how they place probability (weight) on each of their top words. We do this by examining two graphics of the weights for the top words in each topic: 1. 2.

The weights of the top 50 words, sorted by the size The total weight of the top 10 words

Here is a graphic representation of the top 50 words by weight in each topic.



10

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65





In the above graphic, each line represents one of the ten topics. Observe that for each topic the weights drop dramatically as they descend the ranked list of most important words. Note that the top 10-15 words in each topic are assigned a much greater weight than the remaining words, which range over 547K in total. Next we plot the total weight assigned by each topic to its top 10 words:



11

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

We observe here that for our topic model, the top 10 words only account for a small proportion of their topic's total probability mass. Therefore, while we can use the top words to identify broad themes for each topic, it is possible that in reality the topics are more complex than a brief 10-word summary. VII. TOPIC DISTRIBUTIONS LDA allows for mixed membership, inferring that each document can partially belong to multiple different topics. For every document, topic membership is expressed as a vector of weights that sum to one. As noted above, the scale of each weight indicates the degree to which the document embodies a given topic. We investigate this phenomena in our fitted model by scanning the topic distributions for a few Wikipedia articles from our corpus. It is posited that these articles have the highest weights on the topics whose themes are most relevant to the subject of the article. In particular,, it is expected that an article about a scientist will place relatively high weight on topics related to university, while an article about a baseball player should place higher weight on topics related to sports or athletics. Topic distributions for documents can be obtained using a collapsed Gibbs sampler similar to the one described earlier, where only the word assignment variables are sampled. To obtain a document-specific topic proportion vector post-facto, the vector is drawn from the conditional distribution given the sampled word assignments in the document. Since these are drawn from

12

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

a distribution over topics that the model has learned, slightly different predictions result each time function is called on a document. This is seen in Table 3 and Table 4, where the topic distribution is predicted for an article on Colin Powell. Topics University research Competitive sports Music and entertainment American colleges and law Government and politics Art and media Business and people International sports UK and Australia International music

Predictions (first draw) 0.0215053763441 0.0483870967742 0.0241935483871

Predictions (second draw) 0.0456989247312 0.0510752688172 0.0376344086022

0.155913978495

0.131720430108

0.545698924731

0.588709677419

0.0215053763441 0.0645161290323 0.0618279569892 0.0322580645161 0.0241935483871 Table 3 Topic Distribution for Colin Powell

0.0241935483871 0.0456989247312 0.0295698924731 0.0241935483871 0.0215053763441

To get a more robust estimate of the topics for each document, we can average a large number of predictions for the same document: Topics Average Predictions 0.593494623656 Government and politics American colleges and law 0.139596774194 Business and people 0.051747311828 Competitive sports 0.0499731182796 | University research 0.0404032258065 International sports 0.0338978494624 UK and Australia 0.0255913978495 International music 0.0229838709677 Art and media 0.0228494623656 Music, TV, and film 0.0194623655914 Table 4 Topic Distribution Averages for Colin Powell In like manner, we can find the top topics corresponding to the article about the English actor, Derek Jacobi (Table 5).

13

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

Topics Average Predictions Music and entertainment 0.40448 UK and Australia 0.14659 American colleges and law 0.07897 International music 0.07819 | International sports 0.06216 Business and people 0.05000 University research 0.04902 Competitive sports 0.04536 Art and media 0.04427 Government and law 0.04093 Table 5 Topic Distribution Averages for Derek Jacobi In this way it is found that the topic model has learned some coherent topics, and we have explored these topics as probability distributions over a vocabulary. Moreover, we have seen how individual documents in the Wikipedia data set are allocated to these topics in a way that correspond with expectations. VIII. COMPARING LDA TO NEAREST NEIGHBORS FOR DOCUMENT RETRIEVAL In this section, we use the predicted topic distribution as a representation of each document, similar to the way in which we have previously represented documents by word count or tf-idf. This provides a way of computing distances between documents, so that we can run a nearest neighbors search for a given document based on its membership in the topics that were learned from LDA. We can compare the results with those generated by running nearest neighbors under the usual tf-idf representation. We begin by creating the LDA topic distribution representation for each document. We compare these nearest neighbor models by finding the nearest neighbors under each representation on an example document. For this example we selected Carol Burnett, an American entertainer. query_label reference_label distance rank



Carol Burnett

Carol Burnett

0.0

1

Carol Burnett

Mark Burnett

0.623999966172

2

Carol Burnett

Howard J. Burnett

0.660061952991

3

Carol Burnett

Karl Burnett

0.693028719505

4

Carol Burnett

David Burnett

0.70865770622

5

14

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

Carol Burnett

Richard Burnett

0.722107555831

6

Carol Burnett

Bernadette Peters

0.73906662441

7

Carol Burnett

Vicki Lawrence

0.748757358344

8

Carol Burnett

Emile Kelman

0.796987862482

9

Martha Williamson 0.805813158674 Table 6 tf-idf Representation

10

Carol Burnett

query_label

reference_label

distance

rank

Carol Burnett

Carol Burnett

0.0

1

Carol Burnett

Jared Gold (organ ist)

0.00166964501536

2

Carol Burnett

Bruce McDaniel

0.00171341328519

3

Carol Burnett

Kris Myers

0.00190371715833

4

Carol Burnett

Gary Mule Deer

0.00190689799409

5

Carol Burnett

Mike Merritt (mu sician)

0.00196187642603

6

Carol Burnett

Derek O'Brien (dr ummer)

0.00196362654254

7

Carol Burnett

Elliot Jacobson

0.00207390661797

8

Carol Burnett

Fuzzbee Morse

0.00208494278556

9

Carol Burnett

Lynne Marie Stew 0.00218703665636 art Table 7 LDA Representation

10

Observe that that there is marginal comminality between the two sets of top 10 nearest neighbors. This reveals that the two models are selecting different features of the documents. With tf-idf, documents are distinguished by the frequency of uncommon words. Since similarity is defined based on the specific words used in the document, documents that are "close" under tf-idf tend to be similar in terms of specific details. This is what we see in the example, where the top 10 nearest neighbors are mostly entertainers Conversely, LDA representation measures similarity between documents in terms of their topic distributions. Thereby, documents can be "close" if



15

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

they share similar themes, even though they may not share many of the same key words. For the article on Carol Burnett, we expect the most important topics to be 'Music, TV, and film.’ As a result, we see that the top 10 nearest neighbors are represent a variety of fields, including literature, anthropology, and religious studies. IX. UNDERSTANDING THE ROLE OF LDA HYPER-PARAMETERS Finally, we consider the effect of the LDA model hyper-parameters alpha and gamma on the characteristics of our fitted model. Remember that alpha is a parameter of the prior distribution over topic weights in each document, while gamma is a parameter of the prior distribution over word weights in each topic. Alpha and gamma can be viewed as smoothing parameters when computing how much each document "represents" a topic (in the case of alpha) or how much each topic "represents" a word (in the case of gamma). In both cases, these parameters serve to reduce the differences across topics or words in terms of these calculated preferences. Concretely, alpha makes the document preferences "smoother” over topics, and gamma makes the topic preferences "smoother" over words. The objective of this section is to investigate how changing parameter values affects the characteristics of the resulting topic model. We commence by loading some topic models that have been trained using different settings of alpha and gamma. Specifically, we start by comparing the following two models to a baseline topic model using alpha = 5.0 and gamma = 0.2. 1. alpha_low_: a topic model trained with alpha = 2 and 0.2 gamma 2. alpha_high: a model trained with alpha = 45 and 0.2 gamma A. Low and High Alpha Since alpha acts to smooth document preferences over topics, the impact of changing its value should be visible when we plot the distribution of topic weights for the same document under models fit with different alpha values. Below, we plot the (sorted) topic weights for the Wikipedia article on Mitt Romney under models fit with high, baseline, and low settings of alpha.



16

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

The smoothing enforced by the alpha parameter is evident: When alpha is low most of the weight in the topic distribution for this article goes to a single topic, but when alpha is high the weight is more equably distributed across the topics. B. Low and High Gamma Just as we were able to evidence the effect of alpha by plotting topic weights for a document, we anticipate visualizing the effect of changing gamma by plotting word weights for each topic. In this space, however, there are far too many words in our vocabulary to do with clarity. Alternatively, we plot the total weight of the top 100 words and bottom 1000 words for each topic. Below, is plotted the sorted total weights of the top 100 words and bottom 1000 from each topic in the high, or iginal, and low gamma models. We examine the following two models compared to the baseline:: 1. gamma_low: trained with gamma = 0.02 and alpha = 5.0 2. gamma_high: trained with gamma = 0. 45 and alpha = 5.0



17



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65





Inspecting the two plots it is seen that the low gamma model results assigns higher weight to the top words and assigns lower weight to the bottom words for each topic. Conversely, the high gamma model assigns relatively less weight on the top words and assigns greater weight to the bottom words. Accordingly,

18



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

increasing gamma yields topics that have a smoother distribution of weight across all the words in the vocabulary. In sum, we have gained some insight as to how the hyper-parameters alpha and gamma influence the characteristics of our LDA topic model; but we haven't suggested what settings of alpha or gamma are optimal. Clearly these parameters are able to influence the smoothness of the topic distributions for documents and word distributions for topics. Yet, there is no straightforward conversion between the smoothness of these distributions and quality of the topic model. Currently, there is no known optimal choice for alpha and gamma.. Thus, finding a good topic model entails exploring the output through examining some topic predictions for documents and testing the impact of hyper-parameter settings as suggested here. X. SUMMARY AND CONCLUSION We set out to explore experimentally the following LDA features: 1. Finding top words in each latent topic and using these to identify topic themes. 2. Predicting topic distributions for example documents. This is fundamental to assessing the topical structure of documents of interest. 3. Comparing the quality of LDA to conventional information retrieval methods 4. Investigating the role of model hyper-parameters alpha and gamma. It was shown how LDA groups words into groupings that conform to topics, and How topic labels can then be assigned to these groupings. It was found that for the experimental corpus that the top 10-15 words in each topic were assigned a greater weight than the remaining large number of words. It was also found that the top 10 words accounted for a small proportion of their topics total probability mass. This suggests that while these words can be used to identify themes, or topic names, for each topic, the analyst needs to be aware that the topics may be more complex than a brief 10-word summary, LDA allows for mixed membership, meaning that each document can partially belong to multiple different topics. For each document, topic membership is articulated as a vector of weights that sum to one, We investigated this feature in our fitted model by scanning the topic distributions for a few Wikipedia articles from our corpus. Articles having the highest weights are those whose themes are most relevant to the subject of the article. Topic distributions for documents were obtained using a collapsed Gibbs sampler. To get a document-specific topic proportion vector post-facto, the vector was drawn from the conditional distribution given the sampled word assignments in



19

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

the document. Specific examples were provided for searches on Colin Powell and Derek Jacobi. We compared document classification with nearest neighbors tdf-if representation to LDA. There was an empty intersection between the two sets of top 10 nearest neighbors, indicating that the two models are selecting different features of the documents. With tf-idf, documents are distinguished by the frequency of uncommon words. Since similarity is defined based on the specific words used in the document, documents that are "close" under tf-idf tend to be similar in terms of specific details. This is what we see in the example, where the top 10 nearest neighbors are mostly entertainers Conversely, LDA representation measures similarity between documents in terms of their topic distributions. Thereby, documents can be "close" if they share similar themes, even though they may not share many of the same key words. For the article on Carol Burnett, we expect the most important topics to be 'Music, TV, and film.’ As a result, we see that the top 10 nearest neighbors are represent a wide variety of fields, including literature, anthropology, and religious studies. Finally, we experimentally evaluated the sensitivity of the LDA model to changes in the parameters, alpha and gamma on the characteristics of the fitted model. Alpha and gamma act as smoothing parameters when computing how much each document "represents" a topic (in the case of alpha) or how much each topic "represents" a word (in the case of gamma). In both cases, these parameters serve to reduce the differences across topics or words in terms of these calculated preferences. When alpha is low most of the weight in the topic distribution for this article goes t o a single topic, but when alpha is high the weight is more equably distributed across the topics. In like manner, the low gamma model assigns higher weight to the top words and assigns lower weight to the bottom words for each topic. Conversely, the high gamma model assigns relatively less weight to the top words and assigns greater weight to the bottom words. Accordingly, increasing gamma yields topics that have a smoother distribution of weight across all the words in the vocabulary. LDA has important limitations, several of which have been addressed by the developments outlined in Section II. The fact that these developments have LDA a their core, suggests that LDA is central to topic modeling and has had, and will have a continuing a significant Influence on the field.





20

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

References

[[] D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal Machine Learning Research, 3:993–1022, January 2003. [2] D. Blei. Introduction to probabilistic topic models. Communications of the ACM, 2011. [3] BA Frigyik, A. Kapila, and M.R. Gupta. Introduction to the dirichlet distribution and related processes. Department of Electrical Engineering, University of Washignton, UWEETR-2010-0006, 2010. [4] T. Hofmann, Probabilistic latent semantic analysis. In Uncertainty in Artificial Intelligence (UAI), 1999. [5] M. D. Hoffman, D.M. Blei, and F. Bloch, “Online Learning for Latent Dirichlet Allocation,” Advances in Neural Information Processing Systems 23 (NIPS 2010). [6] C. Reed, Latent Dirichlet Allocation: Towards a Deeper Understanding. University of Iowa, 2012. [7] X. Wang and E. Grimson , “Spatial Latent Dirichlet Allocation,” Advances in Neural Information Processing 20 (NIPS 2007).





21