Bayesian LDA for mixed-membership clustering analysis: The Rlda package. Pedro Albuquerque
Denis Ribeiro do Valle
Daijiang Li
University of Bras´ılia
University of Florida
University of Florida
Abstract The goal of this paper is to present the Rlda package for mixed-membership clustering analysis based on the Latent Dirichlet Allocation model adapted to different types of data (i.e., Multinomial, Bernoulli, and Binomial entries). We present the corresponding statistical models and examples of the use of this package in R. Because these types of data frequently emerge in fields as disparate as ecology, remote sensing, marketing, and finance, we believe this package will be of broad interest for users interested in unsupervised pattern recognition, particularly mixed-membership clustering analysis for categorical data.
Keywords: Latent Dirichlet Allocation, mixed-membership clustering, categorical data, pattern recognition.
1. Introduction. The Latent Dirichlet Allocation model (LDA), first proposed by Blei, Ng, and Jordan (2003), has been extensively used for text-mining in multiple fields. Tsai (2011) used LDA to construct clusters of tags that represent the most common topics in blogs. Lee, Baker, Song, and Wetherbe (2010) compared LDA against three other text mining methods that are frequently used: latent semantic analysis, probabilistic latent semantic analysis, and the correlated topic model. The major limitation of LDA, as identified by these authors, was that the method does not consider the relationship between topics (Erosheva and Fienberg 2005). Despite this limitation, however, LDA continues to be used in multiple disciplines. For instance, Griffiths and Steyvers (2004) used LDA to identify the main scientific topics in a large corpus of the Proceedings of the National Academy of Science articles. In conservation biology, LDA has been used to identify research gaps in the conservation literature (Westgate, Barton, Pierson, and Lindenmayer 2015). LDA has also been proposed as a promising method for the automatic annotation of remote sensing imagery (Lienou, Maˆıtre, and Datcu 2010). In marketing, LDA has been used to extract information from product reviews across 15 firms in five markets over four years, enabling the identification of the most important latent dimensions of consumer decision making in each studied market (Tirunillai and Tellis 2014). Finally, in finance, a stock market analysis system based on LDA was used to combine financial news items together with stock market data to identify and characterize major events that impact the market. This system was then used to make predictions regarding stock market behavior based on news items identified by LDA (Mahajan, Dey, and Haque 2008). Despite its success in text mining across multiple fields, LDA is a model that need not be
2
Rlda: Bayesian LDA for Mixed-Membership Clustering Analysis.
restricted to text-mining. More specifically, LDA can be viewed as a mixed membership models since each element in the sample can belong to more than one cluster (or state) simultaneously. It is important here to clarify the distinction between mixture and mixedmembership models. In contrast to finite mixture models (McLachlan and Peel 2004), mixedmembership models assume that observational units may only partly belong to any given cluster (also known, depending on the field, as topics, extreme profiles, pure or ideal types, states, subpopulations or groups). In this case, the degree of membership is a vector of continuous non-negative latent variables that add up to one, whereas in mixture models the membership is a binary indicator (Airoldi, Blei, Erosheva, and Fienberg 2014). There are a few examples of LDA being used for other purposes than text-mining. For instance, a modified version of LDA has been extensively used on genetic data to identify populations and admixed probabilities of individuals (Pritchard, Stephens, and Donnelly 2000). Similarly, LDA has been used in ecology to identify plant communities in temperate and tropical forests (Valle, Baiser, Woodall, and Chazdon 2014). The aim of this paper is to present the Rlda package for mixed-membership clustering analysis, describing the underlying novel Bayesian models and illustrating its use in a diverse set of examples. The innovative features of this model are twofold. First, it generalizes LDA for other types of commonly encountered categorical data (i.e., Bernoulli and Binomial). Second, it enables the selection of the optimal number of clusters based on a truncated stick-breaking prior approach which regularizes model results. This paper is organized as follows. Sections 2 and 3 describe the mathematical formulation for the Bayesian LDA mixed-membership cluster model. Section 4 justifies our Rlda package by reviewing currently available R packages and their limitations. Sections 5 and 6 present examples of the use of the package and the conclusions, respectively.
2. Methods. In the Bayesian LDA mixed-membership cluster model we postulate that each element is allocated to a single cluster, represented by a latent state variable. Specifically, consider a latent matrix Z with dimension equals to L × C where each row represents a sampling unit (l = 1, . . . , L) and each column a possible state or cluster (c = 1, . . . , C). The Data Generating Process postulated for this latent matrix is given by: Z l· ∼ M ultinomial(nl , θ l )
(1)
where nl is total number of elements drawn for location l and θ l = (θl1 , . . . , θlC ) is a vector of parameters representing the probability of allocation in each cluster. Following Occam’s razor, we intend to create the least number of clusters as possible, which is achieved by assuming a truncated stick-breaking prior:
θlc = Vlc
c−1 Y
(1 − Vlc∗ )
(2)
c∗ =1
where Vlc ∼ Beta(1, γ) for c = 1, . . . , C − 1 and VlC = 1 by definition. This truncated stick-breaking prior will force the elements to be aggregated in the minimum number of
Pedro Albuquerque, Denis Valle, Daijiang Li
3
clusters, given that θlc∗ is stochastically exponentially decreasing. Another benefit of this truncated stick-breaking prior is that it avoids the label switching problem because it is an asymmetric prior. The label switching problem is common to many standard mixture and mixed-membership models and it arises due to the fact that the label of each group is an unidentifiable parameter in these models (Stephens 2000). This problem can lead to convergence issues and can make the summary of modeling results very challenging. In the second hierarchical level, we consider a matrix Y with dimension equal to L × S where each row represents a sampling unit (e.g., locations, firms, individuals, plots) and each column a variable that describes these elements. In the Bayesian LDA model for mixed-membership clustering, after integrating over the latent vector Zl· , Yls can follow one of these distributions: > Yl· ∼ M ultinomial(nl , θ l Φ) Yls ∼ Bernoulli(θ > l φs ) Yls ∼ Binomial(nls , θ > l φs )
(3)
for l = 1, . . . , L and s = 1, . . . , S possible variables. Yls represents a random variable, Y l· is a vector with these random variables for location l, nl is the total number of elements in sampling unit l, nls is the total number of elements in sampling unit l and variable s. In these models, φs = (φ1s , . . . , φCs ) is a vector of parameters, while Φ is a C × S matrix of parameters, given by:
φ11 φ21 Φ= . ..
φ12 φ22 .. .
... ... .. .
φC1 φC2 . . .
φ1S φ2S .. . φCS
In the last step, we specify the priors for φcs . For the multinomial model, we adopt a Dirichlet prior (i.e., φc ∼ Dirichlet(β) where β = (β1 , . . . , βS ) is the hyperparameter vector). For the Bernoulli and Binomial representations, we assume that φcs comes from a Beta distribution, (i.e., φcs ∼ Beta(α0 , α1 )). These models are fit using Gibbs Sampling where parameter are iteratively drawn from each full conditional distribution. From a conceptual perspective, all of these models assume the following matrix decomposition: E[YL×S ] = K ◦ [ΘL×C ΦC×S ]
(4)
where K is a matrix of constants and ◦ is the Hadamard product. Sparseness is ensured by forcing large c in the ΘL×C matrix to be close to zero. For the multinomial model, the K matrix contains the total number of elements in each row whereas for the Bernoulli model, this matrix is equal to the identity matrix. Finally, for the Binomial model, the K matrix has the total number of trials of each binomial distribution (i.e., nls ). Although there are many ways matrices can be decomposed, the key characteristic of the form of matrix decomposition we chose is that each row of Θ is comprised of probabilities that sum to one. As a result, one can interpret ΦC×S as the matrix that contain the “pure” features of the data, which are then mixed by the matrix ΘL×C and multiplied by K to generate the expected data.
4
Rlda: Bayesian LDA for Mixed-Membership Clustering Analysis.
3. Full conditional distributions. The LDA models we created can be fitted to data using a Gibbs sampler in which either a) all latent variables are explicitly represented, enabling closed form FCD’s; or b) these variables have been integrated out and we rely on the Metropolis-Hastings algorithm. In this section, we provide the closed form FCD’s for alternative (a).
3.1. Bernoulli model. The probability of community membership status for each element Wls , where P Zlc = Ss=1 1(Wls = c), is given by: y
θlc∗ φc∗lss (1−φc∗ s )1−yls
p(Wls = c∗ |...) =
C X
θlc φycsls (1 − φcs )1−yls
(5)
c=1
Therefore, Wls can be drawn from a categorical distribution. The FCD for Vlc is given by: p(Vlc |...) = Beta(zlc + 1, zl(c∗ >c) + γ) where zlc is the total number of elements in location l classified into cluster c, and zl(c∗ >c) is the total number of elementsPin location l classified in clusters larger than c. This latter P ∗ quantity is given by zl(c∗ >c) = Ss=1 C ∗ c =c+1 1(wls = c ). Finally, the FCD for φcs is given by: (1) (0) p(φcs |...) = Beta(qcs + α0 , qcs + α1 ) (j)
(j)
where PL qcs is the number of elements assigned to group c and for which yls = j (i.e., qcs = l=1 1(wls = c, yls = j)).
3.2. Binomial model. For this model, we have nls elements for each sampling unit l and variable Ps. The P lcommunity 1(Wils = membership status of the i-th element is denoted by Wils , where Zlc = Ss=1 ni=1 c), and its probability is similar to the one for the Bernoulli model: x
p(Wils = c∗ |...) =
θlc∗ φc∗ils (1−φc∗ s )1−xils s C X
θlc φxcsils (1 − φcs )1−xils
(6)
c=1
P ls where xils are binary random variables such that ni=1 xils = yls . Therefore, Wils can be drawn from a multinomial distribution. The FCD for φcs is given by: (1) (0) p(φcs |...) = Beta qcs + α0 , qcs + α1 (j)
where, similar to the Bernoulli model, qcs =
PL Pnls l=1
i=1
1(xils = j, wils = c).
(7)
Pedro Albuquerque, Denis Valle, Daijiang Li
5
Finally, the FCD for Vlc is given by: p (Vlc |...) = Beta(zlc + 1, zl(c∗ >c) + γ)
(8)
where zlc is the total number of elements in location l classified into cluster c and zl(c∗ >c) is the total number of elements larger than c. This latter P ls PlCclassified in clusters Pin location ∗ ). 1 (w = c quantity is given by zl(c∗ >c) = Ss=1 ni=1 ils c∗ =c+1
3.3. Multinomial model. For the Multinomial case, if unit i in location l is associated with variable s (i.e., xil = s such Pnl that yls = i=1 1(xil = s)), we have that: p(Wil = c∗ |...) =
θlc∗ φc∗ s (θl1 φ1s + · · · + θlC φCs )
(9)
In equation, Wil is the group assignment of element i in location l, such that Zlc = Pnthis l 1 (W il = c), and it can be sampled from a categorical distribution. Since we assumed a i=1 conjugate prior for φc with c ∈ {1, . . . , C}, the Full Conditional Distribution for this vector of parameters is a straight-forward Dirichlet distribution: p(φc |...) = Dirichlet([qc1 + β1 , . . . , qcS + βS ]) where qcs =
PL Pnl l=1
i=1
(10)
1(wil = c, xil = s).
Finally, the FCD for Vlc is given by: p(Vlc |...) = Beta(zlc + 1, zl(c∗ >c) + γ)
(11)
where zl(c∗ >c) is the total number of elements in observation l classified in clusters larger than P l PC ∗ c. This quantity is given by zl(c∗ >c) = ni=1 c∗ =c+1 1(wil = c ).
4. The Rlda package. We found several packages that can fit the Latent Dirichlet Allocation model. Hornik and Gr¨ un (2011) developed the topicmodels package in which there are two LDA implementations: one uses the variational inference (as described in Blei et al. (2003)) and the other uses Gibbs Sampling based on Phan and Nguyen (2013). Similarly, Jones (2016) created the textmineR which relies on Gibbs Sampling to estimate the topics in a corpus structure. Chang (2012) developed the lda package which includes the mixed-membership stochastic blockmodel (Airoldi, Blei, Fienberg, and Xing 2008), supervised Latent Dirichlet Allocation - sLDA (Mcauliffe and Blei 2008) and Correspondence-Latent Dirichlet Allocation - corrLDA (Blei and Jordan 2003). More recently, Roberts, Stewart, and Tingley (2014) created the stm which has some unsupervised functions to determine the optimal number of clusters. This unsupervised method relies on the EM Algorithm and uses a backward model selection approach to determine the best number of groups. Differently from this approach, FUNLDA (Backenroth and IonitaLaza 2017), sparklyr (Luraschi, Ushey, Allaire, RStudio, and Foundation 2017) and PivotalR (Iyer 2017) present a LDA implementation based on Machine Learning paradigm. Finally,
6
Rlda: Bayesian LDA for Mixed-Membership Clustering Analysis.
the last packages we found were LDAvis (Sievert and Shirley 2015), which is a graphical tool to interpret the LDA results, and ldatuning (Nikita 2016), which is a package that assists the users in tunning the LDA features and helps with the choice of the number of clusters to be used. This package presents a tool to create an interactive web-based visualization which can help users interpret the topics that result from LDA. None of these packages adopt the truncated stick-breaking prior which enables the selection of the optimal number of clusters and regularizes model results. Furthermore, the fact that our model is fit within the Bayesian paradigm can be useful when the user already expects certain “topics” or “groups” but is unsure about the other ones. This information can potentially be incorporated through the priors adopted for the analysis (Garthwaite, Kadane, and O’Hagan 2005). Furthermore, none of these packages use other distributions besides the Multinomial outcome for the dependent variable. Thus, we believe our Rlda package complements current LDA approaches already available in R. The package Rlda is available from the Comprehensive R Archive Network (CRAN) at https://CRAN.R-project.org/package=Rlda.
4.1. Software implementation and package workflow. The Rlda package relies on Gibbs samplers. Because of the iterative nature of these algorithms, fitting these models can be time consuming, particularly if implemented directly in R. For this reason, most of our code has been implemented in C++, relying heavily on Rcpp (Eddelbuettel, Fran¸cois, Allaire, Chambers, Bates, and Ushey 2011) and RcppArmadillo (Eddelbuettel and Sanderson 2014). We also used the parallelization provided by parallel and doParallel (Analytics and Weston 2014) to create predictions when using the S3 method predict. The main input for Rlda consists of a data.frame with L observations (rows) and S variables (colums), sampled from a multinomial, Bernoulli or binomial process, as described by the data generating process in Equation 4. Each cell of this data.frame should contain a non-negative integer containing the number of times that certain event was observed (for multinomial and binomial data) or if this event occurred or not (for Bernoulli data). The other inputs that must be provided are the prior’s hyperparameters, which can be either elicited from experts or defined according to the user’s belief. Those hyper-parameters must be positive quantities with dimension that vary depending on the specific type of prior. Finally, another important input consists of the the maximum number of clusters, defined by the user through the n_community argument. Once this is specified, our stick-breaking prior approach might be able to identify a smaller number of clusters, as illustrated in the analysis presented in subsection 5.1. The key functions of the Rlda are rlda.multimonial, rlda.bernoulli and rlda.binomial. These functions fit the Multinomial, Bernoulli and Binomial models, respectively. The choice of which function to use depends on the assumptions the analyst is willing to make regarding the distribution of the data. The output from these functions consists of a rlda S3 object which contains the log-likelihood, matrices ΘL×C and ΦC×S for each iteration. Furthermore, this object also includes information regarding the model’s input such as the maximum number of clusters, the prior’s hyperparameters, and summary information regarding the data (e.g., number of observations and variables). We created several auxiliary functions to help the user summarize and interpret the modeling results. For example, the plot function displays a trace-plot of the log-likelihood (useful to
Pedro Albuquerque, Denis Valle, Daijiang Li
7
judge model convergence), a boxplot of the estimated θlc for each cluster (useful to determine the optimal number of clusters), and a characterization of each cluster based on the estimated φcs . The summary function provides the posterior mean, variance regarding to the quadratric loss function and (1 − α) credible intervals of the Θ and Φ matrices. Alternatively, users can also rely on the accessor functions getTheta.rlda and getPhi.rlda to obtain the posterior mean of these matrices. Finally, we also created the functions generateBinomialLDA.rlda, generateBernoulliLDA.rlda, and generateMultinomialLDA.rlda to generate simulated data. These are useful functions because they enable the user to check if the proposed algorithms are able to retrieve the true parameters used to generate the simulated data. Assessment of algorithm convergence for Bayesian models is often done through visual inspection (Gelman, Shirley et al. 2011). In our case, due to the large number of estimated parameters, we assess model convergence by examining the trace-plot of the log-likelihood using the plot method. An asymptotic pattern in this trace-plot suggests that the algorithm has converged. Other diagnostic tests can also be made using specific packages such as coda (Plummer, Best, Cowles, and Vines 2006). To do this, the interested reader can use the cast method rlda2mcmc.rlda which converts our rlda object into a mcmc class from coda package. The output is a list with the two matrices (ΘL×C and ΦC×S ) written in a mcmc format, with each row representing a Gibbs iteration and each column a parameter from those matrices. In the next section we demonstrate this workflow based on examples for each implemented method in the Rlda package. We describe in greater detail the input needed for estimation, present the output from each method, and interpret the results.
5. Examples. To illustrate how the models we created are likely to be useful across multiple disciplines, in this section we use the Multinomial, Binomial and Bernoulli models on Marketing, Remote Sensing and Ecology examples, respectively. In each subsection we start with some motivation about the specific problem and then show how the Rlda package can be used to analyze the corresponding dataset.
5.1. Marketing. It is well known that attracting a new customer is often considerably costlier than keeping current customers (Kotler and Armstrong 2006). For this reason, firms can better retain their customers if they pay careful attention to their consumers’ complaints and work to solve them in a satisfactory way. Our first application using the Rlda package considers the LDA for Multinomial entries applied to the field of Marketing. Specifically, we are interested in characterizing firms based on their consumers’ complaints. The data came from the 2015 Consumer Complaint Database and consist of complaints received by the United States Bureau of Consumer Financial Protection regarding financial products and services. In this example, we work only with credit card complaints. This dataset contains information on the number of complaints for each firm (L = 226), categorized according to the specific type of issue (S = 30). Examples of issues include billing disputes, identity theft / fraud, and unsolicited issuance of credit card. The characterization of firms provided by our analysis can be useful to reveal commonalities and differences across firms, which can then be used by managers to identify and potentially adopt the solutions that are
8
Rlda: Bayesian LDA for Mixed-Membership Clustering Analysis.
employed by similar firms to deal with these issues. To use the Rlda package for the Multinomial entry, it is first necessary to create a matrix where each cell represents the total number of cases observed for each sampling unit and type of complaint. R> R> R> R> + + R> R>
library("Rlda") data("complaints") library("reshape2") mat1 R> +
set.seed(9292) beta R> R> R>
library("coda") listParms library("rgl") R> library("car") R> scatter3d(x = Theta[ ,'Cluster 1'], y = Theta[ ,'Cluster 2'], + z = Theta[ ,'Cluster 3'], surface = F, xlab = 'Cluster 1', + ylab = 'Cluster 2', zlab = 'Cluster 3', + labels=rownames(Theta), id.n=20) The firms that best represent each of these clusters (i.e., top firms in each cluster with probability greater than 0.3) are given in Table 1. Cluster 1 2 3 4 6 7
Firms Portfolio Recovery Associates, Inc., Encore Capital Group, Sterling Jewelers Inc. Comerica, PNC Bank N.A., Barclays PLC Continental Finance Company, LLC, Citibank, Amex Synchrony Financial, Regions Financial Corporation, TD Bank US Holding Company Discover U.S. Bancorp
Table 1: Representative firms for each cluster
5.2. Remote sensing. Because pixels in remote sensing imagery are often large enough to encompass different types of material within each pixel, there has been great interest in the development of methods
Pedro Albuquerque, Denis Valle, Daijiang Li
11
that enable researchers to estimate the proportion of the constituent materials (often called endmembers in the remote sensing literature). Indeed, numerous spectral unmixing algorithms have already been developed in the literature, with multiple approaches used for the dimension reduction, endmember determination, and inversion stages (Keshava 2003). The key characteristics of the method that we propose here is that it is an unsupervised method (i.e., it does not require a priori determination of endmembers) that enforces parsimony through our truncated stick-breaking prior. Furthermore, differently from many of the currently existing methods for spectral unmixing, our model explicitly acknowledges the discreteness of the digital numbers used in remote sensing systems and the range of possible values these numbers can take. In our example, we rely on Landsat TM 5 imagery from 2010 of the Iquitos-Nauta road in the Peruvian Amazon. This area has multiple land-use land-cover (LULC) types and is the site in which we have studied the effect of these different LULC types on the malaria vector Anopheles darlingi (Lima, Vittor, Rifai, and Valle 2017). The first step of our analysis consists of choosing a subset of the most heterogeneous pixels in the imagery for the estimation of the endmembers. Then, we identify the endmembers using this subset (matrix Landsat_sample) by supplying to the algorithm the maximum number of elements for the binomial distribution, given by the matrix max2, and the prior hyper-parameters a.phi and b.phi. R> R> R> R> R> R> R> R> R> R> + + +
tmp
library("gplots") seq1 R> R> R> R> R> R> R> R>
theta idw.output = as.data.frame(idw) R> names(idw.output)[1:3] res = ggplot() + geom_tile(data = idw.output, alpha = 0.8, + aes(x = long, y = lat,fill = var1.pred)) R> + scale_fill_gradient(low = "cyan", high = "red", limits = c(0,1)) R> + guides(fill = FALSE) R> + ggtitle(paste0("Cluster ", j)) R> + theme(axis.title.x = element_blank(), + axis.title.y = element_blank(), + plot.title = element_text(hjust = 0.5)) +
Pedro Albuquerque, Denis Valle, Daijiang Li
15
R> + geom_polygon(data = all_states, + aes(x = long, y = lat, group = group), + colour = "white", fill = "transparent") R> } We can see in Figure 4 that, even without any explicit spatial information, the algorithm was able to identify strong special patterns for the different bird communities/clusters. Furthermore, the spatial distribution of these breeding bird communities closely agree with the biogeographical patterns expected by ecologists: Cluster 1
Cluster 2
70
70
60
60
50
50
40
40
30
30 −180
−150
−120
−90
−180
−150
Cluster 3 70
70
60
60
50
50
40
40
30
30 −180
−150
−120
−120
−90
Cluster 4
−90
−180
−150
Cluster 5
−120
−90
Cluster 6
70
70
60
60
50
50
40
40
30
30 −180
−150
−120
−90
−180
−150
Cluster 7
−120
−90
Cluster 8
70
70
60
60
50
50
40
40
30
30 −180
−150
−120
−90
−180
−150
Cluster 9
−120
−90
Cluster 10
70
70
60
60
50
50
40
40
30
30 −180
−150
−120
−90
−180
−150
−120
−90
Figure 4: Spatial distribution of breeding bird communities/clusters. The cyan to red color gradient corresponds to 0 to 1 probabilities.
16
Rlda: Bayesian LDA for Mixed-Membership Clustering Analysis.
6. Conclusion. The goal of this paper was to introduce the Rlda package, describing the Bayesian LDA model for mixed membership clustering based on different types of discrete data. In particular, we have demonstrated how to use the model for Multinomial entry in the Marketing example, Binomial trial using the Landsat dataset, and Bernoulli trial using ecological data. The models described here are likely to be broadly useful across multiple scientific fields for unsupervised learning based on categorical data.
Appendix. In this Appendix we provide the derivation for the Full Conditional Distributions associated with each model.
Bernoulli model. For Wls : p(Wls = c∗ |...) = k × Cat(Wls = c∗ |θl ) × Bernoulli(yls |φc∗ s ) = k × θlc∗ × φyc∗lss (1 − φc∗ s )1−yls Since Wls is a categorical random variable with support in Z = (1, 2, . . . , C), the sum of the probabilities for all elements must equal one. Therefore, the constant k is given by: k = PC
1
c=1 θlc
×
φycsls (1
− φcs )1−yls
As a result, Wls can be sampled from a categorical distribution. For Vlc : p(Vlc |...) ∝ Binomial(zlc |zlc + zl(c∗ >c) , Vlc ) × Beta(Vlc |1, γ) ∝ Vlczlc (1 − Vlc )zl(c∗ >c) × (1 − Vlc )γ−1 (zlc +1)−1
∝ Vlc
(1 − Vlc )(zl(c∗ >c) +γ)−1
p(Vlc |...) = Beta(zlc + 1, zl(c∗ >c) + γ) For φcs : L Y p(φcs |...) ∝ [ Bernoulli(yls |φcs )1(wls =c) ] × Beta(φcs |α0 , α1 ) l=1 L Y 1(wls =c,yls =1) (1 − φ )1(wls =c,yls =0) ] × φα0 −1 (1 − φ )α1 −1 ∝ [ φcs cs cs cs l=1 PL
∝ φcs l=1
1(wls =c,yls =1)+α0 −1
(1 − φcs )
PL
l=1
1(wls =c,yls =0)+α1 −1
Pedro Albuquerque, Denis Valle, Daijiang Li
17
(1) (0) p(φcs |...) = Beta(qcs + α0 , qcs + α1 ) (j)
where qcs =
PL
l=1
1(wls = c, yls = j).
Binomial model. For Wils :
p(Wils = c∗ |...) = k × Cat(Wils = c∗ |θl ) × Bernoulli(xils |φc∗ s ) = k × θlc∗ × φxc∗ilss (1 − φc∗ s )1−xils
Since Wils is a categorical random variable with support in Z = (1, 2, . . . , C), the sum of the probabilities for all elements must equal one. Therefore, the constant k is given by: 1
k = PC
c=1 θlc
×
φxcsils (1
− φcs )1−xils
As a result, Wils can be sampled from a categorical distribution. For Vlc : p(Vlc |...) ∝ Binomial(zlc |zlc + zl(c∗ >c) , Vlc ) × Beta(Vlc |1, γ) ∝ Vlczlc (1 − Vlc )zl(c∗ >c) × (1 − Vlc )γ−1 (zlc +1)−1
∝ Vlc
(1 − Vlc )(zl(c∗ >c) +γ)−1
p(Vlc |...) = Beta(zlc + 1, zl(c∗ >c) + γ) For φcs : nls L Y Y p(φcs |...) ∝ [ Bernoulli(xils |φcs )1(wils =c) ] × Beta(φcs |α0 , α1 ) l=1 i=1 nls L Y Y
∝[
φ1cs(wils =c,xils =1) (1 − φcs )1(wils =c,xils =0) ] × φαcs0 −1 (1 − φcs )α1 −1
l=1 i=1 PL Pnls
∝ φcs l=1
i=1
1(wils =c,xils =1)+α0 −1
(1 − φcs )
PL
l=1
Pnls
i=1
1(wils =c,xils =0)+α1 −1
(1) (0) p(φcs |...) = Beta(qcs + α0 , qcs + α1 ) (j)
where qcs =
PL Pnls l=1
i=1
1(wils = c, xils = j).
Multinomial model. For Wil : p(Wil = c∗ |...) = k × Cat(Wil = c∗ |θl ) × Cat(xil = s|φc∗ s ) = k × θlc∗ φc∗ s Since Wil is a categorical random variable with support in Z = (1, 2, . . . , C), the sum of the probabilities for all elements must equal one. Therefore, the constant k is given by:
18
Rlda: Bayesian LDA for Mixed-Membership Clustering Analysis.
k = PC
1
c=1 θlc
× φcs
As a result, Wil can be sampled from a categorical distribution. For Vlc : p(Vlc |...) ∝ Binomial(zlc |zlc + zl(c∗ >c) , Vlc ) × Beta(Vlc |1, γ) ∝ Vlczlc (1 − Vlc )zl(c∗ >c) × (1 − Vlc )γ−1 (zlc +1)−1
∝ Vlc
(1 − Vlc )(zl(c∗ >c) +γ)−1
p(Vlc |...) = Beta(zlc + 1, zl(c∗ >c) + γ) For φc : nl L Y Y p(φc |...) ∝ [ Cat(xil |φc )1(wil =c) ] × Dirichlet(φc |β) l=1 i=1 nl L Y Y 1(x =1,wil =c) 1(x =S,wil =c) ∝[ × ... × φcS il ] × φβc11 −1 × ... × φβcSS −1 φc1 il l=1 i=1 PL Pnl
∝ φc1 l=1
i=1
1(xil =1,wil =c)+β1 −1
P nl
× ... × φcSi=1
1(xil =S,wil =c)+βS −1
p(φc |...) = Dirichlet([qc1 + β1 , ..., qcS + βS ]) where qcs =
Pnl
i=1
1(xil = s, wil = c).
Acknowledgements. We thank the numerous comments and suggestions provided by Justin Millar, Ben Baiser, Gordon Burleigh, and Tamer Kahveci. This work was partly supported by the US Department of Agriculture National Institute of Food and Agriculture McIntire Stennis project 1005163 and by the US National Science Foundation award 1458034 to DV. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
References Airoldi EM, Blei D, Erosheva EA, Fienberg SE (2014). Handbook of Mixed Membership Models and Their Applications. CRC Press. Airoldi EM, Blei DM, Fienberg SE, Xing EP (2008). “Mixed Membership Stochastic Blockmodels.” Journal of Machine Learning Research, 9(Sep), 1981–2014. Analytics R, Weston S (2014). “doParallel: Foreach Parallel Adaptor for the Parallel Package.” R package version, 1(8).
Pedro Albuquerque, Denis Valle, Daijiang Li
19
Backenroth D, Ionita-Laza I (2017). PivotalR: Genomic Latent Dirichlet Allocation. R package version 1.1, URL https://CRAN.R-project.org/package=FUNLDA. Blei DM, Jordan MI (2003). “Modeling Annotated Data.” In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pp. 127–134. ACM. Blei DM, Ng AY, Jordan MI (2003). “Latent Dirichlet Allocation.” Journal of machine Learning research, 3(Jan), 993–1022. Chang J (2012). “lda: Collapsed Gibbs Sampling Methods for Topic Models. R Package Version 1.4.2.” Eddelbuettel D, Fran¸cois R, Allaire J, Chambers J, Bates D, Ushey K (2011). “Rcpp: Seamless R and C++ Integration.” Journal of Statistical Software, 40(8), 1–18. Eddelbuettel D, Sanderson C (2014). “RcppArmadillo: Accelerating R with High-Performance C++ Linear Algebra.” Computational Statistics & Data Analysis, 71, 1054–1063. Erosheva EA, Fienberg SE (2005). “Bayesian Mixed Membership Models for Soft Clustering and Classification.” In Classification - The Ubiquitous Challenge, pp. 11–26. SpringerVerlag. Garthwaite PH, Kadane JB, O’Hagan A (2005). “Statistical Methods for Eliciting Probability Distributions.” Journal of the American Statistical Association, 100(470), 680–701. Gelman A, Shirley K, et al. (2011). “Inference from Simulations and Monitoring Convergence.” Handbook of markov chain monte carlo, pp. 163–174. Griffiths TL, Steyvers M (2004). “Finding Scientific Topics.” Proceedings of the National academy of Sciences, 101(suppl 1), 5228–5235. Hornik K, Gr¨ un B (2011). “topicmodels: An R Package for Fitting Topic Models.” Journal of Statistical Software, 40(13), 1–30. Iyer R (2017). PivotalR: A Fast, Easy-to-Use Tool for Manipulating Tables in Databases and a Wrapper of MADlib. R package version 0.1-18.3, URL https://CRAN.R-project. org/package=PivotalR. Jones TW (2016). “textmineR: Functions for Text Mining and Topic Modeling. R package version 2.0.2.” Keshava N (2003). “A Survey of Spectral Unmixing Algorithms.” Lincoln Laboratory Journal, 14(1), 55–78. Kotler P, Armstrong G (2006). “Principles of Marketing Management.” Lee S, Baker J, Song J, Wetherbe JC (2010). “An Empirical Comparison of Four Text Mining Methods.” In System Sciences (HICSS), 2010 43rd Hawaii International Conference on, pp. 1–10. IEEE. Lienou M, Maˆıtre H, Datcu M (2010). “Semantic Annotation of Satellite Images Using Latent Dirichlet Allocation.” IEEE Geoscience and Remote Sensing Letters, 7(1), 28–32.
20
Rlda: Bayesian LDA for Mixed-Membership Clustering Analysis.
Lima JMT, Vittor A, Rifai S, Valle D (2017). “Does Deforestation Promote or Inhibit Malaria Transmission in the Amazon? A Systematic Literature Review and Critical Appraisal of Current Evidence.” Phil. Trans. R. Soc. B, 372(1722), 20160125. Luraschi J, Ushey K, Allaire J, RStudio, Foundation TAS (2017). sparklyr: R Interface to Apache Spark. R package version 0.5-6, URL https://CRAN.R-project.org/package= sparklyr. Mahajan A, Dey L, Haque SM (2008). “Mining Financial News for Major Events and Their Impacts on the Market.” In Web Intelligence and Intelligent Agent Technology, 2008. WIIAT’08. IEEE/WIC/ACM International Conference on, volume 1, pp. 423–426. IEEE. Mcauliffe JD, Blei DM (2008). “Supervised Topic Models.” In Advances in neural information processing systems, pp. 121–128. McLachlan G, Peel D (2004). Finite Mixture Models. John Wiley & Sons. Nikita M (2016). ldatuning: Tuning of the Latent Dirichlet Allocation Models Parameters. R package version 0.2-0, URL https://CRAN.R-project.org/package=ldatuning. Pearce JL, Boyce MS (2006). “Modelling Distribution and Abundance with Presence-Only Data.” Journal of applied ecology, 43(3), 405–412. Phan XH, Nguyen CT (2013). “GibbsLDA++, AC/C++ Implementation of Latent Dirichlet Allocation (LDA) Using Gibbs Sampling for Parameter Estimation and Inference.” Plummer M, Best N, Cowles K, Vines K (2006). “CODA: Convergence Diagnosis and Output Analysis for MCMC.” R News, 6(1), 7–11. URL https://journal.r-project.org/ archive/. Pritchard JK, Stephens M, Donnelly P (2000). “Inference of Population Structure Using Multilocus Genotype Data.” Genetics, 155(2), 945–959. Roberts ME, Stewart BM, Tingley D (2014). “stm: R package for Structural Topic Models.” R package, 1, 12. Sievert C, Shirley K (2015). LDAvis: Interactive Visualization of Topic Models. R package version 0.3-2, URL https://CRAN.R-project.org/package=LDAvis. Stephens M (2000). “Dealing with Label Switching in Mixture Models.” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 62(4), 795–809. Tirunillai S, Tellis GJ (2014). “Mining Marketing Meaning from Online Chatter: Strategic Brand Analysis of Big Data Using Latent Dirichlet Allocation.” Journal of Marketing Research, 51(4), 463–479. Tsai FS (2011). “A Tag-topic Model for Blog Mining.” Expert Systems with Applications, 38(5), 5330–5335. Valle D, Baiser B, Woodall CW, Chazdon R (2014). “Decomposing Biodiversity Data Using the Latent Dirichlet Allocation Model, a Probabilistic Multivariate Statistical Method.” Ecology letters, 17(12), 1591–1601.
Pedro Albuquerque, Denis Valle, Daijiang Li Westgate MJ, Barton PS, Pierson JC, Lindenmayer DB (2015). “Text Analysis Tools for Identification of Emerging Topics and Research Gaps in Conservation Science.” Conservation Biology, 29(6), 1606–1614.
Affiliation: Pedro Albuquerque Faculdade de Economia, Administra¸c˜ao e Contabilidade University of Bras´ılia Building A-2 - Office A1-54/7 Brasilia, DF 70910-900 E-mail:
[email protected] URL: http://pedrounb.blogspot.com/ Denis Ribeiro do Valle Institute of Food and Agricultural Sciences University of Florida 408 McCarty Hall C PO Box 110339 Gainesville, FL 32611-0410. E-mail:
[email protected] URL: http://denisvalle.weebly.com/ Daijiang Li Institute of Food and Agricultural Sciences University of Florida 408 McCarty Hall C PO Box 110339 Gainesville, FL 32611-0410. E-mail:
[email protected] URL: http://daijiang.name/
21