1
Hierarchical Theme and Topic Modeling Jen-Tzung Chien, Senior Member, IEEE
Abstract—Considering the hierarchical data groupings in text corpus, e.g. words, sentences and documents, we conduct the structural learning and infer the latent themes and topics for sentences and words from a collection of documents, respectively. The relation between themes and topics under different data groupings is explored through an unsupervised procedure without limiting the number of clusters. A tree stick-breaking process is presented to draw theme proportions for different sentences. We build a hierarchical theme and topic model which flexibly represents the heterogeneous documents using Bayesian nonparametrics. Thematic sentences and topical words are extracted. In the experiments, the proposed method is evaluated to be effective to build semantic tree structure for sentences and the corresponding words. The superiority of using tree model for selection of expressive sentences for document summarization is illustrated. Index Terms— Structural learning, topic model, Bayesian nonparametrics, document summarization
I. I NTRODUCTION Unsupervised learning has a broad goal of extracting features and discovering structure within the given data. The unsupervised learning via probabilistic topic model [1] has been successfully developed for document categorization [2], image analysis [3], text segmentation [4], speech recognition [5], information retrieval [6], document summarization [7], [8] and many other applications. Using topic model, latent semantic topics are learned from a bag of words to capture the salient aspects embedded in data collection. In this study, we propose a new topic model to represent a bag of sentences as well as the corresponding words. As we know, the concept of “topic” is well understood in the community. Here, we use another related concept “theme”. Themes are the latent variables which occur in different level of grouped data, e.g. sentences, and so the concepts of themes and topics are different. We model the themes and topics separately and require the estimation of them jointly. The hierarchical theme and topic model is constructed. Figure 1 illustrates the diagram of hierarchical generation from documents, sentences to words given by the themes and topics which are drawn from their proportions. We explore a semantic tree structure of sentencelevel latent variables from a bag of sentences while the wordlevel latent variables are learned from a bag of grouped words allocated in individual tree nodes. We build a two-level topic model through a compound process. The process of generating words conditions on the theme assigned to the sentence. The motivation of this paper aims to go beyond the word level and upgrade the topic model by means of discovering the This work was supported in part by the Ministry of Science and Technology, Taiwan, under contract MOST 103-2221-E-009-078-MY3. J.-T. Chien is with the Department of Electrical and Computer Engineering, National Chiao Tung University, Hsinchu 30010, Taiwan (e-mail:
[email protected]).
Documents
Sentences
Words
Themes
Topics
Theme Proportions
Topic Proportions
Fig. 1. Conceptual illustration for hierarchical generation of documents (yellow rectangle), sentences (green diamond) and words (blue circle) by using theme and topic proportions.
hierarchical relations between the latent variables in word and sentence levels. The benefit of this model is to establish a hierarchical latent variable model which is feasible to characterize the heterogeneous documents with multiple levels of abstraction in different data groupings. This model is general and could be applied for document summarization and many other information systems. A. Related work for topic model Topic model based on latent Dirichlet allocation (LDA) [2] is constructed as a finite-dimensional mixture representation which assumes that 1) the number of topics was fixed and 2) different topics were independent. The hierarchical Dirichlet process (HDP) [9] and the nested Chinese restaurant process (nCRP) [10], [11] were proposed to relax these assumptions. The HDP-LDA model by Teh et al. [9] is a nonparametric extension of LDA where document representation is allowed to grow structurally when more documents are observed. Number of topics is unknown a priori. Each word token within a document is drawn from a mixture model where the hidden topics are shared across documents. Dirichlet process (DP) is realized to find flexible data partitions and provide the nonparametric prior over the number of topics for each document. The base measure for the child DPs is itself drawn from a parent DP. The atoms are shared in a hierarchical way. Model selection problem is tackled through Bayesian nonparametric (BNP) learning. In the literature, the sparse topic model was constructed by decoupling the sparsity and smoothness for LDA [12] and HDP [13]. The spike and slab prior over Dirichlet distributions was applied by using a Bernoulli variable for each word to indicate whether the word appears in the topic or not. In [14], the Indian buffet process compound DP was developed to build a focused topic model. In [15], a hierarchical Dirichlet prior with Polya conditionals was used as the asymmetric Dirichlet prior over
2
the document-topic distributions. The improvement of using this prior structure over the symmetric Dirichlet prior in LDA was substantial even when not estimating the number of topics. On the other hand, the nCRP conducts BNP inference of topic hierarchies and learns the deeply branching trees from a data collection. Using this hierarchical topic model [10], each document is modeled by a path of topics given a random tree where the hierarchically-correlated topics from global topic to individual topics are extracted from root node to leaf nodes, respectively. In [16], the topic correlation was strengthened through a doubly correlated nonparametric topic model where the annotations were incorporated into discovery of semantic structure. In [17], the nested DP prior was proposed by replacing the random atoms with the random probability measures drawn from DP in a nested setting. In general, HDP and nCRP were implemented according to the stick-breaking process and Chinese restaurant process. The approximate inference using Markov chain Monte Carlo (MCMC) [10], [11], [9] and variational Bayesian (VB) [18], [19], [20] was developed. B. Topic model beyond word level In previous methods, the unsupervised learning over text data finds the topic information from a bag of words. The mixed membership modeling is implemented for representation of words from multiple documents. The word-level mixture model is built. However, unsupervised learning beyond word level is required in many information systems. For example, the topic models based on a bag of bigrams [21] and a bag of n-gram histories [5] were estimated for language modeling. In a document summarization system, the text representation is evaluated to select representative sentences from multiple documents. Exploring the sentence clustering and ranking [8] is essential to find the sentence-level themes and measure the relevance between sentences and documents [22]. In [23], the general sentences and specific sentences were identified for document summarization. In [24], [7], a sentence-based topic model based on LDA was proposed to learn word-document and word-sentence associations. Furthermore, an information retrieval system retrieves the condensed information from user queries. Finding the underlying themes from documents is beneficial to organize the ranked documents and extract the relevant information. In addition to the word-level topic model, it is desirable to build the hierarchical topic model in sentence level or even in document level. In [25], [26], the latent Dirichlet co-clustering model and the segmented topic model were proposed to build topic model across different levels.
explored. The brother nodes expand the diversity of themes from different sentences within and across documents. This model does not only group the sentences into a node but also distinguish their concepts through different layers. The words of the sentences clustered in a tree node are seen as the grouped data. The grouped words in different tree nodes are driven by an HDP. The nCRP compound HDP is developed to build a hierarchical theme and topic model. To reflect the heterogeneous documents in real-world data collection, a tree stick-breaking process is addressed to draw a subtree of theme proportions. We conduct structural learning and group the sentences into a diversity of themes. The number of themes and the dependency between themes are learned from data. The words of the sentences within a node are represented by a topic model which is drawn by a DP. All the topics from different nodes are shared under a global DP. The sentencelevel themes and the word-level topics are estimated. This approach is evaluated by the tasks of document modeling and summarization. This paper is organized as follows. Section II surveys BNP learning based on DP, HDP and nCRP. Section III presents the nCRP compound HDP for representation of sentences and words from multiple documents. A tree stick-breaking process is implemented to draw theme proportions for subtree branches. Section IV formulates the posterior probabilities which are applied for drawing the subtree, themes and topics. In Section V, the experiments on text modeling and document summarization are reported. The conclusions drawn from this study are given in Section VI. II. BAYESIAN NONPARAMETRIC LEARNING A. Dirichlet process Dirichlet process (DP) is essential for BNP learning. Basically, DP for a random probability measure G is expressed by G ∼ DP(α0 , G0 ). For any finite measurable partition {A1 , . . . , Ar }, a finite-dimensional Dirichlet distribution (G(A1 ), . . . , G(Ar )) ∼ Dir(α0 G0 (A1 ), . . . , α0 G0 (Ar ))
or G ∼ Dir(α0 G0 ) is produced from G with two parameters, a concentration parameter α0 > 0 and a base measure G0 which is seen as the mean of draws from DP. The probability measure of DP is established by G=
∞ X k=1
C. Main idea of this work In this study, we construct a hierarchical latent variable model for structural representation of text documents. The thematic sentences and the topical words are learned from hierarchical data groupings. Each path in tree model covers from the general theme at root node to the individual themes at leaf nodes. The themes in different tree nodes contain coherent information but in varying degrees of sharing for sentence representation. We basically build a tree model for sentences according to the nCRP. The theme hierarchy is
(1)
βk δφk ,
where
∞ X
βk = 1
(2)
k=1
where δφk is a unit mass at the point φk and the infinite sequences of weights {βk } and points {φk } are drawn from α0 and G0 , respectively. The solution to weights β = {βk }∞ k=1 can be obtained by stick-breaking process (SBP) which provides a distribution over infinite partitions of unit interval, also called the GEM distribution β ∼ GEM(α0 ) [27]. Let θ1 , θ2 , . . . denote the parameters drawn from G for individual words w1 , w2 , . . . with multinomial distribution function, θi |G ∼ G and wi |θi ∼ p(wi |θi ) = Mult(θi ) for each i. According to the metaphor of Chinese restaurant process
3
TABLE I C ONVENTION OF NOTATIONS wd = {wdi } w−(di) θdi or θdji zdi = k z−(di) φk cd c−d β = {βk } π = {πdk } α0 & γ G0 , H or Gd nk wdj = {wdji } wd(−j) ydj = l yd(−j) ψl td = {tdj } td(−j) π l = {πlk } β d = {βdl } γs & γw V θ t,l = {θt,l,v } η Ld Kl L&K βla & βlac md,t md(−j),lac nd,t,l,v n−(di),l,v Nd
zi
¯
words in document d words of all documents except word wdi multinomial parameter of word wdi or wdji latent variable of word wdi at topic k topics of all words in w except word wdi kth topic parameter tree path corresponding to document d tree paths of all documents except document d global mixture weights or topic proportions in HDP topic proportions of document d over cd in HDP concentration or strength parameter of a DP base measure of a global DP or a DP for document d no of customers or words sitting at table φk words in sentence j of document d words of all sentences in wd except sentence j latent variable of sentence j at theme l themes of all sentences in wd except sentence j lth theme parameter subtree branches for all sentences in document d subtree branches for all sentences in wd except wdj topic proportions of theme l in snCRP theme proportions of document d over td in snCRP strength parameters in sentence and word levels vocabulary size multinomial parameters of all words v in tuple (t, l) shared Dirichlet prior parameter for θ t,l maximum no of themes or nodes in subtree td maximum no of topics in theme l total no of themes and topics theme proportions for ancestor and child nodes no of sentences in document d choosing branch t no of sentences in wd(−j) allocated in node lac no of word wdi = v in document d and tuple (t, l) no of word wdi = v in w−(di) with theme l total no of words in document d
®0
µi
wi
Ák
G0
(a)
¯
¼d
°
®0
zdi
µdi
wdi
Ák
H
(b)
Fig. 2. Graphical representation for (a) DP mixture model and (b) HDP mixture model. Yellow and blue denote the variables in document and word levels, respectively.
model where each word wi is generated by φk |G0 ∼ G0 , for each k β|α0 ∼ GEM(α0 ) zi = k|β ∼ β, for each i
(4)
wi |zi , {φk }∞ k=1 ∼ Mult(θi = φzi ), for each i where the mixture component or latent topic zi = k is drawn from topic proportions β which are determined by the strength parameter α0 . Unsupervised learning of an infinite mixture representation is implemented. B. Hierarchical Dirichlet process
(CRP) [28], the conditional distribution of current parameter θi given the previous parameters θ1 , . . . , θi−1 is obtained by
θi |θ1 , . . . ,θi−1 , α0 , G0 ∼
=
i−1 X l=1 K X k=1
1 α0 δθl + G0 i − 1 + α0 i − 1 + α0
(3)
nk α0 δφ + G0 . i − 1 + α0 k i − 1 + α0
Here, φ1 , . . . , φK denote the distinct values taken from the previous parameters θ1 , . . . , θi−1 and nk is the number of customers or parameters θi0 who have seated at table or chosen value φk . If the base distribution is continuous, then φk can correspond to different table. However, if the base distribution is discrete and finite, φk corresponds to distinct value, but not table, because the draws from discrete distribution can be repeated. Considering the continuous base measure, the ith customer sits at table φk with probability proportional to the number of customers nk who have already seated there, and sits at a new table with probability proportional to α0 . Figure 2(a) displays the graphical representation of a DP mixture
HDP [9] deals with the representation of documents or grouped data where each group is associated with a mixture model. Data in different groups share a global mixture model. Each document or data grouping wd is associated with a draw from a DP given probability measure Gd ∼ DP(α0 , G0 ), which determines how much a member from a shared set of mixture components contributes to that data grouping. The base measure G0 is itself drawn from a global DP by G0 ∼ DP(γ, H) with strength parameter γ and base measure H which ensures that there is a single set of discrete components shared across data. Each DP Gd governs the generation of words wd = {wdi } or their multinomial parameters {θdi } for a document. The global measure G0 and the individual measure Gd in HDP can be expressed by the mixture models with the shared ∞ atoms {φk }∞ k=1 but different weights β = {βk }k=1 and π d = ∞ {πdk }k=1 given by G0 = where
∞ X
βk δφk , Gd =
k=1 ∞ X k=1
∞ X
πdk δφk , for each d
k=1
βk =
∞ X
(5)
πdk = 1.
k=1
The atom φk is drawn from base measure H and the topic proportions β are drawn by SBP via β|γ ∼ GEM(γ). Note that the topic proportions π d are sampled from an independent
4
DP Gd given β because Gd ’s are independent given G0 . We have π d ∼ DP(α0 , β). Figure 2(b) shows the HDP mixture model which generates a word wdi in grouped data wd by φk |H ∼ H, for each k β|γ ∼ GEM(γ) π d |α0 , β ∼ DP(α0 , β), for each d
(6)
zdi = k|π d ∼ π d , for each d and i wdi |zdi , {φk }∞ k=1 ∼ Mult(θdi = φzdi ), for each d and i. To implement this HDP, we can apply the stick-breaking construction to connect the relation between β and π d . We first conduct the stick-breaking construction by finding β = {βk } through implementation of a process of drawing beta variables βk0 ∼ Beta(1, γ),
βk = βk0
k−1 Y
(1 − βj0 ).
(7)
j=1
Then, the stick-breaking construction for probability measure π d of grouped data wd is performed to bridge the relation ! ∞ X 0 πdk ∼ Beta α0 βk , α0 βl l=k+1
πdk =
0 πdk
k−1 Y
(1 −
(8)
0 πdj )
j=1 0 where beta variable πdk at the k th draw is determined by the base measures of two segments {βk , {βl }∞ l=k+1 }, which are scaled by parameter α0 . Basically, HDP does not involve in learning topic hierarchies. Only single level of data groupings, i.e. document level, is modeled.
Á1 w
1
¼1
Á2
¼3
¼2
Á3
w8
w6
d = fw1 ; w2 ; w3 ; w4 ; w5 ; w6 ; w7 ; w8 ; w9 g
w5
w9 w4
w3
w7
w2
cd
a deeply branching tree from data collection. The resulting hierarchical LDA is developed to represent a text document wd by using a path cd of topics from a random tree consisting of the hierarchically-correlated topics {φk }∞ k=1 from global topic in root node to individual topics in leaf nodes as illustrated in Figure 3. Choosing a path of topics for a document is equivalent to selecting or visiting a chain of restaurants cd by a customer wd through cd ∼ nCRP(α0 ), which is controlled by a scaling parameter α0 . Each word wdi in document wd is assigned by a tree node φk or topic label zdi = k with probability or topic proportion πdk . The generative process for a set of text documents w = {wd |d = 1, . . . , D} based on nCRP is constructed by 1) For each node k in the infinite tree a) Draw a topic with parameter φk |H ∼ H. 2) For each document wd = {wdi |i = 1, . . . , Nd } a) Draw a tree path by cd ∼ nCRP(α0 ). b) Draw topic proportions over layers of the tree path cd by stick-breaking process π d |γ ∼ GEM(γ). c) For each word wdi i) Choose a layer or a topic by zdi = k|π d ∼ π d . ii) Choose a word based on topic zdi = k by wdi |zdi , cd , {φk }∞ k=1 ∼ Mult θdi = φcd (zdi ) . A Gibbs sampling algorithm is developed to sample the posterior tree path and word topic {cd , zdi } for different words in D documents w = {wd } = {wdi } according to the posterior probabilities of cd and zdi given w and the current values of all other latent variables, i.e. p(cd |c−d , w, z, α0 , H) and p(zdi |w, z−(di) , cd , γ, H). In these posterior probabilities, the notations c−d and z−(di) denote the latent variables c and z for all documents and words other than document d and word i, respectively. Sampling procedure is performed iteratively from current state to new state for D documents with Nd words in each document wd . In this procedure, the topic parameter φk in each node k of an infinite tree model is sampled from a prior measure φk |H ∼ H. The topic proportions π d of document wd are drawn by π d |γ ∼ GEM(γ) according to an SBP given a scaling parameter γ. Given the samples of tree path cd and topic zdi , the word wdi is distributed by using the multinomial parameter (9) θdi = φcd (zdi ) of topic zdi = k drawn from the topic proportions π d corresponding to tree path cd .
Fig. 3. An infinitely branching tree structure for representation of words and documents based on nCRP. Thick arrows denote a tree path cd drawn from nine words of a document wd . Here, we simply use notation d for wd . Yellow rectangle and blue circles denote the document and words, respectively. Each word wdi is assigned by a topic parameter φk at a tree node along cd with proportion or probability πdk .
C. The nested Chinese restaurant process In many practical applications, it is desirable to represent text data based on different levels of aspects. The nCRP [10], [11] was proposed to infer the topic hierarchies and learn
III. H IERARCHICAL THEME AND TOPIC MODEL Although the topic hierarchies are explored in topic model based on nCRP, only the single-level data groupings, i.e. document level, are considered in generative process. The extension of text representation to different levels of data groupings is required to improve text modeling. In addition, a single tree path cd in nCRP may not sufficiently reflect the topic variations and theme ambiguities in heterogeneous documents. A flexible topic selection is required to compensate for model uncertainty. By conducting the multiple-level unsupervised
5
learning and flexible topic selection, we are able to upgrade system performance for document modeling. In this study, a hierarchical theme and topic model is proposed to conduct a kind of topical co-clustering [25], [26] over sentence level and word level so that one can cluster sentences while clustering words. The nCRP compound HDP is presented to implement the proposed model where the text modeling in word level, sentence level and document level is jointly performed. By referring to [29], a simplified tree-structured SBP is presented to draw a subtree td , which accommodates the theme and topic variations in document wd .
d = fs1 ; s2 ; s3 ; s4 ; s5 ; s6 ; s7 ; s8 g
¯11
s4 ¯111
¯1
s1 s3 Ã2
Ã1
Ã3 wi
wi
wi wi
wi
s2 s8
Á1
¼1
wi
¼2
wi wi
wi
s5
Ã4
wi
wi
wi
s7
wi
td
wi
Ã5
wi
wi
td = {tdj } ∼ snCRP(α0 )
(10)
where snCRP(·) is defined by considering individual sentences {wdj } under a subtree td rather than using the individual words {wdi } with single tree path cd as addressed for nCRP(·) in Section II-C. Notably, we consider a subtree td = {tdj } for thematic representation of sentences from a heterogeneous document wd . The proportions β d of theme parameters ψ = {ψl } over a subtree td are drawn according to a tree stick-breaking process (TSBP) which is simplified from the tree-structured stick breaking process in [29]. The resulting distribution is called the treeGEM, which is expressed by
wi
β d |γs ∼ treeGEM(γs )
Á2
¯12
wi
{wdj }, which are drawn by a sentence-based nCRP (snCRP) with a control parameter α0
s6
¯121
Fig. 4. An infinitely branching tree structure for representation of words, sentences and documents based on the nCRP compound HDP. Thick arrows denote the subtree branches td = {tdj } drawn for eight sentences {wdj |j = 1, . . . , 8} of a document wd . Here, we simply use the notations sj for wdj and d for wd . Yellow rectangle, green diamonds and blue circles denote the document, sentences and words, respectively. Each sentence wdj is assigned by a theme parameter ψl at a tree node connected to the subtree branches with a document-dependent probability βdl while each word wdji in tree node is assigned by a topic parameter φk with a theme-dependent probability πlk .
A. The nCRP compound HDP We propose an unsupervised structural learning approach to discover latent variables in different levels of data groupings. For the application of document representation, we aim to explore the latent themes from a set of sentences {wdj } and the latent topics from the corresponding set of words {wdji }. A text corpus w = {wd } = {wdj } = {wdji } is represented by using different levels of aspects. The proposed model is constructed by considering the structure of a document where each document consists of a “bag of sentences” and each sentence consists of a “bag of words”. Different from the topic model using HDP [9] and the hierarchical topic model using nCRP [10], [11], we develop a tree model to represent a “bag of sentences” and then represent the corresponding words allocated in individual tree nodes. A two-stage procedure is proposed for document modeling as illustrated in Figure 4. In the first stage, each sentence wdj of a document wd is drawn from a mixture of theme model where the themes are shared for all sentences from a document collection w. The theme model of a document wd is composed of the themes under its corresponding subtree td for different sentences
(11)
where γs is a strength parameter. The distribution treeGEM(·) is derived through a TSBP which shall be described in Section III-D. With a tree structure of themes, the unsupervised grouping of sentences into different layers is obtained. In the second stage, each word wdji in the set of sentences {wdj } allocated in tree node l is drawn by an individual mixture of topic model based on a DP. All topics from different nodes of a tree model are shared under a global topic model with parameters {φk } which are sampled by φk |H ∼ H. By treating the words in a tree node as the grouped data, the HDP is implemented for topical representation of the whole document collection w which is composed of the grouped words in different tree nodes. We assume that the words of the sentences in tree node l are conditionally independent. These words are drawn from a topic model using topic parameters {φk }∞ k=1 . The sentences in a document given theme l are conditionally independent. These sentences are drawn from a theme model using theme parameters {ψl }∞ l=1 . The documentdependent theme proportions β d = {βdl } and the themedependent topic proportions π l = {πlk } are produced by a tree stick-breaking process β d |γs ∼ treeGEM(γs ) and a standard stick-breaking process π l ∼ GEM(γw ) where γs and γw denote the sentence-level and the word-level strength parameters, respectively. Given the theme proportions β d and topic proportions π l , the probability measure of each sentence wdj is drawn from a document-dependent theme mixture model Gs,d =
∞ X
βdl δψl
(12)
l=1
while the probability measure of each word wdji in a tree node is drawn from a theme-dependent topic mixture model Gw,l =
∞ X
πlk δφk .
(13)
k=1
Since the theme for sentences is represented by a mixture model of topics for words in Eq. (13), we bridge the relation between the probability measures of themes {ψl } and topics {φk } via X ψl ∼ πlk φk . (14) k
6
We implement the so-called nCRP compound HDP in a twostage procedure and establish the hierarchical theme and topic model for document representation. The generative process for a set of documents in different levels of groupings is accordingly implemented. Having the sampled tree node ydj connected to the tree branches td and the associated topic zdi = k, the word wdji in sentence wdj is drawn by using the multinomial parameter θdji = φtd (ydj ,zdi ) .
(15)
The topic is determined by the topic proportions π l of theme ydj = l while the theme is determined by the theme proportions β d from a document wd . The generative process of this compound process is described as follows 1) For each node or theme l in the infinite tree a) For each topic k in a tree node i) Draw a topic with parameter φk |H ∼ H. b) Draw topic proportions by stick-breaking process π l |γw ∼ GEM(γw ). P c) Theme model is constructed by ψl ∼ k πlk φk . 2) For each document wd = {wdj } a) Draw a subtree td = {tdj } ∼ snCRP(α0 ). b) Draw theme proportions over a subtree td in different layers by tree stick-breaking process β d |γs ∼ treeGEM(γs ). c) For each sentence wdj = {wdji } i) Choose a theme ydj = l|β d ∼ β d . ii) For each word wdji or simply wdi a. Choose a topic by zdi = k|π l ∼ π l . b. Choose a word based on topic zdi = k by wdji |zdi , ydj , td , {φk }∞ k=1 ∼ Mult θdji = φtd (ydj ,zdi ) . B. The nCRP for thematic sentences We implement the snCRP and build an infinitely branching tree model which discovers latent themes from different sentences of a text corpus. The root node in this tree contains the general theme while those nodes in leaf layer convey the specific themes. The hierarchical clustering of sentences is realized. The snCRP is developed to represent the sentences of a document wd = {wdj } based on the themes, which come from a subtree td = {tdj }. The ambiguity and uncertainty of themes existing in a heterogeneous document could be compensated. This is different from conventional word-based nCRP [10], [11] where only the topics along a single tree path cd are selected to represent different words in a document wd = {wdi }. The word-based nCRP using GEM distribution for topic proportions π d ∼ GEM(γ) is now extended to the snCRP using treeGEM distribution for theme proportions β d ∼ treeGEM(γs ) by considering a subtree td . The metaphor of snCRP is described as shown in Figure 4. There are infinite number of Chinese restaurants in a city. Each restaurant has infinite tables. The tourist wdj of a tour group wd = {wdj } visits the first restaurant in root node ψ1 where each of its tables has a card showing the next restaurant, which is arranged in the second layer consisting
of {ψ2 , ψ4 , . . .}. Such visit repeats infinitely. Each restaurant is associated with a tree layer. The restaurants in a city are organized into an infinitely-branched and infinitely-deep tree structure. The tourists or sentences wdj from different tour groups or documents wd , who are closely related, shall visit the same restaurant and sit at the same table. The hierarchical grouping of sentences is therefore obtained by nonparametric tree model based on this snCRP. Each tree node stands for a theme. Each tourist or sentence wdj is modeled by the multinomial parameter ψl of a theme l from a subtree or a chain of restaurants td . The thematic sentences allocated in tree nodes with high probabilities could be selected for document summarization.
°s
®0
ydj
tdj
¼l
zdi
µdji
wdji
°w
Ãl
Ák
H
¯d
Fig. 5. Graphical representation for hierarchical theme and topic model. Yellow, green and blue denote the variables in document, sentence and word levels, respectively.
C. HDP for topical words After having the hierarchical grouping of sentences, we treat the words corresponding to a tree node l as the grouped data and conduct the HDP by using the grouped data in different tree nodes. The grouped words from different documents are recognized under the same theme ψl . Such tree-based HDP is different from the document-based HDP [9] which treats documents as the grouped data for text representation. An organized topic model is constructed to draw words for individual themes. The theme-dependent topical words are learned from the tree-based HDP. According to the combination of snCRP and tree-based HDP, the theme parameter ψl is represented by a mixture model of topic parameters {φk }∞ k=1 where the theme-dependent topic proportions {πlk }∞ k=1 are used as the mixture weights as given in Eq. (14). Tree-based HDP is developed to infer the topic proportions π l based on a standard SBP. The tree-based multinomial parameter is inferred to determine the word distribution through φk |H ∼ H, for each k π l |γw ∼ GEM(γw ), ψl ∼
X
πlk φk , for each l
k
td ∼ snCRP(α0 ), β d |γs ∼ treeGEM(γs ), for each d ydj = l|β d ∼ β d , for each d and j
(16)
zdi = k|π l ∼ π l , for each d and i wdji |zdi , ydj , td , {φk }∞ k=1 ∼ Mult φtd (ydj ,zdi ) , for each d, j and i. A compound process of snCRP and HDP is fulfilled to build the hierarchical theme and topic model as illustrated in the
7
graphical representation of Figure 5. Each word wdji is drawn by using the topic parameter θdji = φtd (ydj ,zdi ) which is controlled by the word-level topic zdi . This topic is allocated in the theme or tree node ydj of the subtree td which is selected from sentence wdj . The topical words in different tree nodes with high probability are selected. Such process could be extended to represent multiple-level data groupings including words, paragraphs, documents, streams and corpora.
βl0ac ∼ Beta(1, γs ) =
¯1
¯11
¯111 ¯112
βlac = βla βl0ac
¯122
¯1 = 1
¯11 ¯110
¯12
¯111
¯120
¯121
(b)
¯la2
c−1 Y
(1 − βl0aj ).
(18)
In this calculation, the theme proportion βlac of a child node is determined from an initial proportion of the ancestor node βla , which is continuously chopped according to the draws of beta variables from the existing child nodes {βl0aj |j = 1, . . . , c−1} to the new child node βl0ac . Each child node can be further treated as an ancestor node for future stick-breaking process to generate the grandchild nodes. Eq. (18) is recursively applied to find theme proportions βla and βlac for different nodes in subtree td . The theme proportions of a parent node and its child nodes satisfy the condition βla = βla0 + βla1 + βla2 + · · · .
¯la
¯la1
(17)
j=1
(a)
¯10
Γ(1 + γs ) (1 − βl0ac )γs −1 Γ(1)Γ(γs )
for a child node lac ∈ Ωc . The probability or theme proportion of generating a child node lac is calculated by
¯12
¯121
As illustrated in Figures 6(a) and 6(c), we draw the theme proportions for an ancestor node and its child nodes {la , Ωc = {la0 , la1 , la2 , . . .}} that are connected between two layers by arrows. The theme proportion βla0 in child node denotes the initial segment which is succeeded from an ancestor node la . Stick-breaking process is performed for the coming child nodes {la1 , la2 , . . .}. Given the treeGEM parameter γs , we sample a beta variable
¯la3
(c)
Fig. 6. Illustration for conducting (a) the tree stick-breaking process, and finding (b) the hierarchical theme proportions for an infinitely branching tree structure, which meets the constraint of unit length in the estimated theme proportions or in the treeGEM distribution. The dashed arrows and circles represent the future stick-breaking process. The stick-breaking process of a parent node la to its child nodes Ωc = {la0 , la1 , la2 , . . .} is shown in (c). The theme proportion of the first child node of a parent node la is denoted by βla0 . After stick-breaking for a parent node la and its child nodes Ωc , we have the replacement βla ← βla0 . This example illustrates how the three-layer proportions {β1 , β11 , β12 , β111 , β121 } or {β10 , β110 , β120 , β111 , β121 } in Figure 4 are drawn to share a unit-length stick.
D. Tree stick-breaking process In the implementation, a tree stick-breaking process is incorporated to draw a subtree [30] and determine the theme proportions for representation of sentences in heterogeneous document wd = {wdj }. Traditional GEM distribution is not suitable to characterize the tree structure with dependencies between brother nodes and those between parent node and child nodes. The snCRP is combined with a TSBP, which is developed to draw theme proportions β d from a document wd based on the treeGEM distribution subject to the constraint of P∞ multinomial parameters l=1 βdl = 1 for all nodes in subtree td . A variety of aspects from different sentences are revealed through β d .
(19)
Tree stick-breaking is run for each set of nodes {la , Ωc }. After stick-breaking of a parent node la and its child nodes Ωc , the theme proportion of parent node is replaced by βla ← βla0 . Figure 6(b) shows how the three-layer theme proportions {β1 , β11 , β12 , β111 , β121 } in Figure 4 are inferred through TSBP. We accordingly infer the theme proportions β d for a document wd and meet the constraint that the summation over theme proportions of all parent and child nodes has unit length. The inference β d ∼ treeGEM(γs ) is completed. Therefore, a stick of unit length is partitioned at random locations. The random theme proportions β d are used to calculate the posterior probability for drawing theme ydj = l for a sentence wdj . In this study, we adopt a single beta parameter γs for TSBP towards depth as well as branch. This solution is a simplified realization of the TSBP in [29] where two separate beta parameters {γsd , γsb } are adopted to conduct stick-breaking for depth and branch, β d ∼ treeGEM(γsd , γsb ). The infinite version of the Dirichlet tree distribution used in [31] could be adopted as the conjugate prior over the tree multinomials β d . IV. BAYESIAN I NFERENCE The approximate Bayesian inference using Gibbs sampling is developed to infer the posterior parameters or latent variables to implement the nCRP compound HDP. Latent variables consist of subtree branch tdj , sentence-level theme ydj and word-level topic zdi for each word wdi and sentence wdj in a text corpus w. Each latent variable is iteratively sampled by the posterior probability of this variable given the observed
8
data w and all the other variables. The sampling of latent variables is performed in an MCMC procedure. In this procedure, we sample a subtree td = {tdj } for a document wd via snCRP. Each sentence wdj is grouped into a tree node or a document-dependent theme ydj = l under a subtree td . Each word wdi of a sentence is then assigned by the themedependent topic zdi = k which is sampled according to a tree-based HDP. In what follows, we address the calculation of posterior probabilities for drawing tree branch tdj , theme label ydj and topic label zdi .
A. Sampling for document-dependent subtree branches A document wd is seen as “a bag of sentences” for sampling a subtree td . We iteratively sample a tree branch or choose a table tdj for sentence wdj by using the posterior probability p(tdj = t|td(−j) , w, yd , α0 , η) ∝ p(tdj = t|td(−j) , α0 ) × p(wdj |wd(−j) , td , yd , η)
(20)
where yd = {ydj , yd(−j) } denotes the set of theme labels for sentence wdj and all the remaining sentences wd(−j) of document wd , and td(−j) denotes the branches for document wd excluding those of sentence wdj . In Eq. (20), the snCRP parameter α0 and the Dirichlet prior parameter η provide the control over the size of the inferred tree. The first term p(tdj = t|td(−j) , α0 ) provides the prior on choosing a branch t for a sentence wdj according to the sentenced-based nested CRP with the following metaphor. In a culinary vacation, the tourist or sentence wdj enters a restaurant and chooses a table or branch tdj = t using the CRP probability p(tdj = t|td(−j) , α0 ) md,t if an occupied table t is chosen md −1+α0 = α0 if a new table is chosen md −1+α0
(21)
where md,t denotes the number of tourists or sentences in document wd who have seated at table t and md denotes total number of sentences in wd . In the next evening, this tourist goes to the restaurant in the next layer which is identified by the card on the table that is selected in current evening. Eq. (21) is applied again to determine whether a new branch at this layer is drawn or not. The prior of a subtree td = {tdj } is accordingly determined in a nested fashion over different layers from different sentences in a document wd = {wdj }. The second term p(wdj |wd(−j) , td , yd , η) is calculated by considering the marginal likelihood over the multinomial parameters θ t,l = {θt,l,v }V v=1 of a node ydj = l connected to tree branch tdj = t for totally V words in dictionary where θt,l,v = p(wdi = v|tdj = t, ydj = l). The parameters θ t,l in a tuple (t, l) are assumed to be Dirichlet distributed with a shared prior parameter η P V V Γ v=1 η Y η−1 p(θ t,l |η) = QV θt,l,v . v=1 Γ(η) v=1
(22)
We can derive the likelihood function of a sentence wdj given a tree branch t and the other sentences wd(−j) of wd as p(wd |tdj = t, yd , η) p(wdj |wd(−j) , td , yd , η) = p(wd(−j) |tdj = t, yd , η) PV QV Ld Γ n + Vη Y v=1 (d(−j),t,l,v Γ (n + η) d,t,l,v v=1 Q P = V V Γ n + η Γ n + Vη d(−j),t,l,v v=1 l=1 v=1 d,t,l,v (23) where Ld denotes the maximum number of nodes in subtree td , nd,t,l,v denotes the number of words wdi = v in document wd which are allocated in tuple (t, l) and nd(−j),t,l,v denotes the number of words wdi = v of wd except sentence wdj which are allocated in tuple (t, l). Eq. (23) is obtained since the marginal likelihood is derived by Z p(wd |t, yd , η) = p(wd |t, yd , θ t,l )p(θ t,l |η)dθ t,l ! Z Y Ld Y V Γ(Vη) nd,t,l,v +η−1 θt,l,v dθ t,l = (Γ(η))V l=1 v=1 QV Ld Y v=1 Γ (nd,t,l,v + η) P ∝ V l=1 Γ v=1 nd,t,l,v + Vη (24) which is found by arranging an integral over a posterior Dirichlet distribution of θ t,l with parameters {nd,t,l,v +η}V v=1 . Intuitively, a model with larger α0 will tend to a tree with more branches. The smaller η encourages fewer words in each tree node. The joint effect of large α0 and small η in posterior probability will result in the circumstance that more themes are required to explain the data. Using snCRP, we use sentence wdj to find the corresponding branch tdj . The subtree branches td = {tdj } are drawn to accommodate the thematic variations in heterogeneous document wd .
B. Sampling for document-dependent themes After finding the subtree branches td = {tdj } for sentences in a document wd = {wdj } via snCRP, we sample the document-dependent theme label ydj = l for each sentence wdj according to the posterior probability of latent variable ydj given document wd and current values of all the other latent variables p(ydj = l|wd , yd(−j) , td , γs , η) ∝ p(ydj = l|yd(−j) , td , γs )
(25)
× p(wdj |wd(−j) , yd , td , η). The first term represents the distribution of theme proportion of ydj = l or lac in a subtree td given the themes of the other sentences yd(−j) . This distribution is calculated as an expectation over the treeGEM distribution which is determined by the TSBP. Considering the draw of theme proportion for a child node lac given those for the parent node
9
la and the preceding child nodes {la1 , . . . , la(c−1) } in Eq. (18), we derive the first term based on a subtree structure td p(ydj = lac |yd(−j) , γs ) = E[βla |yd(−j) , γs ] c−1 Y 0 0 × E[βlac |yd(−j) , γs ] E 1 − βlaj yd(−j) , γs
×
c−1 Y j=1
γs +
1 + md(−j),lac PLd 1 + γs + u=c md(−j),lau
(26)
which is a product of expectations of the theme proportion of the parent node βla , the beta variable for the proportion of current child node βl0ac and the beta variables for the remaining proportions for the preceding child nodes {βl0a1 , . . . , βl0ac }. In Eq. (26), md(−j),lac denotes the number of sentences in wd(−j) which are allocated in node lac . The treeGEM parameter γs reflects a kind of pseudo-count of the observations for the proportion. The expectation over beta variable E[β 0 |yd(−j) , γs ] is derived by Z
Z β 0 p(β 0 |yd(−j) , γs )dβ 0 = β 0 p(β 0 |γs )p(yd(−j) |β 0 )dβ 0 Z ∝ β 0 (1 − β 0 )γs −1 (β 0 )md(−j),lac PLd
× (1 − β 0 ) =
n=c+1
md(−j),lan
∝ p(zdi = k|z−(di) , ydj = l, γw )
(28)
× p(wdi |w−(di) , z, ydj = l, η)
PLd
u=j+1 md(−j),lau PLd γs + u=j md(−j),lau
1+
Finally, we implement the tree-based HDP by sampling the theme-dependent topic zdi = k for each word wdji or wdi under a theme ydj = l based on the posterior probability p(zdi = k|w, z−(di) , ydj = l, γw , η)
j=1
= p(ydj = la |yd(−j) , γs )
C. Sampling for theme-dependent topics
dβ 0
1 + md(−j),lac PLd 1 + md(−j),lac + γs + n=c+1 md(−j),lan (27)
which is obtained as the mean of the posterior beta variable p(β 0 |yd(−j) , γs ) with parameters 1 + md(−j),lac and PLd γs + n=c+1 md(−j),lan . On the other hand, the second term of Eq. (25) for sampling a subtree is the same as that of Eq. (20) for sampling a theme. This term p(wdj |wd(−j) , yd , td , η) has been derived in Eq. (23). There are twofold relations in the posterior probabilities between nCRP and snCRP. First, we treat the sentence for snCRP as if it were the word for nCRP. Second, the calculation of Eq. (26) is recursively done for a set of parent node and its child nodes. Using nCRP, this calculation is performed by treating tree nodes in different layers in a flat level without considering subtree structure. In general, snCRP is completed by first choosing a subtree td for the sentences in a document wd and then assigning each sentence to one of the nodes in the chosen subtree according to the theme proportions β d drawn from the treeGEM. In the generation of theme assignment ydj = l, the node l assigned to a sentence wdj might be connected to the branch tdj of a subtree t drawn by another sentence wdj 0 from the same document wd . The variety of themes in a heterogeneous document wd is reflected by the subtree td which is drawn via snCRP. A whole infinite tree is accordingly established and shared for all documents in a corpus w.
where z = {zdi , z−(di) } denotes the topic labels of word wdi and the remaining words w−(di) of a corpus w. In Eq. (28), the first term indicates the distribution of topic proportion of zdi = k which is calculated as an expectation over π l ∼ GEM(γw ). This calculation is done via a word-level SBP with a control parameter γw . By drawing the beta variable and calculating the topic proportion similar to Eqs. (17) and (18), we derive the first term in a form of p(zdi = k|z−(di) , ydj = l, γw ) = ×
k−1 Y j=1
γw + 1+
1 + n−(di),l,k PKl 1 + γw + u=k n−(di),l,u
PKl
u=j+1 n−(di),l,u PKl γw + u=j n−(di),l,u
(29) where Kl denotes the maximum number of topics in a theme l and n−(di),l,k denotes the number of words in w−(di) which are allocated in topic k of theme l. The words of the sentences in a node with theme l are treated as the grouped data to carry out tree-based HDP. The terms in Eq. (29) are obtained as the expectations over the beta variable for topic proportion of current break πlk and the remaining proportions of the preceding breaks {πl1 , . . . , πl(k−1) }. The second term of Eq. (28) calculates the probability of generating the word wdi = v given w−(di) and the current topic variables z in theme l as expressed by p(wdi = v|w−(di) , z, ydj = l, η) = PV
n−(di),l,v + η
n−(di),l,v + Vη (30) where n−(di),l,v denotes the number of words wdi = v in w−(di) which are allocated in theme l. Given the current state of the sampler, we iteratively sample each latent variable conditioned on the whole observations and the rest variables. Given a text corpus w, we sequentially sample the subtree branch tdj = t and the theme ydj = l for individual sentence wdj of document wd via the snCRP. After having a sentence-based tree structure, we sequentially sample the theme-dependent topic zdi = k for individual word wdi in a tree node based on the tree-based HDP. These samples are iteratively employed to update the corresponding posterior probabilities p(tdj = t|td(−j) , w, yd , α0 , η), p(ydj = l|wd , yd(−j) , td , γs , η) and p(zdi = k|w, z−(di) , ydj = l, γw , η) in the Gibbs sampling procedure. The true posteriors are approximated by running sufficient sampling iterations. The hierarchical theme and topic model is established by fulfilling the nCRP compound HDP. v=1
10
V. E XPERIMENTS A. Experimental setup The experiments were performed by using four public-domain corpora: WSJ (Wall Street Journal) 1987-1992, AP (Associate Press) 1988-1990, NIPS (http://arbylon.net/resources.html and https://archive.ics.uci. edu/ml/datasets/Bag+of+Words) and DUC (Document Understanding Conference) 2007 (http://duc.nist.gov). The corpora of WSJ, AP and NIPS contained news documents and conference papers which were applied for evaluation of document representation. The perplexity of test documents wtest = {wd |d = 1, . . . , D} was calculated by ) ( P D d=1 log p(wd ) . (31) Perplexity(wtest ) = exp − PD d=1 Nd We wish to achieve better model with lower perplexity on test documents. To investigate the topic coherence of the estimated topic models, we further calculated the pointwise mutual information (PMI) [32] p(wi , wj ) 1 X log (32) PMI(wtest ) = 45 i