Text Mining & NLP & ML
(Hierarchical) Topic Modeling
Yueshen Xu (lecturer)
[email protected] /
[email protected]
Data and Knowledge Engineering Research Center
Xidian University
Outline Background Some Concepts Topic Modeling Probabilistic Latent Semantic Indexing (PLSI) Latent Dirichlet Allocation (LDA)
Hierarchical Topic Modeling Chinese Restaurant Process (CRP)
Basics, not state-of-the-art
Parameter Estimation Supplement & Reference Keywords: topic modeling, hierarchical topic modeling, probabilistic graphical model, Bayesian model 2016/12/27
2
Software Engineering
Background Information Overloading Big Data Cloud Computing Artificial Intelligence Deep Learning ,…, etc
we need
summarization Visualization Dimensional Reduction 2016/12/27
3
Software Engineering
Background Text Summarization Document Summarization What do these docs (or this doc) talk about?
Review Summarization What do these consumers care about or complain about?
Short Text/Tweets Summarization What are people discussing about?
Basic Requirement Automatic
2016/12/27
Applicable
Explainable
4
Topic Modeling Software Engineering
Some Concepts Information Retrieval
General Concepts Machine Learning
Text Mining Natural Language Processing
Topic Modeling
LSA
Computational Linguistics
Text Mining
Information Retrieval
Machine Natural Language Processing Translation
Dimension Reduction
Computational Linguistics
Topic Modeling
Data Mining
to learn the latent topics from a corpus/document
2016/12/27
5
Software Engineering
Reduction
Latent Semantic Analysis
Dimension LSA/Topic Model
Topic Modeling Topic modeling an example in Chinese (from my doctorate thesis) Corpus 继续实施稳健的货币政策,保 Doc持松紧适度适时预调微调,做 3 好与供给侧结构,并综合运用
从员额上来看,这次改革远远超 过了裁军的数量,它是一种结构 Doc4 性的改革,是军队组织结构现代 化的一个关键步骤
数量、价格等多种货币政策
独立学院从母体高校“断奶”后, 可能会面临品牌、招生等方面阵 痛,但是在国家和省市鼓励民间 资本进入教育领域的实施意见发 Doc2布后,一些独立学院果断切割连 接母体大学的“脐带”,自立门 户发展。
美元作为主要国际货币的地位在 可预见的将来仍无可取代,唯一 的出路是推动全球治理向更均衡 Doc 的方向发展。国际货币基金组织 1 总裁拉加德日前在美国马里兰大 学演讲时就呼吁,国际治理改革 应认清新兴经济体越来越重要这 一现实。
2016/12/27
6
Software Engineering
Topic Modeling After topic modeling Corpus 继续实施稳健的货币政策,保 持松紧适度适时预调微调,做 Doc3 好与供给侧结构,并综合运用 数量、价格等多种货币政策
Doc 1
2016/12/27
0.074 0.051
Topic 政策 改革 2
0.082 0.063
Topic 学院 3 教育
0.077 0.071
Topic 4
0.083 0.079
…
从员额上来看,这次改革远远 Doc 超过了裁军的数量,它是一种 4 结构性的改革,是军队组织结 构现代化的一个关键步骤
美元 作为主要国际 货币 的地位 在可预见的将来仍无可取代, 唯一的出路是 推动 全球治理向 更 均衡 的方向 发展 。国际 货币 基金组织总裁拉加德日前在美 国马里兰 大学 演讲时就呼吁, 国际治理改革应认清新兴 经济 体越来越重要这一现实。
Topic 金融 货币 1
Doc 2
独立学院从母体高校“断奶” 后,可能会面临品牌、招生等 方面阵痛,但是在国家和省市 鼓励民间资本进入教育领域的 实施意见发布后,一些独立学 院果断切割连接母体大学的 “脐带”,自立门户发展。
…
…
…
7
军队 组织 … … … …
Software Engineering
Topic Modeling A topic A word cluster a group of words Not clustered randomly, but meaningfully (not semantically)
Models Parametric models Latent Semantic Indexing (LSI) PLSI; Latent Dirichlet Allocation (LDA)
Non-parametric models (Dirichlet Process) (Nested) Chinese Restaurant Process Indian Buffet Process Pitman-Yor Process 2016/12/27
8
Software Engineering
Topic Modeling pLSI Model
One layer of ‘Deep Neutral Network’
Pairs(d,w) are assumed to be generated independently
p(d )
p( z | d )
p(w | z )
Assumption
w1
z1
d1
w2
z2
d2
…..
…..
…..
wN
zK
dM
Conditioned on z, w is generated independently of d Words in a document are exchangeable
Documents are exchangeable Latent topics z are independent
The generative process
Multinomial Distribution
p (d , w) = p ( w | d ) p (d ) = p (d )∑ p ( w, z | d ) = p (d )∑ p ( w | z ) p ( z | d ) z∈Z
z∈Z
Multinomial Distribution 2016/12/27
9
Software Engineering
Topic Modeling Latent Dirichlet Allocation (LDA) David M. Blei, Andrew Y. Ng, Michael I. Jordan Hierarchical Bayesian model; Bayesian pLSI Generative process of LDA β
Choose N ~ Poisson(𝜉); For each document d={𝑤1 , 𝑤2 … 𝑤𝑛 }
N α
θ
z
Choose 𝜃 ~𝐷𝑖𝑟(𝛼); For each of the N words 𝑤𝑛 in d:
w
M
a) Choose a topic 𝑧𝑛 ~𝑀𝑢𝑙𝑡𝑖𝑛𝑜𝑚𝑖𝑛𝑎𝑙 𝜃
b) Choose a word 𝑤𝑛 from 𝑝 𝑤𝑛 𝑧𝑛 , 𝛽 ,
iterative times
a multinomial distribution conditioned on 𝑧𝑛
2016/12/27
10
Software Engineering
Topic Modeling Parameter Estimation Variational Inference (+EM) || Gibbs Sampling (MCMC) Variational EM Algorithm ∗
∗
Aim: (𝛼 , 𝛽 )=arg max
𝑀 𝑑=1 𝑝
𝒘|𝛼, 𝛽
Initialize 𝛼, 𝛽 E-Step: compute 𝛼, 𝛽 through variational inference for likelihood approximation M-Step: Maximize the likelihood according to 𝛼, 𝛽 End until convergence
I just hope you to know: EM is quite important 2016/12/27
11
Software Engineering
Hierarchical Topic Modeling Topic modeling is not enough Hierarchical Structure
2016/12/27
12
Software Engineering
Hierarchical Topic Modeling Chinese Restaurant Process (Dirichlet Process) A restaurant with an infinite number of tables, and customers (word) enter this restaurant sequentially. The ith customer (𝜃𝑖 ) sits at a table (𝜙𝑘 ) according to the probability
𝜙𝑘 : Clustering == 1/2 unsupervised learning clustering, topic modeling (two layer clustering), hierarchical concept building, collaborative filtering, similarity computation… 2016/12/27
13
Software Engineering
Hierarchical Topic Modeling The generative process (nested CRP) Focus on the insight 1. Let 𝑐1 be the root restaurant (only one table) 2. For each level 𝑙 ∈ {2, … , 𝐿}: Draw a table from restaurant 𝑐𝑙−1 using CRP. Set 𝑐𝑙 to be the restaurant referred to by that table c1 T
3. Draw an 𝐿 -dimensional topic proportion vector 𝜃~𝐷𝑖𝑟(𝛼)
4. For each word 𝑤𝑛 :
γ
Draw 𝑧 ∈ 1, … , 𝐿 ~ Mult(𝜃)
2016/12/27
α
zm,n
β
c2
Draw 𝑤𝑛 from the topic associated with restaurant 𝑐𝑧
Matryoshka (Russia) Doll
m
cL
N
14
k
wm,n M
Software Engineering
Hierarchical Topic Modeling Examples root topic analysis obtain base system concentration
activity compound acid derivative active
thermal polymer acid property diamine
2016/12/27
reaction derivative yield synthesis microwave
assay food quality content analysis
decoction component radix quality constituent
compound ligand group investigate synergistic
compound activity synthesize salt derivative
15
antioxidant activity extract inhibitory flavonoid
interaction cation metal energy solution
Software Engineering
Supplement Some supplements Probabilistic Graphical Model Modeling Bayesian Network using plates and circles
Generative Model & Discriminative Model: 𝑝(𝜃|𝑋/𝐷𝑎𝑡𝑎) Generative Model: p(θ|X) ∝ p(X|θ)p(θ) - Naïve Bayes, GMM, pLSA, LDA, HMM, HDP… : Unsupervised Learning
Discriminative Model: 𝑝(𝜃|𝑋) - LR, KNN,SVM, Boosting, Decision Tree : Supervised Learning Also can be represented by graphical models 2016/12/27
16
Software Engineering
Reference My previous tutorials/notes (ZJU/UIC/Netease/ITRZJU as a Ph.D) ‘Topic modeling (an introduction)’ ‘Non-parametric Bayesian learning in discrete data’ ‘The research of topic modeling in text mining’ ‘Matrix factorization with user generated content’ …, etc
Website You can download all slides of mine http://web.xidian.edu.cn/ysxu/teach.html http://liu.cs.uic.edu/yueshenxu/ http://www.slideshare.net/obamaxys2011 https://www.researchgate.net/profile/Yueshen_Xu 2016/12/27
17
Software Engineering
Reference • David Blei, etc. Latent Dirichlet Allocation, JMLR, 2003 • Yee Whye Teh. Dirichlet Processes: Tutorial and Practical Course, 2007 • Yee Whye Teh, Jordan M I, etc. Hierarchical Dirichlet Processes, American Statistical Association, 2006 • David Blei. Probabilstic topic models. Communications of the ACM, 2012 • David Blei, etc. The Nested Chinese Restaurant Process and Bayesian Inference of Topic Hierarchies. Journal of the ACM, 2010 • Gregor Heinrich. Parameter Estimation for Text Analysis, 2008 • T.S., Ferguson. A Bayesian Analysis of Some Nonparametric Problems. The Annals of Statistics, 1973 • Martin J. Wainwright. Graphical Models, Exponential Families, and Variational Inference • Rick Durrett. Probability: Theory and Examples, 2010 • Christopher Bishop. Pattern Recognition and Machine Learning, 2007 • Vasilis Vryniotis. DatumBox: The Dirichlet Process Mixture Model, 2014 2016/12/27
18
Software Engineering
Q&A 2016/12/27
19
Software Engineering