(Hierarchical) Topic Modeling

98 downloads 277 Views 1MB Size Report
Dec 27, 2016 - Software Engineering. 2016/12/27. Outline. □ Background. □ Some Concepts. □ Topic Modeling. ▫ Probabilistic Latent Semantic Indexing ...
Text Mining & NLP & ML

(Hierarchical) Topic Modeling

Yueshen Xu (lecturer) [email protected] / [email protected]

Data and Knowledge Engineering Research Center

Xidian University

Outline  Background  Some Concepts  Topic Modeling  Probabilistic Latent Semantic Indexing (PLSI)  Latent Dirichlet Allocation (LDA)

 Hierarchical Topic Modeling  Chinese Restaurant Process (CRP)

Basics, not state-of-the-art

 Parameter Estimation  Supplement & Reference Keywords: topic modeling, hierarchical topic modeling, probabilistic graphical model, Bayesian model 2016/12/27

2

Software Engineering

Background  Information Overloading Big Data Cloud Computing Artificial Intelligence Deep Learning ,…, etc

we need

summarization Visualization Dimensional Reduction 2016/12/27

3

Software Engineering

Background  Text Summarization  Document Summarization  What do these docs (or this doc) talk about?

 Review Summarization  What do these consumers care about or complain about?

 Short Text/Tweets Summarization  What are people discussing about?

 Basic Requirement Automatic

2016/12/27

Applicable

Explainable

4

Topic Modeling Software Engineering

Some Concepts Information Retrieval

 General Concepts Machine Learning

 Text Mining  Natural Language Processing

Topic Modeling

LSA

 Computational Linguistics

Text Mining

 Information Retrieval

Machine Natural Language Processing Translation

 Dimension Reduction

Computational Linguistics

 Topic Modeling

Data Mining

 to learn the latent topics from a corpus/document

2016/12/27

5

Software Engineering

Reduction

 Latent Semantic Analysis

Dimension LSA/Topic Model

Topic Modeling  Topic modeling  an example in Chinese (from my doctorate thesis) Corpus 继续实施稳健的货币政策,保 Doc持松紧适度适时预调微调,做 3 好与供给侧结构,并综合运用

从员额上来看,这次改革远远超 过了裁军的数量,它是一种结构 Doc4 性的改革,是军队组织结构现代 化的一个关键步骤

数量、价格等多种货币政策

独立学院从母体高校“断奶”后, 可能会面临品牌、招生等方面阵 痛,但是在国家和省市鼓励民间 资本进入教育领域的实施意见发 Doc2布后,一些独立学院果断切割连 接母体大学的“脐带”,自立门 户发展。

美元作为主要国际货币的地位在 可预见的将来仍无可取代,唯一 的出路是推动全球治理向更均衡 Doc 的方向发展。国际货币基金组织 1 总裁拉加德日前在美国马里兰大 学演讲时就呼吁,国际治理改革 应认清新兴经济体越来越重要这 一现实。

2016/12/27

6

Software Engineering

Topic Modeling  After topic modeling Corpus 继续实施稳健的货币政策,保 持松紧适度适时预调微调,做 Doc3 好与供给侧结构,并综合运用 数量、价格等多种货币政策

Doc 1

2016/12/27

0.074 0.051

Topic 政策 改革 2

0.082 0.063

Topic 学院 3 教育

0.077 0.071

Topic 4

0.083 0.079



从员额上来看,这次改革远远 Doc 超过了裁军的数量,它是一种 4 结构性的改革,是军队组织结 构现代化的一个关键步骤

美元 作为主要国际 货币 的地位 在可预见的将来仍无可取代, 唯一的出路是 推动 全球治理向 更 均衡 的方向 发展 。国际 货币 基金组织总裁拉加德日前在美 国马里兰 大学 演讲时就呼吁, 国际治理改革应认清新兴 经济 体越来越重要这一现实。

Topic 金融 货币 1

Doc 2

独立学院从母体高校“断奶” 后,可能会面临品牌、招生等 方面阵痛,但是在国家和省市 鼓励民间资本进入教育领域的 实施意见发布后,一些独立学 院果断切割连接母体大学的 “脐带”,自立门户发展。







7

军队 组织 … … … …

Software Engineering

Topic Modeling  A topic  A word cluster  a group of words  Not clustered randomly, but meaningfully (not semantically)

 Models  Parametric models  Latent Semantic Indexing (LSI)  PLSI; Latent Dirichlet Allocation (LDA)

 Non-parametric models (Dirichlet Process)  (Nested) Chinese Restaurant Process  Indian Buffet Process  Pitman-Yor Process 2016/12/27

8

Software Engineering

Topic Modeling  pLSI Model

One layer of ‘Deep Neutral Network’

 Pairs(d,w) are assumed to be generated independently

p(d )

p( z | d )

p(w | z )

 Assumption

w1

z1

d1

w2

z2

d2

…..

…..

…..

wN

zK

dM

 Conditioned on z, w is generated independently of d  Words in a document are exchangeable

 Documents are exchangeable  Latent topics z are independent

The generative process

Multinomial Distribution

p (d , w) = p ( w | d ) p (d ) = p (d )∑ p ( w, z | d ) = p (d )∑ p ( w | z ) p ( z | d ) z∈Z

z∈Z

Multinomial Distribution 2016/12/27

9

Software Engineering

Topic Modeling  Latent Dirichlet Allocation (LDA)  David M. Blei, Andrew Y. Ng, Michael I. Jordan  Hierarchical Bayesian model; Bayesian pLSI Generative process of LDA β

 Choose N ~ Poisson(𝜉);  For each document d={𝑤1 , 𝑤2 … 𝑤𝑛 }

N α

θ

z

Choose 𝜃 ~𝐷𝑖𝑟(𝛼); For each of the N words 𝑤𝑛 in d:

w

M

a) Choose a topic 𝑧𝑛 ~𝑀𝑢𝑙𝑡𝑖𝑛𝑜𝑚𝑖𝑛𝑎𝑙 𝜃

b) Choose a word 𝑤𝑛 from 𝑝 𝑤𝑛 𝑧𝑛 , 𝛽 ,

iterative times

a multinomial distribution conditioned on 𝑧𝑛

2016/12/27

10

Software Engineering

Topic Modeling  Parameter Estimation  Variational Inference (+EM) || Gibbs Sampling (MCMC) Variational EM Algorithm ∗



Aim: (𝛼 , 𝛽 )=arg max

𝑀 𝑑=1 𝑝

𝒘|𝛼, 𝛽

Initialize 𝛼, 𝛽 E-Step: compute 𝛼, 𝛽 through variational inference for likelihood approximation M-Step: Maximize the likelihood according to 𝛼, 𝛽 End until convergence

I just hope you to know: EM is quite important 2016/12/27

11

Software Engineering

Hierarchical Topic Modeling  Topic modeling is not enough Hierarchical Structure

2016/12/27

12

Software Engineering

Hierarchical Topic Modeling  Chinese Restaurant Process (Dirichlet Process)  A restaurant with an infinite number of tables, and customers (word) enter this restaurant sequentially. The ith customer (𝜃𝑖 ) sits at a table (𝜙𝑘 ) according to the probability

𝜙𝑘 : Clustering == 1/2 unsupervised learning  clustering, topic modeling (two layer clustering), hierarchical concept building, collaborative filtering, similarity computation… 2016/12/27

13

Software Engineering

Hierarchical Topic Modeling  The generative process (nested CRP)  Focus on the insight 1. Let 𝑐1 be the root restaurant (only one table) 2. For each level 𝑙 ∈ {2, … , 𝐿}: Draw a table from restaurant 𝑐𝑙−1 using CRP. Set 𝑐𝑙 to be the restaurant referred to by that table c1 T

3. Draw an 𝐿 -dimensional topic proportion vector 𝜃~𝐷𝑖𝑟(𝛼)



4. For each word 𝑤𝑛 :

γ

Draw 𝑧 ∈ 1, … , 𝐿 ~ Mult(𝜃)

2016/12/27

α

zm,n

β

c2

Draw 𝑤𝑛 from the topic associated with restaurant 𝑐𝑧

Matryoshka (Russia) Doll

m

cL



N

14

k

wm,n M

Software Engineering

Hierarchical Topic Modeling  Examples root topic analysis obtain base system concentration

activity compound acid derivative active

thermal polymer acid property diamine

2016/12/27

reaction derivative yield synthesis microwave

assay food quality content analysis

decoction component radix quality constituent

compound ligand group investigate synergistic

compound activity synthesize salt derivative

15

antioxidant activity extract inhibitory flavonoid

interaction cation metal energy solution

Software Engineering

Supplement  Some supplements  Probabilistic Graphical Model  Modeling Bayesian Network using plates and circles

 Generative Model & Discriminative Model: 𝑝(𝜃|𝑋/𝐷𝑎𝑡𝑎)  Generative Model: p(θ|X) ∝ p(X|θ)p(θ) - Naïve Bayes, GMM, pLSA, LDA, HMM, HDP… : Unsupervised Learning

 Discriminative Model: 𝑝(𝜃|𝑋) - LR, KNN,SVM, Boosting, Decision Tree : Supervised Learning Also can be represented by graphical models 2016/12/27

16

Software Engineering

Reference  My previous tutorials/notes (ZJU/UIC/Netease/ITRZJU as a Ph.D)  ‘Topic modeling (an introduction)’  ‘Non-parametric Bayesian learning in discrete data’  ‘The research of topic modeling in text mining’  ‘Matrix factorization with user generated content’  …, etc

 Website  You can download all slides of mine  http://web.xidian.edu.cn/ysxu/teach.html  http://liu.cs.uic.edu/yueshenxu/  http://www.slideshare.net/obamaxys2011  https://www.researchgate.net/profile/Yueshen_Xu 2016/12/27

17

Software Engineering

Reference • David Blei, etc. Latent Dirichlet Allocation, JMLR, 2003 • Yee Whye Teh. Dirichlet Processes: Tutorial and Practical Course, 2007 • Yee Whye Teh, Jordan M I, etc. Hierarchical Dirichlet Processes, American Statistical Association, 2006 • David Blei. Probabilstic topic models. Communications of the ACM, 2012 • David Blei, etc. The Nested Chinese Restaurant Process and Bayesian Inference of Topic Hierarchies. Journal of the ACM, 2010 • Gregor Heinrich. Parameter Estimation for Text Analysis, 2008 • T.S., Ferguson. A Bayesian Analysis of Some Nonparametric Problems. The Annals of Statistics, 1973 • Martin J. Wainwright. Graphical Models, Exponential Families, and Variational Inference • Rick Durrett. Probability: Theory and Examples, 2010 • Christopher Bishop. Pattern Recognition and Machine Learning, 2007 • Vasilis Vryniotis. DatumBox: The Dirichlet Process Mixture Model, 2014 2016/12/27

18

Software Engineering

Q&A 2016/12/27

19

Software Engineering