Statistics & Computational Linguistics
Non-parametric Bayesian Learning in Discrete Data
Yueshen Xu
[email protected] /
[email protected]
Middleware, CCNT, ZJU
5/10/2016 Yueshen Xu
1
Middleware, CCNT, ZJU
Outline Bayes’ Rule Parametric Bayesian Learning
Concept & Example Discrete & Continuous Data Text Clustering & Topic Modeling Pros and Cons Some Important Concepts
Example: Hierarchical Topic Modeling
Markov Chain Monte Carlo Reference Discussion
Non-parametric Bayesian Learning
Dirichlet Process and Process Construction Dirichlet Process Mixture Hierarchical Dirichlet Process Chinese Restaurant Process
5/10/2016 Yueshen Xu
2
Middleware, CCNT, ZJU
Bayes’ Rule Posterior = Prior * Likelihood Likelihood
Prior
𝑝 𝐷𝑎𝑡𝑎 𝐻𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠 𝑝(𝐻𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠) 𝑝 𝐻𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠 𝐷𝑎𝑡𝑎 = 𝑝(𝐷𝑎𝑡𝑎) Posterior
Evidence
Update beliefs in hypotheses in response to data
Parametric or Non-parametric The structure of hypothesis: constrain or not constrain We have examples later
Your confidence to the prior 5/10/2016 Yueshen Xu
3
Middleware, CCNT, ZJU
Parametric Bayesian Learning Evidence is the fact Constant No possibility Trick commonly used 𝑝 𝐻𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠 𝐷𝑎𝑡𝑎 ∝ 𝑝 𝐷𝑎𝑡𝑎 𝐻𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠 𝑝(𝐻𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠) p(θ|X) ∝ p(X|θ)p(θ)
Parametric or Non-parametric Hypothesis Non-parametric != No parameters Hyper-parameter
𝐷𝑖𝑟 𝜃 𝜶 =
Parameter
Γ(𝛼0 ) Γ 𝛼1 … Γ 𝛼𝐾
Hyper-parameters
𝐾 𝛼 −1
• Parameters of distributions
𝜃𝑘 𝑘
• Parameter vs. Variable
𝑘=1
Variable
5/10/2016 Yueshen Xu
4
Middleware, CCNT, ZJU
Parametric Bayesian Learning Some Examples Clustering
Topic Modeling
LSA, pLSA, LDA
K-Means/Medoid, NMF
Hierarchical Concept Building
5/10/2016 Yueshen Xu
5
Middleware, CCNT, ZJU
Parametric Bayesian Learning Serious Problems How could we know the number of clusters? the number of topics?
Heuristic pre-processing?
the number of layers?
Guessing and Tuning 5/10/2016 Yueshen Xu
6
Middleware, CCNT, ZJU
Parametric Bayesian Learning Some basics Discrete Data & Continuous Data Discrete Data: text be modeled as natural numbers Continuous Data: stock, trading, signal, quality, rating be modeled as real numbers
Some important concepts (Also used in non-parametric case) Discrete distribution: 𝑋𝑖 |𝜃~𝐷𝑖𝑠𝑐𝑟𝑒𝑡𝑒(𝜃) 𝑛
𝑝 𝑋𝜃 =
𝑚 𝑁𝑗
𝐷𝑖𝑠𝑐𝑟𝑒𝑡𝑒 𝑋𝑖 ; 𝜃 =
𝜃𝑗
𝑖=1
𝑗=1
Multinomial distribution: 𝑁|𝑛, 𝜃~𝑀𝑢𝑙𝑡𝑖(𝜃, 𝑛) 𝑝 𝑁 𝑛, 𝜃 =
5/10/2016 Yueshen Xu
𝑛!
𝑚
𝑚 𝑗=1 𝑁𝑗 ! 𝑗=1
7
Computer Sciencers often mix them up
𝑁𝑗
𝜃𝑗
Middleware, CCNT, ZJU
Parametric Bayesian Learning Some important concepts (cont.)
Why should prior and posterior better be conjugate distributions?
Dirichlet distribution:𝜃|𝜶~𝐷𝑖𝑟(𝜶) 𝐷𝑖𝑟 𝜃 𝜶 =
Γ(𝛼0 ) Γ 𝛼1 … Γ 𝛼𝐾
𝐾 𝛼 −1
𝜃𝑘 𝑘
…
𝑘=1
Conjugate Prior the posterior p(θ|X) are in the same family as the p(θ), the prior is called a conjugate prior of the likelihood p(X|θ) Examples Binomial Distribution ←→ Beta Distribution Multinomial Distribution ←→ Dirichlet Distribution
𝑝(𝜃|𝜶)
𝑝 𝜃 𝑵, 𝜶 =𝐷𝑖𝑟 𝜃 𝑵 + 𝜶 =
5/10/2016 Yueshen Xu
Γ(𝛼0 +𝑁) Γ 𝛼1 +𝑁1 …Γ 𝛼𝐾 +𝑁𝐾
8
𝑝 𝑵𝜃
𝛼𝑘 −1+𝑁𝑘 𝐾 𝑘=1 𝜃𝑘
Middleware, CCNT, ZJU
Parametric Bayesian Learning Some important concepts (cont.) Probabilistic Graphical Model Modeling Bayesian Network using plates and circles
Generative Model & Discriminative Model: 𝑝(𝜃|𝑋) Generative Model: p(θ|X) ∝ p(X|θ)p(θ) Naïve Bayes, GMM, pLSA, LDA, HMM, HDP… : Unsupervised Learning
Discriminative Model: 𝑝(𝜃|𝑋) LR, KNN,SVM, Boosting, Decision Tree : Supervised Learning Also have graphical model representations 5/10/2016 Yueshen Xu
9
Middleware, CCNT, ZJU
Non-parametric Bayesian Learning When we talk about non-parametric, what do we usually talk about? Discrete Data: Dirichlet Distribution, Dirichlet Process, Chinese Restaurant Process, Polya Urn, Pitman-Yor Process, Hierarchical Dirichlet Process, Dirichlet Process Mixture, Dirichlet Process Multinomial Model, Clustering, … Continuous Data: Gaussian Distribution, Gaussian Process, Regression, Classification, Factorization, Gradient Descent, Covariance Matrix… Brownian Motion
Infinite
5/10/2016 Yueshen Xu
10
∞
Middleware, CCNT, ZJU
Non-parametric Bayesian Learning Dirichlet Process[Yee Whye Teh, etc] 𝐺0 : probabilistic measure/distribution (base distribution), 𝛼0 : real number, (𝐴1 , 𝐴2 , … , 𝐴𝑟 ) : partition of space, G: a probabilistic distribution, iff (𝐺 𝐴1 , … , 𝐺(𝐴𝑟 ))~𝐷𝑖𝑟(𝛼0 𝐺0 𝐴1 , … , 𝛼0 𝐺0 𝐴𝑟 ) then, 𝐺~DP(𝛼0 , 𝐺0 )
𝐺0 : which exact distribution is 𝐺0 ? We don’t know 𝐺 : which exact distribution is 𝐺? We don’t know 5/10/2016, Yueshen Xu
11
Middleware, CCNT, ZJU
Non-parametric Bayesian Learning Where is infinite? Construction of DP We need to construct a DP, since it does not exist naturally Stick-breaking, Polya Urn Scheme, Chinese restaurant process
Stick-breaking construction ∞ (𝛽𝑘 )∞ 𝑘=1 ,(𝜙𝑘 )𝑘=1 :iid sequence
𝛽𝑘 |𝛼0 ~𝐵𝑒𝑡𝑎(1, 𝛼0 ) 𝜙𝑘 |𝛼0 ~𝐺0 𝑘−1
𝜋𝑘 = 𝛽𝑘
(1 − 𝛽𝑙 ) 𝑙=1 ∞
𝐺=
𝜋𝑘 𝛿𝜙𝑘 𝑘=1
∞ 𝑘=1 𝜋𝑘
=1
𝛿𝜙𝑘 is the probability of 𝜙𝑘 a distribution of positive integers
Why DP? … Middleware, CCNT, ZJU
Non-parametric Bayesian Learning Chinese Restaurant Process A restaurant with an infinite number of tables, and customers (word, generated from 𝜃𝑖 , one-to-one) enter this restaurant sequentially. The ith customer (𝜃𝑖 ) sits at a table (𝜙𝑘 ) according to the probability : new table
𝜙𝑘 : Clustering == 2/3 unsupervised learning clustering, topic modeling (two layer clustering), hierarchical concept building, collaborative filtering, similarity computation… 5/10/2016 Yueshen Xu
13
Middleware, CCNT, ZJU
Non-parametric Bayesian Learning Dirichlet Process Mixture (DPM) You can draw the graphical model yourself DP is not enough We need similarity instead of cloning Mixture Models Mixture Models: an element is generated from a mixture/group of variables (usually latent variables) ∶ GMM, LDA, pLSA… DPM: 𝜃𝑖 |𝐺~𝐺, 𝑥𝑖 |𝜃𝑖 ~𝐹(𝜃𝑖 ) For text data, 𝐹(𝜃𝑖 ) is Discrete/Multinomial
Construction
𝛽𝑘 |𝛼0 ~𝐵𝑒𝑡𝑎(1, 𝛼0 ) 𝜙𝑘 |𝛼0 ~𝐺0 𝑘−1
𝜋𝑘 = 𝛽𝑘
(1 − 𝛽𝑙 ) 𝑙=1 ∞
𝐺=
𝜋𝑘 𝛿𝜙𝑘 𝑘=1
Intuitive but not helpful
Middleware, CCNT, ZJU
Non-parametric Bayesian Learning Dirichlet Process Mixture (DPM)
Finite Dirichlet Multinomial Mixture Model
What can DMMM do? (0,0,0,Caption,0,0,0,0,0,0,USA,0,0,0,0,0,0,0,0,0,Action,0,0,0,0,0,0,0,Hero,0,0 0,0,0,0,….)
Clustering 5/10/2016 Yueshen Xu
15
Middleware, CCNT, ZJU
Non-parametric Bayesian Learning Hierarchical Dirichlet Process (HDP) HDP: 𝜃𝑗𝑖 |𝐺~𝐺, 𝑥𝑗𝑖 |𝜃𝑗𝑖 ~𝐹(𝜃𝑗𝑖 )
Construction
Finite (F: Mult)
A very natural model for those statistics guys, but for our computer guys…hehe….
LDA LDA Hierarchical Dirichlet Multinomial Mixture Model
5/10/2016 Yueshen Xu
16
Middleware, CCNT, ZJU
Non-parametric Bayesian Learning Hierarchical Topic Modeling What we can get from reviews, blogs, question answers, twitter, news……? Only topics? Far not enough What we really need is a hierarchy to illustrate what exactly the text tells people, like
5/10/2016 Yueshen Xu
17
Middleware, CCNT, ZJU
Non-parametric Bayesian Learning Hierarchical Topic Modeling Prior: Nested CRP/DP (nCRP) [Blei and Jordan, NIPS, 04] NCRP: In a restaurant, at the 1st level, there is one table, which is linked with an infinite number of tables at the 2nd level. Each table at the second level is also linked with an infinite number of tables at the 3rd level. Such a structure is repeated... CRP is the prior to choose a table to form a path one document, one path
Matryoshka Doll
Doc 2
5/10/2016 Yueshen Xu
Middleware, CCNT, ZJU
Non-parametric Bayesian Learning Hierarchical Topic Modeling
𝐿 can be infinite, but not necessary
Generative Process 1. Let 𝑐1 be the root restaurant (only one table) 2. For each level 𝑙 ∈ {2, … , 𝐿}: Draw a table from restaurant 𝑐𝑙−1 using CRP. Set 𝑐𝑙 to be the restaurant referred to by that table T
c1
3. Draw an 𝐿 -dimensional topic proportion vector 𝜃~𝐷𝑖𝑟(𝛼)
4. For each word 𝑤𝑛 : Draw 𝑧 ∈ 1, … , 𝐿 ~ Mult(𝜃)
γ
m
α
zm,n
β
wm,n
k
c2
Draw 𝑤𝑛 from the topic associated with restaurant 𝑐𝑧
cL
N
5/10/2016 Yueshen Xu
M
Non-parametric Bayesian Learning What we can get
5/10/2016 Yueshen Xu
20
Middleware, CCNT, ZJU
Markov Chain Monte Carlo Xm
Markov Chain
Xm+1
p(1 1) p(2 1) P p (| S | 1)
p (1 2)
...
p ( 2 2)
...
p (| S | 2) ...
p (1 | S |) p(2 1) p (| S || S |)
Initialization probability: 𝜋0 = {𝜋0 1 , 𝜋0 2 , … , 𝜋0 (|𝑆|)} 𝜋𝑛 = 𝜋𝑛−1 𝑃 = 𝜋𝑛−2 𝑃2 = ⋯ = 𝜋0 𝑃𝑛 : Chapman-Kolomogrov equation Central-limit Theorem: Under the premise of connectivity of P, lim 𝑃𝑖𝑗𝑛 = 𝜋 𝑗 ;𝜋 𝑗 =
|𝑆| 𝑖=1 𝜋
𝜋(1) ⋮ lim 𝜋0 𝑃𝑛 = 𝑛→∞ 𝜋(1)
𝑛→∞
𝑖 𝑃𝑖𝑗 … ⋮
𝜋(|𝑆|) ⋮ 𝜋 = {𝜋 1 , 𝜋 2 , … , 𝜋 𝑗 , … , 𝜋(|𝑆|)} 𝜋(|𝑆|) Stationary Distribution Convergence
sample
𝑋0~𝜋0 𝑥 −→ 𝑋1~𝜋1 𝑥 −→ ⋯ −→ 𝑋𝑛~𝜋 𝑥 −→ 𝑋𝑛+1 ~𝜋 𝑥 −→ 𝑋𝑛+2 ~𝜋 𝑥 −→ Stationary Distribution 5/10/2016Yueshen Xu
21
Middleware, CCNT, ZJU
Markov Chain Monte Carlo Gibbs Sampling
Metropolis-Hastings Sampling
Step1: Initialize: 𝑋0 = 𝑥0 = {𝑥1 : 𝑖 = 1,2, … 𝑛} Step2: for t = 0, 1, 2, … (𝑡+1)
1. 𝑥1
(𝑡)
(𝑡)
(𝑡)
~𝑝 𝑥1 𝑥2 , 𝑥3 , … , 𝑥𝑛 (𝑡+1)
2. 𝑥2𝑡+1 ~𝑝 𝑥2 𝑥1
(𝑡)
;
𝑥𝑖 ~𝑝 𝑥 𝑥−𝑖
(𝑡)
, 𝑥3 , … , 𝑥𝑛
B(x1,x2)
3. … (𝑡+1)
4. 𝑥𝑗𝑡+1 ~𝑝 𝑥𝑗 𝑥1
(𝑡+1)
(𝑡)
(𝑡)
, 𝑥𝑗−1 , 𝑥𝑗+1 … , 𝑥𝑛
5. …
A(x1,x1) (𝑡+1)
6. 𝑥𝑛𝑡+1 ~𝑝 𝑥𝑛 𝑥1
D
(𝑡+1)
, 𝑥2
C(x2,x1)
(𝑡+1)
, … , 𝑥𝑛−1
You want to know ‘Gibbs sampling for HDP/DPM/nCRP’ ? You’d better understand Gibbs sampling for ‘LDA and DMMM’ 5/10/2016 Yueshen Xu
22
Middleware, CCNT, ZJU
Reference • Yee Whye Teh. Dirichlet Processes: Tutorial and Practical Course, 2007 • Yee Whye Teh, Jordan M I, etc. Hierarchical Dirichlet Processes, American Statistical Association, 2006 • David Blei. Probabilstic topic models. Communications of the ACM, 2012 • David Blei, etc. Latent Dirichlet Allocation, JMLR, 2003 • David Blei, etc. The Nested Chinese Restaurant Process and Bayesian Inference of Topic Hierarchies. Journal of the ACM, 2010 • Gregor Heinrich. Parameter Estimation for Text Analysis, 2008 • T.S., Ferguson. A Bayesian Analysis of Some Nonparametric Problems. The Annals of Statistics, 1973 • Martin J. Wainwright. Graphical Models, Exponential Families, and Variational Inference • Rick Durrett. Probability: Theory and Examples, 2010 • Christopher Bishop. Pattern Recognition and Machine Learning, 2007 • Vasilis Vryniotis. DatumBox: The Dirichlet Process Mixture Model, 2014 • David P. Williams. Gaussian Processes, Duke University, 2006 5/10/2016 Yueshen Xu
23
Middleware, CCNT, ZJU
Q&A 5/10/2016 Yueshen Xu
24
Middleware, CCNT, ZJU