Non-parametric Bayesian Learning in Discrete Data

1 downloads 0 Views 2MB Size Report
May 10, 2016 - ... in Discrete Data. Yueshen Xu ... 5/10/2016 Yueshen Xu. 3 .... You can draw the graphical model yourself → DP is not enough →. We need ...
Statistics & Computational Linguistics

Non-parametric Bayesian Learning in Discrete Data

Yueshen Xu [email protected] / [email protected]

Middleware, CCNT, ZJU

5/10/2016 Yueshen Xu

1

Middleware, CCNT, ZJU

Outline  Bayes’ Rule  Parametric Bayesian Learning     

Concept & Example Discrete & Continuous Data Text Clustering & Topic Modeling Pros and Cons Some Important Concepts

 Example: Hierarchical Topic Modeling

 Markov Chain Monte Carlo  Reference  Discussion

 Non-parametric Bayesian Learning    

Dirichlet Process and Process Construction Dirichlet Process Mixture Hierarchical Dirichlet Process Chinese Restaurant Process

5/10/2016 Yueshen Xu

2

Middleware, CCNT, ZJU

Bayes’ Rule  Posterior = Prior * Likelihood Likelihood

Prior

𝑝 𝐷𝑎𝑡𝑎 𝐻𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠 𝑝(𝐻𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠) 𝑝 𝐻𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠 𝐷𝑎𝑡𝑎 = 𝑝(𝐷𝑎𝑡𝑎) Posterior

Evidence

Update beliefs in hypotheses in response to data

 Parametric or Non-parametric  The structure of hypothesis: constrain or not constrain  We have examples later

 Your confidence to the prior 5/10/2016 Yueshen Xu

3

Middleware, CCNT, ZJU

Parametric Bayesian Learning  Evidence is the fact  Constant  No possibility  Trick commonly used 𝑝 𝐻𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠 𝐷𝑎𝑡𝑎 ∝ 𝑝 𝐷𝑎𝑡𝑎 𝐻𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠 𝑝(𝐻𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠) p(θ|X) ∝ p(X|θ)p(θ)

 Parametric or Non-parametric  Hypothesis  Non-parametric != No parameters Hyper-parameter

𝐷𝑖𝑟 𝜃 𝜶 =

Parameter

Γ(𝛼0 ) Γ 𝛼1 … Γ 𝛼𝐾

Hyper-parameters

𝐾 𝛼 −1

• Parameters of distributions

𝜃𝑘 𝑘

• Parameter vs. Variable

𝑘=1

Variable

5/10/2016 Yueshen Xu

4

Middleware, CCNT, ZJU

Parametric Bayesian Learning  Some Examples Clustering

Topic Modeling

LSA, pLSA, LDA

K-Means/Medoid, NMF

Hierarchical Concept Building

5/10/2016 Yueshen Xu

5

Middleware, CCNT, ZJU

Parametric Bayesian Learning  Serious Problems  How could we know  the number of clusters?  the number of topics?

Heuristic pre-processing?

 the number of layers?

Guessing and Tuning 5/10/2016 Yueshen Xu

6

Middleware, CCNT, ZJU

Parametric Bayesian Learning  Some basics  Discrete Data & Continuous Data  Discrete Data: text  be modeled as natural numbers  Continuous Data: stock, trading, signal, quality, rating  be modeled as real numbers

 Some important concepts (Also used in non-parametric case)  Discrete distribution: 𝑋𝑖 |𝜃~𝐷𝑖𝑠𝑐𝑟𝑒𝑡𝑒(𝜃) 𝑛

𝑝 𝑋𝜃 =

𝑚 𝑁𝑗

𝐷𝑖𝑠𝑐𝑟𝑒𝑡𝑒 𝑋𝑖 ; 𝜃 =

𝜃𝑗

𝑖=1

𝑗=1

 Multinomial distribution: 𝑁|𝑛, 𝜃~𝑀𝑢𝑙𝑡𝑖(𝜃, 𝑛) 𝑝 𝑁 𝑛, 𝜃 =

5/10/2016 Yueshen Xu

𝑛!

𝑚

𝑚 𝑗=1 𝑁𝑗 ! 𝑗=1

7

Computer Sciencers often mix them up

𝑁𝑗

𝜃𝑗

Middleware, CCNT, ZJU

Parametric Bayesian Learning  Some important concepts (cont.)

Why should prior and posterior better be conjugate distributions?

 Dirichlet distribution:𝜃|𝜶~𝐷𝑖𝑟(𝜶) 𝐷𝑖𝑟 𝜃 𝜶 =

Γ(𝛼0 ) Γ 𝛼1 … Γ 𝛼𝐾

𝐾 𝛼 −1

𝜃𝑘 𝑘

…

𝑘=1

 Conjugate Prior  the posterior p(θ|X) are in the same family as the p(θ), the prior is called a conjugate prior of the likelihood p(X|θ)  Examples Binomial Distribution ←→ Beta Distribution Multinomial Distribution ←→ Dirichlet Distribution

𝑝(𝜃|𝜶)

𝑝 𝜃 𝑵, 𝜶 =𝐷𝑖𝑟 𝜃 𝑵 + 𝜶 =

5/10/2016 Yueshen Xu

Γ(𝛼0 +𝑁) Γ 𝛼1 +𝑁1 …Γ 𝛼𝐾 +𝑁𝐾

8

𝑝 𝑵𝜃

𝛼𝑘 −1+𝑁𝑘 𝐾 𝑘=1 𝜃𝑘

Middleware, CCNT, ZJU

Parametric Bayesian Learning  Some important concepts (cont.)  Probabilistic Graphical Model  Modeling Bayesian Network using plates and circles

 Generative Model & Discriminative Model: 𝑝(𝜃|𝑋)  Generative Model: p(θ|X) ∝ p(X|θ)p(θ)  Naïve Bayes, GMM, pLSA, LDA, HMM, HDP… : Unsupervised Learning

 Discriminative Model: 𝑝(𝜃|𝑋)  LR, KNN,SVM, Boosting, Decision Tree : Supervised Learning Also have graphical model representations 5/10/2016 Yueshen Xu

9

Middleware, CCNT, ZJU

Non-parametric Bayesian Learning  When we talk about non-parametric, what do we usually talk about?  Discrete Data: Dirichlet Distribution, Dirichlet Process, Chinese Restaurant Process, Polya Urn, Pitman-Yor Process, Hierarchical Dirichlet Process, Dirichlet Process Mixture, Dirichlet Process Multinomial Model, Clustering, …  Continuous Data: Gaussian Distribution, Gaussian Process, Regression, Classification, Factorization, Gradient Descent, Covariance Matrix…  Brownian Motion

Infinite

5/10/2016 Yueshen Xu

10

∞

Middleware, CCNT, ZJU

Non-parametric Bayesian Learning  Dirichlet Process[Yee Whye Teh, etc]  𝐺0 : probabilistic measure/distribution (base distribution), 𝛼0 : real number, (𝐴1 , 𝐴2 , … , 𝐴𝑟 ) : partition of space, G: a probabilistic distribution, iff (𝐺 𝐴1 , … , 𝐺(𝐴𝑟 ))~𝐷𝑖𝑟(𝛼0 𝐺0 𝐴1 , … , 𝛼0 𝐺0 𝐴𝑟 ) then, 𝐺~DP(𝛼0 , 𝐺0 )

 𝐺0 : which exact distribution is 𝐺0 ? We don’t know  𝐺 : which exact distribution is 𝐺? We don’t know 5/10/2016, Yueshen Xu

11

Middleware, CCNT, ZJU

Non-parametric Bayesian Learning  Where is infinite?  Construction of DP  We need to construct a DP, since it does not exist naturally  Stick-breaking, Polya Urn Scheme, Chinese restaurant process

 Stick-breaking construction ∞  (𝛽𝑘 )∞ 𝑘=1 ,(𝜙𝑘 )𝑘=1 :iid sequence

𝛽𝑘 |𝛼0 ~𝐵𝑒𝑡𝑎(1, 𝛼0 ) 𝜙𝑘 |𝛼0 ~𝐺0 𝑘−1

𝜋𝑘 = 𝛽𝑘

(1 − 𝛽𝑙 ) 𝑙=1 ∞

𝐺=

𝜋𝑘 𝛿𝜙𝑘 𝑘=1

∞ 𝑘=1 𝜋𝑘

=1

𝛿𝜙𝑘 is the probability of 𝜙𝑘 a distribution of positive integers

Why DP?  … Middleware, CCNT, ZJU

Non-parametric Bayesian Learning  Chinese Restaurant Process  A restaurant with an infinite number of tables, and customers (word, generated from 𝜃𝑖 , one-to-one) enter this restaurant sequentially. The ith customer (𝜃𝑖 ) sits at a table (𝜙𝑘 ) according to the probability : new table

𝜙𝑘 : Clustering == 2/3 unsupervised learning  clustering, topic modeling (two layer clustering), hierarchical concept building, collaborative filtering, similarity computation… 5/10/2016 Yueshen Xu

13

Middleware, CCNT, ZJU

Non-parametric Bayesian Learning  Dirichlet Process Mixture (DPM)  You can draw the graphical model yourself  DP is not enough  We need similarity instead of cloning  Mixture Models  Mixture Models: an element is generated from a mixture/group of variables (usually latent variables)  ∶ GMM, LDA, pLSA…  DPM: 𝜃𝑖 |𝐺~𝐺, 𝑥𝑖 |𝜃𝑖 ~𝐹(𝜃𝑖 ) For text data, 𝐹(𝜃𝑖 ) is Discrete/Multinomial

Construction

𝛽𝑘 |𝛼0 ~𝐵𝑒𝑡𝑎(1, 𝛼0 ) 𝜙𝑘 |𝛼0 ~𝐺0 𝑘−1

𝜋𝑘 = 𝛽𝑘

(1 − 𝛽𝑙 ) 𝑙=1 ∞

𝐺=

𝜋𝑘 𝛿𝜙𝑘 𝑘=1

Intuitive but not helpful

Middleware, CCNT, ZJU

Non-parametric Bayesian Learning  Dirichlet Process Mixture (DPM)

Finite Dirichlet Multinomial Mixture Model

What can DMMM do? (0,0,0,Caption,0,0,0,0,0,0,USA,0,0,0,0,0,0,0,0,0,Action,0,0,0,0,0,0,0,Hero,0,0 0,0,0,0,….)

Clustering 5/10/2016 Yueshen Xu

15

Middleware, CCNT, ZJU

Non-parametric Bayesian Learning  Hierarchical Dirichlet Process (HDP)  HDP: 𝜃𝑗𝑖 |𝐺~𝐺, 𝑥𝑗𝑖 |𝜃𝑗𝑖 ~𝐹(𝜃𝑗𝑖 )

Construction

Finite (F: Mult)

A very natural model for those statistics guys, but for our computer guys…hehe….

LDA LDA  Hierarchical Dirichlet Multinomial Mixture Model

5/10/2016 Yueshen Xu

16

Middleware, CCNT, ZJU

Non-parametric Bayesian Learning  Hierarchical Topic Modeling  What we can get from reviews, blogs, question answers, twitter, news……?  Only topics?  Far not enough  What we really need is a hierarchy to illustrate what exactly the text tells people, like

5/10/2016 Yueshen Xu

17

Middleware, CCNT, ZJU

Non-parametric Bayesian Learning  Hierarchical Topic Modeling  Prior: Nested CRP/DP (nCRP) [Blei and Jordan, NIPS, 04]  NCRP: In a restaurant, at the 1st level, there is one table, which is linked with an infinite number of tables at the 2nd level. Each table at the second level is also linked with an infinite number of tables at the 3rd level. Such a structure is repeated...  CRP is the prior to choose a table to form a path one document, one path

Matryoshka Doll

Doc 2

5/10/2016 Yueshen Xu

Middleware, CCNT, ZJU

Non-parametric Bayesian Learning  Hierarchical Topic Modeling

𝐿 can be infinite, but not necessary

 Generative Process 1. Let 𝑐1 be the root restaurant (only one table) 2. For each level 𝑙 ∈ {2, … , 𝐿}: Draw a table from restaurant 𝑐𝑙−1 using CRP. Set 𝑐𝑙 to be the restaurant referred to by that table T

c1

3. Draw an 𝐿 -dimensional topic proportion vector 𝜃~𝐷𝑖𝑟(𝛼)



4. For each word 𝑤𝑛 : Draw 𝑧 ∈ 1, … , 𝐿 ~ Mult(𝜃)

γ

m

α

zm,n

β

wm,n

k

c2

Draw 𝑤𝑛 from the topic associated with restaurant 𝑐𝑧

cL



N

5/10/2016 Yueshen Xu

M

Non-parametric Bayesian Learning  What we can get

5/10/2016 Yueshen Xu

20

Middleware, CCNT, ZJU

Markov Chain Monte Carlo Xm

 Markov Chain

Xm+1

 p(1  1)   p(2  1) P    p (| S | 1) 

p (1  2)

...

p ( 2  2)

...





p (| S | 2) ...

p (1 | S |)   p(2  1)     p (| S || S |) 

 Initialization probability: 𝜋0 = {𝜋0 1 , 𝜋0 2 , … , 𝜋0 (|𝑆|)}  𝜋𝑛 = 𝜋𝑛−1 𝑃 = 𝜋𝑛−2 𝑃2 = ⋯ = 𝜋0 𝑃𝑛 : Chapman-Kolomogrov equation  Central-limit Theorem: Under the premise of connectivity of P, lim 𝑃𝑖𝑗𝑛 = 𝜋 𝑗 ;𝜋 𝑗 =

|𝑆| 𝑖=1 𝜋

𝜋(1) ⋮  lim 𝜋0 𝑃𝑛 = 𝑛→∞ 𝜋(1)

𝑛→∞

𝑖 𝑃𝑖𝑗 … ⋮

𝜋(|𝑆|) ⋮  𝜋 = {𝜋 1 , 𝜋 2 , … , 𝜋 𝑗 , … , 𝜋(|𝑆|)} 𝜋(|𝑆|) Stationary Distribution Convergence

sample

𝑋0~𝜋0 𝑥 −→ 𝑋1~𝜋1 𝑥 −→ ⋯ −→ 𝑋𝑛~𝜋 𝑥 −→ 𝑋𝑛+1 ~𝜋 𝑥 −→ 𝑋𝑛+2 ~𝜋 𝑥 −→ Stationary Distribution 5/10/2016Yueshen Xu

21

Middleware, CCNT, ZJU

Markov Chain Monte Carlo  Gibbs Sampling

Metropolis-Hastings Sampling

Step1: Initialize: 𝑋0 = 𝑥0 = {𝑥1 : 𝑖 = 1,2, … 𝑛} Step2: for t = 0, 1, 2, … (𝑡+1)

1. 𝑥1

(𝑡)

(𝑡)

(𝑡)

~𝑝 𝑥1 𝑥2 , 𝑥3 , … , 𝑥𝑛 (𝑡+1)

2. 𝑥2𝑡+1 ~𝑝 𝑥2 𝑥1

(𝑡)

;

𝑥𝑖 ~𝑝 𝑥 𝑥−𝑖

(𝑡)

, 𝑥3 , … , 𝑥𝑛

B(x1,x2)

3. … (𝑡+1)

4. 𝑥𝑗𝑡+1 ~𝑝 𝑥𝑗 𝑥1

(𝑡+1)

(𝑡)

(𝑡)

, 𝑥𝑗−1 , 𝑥𝑗+1 … , 𝑥𝑛

5. …

A(x1,x1) (𝑡+1)

6. 𝑥𝑛𝑡+1 ~𝑝 𝑥𝑛 𝑥1

D

(𝑡+1)

, 𝑥2

C(x2,x1)

(𝑡+1)

, … , 𝑥𝑛−1

You want to know ‘Gibbs sampling for HDP/DPM/nCRP’ ? You’d better understand Gibbs sampling for ‘LDA and DMMM’ 5/10/2016 Yueshen Xu

22

Middleware, CCNT, ZJU

Reference • Yee Whye Teh. Dirichlet Processes: Tutorial and Practical Course, 2007 • Yee Whye Teh, Jordan M I, etc. Hierarchical Dirichlet Processes, American Statistical Association, 2006 • David Blei. Probabilstic topic models. Communications of the ACM, 2012 • David Blei, etc. Latent Dirichlet Allocation, JMLR, 2003 • David Blei, etc. The Nested Chinese Restaurant Process and Bayesian Inference of Topic Hierarchies. Journal of the ACM, 2010 • Gregor Heinrich. Parameter Estimation for Text Analysis, 2008 • T.S., Ferguson. A Bayesian Analysis of Some Nonparametric Problems. The Annals of Statistics, 1973 • Martin J. Wainwright. Graphical Models, Exponential Families, and Variational Inference • Rick Durrett. Probability: Theory and Examples, 2010 • Christopher Bishop. Pattern Recognition and Machine Learning, 2007 • Vasilis Vryniotis. DatumBox: The Dirichlet Process Mixture Model, 2014 • David P. Williams. Gaussian Processes, Duke University, 2006 5/10/2016 Yueshen Xu

23

Middleware, CCNT, ZJU

Q&A 5/10/2016 Yueshen Xu

24

Middleware, CCNT, ZJU

Suggest Documents