## Understanding Maximum Entropy Discrimination ... - Semantic Scholar

Sep 22, 2011 - 3. optimize generative parameters, Î±, Î², given E[w] and x1:N. 30 / 43 .... C. Cortes and V. Vapnik. Sup

Understanding Maximum Entropy Discrimination Models Sam Blasiak

September 22, 2011 George Mason University

Schedule

Schedule 1. Presentation: 25 minutes

2 / 43

Schedule

Schedule 1. Presentation: 25 minutes 2. Exercise: 25 minutes

2 / 43

Schedule

Schedule 1. Presentation: 25 minutes 2. Exercise: 25 minutes 3. Discussion 10 minutes

2 / 43

Introduction

Outline 1. Intro: Approaches to Classification

3 / 43

Introduction

Outline 1. Intro: Approaches to Classification 2. Support Vector Machines

3 / 43

Introduction

Outline 1. Intro: Approaches to Classification 2. Support Vector Machines 3. Maximum Entropy Discrimination

3 / 43

Introduction

Outline 1. Intro: Approaches to Classification 2. Support Vector Machines 3. Maximum Entropy Discrimination 4. Topic Models

3 / 43

Introduction

Outline 1. Intro: Approaches to Classification 2. Support Vector Machines 3. Maximum Entropy Discrimination 4. Topic Models 5. Topic Models + Maximum Entropy Discrimination

3 / 43

Classification

Setting I

observations x1 , . . . , xN I

xn ∈ RK

4 / 43

Classification

Setting I

observations x1 , . . . , xN I

I

xn ∈ RK

associated classes y1 , . . . , yN I

yn ∈ {−1, 1}

4 / 43

Classification

Setting I

observations x1 , . . . , xN I

I

xn ∈ RK

associated classes y1 , . . . , yN I

yn ∈ {−1, 1}

learn a function that maps observations to classes

4 / 43

Generative Models for Classification Assume different categories were generated from different theoretical distributions Theoretical Distributions

1.0

class -1

class 1

0.8

0.6

0.4

0.2

0.0 3

2

1

0

1

2

3

5 / 43

Generative Models for Classification Model the observations given the category Estimated Distributions

1.8 class -1

1.6 1.4 1.2

class 1

1.0 0.8 0.6 0.4 0.2 0.0 3

2

1

0

decision boundary

1

2

3

6 / 43

Generative Models for Classification: Detail Learn class-conditional densities I

find θ(−) that maximizes p(x|θ(−) ) for all negative class observations

I

and find θ(+) that maximizes p(x|θ(+) ) for all positive class observations

7 / 43

Generative Models for Classification: Detail Learn class-conditional densities I

find θ(−) that maximizes p(x|θ(−) ) for all negative class observations

I

and find θ(+) that maximizes p(x|θ(+) ) for all positive class observations

Classify an unknown observation, xN+1 I

take the maximum likelihood class under the posterior I

p(yN+1 |xN+1 , θ) ∝ p(xN+1 |θy )p(y )

7 / 43

Generative Models for Classification: Detail Learn class-conditional densities I

find θ(−) that maximizes p(x|θ(−) ) for all negative class observations

I

and find θ(+) that maximizes p(x|θ(+) ) for all positive class observations

Classify an unknown observation, xN+1 I

take the maximum likelihood class under the posterior I

I

p(yN+1 |xN+1 , θ) ∝ p(xN+1 |θy )p(y )

equivalently, use a decision rule I

p(x

)

p(y =1) N+1 (+) log p(xN+1 |θ(−) ) + log p(y =−1)

7 / 43

Generative Models for Classification What if the models have high variance? High Variance Distributions

0.5

0.4

class -1

0.3

class 1

0.2

0.1

0.0 3

2

1

0

1

2

3

8 / 43

Discriminative Methods Find a boundary between positive and negative categories

9 / 43

Discriminative Methods with Kernels What if the classes are not separable? I

map the observations to an easier space xi → ϕ(xi ) I

Kernel ≈ hϕ(xi ), ϕ(xj )i

10 / 43

Discriminative Methods with Kernels What if the classes are not separable? I

map the observations to an easier space xi → ϕ(xi ) I

Kernel ≈ hϕ(xi ), ϕ(xj )i

Discriminative Classifier + Kernel gives good practical solutions I

tf-idf 

I

remote homology detection problem 

I

ceptrum coefficients for speech recognition (sort of)

I

Fisher/Eigen Faces

10 / 43

Discriminative Methods with Kernels

11 / 43

Discriminative Methods with Kernels

Where do good kernels come from?

11 / 43

Maximum Entropy Discrimination (MED) 

What is MED? I

like pushing conditional class densities farther apart

12 / 43

Maximum Entropy Discrimination (MED) 

What is MED? I

like pushing conditional class densities farther apart

I

or constructing a kernel from using hidden parameters from a generative model

12 / 43

Maximum Entropy Discrimination (MED) 

What is MED? I

like pushing conditional class densities farther apart

I

or constructing a kernel from using hidden parameters from a generative model

12 / 43

Maximum Entropy Discrimination (MED) 

What is MED? I

like pushing conditional class densities farther apart

I

or constructing a kernel from using hidden parameters from a generative model

When is MED tractible?

12 / 43

Maximum Entropy Discrimination (MED) 

What is MED? I

like pushing conditional class densities farther apart

I

or constructing a kernel from using hidden parameters from a generative model

When is MED tractible?

I

When will MED fail to find a good classifier?

12 / 43

Support Vector Machines (SVMs) 

Learn a hyperplane that separates two categories

13 / 43

Support Vector Machines (SVMs) 

Learn a hyperplane that separates two categories

13 / 43

Support Vector Machines (SVMs) 

Learn a hyperplane that separates two categories I

maximize the “margin”

13 / 43

Support Vector Machines (SVMs) 

Learn a hyperplane that separates two categories I

maximize the “margin”

I

determined by a small number of “support vectors”

13 / 43

SVM detail

14 / 43

SVM detail Given a training set I

x1:N , xn ∈ X

I

y1:N , yn ∈ {−1, 1}

14 / 43

SVM detail Given a training set I

x1:N , xn ∈ X

I

y1:N , yn ∈ {−1, 1}

A separating hyperplane I

w> x = b

14 / 43

SVM detail Given a training set I

x1:N , xn ∈ X

I

y1:N , yn ∈ {−1, 1}

A separating hyperplane I

w> x = b

SVM objective 1 min ||w||2 w,b 2 such that yn (w> xn − b) ≥ 1

∀n ∈ 1 : N

14 / 43

SVM intuition Decrease ||w|| to increase the margin 3

2 x1 =[1.0,1.0] w> ·x1 =1.00 > ·x1 =1.41

w

||w||

1

w =[0.50,0.50]

0 x0 =[−1.0,−1.0] w> ·x0 = −1.00 w > ·x = −1.41 0 ||w||

1

2

33

2

1

0

1

2

3

15 / 43

SVM intuition Decrease ||w|| to increase the margin 3 x1 =[2.0,2.0] w> ·x1 =1.00 w > ·x =2.83 1

||w||

2

1

0

w =[0.25,0.25]

1 x0 =[−2.0,−2.0] w> ·x0 = −1.00 w > ·x = −2.83 0

||w||

2

33

2

1

0

1

2

3

15 / 43

MED setting (same as generative) A training set I

x1:N , xn ∈ X

I

y1:N , yn ∈ {−1, 1}

16 / 43

MED setting (same as generative) A training set I

x1:N , xn ∈ X

I

y1:N , yn ∈ {−1, 1}

The form of two class-conditional probability distributions I

p(x|θ(+) ), p(x|θ(−) )

I

and a prior distribution over class frequencies, p(y )

16 / 43

MED setting (same as generative) A training set I

x1:N , xn ∈ X

I

y1:N , yn ∈ {−1, 1}

The form of two class-conditional probability distributions I

p(x|θ(+) ), p(x|θ(−) )

I

and a prior distribution over class frequencies, p(y )

Define a discrimination function I

p(xn |θ

)

p(y =1) define L(xn , θ) ≡ log p(xn |θ (+) ) + log p(y =−1) (−)

16 / 43

MED setting (same as generative) A training set I

x1:N , xn ∈ X

I

y1:N , yn ∈ {−1, 1}

The form of two class-conditional probability distributions I

p(x|θ(+) ), p(x|θ(−) )

I

and a prior distribution over class frequencies, p(y )

Define a discrimination function p(xn |θ

)

I

p(y =1) define L(xn , θ) ≡ log p(xn |θ (+) ) + log p(y =−1)

I

class 1 if L(xN+1 , θ) ≥ 0

I

class -1 otherwise

(−)

16 / 43

MED objective Apply the max-margin principle to the decision rule Log Distributions

5 class -1

0

class 1

5 decision rule margin

10

decision boundary

15 20 25 30 35 3

training example

2

1

0

1

2

3

17 / 43

Maximum Entropy Discrimination (MED)  Recall the generative example Generating Distributions class -1

0.8

class 1

0.6

0.4

0.2

0.0

training example

3

2

decision boundary

1

0

1

2

3

18 / 43

MED objective But over an expected margin Expected Discrimination Function Magnitude 0

20

40

60

training example

80 3

2

1

0

1

2

3

19 / 43

MED objective

Goal I

Find a distribution p(θ(+) , θ(−) ) that maintains a margin

20 / 43

MED objective

Goal I

Find a distribution p(θ(+) , θ(−) ) that maintains a margin

Ensure the expected class of all training examples is correct by a margin, γ I

Ep(θ(+) ,θ(−) ) [yn L(xn , θ)] ≥ γ

∀i ∈ 1 : N

20 / 43

MED objective

Goal I

Find a distribution p(θ(+) , θ(−) ) that maintains a margin

Ensure the expected class of all training examples is correct by a margin, γ I

Ep(θ(+) ,θ(−) ) [yn L(xn , θ)] ≥ γ

∀i ∈ 1 : N

Penalize larger “distances” from a prior, p0 (θ(+) , θ(−) ) I

KL( p(θ(+) , θ(−) ) || p0 (θ(+) , θ(−) ) )

20 / 43

SVM-MED comparison

SVM objective min 12 ||w||2 such w,b yn (w> xn − b) ≥ I

MED objective that

min KL( p(θ) || p0 (θ) ) such that p(θ)

1

∀n ∈ 1 : N

Regularizer: size of w

Ep(θ) [yn L(xn , θ)] ≥ γ I

∀n ∈ 1 : N

Regularizer: KL divergence from prior

21 / 43

SVM-MED comparison

SVM objective min 12 ||w||2 such w,b yn (w> xn − b) ≥

MED objective that

min KL( p(θ) || p0 (θ) ) such that p(θ)

1

∀n ∈ 1 : N

Ep(θ) [yn L(xn , θ)] ≥ γ

∀n ∈ 1 : N

I

Regularizer: size of w

I

I

Margin: min distance from the separating hyperplane

Regularizer: KL divergence from prior

I

Margin: expected category under the decision rule

21 / 43

Exercises

See worksheet

22 / 43

Topic Models

I

Topic models are generative models that specify how a document (or sequence) can be decomposed into a set of abstract topics.

Figure: Examples of how a document can be decomposed into topics. The ”data” topic is probabilistically composed of words related to data.

23 / 43

Topic Models

I

Topic models are generative models that specify how a document (or sequence) can be decomposed into a set of abstract topics.

Figure: Examples of how a document can be decomposed into topics. The ”goal” topic is probabilistically composed of words related to goals.

23 / 43

Latent Dirichlet Allocation (LDA)

I

Latent Dirichlet Allocation (LDA) is a well-known topic model describing counts of words in a set of documents.

I

we call vectors of words counts “bag-of-words” (BOW) vectors

I

β vectors describe probabilities of a word given a topic

I

θ vectors describe probabilities of a topic given a document

Figure: LDA plate model 24 / 43

Using LDA to classify Using a bag-of-words representation... I

[nw1 , nw2 , . . . , nwM ]>

I

we can classify a document with a discriminative classifier

LDA assigns each word in the corpus to a topic (or an expected topic) I

counting topics gives a bag-of-topics vector

I

bag-of-topics [nt1 , nt2 , . . . , ntK ]>

I

similar to classification after dimensionality reduction with PCA (LSA)

25 / 43

Using LDA to classify II But the LDA model performs badly for classification!

Figure: tSNE  embedding of the topic space to two dimensions from  depicts overlap between documents of different classes. 26 / 43

MedLDA MedLDA combines MED and LDA I

BUT MedLDA is not fully faithful to MED framework!

MED objective I

min KL( p(θ) || p0 (θ) ) such that p(θ)

Ep(θ) [L(xi , θ)] ≥ γ

∀i ∈ 1 : N

MedLDA objective I

P min KL(p(w)||p0 (w)) − i log p(xi |α, β) such that p(w),α,β   E yi w> ¯zi ≥ 1 ∀i ∈ 1 : N

I

p(w) - a distribution over separating hyperplanes

I

¯z is the bag-of-topics vector normalized to sum to 1

I

α - document-topic vector prior parameter

I

β - topic-word vector prior parameter

27 / 43

MedLDA objective detail MedLDA approximate objective I

ˆ || p(x, z|α, β) ) min KL( p(w) || p0 (w) ) + KL( q(z; α ˆ , β)   s.t. E yi w> z¯i ≥ 1 ∀i ∈ 1 : N

p(w),α,β

I

approximation needed to solve optimization problem ˆ is specified - fully factorized distribution form of q(z; α ˆ , β)

I

approximation matches the MED objective

I

I I I I

ˆ p(θ) = p(w)q(z; α ˆ , β) L(xi , θ) ≡ w> ¯z prior on w is specified p0 (w) = N (0, I) ˆ is the posterior distribution p(z|α, β) MED “prior” on q(z; α ˆ , β) I

p(z|α, β) =

Pp(x,z|α,β) z p(x,z|α,β)

28 / 43

MedLDA variations Four variations of MedLDA 1. either supervised or basic LDA I

I

sLDA version alters distribution over bag-of-words vectors (documents) p(x, z|α, β) → p(y , x, z|w, α, β)

2. either regression or classification I I

classification dual form equivalent to SVM regression dual form equivalent to Support Vector Regression (SVR)

I

MedLDA paper covers 3/4 cases

I

omits sLDA w/ classification I

bounds on the softmax function (recall sLDA GLM types) make inference inefficient 29 / 43

MedLDA training algorithm outline Given: I

a training set consisting of I I

bag-of-words vectors, x1:N ∈ X labels, y1:N ∈ {−1, 1}

I

an LDA model with posterior p(z|x, α, β)

I

a discrimination function that maps E [¯z] to {−1, 1} given E [w]

30 / 43

MedLDA training algorithm outline Given: I

a training set consisting of I I

bag-of-words vectors, x1:N ∈ X labels, y1:N ∈ {−1, 1}

I

an LDA model with posterior p(z|x, α, β)

I

a discrimination function that maps E [¯z] to {−1, 1} given E [w]

Repeat until change in objective function < : 1. compute E [¯zi ] ∀i under the LDA variational approximation 2. optimize discriminative parameters, E [w], for the set of expected ¯zi 3. optimize generative parameters, α, β, given E [w] and x1:N

30 / 43

MedLDA Lagrangian

MedLDA objective I

ˆ || p(z|α, β) ) s.t. KL( p(w) || p0 (w) ) + KL( q(z|ˆ α, β) > yi E [w ¯z] ≥ 1 ∀i ∈ 1 : N

MedLDA Lagrangian I

ˆ || p(z|α, β) ) − KL( p(w ) || p0 (w ) ) + α, β)  KL( q(z|ˆ P > z] − 1 i µi yi E [w ¯

I

where µi are Lagrange multpliers

31 / 43

MedLDA inference detail MedLDA dual I

minimize with respect to p(w) I

other parameters in the objective do not depend on Lagrange Multipliers, µ

MedLDA Lagrangian I I I

h i P   Ep(w) log pp(w) − i µi yi E w> ¯z + const 0 (w) move constraint terms inside the expectation   p(w) + const Ep(w) log p (w) exp P µ y w> E [¯z] ( i i( i )) 0

Use Gibbs inequality to find minimum I I

Gibbs inequality: KL( q || p ) ≥ 0, equality when q = p  P > z] p(w) = Z1 p0 (w) exp i µi yi w E [¯ 32 / 43

MedLDA dual cont. Min with respect to primal parameters I

p(w) =

1 Z p0 (w) exp

P

i

 µi yi w> E [¯z]

MedLDA dual I

substitute min value for p(w) back into the objective to obtain dual

I

ability to solve the dual problem depends on whether normalizing constant, Z , is computable!

I

dual form of the objective for p0 (w) = N (0, I) is equivalent to SVM dual form

I

more efficient to solve the MedLDA QP than to optimize the sLDA bound! 33 / 43

More inference detail

Parameter updates are mostly the same as basic LDA I

I

I

variational parameter φd,n approximates the expectation of topic zd,n

in MedLDA the φd,n update has a closed-form and depends on the Lagrange Multipliers from the dual solution ∂E [A(η > ¯ z)] MedLDA f (µ) has an effect similar to sLDA ∂φd,n

34 / 43

MED topic models (MedTM)  MedTM I

generalization from LDA to arbitrary topic models

I

not completely faithful to original MED sometimes learn a discriminative portion of the objective without a using a probability distribution over weights

I

I

I

Examples I

Conditional Topic Random Field  I

I

adds dependence between topics and a per-document input variable to MedLDA

Upstream Scene Understanding Models  I

discriminative component is upstream in the model

35 / 43

MED/MedTM key points

I

combines positive aspects from both generative and discriminative models I

I

allows simultaneous learning of discrimination function and mapping to latent space avoids pitfall of classifying with generative model

I

optimization reduces to QP, which is easier to solve than logistic regression

I

flexible - MED can be used with different topic models

36 / 43

Modifications to LDA for classification

Two main approaches to improve classification performance: I

Upstream models I

I

distribution of topic priors conditioned on document category this presentation won’t focus on upstream models

Figure: Example: Dirichlet Multinomial Regression  37 / 43

Modifications to LDA for classification

Two main approaches to improve classification performance: I

Downstream models I

distribution of document category conditioned on topic assignments

Figure: Example: Supervised LDA (sLDA)

37 / 43

A closer look at sLDA sLDA I I

downstream model probability of response variable conditioned on a vector of normalized topic counts I I

p(yn |¯z, η, δ) ¯z = [¯ z1 , . . . , z¯k ]> , z¯k = Pzk zk k

I

can be used for regression or classification

I

variational algorithm with mean field (fully factorized) approximation for inference

Figure: sLDA plate model 38 / 43

Generalized Linear Model (GLM) sLDA response variables modelled by GLM I

GLM(y ; ¯z, η, δ) ≡ h(y , δ) exp I I I I

I

I I I

η>¯ zy −A(η > ¯ z) δ

o

h(y , δ) - base measure y - sufficient statistic A(η > ¯z) - log normalizer (¯z is the bag-of-topics vector normalized to sum to 1)

Normal Distribution I

n

for regression y ∈R 1 h(y , δ) = √2πδ exp{−y 2 /2} A(η > ¯z) = −(η > ¯z)2 /2

I

Softmax I I I I

for classification y ∈ {0, 1} h(y , δ) = 1   A(η > ¯z) = log 1 + exp η > ¯z

39 / 43

sLDA inference

Parameter updates are mostly the same as basic LDA I

sLDA variational algorithm I

I I

I

Eq(z) [zd,n ] approximates the expectation of topic zd,n under the true posterior solve for φd,n ≡ Eq(z) [zd,n ] directly ∂E [A(η > ¯ z)] for a particular GLM φd,n update depends on ∂φd,n

for the softmax (classification) case I I

  A(η > ¯z) ≡ log 1 + exp η > ¯z two bounds needed to compute approximate  additional upper   E log 1 + exp η > ¯z !

40 / 43

Posterior Regularization

I

only a single distribution over feature vectors - no weights

I

variables that were latent in MED are observed in the training set

I

different dual form

41 / 43

References I 

D.M. Blei and J. McAuliffe. Supervised topic models. Advances in Neural Information Processing Systems, 20:121–128, 2008.



D.M. Blei, A.Y. Ng, and M.I. Jordan. Latent dirichlet allocation. The Journal of Machine Learning Research, 3:993–1022, 2003.



C. Cortes and V. Vapnik. Support-vector networks. Machine learning, 20(3):273–297, 1995.



T. Jaakkola, M. Meila, and T. Jebara. Maximum entropy discrimination. 1999.



S. Lacoste-Julien, F. Sha, and M.I. Jordan. DiscLDA: Discriminative learning for dimensionality reduction and classification. Advances in Neural Information Processing Systems 21 (NIPS08), 2008.



D. Mimno and A. McCallum. Topic models conditioned on arbitrary features with dirichlet-multinomial regression. In Proc. of the 24th Conference on Uncertainty in Artificial Intelligence. Citeseer, 2008.



H. Rangwala and G. Karypis. Profile-based direct kernels for remote homology detection and fold recognition. Bioinformatics, 21(23):4239, 2005.



G. Salton and C. Buckley. Term-weighting approaches in automatic text retrieval* 1. Information processing & management, 24(5):513–523, 1988.

42 / 43

References II



L. van der Maaten and G. Hinton. Visualizing data using t-SNE. Journal of Machine Learning Research, 9(2579-2605):85, 2008.

 J. Zhu, A. Ahmed, and E.P. Xing. MedLDA: A General Framework of Maximum Margin Supervised Topic Models. stat, 1050:30, 2009.  J. Zhu, L.J. Li, F.F. Li, and E.P. Xing. Large Margin Learning of Upstream Scene Understanding Models.  J. Zhu and E.P. Xing. Conditional Topic Random Fields. In International Conference on Machine Learning (To appear). Citeseer, 2010.

43 / 43