Sep 22, 2011 - Kernel â ãÏ(xi ),Ï(xj )ã. Discriminative Classifier + Kernel gives good practical solutions. â·
Understanding Maximum Entropy Discrimination Models Sam Blasiak
September 22, 2011 George Mason University
Schedule
Schedule 1. Presentation: 25 minutes
2 / 43
Schedule
Schedule 1. Presentation: 25 minutes 2. Exercise: 25 minutes
2 / 43
Schedule
Schedule 1. Presentation: 25 minutes 2. Exercise: 25 minutes 3. Discussion 10 minutes
2 / 43
Introduction
Outline 1. Intro: Approaches to Classification
3 / 43
Introduction
Outline 1. Intro: Approaches to Classification 2. Support Vector Machines
3 / 43
Introduction
Outline 1. Intro: Approaches to Classification 2. Support Vector Machines 3. Maximum Entropy Discrimination
3 / 43
Introduction
Outline 1. Intro: Approaches to Classification 2. Support Vector Machines 3. Maximum Entropy Discrimination 4. Topic Models
3 / 43
Introduction
Outline 1. Intro: Approaches to Classification 2. Support Vector Machines 3. Maximum Entropy Discrimination 4. Topic Models 5. Topic Models + Maximum Entropy Discrimination
3 / 43
Classification
Setting I
observations x1 , . . . , xN I
xn ∈ RK
4 / 43
Classification
Setting I
observations x1 , . . . , xN I
I
xn ∈ RK
associated classes y1 , . . . , yN I
yn ∈ {−1, 1}
4 / 43
Classification
Setting I
observations x1 , . . . , xN I
I
xn ∈ RK
associated classes y1 , . . . , yN I
yn ∈ {−1, 1}
Task I
learn a function that maps observations to classes
4 / 43
Generative Models for Classification Assume different categories were generated from different theoretical distributions Theoretical Distributions
1.0
class -1
class 1
0.8
0.6
0.4
0.2
0.0 3
2
1
0
1
2
3
5 / 43
Generative Models for Classification Model the observations given the category Estimated Distributions
1.8 class -1
1.6 1.4 1.2
class 1
1.0 0.8 0.6 0.4 0.2 0.0 3
2
1
0
decision boundary
1
2
3
6 / 43
Generative Models for Classification: Detail Learn class-conditional densities I
find θ(−) that maximizes p(x|θ(−) ) for all negative class observations
I
and find θ(+) that maximizes p(x|θ(+) ) for all positive class observations
7 / 43
Generative Models for Classification: Detail Learn class-conditional densities I
find θ(−) that maximizes p(x|θ(−) ) for all negative class observations
I
and find θ(+) that maximizes p(x|θ(+) ) for all positive class observations
Classify an unknown observation, xN+1 I
take the maximum likelihood class under the posterior I
p(yN+1 |xN+1 , θ) ∝ p(xN+1 |θy )p(y )
7 / 43
Generative Models for Classification: Detail Learn class-conditional densities I
find θ(−) that maximizes p(x|θ(−) ) for all negative class observations
I
and find θ(+) that maximizes p(x|θ(+) ) for all positive class observations
Classify an unknown observation, xN+1 I
take the maximum likelihood class under the posterior I
I
p(yN+1 |xN+1 , θ) ∝ p(xN+1 |θy )p(y )
equivalently, use a decision rule I
p(x
|θ
)
p(y =1) N+1 (+) log p(xN+1 |θ(−) ) + log p(y =−1)
7 / 43
Generative Models for Classification What if the models have high variance? High Variance Distributions
0.5
0.4
class -1
0.3
class 1
0.2
0.1
0.0 3
2
1
0
1
2
3
8 / 43
Discriminative Methods Find a boundary between positive and negative categories
9 / 43
Discriminative Methods with Kernels What if the classes are not separable? I
map the observations to an easier space xi → ϕ(xi ) I
Kernel ≈ hϕ(xi ), ϕ(xj )i
10 / 43
Discriminative Methods with Kernels What if the classes are not separable? I
map the observations to an easier space xi → ϕ(xi ) I
Kernel ≈ hϕ(xi ), ϕ(xj )i
Discriminative Classifier + Kernel gives good practical solutions I
tf-idf [8]
I
remote homology detection problem [7]
I
ceptrum coefficients for speech recognition (sort of)
I
Fisher/Eigen Faces
10 / 43
Discriminative Methods with Kernels
11 / 43
Discriminative Methods with Kernels
Where do good kernels come from?
11 / 43
Maximum Entropy Discrimination (MED) [4]
What is MED? I
like pushing conditional class densities farther apart
12 / 43
Maximum Entropy Discrimination (MED) [4]
What is MED? I
like pushing conditional class densities farther apart
I
or constructing a kernel from using hidden parameters from a generative model
12 / 43
Maximum Entropy Discrimination (MED) [4]
What is MED? I
like pushing conditional class densities farther apart
I
or constructing a kernel from using hidden parameters from a generative model
Things to think about
12 / 43
Maximum Entropy Discrimination (MED) [4]
What is MED? I
like pushing conditional class densities farther apart
I
or constructing a kernel from using hidden parameters from a generative model
Things to think about I
When is MED tractible?
12 / 43
Maximum Entropy Discrimination (MED) [4]
What is MED? I
like pushing conditional class densities farther apart
I
or constructing a kernel from using hidden parameters from a generative model
Things to think about I
When is MED tractible?
I
When will MED fail to find a good classifier?
12 / 43
Support Vector Machines (SVMs) [3]
Learn a hyperplane that separates two categories
13 / 43
Support Vector Machines (SVMs) [3]
Learn a hyperplane that separates two categories
13 / 43
Support Vector Machines (SVMs) [3]
Learn a hyperplane that separates two categories I
maximize the “margin”
13 / 43
Support Vector Machines (SVMs) [3]
Learn a hyperplane that separates two categories I
maximize the “margin”
I
determined by a small number of “support vectors”
13 / 43
SVM detail
14 / 43
SVM detail Given a training set I
x1:N , xn ∈ X
I
y1:N , yn ∈ {−1, 1}
14 / 43
SVM detail Given a training set I
x1:N , xn ∈ X
I
y1:N , yn ∈ {−1, 1}
A separating hyperplane I
w> x = b
14 / 43
SVM detail Given a training set I
x1:N , xn ∈ X
I
y1:N , yn ∈ {−1, 1}
A separating hyperplane I
w> x = b
SVM objective 1 min ||w||2 w,b 2 such that yn (w> xn − b) ≥ 1
∀n ∈ 1 : N
14 / 43
SVM intuition Decrease ||w|| to increase the margin 3
2 x1 =[1.0,1.0] w> ·x1 =1.00 > ·x1 =1.41
w
||w||
1
w =[0.50,0.50]
0 x0 =[−1.0,−1.0] w> ·x0 = −1.00 w > ·x = −1.41 0 ||w||
1
2
33
2
1
0
1
2
3
15 / 43
SVM intuition Decrease ||w|| to increase the margin 3 x1 =[2.0,2.0] w> ·x1 =1.00 w > ·x =2.83 1
||w||
2
1
0
w =[0.25,0.25]
1 x0 =[−2.0,−2.0] w> ·x0 = −1.00 w > ·x = −2.83 0
||w||
2
33
2
1
0
1
2
3
15 / 43
MED setting (same as generative) A training set I
x1:N , xn ∈ X
I
y1:N , yn ∈ {−1, 1}
16 / 43
MED setting (same as generative) A training set I
x1:N , xn ∈ X
I
y1:N , yn ∈ {−1, 1}
The form of two class-conditional probability distributions I
p(x|θ(+) ), p(x|θ(−) )
I
and a prior distribution over class frequencies, p(y )
16 / 43
MED setting (same as generative) A training set I
x1:N , xn ∈ X
I
y1:N , yn ∈ {−1, 1}
The form of two class-conditional probability distributions I
p(x|θ(+) ), p(x|θ(−) )
I
and a prior distribution over class frequencies, p(y )
Define a discrimination function I
p(xn |θ
)
p(y =1) define L(xn , θ) ≡ log p(xn |θ (+) ) + log p(y =−1) (−)
16 / 43
MED setting (same as generative) A training set I
x1:N , xn ∈ X
I
y1:N , yn ∈ {−1, 1}
The form of two class-conditional probability distributions I
p(x|θ(+) ), p(x|θ(−) )
I
and a prior distribution over class frequencies, p(y )
Define a discrimination function p(xn |θ
)
I
p(y =1) define L(xn , θ) ≡ log p(xn |θ (+) ) + log p(y =−1)
I
class 1 if L(xN+1 , θ) ≥ 0
I
class -1 otherwise
(−)
16 / 43
MED objective Apply the max-margin principle to the decision rule Log Distributions
5 class -1
0
class 1
5 decision rule margin
10
decision boundary
15 20 25 30 35 3
training example
2
1
0
1
2
3
17 / 43
Maximum Entropy Discrimination (MED) [4] Recall the generative example Generating Distributions class -1
0.8
class 1
0.6
0.4
0.2
0.0
training example
3
2
decision boundary
1
0
1
2
3
18 / 43
MED objective But over an expected margin Expected Discrimination Function Magnitude 0
20
40
60
training example
80 3
2
1
0
1
2
3
19 / 43
MED objective
Goal I
Find a distribution p(θ(+) , θ(−) ) that maintains a margin
20 / 43
MED objective
Goal I
Find a distribution p(θ(+) , θ(−) ) that maintains a margin
Ensure the expected class of all training examples is correct by a margin, γ I
Ep(θ(+) ,θ(−) ) [yn L(xn , θ)] ≥ γ
∀i ∈ 1 : N
20 / 43
MED objective
Goal I
Find a distribution p(θ(+) , θ(−) ) that maintains a margin
Ensure the expected class of all training examples is correct by a margin, γ I
Ep(θ(+) ,θ(−) ) [yn L(xn , θ)] ≥ γ
∀i ∈ 1 : N
Penalize larger “distances” from a prior, p0 (θ(+) , θ(−) ) I
KL( p(θ(+) , θ(−) ) || p0 (θ(+) , θ(−) ) )
20 / 43
SVM-MED comparison
SVM objective min 12 ||w||2 such w,b yn (w> xn − b) ≥ I
MED objective that
min KL( p(θ) || p0 (θ) ) such that p(θ)
1
∀n ∈ 1 : N
Regularizer: size of w
Ep(θ) [yn L(xn , θ)] ≥ γ I
∀n ∈ 1 : N
Regularizer: KL divergence from prior
21 / 43
SVM-MED comparison
SVM objective min 12 ||w||2 such w,b yn (w> xn − b) ≥
MED objective that
min KL( p(θ) || p0 (θ) ) such that p(θ)
1
∀n ∈ 1 : N
Ep(θ) [yn L(xn , θ)] ≥ γ
∀n ∈ 1 : N
I
Regularizer: size of w
I
I
Margin: min distance from the separating hyperplane
Regularizer: KL divergence from prior
I
Margin: expected category under the decision rule
21 / 43
Exercises
See worksheet
22 / 43
Topic Models
I
Topic models are generative models that specify how a document (or sequence) can be decomposed into a set of abstract topics.
Figure: Examples of how a document can be decomposed into topics. The ”data” topic is probabilistically composed of words related to data.
23 / 43
Topic Models
I
Topic models are generative models that specify how a document (or sequence) can be decomposed into a set of abstract topics.
Figure: Examples of how a document can be decomposed into topics. The ”goal” topic is probabilistically composed of words related to goals.
23 / 43
Latent Dirichlet Allocation (LDA)
I
Latent Dirichlet Allocation (LDA)[2] is a well-known topic model describing counts of words in a set of documents.
I
we call vectors of words counts “bag-of-words” (BOW) vectors
I
β vectors describe probabilities of a word given a topic
I
θ vectors describe probabilities of a topic given a document
Figure: LDA plate model 24 / 43
Using LDA to classify Using a bag-of-words representation... I
[nw1 , nw2 , . . . , nwM ]>
I
we can classify a document with a discriminative classifier
LDA assigns each word in the corpus to a topic (or an expected topic) I
counting topics gives a bag-of-topics vector
I
bag-of-topics [nt1 , nt2 , . . . , ntK ]>
I
similar to classification after dimensionality reduction with PCA (LSA)
25 / 43
Using LDA to classify II But the LDA model performs badly for classification!
Figure: tSNE [9] embedding of the topic space to two dimensions from [5] depicts overlap between documents of different classes. 26 / 43
MedLDA[10] MedLDA combines MED and LDA I
BUT MedLDA is not fully faithful to MED framework!
MED objective I
min KL( p(θ) || p0 (θ) ) such that p(θ)
Ep(θ) [L(xi , θ)] ≥ γ
∀i ∈ 1 : N
MedLDA objective I
P min KL(p(w)||p0 (w)) − i log p(xi |α, β) such that p(w),α,β E yi w> ¯zi ≥ 1 ∀i ∈ 1 : N
I
p(w) - a distribution over separating hyperplanes
I
¯z is the bag-of-topics vector normalized to sum to 1
I
α - document-topic vector prior parameter
I
β - topic-word vector prior parameter
27 / 43
MedLDA objective detail MedLDA approximate objective I
ˆ || p(x, z|α, β) ) min KL( p(w) || p0 (w) ) + KL( q(z; α ˆ , β) s.t. E yi w> z¯i ≥ 1 ∀i ∈ 1 : N
p(w),α,β
I
approximation needed to solve optimization problem ˆ is specified - fully factorized distribution form of q(z; α ˆ , β)
I
approximation matches the MED objective
I
I I I I
ˆ p(θ) = p(w)q(z; α ˆ , β) L(xi , θ) ≡ w> ¯z prior on w is specified p0 (w) = N (0, I) ˆ is the posterior distribution p(z|α, β) MED “prior” on q(z; α ˆ , β) I
p(z|α, β) =
Pp(x,z|α,β) z p(x,z|α,β)
28 / 43
MedLDA[10] variations Four variations of MedLDA 1. either supervised or basic LDA I
I
sLDA version alters distribution over bag-of-words vectors (documents) p(x, z|α, β) → p(y , x, z|w, α, β)
2. either regression or classification I I
classification dual form equivalent to SVM regression dual form equivalent to Support Vector Regression (SVR)
I
MedLDA paper covers 3/4 cases
I
omits sLDA w/ classification I
bounds on the softmax function (recall sLDA GLM types) make inference inefficient 29 / 43
MedLDA training algorithm outline Given: I
a training set consisting of I I
bag-of-words vectors, x1:N ∈ X labels, y1:N ∈ {−1, 1}
I
an LDA model with posterior p(z|x, α, β)
I
a discrimination function that maps E [¯z] to {−1, 1} given E [w]
30 / 43
MedLDA training algorithm outline Given: I
a training set consisting of I I
bag-of-words vectors, x1:N ∈ X labels, y1:N ∈ {−1, 1}
I
an LDA model with posterior p(z|x, α, β)
I
a discrimination function that maps E [¯z] to {−1, 1} given E [w]
Repeat until change in objective function < : 1. compute E [¯zi ] ∀i under the LDA variational approximation 2. optimize discriminative parameters, E [w], for the set of expected ¯zi 3. optimize generative parameters, α, β, given E [w] and x1:N
30 / 43
MedLDA Lagrangian
MedLDA objective I
ˆ || p(z|α, β) ) s.t. KL( p(w) || p0 (w) ) + KL( q(z|ˆ α, β) > yi E [w ¯z] ≥ 1 ∀i ∈ 1 : N
MedLDA Lagrangian I
ˆ || p(z|α, β) ) − KL( p(w ) || p0 (w ) ) + α, β) KL( q(z|ˆ P > z] − 1 i µi yi E [w ¯
I
where µi are Lagrange multpliers
31 / 43
MedLDA inference detail MedLDA dual I
minimize with respect to p(w) I
other parameters in the objective do not depend on Lagrange Multipliers, µ
MedLDA Lagrangian I I I
h i P Ep(w) log pp(w) − i µi yi E w> ¯z + const 0 (w) move constraint terms inside the expectation p(w) + const Ep(w) log p (w) exp P µ y w> E [¯z] ( i i( i )) 0
Use Gibbs inequality to find minimum I I
Gibbs inequality: KL( q || p ) ≥ 0, equality when q = p P > z] p(w) = Z1 p0 (w) exp i µi yi w E [¯ 32 / 43
MedLDA dual cont. Min with respect to primal parameters I
p(w) =
1 Z p0 (w) exp
P
i
µi yi w> E [¯z]
MedLDA dual I
substitute min value for p(w) back into the objective to obtain dual
I
ability to solve the dual problem depends on whether normalizing constant, Z , is computable!
I
dual form of the objective for p0 (w) = N (0, I) is equivalent to SVM dual form
I
more efficient to solve the MedLDA QP than to optimize the sLDA bound! 33 / 43
More inference detail
Parameter updates are mostly the same as basic LDA I
recall difference in sLDA updates from LDA updates I
I
I
variational parameter φd,n approximates the expectation of topic zd,n
in MedLDA the φd,n update has a closed-form and depends on the Lagrange Multipliers from the dual solution ∂E [A(η > ¯ z)] MedLDA f (µ) has an effect similar to sLDA ∂φd,n
34 / 43
MED topic models (MedTM) [10] MedTM I
generalization from LDA to arbitrary topic models
I
not completely faithful to original MED sometimes learn a discriminative portion of the objective without a using a probability distribution over weights
I
I
I
w instead of p(w )
Examples I
Conditional Topic Random Field [12] I
I
adds dependence between topics and a per-document input variable to MedLDA
Upstream Scene Understanding Models [11] I
discriminative component is upstream in the model
35 / 43
MED/MedTM key points
I
combines positive aspects from both generative and discriminative models I
I
allows simultaneous learning of discrimination function and mapping to latent space avoids pitfall of classifying with generative model
I
optimization reduces to QP, which is easier to solve than logistic regression
I
flexible - MED can be used with different topic models
36 / 43
Modifications to LDA for classification
Two main approaches to improve classification performance: I
Upstream models I
I
distribution of topic priors conditioned on document category this presentation won’t focus on upstream models
Figure: Example: Dirichlet Multinomial Regression [6] 37 / 43
Modifications to LDA for classification
Two main approaches to improve classification performance: I
Downstream models I
distribution of document category conditioned on topic assignments
Figure: Example: Supervised LDA (sLDA)[1]
37 / 43
A closer look at sLDA sLDA I I
downstream model probability of response variable conditioned on a vector of normalized topic counts I I
p(yn |¯z, η, δ) ¯z = [¯ z1 , . . . , z¯k ]> , z¯k = Pzk zk k
I
can be used for regression or classification
I
variational algorithm with mean field (fully factorized) approximation for inference
Figure: sLDA plate model 38 / 43
Generalized Linear Model (GLM) sLDA response variables modelled by GLM I
GLM(y ; ¯z, η, δ) ≡ h(y , δ) exp I I I I
I
I I I
η>¯ zy −A(η > ¯ z) δ
o
h(y , δ) - base measure y - sufficient statistic A(η > ¯z) - log normalizer (¯z is the bag-of-topics vector normalized to sum to 1)
Normal Distribution I
n
for regression y ∈R 1 h(y , δ) = √2πδ exp{−y 2 /2} A(η > ¯z) = −(η > ¯z)2 /2
I
Softmax I I I I
for classification y ∈ {0, 1} h(y , δ) = 1 A(η > ¯z) = log 1 + exp η > ¯z
39 / 43
sLDA inference
Parameter updates are mostly the same as basic LDA I
sLDA variational algorithm I
I I
I
Eq(z) [zd,n ] approximates the expectation of topic zd,n under the true posterior solve for φd,n ≡ Eq(z) [zd,n ] directly ∂E [A(η > ¯ z)] for a particular GLM φd,n update depends on ∂φd,n
for the softmax (classification) case I I
A(η > ¯z) ≡ log 1 + exp η > ¯z two bounds needed to compute approximate additional upper E log 1 + exp η > ¯z !
40 / 43
Posterior Regularization
I
only a single distribution over feature vectors - no weights
I
variables that were latent in MED are observed in the training set
I
different dual form
41 / 43
References I [1]
D.M. Blei and J. McAuliffe. Supervised topic models. Advances in Neural Information Processing Systems, 20:121–128, 2008.
[2]
D.M. Blei, A.Y. Ng, and M.I. Jordan. Latent dirichlet allocation. The Journal of Machine Learning Research, 3:993–1022, 2003.
[3]
C. Cortes and V. Vapnik. Support-vector networks. Machine learning, 20(3):273–297, 1995.
[4]
T. Jaakkola, M. Meila, and T. Jebara. Maximum entropy discrimination. 1999.
[5]
S. Lacoste-Julien, F. Sha, and M.I. Jordan. DiscLDA: Discriminative learning for dimensionality reduction and classification. Advances in Neural Information Processing Systems 21 (NIPS08), 2008.
[6]
D. Mimno and A. McCallum. Topic models conditioned on arbitrary features with dirichlet-multinomial regression. In Proc. of the 24th Conference on Uncertainty in Artificial Intelligence. Citeseer, 2008.
[7]
H. Rangwala and G. Karypis. Profile-based direct kernels for remote homology detection and fold recognition. Bioinformatics, 21(23):4239, 2005.
[8]
G. Salton and C. Buckley. Term-weighting approaches in automatic text retrieval* 1. Information processing & management, 24(5):513–523, 1988.
42 / 43
References II
[9]
L. van der Maaten and G. Hinton. Visualizing data using t-SNE. Journal of Machine Learning Research, 9(2579-2605):85, 2008.
[10] J. Zhu, A. Ahmed, and E.P. Xing. MedLDA: A General Framework of Maximum Margin Supervised Topic Models. stat, 1050:30, 2009. [11] J. Zhu, L.J. Li, F.F. Li, and E.P. Xing. Large Margin Learning of Upstream Scene Understanding Models. [12] J. Zhu and E.P. Xing. Conditional Topic Random Fields. In International Conference on Machine Learning (To appear). Citeseer, 2010.
43 / 43