Bayes Text Classification Based on KL-Divergence .... 8th ACM SIGKDD International Conference on Knowledge Discovery and
A New Feature Selection Score for Multinomial Naive Bayes Text Classification Based on KL-Divergence K ARL -M ICHAEL S CHNEIDER (University of Passau)
[email protected]
42nd Annual Meeting of the Association for Computational Linguistics (ACL 2004), July 21–26, 2004, Barcelona, Spain
Naive Bayes: Information Theoretic Framework
Text Classification • Assignment of text documents to predefined classes
KL(p, q) =
ing, document storage, etc.
20 Newsgroups 0.8
p(x)
∑ p(x) log q(x) x
• Measures the divergence of one probability distribution from another
• Popular machine learning technique
1
• Kullback-Leibler Divergence [4, 1]:
• Applications in news categorization, E-mail filtering, user model-
Naive Bayes
Results
• Simple probabilistic model of text generation
• Distribution of words in a document: p(wt |d) = n(wt , d)/|d |
• Performs well despite unrealistic independence assumptions [3]
• Classification [2]:
Classification Accuracy
Introduction
0.6
0.4
|V |
c *(d) = argmax log p(cj ) + j
• Problem with text: high dimensionality (20, 000 ∼ 100, 000 words) • Solution: use only a subset of the vocabulary
j
= argmax j
• Filtering approach
1
|d | 1
|d |
log p(cj ) +
∑ p(wt |d) log p(wt |cj ) t=1 |V |
log p(cj ) −
0 10
∑ p(wt |d) log p(wt |cj ) 1
|d |
j
log p(cj )
Feature Selection based on KL-Divergence Motivation tween d and c *
• Mixture model of document generation: |C |
p(d) =
∑ p(cj )p(d |cj ) j=1
0.9
0.85
0.8
0.75
• Goal of KL-divergence feature selection: – Maximize similarity of documents within training classes – Minimize similarity between different training classes
0.7 10
document from its class:
– Vocabulary modeled by a single multinomial random variable W – Document representation: word count vector d = hxi1 . . . xi |V |i – Distribution of documents:
KL(S) =
|V |
1
100
1000 Vocabulary Size
10000
p(wt |di )
p(wt |di ) log ∑ ∑ |S | p(wt |c(di ))
1
di ∈S t=1
MI KL dKL
Reuters 21578 macroaveraged recall
p(d |cj ) = p(|d |)|d |!
p(wt |cj )n(wt ,d)
∏
n(wt , d)!
t=1
• Bayes’ rule:
p(cj |d) =
p(cj )p(d |cj ) p(d)
• Approximation of KL(S) based on two assumptions: – equal number of occurrences of wt in documents containing wt – equal document length • Average probability of wt in documents that contain wt :
c * = argmaxj p(cj |d)
• Classification:
• Computation of KL(S) has time complexity O (|S |)
˜d (wt |cj ) = p(wt |cj ) p
Parameter Estimation • Maximum likelihood estimates with Laplacean priors: |cj | p(cj ) = ∑j 0 |cj 0 |
p(wt |cj ) =
|cj | Njt
• Approximate KL-divergence of a training document from its class:
1 + ∑di ∈cj n(wt , di )
f KL(S) =
|V |
|V | + ∑s=1 ∑di ∈cj n(ws , di )
1
|V | |C |
|S | ∑ ∑
˜d (wt |cj ) log Njt p
˜d (wt |cj ) p
|V | |C |
=−
Njt
∑ ∑ p(cj )p(wt |cj ) log |cj | t=1 j=1
Feature Selection with Mutual Information • Approximate KL-divergence of a training document from the train• Mutual Information between two random variables [1]:
0.9
0.85
0.8
0.75
0.7 10
p(wt |cj )
t=1 j=1
Precision/Recall Breakeven Point
0.95
|V |
100000
Micro Recall 100 5,000 20,000 MI 86.5 88.0 89.3 KL 87.2 (+0.8%) 89.2 (+1.5%) 90.1 (+0.9%) dKL 88.0 (+1.8%) 89.0 (+1.2%) 89.3 (+0.0%)
KL-Divergence Score • Average KL-divergence of the distribution of words in a training
• Probabilistic model of text generation [5]:
100000
MI KL dKL
0.95
• Naive Bayes selects the class c * with minimal KL-divergence be-
Naive Bayes: Probabilistic Framework
10000
Reuters 21578 microaveraged recall
• Performs well in text classification [7] sumptions
1000 Vocabulary Size
1
t=1
• Common feature selection score
• But requires artificial definition of random variables with wrong as-
100
p(wt |d)
Mutual Information Score • Mutual Information well known in Information Theory [1]
MI KL dKL
|V |
= argmin KL(p(W |d), p(W |cj )) −
– Score words independently – Greedy selection of highest scored words
0.2
t=1
= argmax
– Reduce computational overhead – Avoid overfitting to training data
∑ n(wt , d) log p(wt |cj )
Precision/Recall Breakeven Point
Feature Selection
100
1000 Vocabulary Size
10000
100000
Macro Recall 100 5,000 20,000 MI 76.1 75.1 78.3 KL 80.1 (+5.3%) 80.7 (+7.4%) 82.2 (+5.0%) dKL 78.8 (+3.7%) 78.0 (+3.9%) 78.5 (+0.2%)
ing corpus: |V |
MI(X ; Y ) =
Nj e K (S) = − ∑ p(wt ) log |S | t=1
p(x, y )
∑ ∑ p(x, y ) log p(x)p(y ) x
y
• Application to feature selection: measure the mutual information between a word and the class variable
• Requires individual random variables for all words [5, 6]: p(Wt = 1) = p(W = wt )
Conclusions
• KL-divergence score for wt : |C |
Njt Nj f e KL(wt ) = Kt (S) − KLt (S) = ∑ p(cj ) p(wt |cj ) log − p(wt ) log |cj | |S | j=1
• Wt models the occurrence of wt at any given position • Mutual information feature score: MI(wt ) = MI(Wt ; C) • Random variables Wt are not independent
• Feature score based on approximation of KL-divergence: – comparable to or slightly better than mutual information – better performance on smaller categories of Reuters 21578 • Feature score based on true KL-divergence (future work): – considerably higher performance than mutual information on various datasets
– automatic feature subset selection for maximum performance
References [1] Thomas M. Cover and Joy A. Thomas. Elements of Information Theory. John Wiley, New York, 1991. [2] Inderjit S. Dhillon, Subramanyam Mallela, and Rahul Kumar. Enhanced word clustering for hierarchical text classification. In Proc. 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 191–200, 2002. [3] Pedro Domingos and Michael Pazzani. On the optimality of the simple bayesian classifier under zero-one loss. Machine Learning, 29:103–130, 1997. [4] S. Kullback and R. A. Leibler. On information and sufficiency. Annals of Mathematical Statistics, 22:79–86, 1951. [5] Andrew McCallum and Kamal Nigam. A comparison of event models for Naive Bayes text classification. In Learning for Text Categorization: Papers from the AAAI Workshop, pages 41–48. AAAI Press, 1998. Technical Report WS-98-05. [6] Jason D. M. Rennie. Improving multi-class text classification with Naive Bayes. Master’s thesis, Massachusetts Institute of Technology, 2001. [7] Yiming Yang and Jan O. Pedersen. A comparative study on feature selection in text categorization. In Proc. 14th International Conference on Machine Learning (ICML-97), pages 412–420, 1997.