Naive Bayes models have been very popular in several classification tasks. ... techniques [5] and Bayesian approaches using hierarchical Dirichlet priors [4].
Hierarchical Bayes for Text Classification Shivakumar Vaithyanathan Jianchang Mao Byron Dom {shiv mao dom}@almaden.ibm.com IBM Almaden Research Center 650 Harry Road San Jose, CA 95120
Abstract Naive Bayes models have been very popular in several classification tasks. In this paper we study the application of these models to classification tasks where the data is sparse - i.e., a large number of possible outcomes do not appear in the data. Traditionally point estimates of the model parameters and in particular, point estimates based on the Laplace’s rule have been popular for such sparse data. In this paper we investigate the use of the integrated likelihood using different techniques to determine the hyper-parameters of the prior distribution. The evaluations are conducted in the context of text classification.
1.0 Introduction
Naive Bayes approaches form a very popular class of models used in machine learning applications [1]. Their simplicity (based upon the assumptions of conditional independence) makes these models very attractive and computationally efficient. Further, results on the applications of these models to several domains have suggested that these models can perform very competitively when stacked against more elaborate approaches.
This paper addresses the use of the special case of Naive Bayes models where the likelihood function is assumed multinomial. Further, we are particularly interested in applications where the data is very sparse (the number of observations is very low compared to the dimensionality of the data). Estimating the parameters of a multinomial distribution, for sparse data, can be problematic -- as large number of possible outcomes do not appear in the training data. The practical A.-H. Tan and P. Yu (Eds): PRICAI 2000 Workshop on Text and Web Mining, Melbourne, pp.36-43, August 2000.
37
implications of this are enormous and it appears in problems such as text classification [1,4], web-mining [6] and language modeling [4] to name a few. Several previous attempts have been made to mitigate this problem by using smoothing approaches like Good-Turing,
back-off
techniques [5] and Bayesian approaches using hierarchical Dirichlet priors [4].
In this paper we investigate the use of Naive Bayes models in the context of text classification. This is an important problem and has received a lot of attention lately due to the large amount of electronic text available. In particular we investigate the use of integrated likelihood to perform the classification instead of using point estimates of the parameters. The difference between integrated likelihood and the use of point estimates will become clear in the subsequent sections. Further, we experiment with different hyper-parameter estimates including the use of Empirical Bayes [7] and an intuitive technique that uses the distribution of the parameters over the priors to construct a prior.
2.0 Naive Bayes Models
We begin by describing naive Bayes models. Let us suppose we have a collection of N training
instances {d 1 d N } where each of the instances is labeled one of {C 1 C K } classes. Further, assume that each instance is represented as an M dimensional vector. Our task is to learn the densities of each class such that, given a new test instance t, we can assign it to the class with the highest posterior probability. Formally, using Bayes rule, the assigned class is given by
arg max {P(C k |t)} = k
P(t|C k ) P(C k ) P(t)
(1)
where P(C k )
= the prior probability of class k
t
= the new instance that is to be classified.
P(t|C k )
= the conditional probability of the test instance given class k.
To define the problem addressed in this paper and set the tone for the subsequent discussions note that typically is performed using a point estimate of the parameter vector, denoted by ˆ , which
38
describes the density of class k. For example, if each class is assumed to be a Gaussian then ˆ can be the maximum likelihood (ML) estimates of the mean
and the covariance matrix . Making
the assumption of conditional independence (hence the term “naive”) we can write the likelihood function as below
P(t|C k ) =
P(t |C ; ) M
j=1
j
k
(2)
k
where tj
= jth element of the vector t
In a special case of such Bayes models the functional form of the likelihood function is assumed to be a multinomial distribution and is described as below p(t [ C k ; k ) =
( )
t tj
k
tj
(3)
j
where .... {...} is the multinomial coefficient tj
is the number of occurrences of the feature element (term) j
t
is the total number of all features (terms) in the pattern (document) and is
t
computed as
j j
Substituting (2) in (1) we can predict the class of the new instance. What remains is to obtain the point estimates of the parameter vectors
k
from the training data. Typically, maximum
likelihood (ML) estimates of the parameters are used and this can be a particularly vexing problem in cases where the data is very sparse. Evaluation of Bayesian approaches to overcome this problem of sparsity is the central point of this paper and is discussed at depth in the subsequent sections.
3.0 Bayesian Approach to Parameter Estimation in the Multinomial Distribution
39
As mentioned in the previous section, sparse data can lead to difficulties in accurately obtaining ML estimates of k . To overcome this problem for multinomial models a very popular approach is to use the Laplace’s rule which can be derived as the expected value of the posterior distribution of the parameter using a uniform prior. The derivation for this expected value is shown in Appendix A. We now investigate other approaches to overcome this sparsity problem.
3.1 Classification using Integrated Likelihood
Consider the right-hand side of equation (1). Instead of using a point estimate, we can use the integrated or marginal likelihood as shown below in equation (4) P(t|C k ) = P(t|C k ; ) P(|D Tk )d
(4)
where P(|D Tk )
is the posterior distribution of the parameter vector
To use equation (4) we have to first obtain the posterior distribution P(|D Tk ) using a Bayesian learning scheme. This is done as follows P(|D Tk ) =
P(D Tk |) () P(D T |) ()d
(5)
where
()
is the prior distribution of the parameter vector
To perform the Bayesian learning shown in equation (5) we first need to choose an functional form for the prior (), then choose the associated hyper-parameters judiciously and solve the integral. A popular choice for the prior distribution () , with a multinomial likelihood function, is its conjugate1 Dirichlet distribution. This choice of prior enables easier solution of right-hand side of equation (4) yielding the following result
P(t|C k ; )()d = 1
( j + t kj ) ( ) ( + T (k) ) j ( j )
(6)
A conjugate prior is a prior that yields a posterior that is identical to the functional form of the
prior.
40
where
= j j t kj = is the total number of occurrences of term j in class k T (k) =
t
k j
= the gamma function The choice of the hyper-parameters in the Bayesian formalism, is to be determined by one’s prior beliefs based on the knowledge of the problem. In general, however, such prior knowledge is difficult to obtain. In the absence of such knowledge one usually uses a “non-informative” prior, typically a uniform prior. For the Dirichlet distribution a uniform prior can be obtained by making all values of
j = 1 . In the next subsections we discuss two alternative techniques for determining
the values of these hyper-parameters and in the final section we evaluate the performance the classifier under different choices of hyper-parameters for the task of document classification.
3.2 Empirical Bayes
One alternative approach to determine these hyper-parameters is to optimize them. This can be accomplished in the following sense P(D T |) = P(D T |) (|) d
(7)
where D T is the training data Now P(D T |) will be peaked around the maximum likelihood estimates of and these maximum likelihood estimates can be obtained by setting
P(D T |) j
= 0. A similar approach to language
modeling has been used by MacKay [4]. It is important to note that ML estimates of can either be obtained separately for each class or a single set of parameters for all the classes. These two options would result in different definitions of the training data D T in equation (7) thus resulting in different solutions of the integral in the same equation. These two approaches are discussed below.
41
3.2.1 Empirical Bayes (EB) estimator with same hyper-parameters for all classes We can have two different estimates of depending on our definition of P(D|) in equation (7). In this sub-section we will use the definition given in equation (8). P(D|) = p(d i |)
(8)
k iFC k
Assuming a Dirichlet form for (|) and differentiating with respect to we can write equation (9).
j
( k + t k )
( k )
k ( k + T (k) ) j (j k ) j j
=0
(9)
where
k =j kj It is more convenient to solve
) ( ) − ) ( + T j
k
(k)
j
log P(D|) = 0 , which is as given in equation (10) below
) + )( j + t kj ) − )( j ) = 0
(10)
where
) = is the digamma function Equation (10) can then be solved for
j using standard optimization techniques. The high
dimensionality of the feature space in document classification problems make it difficult to use quasi-Newton techniques for the solution. Instead we use simple gradient descent to find the optima of (7).
3.2.2 EB estimator with different hyper-parameters for each class
If instead of the definition in equation (6) we define each class separately in equation (8) then we have P(D|) = p(d i |) iFC
(11)
42
Using (11) and a Dirichlet prior in (7) and differentiating with respect to we get the following gradient k
log P(D|) = )( k ) − )( kj + T (k) ) + )( kj + t kj ) − )( kj ) = 0
(12)
where
k =j kj Again (12) can be solved using standard gradient descent techniques. Note that we will have a different set of parameters for each class k. For the solution of both the empirical Bayes approaches we used gradient descent ran a total of 500 iterations by varying the learning rate between 0.0001 and 0.00001.
4.0 Experiments and Results
To evaluate the performance of different approaches we experimented two popular data-sets the popular 21273 Reuters data-set [3] and a USENET data-set used in [8]. Some particularly peculiar aspects of the Reuters split are the fact that some of the training sets do not have a single document and further several of the classes have very few training documents. We have, however, maintained this split to be consistent with the work in literature. For the USENET data-set we performed a 4-fold cross validation -- i.e., the data set was split into 4 blocks and in each iteration we trained on 3 blocks and tested on the fourth.
The results of these experiments are shown in table (1) - we shown the results of using the Laplace’s rule, integrated likelihood with uniform prior, empirical Bayes (or evidence maximization) and prior constructed from the class distribution. All reported results are for the traditional IR evaluation precision/recall break-even point [3]. It is interesting to note that integrated likelihood using an empirical Bayes prior (same hyper-parameters for all classes) performs the best in both data-sets. For the News data-set we note that all the classifiers perform similar although it is interesting to note that using presently integrated likelihood with a uniform prior performs slightly worse than the Laplace’s rule. In conclusion, however, the difference in
43
performance between the different classifiers is inconclusive. We are presently runing experiments on larger and more varied data-sets to better asses the performance of these different classifiers.
Uni=Uniform, Emp= separate hyper-parameters for each class, EMP2 = same hyper-parameters for each class, Lap = Laplace’s rule
References
1. Lewis, D.D., Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval, ECML, 1998. 2. Kohavi. R, Becker. B, and Sommerfield. D, Improving simple Bayes, ECML, Poster Papers, 1997. 3. Apte. C, Damerau. F, and Weiss.S.M., Automated Learning of Decision Rules for Text Categorization, ACM Transactions on Information Systems, 1994. 4. Chakrabarthi. S, Dom. B, Agrawal. R, and Raghavan. P, Using Taxonomy, Discriminants and Signatures for Navigating in Text Databases, VLDB, 1997. 4.
MacKay.
D.J.C
and
Peto.
L.C.B.,
A Hierarchical Dirichlet
Language
Model.
http://wol.ra.phy.cam.ac.uk/mackay/abstracts/lang4.html 5. Katz. S.M., Estimation of probabilities from sparse data for the language model component of the speech recognizer, IEEE ASSP, 35 (3), 400-401, 1987. 6. Chakrabarti. S, Martin van den Berg and Dom. B, "Focused crawling: a new approach to topic-specific Web resource discovery". Proceedings of the Eighth International World Wide Web Data Set Reuter s News
Conference (WWW8). Toronto, Canada. May PR PR PR PR (BE) (BE) (BE) (BE) 1999. Una EMP EMP2 Lap 78.29 78.31 79.68 78.22 7. Bernardo, J. M. and Smith, A. F. M., Bayesian Theory, Wiley, 1994. 84.68 84.7 84.8 84.8
44
8. Nigam. K., McCallum. A., Thrun. S. and Mitchell. T., Learning to Classify Text from Labeled and Unlabeled Documents. Kamal Nigam, Andrew McCallum, Sebastian Thrun and Tom Mitchell, AAAI-98.
Appendix A
First consider the posterior distribution for a Bernoulli variable with a Beta prior. The conditional expectation (or the expected value of the posterior) of given training data is now given by 1
E( i |D k ) = 0
i P(|D k )d P(D k |) () P(|D k ) = 1 0 P(D k |) ()d
(13) (14)
Substituting (14) in (13) we get 1
E( i |D k ) =
0 i P(D k |) ()d
(15)
1
0 P(D k |) ()d
Solving (15) we can therefore write
( j + t j) ( i + t i + 1) ( 0 ( i ) + 1) ( j ) ji ( j + t j ) ( 0 ) ( i + t i ) (k) ( 0 + T ) ( i ) ( j) ji ( 0 )
+ T (k)
(16)
Simplifying (16) we get C ( i + t i + 1) ( 0 + T (k) + 1) ( i + t i )
(17)
Assuming that all arguments to the gamma function are integers we can further simplify (17) as
(x) = (x − 1)! . i + ti 0 + T (k)
(18)
For a uniform prior all i =1. Therefore the Laplace’s rule can be written as E( i |D k ) =
1 + ti M + T (k)